Very large global grid points leading termination

Issue #1854 closed
anonymous created an issue

In WaveMol/gaussian.par whenever we increase our global grid points to more than 350 the program terminates abruptly. I am attaching gaussian.par file and hn1.lnmiit.ac.in.ini (simfactory/mdb/machines/hn1.lnmiit.ac.in.ini).

HPC Details- Node: 6 Processor: http://ark.intel.com/products/75277/Intel-Xeon-Processor-E5-2680-v2-25M-Cache-2_80-GHz

Q1. Is there anything wrong inside .par or .ini file? Q2. Should i use Carpet instead of PUGH?

Keyword: grid
Keyword: global
Keyword: terminate

Comments (14)

  1. Ian Hinder
    • removed comment

    If you increase the number of grid points, this increases the amount of memory needed for the run. Is your run terminating because it runs out of memory? If you can run "top" on the node, you should see if this is happening. You can also activate the thorn SystemStatistics and add the following into your parameter file to get output of the memory usage ("maxrss"):

    IOBasic::outInfo_reductions             = "maximum"  
    IOBasic::outInfo_vars                   = "SystemStatistics::maxrss_mb  SystemStatistics::swap_used_mb"
    

    This will add two new columns to the info output: the memory used by the process, and the swap space allocated, in MB. You can then compare this to the memory available on the node, and see if there is swapping.

    Can you provide the standard output and standard error files so we can see what is happening?

  2. Frank Löffler
    • removed comment

    The attached par file needs about 55GB of RAM on my workstation (but runs out of memory during the first step). Memory usage goes roughly like N^3, so ~368 would be the maximum for 64GB of RAM - but even in the current form, 64GB aren't enough for this parameter file. How much memory do you have available when you run it?

  3. anonymous reporter
    • removed comment

    Replying to [comment:1 hinder]:

    1. The results for global grid points = 350 (when it runs fine) a. top command output - top1.png b. output1.out c. output1.err

    I stopped it at 64th iteration due to which termination error 15 is occuring.

    1. The results for global grid points = 700 (greater than 350, which terminates it automatically) a. top command output - top2.png b. output2.out c. output2.err

    It stopped all cactus_sim tasks as soon as it used all its swap memory and gave termination error 9.

  4. Ian Hinder
    • removed comment

    Thanks for the output. The file "top1.png", from n = 350, already shows that simulation is using much more memory than is available on the node. You see that there is 99 GB total memory, and 99 GB of this are "used". Similarly, the swap usage is very high, at 8 GB. As far as I can see, there is no evidence of there being anything wrong with this simulation; it is just too large to fit on the machine. Note that "350 global grid points" means 350^3 because this is 3D. When you increase that to 700, the memory requirements increase by a factor of 2^3 = 8. So if the n=350 run doesn't fit, the n=700 run has no chance.

    According to the output1.out file, SystemStatistics is reporting that each process is using about 15 GB (this means that this amount of data is resident in physical memory currently; it's very possible that the run needs even more than this, and the rest is in swap). You have 6 processes, so the total memory usage is 90 GB, which is about the same as the total memory available (accounting for other processes, overheads, and general subtleties in measuring memory usage). I'm a bit confused about why SystemStatistics is reporting a swap usage of 0, when the "top" output shows 8 GB. Was the "top" output recorded at the same time as the simulation which generated "output1.out"?

    To summarise, the machine on which you are running does not have enough memory for the size of job you are trying to run. You either need to restrict to a smaller number of grid points, or run on more nodes.

  5. anonymous reporter
    • removed comment

    Replying to [comment:4 hinder]: Thanks. I have attached two outputs of "top". top2.png represents when the termination is occurring while top1.png is the output when everything is running fine and it is showing SystemStatistics.

    This "top" is showing output for master node of our HPC. We have 6 slave nodes also.

    I use following command to run: ./simfactory/bin/sim submit <simulation folder name> --parfile=arrangements/CactusExamples/WaveMoL/par/gaussian.par --procs=60 --walltime=8:0:0 This runs 6 cactus_sim tasks mentioned in top. I have also attached .ini file containing hpc details.

  6. Ian Hinder
    • removed comment

    There is something wrong; the output says that the evolution is running on 6 MPI processes, and your ini file says that you want 10 threads per process, but I can see 6 processes on the first node. Can you describe exactly how your "HPC" is set up? It sounds like you have 7 nodes, one "master" and 6 "slaves". How many processors, and how many cores, does each node have? Are you intending to run using OpenMP + MPI, or pure MPI?

  7. anonymous reporter
    • removed comment

    Replying to [comment:6 hinder]: HPC details:

    1 node: 2 processor Intel Xeon Processor E5-2680 processor details - http://ark.intel.com/products/75277/Intel-Xeon-Processor-E5-2680-v2-25M-Cache-2_80-GHz 1 node: 20 core 1 node: 96GB

    master: 1 node slave: 6 nodes total: 7 nodes

    OpenMP + MPI

    I think right now i am only running everything on one node(master). I need to submit jobs through job.sh which will devide workload on slave nodes. If that is correct then can you please tell me where to make changes.

    PFA: job.sh hostfile OptionList SubmitScript RunScript

  8. Ian Hinder
    • removed comment

    job.sh looks like a PBS submission script. Do you have PBS, or a similar queueing system, running on this machine? If so, then you will need to split the content of this script into a "submission script", containing the #PBS directives, and a "run script" containing the mpirun_rsh commands. If you do not have a queueing system, then you only need the run script. Take a look in Cactus/simfactory/mdb, in the submitscripts and runscripts directories for examples. You need to create a submit script and a run script for your machine, and then edit your machine ini file to point to these. You are using a host file in your mpirun command, but if you are using a queueing system, then OpenMPI should instead know automatically which hosts to run on.

  9. anonymous reporter
    • removed comment

    I tried to work upon the pbs script but i am still getting some errors. After run command i did: qstat -r :

    hn1.lnmiit.ac.in:

    Job ID Username Queue Jobname SessID NDS TSK Req'd-Memory Req'd-Time S Elap-Time


    207.hn1.lnmiit.ac.in himanshu batch ModernFamily 36422 6 120 -- 720:00:00 R 00:14:57

    There were no results(tables). But got some errors inside .err and .out files. Can you please look at the ouput3.err and output3.out.

    new attached files: old names - new names

    .ini - hn1.lnmiit.ac.in1.ini RunScript - RunScript1 SubmitScript - SubmitScript1 .out - output3.out .err - output3.err

    Sorry for the late reply

  10. Ian Hinder
    • removed comment

    There should not be an mpiexec in the submission script. Look at the other submission scripts for examples. You call simfactory from the submission script, and simfactory calls the run script. The run script should then contain the mpiexec call.

  11. anonymous reporter
    • removed comment

    I removed mpiexec from submission script but it is still showing some error:

    /home/himanshu/simulations/test_13/SIMFACTORY/exe/cactus_sim: error while loading shared libraries: libhdf5hl_fortran.so.7: cannot open shared object file: No such file or directory

    PFA:

    .out - output4.out .err - output4.err

  12. Ian Hinder
    • removed comment

    You are using an optionlist for Fedora, which expects to find HDF5 installed in /usr. The problem is that it is not able to find the Fortran HDF5 library at runtime. Is this installed? Are the compute nodes exactly the same as the head node where you compiled Cactus? I suspect that the file libhdf5hl_fortran.so.7 might be present on the head node but not on the compute nodes. Can you check?

    find /usr -name libhdf5hl_fortran.so.7
    
  13. Log in to comment