issues with stampede

Issue #1547 closed
James Healy created an issue

Over the past few months, I have been using RIT's LazEv code with only minor hiccups on stampede (particularly, the unreproducible 'dapl_conn_rc' crashes that I'm sure other stampede users are familiar with). This checkout was of the previous release, ET_2013_05, and compiled with Intel MPI. Most of the jobs I ran took advantage of some symmetry, and I was able to run on 12-16 nodes at about 50-60% memory usage.

After the sync issue was backported, I checked out the new release, ET_2013_11 and immediately ran into problems. The first issue was with the run performance and LoopControl, which, with the mailing lists help, we sorted out. The second was with crashes and checkpointing. With both Intel MPI and MVAPICH2 configurations, the code would hang (~50% of the time) when dumping checkpoints, and 100% when dumping a termination checkpoint. Further, the crashes seem more frequent, and I couldn't get a simulation to run for a full 24 hours without crashing (either by stalling on checkpointing or otherwise).

So, I checked out a clean version of the toolkit, with only toolkit thorns, and removed any thorns specific to RIT. I compiled with both the Intel MPI and MVAPICH2 configurations in simfactory.

In both cases, I can run the 'qc0-mclachlan.par' file to completion with no issues. So I edited the qc0 parfile to update the grid, remove the symmetries, and update the initial data to match my test parameter file. I ran the job on 20 nodes, and with either configuration, I was not successful in running the job to completion on any of my numerous attempts. Intel MPI runs die with the standard unhelpful "dapl_conn_rc" error at random times in the evolution, and the MVAPICH2 dies with:

[c431-903.stampede.tacc.utexas.edu:mpispawn_7][readline] Unexpected End-Of-File on file descriptor 6. MPI process died? [c431-903.stampede.tacc.utexas.edu:mpispawn_7][mtpmi_processops] Error while reading PMI socket. MPI process died? [c431-903.stampede.tacc.utexas.edu:mpispawn_7][child_handler] MPI process (rank: 15, pid: 106620) terminated with signal 9 -> abort job [c429-501.stampede.tacc.utexas.edu:mpirun_rsh][process_mpispawn_connection] mpispawn_7 from node c431-903 aborted: Error while reading a PMI socket (4)

The IMPI jobs died with the same dapl_conn_rc error at run times of 2 hours, 8 hours, and 21 hours. I also had one job that hung and did not exit until it was killed by the queue manager. The MVAPICH2 jobs died at around 3 hours and 8 hours with the error above.

We've been in contact with TACC and they said it was a Cactus issue, so I am sending this report.

Attached is the parameter file I used for the tests. They should work with a stock ET_2013_11 checkout.

Keyword:

Comments (5)

  1. Roland Haas
    • removed comment

    Would you mind attaching the stdout of your attempted runs (using mvapich2 preferably), please?. To reproduce it might be necessary to eg. match the number of threads you used which is output in stdout. Similarly if you were to include the list of loaded modules at the time the executable ran. If you are using simfactory then please include the output of

    simfactory/bin/sim execute 'module list'
    
  2. James Healy reporter
    • removed comment

    I added standard out, standard error, and the module list file that is created at time of submission. I used 20 nodes, with OMP_NUM_THREADS=8. While I only tried this parameter file with this setup, we've had similar issues independent of number of nodes.

  3. Ian Hinder
    • removed comment

    I am also having similar problems on Stampede. I am using mvapich2. I posted (http://lists.einsteintoolkit.org/pipermail/users/2014-May/003580.html) a summary to the mailing list, and I include it here for reference.

    I've had jobs die when checkpointing, and also mysteriously hanging for no apparent reason. These might be separate problems. The checkpointing issue occurred when I submitted several jobs and they all started checkpointing at the same time after 3 hours. The hang happened after a few hours of evolution, with GDB reporting

    MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13) at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296 296 for (; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size; ++i)

    Unfortunately I didn't ask for a backtrace. I'm using mvapich2. I've been in touch with support and they said the dying while checkpointing coincided with the filesystems being hit hard by my jobs, which makes sense, but they didn't see any problems in their logs, and they have no idea about the mysterious hang. I repeated the hanging job and it ran fine.

  4. Log in to comment