Recovery fails in AHFinderDirect RecoverML with out-of-bounds assertion in CarpetLib

Issue #626 closed
Roland Haas created an issue

when trying to recover in AHFinderDirect's RecoverML test (parfiles attached) I get:

cactus_et: /home/rhaas/ET_2011_10/arrangements/Carpet/CarpetLib/src/th.hh:79: double th::get_time(int, int, int) const: Assertion `tl>=0 and tl<timelevels' failed.

After some debuggin I traced this down to the metric which in CarpetIOHDF5/src/Input.cc:762 is reported (by gf->timelevels (ml, rl)) to have three timelevels. However the timelevels member of gf->t (a th) gives the number of timelevels as two

 gf->t.timelevels
$28 = 2

Since timelevels is new in Carpet/Hg my suspicion would be that it is not properly updated when gf->set_timelevels is called (the comments in th.hh seem to indicated that it is assumed to be const which seems odd given that ggf::set_timelevels exists).

I don't understand enough of Carpet to fix or further debug this. Marking its as major and release relevant in case it is actually a Carpet bug.

Keyword:

Comments (22)

  1. Erik Schnetter
    • removed comment

    Thank you.

    This is an unfortunate coincidence of events.

    Cactus itself does not have a notion of a maximum number of timelevels; this number can be different for each grid group. However, for proper time interpolation etc., Carpet needs to assume such a global maximum, and it uses "prolongation_order_time+1" for this. (This parameter should probably be renamed.) That is, this is the maximum time level that Carpet can handle for interpolation; individual grid groups may have more timelevels, but Carpet will not store/provide any meta-data for them, and the application can only access these as they are (i.e. in the same way as with PUGH).

    In this particular parameter file, prolongation_order_time is 2 (the default), but the metric seems to have 3 time level. That means that the metric has a pre-previous time level that is most likely never used, but which nevertheless has storage, and which is checkpointed and recovered. I assume that nothing in the code actually uses the 3rd timelevel, except that timelevel rotation will move old data into it.

    During recovery, all timelevels need to be synchronised. This is special, since during evolution only the current timelevel would be synchronised. All timelevels need to be synchronised because the number of processes may change, and the ghost zones need to be filled. It would instead be possible to read in the ghost zone data from file, but synchronising is probably much more efficient.

    The major difference between Carpet-git and Carpet-hg is that Carpet now explicitly stores which time is associated with each timelevel, instead of assuming a constant delta time. This can save time interpolation during regridding (leading to slightly different, but still valid results), and also allows e.g. changing the time step size during evolution. However, this also means that the times for the past timelevels need to be initialised correctly, so Carpet needs to know how many old timelevels there can be.

    And this is where things go wrong: Carpet has no meta-data for the 3rd (unused) timelevel. During synchronisation, Carpet accesses (although this is not actually needed) the time associated with this timelevel, and this operation fails.

    There are several ways to correct this:

    • Allocate only 2 timelevels for the metric (since only 2 are used anyway)
    • Set prolongation_order_time to 3 instead of 2 (since you want to use 3 timelevels, you may as well tell Carpet about this)
    • Introduce a special case for synchronising after recovering that somehow doesn't access Carpet's timelevel metadata (which are not be required for synchronising)
    • Wrap accessing timelevels during communication in an if statement, so that the error is not triggered

    I would obviously prefer one of the first two options. In addition, the error message should obviously be improved.

  2. Roland Haas reporter
    • removed comment

    Thank you for the quick (and insightfull reply). I used option (1) for the test suite. That works fine.

    If I understand this correctly and remember properly that the operation requiring time interpolation is the filling in of buffer zones, then this implies that after a recovery prolongation is done? I am a bit concerned what happens if I checkpoint on an odd timestep (assuming I have 2 levels). Below 'x' is the coarse grid, '+' the fine, and '*' is where they overlap. Time is up.

    xxxxxxxxxxxxxx 1 ++++++++ 2 xx**xxxx 3 ++++++++ 4 xxxxxxxxxxxxxx 5 In this situation I would expect the buffer zones (4) to be filled in useing coarse data from (1),(3) and (5). During evolution though I would have expceted the corresponding buffer zones to be filled in by what would be (3),(5),(7) instead.

  3. Erik Schnetter
    • removed comment

    No, there is no prolongation after recovering. When I said "sync", I meant the low-level sync operation that fills inter-process ghost zones via MPI, no interpolation is involved.

  4. Roland Haas reporter
    • removed comment

    The original reason for this ticket is gone since Erik provided a way to fix it. Reading through the explanation for the behaviour, it seems though that one should create a new one to have CarpetIOHDF5 output a clearer error message (at least). One might even have to have Carpet enforce that the maximum number of timelevels in checkpointed grid functions is prolongation_order_time+1 since otherwise it seems one cannot recover from such checkpoints (I really hope I am wrong one this one :-) ).

  5. Ian Hinder
    • removed milestone
    • removed comment

    Roland fixed this in revision 1559 of AHFinderDirect. The test case subsequently passed on Datura. We should create tickets as discussed in comment:4 and then close this one. Removing release milestone.

  6. Roland Haas reporter

    Making this critical because of the issue about not being able to recover from checkpoints under certain circumstances. Should at least be discussed before we release the code.

  7. Erik Schnetter
    • removed comment

    I would like to understand the chain of events in detail. This will allow us to either improve the error message, or ensure that the superfluous timelevels are properly ignored. Could you create a stack backtrace? This will tell us where in the recovery routine things are going wrong.

  8. Ian Hinder
    • removed comment

    We just ran into this problem here at the AEI. Recovery failed in a unigrid run with that error message. Setting the prolongation parameter fixed the problem. Is someone working on this for the upcoming release?

  9. Erik Schnetter
    • removed comment

    To my knowledge no one is.

    Can you describe the steps you took to fix the problem?

  10. Erik Schnetter
    • removed comment

    Does it suffice to do so when recovering? Or are the checkpoint files unusable?

  11. Wolfgang Kastaun
    • removed comment

    Additional remark: setting the parameter Carpet::prolongation_order_time = 2 fixed our problem, but it had to be changed from the beginning, not just in the restart. Otherwise, the restart crashes with

    terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check Rank 2 with PID 7145 got signal6 terminate called after throwing an instance of 'std::out_of_rangeterminate called after throwing an instance of '' std::out_of_range'

    from the backtrace, it crashes in CarpetIOHDF5_RecoverGridStructure+0x1332

    Wolfgang.

  12. Erik Schnetter
    • removed comment

    I have tested this with the parameter files checkpointML.par and recoverML.par from AHFinderDirect's test suite. - Of course, as they are, the parameter files pass. - When I comment out the "ML_BSSN::timelevels = 2" in both parameter files, I see an error upon recovery (more below). - When I then re-introduce this setting for recovering (using the "bad" checkpoint file), everything seems fine again.

    The error Carpet reports is because Carpet cannot determine a "current time" associated with the oldest time level. Because of sub-cycling, these times are generally different for each refinement level. How many "current times" Carpet stores depends on the parameter "prolongation_order_time". It is unfortunate that there is a disconnect between this parameter and the number of time level that the flesh is allocating for variables. A work-around seems possible, but I don't want to introduce this before this release unless necessary. This work-around would likely consist of a flag, passed down into CarpetLib, indicating that the current time is not known, which still allows synchronising, but disallows e.g. prolongation or restriction. This will likely avoid this problem.

    However, since I cannot reproduce the problem, and since allocating 3 timelevels in a unigrid run is most likely an oversight anyway, the cleaner solution seems to be to set "ML_BSSN::timelevels = 2" in the parameter file, either all the time, or before recovering. If I change this parameter only before recovering, CarpetIOHDF5 warns about unused datasets in the checkpoint file.

    If you believe that there is a problem that I'm just not seeing, then please: - Try to reproduce my steps above to see whether you obtain different results - Report exactly which version of Carpet and AHFinderDirect you are using (revision numbers) - Please post your parameter files (both), as well as the output you obtain (both stdout and stderr for both) - Describe other relevant details, e.g. the number of MPI processes, the machine you are using, etc. - If you give a backtrace, please either use gdb, or use "addr2line" to convert hex addresses to line numbers

  13. Roland Haas reporter
    • removed comment

    Erik: for the ML tests your method works for me. It would not work had I set eg. ADMBase::metric_timelevels = 3 initially since that parameter is not steerable. Ie. adding

    ADMBase::metric_timelevels = 3 to the checkpointML.par from the test cannot be fixed since the parameter is not steerable (shouldn't Cactus also complain if I try to steer a parameter upon recovery the is not steerable?). The backtrace is:

    1. 0 0x00007ffff4305475 in *GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
    2. 1 0x00007ffff43086f0 in *GI_abort () at abort.c:92
    3. 2 0x00007ffff42fe621 in *GI_assert_fail (assertion=0x37cb2e5 "tl>=0 and tl<timelevels", file=<optimized out>, line=79, function=0x3806540 "double th::get_time(int, int, int) const") at assert.c:81
    4. 3 0x0000000001664829 in th::get_time (this=<optimized out>, ml=<optimized out>, rl=<optimized out>, tl=<optimized out>) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/arrangements/Carpet/CarpetLib/src/th.hh:79
    5. 4 0x000000000185c96a in transfer_from_all (slabinfo=0x0, use_old_storage=false, ml2=0, rl2=0, tl2=2, sendrecvs=&dh::fast_dboxes::fast_sync_sendrecv, ml1=0, rl1=0, tl1=2, state=..., this=0x79638f0, flip_send_recv=<optimized out>) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/arrangements/Carpet/CarpetLib/src/ggf.hh:184
    6. 5 ggf::sync_all (this=0x79638f0, state=..., tl=2, rl=0, ml=0) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/arrangements/Carpet/CarpetLib/src/ggf.cc:344
    7. 6 0x000000000166324b in CarpetIOHDF5::Recover (cctkGH=0x78d3360, basefilename=<optimized out>, called_from=3) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/arrangements/Carpet/CarpetIOHDF5/src/Input.cc:762
    8. 7 0x0000000003214381 in IOUtil_RecoverFromFile (GH=0x78d3360, basefilename=0x0, called_from=3) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/arrangements/CactusBase/IOUtil/src/CheckpointRecovery.c:335
    9. 8 0x0000000003214ac4 in IOUtil_RecoverGH (GH=<optimized out>) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/arrangements/CactusBase/IOUtil/src/CheckpointRecovery.c:367
    10. 9 0x000000000056b3c5 in CCTK_CallFunction (function=0x3214aa0, fdata=0x78b74d8, data=0x78d3360) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/src/main/ScheduleInterface.c:311
    11. 10 0x00000000017ace5f in Carpet::CallScheduledFunction (time_and_mode=<optimized out>, function=0x3214aa0, attribute=0x78b74d8, data=0x78d3360, user_timer=...) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/arrangements/Carpet/Carpet/src/CallFunction.cc:367
    12. 11 0x00000000017af5a8 in Carpet::CallFunction (function=0x3214aa0, attribute=0x78b74d8, data=0x78d3360) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/arrangements/Carpet/Carpet/src/CallFunction.cc:260
    13. 12 0x000000000056b101 in CCTKi_ScheduleCallFunction (function=0x3214aa0, attribute=0x78b74c0, data=0x7fffffffd340) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/src/main/ScheduleInterface.c:3027
    14. 13 0x000000000057513b in ScheduleTraverseFunction (data=<optimized out>, function_process=<optimized out>, if_check=<optimized out>, while_check=<optimized out>, item_exit=<optimized out>, item_entry=<optimized out>, ifs=<optimized out>, n_ifs=<optimized out>, whiles=0x0, n_whiles=<optimized out>, attributes=0x78b74c0, function=0x3214aa0) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/src/schedule/ScheduleTraverse.c:584
    15. 14 ScheduleTraverseGroup (schedule_groups=0x7892280, group=<optimized out>, attributes=0x0, n_whiles=0, whiles=0x0, n_ifs=<optimized out>, ifs=0x0, item_entry=0x56abd0 <CCTKi_ScheduleCallEntry>, item_exit=0x56b750 <CCTKi_ScheduleCallExit>, while_check=0x56a880 <CCTKi_ScheduleCallWhile>, if_check=0x56a530 <CCTKi_ScheduleCallIf>, function_process=0x56b0a0 <CCTKi_ScheduleCallFunction>, data=0x7fffffffd340) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/src/schedule/ScheduleTraverse.c:368
    16. 15 0x00000000005752ff in CCTKi_DoScheduleTraverse (group_name=0x37267a4 "CCTK_RECOVER_VARIABLES", item_entry=0x56abd0 <CCTKi_ScheduleCallEntry>, item_exit=0x56b750 <CCTKi_ScheduleCallExit>, while_check=0x56a880 <CCTKi_ScheduleCallWhile>, if_check=0x56a530 <CCTKi_ScheduleCallIf>, function_process=<optimized out>, data=0x7fffffffd340) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/src/schedule/ScheduleTraverse.c:158
    17. 16 0x00000000005707db in ScheduleTraverse (CallFunction=0x17addc0 <Carpet::CallFunction(void*, cFunctionData*, void*)>, GH=0x78d3360, where=0x37267a4 "CCTK_RECOVER_VARIABLES") at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/src/main/ScheduleInterface.c:1360
    18. 17 CCTK_ScheduleTraverse (where=0x37267a4 "CCTK_RECOVER_VARIABLES", GH=0x78d3360, CallFunction=0x17addc0 <Carpet::CallFunction(void*, cFunctionData*, void*)>) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/src/main/ScheduleInterface.c:891
    19. 18 0x0000000001738d11 in Carpet::ScheduleTraverse (name=0x37267a4 "CCTK_RECOVER_VARIABLES", cctkGH=0x78d3360, where=<optimized out>) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/arrangements/Carpet/Carpet/src/Initialise.cc:1277
    20. 19 0x00000000017396d9 in CallRecoverVariables (cctkGH=0x78d3360) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/arrangements/Carpet/Carpet/src/Initialise.cc:241
    21. 20 Carpet::Initialise (fc=0x7fffffffd7e0) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/arrangements/Carpet/Carpet/src/Initialise.cc:115
    22. 21 0x00000000005633a3 in main (argc=2, argv=0x7fffffffd8e8) at /mnt/data/rhaas/postdoc/gr/ET_Lovelace/src/main/flesh.cc:80

    this is revision 1566 of AHFinderDirect and Carpet hash 3555:12ee1105096a attached are the parameter files (ending in -admbase) and stdout as well as stderr. This is with one MPI process and one thread. On my workstation at Caltech.

    The inability to solve via a parameter setting is thus mostly related to some parameters not being steerable upon recovery even though they could be changed at least in some ways without problem.

  14. Erik Schnetter
    • removed comment

    I believe that the ADMBase timelevel parameter should be steerable. Reducing the number of active timelevels upon recovery is safe.

    I suggest to recommend this as remedy -- modify ADMBase, and then change ADMBase:: metric_timelevels to 2. In the future, ADMBase::metric_timelevels should be steerable upon recovery.

  15. Erik Schnetter
    • removed comment

    I made the time level parameters in ADMBase, HydroBase, and TmunuBase steerable upon recovery. The time level parameters in McLachlan, WeylScal4, etc. are already steerable. This should make it possible to recover from checkpoint files without modifying the code.

  16. Log in to comment