global-early, loop-local routines in PostRegrid cause access to elements outside of a vector capacity

Issue #971 closed
Roland Haas created an issue

This happens whenever a refinement level is created during evolution (but not during the intial regrid which by default is in meta mode). The attached parameter file demonstrates the problem by forcing the initial regrid to happen in level mode.

The actual error obtained (using a debug executable) is:

terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check
Rank 0 with PID 23084 received signal6
Writing backtrace to ./backtrace.0.txt
Aborted

which can be traced back to

(gdb) bt
#0  0x00007ffff3476475 in *__GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007ffff34796f0 in *__GI_abort () at abort.c:92
#2  0x00007ffff3c5468d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff3c52796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff3c527c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff3c529ee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff3ca466d in std::__throw_out_of_range(char const*) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00000000010856b7 in std::vector<std::vector<gdata*, std::allocator<gdata*> >, std::allocator<std::vector<gdata*, std::allocator<gdata*> > > >::_M_range_check (this=0xbba8850, __n=0) at /usr/include/c++/4.7/bits/stl_vector.h:774
#8  0x00000000052405b7 in std::vector<std::vector<gdata*, std::allocator<gdata*> >, std::allocator<std::vector<gdata*, std::allocator<gdata*> > > >::at (this=0xbba8850, __n=0) at /usr/include/c++/4.7/bits/stl_vector.h:792
#9  0x000000000524050b in ggf::data_pointer (this=0xb735ad0, tl=0, rl=2, lc=0, ml=0) at /mnt/data/rhaas/postdoc/gr/Zelmani/arrangements/Carpet/CarpetLib/src/ggf.hh:214
#10 0x0000000005257432 in Carpet::enter_local_mode (cctkGH=0xb68bae0, c=0, lc=0, grouptype=402) at /mnt/data/rhaas/postdoc/gr/Zelmani/arrangements/Carpet/Carpet/src/modes.cc:654
#11 0x0000000005258be1 in Carpet::local_component_iterator::step (this=0x7fffffffc580) at /mnt/data/rhaas/postdoc/gr/Zelmani/arrangements/Carpet/Carpet/src/modes.cc:1030
#12 0x0000000005258aa4 in Carpet::local_component_iterator::local_component_iterator (this=0x7fffffffc580, cctkGH_=0xb68bae0, grouptype_=402) at /mnt/data/rhaas/postdoc/gr/Zelmani/arrangements/Carpet/Carpet/src/modes.cc:1006
#13 0x0000000005261f28 in Carpet::CallFunction (function=0x4b42007, attribute=0xb682af8, data=0xb68bae0) at /mnt/data/rhaas/postdoc/gr/Zelmani/arrangements/Carpet/Carpet/src/CallFunction.cc:187
#14 0x00000000006f9cfc in CCTKi_ScheduleCallFunction (function=0x4b42007, attribute=0xb682ae0, data=0x7fffffffcc50) at /mnt/data/rhaas/postdoc/gr/Zelmani/src/main/ScheduleInterface.c:3027
#15 0x00000000006fdaf7 in ScheduleTraverseFunction (function=0x4b42007, attributes=0xb682ae0, n_whiles=0, whiles=0x0, n_ifs=0, ifs=0x0, item_entry=0x6f9548 <CCTKi_ScheduleCallEntry>, item_exit=0x6f9772 <CCTKi_ScheduleCallExit>, while_check=0x6f9981 <CCTKi_ScheduleCallWhile>, if_check=0x6f9a14 <CCTKi_ScheduleCallIf>, 
    function_process=0x6f9aa3 <CCTKi_ScheduleCallFunction>, data=0x7fffffffcc50) at /mnt/data/rhaas/postdoc/gr/Zelmani/src/schedule/ScheduleTraverse.c:584
#16 0x00000000006fd782 in ScheduleTraverseGroup (schedule_groups=0xb66e280, group=0xb682280, attributes=0xb681df0, n_whiles=0, whiles=0x0, n_ifs=0, ifs=0x0, item_entry=0x6f9548 <CCTKi_ScheduleCallEntry>, item_exit=0x6f9772 <CCTKi_ScheduleCallExit>, while_check=0x6f9981 <CCTKi_ScheduleCallWhile>, 
    if_check=0x6f9a14 <CCTKi_ScheduleCallIf>, function_process=0x6f9aa3 <CCTKi_ScheduleCallFunction>, data=0x7fffffffcc50) at /mnt/data/rhaas/postdoc/gr/Zelmani/src/schedule/ScheduleTraverse.c:368
#17 0x00000000006fd92e in ScheduleTraverseGroup (schedule_groups=0xb66e280, group=0xb677520, attributes=0x0, n_whiles=0, whiles=0x0, n_ifs=0, ifs=0x0, item_entry=0x6f9548 <CCTKi_ScheduleCallEntry>, item_exit=0x6f9772 <CCTKi_ScheduleCallExit>, while_check=0x6f9981 <CCTKi_ScheduleCallWhile>, 
    if_check=0x6f9a14 <CCTKi_ScheduleCallIf>, function_process=0x6f9aa3 <CCTKi_ScheduleCallFunction>, data=0x7fffffffcc50) at /mnt/data/rhaas/postdoc/gr/Zelmani/src/schedule/ScheduleTraverse.c:384
#18 0x00000000006fd4b1 in CCTKi_DoScheduleTraverse (group_name=0x75d82be "CCTK_POSTREGRIDINITIAL", item_entry=0x6f9548 <CCTKi_ScheduleCallEntry>, item_exit=0x6f9772 <CCTKi_ScheduleCallExit>, while_check=0x6f9981 <CCTKi_ScheduleCallWhile>, if_check=0x6f9a14 <CCTKi_ScheduleCallIf>, 
    function_process=0x6f9aa3 <CCTKi_ScheduleCallFunction>, data=0x7fffffffcc50) at /mnt/data/rhaas/postdoc/gr/Zelmani/src/schedule/ScheduleTraverse.c:158
#19 0x00000000006f7f26 in ScheduleTraverse (where=0x75d82be "CCTK_POSTREGRIDINITIAL", GH=0xb68bae0, CallFunction=0x5260878 <Carpet::CallFunction(void*, cFunctionData*, void*)>) at /mnt/data/rhaas/postdoc/gr/Zelmani/src/main/ScheduleInterface.c:1360
#20 0x00000000006f6d78 in CCTK_ScheduleTraverse (where=0x75d82be "CCTK_POSTREGRIDINITIAL", GH=0xb68bae0, CallFunction=0x5260878 <Carpet::CallFunction(void*, cFunctionData*, void*)>) at /mnt/data/rhaas/postdoc/gr/Zelmani/src/main/ScheduleInterface.c:891
#21 0x00000000051f9def in Carpet::ScheduleTraverse (where=0x75d8233 "CallRegridInitialLevel", name=0x75d82be "CCTK_POSTREGRIDINITIAL", cctkGH=0xb68bae0) at /mnt/data/rhaas/postdoc/gr/Zelmani/arrangements/Carpet/Carpet/src/Initialise.cc:1360
#22 0x00000000051f86d5 in Carpet::CallRegridInitialLevel (cctkGH=0xb68bae0) at /mnt/data/rhaas/postdoc/gr/Zelmani/arrangements/Carpet/Carpet/src/Initialise.cc:1168
#23 0x00000000051f54c7 in Carpet::CallInitial (cctkGH=0xb68bae0) at /mnt/data/rhaas/postdoc/gr/Zelmani/arrangements/Carpet/Carpet/src/Initialise.cc:452
#24 0x00000000051f3365 in Carpet::Initialise (fc=0x7fffffffd710) at /mnt/data/rhaas/postdoc/gr/Zelmani/arrangements/Carpet/Carpet/src/Initialise.cc:124
#25 0x00000000006f18f3 in main (argc=2, argv=0x7fffffffd818) at /mnt/data/rhaas/postdoc/gr/Zelmani/src/main/flesh.cc:80

The reason this happens seems to be that the data structures in dh are only valid once Recompose ran on this level (I think). Since global-early,loop-local routines run before this happens, they see inconsistent data structures and trigger the out-of-bounds error in enter-level-mode. This is a recently triggerable issue, since before 55759e108807 Carpet would not call global-early routines in PostRegrid and the first such routine is HydroBase_InitExcisionMask which was recently made global-early, loop-local.

Reverting these changes does not seem like a good option for the reasons outlined in #958.

Keyword:

Comments (8)

  1. Roland Haas reporter
    • removed comment

    Without a debug executable one is likely to get strange memory corruption errors I think (and have runs that show them, though it happens late in the run so inconvenient to debug), so this should be fixed one way to the other.

  2. Erik Schnetter
    • removed comment

    Recomposing (modifying the grid structure) proceeds from coarse to fine levels. This has to happen in this order, since newly created regions depend on respective coarser grids, and may also depend on boundary data or ghost zones on these coarser grids. This means in particular that recomposing and calling postregrid has to occur in lock-step. Calling a global mode routine before the new grid structure has been set up seems thus inconsistent. global-early routines need to be disabled in postregrid.

    We can still allow early-meta mode routines early in postregrid, as these do not have access to the grid hierarchy.

    global-early loop-local is clearly impossible. Depending on the particular action, this either needs to be distributed over local mode routines, or needs to be turned into a global-late loop-local routine, and some other actions may also need to be deferred.

  3. Roland Haas reporter
    • removed comment

    Ok. That is about what I had feared (namely that one cannot do all recompose steps first and only then do a postregrid). For PostRegridInitial (the one in Meta mode, not the one during initial data) this works since there is no data that needs to be preserved since regridding happens before CCTK_INITIAL, yes?

    For the present problem the routines that need to be changed are:

    • SphericalSurface::SphericalSurface_Set from global to global-early (runs in PostStep and Initial)
    • SetMask_SphericalSurface::SetMask_SphericalSurface from global, loop-local to just local
    • HydrBase::HydroBase_InitExcisionMask from global-early,loop-local to just local

    The reasoning is that SphericalSurface::SphericalSurface_Set has to run before local routines that schedule themselves after SphericalSurface_HasBeenSet and that global is global-late during CCTK_POSTSTEP (and global-early during EVOL and INITIAL). Anything else that sets spherical surface should then also be converted from global to global-early if it runs in CCTK_POSTSTEP. Anything that runs in HydroBase_InitExcisionMask must be local or global-late if it accesses the grid patches. This will change the order in which routines are called.

    I'll try to come up with a way to alert the user at runtime if they try to schedule a global-early, loop-local routine in PostRegrid. Global-early in itself is still ok in PostRegrid, as long as the routine only accesses say grid arrays and grid scalars, yes?

  4. Erik Schnetter
    • removed comment

    Yes, if initialising in meta mode, there will be no interpolation. One can recompose all grid functions at once, and call postregrid only afterwards.

    No, I would say that global-early is not okay in postregrid. One could allow routines that do not access grid functions, but this is difficult to ensure; e.g. reduction, interpolation, and I/O is also not possible. To avoid problems, I would forbid global-early generally. (But this is open for discussion.) Having a "global mode" that does not allow accessing grid functions even indirectly is like introducing yet another mode ("weak global mode"?), and this would be too confusing. At the moment, I rather hope that the schedule can be re-written without a weak global mode.

  5. Roland Haas reporter
    • changed status to open
    • removed comment

    Attached please find patches that rewrite the scheduling of HydroBase::Hydrobase_InitExcision and SetMask_SphericalSurface to avoid using global-early in postregrid. The patches revert #958 (for HydroBase) and schedule SetMask_SphericalSurface in local mode during postregrid (but leave it global loop-local in poststep).

    The patches to AHFinderDirect and CarpetMask remove AFTER statements that were void since the referenced group does not exist in the schedule bin in question or add comments.

    I'll also prepare a patch for Carpet to forbit global-early routines in PostRegrid, though I am not quite sure yet how to achieve this since only CallFunction looks at the attributes but it does no longer know which bin we are in I think.

    Ok to apply?

  6. Roland Haas reporter
    • changed status to resolved
    • removed comment

    All have been applied. revision are 1568 for AHFinderDirect, 56 for HydroBase, 116 for SetMask_SphericalSurface and 87a47611d72e "CarpetMask: comment on scheduling in PostStep after SphericalSurface_HasBeenSet" for Carpet.

  7. Log in to comment