Modify

Opened 3 months ago

Last modified 13 days ago

#2194 review defect

Memory increase during regridding

Reported by: wolfgang.kastaun@… Owned by:
Priority: unset Milestone:
Component: Other Version: development version
Keywords: Cc: miguel.zilhao.nogueira

Description

I still have the problem that my BNS runs consume increasing memory during inspiral, where regridding happens. It limits the runtime of typical BNS runs to around 1 day, then they are killed by memory exhaustion. After merger, with fixed grid structure, the problem goes away.
It happens with two different codes, WhiskyThermal+McLachlan and WhiskyMHD+McLachlan, on all clusters I use (hydra, marconi, supermuc, fermi, ..), and all releases I used: Payne, Brahe, Tesla, Wheeler. Regridding controled using CarpetRegrid2.

Attachments (2)

memory_usage.png (43.7 KB) - added by wolfgang.kastaun@… 3 months ago.
RSS usage from system_statistics over several restarts before and after merger (vertical line)
ML_QC0.par (18.6 KB) - added by Roland Haas 2 months ago.
Miguel Zilhao's example parfile to demonstrate the issue

Download all attachments as: .zip

Change History (11)

Changed 3 months ago by wolfgang.kastaun@…

Attachment: memory_usage.png added

RSS usage from system_statistics over several restarts before and after merger (vertical line)

comment:1 Changed 3 months ago by Ian Hinder

See also this thread: http://lists.einsteintoolkit.org/pipermail/users/2018-August/thread.html#6417. (Line wrapping problematic on the mailing list archive page as usual, which I believe is caused by a bug in Apple Mail).

comment:2 Changed 3 months ago by Steven R. Brandt

Wolfgang, can you identify a small par file for me which illustrates the problem?

Changed 2 months ago by Roland Haas

Attachment: ML_QC0.par added

Miguel Zilhao's example parfile to demonstrate the issue

comment:3 Changed 7 weeks ago by Steven R. Brandt

Confirmed using valgrind. The memory allocated on line 148 of ggf.cc is not freed.

139   // Resize structure and allocate storage
140   storage.resize(h.mglevels());
141   for (int ml = 0; ml < h.mglevels(); ++ml) {
142     storage.AT(ml).resize(h.reflevels());
143     storage.AT(ml).AT(rl).resize(h.local_components(rl));
144     for (int lc = 0; lc < h.local_components(rl); ++lc) {
145       int const c = h.get_component(rl, lc);
146       storage.AT(ml).AT(rl).AT(lc).resize(timelevels(ml, rl));
147       for (int tl = 0; tl < timelevels(ml, rl); ++tl) {
148         storage.AT(ml).AT(rl).AT(lc).AT(tl) = typed_data(tl, rl, lc, ml);
149         storage.AT(ml).AT(rl).AT(lc).AT(tl)->allocate(
150             d.light_boxes.AT(ml).AT(rl).AT(c).exterior, dist::rank());
151       } // for tl
152     }   // for lc
153   }     // for ml
154 

comment:4 Changed 7 weeks ago by Erik Schnetter

Regridding is supposed to happen as follows:

  1. Memory for the new grid structure is allocated
  2. As much as possible, data are copied from the old to the new grid structure
  3. Where necessary, data are interpolated from the old to the new grid structure
  4. The old grid structure is freed

The routine which you list is step 1, allocation. Deallocation is supposed to happen in step 4 in a function called recompose_free_old. When Cactus terminates, then the memory for the current grid structure is deallocated in a function recompose_free.

comment:5 Changed 7 weeks ago by Roland Haas

Please see also this old discussion on how to track down memory leaks: http://lists.einsteintoolkit.org/pipermail/users/2014-May/003611.html where eg Google's perftools https://code.google.com/p/gperftools/ are mentioned which also has some heap checker build in. There is also a home-grown memory tagger (an attachment to one of the emails) that lets you answer questions like: "right now, who allocated how much memory" which you can use to tag memory allocations to find out if some allocates are never freed (there's no need to explicitly match up allocate / free).

Last edited 7 weeks ago by Roland Haas (previous) (diff)

comment:6 Changed 3 weeks ago by miguel.zilhao.nogueira

Cc: miguel.zilhao.nogueira added

comment:7 Changed 2 weeks ago by Roland Haas

I had a look at it and the increase in memory usage was (at least in the example parfile that is attached to this ticket), not due to grid functions.

Digging a bit deeper I found that CarpetLib's commstate objects never free memory allocated for their communication buffers (the procbuf objects), and instead retain that storage in a class static variable (to avoid constantly allocating and freeing it for every single group in a SYNC statement).

This effectively means that the size of procbuf can only ever grow and eg if a large procbuf is needed during recomposing (re-distributing data after regridding) then that large procbuf stays around even if a "regular" SYNC (ie. a ghost zone sync and prolongation) would need much less storage for the communication buffers.

I have tested this guess by editing comm_state::procbufdesc::reinitialize in commstate.cc to read:

  // Note: calling resize(0) instead of clear() ensures that the
  // vector capacity does not change
#if 0
  sendbufbase.resize(0);
  recvbufbase.resize(0);
#else
  {vector<char> empty; sendbufbase.swap(empty);}
  {vector<char> empty; recvbufbase.swap(empty);}
#endif

so that storage is actually freed (note that a call to clear() would not guarantee to free storage either, contrary to what the comment may imply, see http://www.cplusplus.com/reference/vector/vector/clear/).

So there is no memory leak in the sense that no memory is lost and all memory is properly accounted for by Carpet. There is also an upper bound on how much storage the procbufs could take (though that max may be on the size of a full refinement level for the largest group of variables). However there can clearly be an increase in storage used if a minor change in grid structure (as what happens in iteration 2048 of the example, where box 3 just changes size a bit, it existed before already) can lead to a large amount of data needing to be communicated (maybe b/c the assignment of patches to ranks changed).

My change is probably not what one wants to keep in the code (since the comment indicates that the original author did not want to free/allocate the buffers all the time), but a change that only frees the buffers after a regrid would seem useful to me (and not hard to implement I hope).

Last edited 2 weeks ago by Roland Haas (previous) (diff)

comment:9 Changed 13 days ago by Roland Haas

This pull request frees memory in the communication buffers after a regrid operation under the assumption that a regrid will change the communication pattern thus requiring differently sized communication buffers.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as review The ticket will remain with no owner.
Next status will be 'reviewed_ok'.
as The resolution will be set.
The resolution will be deleted.
to The owner will be changed from (none) to the specified user.
to The owner will be changed from (none) to the specified user.
The owner will be changed from (none) to anonymous.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.