Allow different timers on different processors in TimerReport

Issue #814 closed
Erik Schnetter created an issue

This patch reduces not just timer values, but also timer names across processes. This allows different processes to have different timers.

Keyword: TimerReport

Comments (14)

  1. Jian Tao
    • removed comment

    Recommend to apply the patch.

    the patch was tested with a code where different functions are called on different processes and the number of timers are different across all the mpi processes. the timing results are gathered when the timer name is the same while appended when different names are found. that is what I would expect the timerreport to work in such a case.

  2. Roland Haas
    • changed status to open
    • removed comment

    This fails for me when there are timers that were destroyed (via CCTK_TimerDestroy by WaveExtractL from Llama) with a segfault in line 677 (the strncpy). The error is assuming that timer never go away. Apparently Cactus never reduces its number of timer counter though the actual timer index seems to be re-used by the underlying Util_DeleteHandle/Util_NewHandle. These two will free the associated timer structure (and the name). So CCTK_NumTimers() seems to be the number of times CCTK_TimerCreate was successfully called ie. the number of timers ever created, not the number of timers currently in existence and one always has to check if a timer id in a `for(i=0;i<CCTK_NumTimers();i++)` loop is still valid.

    Additionally it seems as if the large automatic arrays in line 700

    char all_timernames[total_ntimers][TIMERNAME_LENGTH]; which tries to create an automatic array of size CCTK_nProc()*CCTK_NumTimer()*TIMERNAME_LENGTH is also causing problems (since I have a test runs without a call to TimerDestroy which fails in line 704 "name_displacements[0] = 0" with a segfault.)

    In my runs I have nprocs >= 192 (576 cores, 6 threads), NumTimers >= 2500 (hydro, multipatch) and TIMERNAME_LENGTH = 101. The total size of this array is thus about 48MB. If this is ever tried to be allocated on the stack by a simple minded compiler then this should segfault as well (and I have no idea if a compiler is required to support arbitrarily large automatic arrays).

    If this ever causes a segfault then a better solution might be to rewrite the CollectTimerInfo routine in question in C++ and use vector<T> for memory allocation since there are still a number of arrays that have automatic storage class and whose size depends on nProcs so can also grow uncontrolled.

    Attached is a patch for both issues.

  3. Erik Schnetter reporter
    • removed comment

    This patch is okay to apply.

    If you want, you can keep the type of all_timernames; there is no need to change from char[][] to char[]. You would write

    char (*restrict const all_timernames)[TIMERNAME_LENGTH] = malloc(total_ntimers * sizeof *all_timernames); which would save the explicit index calculations, and would avoid the need to cast the type for compare_string_array. However, this is not necessary to apply the patch.

  4. Roland Haas
    • changed status to resolved
    • removed comment

    I committed version 2 of the patch as TimerReport rev 45 which restores the char[][] type. I had not realized that one could get C to handle allocatable 2d arrays in this fashion. Thanks for pointing this out.

  5. Ian Hinder
    • removed comment

    I don't understand why deletion of timers is allowed by Cactus. What does it mean? At the end of a run, you want to know how much time has been spent in each timer. Deleted timers make this impossible. Unless there is some compelling argument, maybe we should consider deprecating CCTK_TimerDestroy at some point in the future.

    Another issue is the assumption that timer names are only 100 characters long (this might have been due to me). Carpet's hierarchical timers are implemented as Cactus timers with names based on their path. i.e. main/Evolve/.../myfn. These could in principle get quite long. I am unhappy about silently truncating the timer names. I think it would be better to allow arbitrarily long timer names in TimerReport.

  6. Roland Haas
    • changed status to open
    • removed comment

    We sprung memory leaks. Some were in this code. Attached please find a patch. Normally I would simply fix these since they are obvious (and valgrind agrees with me that they are leaks) but given that we are soft-frozen...

  7. Log in to comment