TimerReport doesn't scale very well.

Issue #1620 new
Bruno Mundim created an issue

During some scaling tests of a unigrid domain using only MPI (no OpenMP) on Supermuc with Noether release, I noticed that the routine doing the TimerReport routine scales very poorly. I had set the following relevant parameters:

Cactus::cctk_itlast                        = 256
TimerReport::out_every                     = 128

TimerReport::output_all_timers_readable    = "yes"
TimerReport::output_schedule_timers        = "no"
TimerReport::n_top_timers                  = 200

Carpet::output_timer_tree_every            = 128
Carpet::output_initialise_timer_tree       = "yes"

Carpet::sync_barriers                      = "yes"

With 144 MPI processes the timer routine didn't take much time of the simulation:

INFO (TimerReport): Top timers at iteration 256 time 29.0133
======================================================================
    %    Time/s   Min/s   Max/s   Timer (gettimeofday)
======================================================================
100.0   6172.83 6172.82 6172.92   meta mode
...
  0.0      1.24    1.21    2.34   [0176] TimerReport: zzz_TimerReport_Output in CCTK_ANALYSIS
...
  0.0      0.96    0.93    1.81   main/Evolve/CallAnalysis/CCTK_ANALYSIS/CallFunction/thorns/zzz_TimerReport_Output
...

While with 2304 MPI jobs, it clearly shows how bad it scales:

INFO (TimerReport): Top timers at iteration 256 time 29.0133
======================================================================
    %    Time/s   Min/s   Max/s   Timer (gettimeofday)
======================================================================
100.0   618.300 618.288 618.316   meta mode
...
  8.5    52.462  52.372  79.462   [0176] TimerReport: zzz_TimerReport_Output in CCTK_ANALYSIS
...
  6.7    41.451  41.435  61.203   main/Evolve/CallAnalysis/CCTK_ANALYSIS/CallFunction/thorns/zzz_TimerReport_Output
...

Two things caught my attention here. One was obviously how bad zzz_TimerReport is scaling as the number of MPI processes increase. The second was the inconsistency between the timers "[0176] TimerReport:..." and "main/Evolve/...". Why do they show different times for the same function?

Anyways, it might not be overall important but I would like to leave it reported here about this issue.

Keyword:

Comments (2)

  1. Erik Schnetter
    • removed comment

    The overhead in TimerReport surprises me. TimerReport needs to collect information from all processes and sort it, which requires a reduction operation and string operations. Maybe one of the algorithms used isn't scalable -- we should investigate further.

    The difference in timings comes likely from overhead between the respective measurement intervals. The timing prefixed "[0176]" comes from the Cactus scheduler, which calls into Carpet, which then has to perform certain operations before calling the actual function, which is then measured by the other timer (".../thorns/..."). One of these operations is synchronization, but there are also others.

  2. Bruno Mundim reporter
    • removed comment

    The overhead in TimerReport surprises me. TimerReport needs to collect information from all processes and sort it, which requires a reduction operation and string operations. Maybe one of the algorithms used isn't scalable -- we should investigate further.

    That surprised me too. As we start using a larger number of cores for our simulations, it might be good to take a look at this.

    The difference in timings comes likely from overhead between the respective measurement intervals. The timing prefixed "[0176]" comes from the Cactus scheduler, which calls into Carpet, which then has to perform certain operations before calling the actual function, which is then measured by the other timer (".../thorns/..."). One of these operations is synchronization, but there are also others.

    Thanks a lot for this clarification.

    Cheers, Bruno.

  3. Log in to comment