- removed comment
TimerReport doesn't scale very well.
During some scaling tests of a unigrid domain using only MPI (no OpenMP) on Supermuc with Noether release, I noticed that the routine doing the TimerReport routine scales very poorly. I had set the following relevant parameters:
Cactus::cctk_itlast = 256 TimerReport::out_every = 128 TimerReport::output_all_timers_readable = "yes" TimerReport::output_schedule_timers = "no" TimerReport::n_top_timers = 200 Carpet::output_timer_tree_every = 128 Carpet::output_initialise_timer_tree = "yes" Carpet::sync_barriers = "yes"
With 144 MPI processes the timer routine didn't take much time of the simulation:
INFO (TimerReport): Top timers at iteration 256 time 29.0133 ====================================================================== % Time/s Min/s Max/s Timer (gettimeofday) ====================================================================== 100.0 6172.83 6172.82 6172.92 meta mode ... 0.0 1.24 1.21 2.34 [0176] TimerReport: zzz_TimerReport_Output in CCTK_ANALYSIS ... 0.0 0.96 0.93 1.81 main/Evolve/CallAnalysis/CCTK_ANALYSIS/CallFunction/thorns/zzz_TimerReport_Output ...
While with 2304 MPI jobs, it clearly shows how bad it scales:
INFO (TimerReport): Top timers at iteration 256 time 29.0133 ====================================================================== % Time/s Min/s Max/s Timer (gettimeofday) ====================================================================== 100.0 618.300 618.288 618.316 meta mode ... 8.5 52.462 52.372 79.462 [0176] TimerReport: zzz_TimerReport_Output in CCTK_ANALYSIS ... 6.7 41.451 41.435 61.203 main/Evolve/CallAnalysis/CCTK_ANALYSIS/CallFunction/thorns/zzz_TimerReport_Output ...
Two things caught my attention here. One was obviously how bad zzz_TimerReport is scaling as the number of MPI processes increase. The second was the inconsistency between the timers "[0176] TimerReport:..." and "main/Evolve/...". Why do they show different times for the same function?
Anyways, it might not be overall important but I would like to leave it reported here about this issue.
Keyword:
Comments (2)
-
-
reporter - removed comment
The overhead in TimerReport surprises me. TimerReport needs to collect information from all processes and sort it, which requires a reduction operation and string operations. Maybe one of the algorithms used isn't scalable -- we should investigate further.
That surprised me too. As we start using a larger number of cores for our simulations, it might be good to take a look at this.
The difference in timings comes likely from overhead between the respective measurement intervals. The timing prefixed "[0176]" comes from the Cactus scheduler, which calls into Carpet, which then has to perform certain operations before calling the actual function, which is then measured by the other timer (".../thorns/..."). One of these operations is synchronization, but there are also others.
Thanks a lot for this clarification.
Cheers, Bruno.
- Log in to comment
The overhead in TimerReport surprises me. TimerReport needs to collect information from all processes and sort it, which requires a reduction operation and string operations. Maybe one of the algorithms used isn't scalable -- we should investigate further.
The difference in timings comes likely from overhead between the respective measurement intervals. The timing prefixed "[0176]" comes from the Cactus scheduler, which calls into Carpet, which then has to perform certain operations before calling the actual function, which is then measured by the other timer (".../thorns/..."). One of these operations is synchronization, but there are also others.