- changed status to open
- removed comment
simfactory run script for bluewaters does not use -d or -cc numa_node
the aprun man page states that
For OpenMP applications, use both the OMP_NUM_THREADS environment variable to specify the number of threads and the aprun -d option to specify the number of CPUs hosting the threads. ALPS creates -n pes instances of the executable, and the executable spawns OMP_NUM_THREADS-1 additional threads per PE. These changes improve run speed of qc0-mclachlan from 14 M/hr to 17 M/hr. **Keyword:**
Comments (6)
-
-
- removed comment
I recommend using thorn hwloc instead of these options, because these options are difficult to debug. (In the past, we had errors in such options, and systems had errors in their implementations.) In addition, the compilers' start-up routines sometimes overwrite these settings, partly invalidating them.
Thorn hwloc both outputs the current setting (making debugging possible), and also sets them (making such options in principle unnecessary).
Did you build/run with thorn hwloc?
-
reporter - removed comment
Both runs used hwloc. Hwloc's output for the unmodified run is:
NFO (hwloc): MPI process-to-host mapping: This is MPI process 0 of 8 MPI hosts: 0: nid07793 1: nid07854 This MPI process runs on host 0 of 2 On this host, this is MPI process 0 of 4 INFO (hwloc): Topology support: Discovery support: discovery->pu : yes CPU binding support: cpubind->set_thisproc_cpubind : yes cpubind->get_thisproc_cpubind : yes cpubind->set_proc_cpubind : yes cpubind->get_proc_cpubind : yes cpubind->set_thisthread_cpubind : yes cpubind->get_thisthread_cpubind : yes cpubind->set_thread_cpubind : yes cpubind->get_thread_cpubind : yes cpubind->get_thisproc_last_cpu_location : yes cpubind->get_proc_last_cpu_location : yes cpubind->get_thisthread_last_cpu_location: yes Memory binding support: membind->set_thisproc_membind : no membind->get_thisproc_membind : no membind->set_proc_membind : no membind->get_proc_membind : no membind->set_thisthread_membind : yes membind->get_thisthread_membind : yes membind->set_area_membind : yes membind->get_area_membind : yes membind->alloc_membind : yes membind->firsttouch_membind : yes membind->bind_membind : yes membind->interleave_membind : yes membind->replicate_membind : no membind->nexttouch_membind : no membind->migrate_membind : yes INFO (hwloc): Hardware objects in this node: Machine L#0: (P#0, total=67108480KB, Backend=Linux, LinuxCgroup=/3012719, OSName=Linux, OSRelease=2.6.32.59-0.7.1_1.0402.7496-cray_gem_c, OSVersion="#1 SMP Wed Aug 7 03:55:25 UTC 2013", HostName=nid07793, Architecture=x86_64) Socket L#0: (P#0, total=33554048KB, CPUModel="AMD Opteron(TM) Processor 6276 ") NUMANode L#0: (P#0, local=16776832KB, total=16776832KB) L3Cache L#0: (P#-1, size=6144KB, linesize=64, ways=64) L2Cache L#0: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#0: (P#-1, size=16KB, linesize=64, ways=4) Core L#0: (P#0) PU L#0: (P#0) L2Cache L#1: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#1: (P#-1, size=16KB, linesize=64, ways=4) Core L#1: (P#1) PU L#1: (P#1) L2Cache L#2: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#2: (P#-1, size=16KB, linesize=64, ways=4) Core L#2: (P#2) PU L#2: (P#2) L2Cache L#3: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#3: (P#-1, size=16KB, linesize=64, ways=4) Core L#3: (P#3) PU L#3: (P#3) L2Cache L#4: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#4: (P#-1, size=16KB, linesize=64, ways=4) Core L#4: (P#4) PU L#4: (P#4) L2Cache L#5: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#5: (P#-1, size=16KB, linesize=64, ways=4) Core L#5: (P#5) PU L#5: (P#5) L2Cache L#6: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#6: (P#-1, size=16KB, linesize=64, ways=4) Core L#6: (P#6) PU L#6: (P#6) L2Cache L#7: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#7: (P#-1, size=16KB, linesize=64, ways=4) Core L#7: (P#7) PU L#7: (P#7) NUMANode L#1: (P#1, local=16777216KB, total=16777216KB) L3Cache L#1: (P#-1, size=6144KB, linesize=64, ways=64) L2Cache L#8: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#8: (P#-1, size=16KB, linesize=64, ways=4) Core L#8: (P#0) PU L#8: (P#8) L2Cache L#9: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#9: (P#-1, size=16KB, linesize=64, ways=4) Core L#9: (P#1) PU L#9: (P#9) L2Cache L#10: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#10: (P#-1, size=16KB, linesize=64, ways=4) Core L#10: (P#2) PU L#10: (P#10) L2Cache L#11: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#11: (P#-1, size=16KB, linesize=64, ways=4) Core L#11: (P#3) PU L#11: (P#11) L2Cache L#12: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#12: (P#-1, size=16KB, linesize=64, ways=4) Core L#12: (P#4) PU L#12: (P#12) L2Cache L#13: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#13: (P#-1, size=16KB, linesize=64, ways=4) Core L#13: (P#5) PU L#13: (P#13) L2Cache L#14: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#14: (P#-1, size=16KB, linesize=64, ways=4) Core L#14: (P#6) PU L#14: (P#14) L2Cache L#15: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#15: (P#-1, size=16KB, linesize=64, ways=4) Core L#15: (P#7) PU L#15: (P#15) Socket L#1: (P#1, total=33554432KB, CPUModel="AMD Opteron(TM) Processor 6276 ") NUMANode L#2: (P#2, local=16777216KB, total=16777216KB) L3Cache L#2: (P#-1, size=6144KB, linesize=64, ways=64) L2Cache L#16: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#16: (P#-1, size=16KB, linesize=64, ways=4) Core L#16: (P#0) PU L#16: (P#16) L2Cache L#17: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#17: (P#-1, size=16KB, linesize=64, ways=4) Core L#17: (P#1) PU L#17: (P#17) L2Cache L#18: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#18: (P#-1, size=16KB, linesize=64, ways=4) Core L#18: (P#2) PU L#18: (P#18) L2Cache L#19: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#19: (P#-1, size=16KB, linesize=64, ways=4) Core L#19: (P#3) PU L#19: (P#19) L2Cache L#20: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#20: (P#-1, size=16KB, linesize=64, ways=4) Core L#20: (P#4) PU L#20: (P#20) L2Cache L#21: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#21: (P#-1, size=16KB, linesize=64, ways=4) Core L#21: (P#5) PU L#21: (P#21) L2Cache L#22: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#22: (P#-1, size=16KB, linesize=64, ways=4) Core L#22: (P#6) PU L#22: (P#22) L2Cache L#23: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#23: (P#-1, size=16KB, linesize=64, ways=4) Core L#23: (P#7) PU L#23: (P#23) NUMANode L#3: (P#3, local=16777216KB, total=16777216KB) L3Cache L#3: (P#-1, size=6144KB, linesize=64, ways=64) L2Cache L#24: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#24: (P#-1, size=16KB, linesize=64, ways=4) Core L#24: (P#0) PU L#24: (P#24) L2Cache L#25: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#25: (P#-1, size=16KB, linesize=64, ways=4) Core L#25: (P#1) PU L#25: (P#25) L2Cache L#26: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#26: (P#-1, size=16KB, linesize=64, ways=4) Core L#26: (P#2) PU L#26: (P#26) L2Cache L#27: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#27: (P#-1, size=16KB, linesize=64, ways=4) Core L#27: (P#3) PU L#27: (P#27) L2Cache L#28: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#28: (P#-1, size=16KB, linesize=64, ways=4) Core L#28: (P#4) PU L#28: (P#28) L2Cache L#29: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#29: (P#-1, size=16KB, linesize=64, ways=4) Core L#29: (P#5) PU L#29: (P#29) L2Cache L#30: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#30: (P#-1, size=16KB, linesize=64, ways=4) Core L#30: (P#6) PU L#30: (P#30) L2Cache L#31: (P#-1, size=2048KB, linesize=64, ways=16) L1dCache L#31: (P#-1, size=16KB, linesize=64, ways=4) Core L#31: (P#7) PU L#31: (P#31) INFO (hwloc): Thread CPU bindings: MPI process 0 on host 0 (process 0 of 4 on this host) OpenMP thread 0: PU set L#{0} P#{0} OpenMP thread 1: PU set L#{0} P#{0} OpenMP thread 2: PU set L#{0} P#{0} OpenMP thread 3: PU set L#{0} P#{0} OpenMP thread 4: PU set L#{0} P#{0} OpenMP thread 5: PU set L#{0} P#{0} OpenMP thread 6: PU set L#{0} P#{0} OpenMP thread 7: PU set L#{0} P#{0} MPI process 1 on host 0 (process 1 of 4 on this host) OpenMP thread 0: PU set L#{1} P#{1} OpenMP thread 1: PU set L#{1} P#{1} OpenMP thread 2: PU set L#{1} P#{1} OpenMP thread 3: PU set L#{1} P#{1} OpenMP thread 4: PU set L#{1} P#{1} OpenMP thread 5: PU set L#{1} P#{1} OpenMP thread 6: PU set L#{1} P#{1} OpenMP thread 7: PU set L#{1} P#{1} MPI process 2 on host 0 (process 2 of 4 on this host) OpenMP thread 0: PU set L#{2} P#{2} OpenMP thread 1: PU set L#{2} P#{2} OpenMP thread 2: PU set L#{2} P#{2} OpenMP thread 3: PU set L#{2} P#{2} OpenMP thread 4: PU set L#{2} P#{2} OpenMP thread 5: PU set L#{2} P#{2} OpenMP thread 6: PU set L#{2} P#{2} OpenMP thread 7: PU set L#{2} P#{2} MPI process 3 on host 0 (process 3 of 4 on this host) OpenMP thread 0: PU set L#{3} P#{3} OpenMP thread 1: PU set L#{3} P#{3} OpenMP thread 2: PU set L#{3} P#{3} OpenMP thread 3: PU set L#{3} P#{3} OpenMP thread 4: PU set L#{3} P#{3} OpenMP thread 5: PU set L#{3} P#{3} OpenMP thread 6: PU set L#{3} P#{3} OpenMP thread 7: PU set L#{3} P#{3} INFO (hwloc): Setting thread CPU bindings: INFO (hwloc): Thread CPU bindings: MPI process 0 on host 0 (process 0 of 4 on this host) OpenMP thread 0: PU set L#{0} P#{0} OpenMP thread 1: PU set L#{1} P#{1} OpenMP thread 2: PU set L#{2} P#{2} OpenMP thread 3: PU set L#{3} P#{3} OpenMP thread 4: PU set L#{4} P#{4} OpenMP thread 5: PU set L#{5} P#{5} OpenMP thread 6: PU set L#{6} P#{6} OpenMP thread 7: PU set L#{7} P#{7} MPI process 1 on host 0 (process 1 of 4 on this host) OpenMP thread 0: PU set L#{8} P#{8} OpenMP thread 1: PU set L#{9} P#{9} OpenMP thread 2: PU set L#{10} P#{10} OpenMP thread 3: PU set L#{11} P#{11} OpenMP thread 4: PU set L#{12} P#{12} OpenMP thread 5: PU set L#{13} P#{13} OpenMP thread 6: PU set L#{14} P#{14} OpenMP thread 7: PU set L#{15} P#{15} MPI process 2 on host 0 (process 2 of 4 on this host) OpenMP thread 0: PU set L#{16} P#{16} OpenMP thread 1: PU set L#{17} P#{17} OpenMP thread 2: PU set L#{18} P#{18} OpenMP thread 3: PU set L#{19} P#{19} OpenMP thread 4: PU set L#{20} P#{20} OpenMP thread 5: PU set L#{21} P#{21} OpenMP thread 6: PU set L#{22} P#{22} OpenMP thread 7: PU set L#{23} P#{23} MPI process 3 on host 0 (process 3 of 4 on this host) OpenMP thread 0: PU set L#{24} P#{24} OpenMP thread 1: PU set L#{25} P#{25} OpenMP thread 2: PU set L#{26} P#{26} OpenMP thread 3: PU set L#{27} P#{27} OpenMP thread 4: PU set L#{28} P#{28} OpenMP thread 5: PU set L#{29} P#{29} OpenMP thread 6: PU set L#{30} P#{30} OpenMP thread 7: PU set L#{31} P#{31} INFO (hwloc): Extracting CPU/cache/memory properties: There are 1 PUs per core (aka hardware SMT threads) There are 1 threads per core (aka SMT threads used) Cache (unknown name) has type "data" depth 1 size 16384 linesize 64 associativity 4 stride 4096, for 1 PUs Cache (unknown name) has type "unified" depth 2 size 2097152 linesize 64 associativity 16 stride 131072, for 1 PUs Cache (unknown name) has type "unified" depth 3 size 6291456 linesize 64 associativity 64 stride 98304, for 8 PUs Memory has type "local" depth 2 size 17179475968 pagesize 4096, for 8 PUs Memory has type "global" depth 2 size 68717903872 pagesize 4096, for 32 PUs
I attach stdout for both runs. The new runscript's output is in qc0-new.out and the unmodified one in qc0-vanilla.out.
-
- changed status to open
- removed comment
If both runs used hwloc, then it should not matter whether other software chose some other settings beforehand. If you look into the outputs after the lines "Setting thread CPU bindings" there, then you'll see that both runs used the same bindings.
Actually, I just realize that some low-level systems may initialize themselves before thorn hwloc is called. In this case, it is likely that MPI would allocate some communication data structures, and those will likely all live on numa node 0 without these new options. This is bad.
Please apply this patch.
Please leave this bug report open (or open another one) until we checked all other mainstream systems.
-
reporter - changed status to resolved
- removed comment
Applied in rev 2255 of simfactory. Opened new ticket #1528 for other mainstream machines.
-
reporter - changed status to closed
- edited description
- Log in to comment