simfactory run script for bluewaters does not use -d or -cc numa_node

Issue #1527 closed
Roland Haas created an issue

the aprun man page states that

For OpenMP applications, use both the OMP_NUM_THREADS
environment variable to specify the number of threads and
the aprun -d option to specify the number of CPUs hosting
the threads. ALPS creates -n pes instances of the
executable, and the executable spawns OMP_NUM_THREADS-1
additional threads per PE.


These changes improve run speed of qc0-mclachlan from 14 M/hr to 17 M/hr.

**Keyword:**

Comments (6)

  1. Erik Schnetter
    • removed comment

    I recommend using thorn hwloc instead of these options, because these options are difficult to debug. (In the past, we had errors in such options, and systems had errors in their implementations.) In addition, the compilers' start-up routines sometimes overwrite these settings, partly invalidating them.

    Thorn hwloc both outputs the current setting (making debugging possible), and also sets them (making such options in principle unnecessary).

    Did you build/run with thorn hwloc?

  2. Roland Haas reporter
    • removed comment

    Both runs used hwloc. Hwloc's output for the unmodified run is:

    NFO (hwloc): MPI process-to-host mapping:
    This is MPI process 0 of 8
    MPI hosts:
      0: nid07793
      1: nid07854
    This MPI process runs on host 0 of 2
    On this host, this is MPI process 0 of 4
    INFO (hwloc): Topology support:
    Discovery support:
      discovery->pu                            : yes
    CPU binding support:
      cpubind->set_thisproc_cpubind            : yes
      cpubind->get_thisproc_cpubind            : yes
      cpubind->set_proc_cpubind                : yes
      cpubind->get_proc_cpubind                : yes
      cpubind->set_thisthread_cpubind          : yes
      cpubind->get_thisthread_cpubind          : yes
      cpubind->set_thread_cpubind              : yes
      cpubind->get_thread_cpubind              : yes
      cpubind->get_thisproc_last_cpu_location  : yes
      cpubind->get_proc_last_cpu_location      : yes
      cpubind->get_thisthread_last_cpu_location: yes
    Memory binding support:
      membind->set_thisproc_membind            : no
      membind->get_thisproc_membind            : no
      membind->set_proc_membind                : no
      membind->get_proc_membind                : no
      membind->set_thisthread_membind          : yes
      membind->get_thisthread_membind          : yes
      membind->set_area_membind                : yes
      membind->get_area_membind                : yes
      membind->alloc_membind                   : yes
      membind->firsttouch_membind              : yes
      membind->bind_membind                    : yes
      membind->interleave_membind              : yes
      membind->replicate_membind               : no
      membind->nexttouch_membind               : no
      membind->migrate_membind                 : yes
    INFO (hwloc): Hardware objects in this node:
    Machine L#0: (P#0, total=67108480KB, Backend=Linux, LinuxCgroup=/3012719, OSName=Linux, OSRelease=2.6.32.59-0.7.1_1.0402.7496-cray_gem_c, OSVersion="#1 SMP Wed Aug 7 03:55:25 UTC 2013", HostName=nid07793, Architecture=x86_64)
      Socket L#0: (P#0, total=33554048KB, CPUModel="AMD Opteron(TM) Processor 6276                 ")
        NUMANode L#0: (P#0, local=16776832KB, total=16776832KB)
          L3Cache L#0: (P#-1, size=6144KB, linesize=64, ways=64)
            L2Cache L#0: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#0: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#0: (P#0)
                  PU L#0: (P#0)
            L2Cache L#1: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#1: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#1: (P#1)
                  PU L#1: (P#1)
            L2Cache L#2: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#2: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#2: (P#2)
                  PU L#2: (P#2)
            L2Cache L#3: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#3: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#3: (P#3)
                  PU L#3: (P#3)
            L2Cache L#4: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#4: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#4: (P#4)
                  PU L#4: (P#4)
            L2Cache L#5: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#5: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#5: (P#5)
                  PU L#5: (P#5)
            L2Cache L#6: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#6: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#6: (P#6)
                  PU L#6: (P#6)
            L2Cache L#7: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#7: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#7: (P#7)
                  PU L#7: (P#7)
        NUMANode L#1: (P#1, local=16777216KB, total=16777216KB)
          L3Cache L#1: (P#-1, size=6144KB, linesize=64, ways=64)
            L2Cache L#8: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#8: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#8: (P#0)
                  PU L#8: (P#8)
            L2Cache L#9: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#9: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#9: (P#1)
                  PU L#9: (P#9)
            L2Cache L#10: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#10: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#10: (P#2)
                  PU L#10: (P#10)
            L2Cache L#11: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#11: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#11: (P#3)
                  PU L#11: (P#11)
            L2Cache L#12: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#12: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#12: (P#4)
                  PU L#12: (P#12)
            L2Cache L#13: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#13: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#13: (P#5)
                  PU L#13: (P#13)
            L2Cache L#14: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#14: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#14: (P#6)
                  PU L#14: (P#14)
            L2Cache L#15: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#15: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#15: (P#7)
                  PU L#15: (P#15)
      Socket L#1: (P#1, total=33554432KB, CPUModel="AMD Opteron(TM) Processor 6276                 ")
        NUMANode L#2: (P#2, local=16777216KB, total=16777216KB)
          L3Cache L#2: (P#-1, size=6144KB, linesize=64, ways=64)
            L2Cache L#16: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#16: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#16: (P#0)
                  PU L#16: (P#16)
            L2Cache L#17: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#17: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#17: (P#1)
                  PU L#17: (P#17)
            L2Cache L#18: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#18: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#18: (P#2)
                  PU L#18: (P#18)
            L2Cache L#19: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#19: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#19: (P#3)
                  PU L#19: (P#19)
            L2Cache L#20: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#20: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#20: (P#4)
                  PU L#20: (P#20)
            L2Cache L#21: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#21: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#21: (P#5)
                  PU L#21: (P#21)
            L2Cache L#22: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#22: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#22: (P#6)
                  PU L#22: (P#22)
            L2Cache L#23: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#23: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#23: (P#7)
                  PU L#23: (P#23)
        NUMANode L#3: (P#3, local=16777216KB, total=16777216KB)
          L3Cache L#3: (P#-1, size=6144KB, linesize=64, ways=64)
            L2Cache L#24: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#24: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#24: (P#0)
                  PU L#24: (P#24)
            L2Cache L#25: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#25: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#25: (P#1)
                  PU L#25: (P#25)
            L2Cache L#26: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#26: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#26: (P#2)
                  PU L#26: (P#26)
            L2Cache L#27: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#27: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#27: (P#3)
                  PU L#27: (P#27)
            L2Cache L#28: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#28: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#28: (P#4)
                  PU L#28: (P#28)
            L2Cache L#29: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#29: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#29: (P#5)
                  PU L#29: (P#29)
            L2Cache L#30: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#30: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#30: (P#6)
                  PU L#30: (P#30)
            L2Cache L#31: (P#-1, size=2048KB, linesize=64, ways=16)
              L1dCache L#31: (P#-1, size=16KB, linesize=64, ways=4)
                Core L#31: (P#7)
                  PU L#31: (P#31)
    INFO (hwloc): Thread CPU bindings:
      MPI process 0 on host 0 (process 0 of 4 on this host)
        OpenMP thread 0: PU set L#{0} P#{0}
        OpenMP thread 1: PU set L#{0} P#{0}
        OpenMP thread 2: PU set L#{0} P#{0}
        OpenMP thread 3: PU set L#{0} P#{0}
        OpenMP thread 4: PU set L#{0} P#{0}
        OpenMP thread 5: PU set L#{0} P#{0}
        OpenMP thread 6: PU set L#{0} P#{0}
        OpenMP thread 7: PU set L#{0} P#{0}
      MPI process 1 on host 0 (process 1 of 4 on this host)
        OpenMP thread 0: PU set L#{1} P#{1}
        OpenMP thread 1: PU set L#{1} P#{1}
        OpenMP thread 2: PU set L#{1} P#{1}
        OpenMP thread 3: PU set L#{1} P#{1}
        OpenMP thread 4: PU set L#{1} P#{1}
        OpenMP thread 5: PU set L#{1} P#{1}
        OpenMP thread 6: PU set L#{1} P#{1}
        OpenMP thread 7: PU set L#{1} P#{1}
      MPI process 2 on host 0 (process 2 of 4 on this host)
        OpenMP thread 0: PU set L#{2} P#{2}
        OpenMP thread 1: PU set L#{2} P#{2}
        OpenMP thread 2: PU set L#{2} P#{2}
        OpenMP thread 3: PU set L#{2} P#{2}
        OpenMP thread 4: PU set L#{2} P#{2}
        OpenMP thread 5: PU set L#{2} P#{2}
        OpenMP thread 6: PU set L#{2} P#{2}
        OpenMP thread 7: PU set L#{2} P#{2}
      MPI process 3 on host 0 (process 3 of 4 on this host)
        OpenMP thread 0: PU set L#{3} P#{3}
        OpenMP thread 1: PU set L#{3} P#{3}
        OpenMP thread 2: PU set L#{3} P#{3}
        OpenMP thread 3: PU set L#{3} P#{3}
        OpenMP thread 4: PU set L#{3} P#{3}
        OpenMP thread 5: PU set L#{3} P#{3}
        OpenMP thread 6: PU set L#{3} P#{3}
        OpenMP thread 7: PU set L#{3} P#{3}
    INFO (hwloc): Setting thread CPU bindings:
    INFO (hwloc): Thread CPU bindings:
      MPI process 0 on host 0 (process 0 of 4 on this host)
        OpenMP thread 0: PU set L#{0} P#{0}
        OpenMP thread 1: PU set L#{1} P#{1}
        OpenMP thread 2: PU set L#{2} P#{2}
        OpenMP thread 3: PU set L#{3} P#{3}
        OpenMP thread 4: PU set L#{4} P#{4}
        OpenMP thread 5: PU set L#{5} P#{5}
        OpenMP thread 6: PU set L#{6} P#{6}
        OpenMP thread 7: PU set L#{7} P#{7}
      MPI process 1 on host 0 (process 1 of 4 on this host)
        OpenMP thread 0: PU set L#{8} P#{8}
        OpenMP thread 1: PU set L#{9} P#{9}
        OpenMP thread 2: PU set L#{10} P#{10}
        OpenMP thread 3: PU set L#{11} P#{11}
        OpenMP thread 4: PU set L#{12} P#{12}
        OpenMP thread 5: PU set L#{13} P#{13}
        OpenMP thread 6: PU set L#{14} P#{14}
        OpenMP thread 7: PU set L#{15} P#{15}
      MPI process 2 on host 0 (process 2 of 4 on this host)
        OpenMP thread 0: PU set L#{16} P#{16}
        OpenMP thread 1: PU set L#{17} P#{17}
        OpenMP thread 2: PU set L#{18} P#{18}
        OpenMP thread 3: PU set L#{19} P#{19}
        OpenMP thread 4: PU set L#{20} P#{20}
        OpenMP thread 5: PU set L#{21} P#{21}
        OpenMP thread 6: PU set L#{22} P#{22}
        OpenMP thread 7: PU set L#{23} P#{23}
      MPI process 3 on host 0 (process 3 of 4 on this host)
        OpenMP thread 0: PU set L#{24} P#{24}
        OpenMP thread 1: PU set L#{25} P#{25}
        OpenMP thread 2: PU set L#{26} P#{26}
        OpenMP thread 3: PU set L#{27} P#{27}
        OpenMP thread 4: PU set L#{28} P#{28}
        OpenMP thread 5: PU set L#{29} P#{29}
        OpenMP thread 6: PU set L#{30} P#{30}
        OpenMP thread 7: PU set L#{31} P#{31}
    INFO (hwloc): Extracting CPU/cache/memory properties:
      There are 1 PUs per core (aka hardware SMT threads)
      There are 1 threads per core (aka SMT threads used)
      Cache (unknown name) has type "data" depth 1
        size 16384 linesize 64 associativity 4 stride 4096, for 1 PUs
      Cache (unknown name) has type "unified" depth 2
        size 2097152 linesize 64 associativity 16 stride 131072, for 1 PUs
      Cache (unknown name) has type "unified" depth 3
        size 6291456 linesize 64 associativity 64 stride 98304, for 8 PUs
      Memory has type "local" depth 2
        size 17179475968 pagesize 4096, for 8 PUs
      Memory has type "global" depth 2
        size 68717903872 pagesize 4096, for 32 PUs
    

    I attach stdout for both runs. The new runscript's output is in qc0-new.out and the unmodified one in qc0-vanilla.out.

  3. Erik Schnetter
    • changed status to open
    • removed comment

    If both runs used hwloc, then it should not matter whether other software chose some other settings beforehand. If you look into the outputs after the lines "Setting thread CPU bindings" there, then you'll see that both runs used the same bindings.

    Actually, I just realize that some low-level systems may initialize themselves before thorn hwloc is called. In this case, it is likely that MPI would allocate some communication data structures, and those will likely all live on numa node 0 without these new options. This is bad.

    Please apply this patch.

    Please leave this bug report open (or open another one) until we checked all other mainstream systems.

  4. Log in to comment