running loopcontrol on strange number of threads fails

Issue #1326 new
Roland Haas created an issue

my machine has 8 cores (according to /proc/cpuinfo). Running eg the trigger test with 3 threads fails inside of loopcontrol.

To reproduce:

export OMP_NUM_THREADS=3
mpirun -n 2  exe/cactus_bns_all arrangements/AEIThorns/Trigger/test/trigger.par

Keyword: LoopControl

Comments (5)

  1. Roland Haas reporter
    • removed comment

    This still happens even with current (Sun Mar 22 18:46:21 CET 2015) trunk, though failure looks a bit different now:

    actus_sim: /data/rhaas/postdoc/gr/ET_trunk/configs/sim/build/LoopControl/loopcontrol.cc:312: T {anonymous}::divexact(T, T) [with T = int]: Assertion `i % j == 0' failed
    
  2. Steven R. Brandt

    I recently attempted to run the gallery example on Deep Bayou. I set OMP_NUM_THREADS=4, and asked for 12 procs to run on the node (which has 48 cores according to lscpu). Cactus failed with this message cactus_sim: /nvme/sbrandt/Cactus/arrangements/Carpet/LoopControl/src/loopcontrol.cc:264: T <unnamed>::divexact(T, T) [with T = int]: Assertion i % j == 0' faile

    It also said INFO (Carpet): This process runs on 24 cores. At Roland’s suggestion, I tried adding LoopControl::use_smt_threads = "no" to the par file and all was well. Maybe “no” is a more sensible default?

    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                48
    On-line CPU(s) list:   0-47
    Thread(s) per core:    1
    Core(s) per socket:    24
    Socket(s):             2
    NUMA node(s):          2
    Vendor ID:             GenuineIntel
    

  3. Erik Schnetter

    Steve

    You seem to expect to use 4 cores per process, and Carpet says there are 24 cores per process. It might be that your job startup isn’t working right. I would debug this first, before looking at LoopControl.

  4. Steven R. Brandt

    I set and exported OMP_NUM_THREADS=4 and manually asked MPI to run with 12 procs. Is that considered wrong/buggy? I suppose I could also have set CACTUS_NUM_THREADS and CACTUS_NUM_PROCS. Regardless, I thought using smt threads was generally expected to be not helpful and to be avoided by default?

  5. Erik Schnetter

    Setting these variables is fine. Setting the other variables (CACTUS_...) only allows Cactus to check whether things actually worked out as intended; they are only used for checking.

    If you set OMP_NUM_THREADS=4 and expect there to be one thread per core, and Cactus later thinks it’s running on 12 cores, then something went wrong. Did you look at the output of omp_max_threads()? Did you environment variable actually make it to Cactus? Did you actually start one job with 12 processes, or accidentally 12 individual processes that know nothing about each other? Did the queuing system get confused and set up a cgroup with a different number of cores? Lots of things can go wrong.

    The error message might come from LoopControl trying to determine how to split 4 threads of 12 cores. The resulting 1/3 threads per core might have caused the problem (although it shouldn’t).

  6. Log in to comment