Modify

Opened 20 months ago

Last modified 5 weeks ago

#2008 assigned defect

NaNs when running static tov on >40 cores

Reported by: allgwy001@… Owned by: Frank Löffler
Priority: major Milestone:
Component: Carpet Version: ET_2016_05
Keywords: Cc:

Description

I've been trying to run the static tov example parameter file on an HPC cluster using >40 cores, but this results in NaNs in the data. I can't remember whether the issue first appears at 40 or 41 cores (and won't be able to check this for the next few days), but using 41+ cores definitely gives me NaNs. I remember testing with 39 cores and several other lower values (down to 4), but these runs all seemed fine.

So far, I've been able to run larger simulations (e.g. BBHs) on more than 40 cores (same cluster) without any apparent issues.

I'll attach the static tov parameter file I used (I think I changed one or two outdated parameters), as well as the PBS script and error/output files from a static tov run on 44 cores.

The static tov runs were intended for speed test purposes. The results of the speed tests I've run so far (on fewer than 40 cores) seem quite strange to me, so I'm going to attach a text file with walltimes and CPU times for these runs, too. Any comments would be appreciated!

Attachments (6)

static_tov_mod_2.par (5.8 KB) - added by allgwy001@… 20 months ago.
parameter file
static_tov_mod_2.o1440381 (140.6 KB) - added by allgwy001@… 20 months ago.
output file (I stopped the run after noticing that all wasn't well)
static_tov_mod_2_truncated.e1440381 (113.4 KB) - added by allgwy001@… 20 months ago.
error file (truncated because the full one is about 84 MB)
static_tov_mod_2.sh (404 bytes) - added by allgwy001@… 20 months ago.
pbs script
speed_tests.txt (349 bytes) - added by allgwy001@… 20 months ago.
results of speed tests for the static tov (currently doing more runs to check consistency)
traffic.png (17.8 KB) - added by allgw001@… 19 months ago.

Download all attachments as: .zip

Change History (18)

Changed 20 months ago by allgwy001@…

Attachment: static_tov_mod_2.par added

parameter file

Changed 20 months ago by allgwy001@…

Attachment: static_tov_mod_2.o1440381 added

output file (I stopped the run after noticing that all wasn't well)

Changed 20 months ago by allgwy001@…

error file (truncated because the full one is about 84 MB)

Changed 20 months ago by allgwy001@…

Attachment: static_tov_mod_2.sh added

pbs script

Changed 20 months ago by allgwy001@…

Attachment: speed_tests.txt added

results of speed tests for the static tov (currently doing more runs to check consistency)

comment:1 Changed 19 months ago by Frank Löffler

Owner: set to Frank Löffler
Status: newassigned

comment:2 Changed 19 months ago by allgwy001@…

Here are some comments I received from the person who compiled the Einstein Toolkit on the cluster I'm using. He ran tests with the static tov star parameter file I attached, and sent me the following:

"The times follow a standard MPI reduction pattern. Out beyond 24 cores the code/data does not scale well and latency from message passing starts to increase rather than reduce run times. Some increases in ppn values increase the run times; this may be due to how data objects are handled in the code. There is also some fluctuation in the data runs which can only be caused by the software as I ran all tests on node 607 while no one else was using it."

He suggested that if I want to do runs on more than 20 cores (which I do), then I should ask for advice from people more familiar with the Einstein Toolkit.

comment:3 Changed 19 months ago by Ian Hinder

This is a small example, and I don't think the results you are seeing are surprising. Think of the number of points in the grid on each refinement level, then divide that by the number of cores. Each core will have to evolve that number of points. If you are using standard MPI without OpenMP, this block of points will be surrounded by a shell of ghost points which need to be synchronised with the other processes. As the size of the component on each process decreases, the ratio between the number of ghost points which need to be communicated and the number of real points which need to be evolved will increase. Eventually, you will have so few points evolved on each process that the communication cost dominates (and is not expected to scale well with the number of cores). I suggest that you calculate how many points are evolved on each process in your tests.

One way to improve scalability may be to use OpenMP parallelisation in addition to MPI. Are you doing this already?

Note that the scaling of this example is probably a separate issue to the problem of it crashing on >40 cores.

Also, just because this specific example doesn't scale effectively beyond 20 cores, this doesn't say anything about other examples, or cases that you want to run for science.

comment:4 Changed 19 months ago by Ian Hinder

I can confirm that this is a real problem. I have run your parameter file using the current master branch of the ET on 44 processes, and I get NaNs at iteration 736. This is earlier than you saw them, suggesting some sort of non-deterministic effect. The command I used to run this was

sim create-submit static_tov_mod_2 --parfile par/static_tov_mod_2.par --procs 44 --num-threads 1

You can see the number of processes being used in the output file:

$ grep "Carpet is running on" simulations/static_tov_mod_2/output-0000/static_tov_mod_2.out 
INFO (Carpet): Carpet is running on 44 processes

When I ran on 40 processes, this didn't happen.

This seems to be a bug in Carpet.

If I instead run with Carpet::processor_topology = "recursive", which uses a different algorithm for splitting the domain among processes, the code instead aborts with an error:

terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 44) >= this->size() (which is 44)

Backtrace is:

Backtrace from rank 0 pid 18049:
1. /usr/lib64/libc.so.6(+0x35670) [0x7f6cf5284670]
2. /usr/lib64/libc.so.6(gsignal+0x37) [0x7f6cf52845f7]
3. /usr/lib64/libc.so.6(abort+0x148) [0x7f6cf5285ce8]
4. __gnu_cxx::__verbose_terminate_handler()   [/cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d) [0x7f6cf5887d2d]]
5. /cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(+0x5dd86) [0x7f6cf5885d86]
6. /cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(+0x5ddd1) [0x7f6cf5885dd1]
7. /cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(+0x5dfe9) [0x7f6cf5885fe9]
8. std::__throw_out_of_range_fmt(char const*, ...)   [/cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(_ZSt24__throw_out_of_range_fmtPKcz+0x11f) [0x7f6cf58dbfef]]
9. std::vector<int, std::allocator<int> >::_M_range_check(unsigned long) const   [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZNKSt6vectorIiSaIiEE14_M_range_checkEm+0x20) [0x1342e60]]
a. Carpet::SplitRegionsMaps_Recursively(_cGH const*, std::vector<std::vector<region_t, std::allocator<region_t> >, std::allocator<std::vector<region_t, std::allocator<region_t> > > >&, std::vector<std::vector<region_t, std::allocator<region_t> >, std::allocator<std::vector<region_t, std::allocator<region_t> > > >&)   [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet28SplitRegionsMaps_RecursivelyEPK4_cGHRSt6vectorIS3_I8region_tSaIS4_EESaIS6_EES9_+0x11d7) [0x2f48f87]]
b. Carpet::SplitRegionsMaps(_cGH const*, std::vector<std::vector<region_t, std::allocator<region_t> >, std::allocator<std::vector<region_t, std::allocator<region_t> > > >&, std::vector<std::vector<region_t, std::allocator<region_t> >, std::allocator<std::vector<region_t, std::allocator<region_t> > > >&)   [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet16SplitRegionsMapsEPK4_cGHRSt6vectorIS3_I8region_tSaIS4_EESaIS6_EES9_+0x11c) [0x2ef6e5c]]
c. /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim() [0x2f05767]
d. /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim() [0x2f03b1c]
e. Carpet::SetupGH(tFleshConfig*, int, _cGH*)   [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet7SetupGHEP12tFleshConfigiP4_cGH+0x18e6) [0x2eff606]]
f. /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(CCTKi_SetupGHExtensions+0xc4) [0xd20924]
10. /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(CactusDefaultSetupGH+0x30e) [0xd3ad2e]
11. Carpet::Initialise(tFleshConfig*)   [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet10InitialiseEP12tFleshConfig+0x54) [0x2ee2414]]
12. /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(main+0x96) [0xcbbef6]
13. /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6cf5270b15]
14. /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim() [0xcbbbe9]

"recursive" works on 40 processes. It's possible that there is no sensible way to split such a small domain among 44 processes, and that aborting with an error is the correct behaviour. If that is the case, then this should be caught at a higher level in the code and a more sensible error message should be generated.

comment:5 Changed 19 months ago by Ian Hinder

Component: OtherCarpet
Priority: unsetmajor

comment:6 in reply to:  3 ; Changed 19 months ago by allgwy001@…

Thank you very much for looking into it!

We can't seem to disable OpenMP in the compile. However, it's effectively disabled by means of OMP_NUM_THREADS=1. Since "OpenMP does not play nicely with other software, especially in the hybridized domain of combining OpenMPI and OpenMP, where multiple users share nodes", they are not willing to enable it for me.

Excluding boundary points and symmetry points, I find 31 evolved points in each spatial direction for the most refined region. That gives 29 791 points in three dimensions. For each of the other four regions I find 25 695 points; that's 132 571 in total. Does Carpet provide any output I can use to verify this?

My contact writes the following: "I fully understand the [comments about the scalability]. However we see a similar decrement with cctk_final_time=1000 [he initially tested with smaller times] and hence I would assume a larger solution space. Unless your problem is what is called embarrassingly parallel you will always be faced with a communication issue."

This is incorrect, right? My understanding is that the size of the solution space should remain the same regardless of cctk_final_time.

Replying to hinder:

This is a small example, and I don't think the results you are seeing are surprising. Think of the number of points in the grid on each refinement level, then divide that by the number of cores. Each core will have to evolve that number of points. If you are using standard MPI without OpenMP, this block of points will be surrounded by a shell of ghost points which need to be synchronised with the other processes. As the size of the component on each process decreases, the ratio between the number of ghost points which need to be communicated and the number of real points which need to be evolved will increase. Eventually, you will have so few points evolved on each process that the communication cost dominates (and is not expected to scale well with the number of cores). I suggest that you calculate how many points are evolved on each process in your tests.

One way to improve scalability may be to use OpenMP parallelisation in addition to MPI. Are you doing this already?

Note that the scaling of this example is probably a separate issue to the problem of it crashing on >40 cores.

Also, just because this specific example doesn't scale effectively beyond 20 cores, this doesn't say anything about other examples, or cases that you want to run for science.

comment:7 in reply to:  6 ; Changed 19 months ago by Ian Hinder

Replying to allgwy001@…:

Thank you very much for looking into it!

We can't seem to disable OpenMP in the compile. However, it's effectively disabled by means of OMP_NUM_THREADS=1. Since "OpenMP does not play nicely with other software, especially in the hybridized domain of combining OpenMPI and OpenMP, where multiple users share nodes", they are not willing to enable it for me.

I have never run on a system where multiple users share nodes; it doesn't really fit very well with the sort of application that Cactus is. You don't want to be worrying about whether other processes are competing with you for memory, memory bandwidth, or cores. When you have exclusive access to each node, OpenMP is usually a good idea. By the way: what sort of interconnect do you have? Gigabit ethernet, or infiniband, or something else? If users are sharing nodes, then I suspect that this cluster is gigabit ethernet only, and you may be limited to small jobs, since the performance of gigabit ethernet will quickly become your bottleneck. What cluster are you using? From your email address, I'm guessing that it is one of the ones at http://hpc.uct.ac.za? If so, or you are using a similar scheduler, then you should be able to do this, as in their documentation:

NB3: If your software prefers to use all cores on a computer then make sure that you reserve these cores. For example running on an 800 series server which has 8 cores per server change the directive line in your script as follows:

#PBS -l nodes=1:ppn=8:series800

Once you are the exclusive user of a node, I don't see a problem with enabling OpenMP. Also note: OpenMP is not something that needs to be enabled by the system administrator; it is determined by your compilation flags (on by default in the ET) and activated with OMP_NUM_THREADS. Is it possible that there was a confusion, and the admins were talking about hyperthreading instead, which is a very different thing, and which I agree you probably don't want to have enabled (it would have to be enabled by the admins)?

Excluding boundary points and symmetry points, I find 31 evolved points in each spatial direction for the most refined region. That gives 29 791 points in three dimensions. For each of the other four regions I find 25 695 points; that's 132 571 in total. Does Carpet provide any output I can use to verify this?

Carpet provides a lot of output :) You may get something useful by setting

CarpetLib::output_bboxes = "yes"

On the most refined region, making the approximation that the domain is divided into N identical cubical regions, then for N = 40, you would have 29791/40 = 745 ~ 93, so about 9 evolved points in each direction. The volume of ghost plus evolved points would be (9+3+3)3 = 153, so the number of ghost points is 153 - 93, and the ratio of ghost to evolved points is (153 - 93)/93 = (15/9)3 - 1 = 3.6. So you have 3.6 times as many points being communicated as you have being evolved. Especially if the interconnect is only gigabit ethernet, I'm not surprised that the scaling flattens off by this point. Note that if you use OpenMP, this ratio will be much smaller, because openmp threads communicate using shared memory, not ghost zones. Essentially you will have fewer processes, each with a larger cube, and multiple threads working on that cube in shared memory.

My contact writes the following: "I fully understand the [comments about the scalability]. However we see a similar decrement with cctk_final_time=1000 [he initially tested with smaller times] and hence I would assume a larger solution space. Unless your problem is what is called embarrassingly parallel you will always be faced with a communication issue."

This is incorrect, right? My understanding is that the size of the solution space should remain the same regardless of cctk_final_time.

Yes - it looks like your contact doesn't know that the code is iterative, and cctk_final_time simply counts the number of iterations. In order to test with a larger problem size, you would need to reduce CoordBase::dx, dy and dz, so that there is higher overall resolution, and hence more points. I would expect the scalability to improve with larger problem sizes.

Changed 19 months ago by allgw001@…

Attachment: traffic.png added

comment:8 in reply to:  7 Changed 19 months ago by allgwy001@…

Thanks Ian!

Yes, we're running on the HEX cluster at the UCT HPC facility. The admin agrees that node sharing isn't desirable, but UCT is in the middle of a financial crisis and we simply can't afford dedicated access to a single node right now.

We use 56G infiniband interconnect, but since the admin ran his tests on a single node, it doesn't really matter.

He tried running on two nodes at 19:30 this evening and found that the IB traffic (bottom right) was minimal.


He's also well-aware about the difference between OpenMP and hyperthreading (which is disabled on all HPC nodes), as well as the OMP_NUM_THREADS environment variable.

Replying to hinder:

Replying to allgwy001@…:

Thank you very much for looking into it!

We can't seem to disable OpenMP in the compile. However, it's effectively disabled by means of OMP_NUM_THREADS=1. Since "OpenMP does not play nicely with other software, especially in the hybridized domain of combining OpenMPI and OpenMP, where multiple users share nodes", they are not willing to enable it for me.

I have never run on a system where multiple users share nodes; it doesn't really fit very well with the sort of application that Cactus is. You don't want to be worrying about whether other processes are competing with you for memory, memory bandwidth, or cores. When you have exclusive access to each node, OpenMP is usually a good idea. By the way: what sort of interconnect do you have? Gigabit ethernet, or infiniband, or something else? If users are sharing nodes, then I suspect that this cluster is gigabit ethernet only, and you may be limited to small jobs, since the performance of gigabit ethernet will quickly become your bottleneck. What cluster are you using? From your email address, I'm guessing that it is one of the ones at http://hpc.uct.ac.za? If so, or you are using a similar scheduler, then you should be able to do this, as in their documentation:

NB3: If your software prefers to use all cores on a computer then make sure that you reserve these cores. For example running on an 800 series server which has 8 cores per server change the directive line in your script as follows:

#PBS -l nodes=1:ppn=8:series800

Once you are the exclusive user of a node, I don't see a problem with enabling OpenMP. Also note: OpenMP is not something that needs to be enabled by the system administrator; it is determined by your compilation flags (on by default in the ET) and activated with OMP_NUM_THREADS. Is it possible that there was a confusion, and the admins were talking about hyperthreading instead, which is a very different thing, and which I agree you probably don't want to have enabled (it would have to be enabled by the admins)?

Excluding boundary points and symmetry points, I find 31 evolved points in each spatial direction for the most refined region. That gives 29 791 points in three dimensions. For each of the other four regions I find 25 695 points; that's 132 571 in total. Does Carpet provide any output I can use to verify this?

Carpet provides a lot of output :) You may get something useful by setting

CarpetLib::output_bboxes = "yes"

On the most refined region, making the approximation that the domain is divided into N identical cubical regions, then for N = 40, you would have 29791/40 = 745 ~ 93, so about 9 evolved points in each direction. The volume of ghost plus evolved points would be (9+3+3)3 = 153, so the number of ghost points is 153 - 93, and the ratio of ghost to evolved points is (153 - 93)/93 = (15/9)3 - 1 = 3.6. So you have 3.6 times as many points being communicated as you have being evolved. Especially if the interconnect is only gigabit ethernet, I'm not surprised that the scaling flattens off by this point. Note that if you use OpenMP, this ratio will be much smaller, because openmp threads communicate using shared memory, not ghost zones. Essentially you will have fewer processes, each with a larger cube, and multiple threads working on that cube in shared memory.

My contact writes the following: "I fully understand the [comments about the scalability]. However we see a similar decrement with cctk_final_time=1000 [he initially tested with smaller times] and hence I would assume a larger solution space. Unless your problem is what is called embarrassingly parallel you will always be faced with a communication issue."

This is incorrect, right? My understanding is that the size of the solution space should remain the same regardless of cctk_final_time.

Yes - it looks like your contact doesn't know that the code is iterative, and cctk_final_time simply counts the number of iterations. In order to test with a larger problem size, you would need to reduce CoordBase::dx, dy and dz, so that there is higher overall resolution, and hence more points. I would expect the scalability to improve with larger problem sizes.

comment:9 Changed 19 months ago by allgw001@…

Details about the build:

zypper in libnuma-devel
wget https://bitbucket.org/einsteintoolkit/manifest/raw/ET_2016_11/einsteintoolkit.th

remove ExternalLibraries/OpenBLAS, ExternalLibraries/OpenCL, Carpet/ReductionTest and CactusTest/TestFortranCrayPointers
add Llama/WaveExtractCPM and EinsteinAnalysis/ADMConstraints

curl -kLO https://raw.githubusercontent.com/gridaphobe/CRL/ET_2016_11/GetComponents
chmod a+x GetComponents
./GetComponents --parallel einsteintoolkit.th
module load compilers/gcc530 mpi/openmpi-1.10.1

cd Cactus
cp simfactory/mdb/optionlists/centos.cfg ./sles.cfg
Add MPI_DIR=/opt/exp_soft/openmpi-1.10.1/

make ET-config options=sles.cfg THORNLIST=thornlists/einsteintoolkit.th
make -j 64 ET

where sles.cfg contains:

# SLES
#
# Whenever this version string changes, the application is configured
# and rebuilt from scratch
VERSION = 2016-05-06

CPP = cpp
FPP = cpp
CC  = gcc
CXX = g++
F77 = gfortran
F90 = gfortran

CPPFLAGS = -DMPICH_IGNORE_CXX_SEEK
FPPFLAGS = -traditional

CFLAGS   =  -g3 -std=gnu99 -rdynamic -Wunused-variable
CXXFLAGS =  -g3 -std=c++0x -rdynamic -Wunused-variable -fno-inline
F77FLAGS =  -g3 -fcray-pointer -ffixed-line-length-none
F90FLAGS =  -g3 -fcray-pointer -ffixed-line-length-none

LIBS = numa gfortran

LIBDIRS =

MPI_DIR=/opt/exp_soft/openmpi-1.10.1/

DEBUG           = no
CPP_DEBUG_FLAGS = -DCARPET_DEBUG
FPP_DEBUG_FLAGS = -DCARPET_DEBUG
C_DEBUG_FLAGS   = -O0
CXX_DEBUG_FLAGS = -O0
F77_DEBUG_FLAGS = -O0 -ffixed-line-length-none
F90_DEBUG_FLAGS = -O0 -ffree-line-length-none

My PBS script:

#PBS -q UCTlong
#PBS -N static_tov_mod_2
#PBS -m abe
#PBS -M allgwy001@myuct.ac.za
#PBS -l nodes=1:ppn=44:series600
#PBS -o /home/allgwy001/Output/static_tov_mod_2
#PBS -e /home/allgwy001/Output/static_tov_mod_2
module load software/EinsteinToolkit
cd /home/allgwy001/Output
export OMP_NUM_THREADS=1
mpirun -np 44 -hostfile $PBS_NODEFILE -v cactus_ET /home/allgwy001/Parameter_Files/static_tov_mod_2.par

I simply ran using

qsub <PBS script>

comment:10 Changed 19 months ago by Erik Schnetter

Gwyneth

These PBS options look as if you ran 44 processes on one single node. "nodes=1:ppn=44" means "use 1 node, run 44 processes on that node." Is that intended? If so, there would be no Infiniband traffic generated.

-erik

comment:11 in reply to:  10 Changed 19 months ago by allgwy001@…

Hi Erik,

Yes and sorry. To clarify, this is the script I used for which the NaNs appeared. The admin would've used a different script for looking at the traffic. If you like, I can ask for it.

Replying to eschnett:

Gwyneth

These PBS options look as if you ran 44 processes on one single node. "nodes=1:ppn=44" means "use 1 node, run 44 processes on that node." Is that intended? If so, there would be no Infiniband traffic generated.

-erik

comment:12 Changed 5 weeks ago by Roland Haas

This does not look to have quite the same characteristics as ticket:2175#comment:1 which was fixed in #2182 but given that both are concerned with NaN/poisson values appearing when running in many MPI ranks this may be worthwile to check again if there is a reproducer for the issue.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as assigned The owner will remain Frank Löffler.
Next status will be 'review'.
as The resolution will be set.
to The owner will be changed from Frank Löffler to the specified user.
The owner will be changed from Frank Löffler to anonymous.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.