Modify

Opened 4 months ago

Last modified 3 months ago

#2200 new defect

some test may be non-deterministic

Reported by: Roland Haas Owned by:
Priority: major Milestone:
Component: Cactus Version: development version
Keywords: Cc:

Description

For the ET_2018_09 release it seems that tests run on Sep 6th and Sept 22nd differ and these tests show a different number of failed tests:

  • comet1_24.log
  • cori1_16.log
  • osx-homebrew2_2.log
  • osx-macports2_2.log

and given that there should not have been a change in code between those days this seems suspicious.

The particular tests that changed are:

+++ b/results/comet__1_24.log
@@ -1434,7 +1434,7 @@
       AHFinderDirect: misner1.2-025
          Success: 55 files compared, 9 differ in the last digits
       AHFinderDirect: recoverML-EE
-         Success: 5 files compared, 5 differ in the last digits
+         Failure: 2 files missing, 3 files compared, 3 differ, 3 differ significantly
       Carpet: 64k2
          Success: 0 files identical
       Carpet: test_restrict_sync
+++ b/results/cori__1_16.log
@@ -1446,7 +1446,7 @@
       CarpetIOHDF5: CarpetWaveToyNewRecover_test_1proc
          Success: 12 files compared, 7 differ in the last digits
       CarpetIOHDF5: CarpetWaveToyRecover_test_1proc
-         Failure: 12 files missing, 0 files compared, 0 differ
+         Success: 12 files compared, 7 differ in the last digits
       CarpetIOHDF5: CarpetWaveToyRecover_test_newcp_1proc
          Success: 12 files compared, 7 differ in the last digits
       CarpetIOHDF5: newsep
+++ b/results/osx-homebrew__2_2.log
@@ -1631,7 +1631,7 @@
       PeriodicCarpet: testperiodicinterp
          Success: 1 files identical
       QuasiLocalMeasures: qlm-bl
-         Failure: 54 files missing, 115 files compared, 115 differ
+         Success: 169 files compared, 143 differ in the last digits
       QuasiLocalMeasures: qlm-ks
          Success: 169 files compared, 138 differ in the last digits
       QuasiLocalMeasures: qlm-ks-EE
+++ b/results/osx-macports__2_2.log
@@ -1631,7 +1631,7 @@
       PeriodicCarpet: testperiodicinterp
          Success: 1 files identical
       QuasiLocalMeasures: qlm-bl
-         Failure: 54 files missing, 115 files compared, 115 differ
+         Success: 169 files compared, 144 differ in the last digits
       QuasiLocalMeasures: qlm-ks
          Success: 169 files compared, 136 differ in the last digits
       QuasiLocalMeasures: qlm-ks-EE

Attachments (1)

Comet-AHFinderDirect-recoverML-EE.tar.gz (74.0 KB) - added by Roland Haas 3 months ago.
data for failed test on comet

Download all attachments as: .zip

Change History (8)

comment:1 Changed 4 months ago by Roland Haas

Component: OtherCactus
Priority: unsetmajor

comment:2 Changed 4 months ago by Steven R. Brandt

I'm confused. I thought all these tests worked on master right before the release?

comment:3 Changed 4 months ago by Roland Haas

Correct. It may be an issue that the is actual variation from run to run (eg due to a race condition) or it could be that there was an OS update in between (this happened on golub [not listed] which means that one cannot even compile anymore).

I have not looked at failures since I am swamped with other issues.

Last edited 4 months ago by Roland Haas (previous) (diff)

Changed 3 months ago by Roland Haas

data for failed test on comet

comment:4 Changed 3 months ago by Roland Haas

On comet the log file for the failed test contains:

INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 0
HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 0:
  #000: H5Dio.c line 173 in H5Dread(): can't read data
    major: Dataset
    minor: Read failed
  #001: H5Dio.c line 550 in H5D__read(): can't read data
    major: Dataset
    minor: Read failed
  #002: H5Dchunk.c line 1872 in H5D__chunk_read(): unable to read raw data chunk
    major: Low-level I/O
    minor: Read failed
  #003: H5Dchunk.c line 2902 in H5D__chunk_lock(): data pipeline read failed
    major: Data filters
    minor: Filter operation failed
  #004: H5Z.c line 1382 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed
  #005: H5Zdeflate.c line 136 in H5Z_filter_deflate(): memory allocation failed for deflate uncompression
    major: Resource unavailable
    minor: No space available for allocation
WARNING[L1,P0] (CarpetIOHDF5): HDF5 call 'H5Dread(dataset, datatype, memspace, filespace, xfer, cctkGH->data[patch->vindex][timelevel])' returned error code -1

indicating an issue with the HDF5 library. There certainly should be no issue with running out of memory since the total dataset size in the file in question (checkpointML-EE/checkpoint.chkpt.it_1.h5) is quite small:

h5ls -v checkpointML-EE/checkpoint.chkpt.it_1.h5 | gawk '/logical bytes/{sum += $2} END{print sum/1e6}'
19.9976

is only about 20MB. It seems more likely that there is (again) a bug in HDF5's gzip code (see #1878).

We had issues with Comet's file system in the past (#2073) in relation with writing and immediately reading files which is more or less what is happening for the testsuite data though this does not seem related.

comment:5 Changed 3 months ago by Roland Haas

On cori the error during the "vanilla" run was:

+ srun -n 1 -c 32 /global/cscratch1/sd/rhaas/simulations/testsuite-cori-ET_vanilla-sim-procs000001/SIMFACTORY/exe/cactus_sim -L 3 /global/cscratch1/sd/rhaas/simulations/testsuite-cori-ET_vanilla-sim-procs000001/output-0000/arrangements/Carpet/CarpetIOHDF5/test/CarpetWaveToyRecover_test_1proc.par
srun: error: task 0 launch failed: Error configuring interconnect

which looks like a cluster error to me.

comment:6 Changed 3 months ago by Steven R. Brandt

I've seen that kind of output from hdf5 when mpi is misconfigured, i.e. the mpirun doesn't match the mpicxx.

comment:7 in reply to:  6 Changed 3 months ago by Roland Haas

Replying to Steven R. Brandt:

I've seen that kind of output from hdf5 when mpi is misconfigured, i.e. the mpirun doesn't match the mpicxx.

Ok, so I should check if the MPI stack changed in between the tests.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The ticket will remain with no owner.
Next status will be 'review'.
as The resolution will be set.
to The owner will be changed from (none) to the specified user.
Next status will be 'confirmed'.
The owner will be changed from (none) to anonymous.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.