some test may be non-deterministic

Issue #2200 new
Roland Haas created an issue

For the ET_2018_09 release it seems that tests run on Sep 6th and Sept 22nd differ and these tests show a different number of failed tests:

  • comet__1_24.log
  • cori__1_16.log
  • osx-homebrew__2_2.log
  • osx-macports__2_2.log

and given that there should not have been a change in code between those days this seems suspicious.

The particular tests that changed are:

+++ b/results/comet__1_24.log
@@ -1434,7 +1434,7 @@
       AHFinderDirect: misner1.2-025
          Success: 55 files compared, 9 differ in the last digits
       AHFinderDirect: recoverML-EE
-         Success: 5 files compared, 5 differ in the last digits
+         Failure: 2 files missing, 3 files compared, 3 differ, 3 differ significantly
       Carpet: 64k2
          Success: 0 files identical
       Carpet: test_restrict_sync


+++ b/results/cori__1_16.log
@@ -1446,7 +1446,7 @@
       CarpetIOHDF5: CarpetWaveToyNewRecover_test_1proc
          Success: 12 files compared, 7 differ in the last digits
       CarpetIOHDF5: CarpetWaveToyRecover_test_1proc
-         Failure: 12 files missing, 0 files compared, 0 differ
+         Success: 12 files compared, 7 differ in the last digits
       CarpetIOHDF5: CarpetWaveToyRecover_test_newcp_1proc
          Success: 12 files compared, 7 differ in the last digits
       CarpetIOHDF5: newsep


+++ b/results/osx-homebrew__2_2.log
@@ -1631,7 +1631,7 @@
       PeriodicCarpet: testperiodicinterp
          Success: 1 files identical
       QuasiLocalMeasures: qlm-bl
-         Failure: 54 files missing, 115 files compared, 115 differ
+         Success: 169 files compared, 143 differ in the last digits
       QuasiLocalMeasures: qlm-ks
          Success: 169 files compared, 138 differ in the last digits
       QuasiLocalMeasures: qlm-ks-EE


+++ b/results/osx-macports__2_2.log
@@ -1631,7 +1631,7 @@
       PeriodicCarpet: testperiodicinterp
          Success: 1 files identical
       QuasiLocalMeasures: qlm-bl
-         Failure: 54 files missing, 115 files compared, 115 differ
+         Success: 169 files compared, 144 differ in the last digits
       QuasiLocalMeasures: qlm-ks
          Success: 169 files compared, 136 differ in the last digits
       QuasiLocalMeasures: qlm-ks-EE

Keyword: None

Comments (7)

  1. Steven R. Brandt
    • removed comment

    I'm confused. I thought all these tests worked on master right before the release?

  2. Roland Haas reporter
    • removed comment

    Correct. It may be an issue that the is actual variation from run to run (eg due to a race condition) or it could be that there was an OS update in between (this happened on golub [not listed] which means that one cannot even compile anymore).

    I have not looked at failures since I am swamped with other issues.

  3. Roland Haas reporter
    • removed comment

    On comet the log file for the failed test contains:

    INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 0
    HDF5-DIAG: Error detected in HDF5 (1.8.14) MPI-process 0:
      #000: H5Dio.c line 173 in H5Dread(): can't read data
        major: Dataset
        minor: Read failed
      #001: H5Dio.c line 550 in H5D__read(): can't read data
        major: Dataset
        minor: Read failed
      #002: H5Dchunk.c line 1872 in H5D__chunk_read(): unable to read raw data chunk
        major: Low-level I/O
        minor: Read failed
      #003: H5Dchunk.c line 2902 in H5D__chunk_lock(): data pipeline read failed
        major: Data filters
        minor: Filter operation failed
      #004: H5Z.c line 1382 in H5Z_pipeline(): filter returned failure during read
        major: Data filters
        minor: Read failed
      #005: H5Zdeflate.c line 136 in H5Z_filter_deflate(): memory allocation failed for deflate uncompression
        major: Resource unavailable
        minor: No space available for allocation
    WARNING[L1,P0] (CarpetIOHDF5): HDF5 call 'H5Dread(dataset, datatype, memspace, filespace, xfer, cctkGH->data[patch->vindex][timelevel])' returned error code -1
    

    indicating an issue with the HDF5 library. There certainly should be no issue with running out of memory since the total dataset size in the file in question (checkpointML-EE/checkpoint.chkpt.it_1.h5) is quite small:

    h5ls -v checkpointML-EE/checkpoint.chkpt.it_1.h5 | gawk '/logical bytes/{sum += $2} END{print sum/1e6}'
    19.9976
    

    is only about 20MB. It seems more likely that there is (again) a bug in HDF5's gzip code (see #1878).

    We had issues with Comet's file system in the past (#2073) in relation with writing and immediately reading files which is more or less what is happening for the testsuite data though this does not seem related.

  4. Roland Haas reporter
    • removed comment

    On cori the error during the "vanilla" run was:

    + srun -n 1 -c 32 /global/cscratch1/sd/rhaas/simulations/testsuite-cori-ET_vanilla-sim-procs000001/SIMFACTORY/exe/cactus_sim -L 3 /global/cscratch1/sd/rhaas/simulations/testsuite-cori-ET_vanilla-sim-procs000001/output-0000/arrangements/Carpet/CarpetIOHDF5/test/CarpetWaveToyRecover_test_1proc.par
    srun: error: task 0 launch failed: Error configuring interconnect
    

    which looks like a cluster error to me.

  5. Steven R. Brandt
    • removed comment

    I've seen that kind of output from hdf5 when mpi is misconfigured, i.e. the mpirun doesn't match the mpicxx.

  6. Roland Haas reporter
    • removed comment

    Replying to [comment:6 Steven R. Brandt]:

    I've seen that kind of output from hdf5 when mpi is misconfigured, i.e. the mpirun doesn't match the mpicxx.

    Ok, so I should check if the MPI stack changed in between the tests.

  7. Log in to comment