SphericalHarmonicRecon and SphericalHarmonicReconGen tests fail on Jenkins build machine

Issue #2096 closed
Ian Hinder created an issue

(Adapted from http://lists.einsteintoolkit.org/pipermail/users/2017-July/005671.html)

The following three tests fail on the Jenkins build machine:

SphericalHarmonicRecon.regression_test/2procs SphericalHarmonicReconGen.SpEC-dat-test/2procs SphericalHarmonicReconGen.SpEC-h5-test/2procs

See https://build-test.barrywardell.net/job/EinsteinToolkit/1271/testReport/.

These tests all pass on one process but fail on two processes, using the ubuntu.cfg optionlist. They all seem to pass on multiple processes on all other machines, including my laptop with gcc.

The first, SphericalHarmonicRecon.regression_test, fails like this:

WARNING level 0 from host 7ce14e5707a0 process 0
  while executing schedule bin NullEvol_Initial, routine NullEvolve::NullEvol_InitialSlice
  in thorn NullEvolve, file NullEvol_InitialSlice.F90:42:
  ->  Error

The second, SphericalHarmonicReconGen.SpEC-dat-test, fails like this:

NewsB_scri.L02Mm01.asc: substantial differences
      significant differences on 1 (out of 2) lines
      maximum absolute difference in column 1 is 963
      maximum absolute difference in column 2 is 0.000185770963653907
      maximum absolute difference in column 3 is 0.000142466608463344
      maximum relative difference in column 1 is 1
      maximum relative difference in column 2 is 1
      maximum relative difference in column 3 is 1
      ...

The third, SphericalHarmonicReconGen.SpEC-h5-test, fails like this:

NewsB_scri.L02Mm01.asc: substantial differences
      significant differences on 1 (out of 2) lines
      maximum absolute difference in column 1 is 963
      maximum absolute difference in column 2 is 0.000185770963653907
      maximum absolute difference in column 3 is 0.000142466608463344
      maximum relative difference in column 1 is 1
      maximum relative difference in column 2 is 1
      maximum relative difference in column 3 is 1
      ...

I suspect the second and third failures have the same cause. These tests don't seem to fail on any other machines.

See the discussion thread http://lists.einsteintoolkit.org/pipermail/users/2017-July/thread.html#5671 for more information.

Keyword:

Comments (22)

  1. Roland Haas
    • removed comment

    In today's ET call Yosef said: Could be that the test data needs to be updated after an update of the Null code. Should have failed after fix to PITTNull but did not.

  2. Roland Haas
    • removed comment

    I pushed data regenerated after the update to PITTNull in git commit 6b25e9b "SphericalHarmonicReconGen: update test data after 3b23a6b" of pittnullcode and will close the ticket once the Jenkins tests pass.

  3. Roland Haas
    • removed comment

    A new failure mode though. No longer differences but an outright runtime error.

  4. Roland Haas
    • removed comment

    This is a question for Yosef or one of the other authors. This triggers:

    31      ! Note: the extraction WT is at rb(char_nx0+1)
    32      ! start marching at the 2nd point
    33
    34      if ( FirstTime ) then
    35         FirstTime = .false.
    36         if (minval(abs(zeta - dcmplx(1,1))) < 1.0d-10) then
    37            Tarr = minloc(abs(stereo_q(:,1)-1.))
    38            loc_q = Tarr(1)
    39            Tarr = minloc(abs(stereo_p(1,:)-1.))
    40            loc_p = Tarr(1)
    41            if (abs(zeta(loc_q,loc_p) - dcmplx(1.,1.)) .gt. 1d-10) then
    42               call CCTK_WARN(0, " Error ")
    43            endif
    

    in ./NullEvolve/src/NullEvol_InitialSlice.F90 line 42. Ie abs(zeta(loc_q,loc_p) - dcmplx(1.,1.)) .gt. 1d-10 whatever that means. Note that the data has been freshly regenerated using gcc 7.2 with -O2 but without --fast-math (which may be used on the Jenkins machine).

    I will have to try and see if I can reproduce the error on my workstation (or dig up my login information for the Jenkins VM machine to play around on it).

    Yosef, would you be able to tell if there is anything one can do about this error?

  5. Roland Haas
    • removed comment

    Hello Yosef,

    the test now fail, for both SphericalHarmonicReconGen and SphericalHarmonciRecon, by triggering the level 0 warning in line 42 of NullEvolve/src/NullEvol_InitialSlice.F90. You can see the console output here: https://build.barrywardell.net/job/EinsteinToolkit/lastCompletedBuild/testReport/(root)/SphericalHarmonicRecon/regression_test_2procs/

    The condition that triggers the error is: abs(zeta(loc_q,loc_p) - dcmplx(1.,1.)) .gt. 1d-10 and not being an author I cannot judge if this is just due to the threshold being too tight or due to something else. Note that the test passed when compiled with -O2 and without --fast-math using gcc 7.2 on my workstation, namely the Recon one has

      Test SphericalHarmonicRecon: regression_test
        "PITTNullCode/SphericalHarmonicRecon/test/regression_test.par"
    
      Issuing mpirun -np 2 /data/rhaas/postdoc/gr/cactus/ET_trunk/exe/cactus_sim /data/rhaas/postdoc/gr/cactus/ET_trunk/arrangements/PITTNullCode/SphericalHarmonicRecon/test/regression_test.par
    
       j_wt[0]_2D.asc: differences below tolerance on 9698 lines
    
    
    
      Success: 4 files compared, 1 differ in the last digits
    

    while the ReconGen one passes identically which is expected since the test data was generated on my workstation.

    Would you be able to comment if a difference in 1e-10 for zeta from 1+i is something to be concerned about?

    Yours, Roland

  6. Yosef Zlochower
    • removed comment

    I think we hit this before. The actual test that fails is useless. However, it also can't fail unless something else is really wrong. For a reason lost to history, we at one time wanted to location of the point (1,1)

    The test is saying, if there is a point closer than 1.0e-10 to (1,1), find it and confirm it's within 1.0e-10 to (1,1) The result should be either there is no such point, or there is, not this superposition of yes and no.

    So there is a bug in the grid setup somewhere, or the compiler has messed up the code.

  7. Roland Haas
    • removed comment

    Yes, there was discussion about the in the past in 2017 it seems.

    Anyhow, I can reproduce this on the NDS tutorial machine using the ubuntu.cfg option list that the Jenkins server uses (Ian confirms this last bit of information).

    Yosef, if you could take a look, please.

  8. Yosef Zlochower
    • removed comment

    This is a gfortran compiler bug triggered by the compiler options "-O2 -ffast-math -fno-finite-math-only". Remove any one option, and the bug goes away. Here's a test code

    program test implicit none integer, parameter :: nx = 5 integer, parameter :: ny = 5 integer, parameter :: wp = kind(1.0d0)

      complex(kind=wp), dimension(nx,ny):: st_z
    
      integer:: i,j
    
      do j=1, ny
        do i=1, nx
          st_z(i,j) = dcmplx(-i,-j)
        end do
      end do
    
      write(*,*) "Test output should not be zero ", minval(abs(st_z - dcmplx(1,1)))
    

    end program test

  9. Roland Haas
    • removed comment

    Thank you. Now we only need a workaround for our own code until we can expect a bug-fixed version of gcc on all systems :-)

  10. anonymous
    • removed comment

    The problem is that minval is used plenty of times in the code. The code that triggers the errors can be safely commented out, but what about the other uses of minval?

  11. Roland Haas
    • removed comment

    Hmm, so minval is the issue, not minloc? Given your test code, yes, "minval" seems to be the culprit. The smallest change to make things work would be to remove --no-finite-math-only but this makes all calls to isnan return "false" which makes NaNChecker ignore all nans. There's workarounds for that eg by manually looking for bit-pattern that IEEE defines to be NaN rather than using the C (or C++) library's isnan function. This could be encapsulated in the CCTK_IsNaN functions which in turn could be made to override the systems isnan functions. However this will fail in Fortran code where the idiom to check for NaN is "if (a.ne.a) then" which involves no function call.

    Another option would be to provide our own minval function which may be less efficient but given that it is likely memory bandwidth bound anyway we should do about as good as the compiler supplied version.

    Do you know the problem is in minval or in abs?

  12. Yosef Zlochower
    • removed comment

    Further testing seems to indicate that it's the combination of minval(abs(COMPLEX)) that's the problem. The only code that does this are the two "useless" tests. I think we can comment them out.

  13. Roland Haas
    • changed status to resolved
    • removed comment

    With the commit git hash 1382f3c "removed 'useless' test which tested if (1,1) is on the grid then test if (1,1) is on the grid' of pittnullcode mentioned in comment:20 the tests pass.

  14. Log in to comment