SphericalHarmonicRecon and SphericalHarmonicReconGen tests fail on Jenkins build machine

Issue #2096 closed

Ian Hinder created an issue 2018-01-20

(Adapted from http://lists.einsteintoolkit.org/pipermail/users/2017-July/005671.html)

The following three tests fail on the Jenkins build machine:

SphericalHarmonicRecon.regression_test/2procs SphericalHarmonicReconGen.SpEC-dat-test/2procs SphericalHarmonicReconGen.SpEC-h5-test/2procs

See https://build-test.barrywardell.net/job/EinsteinToolkit/1271/testReport/.

These tests all pass on one process but fail on two processes, using the ubuntu.cfg optionlist. They all seem to pass on multiple processes on all other machines, including my laptop with gcc.

The first, SphericalHarmonicRecon.regression_test, fails like this:

WARNING level 0 from host 7ce14e5707a0 process 0
  while executing schedule bin NullEvol_Initial, routine NullEvolve::NullEvol_InitialSlice
  in thorn NullEvolve, file NullEvol_InitialSlice.F90:42:
  ->  Error

The second, SphericalHarmonicReconGen.SpEC-dat-test, fails like this:

NewsB_scri.L02Mm01.asc: substantial differences
      significant differences on 1 (out of 2) lines
      maximum absolute difference in column 1 is 963
      maximum absolute difference in column 2 is 0.000185770963653907
      maximum absolute difference in column 3 is 0.000142466608463344
      maximum relative difference in column 1 is 1
      maximum relative difference in column 2 is 1
      maximum relative difference in column 3 is 1
      ...

The third, SphericalHarmonicReconGen.SpEC-h5-test, fails like this:

NewsB_scri.L02Mm01.asc: substantial differences
      significant differences on 1 (out of 2) lines
      maximum absolute difference in column 1 is 963
      maximum absolute difference in column 2 is 0.000185770963653907
      maximum absolute difference in column 3 is 0.000142466608463344
      maximum relative difference in column 1 is 1
      maximum relative difference in column 2 is 1
      maximum relative difference in column 3 is 1
      ...

I suspect the second and third failures have the same cause. These tests don't seem to fail on any other machines.

See the discussion thread http://lists.einsteintoolkit.org/pipermail/users/2017-July/thread.html#5671 for more information.

Keyword:

Comments (22)

Ian Hinder reporter
- changed status to open
- removed comment
- 2018-01-20T08:13:03+00:00
Ian Hinder reporter
- removed comment
- 2018-01-20T08:14:20+00:00
Roland Haas
- removed comment
In today's ET call Yosef said: Could be that the test data needs to be updated after an update of the Null code. Should have failed after fix to PITTNull but did not.
- 2018-01-22T09:48:25+00:00
Roland Haas
- changed status to open
- assigned issue to
- removed comment
- 2018-01-22T10:15:23+00:00
Roland Haas
- removed comment
I pushed data regenerated after the update to PITTNull in git commit 6b25e9b "SphericalHarmonicReconGen: update test data after 3b23a6b" of pittnullcode and will close the ticket once the Jenkins tests pass.
- 2018-01-22T21:00:18+00:00
Ian Hinder reporter
- removed comment
The tests still fail on 2 processes. https://build-test.barrywardell.net/job/EinsteinToolkit/1273/testReport/
- 2018-01-23T08:45:04+00:00
Roland Haas
- removed comment
A new failure mode though. No longer differences but an outright runtime error.
- 2018-01-23T09:09:56+00:00
Roland Haas
- removed comment
This is a question for Yosef or one of the other authors. This triggers:
```
31      ! Note: the extraction WT is at rb(char_nx0+1)
32      ! start marching at the 2nd point
33
34      if ( FirstTime ) then
35         FirstTime = .false.
36         if (minval(abs(zeta - dcmplx(1,1))) < 1.0d-10) then
37            Tarr = minloc(abs(stereo_q(:,1)-1.))
38            loc_q = Tarr(1)
39            Tarr = minloc(abs(stereo_p(1,:)-1.))
40            loc_p = Tarr(1)
41            if (abs(zeta(loc_q,loc_p) - dcmplx(1.,1.)) .gt. 1d-10) then
42               call CCTK_WARN(0, " Error ")
43            endif
```
in ./NullEvolve/src/NullEvol_InitialSlice.F90 line 42. Ie abs(zeta(loc_q,loc_p) - dcmplx(1.,1.)) .gt. 1d-10 whatever that means. Note that the data has been freshly regenerated using gcc 7.2 with -O2 but without --fast-math (which may be used on the Jenkins machine).

I will have to try and see if I can reproduce the error on my workstation (or dig up my login information for the Jenkins VM machine to play around on it).

Yosef, would you be able to tell if there is anything one can do about this error?
- 2018-01-23T10:18:14+00:00
Roland Haas
- removed comment
The same error is triggered by SphericalHarmonicRecon's test (no "Gen" this time). See: https://build.barrywardell.net/job/EinsteinToolkit/lastCompletedBuild/testReport/(root)/SphericalHarmonicRecon/regression_test_2procs/ and https://build.barrywardell.net/job/EinsteinToolkit/1273/ for the top level test page.
- 2018-01-23T10:24:53+00:00
Roland Haas
- removed comment
- changed watchers to yosef_zlochower
- 2018-01-24T08:16:14+00:00
Roland Haas
- removed comment
Hello Yosef,

the test now fail, for both SphericalHarmonicReconGen and SphericalHarmonciRecon, by triggering the level 0 warning in line 42 of NullEvolve/src/NullEvol_InitialSlice.F90. You can see the console output here: https://build.barrywardell.net/job/EinsteinToolkit/lastCompletedBuild/testReport/(root)/SphericalHarmonicRecon/regression_test_2procs/

The condition that triggers the error is: abs(zeta(loc_q,loc_p) - dcmplx(1.,1.)) .gt. 1d-10 and not being an author I cannot judge if this is just due to the threshold being too tight or due to something else. Note that the test passed when compiled with -O2 and without --fast-math using gcc 7.2 on my workstation, namely the Recon one has
```
  Test SphericalHarmonicRecon: regression_test
    "PITTNullCode/SphericalHarmonicRecon/test/regression_test.par"

  Issuing mpirun -np 2 /data/rhaas/postdoc/gr/cactus/ET_trunk/exe/cactus_sim /data/rhaas/postdoc/gr/cactus/ET_trunk/arrangements/PITTNullCode/SphericalHarmonicRecon/test/regression_test.par

   j_wt[0]_2D.asc: differences below tolerance on 9698 lines



  Success: 4 files compared, 1 differ in the last digits
```
while the ReconGen one passes identically which is expected since the test data was generated on my workstation.

Would you be able to comment if a difference in 1e-10 for zeta from 1+i is something to be concerned about?

Yours, Roland
- 2018-01-24T08:21:38+00:00
Yosef Zlochower
- removed comment
I think we hit this before. The actual test that fails is useless. However, it also can't fail unless something else is really wrong. For a reason lost to history, we at one time wanted to location of the point (1,1)

The test is saying, if there is a point closer than 1.0e-10 to (1,1), find it and confirm it's within 1.0e-10 to (1,1) The result should be either there is no such point, or there is, not this superposition of yes and no.

So there is a bug in the grid setup somewhere, or the compiler has messed up the code.
- 2018-01-24T08:36:58+00:00
Roland Haas
- removed comment
Yes, there was discussion about the in the past in 2017 it seems.

Anyhow, I can reproduce this on the NDS tutorial machine using the ubuntu.cfg option list that the Jenkins server uses (Ian confirms this last bit of information).

Yosef, if you could take a look, please.
- 2018-01-25T17:40:51+00:00
Yosef Zlochower
- removed comment
This is a gfortran compiler bug triggered by the compiler options "-O2 -ffast-math -fno-finite-math-only". Remove any one option, and the bug goes away. Here's a test code

program test implicit none integer, parameter :: nx = 5 integer, parameter :: ny = 5 integer, parameter :: wp = kind(1.0d0)
```
  complex(kind=wp), dimension(nx,ny):: st_z

  integer:: i,j

  do j=1, ny
    do i=1, nx
      st_z(i,j) = dcmplx(-i,-j)
    end do
  end do

  write(*,*) "Test output should not be zero ", minval(abs(st_z - dcmplx(1,1)))
```
end program test
- 2018-01-29T07:27:46+00:00
Yosef Zlochower
- removed comment
I submitted a bug report to gcc

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84104
- 2018-01-29T07:41:52+00:00
Roland Haas
- removed comment
Thank you. Now we only need a workaround for our own code until we can expect a bug-fixed version of gcc on all systems :-)
- 2018-01-29T07:44:46+00:00
anonymous
- removed comment
The problem is that minval is used plenty of times in the code. The code that triggers the errors can be safely commented out, but what about the other uses of minval?
- 2018-01-29T07:51:56+00:00
Roland Haas
- removed comment
Hmm, so minval is the issue, not minloc? Given your test code, yes, "minval" seems to be the culprit. The smallest change to make things work would be to remove --no-finite-math-only but this makes all calls to isnan return "false" which makes NaNChecker ignore all nans. There's workarounds for that eg by manually looking for bit-pattern that IEEE defines to be NaN rather than using the C (or C++) library's isnan function. This could be encapsulated in the CCTK_IsNaN functions which in turn could be made to override the systems isnan functions. However this will fail in Fortran code where the idiom to check for NaN is "if (a.ne.a) then" which involves no function call.

Another option would be to provide our own minval function which may be less efficient but given that it is likely memory bandwidth bound anyway we should do about as good as the compiler supplied version.

Do you know the problem is in minval or in abs?
- 2018-01-29T08:29:30+00:00
Yosef Zlochower
- removed comment
Further testing seems to indicate that it's the combination of minval(abs(COMPLEX)) that's the problem. The only code that does this are the two "useless" tests. I think we can comment them out.
- 2018-01-29T08:50:51+00:00
anonymous
- removed comment
I pushed a change that removes the "useless" tests. The code should work now
- 2018-01-29T09:26:25+00:00
Roland Haas
- changed status to resolved
- removed comment
With the commit git hash 1382f3c "removed 'useless' test which tested if (1,1) is on the grid then test if (1,1) is on the grid' of pittnullcode mentioned in comment:20 the tests pass.
- 2018-01-29T15:10:44+00:00
Roland Haas
- changed status to closed
- edited description
- 2019-02-21T20:05:10+00:00
Log in to comment

Assignee: Roland Haas

Type: bug

Priority: minor

Status: closed

Component: EinsteinToolkit thorn

Milestone: ET_2018_02

Version: development version

Votes: 0

Watchers: 0