SphericalHarmonicRecon regression_test fails on many machines

Issue #1600 closed
Ian Hinder created an issue

The test SphericalHarmonicRecon/regression_test fails on shelob, datura, carver and mike with the following error:

INFO (NullInterp): will set up circle-shaped guard point mask
WARNING level 0 in thorn NullInterp processor 0 host shelob001
  (line 105 of NullInterp_MaskInit.F90): 
  -> mask setup error
INFO (NullInterp):  setting up a circular evolution mask of radius   2.00000000000000
INFO (NullInterp):  the guard point shell is set for a max. stencil size of           3
[1mWARNING level 0 in thorn NullInterp processor 0 host shelob001
  (line 105 of NullInterp_MaskInit.F90): 
  ->[0m mask setup error

It fails on bluewaters and pandora with substantial differences in the NewsB gridfunction:

   NewsB[0]_2D.asc: substantial differences
      significant differences on 12 (out of 4515) lines
      maximum absolute difference in column 3 is 0.0591090798545013
      maximum absolute difference in column 4 is 0.116319942759234
      maximum relative difference in column 3 is 1
      maximum relative difference in column 4 is 195.702939866031
      (insignificant differences on 305 lines)

The failures occur on both one and two processes.

Keyword:

Comments (4)

  1. Roland Haas
    • removed comment

    the routine setup_circle_masks in PITTNullCode/NullInterp/src/NullInterp_MaskInit.F90 contains in line 101 code that reads:

    98     ! check that there is enough room left
    99     ! for guard  points near the patch boundaries
    100
    101     if (((lbnd(1).eq.0).and.any(EG_mask(1:N_ang_stencil_size,:).ne.0)) .or. &
    102         ((ubnd(1).eq.gsh(1)-1).and.any(guard_mask(lsh(1)-N_ang_stencil_size+1:,:).ne.0)) .or. &
    103         ((lbnd(2).eq.0).and.any(guard_mask(:,1:N_ang_stencil_size).ne.0)) .or. &
    104         ((ubnd(2).eq.gsh(2)-1).and.any(guard_mask(:,lsh(2)-N_ang_stencil_size+1:).ne.0))) &
    105          call CCTK_WARN(0, "mask setup error");
    106
    107     ! mark guard points 108
    

    The first condition "((lbnd(1).eq.0).and.any(EG_mask(1:N_ang_stencil_size,:).ne.0))" seems incorrect to me since it (alone) tests EG_mask rather than guard_mask as the others do.

  2. Roland Haas
    • removed comment

    Bela suggest that (and the surrounding code supports this) that in fact EG_mask is the correct thing to use, but instead guard_mask is used incorrectly.

    Fixed in rev 17 of NullInterp.

    Ian: while this test fails on bluewaters, it fails with file differences, so Datura might be the only one where it aborts. Please close this ticket if the fix fixes the issue on datura (still fails on bluewaters).

  3. Ian Hinder reporter
    • changed status to resolved
    • removed comment

    All tests now pass on Datura. I assume the same failure is now corrected on shelob, carver and mike, but we have to wait for Erik to run the tests there. Thanks!

    I decided not to create separate tickets for the two different failures, so the ticket summary includes both. However, the machines failing with "significant differences", pandora and bluewaters, have many other failures, and I suspect these are related to roundoff issues related to Ofast and/or using difference compilers. Closing this ticket.

  4. Log in to comment