Modify

Opened 5 years ago

Last modified 16 months ago

#1276 reopened defect

Intel 2013.1.117 mis-compiles NewRad

Reported by: Roland Haas Owned by:
Priority: major Milestone:
Component: EinsteinToolkit thorn Version: development version
Keywords: NewRad Cc:

Description

Intels compiler fails (with -O2) to push values for bmin onto the stack in lines 316 of newrad.cc and line 126 of extrap.cc. Adding printf's for bmin perturbs the bug out of existence, but adding a printf of the address of bmax and reveals that at the time extrap_kernel is call the integer just before this address is still the initialization value of bmin[2] and not the correct value.

The attached patch disables optimization for the two driver functions affected (but not the actual kernel).

The patch is specific (via an #if) for this particular compiler and version. What is the best way of handling this? Target any intel version starting from the known failing one until we know of known good one? Or starting from an older known good one (that would be intel 11 in my case).

Hopefully no similar bug is triggered by Carpet's use of the same idiom.

Attachments (5)

disable_optimization.patch (2.7 KB) - added by Roland Haas 5 years ago.
extrap.cc (1.5 KB) - added by Frank Löffler 5 years ago.
Short, standalone file to display bug
Cactus_intel_no_restrict (1.5 KB) - added by Frank Löffler 5 years ago.
Disable 'restrict' for Intel compilers known to produce wrong code
0001-Enable-restrict-for-sufficiently-new-versions-of-the.patch (852 bytes) - added by Ian Hinder 4 years ago.
restrict.patch (118.4 KB) - added by Steven R. Brandt 4 years ago.
This code updates autoconf to run the extrap.c test which demonstrates the failure.

Download all attachments as: .zip

Change History (59)

Changed 5 years ago by Roland Haas

Attachment: disable_optimization.patch added

comment:1 Changed 5 years ago by Roland Haas

Status: newreview

comment:2 Changed 5 years ago by Erik Schnetter

Traditionally, we have not included statements like these in source code. Apart from the difficulty of removing these again (when is it safe to do so?), the question is whether the compiler bug would also miscompile other parts of the code.

The preferred approach is to not use this compiler version, or to determine a lower optimisation setting that makes this compiler still usable. This information would be encoded in Simfactory. Obviously this bug would also be reported to Intel, so that they can correct it.

comment:3 Changed 5 years ago by Wolfgang Kastaun <wolfgang.kastaun@…>

We encountered the very same issue recently with intel 13.0.1 on the marenostrum 3 cluster, manifested as assertion failures in Newrad that went away when inserting print statements. What exactly is the compiler bug (how is it triggered) and how did you diagnose this ? In any case the code seems not special, so if it fails probably other parts of ET will fail as well.

comment:4 Changed 5 years ago by Erik Schnetter

Intel 13.1.0.146 (build 20130121) is available. I suggest upgrading to this release, or requesting from the system administrators that they do so.

comment:5 Changed 5 years ago by Roland Haas

I do not fully understand what triggers the compiler bug. Unfortunately I do not have a short test case that reproduces it. Only NewRad. The error manifested as a segfault when the code was run, it was easy to detect. To diagnose I added printf statements at various places and ran the code in gdb. Eventually I found that the error is triggered in the very first iteration of the loop in extrap_kernel at which point i,j,k had the values that bmin was initialized with :-INT_MAX,-INT_MAX/2,-INT_MAX/3 in my case. Adding printf statements at the wrong spot removes the error eg adding a printf(%d %d %d\n", bmin[0],...) removes it, indep. of whether it is added in the driver or in the extrap routine, which in itself is odd since it would hint at some kind of cross-procedure optimization.

I cannot really lower the optimization settings below -O2 (for the whole code) and thus will use a different compiler version. As far are removing the statements goes: we'd never be able to remove them since this compiler version will always be buggy. Only when no one uses the compiler anymore could we do so.

I am unfamiliar with the procedure of submitting a bug report to intel. Can one submit all of newrad as the failing routine or is a stand-alone code that demonstrates the issue required?

comment:6 Changed 5 years ago by Erik Schnetter

First, you would reproduce the bug with the currently released version. Then you would need to find a way for the Intel engineers to reproduce the problem, which is either a self-contained code that can be run (may be large), or a short piece of code where you can point at the assembler output indicating the error there.

comment:7 Changed 5 years ago by Frank Löffler

Independent on how to workaround this - it would be good to have one testsuite using Newrad which would make this kind of problem more visible. Currently the thorn doesn't have a testsuite, but I am not sure if another thorn might include it (although I would doubt it).

comment:8 Changed 5 years ago by Roland Haas

There are quite a number of tests that include it:

rhaas@horizon:.../gr/Zelmani$ grep -l NewRad arrangements/*/*/test/*.par
arrangements/AEIThorns/Trigger/test/trigger.par
arrangements/CIGR/Hydro_Analysis/test/Hydro_Analysis_TOV.par
arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-EE.par
arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-rotating-180-EE.par
arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-rotating-180-staggered-EE.par
arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-rotating-180-staggered.par
arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-rotating-180.par
arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-staggered-EE.par
arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-staggered.par
arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr.par
arrangements/CactusNumerical/RotatingSymmetry180/test/KerrSchild-rotating-180.par
arrangements/CactusNumerical/RotatingSymmetry90/test/Kerr-rotating-90-EE.par
arrangements/CactusNumerical/RotatingSymmetry90/test/Kerr-rotating-90-staggered-EE.par
arrangements/CactusNumerical/RotatingSymmetry90/test/Kerr-rotating-90-staggered.par
arrangements/CactusNumerical/RotatingSymmetry90/test/Kerr-rotating-90.par
arrangements/CactusNumerical/RotatingSymmetry90/test/KerrSchild-rotating-90.par
arrangements/Carpet/CarpetEvolutionMask/test/CarpetEvolutionMask_test.par
arrangements/Carpet/CarpetEvolutionMask/test/CarpetEvolutionMask_test_off.par
arrangements/EinsteinAnalysis/AHFinderDirect/test/checkpointML-EE.par
arrangements/EinsteinAnalysis/AHFinderDirect/test/checkpointML.par
arrangements/EinsteinAnalysis/AHFinderDirect/test/recoverML-EE.par
arrangements/EinsteinAnalysis/AHFinderDirect/test/recoverML.par
arrangements/EinsteinAnalysis/WeylScal4/test/teukolsky.par
arrangements/EinsteinEvolve/GRHydro/test/A3_oct.par
arrangements/EinsteinEvolve/GRHydro/test/GRHydro_test_tov_ppm_ML.par
arrangements/EinsteinEvolve/GRHydro/test/GRHydro_test_tov_ppm_ML_disable_internal_excision.par
arrangements/EinsteinEvolve/GRHydro/test/GRHydro_test_tov_ppm_no_trp_ML.par
arrangements/EinsteinInitialData/Exact/test/KS-tilted.par
arrangements/LSUThorns/QuasiLocalMeasures/test/qlm-bl.par
arrangements/McLachlan/ML_BSSN_Test/test/ML_BSSN_MP_O8_bh.par
arrangements/McLachlan/ML_BSSN_Test/test/ML_BSSN_NewRad.par
arrangements/Zelmani/GRHydro/test/A3_oct.par
arrangements/Zelmani/GRHydro/test/GRHydro_test_tov_ppm_ML.par
arrangements/Zelmani/GRHydro/test/GRHydro_test_tov_ppm_ML_disable_internal_excision.par
arrangements/Zelmani/GRHydro/test/GRHydro_test_tov_ppm_no_trp_ML.par

essentially all tests that use McLachlan.

comment:9 Changed 5 years ago by Frank Löffler

That is very good. That also means that this problem must be pretty "new", or appears only on otherwise not tested machines, right? We should otherwise have seen testsuite failures because of that in the past.

I otherwise agree with Erik: disabling optimization here is 'dirty' and very likely that compiler has problems compiling other code as well. It would be much better to avoid the compiler instead, if possible.

comment:10 Changed 5 years ago by Roland Haas

This still happens (for me, on bethe, in the NewRad thorn [only]) with 2013.2.146 . I have so far not been able to reproduce this with a small test case. Maybe I will just have to make up a small Cactus thornlist with NewRad to demonstrate.

comment:11 Changed 5 years ago by Erik Schnetter

Milestone: ET_2013_05

comment:12 Changed 5 years ago by Frank Löffler

Is this still an issue? Should we, within NewRad and for this release, detect that very compiler version and abort compilation with an error to at least avoid the segfaults and point users to the root of the problem?

comment:13 Changed 5 years ago by Roland Haas

This is still an issue. Since the last update of the ticket I found that adding

#define restrict

ie. removing "restrict" at the beginning of the files also seems to work around the issue.

comment:14 Changed 5 years ago by Erik Schnetter

Since we are approaching the release date, I suggest a correction that is completely contained within and #ifdef for the particular compiler version. Either switching off optimization or undefining restrict would both be fine.

Switching off optimization in the *_loop routines should be fine, since on the kernels are performance relevant.

comment:15 Changed 5 years ago by Frank Löffler

Just testing this with Intel 13 2013.3.163: still segfaults. So far I think we don't have any version '13' that works, or do we?

comment:16 Changed 5 years ago by Erik Schnetter

Unfortunately, Intel uses different version numbers on the outside of their packages and on the inside. Easiest to check is the output of "icc -V"; what does this show for you?

I'm using Version 13.1.1.163 Build 20130313.

comment:17 Changed 5 years ago by Frank Löffler

Same: Version 13.1.1.163 Build 20130313

comment:18 Changed 5 years ago by Frank Löffler

Removing the 'restrict' keywords from bmin and bmax also seems to be a workaround. Are we sure we are not doing something wrong here, declaring bmin and bmax 'restrict'? I don't see how, but I am not sure.

comment:19 Changed 5 years ago by Frank Löffler

Priority: minorcritical
Version: development version

I attach a short version of the file, reduced as much as I could so far. It is a single C-file, so be compiled using 'icpc -o ./extrap ./extrap.cc'. If working, you should see "2 2" as output (three lines). If it's not, you see random (large) numbers, and it usually hangs. The difference can be seen by commenting/uncommenting the printf in line 59.

It would be nice if someone else could confirm that this short example can be used to shop the bug, wherever it might be.

Also raising severity, since the Intel compiler is crucial and version 13 is the newest, and only available on some systems.

Last edited 4 years ago by Frank Löffler (previous) (diff)

Changed 5 years ago by Frank Löffler

Attachment: extrap.cc added

Short, standalone file to display bug

comment:20 Changed 5 years ago by Frank Löffler

The attached patch https://trac.einsteintoolkit.org/attachment/ticket/1276/Cactus_intel_no_restrict disables restrict for the version of Intel compiler I can test. Other versions might need to be added, but someone else (who has them) would need to test that. With this patch, testsuites that segfaulted before now pass.

comment:21 Changed 5 years ago by Roland Haas

I tried this with the compiler with build date 20130121 and 20121010 and also observe the presumed bug.

comment:22 Changed 5 years ago by Roland Haas

Without restrict I would expect the code to run much slower in some places.

comment:23 Changed 5 years ago by Frank Löffler

I agree. But the alternative is to have it not run at all, or possibly to miss places which the compiler also happens to miscompile, if one would redefine 'restrict' only in files we find to be miscompiled.

comment:24 Changed 5 years ago by Erik Schnetter

This patch doesn't disable restrict. The default case, where "restrict" remains undefined, leaves restrict enabled (since it remains "restrict").
I recommend changing only HAVE_CCTK_C_RESTRICT and CCTK_C_RESTRICT, and keeping the logic that redefines "restrict". Alternatively, add the Intel #ifdef afterwards, and just overwrite HAVE_CCTK_C_RESTRICT, CCTK_C_RESTRICT, and restrict.

#if BORKEN_INTEL
# undef HAVE_CCTK_C_RESTRICT
# undef CCTK_C_RESTRICT
# undef restrict
# define CCTK_C_RESTRICT
# define restrict
#endif

comment:25 Changed 5 years ago by Frank Löffler

The first part of the patch is incorrect, but the second worked. This left 'restrict' for C intact, while it does remove it for C++. The file in question here was C++, so I didn't catch this, and the (most/all?) tests passed for me. For C++, and with this version of the compiler, autoconf sets HAVE_CCTK_CXX_RESTRICT to 1 and CCTK_CXX_RESTRICT to restrict. With the old patch, 'restrict' is then defined empty for problematic versions of the intel compiler (instead of restrict).

However, you are right: this could be made even better. Attached is a new version of the patch. This one first contains a conditional on the version of the compiler for all CCODE. Then, for C and CXX separately, it skips the autoconf-provided values if the compiler was found to be bad, and instead HAVE_CCTK_CXX_RESTRICT remains undefined and CCTK_CXX_RESTRICT is set empty, which later defines 'restrict' to empty too (similarly for C).

Because someone might actually want to overwrite this, CCTK_INTEL_COMPILER_DONT_DISABLE_RESTRICT is checked, and if set, doesn't disable restrict even for bad compilers.

Also, now all Intel compilers with build dates between 20121010 and 20130313 are flagged 'bad'.

Changed 5 years ago by Frank Löffler

Attachment: Cactus_intel_no_restrict added

Disable 'restrict' for Intel compilers known to produce wrong code

comment:26 Changed 5 years ago by Erik Schnetter

Status: reviewreviewed_ok

Tested and approved.

comment:27 Changed 5 years ago by Frank Löffler

Resolution: fixed
Status: reviewed_okclosed

Committed as rev 5001

comment:28 Changed 5 years ago by Roland Haas

Resolution: fixed
Status: closedreopened

Did anyone actually report this to intel?

comment:29 Changed 5 years ago by Frank Löffler

It was reported quite a while back (through TACC). I didn't hear anything back yet though.

comment:30 Changed 5 years ago by Frank Löffler

Milestone: ET_2013_05ET_2014_05

Nothing we can really do about this right now.

comment:31 Changed 5 years ago by Frank Löffler

Milestone: ET_2014_05ET_2013_11

We should apply the workaround also the newer versions.

comment:32 Changed 5 years ago by Frank Löffler

The last version we've tested it with is 20130313, which isn't the most recent version - but the most recent I have currently access to. Also interestingly, this version is marked 'buggy' by intel itself (see http://software.intel.com/en-us/articles/intel-compiler-and-composer-update-version-numbers-to-compiler-version-number-mapping). Newer versions are 20130514, 20130607, 20130728, and the currently most recent 20131008. To the best of my knowledge none of these has been tested by us.

comment:33 Changed 5 years ago by Frank Löffler

Milestone: ET_2013_11ET_2014_05
Priority: criticalmajor

The workaround is now enabled for all versions past 20121010, in commit 5049 of the flesh. Marking as relevant for the next release because we need to revisit after some time to see if Intel fixed this, and in which version.

comment:34 Changed 5 years ago by Roland Haas

It seems as if Intel14 (Version 14.0.1.106 Build 20131008), which is available on stampede as intel/14.0.1.106 is no longer affected. At least the standalone test produces "2 2" without options, with -O3, -O2, -O1, -O0, -Ofast.

comment:35 Changed 4 years ago by Ian Hinder

The problem is present as far back as 20120731; we have changed the blacklist to reflect this in r5092 of the flesh.

comment:36 Changed 4 years ago by Ian Hinder

The version of Intel 14 on Datura is slightly older: Version 14.0.0.080 Build 20130728. The standalone test passes with this version. I am attaching a patch to cctk_Config.h.in which enables restrict for this version and later. We are assuming here that Intel do not release new builds of old versions which might not have the bug fixed.

OK to apply?

comment:37 Changed 4 years ago by Frank Löffler

Yes, please apply. However, please also extend the comment above this section accordingly.

comment:38 Changed 4 years ago by Roland Haas

Intel only fixes the latest version https://software.intel.com/en-us/articles/intel-fortran-compiler-supported-compiler-versions

Typically, bug fixes are provided for the latest version only, unless an explicit support contract for older versions exists.

comment:39 Changed 4 years ago by Frank Löffler

Ian: can you commit the patch - or did you see problems when testing it?

comment:40 Changed 4 years ago by Ian Hinder

Resolution: fixed
Status: reopenedclosed

Committed as 5110/Cactus flesh.

comment:41 Changed 4 years ago by bmundim

Resolution: fixed
Status: closedreopened

This bug seems to be partially back on intel 14.
I am updating the loewe.cfg file after a machine
upgrade and noticed that all tests using the
parameter

ML_BSSN::my_initial_boundary_condition = "extrapolate-gammas"

failed. After I realized all failed at NewRad.c
I recalled this ticket and then I ran Frank's standalone
code with -O0, -O1, -O2, -O3, -Ofast flags. As you can see
below only the -O1 and -O2 flags corrupted the output.
Since the -O2 is usually the default flag, I thought of
reporting this issue here again.

icpc -o ./extrap -O0 ./extrap.cc
./extrap
2 2
2 2
2 2
icpc -o ./extrap -O1 ./extrap.cc
/extrap
-10184 -10184
-10184 -10184
-10184 -10184
icpc -o ./extrap -O2 ./extrap.cc
./extrap
40896 40896
40896 40896
40896 40896
icpc -o ./extrap -O3 ./extrap.cc
./extrap
2 2
2 2
2 2
icpc -o ./extrap -Ofast ./extrap.cc
./extrap
2 2
2 2
2 2

The full version of the compiler I used is:

icpc -V
Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.3.174 Build 20140422
Copyright (C) 1985-2014 Intel Corporation. All rights reserved.

comment:42 Changed 4 years ago by Erik Schnetter

Did you also report the issue to Intel?

comment:43 Changed 4 years ago by bmundim

No. I was looking at their website and they
have already an update for this version:

https://software.intel.com/en-us/articles/intel-compiler-and-composer-update-version-numbers-to-compiler-version-number-mapping

I don't know if they would just tell me
to try a more recent update. Do you have
experience filing bugs there? In any case,
I will ask the sysadmin here first to update
the compiler version, before filing a bug
at intel.

@Frank: did you file this bug at intel
with this extrap.c code snipet?

comment:44 Changed 4 years ago by Frank Löffler

I didn't report directly, but through our contact at TACC (which now does not work there anymore).

comment:45 Changed 4 years ago by bmundim

Ok. Do you mind reporting to them? I could also report
to them and use your code snippet, if you don't mind.
But first I will wait for the sysadmin here update
the compiler build to see if this bug goes away.

What about making your code snippet part of the configure
stage where we decide to use restrict or not based
on tests with your code?

comment:46 Changed 4 years ago by Ian Hinder

The blacklisting of restrict in Cactus should be updated to match this compiler version. And unless and until someone tests this with a newer version, any newer versions should also be blacklisted just in case. The latest version I have tested with is Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.0.080 Build 20130728, which is the latest version installed on Datura.

comment:47 Changed 4 years ago by bmundim

I tested the extrap code with other intel
compiler versions. The most recent intel 14
still has the bug. The first build for intel 15
works fine. So, we should blacklist the following:

Intel(R) 64, Version 14.0.3.174 Build 20140422
Intel(R) 64, Version 14.0.4.211 Build 20140805

and add to a white list the intel 15 one:

Intel(R) 64, Version 15.0.0.090 Build 20140723

I haven't tested a full build of ET yet. The sysadmin
here will work on compiling the other modules with
the newest intel 15. I will have to report back here
later.

Changed 4 years ago by Steven R. Brandt

Attachment: restrict.patch added

This code updates autoconf to run the extrap.c test which demonstrates the failure.

comment:48 Changed 4 years ago by Frank Löffler

Can please someone with a failing compiler test it before the release?

comment:49 Changed 4 years ago by Frank Löffler

I only looked through the patch now without trying - but doesn't this test only run the standalone code, not test if it actually did the correct thing?

comment:50 Changed 4 years ago by Steven R. Brandt

It checks to see if it got the correct answer (i.e. 2). If it doesn't, it exits with return code 1 which causes the test to fail. You can artificially cause a failure if you manually change the "return 0" to "return 1" inside the test to see what happens.

comment:51 Changed 4 years ago by Frank Löffler

Milestone: ET_2014_05

comment:52 Changed 3 years ago by physik@…

I tested with intel 15.0.1 20141023 and the standalone test works with -O0, -O1, -O2, -O3.

comment:53 Changed 16 months ago by Roland Haas

This ticket is still open after 2 years. Frank, given that Wolfgang reports that the test is effective, do you want to push this ticket forward so that the patch can be included?

comment:54 Changed 16 months ago by Frank Löffler

I would expect that a few tests with different versions of the compiler would be necessary to include the patch. In general it is a good idea to have a test for it. One immediate problem is that the patch adds

#undef CCTK_DISABLE_RESTRICT

It shouldn't, as by default this isn't defined, and if it is, it was done so presumably by the user and should not be ignored.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as reopened The ticket will remain with no owner.
Next status will be 'review'.
as The resolution will be set.
to The owner will be changed from (none) to the specified user.
The owner will be changed from (none) to anonymous.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.