Modify

Opened 7 years ago

Last modified 4 years ago

#499 new defect

Prolongation fails with vectorisation enabled

Reported by: Ian Hinder Owned by: Erik Schnetter
Priority: optional Milestone:
Component: Carpet Version: development version
Keywords: Cc:

Description (last modified by Erik Schnetter)

The development version of Carpet uses vectorisation to speed-up prolongation. This fails with various errors, including corruption of the malloc heap.

I am reducing the priority, and still hope to get to the bottom of this before the release.

Attachments (4)

Llama.CTGamma.ET.CarpetHG.th (6.5 KB) - added by Barry Wardell 7 years ago.
datura.cfg (2.9 KB) - added by Barry Wardell 7 years ago.
bbh.rpar (24.4 KB) - added by Barry Wardell 7 years ago.
damiana.cfg (2.9 KB) - added by Barry Wardell 7 years ago.

Download all attachments as: .zip

Change History (27)

comment:1 Changed 7 years ago by Barry Wardell

This happens only when vectorisation is enabled through the use of VECTORISE="yes" in the optionlist. It also seems to happen only in runs where the grids are moved and in that case it happens at a random iteration before the first regridding happens.

comment:2 Changed 7 years ago by Erik Schnetter

Do you have more information?

For example: Machine name, hardware architecture (which SSE version?), compiler, compiler version, etc.

Also, do you have a stack backtrace or a core file that could lead to a line number? Alternatively, do you have the value of the instruction pointer and the disassembled executable?

comment:3 Changed 7 years ago by Barry Wardell

This happens on Datura, which has Intel Xeon X5650 processors. We are using SSE4.1 and the Intel compiler version 11.1.072. Below is a representative backtrace

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code:  (128)
Failing at address: (nil)
[ 0] /lib64/libpthread.so.0 [0x2ba555f47b10]
[ 1] /lib64/libc.so.6 [0x2ba55a6cbcc8]
[ 2] /lib64/libc.so.6(__libc_malloc+0x6e) [0x2ba55a6cdcde]
[ 3] /usr/lib64/libstdc++.so.6(_Znwm+0x1d) [0x2ba55a00317d]
[ 4] cactus_sim(_ZNSt6vectorIcSaIcEE6resizeEmc+0x137) [0x13025a7]
[ 5] cactus_sim(_ZN10comm_state4stepEv+0xab9) [0x12fd459]
[ 6] cactus_sim(_ZN6Carpet10SyncGroupsEPK4_cGHRKSt6vectorIiSaIiEE+0x3fa) [0x126b85a]
[ 7] cactus_sim(_ZN6Carpet20SyncProlongateGroupsEPK4_cGHRKSt6vectorIiSaIiEE+0x49e) [0x126b28e]
[ 8] cactus_sim [0x12c8798]
[ 9] cactus_sim(_ZN6Carpet12CallFunctionEPvP13cFunctionDataS0_+0x14bb) [0x12c84db]
[10] cactus_sim [0x4e83da]
[11] cactus_sim [0x4eb570]
[12] cactus_sim [0x4eb668]
[13] cactus_sim [0x4eb668]
[14] cactus_sim(CCTKi_DoScheduleTraverse+0x294) [0x4eb2f4]
[15] cactus_sim(CCTK_ScheduleTraverse+0x199) [0x4e4509]
[16] cactus_sim [0x126e341]
[17] cactus_sim(_ZN6Carpet6EvolveEP12tFleshConfig+0x34b) [0x126c18b]
[18] cactus_sim(main+0xa5) [0x4dc2b5]
[19] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2ba55a676994]
[20] cactus_sim [0x4dbfc9]

comment:4 Changed 7 years ago by Erik Schnetter

I looked at the vectorised code in Carpet's prolongation operator, and I see that all the vectorisation happens for reading, multiplying, and adding numbers. Storing the result into the target array is untouched and independent of vectorisation.

According to the backtrace above, the actual error occurs while resizing a std::vector, probably while allocating communication buffers. It could be the system runs out of memory, or there is internal memory corruption.

Since progress on this problem has stalled, I suggest to disable vectorisation in Carpet's prolongation operator. There is a statement "#if 0" in line 226; changing this to "#if 1" should enable the scalar code and thus circumvent the vectorised code.

comment:5 in reply to:  4 Changed 7 years ago by Barry Wardell

Replying to eschnett:

According to the backtrace above, the actual error occurs while resizing a std::vector, probably while allocating communication buffers. It could be the system runs out of memory, or there is internal memory corruption.

I don't think it is a case of running out of memory, but it seems likely that it could be memory corruption. Upon closer inspection, it looks like the segfault is happening in recompose:

Backtrace from rank 16 pid 4306:
1. /lib64/libc.so.6(gsignal+0x35) [0x2b51371a1265]
2. /lib64/libc.so.6(abort+0x110) [0x2b51371a2d10]
3. /lib64/libc.so.6 [0x2b51371db84b]
4. /lib64/libc.so.6 [0x2b51371e330f]
5. /lib64/libc.so.6(cfree+0x4b) [0x2b51371e376b]
6. mem<double>::~mem()(.../cactus_Datura-carpet-hg-test)
7. data<double>::~data()(.../cactus_Datura-carpet-hg-test)
8. ggf::recompose_free_old(int)(.../cactus_Datura-carpet-hg-test)
9. dh::recompose(int, bool)(.../cactus_Datura-carpet-hg-test)
a. gh::recompose(int, bool)(.../cactus_Datura-carpet-hg-test)
b. Carpet::Recompose(_cGH const*, int, bool)(.../cactus_Datura-carpet-hg-test)
c. .../cactus_Datura-carpet-hg-test [0x13adabc]
d. Carpet::Evolve(tFleshConfig*)(.../cactus_Datura-carpet-hg-test)
e. .../cactus_Datura-carpet-hg-test(main+0xa5) [0x4de135]
f. /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b513718e994]
10. .../cactus_Datura-carpet-hg-test [0x4dde49]

Since progress on this problem has stalled, I suggest to disable vectorisation in Carpet's prolongation operator. There is a statement "#if 0" in line 226; changing this to "#if 1" should enable the scalar code and thus circumvent the vectorised code.

I have disabled vectorisation in this section of the code and still get a segfault, so I guess the problem must be elsewhere (LoopControl?). It certainly has to be somewhere in Carpet as replacing the mercurial version with the git version my simulation runs without any problems.

I also tried debugging with gdb, but unfortunately once I had disabled optimisation the crash no longer happens!

comment:6 Changed 7 years ago by Erik Schnetter

LoopControl is another candidate that can create problems when vectorising. It has a self-check built in that is enabled via LoopControl::do_selftest = yes (and which is somewhat expensive). Could you give this a try?

comment:7 in reply to:  6 Changed 7 years ago by Barry Wardell

Replying to eschnett:

LoopControl is another candidate that can create problems when vectorising. It has a self-check built in that is enabled via LoopControl::do_selftest = yes (and which is somewhat expensive). Could you give this a try?

I have tried this and the crash still happens as before without any error detected by LoopControl. The only way I am able to avoid the crash is to disable optimisation (which annoyingly makes debugging a pain). Do you think this could point to a compiler issue? I'm going to try the same run on Kraken (the crash happens on Datura) to see if the problem happens there too.

Has anybody else encountered a segfault in CarpetHG with vectorisation enabled and AMR? Or has anybody else tried this combination successfully?

comment:8 Changed 7 years ago by Erik Schnetter

Other random ideas coming to my mind:

  • running without OpenMP (with a single thread)
  • building with gcc (and with optimisation) instead of Intel, still on Datura

comment:9 in reply to:  8 ; Changed 7 years ago by Barry Wardell

I have just tried running the same job on Kraken and it works perfectly fine, without any segfault. I'm using the current SimFactory2 optionlists kraken-intel.cfg and datura.cfg. This means that on Kraken I'm using a slightly different version of the Intel compiler (11.1.038 vs 11.1.072) and some different optimisation settings (in particular SSE2 instead of SSE4.1). Note, however, that when I run the job on Damiana (damiana.cfg), which uses SSE2 but is otherwise identical to Datura, I get the same segfault.

Replying to eschnett:

Other random ideas coming to my mind:

  • running without OpenMP (with a single thread)
  • building with gcc (and with optimisation) instead of Intel, still on Datura

I will try these suggestions.

comment:10 in reply to:  9 Changed 7 years ago by Barry Wardell

Replying to barry.wardell:

  • running without OpenMP (with a single thread)

Running on a single thread no longer triggers the segfault. So to summarize the combination required to trigger the problem:

  • Mercurial version of Carpet
  • Vectorisation enabled
  • Datura or Damiana (or at least not Kraken)
  • Optimisation (-O2) enabled
  • >1 OpenMP thread
  • Moving boxes AMR

and then the segfault happens in ggf::recompose_free_old(int).

comment:11 Changed 7 years ago by Erik Schnetter

It could also be the particular version of the Intel compiler. Your licence should be good for other versions as well; if you kept the install image for a previous version around, you could give this a try.

comment:12 in reply to:  11 Changed 7 years ago by Barry Wardell

Replying to eschnett:

It could also be the particular version of the Intel compiler. Your licence should be good for other versions as well; if you kept the install image for a previous version around, you could give this a try.

I have now tried with Intel Compiler 12.0.2 (I was previously using 11.1.072) and with OpenMPI 1.5.4 (previously 1.4.3) and the segfault happens just as before.

comment:13 Changed 7 years ago by Frank Löffler

Milestone: ET_2011_11
Version: development version

comment:14 Changed 7 years ago by Ian Hinder

Priority: majorblocker

comment:15 Changed 7 years ago by anonymous

Attached is the thornlist, optionlist and parameter file I use to reproduce the problem

Changed 7 years ago by Barry Wardell

Changed 7 years ago by Barry Wardell

Attachment: datura.cfg added

Changed 7 years ago by Barry Wardell

Attachment: bbh.rpar added

Changed 7 years ago by Barry Wardell

Attachment: damiana.cfg added

comment:16 in reply to:  15 Changed 7 years ago by Barry Wardell

Replying to anonymous:

Attached is the thornlist, optionlist and parameter file I use to reproduce the problem

I forgot to mention that I run this with 120 cores. On Datura, I run with 6 threads and on Damiana with 2 threads. The crash usually happens within the first 1500 iterations, although it changes each time it is run.

comment:17 Changed 7 years ago by Erik Schnetter

I believe that setting

LoopControl::use_random_restart_hill_climbing = no

circumvents the segfault. Could you check this?

comment:18 Changed 7 years ago by Barry Wardell

I can confirm that when I use this setting the segfault no longer happens. Does this mean that you have identified the problem? Is it a bug in LoopControl? Or a compiler bug?

comment:19 Changed 7 years ago by Erik Schnetter

No, I have not identified the problem, but I assume that it is either an error in LoopControls optimization mechanism or in the compiler.

I have asked Nico to install a malloc debugger on Damiana; could you follow up? This may the be easiest way to debug this.

comment:20 Changed 7 years ago by Erik Schnetter

Description: modified (diff)
Priority: blockercritical

I have made

LoopControl::use_random_restart_hill_climbing = no

the default in Carpet since this seems to avoid the segfault.

comment:21 Changed 7 years ago by Ian Hinder

Milestone: ET_2011_10
Priority: criticalmajor

I am removing the release milestone and reducing the priority to major, since it no longer occurs with the default parameters and the workaround is straightforward.

comment:22 Changed 5 years ago by Roland Haas

Ian, Barry: still an issue?

comment:23 Changed 4 years ago by Ian Hinder

Priority: majoroptional

According to the comments above, the default for LoopControl::use_random_restart_hill_climbing has been changed to "no" and this avoids the problem. Unless someone wants to use that feature, it sounds like the problem is effectively fixed. I do not even know if the problem can be reproduced now. If it can, then probably the ticket should be considered a bug in LoopControl's optimisation code.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The owner will remain Erik Schnetter.
Next status will be 'review'.
as The resolution will be set.
to The owner will be changed from Erik Schnetter to the specified user.
Next status will be 'confirmed'.
The owner will be changed from Erik Schnetter to anonymous.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.