Output a stack backtrace on a fatal error in a simulation

Issue #443 closed
Ian Hinder created an issue

When a Cactus simulation aborts with a signal, it is often difficult to determine which part of the code led to the problem. The attached patch registers a signal handler on Carpet startup for signals 11 and 6 (segmentation fault and abort, e.g. from assert()) which outputs a stack backtrace from each process to a file, including demangling symbol names. It uses some low-level and possibly unofficial APIs, and is likely not completely portable. However, I have tested it on Mac OS (gcc) and Linux (intel) and it works in those places.

Part of this code was contributed by Justin Luitjens at the Carpet developers' workshop in summer 2010.

Keyword:

Comments (17)

  1. Bruno Mundim
    • removed comment

    It worked for me on linux CentOS 5. Now the backtrace didn't include the symbols:

    Backtrace from rank 1 pid 21481: 1. /lib64/libc.so.6(gsignal+0x35) [0x3531430265] 2. /lib64/libc.so.6(abort+0x110) [0x3531431d10] 3. /lib64/libc.so.6(assert_fail+0xf6) [0x35314296e6] 4. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x115cdbe] 5. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x420eb7] 6. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x420fd1] 7. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0xa0b9fd] 8. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0xa0955a] 9. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x5928e0] a. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x11225fd] b. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x111b1c7] c. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x111c857] d. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x111ce83] e. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch [0x411979] f. /lib64/libc.so.6(libc_start_main+0xf4) [0x353141d994] 10. /home/bruno/scratch/frozenstar/bbh/./cactus_einstein_patch(memcmp+0x381) [0x411679]

    I had OPTIMISE = yes in my configuration file but I also had CFLAGS = -g -debug all ... So -g doesn't seem to be passed along here. If we don't have the symbols then it becomes much harder to debug. In any case, apart from this detail, I would say this patch looks good for applying.

  2. Ian Hinder reporter
    • removed comment

    Try adding -rdynamic to LDFLAGS:

    http://gcc.gnu.org/onlinedocs/gcc/Link-Options.html

    This was necessary for me to get this to work. The patch uses undocumented GLibC hacks, so we should test it on several exotic architectures before applying - it might need to be disabled if certain features are not available. There is no platform-independent or supported way of doing this, according to Justin's comments.

    We could consider adding -rdynamic by default to LDFLAGS in the flesh, though I don't know what the consequences of this would be.

  3. Erik Schnetter
    • removed comment

    GCC says:

    -rdynamic Pass the flag -export-dynamic to the ELF linker, on targets that support it. This instructs the linker to add all symbols, not only used ones, to the dynamic symbol table. This option is needed for some uses of "dlopen" or to allow obtaining backtraces from within a program.

    This seems like a safe thing to do. Note that we override many defaults in the SimFactory, so I would begin by adding it to SimFactory's option lists on the machines that we often use.

  4. Roland Haas
    • removed comment

    When trying this with gcc (4.6) I needed to #include <cstring>. I also added code to output the backtrace.xx,txt files to out_dir in case more than one run segfaults. Finally I seem to require -ldl in LDFLAGS otherwise it fails during link time with dladdr not found. -lcrypto (from OpenSSL) also allows me to link (I assume crypto links to dl), so maybe this never happened with the ET thorn lists. Since dladdr is not POSIX but a GNU extension, is there maybe an autoconf test for it? I have never used autoconf myself so have no idea.

  5. Erik Schnetter
    • removed comment

    Autoconf has generic macros to test for header files or for functions. Cactus wraps these. CCTK_CHECK_FUNCS(dladdr) may be all you need; look for CCTK_CHECK_FUNCS in configure.in.

  6. Roland Haas
    • removed comment

    I added diffs to use autoconf to detect dladdr and cxa_demangle (both of which are GNU extensions and/or glibc specific, they are present for if -D_GNU_SOURCE or -std=gnuXXX is used, but not eg. on Kraken when using PGI and only _BSD_SOURCE).

  7. Erik Schnetter
    • removed comment

    The patch backtrace_amend_v2.diff removes two #include statements that shouldn't be there in the first case... Is that a patch on top of the first patch?

    Anyway, please apply.

  8. Roland Haas
    • removed comment

    backtrace_amend_v2.diff replaces backtrace_amend.diff yes (should have simply let trac replace the file). If you mean the cxxabi and dlfcn includes, then those are still present and are acutally required (they declare function prototypes and a struct). They are now below cctk.h .

  9. Ian Hinder reporter
    • removed comment

    I had to remove some errant square brackets from backtrace_autoconf.diff. When using the Intel compiler, the Dl_info structure is only available if _GNU_SOURCE is defined. This is because it is a GNU extension. In order to get this defined (both in Initialise.cc and during the autoconf test), we could add it to every optionlist that uses the Intel compiler. Or maybe we could add it in known_architectures. Erik?

  10. Ian Hinder reporter
    • removed comment

    The autoconf magic also seems to not work quite right. I have tested it using GCC on Mac OS and Intel 11 on Linux. In each case, autoconf reports

    checking for cxxabi.h... yes checking for cxa_demangle... yes checking for Dl_info.dli_sname... yes checking for dladdr... yes

    but cctk_Config.h contains

    1. define HAVE_CXA_DEMANGLE 1 /* #undef HAVE_DLADDR */

    Since HAVE_DLADDR is not defined, the backtrace names are not demangled. Roland has volunteered to try to get the autoconf macros working.

  11. Erik Schnetter
    • removed comment

    _GNU_SOURCE has other effects as well, and may create all sorts of problems e.g. with <math.h> or <stdlib.h>. We can try, but it could lead to a rat's tail of problems on the usual weird architectures (AIX, PGI/IBM compilers, non-x86 CPUs, etc.).

    Since this is an architecture specific piece of code, I suggest instead to #define _GNU_SOURCE before including these include files, but in such a way that only this source file is affected. If necessary, we can add this #define also to the autoconf macros.

  12. Ian Hinder reporter
    • removed comment

    Backtrace patch committed in changeset:3345:d87fce06a3cd/Carpet.

    Flesh patch committed in SVN revision changeset:"4726/Cactus flesh".

  13. Log in to comment