Assertion error when using "eval" & new UIUC speedup in TwoPunctures

Issue #1429 closed
Bernard Kelly created an issue

Hi. The new, more efficient, "eval" branch of TwoPunctures is generating a problem in certain cases. I had a job using this code (brought over from the trunk to my ET_2013_05 release), and it failed with an assertion error in TP_utilities.c, within the "d3tensor" allocation routine. I'm attaching a sample parameter file and the associated SCROUT + SCRERR for a small version of this case. It was run on 2 Nehalem nodes (8 cores each; no OpenMP), using an executable compiled with -O3 level optimisation using Intel-2013 compilers and SGI's MPT implementation of MPI.

Here's the actual error message in the SCROUT+ERR file:

TP_utilities.c:146: TP_d3tensor: Assertion `retval[i][nch]-retval[i][ncl] == (nch-ncl)*depth' failed.

I've looked at the d3tensor allocation routine in TP_utilities.c, and it seems to have several problems:

  • it has an actual bug in line 115:

    retval[0][0] = malloc(sizeof(CCTK_REAL)(nrh-nrl+1)(nch-ncl+1)*(nrh-nrl+1));

--- the last factor should be (ndh-ndl+1), ''not'' (nrh-nrl+1)

  • even without that bug, the size allocated is too long by one in each dimension, when called by other TP routines, as, in fact, are all these TP_utilities routines.

  • the way in which memory is allocated seems to assume contiguous memory chunks (I suspect this is the real problem, given the error message).

  • all the routines in TP_utilities.c use "int" and "long" instead of "CCTK_INT"

Keyword: TwoPunctures,
Keyword: malloc

Comments (18)

  1. Bernard Kelly reporter
    • removed comment

    More comments I forgot to include:

    • This was with -O3, as I said. If I use -O0 instead, the problem disappears.

    • My code was entirely the ET_2013_05 release, except the TwoPunctures thorn.

  2. anonymous
    • removed comment

    Suggested patch attached, relative to trunk, rev. 134. This involves a new d3tensor_SIMPLE and free_d3tensor_SIMPLE.

    This is a fairly minimal change. Really, the whole of TP_utilities should be overhauled, IMO.

  3. Erik Schnetter
    • removed comment

    The TwoPunctures thorn uses utility routines from the Numerical Recipes. The Fortran version of the NR uses standard Fortran, but the C version uses Fortran in C disguise -- a questionable choice. To be able to have array indices start with 1, all memory allocation functions can subtract an offset from the pointer, so that array[1] points to the first allocated element. (This is actually illegal in C, but probably works on all relevant systems.)

    Instead of introducing new *_SIMPLE functions, I would rather correct the existing function. Alternatively, the existing function could be removed, and be replaced with the new function.

    It seems the problem is that retval[nrl][0] is never set. The "apply all offsets" part is probably missing a loop.

  4. Roland Haas
    • removed comment

    Sorry for the tardy reply, the bug report was indeed very concise and clear.

    I agree that the memory management in TP_utilities is unnecessarily complex. It tries to be a drop in replacement for the (C-language) numerical recipes routines which allowed to have arrays to start with 1 instead of 0 (or any other offset for that matter). Not sure about the extra entry that they all allocate in each dimenension. It is possible that TP actually does us all elements from 0 to N inclusive (would need to check). Does anyone remember?

    The patch included however allocates the wrong type namely typeof(CCTK_REAL ) rather than typeof(CCTK_REAL) in line 48 of the patch. For typical machines this will go unnoticed since sizeof(CCTK_REAL)==sizof(double)==sizeof(void)==8.

    Otherwise it seems to be ok (baring issues that TP might use element "L" which I don't remember anymore).

    I wonder though if simply changing the original code as outlined in the description "the last factor should be (ndh-ndl+1), not (nrh-nrl+1)" would be sufficient? Mind you that I am the one responsible for the original buggy code should not have that much of an influence :-)

  5. Bernard Kelly reporter
    • removed comment

    Hi Roland. Apologies for the CCTK_REAL/CCTK_REAL* typeof; I'd corrected that myself locally, or so I thought.

    Anyway, I can confirm that fixing the one obvious logic bug in line 115 of the original code does not fix the actual issue; I'd tried that first in my own run. But it should be fixed, anyway, if the Numerical Recipes code is to persist.

    I suspect Erik's observation --- "It seems the problem is that retval[nrl][0] is never set." --- is the real problem in the original code. Nevertheless, I think the Numerical Recipes Fortran-masquerading-as-C code should be junked anyway. I was originally going to submit my simpler d3tensor as a replacement for what's there, but then thought that this would be more minimalist. That still leaves "dvector" etc ...

  6. Roland Haas
    • removed comment

    I unfortunately cannot reproduce the error. TP calls d3tensor with all lower bounds set to 0 so nrl=0 and retval[0][0] holds the return value from malloc.

    I made myself an extra paranoid version of d3tensor and ran a standalone copy (attached) through valgrind. Valgrind finds no errors. Can you perhaps run the standalone version on the cluster where the error occured, Bernard?

    I am also perfectly happy to replace d3tensor by d3tensor_SIMPLE by the way.

  7. Erik Schnetter
    • removed comment

    I attach a patch that shows how I think d3tensor could be corrected. The patch is untested -- it just shows where I think the logic errors are.

  8. Bernard Kelly reporter
    • removed comment

    Hi Erik. I just tried your patch from comment:7, but alas! --- same assertion error as before.

    Roland, I just tried your standalone test code from comment:6. It compiled & ran OK on the master node I compile on, though there were warnings. Here's my compilation line, which matches my standard CC config options for ET:

    icc -std=c99 -U__STRICT_ANSI__ -g -O3 -openmp -o d3tensor_TEST.exe d3tensor.c
    

    The warning messages are as follows (they're short enough not to require an attachment, I think), though they look pretty superficial:

    d3tensor.c(24): warning #181: argument is incompatible with corresponding format string conversion
        printf("d3tensor: %d %d %d\n",nrh,nch,ndh);
                                      ^
    
    d3tensor.c(24): warning #181: argument is incompatible with corresponding format string conversion
        printf("d3tensor: %d %d %d\n",nrh,nch,ndh);
                                          ^
    
    d3tensor.c(24): warning #181: argument is incompatible with corresponding format string conversion
        printf("d3tensor: %d %d %d\n",nrh,nch,ndh);
                                              ^
    
    d3tensor.c(27): warning #556: a value of type "double ***" cannot be assigned to an entity of type "void ***"
        a = retval;
          ^
    
    d3tensor.c(32): warning #556: a value of type "double **" cannot be assigned to an entity of type "void **"
        b = retval[0];
          ^
    
    d3tensor.c(81): warning #181: argument is incompatible with corresponding format string conversion
                   i, i/(nch-ncl+1), i%(nch-ncl+1));
                      ^
    
    d3tensor.c(81): warning #181: argument is incompatible with corresponding format string conversion
                   i, i/(nch-ncl+1), i%(nch-ncl+1));
    
  9. Roland Haas
    • removed comment

    Bernard: sorry I had not used the proper printf format specifiers to deal with the long int arguments. I have updloaded a corrected version that also accesses every single array element and still passes with valgrind on my workstation. The updated file is d3tensor.c.2 .

  10. Bernard Kelly reporter
    • removed comment

    Thanks, Roland. That removed the earlier warnings, but not the two from lines 27 & 32 [now lines 33 & 37] about the mismatch between double *** and void ***.

    But anyway, your standalone code compiles and runs with no sign of the assertion error I encounter when inside the actual TwoPunctures evaluation. I didn't actually mean for you to supply another test source file.

  11. Bernard Kelly reporter
    • removed comment

    Update. I think I've found the actual problem with the original code:

    Under -O3 optimisation (at least on this machine, with the Intel compiler), the pointer assignment loops weren't being completed before the assert statements were executed. I suspect this is because the assert statements don't look like they depend on the result of the assignment?

    I'm attaching a proposed patch to the repository TP_utilities.c that enforces the proper timing of the assert() checks by putting them in an IF statement that tests whether the loop iterator has actually reached the end of the loop. This may not be the best way of doing this, but it works for me, and still passes the test suites.

  12. Erik Schnetter
    • removed comment

    This is very strange. These additional if statements "protecting" the assert statements should not be necessary. This points to a severe problem in your compiler. Can you try with a different (newer) compiler, or with -O2 instead of -O3?

  13. Bernard Kelly reporter
    • removed comment

    Replying to [comment:12 eschnett]:

    This is very strange. These additional if statements "protecting" the assert statements should not be necessary. This points to a severe problem in your compiler. Can you try with a different (newer) compiler, or with -O2 instead of -O3?

    I've now tried it with -O2, and the problem doesn't appear to arise. I've reported it to the system admins to see what they say.

  14. Bernard Kelly reporter
    • removed comment

    So the admins' suggestion was to [drumroll] update the compiler version. The Intel compiler I was using --- v. 2013.1.117 (the latest available when I first hit this issue) --- had some problems with its -O3 "aggressive" optimisation. I went to the now-latest version installed --- v. 2013.5.192 --- and the problem disappeared for -O3.

    So I suppose this ticket should be closed. However, the logic bug in line 115 (noted in the original submission) should be fixed, short-term, and longer-term, I think this whole TP_utilities.c file should be replaced. Thanks for your help.

  15. Ian Hinder
    • removed comment

    I think we also had (or wanted to have) a mechanism for blacklisting certain compiler versions. It sounds like that version should be blacklisted, at least if TwoPuncture is present.

  16. Erik Schnetter
    • removed comment

    At the moment, an #ifdef in TwoPunctures would work fine. Later, we can move this to configure.in.

  17. Roland Haas
    • changed status to resolved
    • removed comment

    Applied as rev 135 of TwoPunctures.

    I've applied the patch to correct the amount of allocated memory "(nrh-nrl+1)" vs. "(ndh-ndl+1))". Bernard, would you mind letting me know what the output of icc --version is, please? The date in the first line is the compiler build date and it is accessible as the preprocessor varialbe __INTEL_COMPILER_BUILD_DATE . We would need this data for both the non-working version and the working one, to output a warning when the version is known to fail.

    The test might miss some if even version earlier than the one you tested fail and be too inclusive if the compiler bug goes away in a version earlier than 2013.5.192.

    I have opened a new ticket #1458 for this.

  18. Log in to comment