Enable Vectorisation in McLachlan

Issue #516 closed
Barry Wardell created an issue

Kranc generated thorns now have support for optimisation through vectorisation. They just need the option UseVectors -> True to be set when creating the thorn. The attached patches enable this for the BSSN thorns in McLachlan.

I have tested that this gives a significant performance increase (close to 2x in the right-hand-sides in the cases I tried) and also that it agrees to within what would be expected given roundoff differences with the results of a BBH simulation with vectorisation disabled. Additionally, the testsuites still pass with this patch applied.

Keyword:

Comments (12)

  1. Barry Wardell reporter
    • removed comment

    I have a second patch which regenerates the code with vectorisation enabled but it is larger than the maximum attachment size allowed by trac so I can't attach it. It is just the result of running make in the 'm' directory.

  2. Erik Schnetter
    • removed comment

    (We don't need explicit patches for autogenerating code; e.g. regenerating configure should not be in a patch either.)

    Instead of testing various machines I would test several architectures. In particular we should test:

    - SSE 4.1 (modern Intel) - SSE 4a (modern AMD) - SSE 2 (old Intel or AMD) - [VSX (Power 7)] - [Double Hummer (Blue Gene/P)]

    I'm not sure about the last two architectures. Without Blue Waters, Power 7 has become much less interesting, although we still have access to such a machine at LSU. We don't use BG/P in production, and probably won't because the architecture is dated; BG/Q will be interesting.

    Having said this, testing on Datura, Damiana, and Kraken (with the Intel compiler) should do the trick. We may want to throw in a system with the PGI compiler as well since that compiler needs some special casing in some regions of the code.

  3. Barry Wardell reporter
    • removed comment

    I agree that this should certainly be well tested before being applied.

    Replying to [comment:3 eschnett]:

    Instead of testing various machines I would test several architectures. In particular we should test:

    - SSE 4.1 (modern Intel) - SSE 4a (modern AMD) - SSE 2 (old Intel or AMD) - [VSX (Power 7)] - [Double Hummer (Blue Gene/P)]

    I have verified that the tests pass on SSE 4.1, SSE 4a and SSE 2 machines with vectorisation enabled immediately after commit 4c04a8bc35cf7706e144fe771ba5d6c907f5a455 which was just before the recent schedule changes.

    I'm not sure about the last two architectures. Without Blue Waters, Power 7 has become much less interesting, although we still have access to such a machine at LSU. We don't use BG/P in production, and probably won't because the architecture is dated; BG/Q will be interesting.

    Unfortunately, I don't have access to any machine with these architectures.

    Having said this, testing on Datura, Damiana, and Kraken (with the Intel compiler) should do the trick. We may want to throw in a system with the PGI compiler as well since that compiler needs some special casing in some regions of the code.

    I have verified that the McLachan tests pass with vectorisation enabled on these three machines with the Intel compiler. I haven't yet tried with the PGI compiler.

  4. Erik Schnetter
    • removed comment

    The only machine where we use the PGI compiler by default is Hopper at NERSC, and even there the Intel compiler is now available.

  5. Erik Schnetter
    • removed comment

    What I wanted to say is that I think we are ready to apply this patch. Do we agree?

  6. Barry Wardell reporter
    • removed comment

    Replying to [comment:4 barry.wardell]:

    I have verified that the tests pass on SSE 4.1, SSE 4a and SSE 2 machines with vectorisation enabled immediately after commit 4c04a8bc35cf7706e144fe771ba5d6c907f5a455 which was just before the recent schedule changes.

    Correction: I have verified this only for SSE4.1 and SSE2. Although I tested on Kraken which has CPUs supporting SSE4a, it turns out that the Vectors thorn only used SSE2. I guess the SimFactory optionlist for Kraken must not enable SSE4a? Is there a machine which does compile using SSE4a? Can Kraken be modified to do so?

  7. Erik Schnetter
    • removed comment

    How do you know that SSE 4a was not used? This is autodetected in vectors-8-SSE2.h. It may be that this autodetection is faulty, of course, if e.g. the Intel and GNU compilers use different conventions here.

    Most of the vector instructions that we are using are defined in SSE 2. SSE 4.1 defines an instruction that allows a more efficient IfThen implementation, SSE 4a provides a more efficient implementation of a streaming partial store. Since you probably don't use streaming stores (Ian found them slower), it should make no difference whether SSE 4a is present or not.

  8. Erik Schnetter
    • removed comment

    I checked on Kraken, an AMD machine with the Intel compiler. It seems the Intel compiler does not support AMD's SSE extensions, i.e. does not support SSE 4a. (The corresponding include file ammintrin.h is not present.) Therefore, and for the reasons I gave above, let's ignore SSE 4a, and just apply the patch.

  9. Log in to comment