parallelize Multipole using OpenMP and MPI

Issue #962 closed
Roland Haas created an issue

The attached patch parallelizes each extraction sphere in multipole using OpenMP and distributes the spheres across processors using MPI.

It contains threadprivate OMP pragmas for some memory that is allocated by simpson integration rule which will mean that more memory is allocated than would be for the non-openmp case. The alternative would be to allocate and free the memory each time the function is called (this would be much cleaner).

Keyword: Multipole

Comments (7)

  1. Roland Haas reporter
    • removed comment

    It contains threadprivate OMP pragmas for some memory that is allocated by simpson integration rule which will mean that more memory is allocated than would be for the non-openmp case. The alternative would be to allocate and free the memory each time the function is called (this would be much cleaner).

    I removed the threadprivate variables and checked that the tests pass before and after the change (and that hdf5 output does not change before and after).

  2. Ian Hinder
    • removed comment

    Have you found a case where this is necessary? Is Multipole a significant performance problem in your simulations? One of the advantages of Multipole over other implementations is its simplicity. Of course, if it is a bottleneck, then parallelising it is a good solution.

  3. Roland Haas reporter
    • removed comment

    I lack data. The single run I just looked at spends about half of ANALYSIS (NSNS run, so there are only Hydro_Analysis, Multipole and QLM but not eg AHFinderDirect) in Multipole. ANLYSIS was about 10% of the total run time, so I doubt Multipole is significant for runs, in particular if AEILocalInterp were OpenMP parallelized (which it currently is not). The patch *does* make MP much faster in the test case. I had the patches and logic lying around from Ylm_Decomp. I am sure I can find some Swedish person's law to claim that all part of a code must be parallelized for scalability :-)

    Which do you find more objectionable: the while loop to reconstruct el and em from lm, or the MPI_Gatherv?

  4. Ian Hinder
    • removed comment

    I don't find either objectionable; it's just that the more code we add, the more we have to debug and maintain. I considered parallelising Multipole in the past, but rejected it because in all my timing results, I never saw Multipole taking a large amount of time. This is also the reason I was not enthusiastic about adding more accurate integration schemes. If I ever saw Multipole being important from a performance point of view, I would have been.

    The integration cost is proportional to the number of interpolation points, not the number of processes, and the interpolation is naturally parallel already. So I don't think there should be a scaling issue here. In fact, by parallelising the integration, we introduce additional communication cost, which might lead to scaling problems later on. Is it easy to (when necessary) add a flag to prevent parallel integration?

    I guess it can be applied, since you have done work on it, and this should not be lost or allowed to rot in the case that this is useful in future. Aimless musing: what happens if someone tries to use this with PUGH and doesn't have MPI available?

  5. Roland Haas reporter
    • changed status to resolved
    • removed comment

    I am not sure I buy into all arguments. Right now Multipole does one radius after the other so time there are nradii calls to the (global) interpolator. With the patch there are (given that we usually have more processors than extraction radii) 2 calls (one for the interpolator, one for MPI_Allgatherv which also transports to Root only). And even Allgatherv is only needed for HDF5 output where we use an inherently serial file layout with all data in a single file.

    The patch works fine with PUGH (just tried replacing Carpet with PUGH in the test_rads test), it also works without MPI (since then NProcs==1 and the Allgatherv that will be skipped by the #ifdef would be a no-op, tested by manually #undef CCTK_MPI). I am not sure it would make sense to run without MPI since eg. Carpet requires it.

    However simplicity is certainly something good to have and premature optimization is to be avoided. I can live with the patch not being applied. I'll retract it.

  6. Log in to comment