move CCTK_MyHost and friends into flesh

Issue #2112 new
Roland Haas created an issue

Currently Carpet provides functions CCTK_MyHost etc that lets a thorn discover which host it is running on and what its "rank" local to the host is (ie process N of M processes on a given host).

It would be good to have this information provided by the flesh similar to CCTK_MyRank() which will require a new set of overloadable functions (of the same name) for Carpet to hook into.

Not sure what to do about the currently used aliased function names that would then cause a conflict.

Keyword: Carpet
Keyword: PUGH
Keyword: driver

Comments (21)

  1. Erik Schnetter
    • removed comment

    Carpet's implementation is based on comparing host names as strings. This involves significant infrastructure to be able to exchange strings between MPI processes.

    On a Blue Gene, the "host name" includes not only what Unix calls host name, but also the core id on which the MPI process is running. This defeats the purpose, and thus one needs to perform string surgery to extract the actual host name from the topology.

    OpenMPI provides environment variables that contain the respective information. These are much easier to decode than exchanging host name strings. However, e.g. MPICH does not provide such a mechanism.

    Finally, the MPI 3 standard, which is supported by all interesting MPI implementation, but not necessarily by all supercomputers of interest to us, provides a function MPI_Comm_split_type, which accepts an argument MPI_COMM_TYPE_SHARED, which then defines new per-node communicators. With a bit of post-processing, one can extract a relevant grouping: - each per-node communicator chooses a leader (rank 0 within that communicator) - all processes exchange the global rank of their leaders - these ranks can be sorted in ascending order - this ranking then defines an integer numbering of the nodes - you probably want to collect additional statistics in case the number of processes per node isn't uniform

  2. Roland Haas reporter
    • removed comment

    I see. The hope is that the the complex logic can stay in Carpet (or go into the ExternalLibraries/MPI thorn since it is not driver specific), I am at this point only advocating to make CCTK_MyHostID a Cactus overloaded function rather than an aliased function it is now.

    The flesh still could then stay MPI-free (except the MPI_Init() calls which are required) if possible.

    The MPI3 method is definitely easier it seems. Another way to do this without having to rely on hostname and string surgery seems to use POSIX (or SYSV) shared memory sections which are basically the same as the MPI3 shared communicators. I have not tried this exactly but have used shared memory (which is what made me wish for a hostid function) and semaphores.

    1. make a shared memory section of a known name which will automatically be shared by all ranks on a node
    2. use POSIX or SYSV semaphores to have exactly one rank per node write its global MPI rank into the shared memory, this rank will be random since it depends on which process first grabs the semaphore
    3. one needs a couple of MPI_Barriers to make sure that there is proper serialization in some places

    The method is not too hard (easier probably than string surgery) and works as long as one has POSIX or SYSV IPC capabilities (so hopefully everywhere, in particular since we do require POSIX by now).

  3. Ian Hinder
    • removed comment

    How about an all-to-all of the MAC address of the node? We'd have to figure out which interface to use, but this would still uniquely identify the node. It wouldn't give you a rank on the node though.

  4. Roland Haas reporter
    • removed comment

    I attach a sample code using mmap. Note that it is not bulletproof. The bug that exists is that if there are more than one independent Cactus runs on the same node (eg on your workstation while running testsuites) then they all compete for the same shared memory section and only one of them will get it.

    To avoid it one needs some name for the shared memory section that is unique to the simulation. This is quite hard to do and I am somewhat sure that it is not always possible. One cannot for example choose the name of the executable or the name of the parfile since neither one needs to be unique. So it is the opposite problem the Erik mentioned on a BG since the problem with the mmap approach is to limit the mapping to just the processes in the current run.

  5. Frank Löffler
    • removed comment

    Replying to [comment:5 rhaas]:

    To avoid it one needs some name for the shared memory section that is unique to the simulation. This is quite hard to do and I am somewhat sure that it is not always possible.

    If that id only has to be unique on a given node (which I assume since this is for a shared memory block), the process id should provide that.

  6. Roland Haas reporter
    • removed comment

    @ knarf: the problem with that is that all ranks on the node need to agree on the same process ID. I also cannot use the process ID of rank 0 and bcast that one, because (being devil's advocate) it could actually be 2 nodes which are shared among two simulations and one simulation has rank 0 on node A and the other has rank 0 on node B and those two just happen to have the same process ID.

    @ eschett: that will almost always work :-)

    A simpler thing, without the need for any mmap would be basically what you are already doing with the processor names but using the (32 bit numeric so much easier to work with) gethostid() result. This would avoid having to pass around strings through MPI. Again, no mmap needed.

  7. Erik Schnetter
    • removed comment

    gethostid seems to always return 0 on MacOS. The respective systcl call does the same thing.

  8. Roland Haas reporter
    • removed comment

    Ok, so that way to go seems to be to combine these things (or bite the bullet and send strings hopen that there is no cluster out there that uses "localhost.localdomain" for all of its compute nodes.

    So I'd use a combination of

    1. a compile time chosen, fixed string
    2. the current time from gettimeofday() or time() on rank 0
    3. the process ID of rank 0
    4. a random number as returned by random() on rank 0 (no use of /dev/random which is not in POSIX and I have seem chroot jails where it was not provided)
    5. the name of the parfile (per rank, they should be identical among all ranks on a host)
    6. the name of the executable (per rank, they should be identical among all ranks on a host)

    to create a hopefully unique name for the shared memory section on each node.

    Before I spend any actual time on this, would this actually be something that you would find useful? I would also really like to keep this simple since a GetHostID functionality is useful but not essential.

  9. Roland Haas reporter
    • removed comment

    Since the actual code would stay in the driver (similar to the actual code for CCTK_MyProc being in the drivers), this would all be changes to Carpet/PUGH/MPI.

  10. Erik Schnetter
    • removed comment

    If you want a unique ID for a run, ask Formaline: UniqueSimulationID().

    And before we over-design something: Let's use MPI_Comm_split_type and see how that pans out.

  11. Roland Haas reporter
    • removed comment

    I do not really want to make the drivers depend on Formaline, otherwise, yes Formaline would give me a nice hopefully unique ID for the run.

    I am absolutely behind not over-designing this. MPI_Comm_split_type is the best solution if available. The downside to it seems to be your comment that MPI3 is not supported by all supercomputers of interest to us :-(

    Most of the logic should be re-usable by all methods though since they all boil down to:

    • mark one rank per host as leader
    • choose a host-id based on the leader's rank relative to the other leaders
    • have the leaders pass the host ID to the other ranks on the host
  12. anonymous
    • removed comment

    Why not use the combination of gettimeofday(), proc id on rank 0, ip address on rank 0?

  13. Roland Haas reporter
    • removed comment

    Getting lost even more in the weeds.

    Actually MPI_Get_processor_name() on rank 0 together with the pid on rank 0 should be enough to give a unique name for the shared memory section. MPI_Get_processor_name() identifies the host (and if it is overly specific as on BG I don't mind since I just need a unique name that identifies the machine) plus the pid which is unique on the machine. All of this only as a fallback if the MPI3 stuff is not available.

    Since MPI_Get_processor_name has a fixed size, Bcasting its result (+ pid) is much simpler than the concatenation of all gethostname results.

  14. Frank Löffler
    • removed comment

    First of all: I agree with the later posters that something simpler would be nice. Just for reference though, these two aren't good:

    Replying to [comment:11 rhaas]:

    1. the name of the parfile (per rank, they should be identical among all ranks on a host)
    2. the name of the executable (per rank, they should be identical among all ranks on a host)

    At least 6. can be different (think running one simulation on a heterogenious cluster), and likely 5. can change due to this as well. They are usually the same, but I wouldn't introduce this as hard requirement for this issue.

  15. Roland Haas reporter
    • removed comment

    Replying to [comment:17 knarf]:

    At least 6. can be different (think running one simulation on a heterogenious cluster), and likely 5. can change due to this as well. They are usually the same, but I wouldn't introduce this as hard requirement for this issue.

    Different on single host (that is all that is needed)?

  16. Frank Löffler
    • removed comment

    Replying to [comment:18 rhaas]:

    Replying to [comment:17 knarf]:

    At least 6. can be different (think running one simulation on a heterogenious cluster), and likely 5. can change due to this as well. They are usually the same, but I wouldn't introduce this as hard requirement for this issue.

    Different on single host (that is all that is needed)?

    Even that. Imagine one binary for the main CPUs, but another binary for some accelerator on that same host; within the same simulation.

  17. Roland Haas reporter
    • removed comment

    Even that. Imagine one binary for the main CPUs, but another binary for some accelerator on that same host; within the same simulation.

    We do technically support such a usage yes, so indeed parfile names and executable names are out.

  18. Log in to comment