Modify

Opened 12 months ago

Last modified 11 months ago

#2112 new enhancement

move CCTK_MyHost and friends into flesh

Reported by: Roland Haas Owned by:
Priority: minor Milestone:
Component: Cactus Version: development version
Keywords: Carpet PUGH driver Cc:

Description

Currently Carpet provides functions CCTK_MyHost etc that lets a thorn discover which host it is running on and what its "rank" local to the host is (ie process N of M processes on a given host).

It would be good to have this information provided by the flesh similar to CCTK_MyRank() which will require a new set of overloadable functions (of the same name) for Carpet to hook into.

Not sure what to do about the currently used aliased function names that would then cause a conflict.

Attachments (1)

hostid.c (1.7 KB) - added by Roland Haas 12 months ago.

Download all attachments as: .zip

Change History (22)

comment:1 Changed 12 months ago by Erik Schnetter

Carpet's implementation is based on comparing host names as strings. This involves significant infrastructure to be able to exchange strings between MPI processes.

On a Blue Gene, the "host name" includes not only what Unix calls host name, but also the core id on which the MPI process is running. This defeats the purpose, and thus one needs to perform string surgery to extract the actual host name from the topology.

OpenMPI provides environment variables that contain the respective information. These are much easier to decode than exchanging host name strings. However, e.g. MPICH does not provide such a mechanism.

Finally, the MPI 3 standard, which is supported by all interesting MPI implementation, but not necessarily by all supercomputers of interest to us, provides a function MPI_Comm_split_type, which accepts an argument MPI_COMM_TYPE_SHARED, which then defines new per-node communicators. With a bit of post-processing, one can extract a relevant grouping:

  • each per-node communicator chooses a leader (rank 0 within that communicator)
  • all processes exchange the global rank of their leaders
  • these ranks can be sorted in ascending order
  • this ranking then defines an integer numbering of the nodes
  • you probably want to collect additional statistics in case the number of processes per node isn't uniform

comment:2 Changed 12 months ago by Roland Haas

I see. The hope is that the the complex logic can stay in Carpet (or go into the ExternalLibraries/MPI thorn since it is not driver specific), I am at this point only advocating to make CCTK_MyHostID a Cactus overloaded function rather than an aliased function it is now.

The flesh still could then stay MPI-free (except the MPI_Init() calls which are required) if possible.

The MPI3 method is definitely easier it seems. Another way to do this without having to rely on hostname and string surgery seems to use POSIX (or SYSV) shared memory sections which are basically the same as the MPI3 shared communicators. I have not tried this exactly but have used shared memory (which is what made me wish for a hostid function) and semaphores.

  1. make a shared memory section of a known name which will automatically be shared by all ranks on a node
  2. use POSIX or SYSV semaphores to have exactly one rank per node write its global MPI rank into the shared memory, this rank will be random since it depends on which process first grabs the semaphore
  3. one needs a couple of MPI_Barriers to make sure that there is proper serialization in some places

The method is not too hard (easier probably than string surgery) and works as long as one has POSIX or SYSV IPC capabilities (so hopefully everywhere, in particular since we do require POSIX by now).

Last edited 12 months ago by Roland Haas (previous) (diff)

comment:3 Changed 12 months ago by Ian Hinder

How about an all-to-all of the MAC address of the node? We'd have to figure out which interface to use, but this would still uniquely identify the node. It wouldn't give you a rank on the node though.

comment:4 Changed 12 months ago by Erik Schnetter

We can also call gethostname.

comment:5 Changed 12 months ago by Roland Haas

I attach a sample code using mmap. Note that it is not bulletproof. The bug that exists is that if there are more than one independent Cactus runs on the same node (eg on your workstation while running testsuites) then they all compete for the same shared memory section and only one of them will get it.

To avoid it one needs some name for the shared memory section that is unique to the simulation. This is quite hard to do and I am somewhat sure that it is not always possible. One cannot for example choose the name of the executable or the name of the parfile since neither one needs to be unique. So it is the opposite problem the Erik mentioned on a BG since the problem with the mmap approach is to limit the mapping to just the processes in the current run.

Changed 12 months ago by Roland Haas

Attachment: hostid.c added

comment:6 Changed 12 months ago by Roland Haas

There is also gethostid which is a 32bit integer: http://man7.org/linux/man-pages/man3/gethostid.3.html

comment:7 in reply to:  5 Changed 12 months ago by Frank Löffler

Replying to rhaas:

To avoid it one needs some name for the shared memory section that is unique to the simulation. This is quite hard to do and I am somewhat sure that it is not always possible.

If that id only has to be unique on a given node (which I assume since this is for a shared memory block), the process id should provide that.

comment:8 Changed 12 months ago by Erik Schnetter

You can draw a unique random number on node 0 and then broadcast it.

comment:9 Changed 12 months ago by Roland Haas

@ knarf: the problem with that is that all ranks on the node need to agree on the same process ID. I also cannot use the process ID of rank 0 and bcast that one, because (being devil's advocate) it could actually be 2 nodes which are shared among two simulations and one simulation has rank 0 on node A and the other has rank 0 on node B and those two just happen to have the same process ID.

@ eschett: that will almost always work :-)

A simpler thing, without the need for any mmap would be basically what you are already doing with the processor names but using the (32 bit numeric so much easier to work with) gethostid() result. This would avoid having to pass around strings through MPI. Again, no mmap needed.

comment:10 Changed 12 months ago by Erik Schnetter

gethostid seems to always return 0 on MacOS. The respective systcl call does the same thing.

comment:11 Changed 12 months ago by Roland Haas

Ok, so that way to go seems to be to combine these things (or bite the bullet and send strings hopen that there is no cluster out there that uses "localhost.localdomain" for all of its compute nodes.

So I'd use a combination of

  1. a compile time chosen, fixed string
  2. the current time from gettimeofday() or time() on rank 0
  3. the process ID of rank 0
  4. a random number as returned by random() on rank 0 (no use of /dev/random which is not in POSIX and I have seem chroot jails where it was not provided)
  5. the name of the parfile (per rank, they should be identical among all ranks on a host)
  6. the name of the executable (per rank, they should be identical among all ranks on a host)

to create a hopefully unique name for the shared memory section on each node.

Before I spend any actual time on this, would this actually be something that you would find useful? I would also really like to keep this simple since a GetHostID functionality is useful but not essential.

comment:12 Changed 12 months ago by Roland Haas

Since the actual code would stay in the driver (similar to the actual code for CCTK_MyProc being in the drivers), this would all be changes to Carpet/PUGH/MPI.

comment:13 Changed 12 months ago by Erik Schnetter

If you want a unique ID for a run, ask Formaline: UniqueSimulationID().

And before we over-design something: Let's use MPI_Comm_split_type and see how that pans out.

comment:14 Changed 12 months ago by Roland Haas

I do not really want to make the drivers depend on Formaline, otherwise, yes Formaline would give me a nice hopefully unique ID for the run.

I am absolutely behind not over-designing this. MPI_Comm_split_type is the best solution if available. The downside to it seems to be your comment that MPI3 is not supported by all supercomputers of interest to us :-(

Most of the logic should be re-usable by all methods though since they all boil down to:

  • mark one rank per host as leader
  • choose a host-id based on the leader's rank relative to the other leaders
  • have the leaders pass the host ID to the other ranks on the host

comment:15 Changed 12 months ago by anonymous

Why not use the combination of gettimeofday(), proc id on rank 0, ip address on rank 0?

comment:16 Changed 12 months ago by Roland Haas

Getting lost even more in the weeds.

Actually MPI_Get_processor_name() on rank 0 together with the pid on rank 0 should be enough to give a unique name for the shared memory section. MPI_Get_processor_name() identifies the host (and if it is overly specific as on BG I don't mind since I just need a unique name that identifies the machine) plus the pid which is unique on the machine. All of this only as a fallback if the MPI3 stuff is not available.

Since MPI_Get_processor_name has a fixed size, Bcasting its result (+ pid) is much simpler than the concatenation of all gethostname results.

comment:17 in reply to:  11 ; Changed 12 months ago by Frank Löffler

First of all: I agree with the later posters that something simpler would be nice. Just for reference though, these two aren't good:

Replying to rhaas:

  1. the name of the parfile (per rank, they should be identical among all ranks on a host)
  2. the name of the executable (per rank, they should be identical among all ranks on a host)

At least 6. can be different (think running one simulation on a heterogenious cluster), and likely 5. can change due to this as well. They are usually the same, but I wouldn't introduce this as hard requirement for this issue.

comment:18 in reply to:  17 ; Changed 12 months ago by Roland Haas

Replying to knarf:

At least 6. can be different (think running one simulation on a heterogenious cluster), and likely 5. can change due to this as well. They are usually the same, but I wouldn't introduce this as hard requirement for this issue.

Different on single host (that is all that is needed)?

comment:19 in reply to:  18 ; Changed 12 months ago by Frank Löffler

Replying to rhaas:

Replying to knarf:

At least 6. can be different (think running one simulation on a heterogenious cluster), and likely 5. can change due to this as well. They are usually the same, but I wouldn't introduce this as hard requirement for this issue.

Different on single host (that is all that is needed)?

Even that. Imagine one binary for the main CPUs, but another binary for some accelerator on that same host; within the same simulation.

comment:20 in reply to:  19 Changed 12 months ago by Roland Haas

Even that. Imagine one binary for the main CPUs, but another binary for some accelerator on that same host; within the same simulation.

We do technically support such a usage yes, so indeed parfile names and executable names are out.

comment:21 Changed 11 months ago by Steven R. Brandt

How shall we resolve this ticket?

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The ticket will remain with no owner.
Next status will be 'review'.
as The resolution will be set.
to The owner will be changed from (none) to the specified user.
Next status will be 'confirmed'.
The owner will be changed from (none) to anonymous.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.