Automatically start SystemTopology

Issue #1865 open
David Radice created an issue

Carpet used to load hwloc automatically and that would set thread affinities. Now this functionality is in the SystemTopology thorn, which is not automatically activated. This change could result in a significant performance regression on some systems (see discussion in #1850).

Would it make sense to activate SystemTopology automatically?

Keyword:

Comments (14)

  1. Erik Schnetter
    • removed comment

    Yes it would -- this also used to be the default behaviour.

    We should then also point people to SystemTopology::set_thread_bindings = "no".

  2. Ian Hinder
    • removed comment

    Currently, hwloc is OPTIONALly activated by LoopControl and MPI. I was going to say that these should be changed to require SystemTopology instead, so that they don't depend directly on the library used to provide this information, but on the interfaces provided by SystemTopology (this was the reason to split the thorn, right?). SystemTopology then requires hwloc. So both thorns would be activated if present in the thorn list and either LoopControl or MPI were activated.

    However, now I notice that MPI itself requires hwloc; is this correct? So there is a circular dependency. That feels wrong. Erik, could you clarify what should be done here?

    The result of all this is that when someone uses a parameter file which doesn't activate SystemTopology, they don't get their threads pinned.

  3. Frank Löffler
    • removed comment

    I only see MPI optionally depending on hwloc, in order to add it to 'configure' in case it builds MPI.

  4. Ian Hinder
    • changed status to open
    • removed comment

    Aside: I think we shouldn't use the word "depends" in this case; OPTIONAL means "use this capability if it is present", so maybe we should say "MPI optionally uses hwloc".

    I don't understand what I wrote above. Let me try again. Looking at the configuration.ccl files mentioning hwloc, we have:

    CactusUtils/SystemTopology/configuration.ccl:REQUIRES hwloc MPI
    
    Carpet/LoopControl/configuration.ccl:OPTIONAL CycleClock hwloc Vectors
    
    ExternalLibraries/MPI/configuration.ccl:OPTIONAL hwloc
    

    SystemTopology requires both hwloc and MPI.

    MPI optionally uses hwloc, because it might be building MPI, and the MPI library might make use of hwloc.

    LoopControl optionally uses hwloc. There doesn't seem to be anything in LoopControl that directly references hwloc. Is this because LoopControl uses threads, and it is good to pin those threads?

    Assuming all the above is right, I think the right thing is to change the LoopControl OPTIONAL from hwloc to SystemTopology. This would have the effect of automatically activating SystemTopology whenever anyone uses the ET thornlist, and allowing it to manage thread pinning.

    Agreed?

  5. Erik Schnetter
    • removed comment

    LoopControl optionally uses hwloc since this mechanism ensures that, if LoopControl is activated, then hwloc will also be auto-activated if it is present. hwloc then used to provide certain aliased functions that LoopControl uses to determine cache sizes and thread topology. This functionality moved to SystemTopology, thus this needs to be changed to optionally using SystemTopology instead.

    Requiring SystemTopology from some thorn (be it Carpet, or LoopControl, or the flesh) instead of just optionally using it is also possible. I think it is a good idea, but that's more of a policy change.

    Using SystemTopology is not always a good idea. If you are running on a single workstation, and are maybe running multiple instances of Cactus, then these will get into each other's way. In this case you should disable SystemTopology. Last week, I implemented a respective mechanism: The parameter "set_thread_bindings" has a new value "env" that makes it look at an environment variable "CACTUS_SET_THREAD_BINDINGS". If this variable is unset (obviously the default), SystemTopology does nothing. On HPC systems, this variable is set in the respective run scripts, thus enabling SystemTopology. The advantage is that this automatically does the right thing for unsuspecting users running on a workstation, the disadvantage is that unsuspecting HPC users need to use Simfactory or need to update their submit scripts.

  6. David Radice reporter
    • removed comment

    I don't use simfactory and I see myself easily forgetting to set that environment variable in my runscripts, so I would prefer a solution where binding is active by default. Either way, I think that this should be prominently displayed in the documentation and/or Cactus should warn about this in the stderr or stdout.

  7. Frank Löffler
    • removed comment

    Is SystemTopology able to find out if the current mpi job (not process) uses less than one whole node? If so, isn't it possible to let it set thread bindings whenever there is at least one 'full' node (and thus, achieving best performance then), and not set them, if there is not (e.g., running more than one, independent Cactus job)?

  8. Erik Schnetter
    • removed comment

    SystemTopology can check, but that's dangerous. What if it decides that you're not running on a "full node" simply because it counts cores differently from what the user expects? If you read the documentation for Blue Waters, you'll see that not even NCSA could agree with itself on how many cores a node has. (Slurm, MPI, and the accounting system use different core counts.) What if you run in a virtual machine or a Docker environment? What if you run on a workstation, and by chance are using all cores, while there is also a second Cactus job running?

    In my view, the default behaviour should do the right thing for the unsuspecting user. Experts can be expected to put in a bit more effort, e.g. changing a line in the scripts they use to submit jobs.

  9. Frank Löffler
    • removed comment

    I agree. Then the question becomes: what is an unsuspecting user most likely to do? I think it is either running one simulation concurrently on any machine, which would include any typical simulation setup on clusters. Running multiple jobs at the same time on the same node is not very common, I would think. Or is it? I can only speak for myself.

  10. Erik Schnetter
    • removed comment

    I do this for debugging -- I start a small test on on my workstation, and sometimes want to start several test runs simultaneously so that they can finish over lunch.

  11. Log in to comment