Make TerminationTrigger listen to signals

Issue #1999 closed
Roland Haas created an issue

this adds the ability to listen to signals such as the ones that some queuing systems (SLURM for example, torque/moab in theory) can send before terminating a job which can be used to implement interruptible jobs.

Pull request is https://bitbucket.org/cactuscode/cactusutils/pull-requests/12/make-terminationtrigger-listen-to-signals/diff

Keyword: TerminationTrigger

Comments (18)

  1. Frank Löffler
    • removed comment

    This ability sounds good in theory. Is there any system this works on, where it could be tested?

    Also, I would suggest to make SIGTERM the default, unless we know of systems where this would be a problem.

  2. Roland Haas reporter
    • removed comment

    This was tested by Ian Hinder on Minerva (AEI, SLURM) which is non public unfortunately. Most likely it would work on other SLURM based systems (eg stampede, comet) as well.

    I used the do-nothing option as the default since that is what most other Cactus thorns do, just enabling the thorn does not yet do anything. Changing the default to SIGTERM would change the default behaviour of TerminationTrigger as there is no equivalent of a termination_from_file option. So if we change to SIGTERM as the default signal I think we need to introduce a termination_from_signal which must default to false.

    Even without the preemption support it can be useful to trap eg CTRL-C for a clean shutdown or as a faster alternative to a termination file.

  3. Frank Löffler
    • changed status to open
    • removed comment

    If it was tested by maintainers it doesn't matter much to me if the machine was public or not. Thanks Ian. And thanks Roland for the patch!

  4. Ian Hinder
    • removed comment

    As I remember, it worked when I sent the signal to the mpiexec process directly, but not when I sent it to the controlling python (simfactory) process (apparently this is correct behaviour; simfactory needs to explicitly install a signal handler to catch it). Since that is the process that SLURM would send to, it seemed that there was still more work to be done in simfactory before this would be usable as intended. I didn't "review" the code in TerminationTrigger. I have now added comments to the pull request.

  5. Frank Löffler
    • removed comment

    I wouldn't hold back the commit to TerminationTrigger just because Simfactory needs further changes. It would already be nice to have when not using simfactory.

  6. Roland Haas reporter
    • changed status to open
    • removed comment

    I updated the pull request:

    • multiple signals are now supported
    • signal numbers are supported
    • test suites were added
    • cleaned up some schedule statements
  7. Roland Haas reporter
    • removed comment

    Replying to [comment:10 rhaas]:

    I updated the pull request:

    • multiple signals are now supported
    • signal numbers are supported
    • test suites were added
    • cleaned up some schedule statements

    One more (forced) update: * fix some typos that made asserts no-ops * work around bug in testsuite system

  8. Frank Löffler
    • removed comment

    Added a question concerning an error condition in the testsuite case, but it otherwise looked fine. Didn't test though.

  9. Roland Haas reporter
    • removed comment

    (mostly same comment as on bitbucket).

    I think there may be a bit of confusion. TerminationTrigger calls CCTK_TerminateNext which is a clean exit and exits with exit code 0 to the OS. An Abort would be a call to CCTK_Abort or at least CCTK_Error and would return (the former for sure, the latter I'd hope so) a non-zero exit code to the OS.

    The test suite checks for termination (which is a successful exit via CCTK_TerminateNext) and checks that termination was triggered via TerminationTrigger by inspecting a grid scalar that is set to 1 when TerminationTrigger requests a termination. An Abort due to an error would be caught by the test suite system as a non-zero exit code of the Cactus executable.

    Having said that though, I found that I can actually test more by not skipping the call to CCTK_TerminateNext during the test suite so that the test suite also tests if CCTK_TerminateNext would actually terminate. I have pushed an updated version of the code and test suite.

  10. Roland Haas reporter
    • removed comment

    Ok to apply (after the release is fine with me, though having it before may be neat since my favorite cluster may support job termination via signals soon)?

  11. Roland Haas reporter
    • removed comment

    This is becoming more interesting again (for me) since BlueWaters will (soon) support receiving signals some time before the OS kills a job due to out-of-walltime events.

    So having someone pick up the review would be great (after the release).

  12. Ian Hinder
    • changed status to open
    • removed comment

    I think this has had enough eyes on it now, and I don't think it's going to break anything serious, so please apply.

  13. Log in to comment