[Pull request: CactusUtils/WatchDog] new thorn to automatically terminate jobs that hang

Issue #1751 closed
David Radice created an issue

WatchDog is thorn that terminates jobs that do not make progress over a user-defined time frame. Internally, WatchDog updates an internal timer at CCTK_ANALYSIS and uses the pthread library to spawn watcher thread that periodically checks if the timer has been updated. If the timer has not been updated for more than a user-defined time frame, the thread calls "exit()" to terminate the process (and the job).

Keyword:

Comments (14)

  1. Roland Haas
    • changed status to open
    • marked as
    • assigned issue to
    • removed comment

    I had a look through the code. At this point, the thorn is not yet ready to be included in the ET or Cactus, however it seems worthwhile to have such a thorn. Comments from most severe to least severe:

    • the thorn needs documentation in the standard documentation.tex file, this can be very short, I would think that the description in the pull request plus the usual boilerplate text will be sufficient
    • the thorn has no test case however since it would have to test for an abort, a test case may be hard to design
    • the thorn uses fprintf(stderr, ...) for both error and informational messages. For informational messages (the "Everything is fine" message) it could use CCTK_VInfo() in the main thread's ANALYSIS routine. The warnings cannot be changed since they are emitted by the secondary thread and Cactus is not thread safe.
    • reading the man-page for asctime_r I do not think that the explicit zero termination in line 32 and 49 is required since asctime null terminates its output (which is also guaranteed to fit within 26 characters).
    • if possible, the thorn should check for the presence of PTHREADS, currently since PTHREADS is a Cactus extras, the only way to do so seems to be at compile time via:

      #ifndef(CCTK_PTHREADS) #error "WATCHDOG required PTHREADS. Please enable PTHREADS=yes in your option list." #endif * it may be interesting to make check_every STEERABLE=ALWAYS by resetting it inside the ANALYSIS routine (and protecting access to it by the mutex).

  2. David Radice reporter
    • removed comment

    Replying to [comment:2 rhaas]:

    Thank you for having reviewed my patch.

    • the thorn needs documentation in the standard documentation.tex file, this can be very short, I would think that the description in the pull request plus the usual boilerplate text will be sufficient

    I updated the pull request: now my patch contains the documentation as well.

    • the thorn has no test case however since it would have to test for an abort, a test case may be hard to design

    I wouldn't know how to create a unit test for the WatchDog thorn. However the code is sufficiently simple that it is easy to check that it operates as intended. Whether calling "abort()" results in the run being cleared from the queuing system without leaving "zombies" is a much more delicate issue, because it depends on the reason why a run is hanging in the first place and on the detail of the specific machine. I do not see a way to really test this. The WatchDog thorn is meant to avoid burning allocations on dead jobs, however I do not think that one can guarantee 100% that it will work: the users should use this at their own risk.

    • the thorn uses fprintf(stderr, ...) for both error and informational messages. For informational messages (the "Everything is fine" message) it could use CCTK_VInfo() in the main thread's ANALYSIS routine. The warnings cannot be changed since they are emitted by the secondary thread and Cactus is not thread safe.

    "Everything is fine" is a message from the watchdog thread. It cannot be relayed using the CCTK_* functions.

    • reading the man-page for asctime_r I do not think that the explicit zero termination in line 32 and 49 is required since asctime null terminates its output (which is also guaranteed to fit within 26 characters).

    I am removing the newline character.

    • if possible, the thorn should check for the presence of PTHREADS, currently since PTHREADS is a Cactus extras, the only way to do so seems to be at compile time via: #ifndef(CCTK_PTHREADS) #error "WATCHDOG required PTHREADS. Please enable PTHREADS=yes in your option list." #endif

    This is fixed in the new version of the pull request.

    • it may be interesting to make check_every STEERABLE=ALWAYS by resetting it inside the ANALYSIS routine (and protecting access to it by the mutex).

    This is certainly possible, but it seems an unneeded complication to me.

  3. David Radice reporter
    • removed comment

    Hello, thank you for the additional patches. However there are some things I would like to change in those:

    • The "everything is fine" message is very useful to see when (if) the watchdog thread is actually executed by the OS. Moving it to the main thread defeats its purpose in my opinion.
    • In the heartbeat output we might want to output much more frequently than every half of the check time: large jobs might "hang" for long time while checkpointing. I would suggest writing in the heartbeat file more often, say every few iterations (anyway it is really not so expensive to write an int in a text file).
    • I woud also move the heartbeat file to the running directory, to avoid having the runscript know the name of the output directory
  4. Roland Haas
    • removed comment

    All are fine with me. All me patches were just suggestions and an illustration of what I had in mind.

    I did not put the heartbeat it into the running directory since normally all Cactus output should go into out_dir and ignored the fact that this actually makes it hard for a script to read the heartbeat file.

    Outputting to file more frequently should work fine is the number of iterations to wait is much smaller than the walltime between checkpoint (or if one checkpoints every so many iterations, is a divisor of the checkpoint checkpoint_every parameter). The code as written by me actually has a bug. It uses the last time the timestamp variable was updated for then but instead it should use the last time the output file was written to. I am a bit weary about saying that it is not not expensive to write an int to a text file. I agree that writing the int once the file is open is fast. I worry that opening the file is actually slow (may take several seconds) on large lustre file systems where the (single) metadata server is the bottleneck.

  5. David Radice reporter
    • removed comment

    Yes writing output every iteration is probably going to be a bit slow... in the old WatchDog thorn I had a parameter for the frequency of the output, in number of iterations, and I was setting it to be as frequent as 0D output. A better possibility would be to have another parameter for the frequency of the output to the heartbeat file. I am going to merge/adapt your patches to the code for the pull request.

  6. David Radice reporter
    • removed comment

    On a second thought I am starting to think that supporting both the HEARTBEAT file and the pthread internal results in a lot of duplication. The HEARTBEAT file approach has the important drawback that it would work in a very non-uniform way across separate machines and that it requires the user to simultaneously adjust his/her runscript and parfile to adjust the timers in the checking code. I would probably stick with the current version of WatchDog: the only drawback is that it could leave zombies on some systems that do not cleanup after a job has terminated.

  7. Roland Haas
    • removed comment

    Any more changes you'd like to include? One question: the license is given as GPLv3. Since the thorn is intended for CactusUtils is this ok or do we need to ask David if he is willing to license it as LGPL to include it in a "Cactus" repository? I'd like to include this soon if possible.

  8. David Radice reporter
    • removed comment

    I am happy with the code as it is now in the pull request, with maybe the exception of the check for CCTK_PTHREADS that fails for no good reason on my Mac, but works everywhere else.

    In its current form the WatchDog thorn has been "tested" successfully on BlueWaters and Stampede with jobs hanging for different reasons (I/O on BlueWaters and MPI on Stampede).

    As for the license: I used GPLv3 because that is my default, but for a piece of code as trivial as the WatchDog thorn, any license would be fine for me. Including "public domain".

  9. Roland Haas
    • removed comment

    This thorn is reviewed ok. Unless there are objections I will add it to CactusUtils (same repo as Nice and Formaline) after swapping the license file for LGPL (and checking with David this is ok once more) today or tomorrow.

  10. Log in to comment