Can't stop / cleanup simulations that run without a queueing system

Issue #710 resolved
Erik Schnetter created an issue

For example, simulations running on workstations are affected.

Keyword:

Comments (8)

  1. Barry Wardell
    • removed comment

    I think this is supposed to work, but it doesn't work for me either. I looked into it before and I think the problem seemed to be that the PID recorded by SimFactory is that of the submit script rather than that of the run script. I wasn't able to find a straightforward way to fix this at the time.

  2. Erik Schnetter reporter
    • removed comment

    All the run scripts that don't use a queuing system are very similar, and very simplistic. One way of correcting this is to implement a "pseudo-qsub" in a script (stored with Simfactory) that we call, and which prints the pid to the screen, so that the mdb entry filtering out the job id (pid) would also need to be updated.

  3. Roland Haas

    This is still happening even after git hash 0ce61cb "Add submit scripts for workstations, so that simulations can be run in the background there" of simfactory2 which introduced an exec to generic.sub making the PID returned that of the simfactory run process. This still leaves python and bash in between the PID returned and the Cactus executable. Eg for pid 30727 on a Debian system, one has:

    systemd(1)───python2(30711)───RunScript(30719)───cactus_sim(30727)─┬─orted(30758)─┬─{orted}(30764)
                                                                       │              ├─{orted}(30765)
                                                                       │              └─{orted}(30766)
                                                                       ├─{cactus_sim}(30768)
                                                                       └─{cactus_sim}(30769)
    

    and after running stop on the simulation this looks like so

    systemd(1)───RunScript(30719)───cactus_sim(30727)─┬─orted(30758)─┬─{orted}(30764)
                                                      │              ├─{orted}(30765)
                                                      │              └─{orted}(30766)
                                                      ├─{cactus_sim}(30768)
                                                      └─{cactus_sim}(30769)
    

    ie Python terminated but nothing else. Manually killing RunScript gets rid of just the RunScript and now leaves cactus_sim as a child of systemd.

    A fix would be to try and propagate the signal down to cactus_sim (see eg http://veithen.io/2014/11/16/sigterm-propagation.html) which is complicated or to try and use exec in the whole chain which does not work work because the RunScript needs to output “Done” after Cactus exits. Outputting the PID of Cactus in RunScript and using that is also not an option because the RunScript must be usable with sim run and therefore must not output extraneous text.

    On Linux, a solution may be to use kill’s process group killing capability (https://riccomini.name/kill-subprocesses-linux-bash) which is hinted at by “Negative PID values may be used to choose whole process groups; see the PGID column in ps command output.” in man kill. Negative PIDs are not supported by OSX’s kill command.

    On both Linux and OSX there is the pkill command (https://linux.die.net/man/1/pkill) whose -g option serves the same purpose . pkill not part of POSIX though so may or may not be installed on a given Linux system.

    Since one needs the process group ID rather than the process ID one has to get this first (via ps) and this command works on both Linux and OSX to kill the whole process group that $JOBID is part of:

    pkill -g $(ps -o pgid= -p $JOBID)
    

  4. Log in to comment