- removed comment
Can't stop / cleanup simulations that run without a queueing system
For example, simulations running on workstations are affected.
Keyword:
Comments (8)
-
-
reporter - removed comment
All the run scripts that don't use a queuing system are very similar, and very simplistic. One way of correcting this is to implement a "pseudo-qsub" in a script (stored with Simfactory) that we call, and which prints the pid to the screen, so that the mdb entry filtering out the job id (pid) would also need to be updated.
-
This is still happening even after git hash 0ce61cb "Add submit scripts for workstations, so that simulations can be run in the background there" of simfactory2 which introduced an
exec
togeneric.sub
making the PID returned that of the simfactory run process. This still leaves python and bash in between the PID returned and the Cactus executable. Eg for pid 30727 on a Debian system, one has:systemd(1)───python2(30711)───RunScript(30719)───cactus_sim(30727)─┬─orted(30758)─┬─{orted}(30764) │ ├─{orted}(30765) │ └─{orted}(30766) ├─{cactus_sim}(30768) └─{cactus_sim}(30769)
and after running
stop
on the simulation this looks like sosystemd(1)───RunScript(30719)───cactus_sim(30727)─┬─orted(30758)─┬─{orted}(30764) │ ├─{orted}(30765) │ └─{orted}(30766) ├─{cactus_sim}(30768) └─{cactus_sim}(30769)
ie Python terminated but nothing else. Manually killing RunScript gets rid of just the RunScript and now leaves
cactus_sim
as a child ofsystemd
.A fix would be to try and propagate the signal down to
cactus_sim
(see eg http://veithen.io/2014/11/16/sigterm-propagation.html) which is complicated or to try and useexec
in the whole chain which does not work work because theRunScript
needs to output “Done” after Cactus exits. Outputting the PID of Cactus in RunScript and using that is also not an option because the RunScript must be usable withsim run
and therefore must not output extraneous text.On Linux, a solution may be to use kill’s process group killing capability (https://riccomini.name/kill-subprocesses-linux-bash) which is hinted at by “Negative PID values may be used to choose whole process groups; see the PGID column in ps command output.” in
man kill
. Negative PIDs are not supported by OSX’s kill command.On both Linux and OSX there is the
pkill
command (https://linux.die.net/man/1/pkill) whose-g
option serves the same purpose .pkill
not part of POSIX though so may or may not be installed on a given Linux system.Since one needs the process group ID rather than the process ID one has to get this first (via
ps
) and this command works on both Linux and OSX to kill the whole process group that$JOBID
is part of:pkill -g $(ps -o pgid= -p $JOBID)
-
-
assigned issue to
- edited description
-
assigned issue to
-
- changed status to open
-
Pull request is here: https://bitbucket.org/simfactory/simfactory2/pull-requests/37/generic-properly-kill-cactus-processes-in/diff
Please review.
-
Unless objected I will apply this after 2020-01-10.
-
- changed status to resolved
Applied as git hash bc73012 "generic: properly kill Cactus processes in sim stop" of simfactory2
- Log in to comment
I think this is supposed to work, but it doesn't work for me either. I looked into it before and I think the problem seemed to be that the PID recorded by SimFactory is that of the submit script rather than that of the run script. I wasn't able to find a straightforward way to fix this at the time.