SimFactory should not run queued chained jobs if a previous job fails

Erik Schnetter

removed comment

Please define how to detect failures?

2013-03-11T08:46:54+00:00

Frank Löffler

removed comment

I agree that simfactory should not continue. A failure in Cactus should be indicated with a non-zero exit code, shouldn't it? If it isn't, it probably should. It would then be the responsibility of the run script to pass this through to simfactory.

2013-03-11T08:57:00+00:00

Erik Schnetter

removed comment

That may be too simplistic. If a simulation runs out of queue time, would that count as "success"? Would you expect Simfactory to continue chaining jobs in this case? What if different MPI processes return different exit codes? What is, in general the exit code of mpirun anyway? What would you do if a simulation runs of of time? out of memory? out of disk space? What if there is a file permission error and the simulation can't write? What if the Cactus executable never actually starts because something is wrong? What if the user used qdel to stop a simulation? What if the user used qdel to stop a simulation? What if the user used the web interface or a termination trigger to stop the simulation?

2013-03-11T09:54:53+00:00

Frank Löffler

removed comment

Replying to [comment:3 eschnett]:

That may be too simplistic. If a simulation runs out of queue time, would that count as "success"? Would you expect Simfactory to continue chaining jobs in this case?

I guess everybody probably agrees to "yes" here, although I can see that this might also be a problem (assume no checkpoint was written because the corresponding parameter was incorrectly set).

What if different MPI processes return different exit codes?

I would assume the overall mpirun-like command to have a non-zero exit code then. Maybe it turns out that Cactus should write that explicitly to a file (when it does properly exit).

What is, in general the exit code of mpirun anyway?

Very likely to be dependent on more than we like.

What would you do if a simulation runs of of time? out of memory? out of disk space? What if there is a file permission error and the simulation can't write? What if the Cactus executable never actually starts because something is wrong?

I agree that catching all of these correctly will be quite some work. That doesn't mean we couldn't start working on some and leave others for later.

What if the user used qdel to stop a simulation?

If a user does this with chained jobs still in the queue I indeed expect these to start.

What if the user used the web interface or a termination trigger to stop the simulation?

That's a good question. I can see arguments both ways. We don't specify if this should mean 'stop this simulation' or 'stop the entire run'.

2013-03-11T10:05:45+00:00

Comments (4)