SimFactory should not run queued chained jobs if a previous job fails

Issue #1286 new
Ian Hinder created an issue

When a simulation consists of multiple chained jobs, the failure of one job is likely to lead to the failure of subsequent jobs. Possible reasons for failure of a job include:

  1. Running out of disk quota;
  2. An error in the code;
  3. A numerical problem;
  4. A problem with the cluster

Of all these, only the last could potentially be recovered from by simply running the next job in the chain, and in any case, if this is done immediately, it is likely to fail because the problem may not have resolved itself.

As a result, to avoid wasting CPU hours on the remaining jobs in the chain, I think simfactory should hold or remove the subsequent chained jobs. Probably removing the jobs would be easier and simpler, and users can always run "submit" on them to restart them.

Keyword:

Comments (4)

  1. Frank Löffler
    • removed comment

    I agree that simfactory should not continue. A failure in Cactus should be indicated with a non-zero exit code, shouldn't it? If it isn't, it probably should. It would then be the responsibility of the run script to pass this through to simfactory.

  2. Erik Schnetter
    • removed comment

    That may be too simplistic. If a simulation runs out of queue time, would that count as "success"? Would you expect Simfactory to continue chaining jobs in this case? What if different MPI processes return different exit codes? What is, in general the exit code of mpirun anyway? What would you do if a simulation runs of of time? out of memory? out of disk space? What if there is a file permission error and the simulation can't write? What if the Cactus executable never actually starts because something is wrong? What if the user used qdel to stop a simulation? What if the user used qdel to stop a simulation? What if the user used the web interface or a termination trigger to stop the simulation?

  3. Frank Löffler
    • removed comment

    Replying to [comment:3 eschnett]:

    That may be too simplistic. If a simulation runs out of queue time, would that count as "success"? Would you expect Simfactory to continue chaining jobs in this case?

    I guess everybody probably agrees to "yes" here, although I can see that this might also be a problem (assume no checkpoint was written because the corresponding parameter was incorrectly set).

    What if different MPI processes return different exit codes?

    I would assume the overall mpirun-like command to have a non-zero exit code then. Maybe it turns out that Cactus should write that explicitly to a file (when it does properly exit).

    What is, in general the exit code of mpirun anyway?

    Very likely to be dependent on more than we like.

    What would you do if a simulation runs of of time? out of memory? out of disk space? What if there is a file permission error and the simulation can't write? What if the Cactus executable never actually starts because something is wrong?

    I agree that catching all of these correctly will be quite some work. That doesn't mean we couldn't start working on some and leave others for later.

    What if the user used qdel to stop a simulation?

    If a user does this with chained jobs still in the queue I indeed expect these to start.

    What if the user used the web interface or a termination trigger to stop the simulation?

    That's a good question. I can see arguments both ways. We don't specify if this should mean 'stop this simulation' or 'stop the entire run'.

  4. Log in to comment