Modify

Opened 6 years ago

Last modified 6 years ago

#1286 new enhancement

SimFactory should not run queued chained jobs if a previous job fails

Reported by: Ian Hinder Owned by: Erik Schnetter
Priority: major Milestone:
Component: SimFactory Version:
Keywords: Cc:

Description

When a simulation consists of multiple chained jobs, the failure of one job is likely to lead to the failure of subsequent jobs. Possible reasons for failure of a job include:

  1. Running out of disk quota;
  2. An error in the code;
  3. A numerical problem;
  4. A problem with the cluster

Of all these, only the last could potentially be recovered from by simply running the next job in the chain, and in any case, if this is done immediately, it is likely to fail because the problem may not have resolved itself.

As a result, to avoid wasting CPU hours on the remaining jobs in the chain, I think simfactory should hold or remove the subsequent chained jobs. Probably removing the jobs would be easier and simpler, and users can always run "submit" on them to restart them.

Attachments (0)

Change History (4)

comment:1 Changed 6 years ago by Erik Schnetter

Please define how to detect failures?

comment:2 Changed 6 years ago by Frank Löffler

I agree that simfactory should not continue. A failure in Cactus should be indicated with a non-zero exit code, shouldn't it? If it isn't, it probably should. It would then be the responsibility of the run script to pass this through to simfactory.

comment:3 Changed 6 years ago by Erik Schnetter

That may be too simplistic. If a simulation runs out of queue time, would that count as "success"? Would you expect Simfactory to continue chaining jobs in this case? What if different MPI processes return different exit codes? What is, in general the exit code of mpirun anyway? What would you do if a simulation runs of of time? out of memory? out of disk space? What if there is a file permission error and the simulation can't write? What if the Cactus executable never actually starts because something is wrong? What if the user used qdel to stop a simulation? What if the user used qdel to stop a simulation? What if the user used the web interface or a termination trigger to stop the simulation?

comment:4 in reply to:  3 Changed 6 years ago by Frank Löffler

Replying to eschnett:

That may be too simplistic. If a simulation runs out of queue time, would that count as "success"? Would you expect Simfactory to continue chaining jobs in this case?

I guess everybody probably agrees to "yes" here, although I can see that this might also be a problem (assume no checkpoint was written because the corresponding parameter was incorrectly set).

What if different MPI processes return different exit codes?

I would assume the overall mpirun-like command to have a non-zero exit code then. Maybe it turns out that Cactus should write that explicitly to a file (when it does properly exit).

What is, in general the exit code of mpirun anyway?

Very likely to be dependent on more than we like.

What would you do if a simulation runs of of time? out of memory? out of disk space? What if there is a file permission error and the simulation can't write? What if the Cactus executable never actually starts because something is wrong?

I agree that catching all of these correctly will be quite some work. That doesn't mean we couldn't start working on some and leave others for later.

What if the user used qdel to stop a simulation?

If a user does this with chained jobs still in the queue I indeed expect these to start.

What if the user used the web interface or a termination trigger to stop the simulation?

That's a good question. I can see arguments both ways. We don't specify if this should mean 'stop _this_ simulation' or 'stop the entire run'.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The owner will remain Erik Schnetter.
Next status will be 'review'.
as The resolution will be set.
to The owner will be changed from Erik Schnetter to the specified user.
Next status will be 'confirmed'.
The owner will be changed from Erik Schnetter to anonymous.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.