Opened 3 years ago

Last modified 2 years ago

#1868 reopened enhancement

automatically resubmit run if terminated due to walltime

Reported by: Roland Haas Owned by:
Priority: major Milestone:
Component: SimFactory Version: development version
Keywords: Cc:


This patch adds code to datura's run script that detect if the run was terminated due to walltime running out. In that case it will resubmit the job automatically.

This is an alternative over presubmission which is often more convenient for the user in case simulations fail. Presubmission itself would benefit by Cactus returning a failure code (that simfactory would need to forward) when termination is triggered by an error.

The pull request is here:

Attachments (0)

Change History (9)

comment:1 Changed 3 years ago by Roland Haas

Status: newreview

comment:2 Changed 3 years ago by Roland Haas

Summary: automatically resubmit if terminated due to walltimeautomatically resubmit run if terminated due to walltime

comment:3 Changed 3 years ago by Erik Schnetter

This needs to be combined with a check whether presubmission was used (or presubmission needs to be disable for Datura for the time being).

comment:4 Changed 3 years ago by Frank Löffler

One issue that might come up is that if machines handle this differently users will be surprised either way: that runs are not automatically re-submitted on other machines, and that runs might go on 'forever' on datura, while on other machines users can count on only using up one walltime-cycle, regardless of how long the parfile requests (assuming it is long enough). Forgetting that on Datura can get costly with time.

Would a prominent warning on Datura be a good solution?

comment:5 Changed 3 years ago by Erik Schnetter

People only read a warning if it's the last line of output of a run that is aborted. And even then they often don't.

comment:6 Changed 3 years ago by Frank Löffler

That's true for Cactus simulations. I hope it's not right after submitting a job using simfactory.

comment:7 Changed 3 years ago by Roland Haas

I agree to all suggestions:

  1. a test for "Done" must be there
  2. this must only happen when the job is submitted with a '--auto-resubmit' option
  3. Cactus should return a useful return value and simfactory should take to propagate this to the queuing system
  4. datura must not be the only system that behaves differently from all others

Also, in particular, warning message that are not the last line will not be headed (in fact, *no* warning will be headed as long as the run proceeds).

Based on the discussion the idea seems like a good one though (SpEC for example and various private simfactory-like systems have operated like this for a long time and this behaviour is more convenient than presubmission when a failure occurs).

comment:8 Changed 3 years ago by Frank Löffler

I also agree that this would be nice-to-have (assuming it works) on all machines. Specifying the length of a simulation in the par-file should be enough. Checkpointing due to wall time should be made as transparent as possible, so I really appreciate the effort in that direction.

comment:9 Changed 2 years ago by Roland Haas

Status: reviewreopened

We want this to be selectable at submit time.

Modify Ticket

Change Properties
Set your email in Preferences
as reopened The ticket will remain with no owner.
Next status will be 'review'.
as The resolution will be set.
to The owner will be changed from (none) to the specified user.
The owner will be changed from (none) to anonymous.

Add Comment

E-mail address and name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.