Modify

Opened 6 years ago

#1327 new enhancement

Delay subsequent restarts in the case of certain problems

Reported by: Ian Hinder Owned by: Erik Schnetter
Priority: major Milestone:
Component: SimFactory Version:
Keywords: Cc:

Description

If one restart of a simulation exits abnormally, e.g. due to some transient problem on a cluster, all subsequent restarts might also run into the same problem. If we can distinguish between terminations due to internal (i.e. numerical or code-related) problems and external (MPI errors, filesystem issues) problems, we can do different things for each. Possible actions could be:

  1. Continue as normal with the next restart;
  2. Delay the next restart for a few hours, in the hope that the transient cluster problems are resolved;
  3. Hold the next restart and notify the user by email that an unrecoverable error has occurred.

These could be communicated by exit codes (whether through official methods, or through an exit code file). Distinguishing between 2 and 3 could be achieved by regular expression matching on the standard output or standard error file. This would make the mechanism independent of Cactus. So Cactus would only have to say "good" or "bad", and SimFactory could then decide if "bad" meant to delay or hold based on some logic in its machine database.

Attachments (0)

Change History (0)

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The owner will remain Erik Schnetter.
Next status will be 'review'.
as The resolution will be set.
to The owner will be changed from Erik Schnetter to the specified user.
Next status will be 'confirmed'.
The owner will be changed from Erik Schnetter to anonymous.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.