Checkpoint recovery nonfunctional

Issue #316 closed
Ian Hinder created an issue

Checkpoint recovery is nonfunctional in SimFactory 2 (it has broken since it was last fixed in ticket #60).

Using the attached parameter file, I submit a simulation on Datura:

simfactory2/bin/sim --machine datura --config sim2_datura create-submit parfiles/cptest.par 12 1:00:00

This parameter file terminates the Cactus run after 1 minute and dumps a checkpoint file. I then manually remove the output-0000-active symlink, as the automatic cleanup in the main() function is cleaning up restarts that are attempting to run, so I have disabled it, and manual cleanup doesn't work (see ticket #315).

I then resubmit the simulation

simfactory2/bin/sim --machine datura submit parfiles/cptest.par

and observe that the checkpoint files from the first restart are never hardlinked into the output directory. The job does not recover, and instead starts from initial data.

Log file is attached.

Looking at the code, it appears that the checkpoint linking is conditional on the from-restart-id parameter being passed to simfactory, which I think is something to do with job-chaining. I can't see anywhere in the code which sets this option, so this is probably why the linking is not happening.

Keyword: regression

Comments (14)

  1. Ian Hinder reporter
    • removed comment

    Workaround: recovery can be made to work if you add the options

    --from-restart-id=<n> --recover

    to the submit command, where <n> is the number of the restart that you want to recover from.

  2. anonymous
    • removed comment

    Ian, I have committed a fix for this. Please test without using --from-restart-id or --recover.

  3. Ian Hinder reporter
    • removed comment

    I cannot get this to work on my laptop. It still doesn't seem to be linking the checkpoint files into the new restart directory. Do you think it depends on the machine that I run on? I have tested this on my laptop. What machine have you tested this on, so that I can try on the same machine?

  4. Erik Schnetter
    • removed comment

    The patch looks good; since Ian confirmed that it solves the problem, please apply it.

  5. Ian Hinder reporter
    • removed comment

    It looks like it only attempts to recover from the previous restart. What happens if that restart never ran, perhaps because it was deleted from the queue, and hence there are no checkpoint files in it? I think simfactory should try to recover from the last restart which actually has checkpoint files, not just the last restart. If there are no restarts with checkpoint files, then it should not try to recover.

    I don't think the patch should be applied until it works for the case where a job never ran, and hence there are no checkpoint files in the last few restarts.

    Aside: all of this would have been trivial if we used a single common directory for checkpoint files, common across restarts. All the logic is present in Cactus already.

  6. Erik Schnetter
    • removed comment

    Let's discuss this on the mailing list. Using a common checkpoint directory may be a good idea.

  7. anonymous
    • removed comment

    I have attached a new patch that finds all checkpointing files and links them. Please review.

  8. Erik Schnetter
    • removed comment

    This patch has several problems:

    We just agreed on the mailing list to store checkpoint files in a directory "checkpoints" that is at the same level as the "output-NNNN" directories. These checkpoint files therefore do not fit the pattern you are expecting, and lead to warnings. Instead, these checkpoint files should be ignored. Instead of looking for checkpoint files in the whole simulation directory, it may be better to look in the individual restart directories.

    The code to replace the "output-NNNN" patterns is too complex. Instead of pattern matching and then manual string operations, a direct regexp replacement would be simpler and safer.

    The pattern matching code does not check where in the path name the pattern "output-NNNN" exists. If a simulation is called "output", this leads to problems.

    The message "linking file XXX" is printed even if the linking step does not actually happen.

    The code to check whether a checkpoint file is found twice only looks at file names. This is fragile and can hide problems, for example when two restarts use different numbers of processes. Instead, the incoming list of checkpoint files should be pruned. This would also allow an explicit choice of whether newer or older checkpoint files should be chosen if they exist in several restarts: the current code presumably chooses the older ones, whereas we may want to use the newer ones.

    The fallback call to "shutil.copyfile" overwrites existing file contents. This is a serious problem if the existing file is a hard link (because this will then overwrite the file content). The destination needs to be unlinked first.

    This code links all checkpoint files. Only checkpoint files from the last iteration should be linked.

    Overall, the code says "checkpoint" when it should say "recover" in several places.

  9. Erik Schnetter
    • removed comment

    Note: I have just successfully submitted three restarts of a simulation on Damiana, using a global "checkpoints" directory.

  10. Erik Schnetter
    • changed status to resolved
    • removed comment

    Note: In my comment above, I meant "finished" instead of "submitted".

    I have also just successfully finished three restarts of a simulation on Damiana, without using a global "checkpoints" directory.

    Since the problem seems to not exist any more, I am closing this ticket.

  11. Log in to comment