Reduce disk space used by checkpoints

Issue #180 closed
Ian Hinder created an issue

Currently simfactory stores the checkpoints for each restart in their own directory. This means that the Cactus mechanism for deleting all but the most recent N checkpoints does not see the previous checkpoints. This means that you can easily run out of quota space when doing very long simulations.

One solution to this would be for SimFactory to store the checkpoints in a directory above the output-NNNN directories and make the current checkpoints directories under output-NNNN symbolic links to the common directory. This way, all restarts would see the same directory for checkpoint files, and Cactus could clean up the old checkpoints. The current hardlinking mechanism would not be required any more. This solution might be undesirable because it means that each restart is no longer independent.

Another solution would be to give simfactory an option to delete old checkpoint files from previous restarts when a job starts. This would duplicate the functionality already available in Cactus.

Keyword:

Comments (9)

  1. Erik Schnetter
    • removed comment

    SimFactory's philosophy is that each restart is independent of the other restarts, like a snapshop of the simulation in time. The hard-linking mechanism was introduced to prevent Cactus from (accidentally?) deleting data from old restarts.

    One problem in long-running simulations is that (a) one may accidentally delete too many checkpoints, so that the simulation becomes unusable. Another problem is that, if an error in the simulation is detected and if all old checkpoints have been deleted, it is impossible to "go back in time" (using checkpoints left every so many time steps).

    My current idea is to archive old restarts, and then delete them (or parts of them) locally. In this way, no data have to be deleted. This depends on the availability of archival storage, but this should be present -- we mostly have not been using it because this is somewhat awkward: it takes a long time, different systems have different archives, and it is cumbersome to find out which files have been archived. However, it should in principle be straightforward to automatically copy older restarts into an archive.

  2. Ian Hinder reporter
    • removed comment

    This is now a real problem on Damiana because we are short of disk space.

    The idea of going "back in time" to solve a problem is nice in theory, but for large simulations it is simply not practical due to the space requirements. We are having to log in to the cluster every day to delete old checkpoint files because there is only enough space for one or two sets.

    What do you think of an option for simfactory to delete checkpoints during cleanup, which will happen between restarts? SimFactory could check that there is at least one complete set of checkpoint files in the simulation.

  3. Erik Schnetter
    • removed comment

    I think this would be an ideal application of cleaning up. I don't like the idea in principle (I think there should be a better solution), but this seems a good short-term remedy.

    How should the setting work? Do you want it to be a manual command line option to cleanup that you have to pass every time? Should it be machine- or simulation-specific, or even user-specific (i.e. applying to all machines)?

  4. Ian Hinder reporter
    • removed comment

    How about the following:

    A variable called keep_restart_checkpoints which defaults to -1 (all) and can take integer values and can be customised on a per-machine basis.

    I don't think it should be an argument to cleanup, since cleanup should run automatically.

  5. Erik Schnetter
    • removed comment

    I believe that the "checkpoints" directory solves this issue in principle. Ian, can you confirm this?

  6. Erik Schnetter
    • removed comment

    One other option not mentioned above is the following:

    Once a restart is finished (and cleaned up), and once the next restart has been submitted, all data in this restart can be archived and/or deleted without influencing the further progress of the simulation. This includes deleting all checkpoint files in all non-active output directories.

  7. Barry Wardell
    • removed comment

    This is solved by using

    IO::checkpoint_dir = "../checkpoints" IO::recover_dir = "../checkpoints"

    I have been running simulations like this for a while without any problems. The only issue is that SimFactory no longer knows about these checkpoints. This isn't very important other than for logging purposes.

  8. Log in to comment