Reduce disk space used by checkpoints

Erik Schnetter

removed comment

SimFactory's philosophy is that each restart is independent of the other restarts, like a snapshop of the simulation in time. The hard-linking mechanism was introduced to prevent Cactus from (accidentally?) deleting data from old restarts.

One problem in long-running simulations is that (a) one may accidentally delete too many checkpoints, so that the simulation becomes unusable. Another problem is that, if an error in the simulation is detected and if all old checkpoints have been deleted, it is impossible to "go back in time" (using checkpoints left every so many time steps).

My current idea is to archive old restarts, and then delete them (or parts of them) locally. In this way, no data have to be deleted. This depends on the availability of archival storage, but this should be present -- we mostly have not been using it because this is somewhat awkward: it takes a long time, different systems have different archives, and it is cumbersome to find out which files have been archived. However, it should in principle be straightforward to automatically copy older restarts into an archive.

2010-12-31T07:36:02+00:00

Ian Hinder reporter

removed comment

This is now a real problem on Damiana because we are short of disk space.

The idea of going "back in time" to solve a problem is nice in theory, but for large simulations it is simply not practical due to the space requirements. We are having to log in to the cluster every day to delete old checkpoint files because there is only enough space for one or two sets.

What do you think of an option for simfactory to delete checkpoints during cleanup, which will happen between restarts? SimFactory could check that there is at least one complete set of checkpoint files in the simulation.

2011-03-08T07:28:31+00:00

Erik Schnetter

removed comment

I think this would be an ideal application of cleaning up. I don't like the idea in principle (I think there should be a better solution), but this seems a good short-term remedy.

How should the setting work? Do you want it to be a manual command line option to cleanup that you have to pass every time? Should it be machine- or simulation-specific, or even user-specific (i.e. applying to all machines)?

2011-03-08T10:05:56+00:00

Ian Hinder reporter

removed comment

How about the following:

A variable called keep_restart_checkpoints which defaults to -1 (all) and can take integer values and can be customised on a per-machine basis.

I don't think it should be an argument to cleanup, since cleanup should run automatically.

2011-03-08T11:42:01+00:00

Erik Schnetter

removed comment

I believe that the "checkpoints" directory solves this issue in principle. Ian, can you confirm this?

2011-05-23T17:00:27+00:00

Erik Schnetter

removed comment

One other option not mentioned above is the following:

Once a restart is finished (and cleaned up), and once the next restart has been submitted, all data in this restart can be archived and/or deleted without influencing the further progress of the simulation. This includes deleting all checkpoint files in all non-active output directories.

2011-05-23T18:27:02+00:00

Barry Wardell

removed comment

This is solved by using

IO::checkpoint_dir = "../checkpoints" IO::recover_dir = "../checkpoints"

I have been running simulations like this for a while without any problems. The only issue is that SimFactory no longer knows about these checkpoints. This isn't very important other than for logging purposes.

2011-07-11T03:43:48+00:00

Erik Schnetter

changed status to resolved
removed comment

I close this ticket because the proposed solution seems to work.

2011-07-11T10:46:55+00:00

Roland Haas

edited description
changed status to closed

2019-02-21T20:27:45+00:00

Comments (9)