Missing data in HDF5 files

Issue #1283 new
Ian Hinder created an issue

If a simulation runs out of disk space while writing an HDF5 file, the simulation will terminate. The hdf5 file being written might then be corrupt, and all data from it may be irretrievable. In that case, restarting from the last-written checkpoint file may leave a "gap" in the data corresponding to the period between the start of the failed restart and the last checkpoint file written.

Steps to reproduce:

• Start a simulation which checkpoints periodically and consists of several restarts
• Keep all checkpoint files
• Restart 0000 completes successfully and checkpoints at iteration i1
• Restart 0001 checkpoints once after some evolution at iteration i2
• Restart 0001 terminates abnormally while writing an HDF5 output file at iteration i3
• The output file is corrupted and nonrecoverable, so there is no data from iteration i1 to iteration i3
• Restart 0002 starts at iteration i2 as this is the last checkpoint available
• The simulation continues until the end, but the data from the corrupted HDF5 file between iteration i1 and i2 is lost

Possible solutions:

1. Write HDF5 files safely, e.g. by first copying the file to a new temporary file, performing the write, then atomically moving the temporary file over the original file.  The original file would then remain in the event of a crash while writing the new file.  This could be very expensive for 3D output files.
2. Start a new set of HDF5 files after each checkpoint.  This seems to be the most efficient and simplest solution, but requires readers of HDF5 files to be modified to take it into account.
3. Check the consistency of all HDF5 files in the previous restart(s) on recovery, and recover from the latest checkpoint file for which all previous HDF5 files are valid.  We could use code to check the HDF5 file, or some other flagging mechanism to indicate that HDF5 writes were completed successfully; e.g. we could rename the HDF5 file to .tmp during writes, and rename it back after a successful write.  This is complex and requires Cactus or simfactory to look into previous restarts. It also only applies to HDF5 files, and requires breaking several abstraction barriers.
4. Wait for HDF5 journalling support.  As far as I know only metadata journalling is planned, which is probably not enough, and in any case, they are not actively working on the next version of HDF5 at the moment due to lack of funding.
5. Checkpoint only on termination of the simulation

In reality, we do not keep all checkpoint files. I usually keep just the last checkpoint file. I believe that a Cactus simulation will only delete checkpoint files which it has itself written, which means that there will generally be one checkpoint file kept per restart; the last one written. This means that you can always recover from the above situation by rerunning the restart during which the problem occurred. However, keeping one checkpoint file per restart is a problem in itself, and we should fix this as well, which would then mean the potential for losing data in the case of an interrupted write operation.

Thoughts?

Keyword:

Comments (14)

  1. Frank Löffler
    • removed comment

    Independent on the frequency of checkpoints and how many you keep - in all cases you loose data. You can always 'recover' by re-running whatever is missing. The question is: how much do you loose and can/should you recover automatically or would it be acceptable/better to leave that to the user?

    Assuming a simulations stops because it detected a problem with I/O (e.g. disk full) I would rather not have it restart without the user explicitly telling it to. 'sim submit NAME' doesn't count as 'explicitly' to me. Depending on the situation, a user should then probably investigate what happened and take steps accordingly, e.g., restarting the last 'restart' completely if otherwise data would be missing. It would be hard to correctly automate this.

    Having said that: trying to minimize the data loss is certainly worthwhile. Starting new files every time the code checkpoints would be a nice solution. Implementing this in Cactus would touch some thorn (assuming you do the same for all output types), and while you are right about the readers, there aren't so many of them; I don't think that would be a problem. You would get all this automatically though if you do a restart every time you checkpoint. Assuming you checkpoint every 6 hours, set your walltime to 6 hours and submit more restarts. Of course that is likely quite impractical on some systems that have limits on queues. What we could do is to have simfactory support this: run 'RunScript' a couple of times instead of once, and set the max_walltime in the Cactus parameter file to the real_walltime/count. That way you get real restarts without the queuing system knowing and without Cactus or reader code change. You add some overhead from reading the checkpoints though.

  2. Ian Hinder reporter
    • removed comment

    Replying to [comment:1 knarf]:

    Assuming a simulations stops because it detected a problem with I/O (e.g. disk full) I would rather not have it restart without the user explicitly telling it to. 'sim submit NAME' doesn't count as 'explicitly' to me. Depending on the situation, a user should then probably investigate what happened and take steps accordingly, e.g., restarting the last 'restart' completely if otherwise data would be missing. It would be hard to correctly automate this.

    At the moment, running out of quota essentially causes all your simulations to die immediately, and all the queued restarts also die, until you have nothing left in the queue. What would be nice would be a feature to place a hold on your queued jobs in the case of low disk space. One could imagine simfactory monitoring quota usage periodically, and holding your jobs and sending you an email if you run low on quota. But I think this is a discussion for another ticket.

    Having said that: trying to minimize the data loss is certainly worthwhile. Starting new files every time the code checkpoints would be a nice solution. Implementing this in Cactus would touch some thorn (assuming you do the same for all output types), and while you are right about the readers, there aren't so many of them; I don't think that would be a problem.

    You would get an increased number of inodes used, which has been a problem in the past on datura. I usually checkpoint every 3 hours, so in a 24 hour job, that is a factor of 8 increase in the number of inodes. This is really a fault of the HDF5 library, that it does not guarantee that the file will be recoverable in the case of an interrupted write.

    You would get all this automatically though if you do a restart every time you checkpoint. Assuming you checkpoint every 6 hours, set your walltime to 6 hours and submit more restarts. Of course that is likely quite impractical on some systems that have limits on queues.

    You know, I hadn't thought of that! I think I will start to do this. Thanks! It has the same issue with inodes though.

    What we could do is to have simfactory support this: run 'RunScript' a couple of times instead of once, and set the max_walltime in the Cactus parameter file to the real_walltime/count. That way you get real restarts without the queuing system knowing and without Cactus or reader code change. You add some overhead from reading the checkpoints though.

    Yes, which might be significant on some systems. The "correct" solution to the problem is to implement full journalling in HDF5. In the absence of that, we might want to consider copying the HDF5 file to a temporary file before writing to it. I wonder how bad this would be from a performance point of view. If the filesystem was in any sense "modern", it would be using copy-on-write at the block level, which should be very fast for this. In reality, it will probably do a full copy.

  3. Erik Schnetter
    • removed comment

    Starting a new HDF5 file for every restart is what Simfactory does by default. This is why every restart has its own directory. I implemented it this way because I was bitten by this problem in the past. People didn't like it.

    Simfactory needs a mechanism to check whether the output produced by a restart is complete and consistent, before using a checkpoint file produced by this restart. This could be implemented by Cactus writing a "tag file" that is only present while the output is consistent. That is, this tag file would be deleted before HDF5 output is started, and re-created afterwards if there were no errors. During cleanup, Simfactory would disable checkpoint files from incomplete or inconsistent restarts.

    This requires keeping checkpoint files in the restart directories, and requires keeping checkpoint files from previous restarts. This is what Simfactory did in the past (and there is also an option to manually restart from an earlier restart). People didn't like this safety measure. Do we need to go back to it?

    To go back to it, we would discourage people from writing anywhere than into restart-specific directories.

    If this is inconvenient, it can probably be fixed via post-processing, even automatically during the cleanup stage. That is what the cleanup stage is for...

  4. Frank Löffler
    • removed comment

    Replying to [comment:3 eschnett]:

    If this is inconvenient, it can probably be fixed via post-processing, even automatically during the cleanup stage. That is what the cleanup stage is for...

    I honestly never used that (which is my problem). I wonder how many people do at the moment.

  5. Erik Schnetter
    • removed comment

    If you restart, the previous restart is cleaned up automatically.

    I'll need to put more thought into automating cleanup.

  6. Ian Hinder reporter
    • removed comment

    Starting a new HDF5 file for every restart is what Simfactory does by default. This is why every restart has its own directory. I implemented it this way because I was bitten by this problem in the past. People didn't like it.

    I agree (and always have) that this is the correct approach. It separates the different restarts so that a catastrophic corruption in one restart does not destroy the restarts before it. However, it doesn't solve the problem entirely, because by checkpointing more frequently than once per restart, we have introduced a finer granularity scale, and recovery is based on this finer scale rather than the coarse scale of restarts, which means that you can recover a simulation from a "corrupted" restart, and not have the data from that restart.

    Simfactory needs a mechanism to check whether the output produced by a restart is complete and consistent, before using a checkpoint file produced by this restart. This could be implemented by Cactus writing a "tag file" that is only present while the output is consistent. That is, this tag file would be deleted before HDF5 output is started, and re-created afterwards if there were no errors. During cleanup, Simfactory would disable checkpoint files from incomplete or inconsistent restarts. This requires keeping checkpoint files in the restart directories, and requires keeping checkpoint files from previous restarts.

    An alternative would be to implement this in Cactus. We could add a flag to the checkpoint file which says "all data before this checkpoint file is safe". This flag would be set to "false" at the start of OutputGH and then reset to "true" once OutputGH completes successfully. Thorns which write files outside this bin could call an aliased or flesh function to indicate that they are writing files in a potentially destructive way. Cactus would only manipulate checkpoint files that it had created in the current restart, but it would mark all such files in this way. A simple implementation of this flag would be to rename the checkpoint file to ".tmp" during file writes which might corrupt data. A more complicated implementation would be to actually add information to the checkpoint file itself, or some other external file. During recovery, Cactus won't find the checkpoint files which were renamed before the catastrophic write, so will continue from the last "good" checkpoint file, which will likely be the one that the previous restart started from.

    This is what Simfactory did in the past (and there is also an option to manually restart from an earlier restart). People didn't like this safety measure. Do we need to go back to it?

    I actually reviewed the discussion surrounding this decision earlier today. The aim was to reduce disk space usage by putting all the checkpoint files in the same directory, allowing cactus to delete all the old ones rather than keeping one per restart. However, as far as I can tell, this did not have the desired effect, because Cactus will not delete previous checkpoint files, only those from the current run. So the only benefit to using a common checkpoint directory is that it is slightly easier to handle from the command line, as all the checkpoint files are in the same place.

    We said that we would rely on the logic in Cactus to do this, since it was established and hence debugged and likely working. It just didn't do quite what we assumed it would do!

    To solve the disk space issue, we need to write new code. SimFactory could, during cleanup (i.e. between restarts) delete old checkpoint files according to some scheme. Alternatively, this could be implemented in Cactus, by a new parameter "remove_previous_checkpoint_files". The former would work with checkpoint files in the restart directories, whereas the latter would only work if each restart shared a checkpoint directory.

    To go back to it, we would discourage people from writing anywhere than into restart-specific directories. If this is inconvenient, it can probably be fixed via post-processing, even automatically during the cleanup stage. That is what the cleanup stage is for...

    You mean we would discourage people from using "../checkpoints" for the checkpoint and recovery directory? I think I agree that this would be a good thing.

  7. Roland Haas
    • removed comment

    When implementing a checkpoint file removal scheme, we might also want to provide the option of say writing checkpoints every 6 hours during the simulation, but only keeping 1 checkpoint for every 24 hour interval (but keep that one permanently). This way one can use the less frequent permanent checkpoints if one want to re-run segments of the simulation (eg to generate more output quantities, there's more too live than psi4 after all :-) ).

  8. Frank Löffler
    • removed comment

    which means that you can recover a simulation from a "corrupted" restart, and not have the data from that restart.

    Shouldn't Cactus abort with a non-zero exit code in such a case (e.g. disk full)? Shouldn't then simfactory not restart in that case, unless explicitly told, in which case the user would be able to decide depending on what happened where to restart from?

  9. Ian Hinder reporter
    • removed comment

    Yes, simfactory should be modified to implement this feature. I just created #1286 to track this.

    Summarising the discussion:

    • A simple workaround for this problem is to checkpoint only on terminate, and terminate based on walltime, and to use shorter walltimes to reduce the amount of time wasted in the event of a problem. This requires job chaining to be working on the machine to be usable.
    • The use of a common "checkpoints" directory does not have any useful effect, and can be dropped, because Cactus will only delete checkpoint files written by the current job
    • Cactus could write a "tag" file before any output which might lead to loss of data if a write error occurs. This could be provided by a flesh or aliased function; e.g. CCTK_StartUnsafeWrite/CCTK_EndUnsafeWrite. The semantics would be that if an error occurs between these two functions being called (leaving the tag in place), the simulation output (possibly just this one file) should be considered invalid. Writers of HDF5 files would be modified to call these functions.
    • HDF5 files could be written "safely": the file would be copied to a temporary file first, then modified, then moved back atomically. This will incur a disk space and speed penalty, especially for 3D output, but this might not be large (should be measured) and will avoid having to ignore certain checkpoint files and use more CPU time to recompute data. In this case, the CCTK_StartUnsafeWrite/CCTK_EndUnsafeWrite functions would not be called.
    • SimFactory could be enhanced to "expire" checkpoints during cleanup (i.e. between restarts) according to a particular policy, e.g. to reduce disk space used. It could check tags written by Cactus to determine which checkpoint files to remove and which to link to the new restart for recovery. It would never recover from a checkpoint file for which all the data in the simulation was not determined to be "good".
  10. Erik Schnetter
    • removed comment

    The tag files should not be handled by the flesh. A global tag would invalidate all output, whereas we want to invalidate either only individual files, or a single file type (e.g. HDF5 as written by CarpetIOHDF5).

    Simfactory should not simply expire (delete) backups; it should archive them first. Everything that has been archived is then fair game for deletion.

  11. Ian Hinder reporter
    • removed comment

    How about just introducing a convention that file should be renamed as <file>.tmp.<extension> while they are being written to? If any such files are present in a restart, those files are very likely to be corrupt, and simfactory can make its decisions accordingly. We need to be careful to avoid a conflict with any "safe file write" filename convention, as those files are really temporary, and their existence does not imply that anything important is corrupt.

    Do you mean "checkpoints" or "restarts" where you have written "backups"? Are you proposing to archive checkpoint files? Does anybody do this? For a 500 core job, with 2 GB/core, this is 1 TB of data per checkpoint iteration (maybe we don't checkpoint everything, so it could be maybe 1/2 that). I don't think the archive services would be very happy with us regularly archiving TB of checkpoint data.

  12. Erik Schnetter
    • removed comment

    I would not use "tmp" to indicate files are currently incomplete. What about "incomplete"? Or "$$$"?

    One either needs to store the data belonging to a simulation, or re-run the simulation if there is a problem. If archives can't handle checkpoint files, and if local data disks aren't large enough to keep checkpoint files, then re-running is the correct response. People need to be aware of this, and in this mode, all but the last (verified to be safely usable) checkpoint files should be deleted, probably semi-automatically.

    If a data disk has 300 Tbyte, then I would expect an archive to have 3,000 Tbyte. Alternatively, one could design a system that has 100 Tbyte data disk and 1,000 Tbyte usable archive space (10 Gbit interconnect?). Currently, since no one uses archive except to stash data they don't intend to look at any more, the archive is difficult to use, and there is a large pressure to increase data disks, further reducing the need to use an archive, and thus reducing pressure to make it easier to use.

    I meant checkpoint files. But, thinking about this, it could equally apply to restarts, or at least large output files in restarts.

  13. Frank Löffler
    • removed comment

    Replying to [comment:12 eschnett]:

    I would not use "tmp" to indicate files are currently incomplete. What about "incomplete"? Or "$$$"?

    "incomplete": yes. "$$$": please not; they don't indicate what this is about and they need to be escaped on the command line.

  14. Roland Haas
    • removed comment

    Is this discussion still useful? Given that we are still waiting for HDF5 journalling support, right now the only viable option is to checkpoint only at terminate (for regular sized runs) and for those runs that are big enough to expect node failures while they are running, write checkpoints multiple times per simfactory segement to pray that corruption does not happen while one has the output file open (so that one does not lose the whole segment).

    Basically: since we are actively promoting simfactory which mitigates the problem we are no longer on the hook for fixing hdf5's problems.

  15. Log in to comment