Implement a "get" command

Issue #30 new
Ian Hinder created an issue

SimFactory could provide a "get" command which would copy the lightweight data from a remote simulation to the local machine. For example, I use a script called "getsim" which rsyncs a simulation directory and excludes all known "large" files, for example checkpoints, 2D and 3D HDF5 output files, Cactus executable, core dumps etc. You could also have a "quick" mode which excludes even more. My script is attached as an example. This is slightly similar to archiving, but has a different purpose. This is to be run regularly on the local machine to keep track of a simulation, rather than archiving it permanently once it is complete.

Keyword:

Comments (8)

  1. Ian Hinder reporter
    • removed comment

    We will need some way to ensure that the retrieved data is in a consistent state. Truncated ASCII files can be dealt with, though this is not ideal, but partially-written HDF5 files cannot. This can be a serious problem if several HDF5 files are being synced, as after each sync, there can be a high probability that at least one of them is incompletely written. Some options:

    1. SimFactory (on the remote machine) writes a control file (either into the simulation or somewhere else) which indicates to the simulation that it should not open any new files for writing, and once it has closed any currently open file as part of normal operation, it should record this information in the control file, continue running, and only write new files once the control file tells it to. This has the disadvantage of locking the entire simulation for the duration of the transfer of the active restart. For slow data transfers, this could be a significant amount of time. This approach has the disadvantage that the different files will not be in a consistent state; e.g. one output file may have the current iteration but another may not.

    2. Before writing a file, Cactus would move it to a new location (file.tmp), and only move it back when it was fully written. SimFactory would not sync tmp files. Any files which had been renamed to *.tmp in the first pass would then be synced in a separate pass using their original names, if they exist. Repeat until all files are synced. This solution also does not maintain a consistent state across multiple files. This does not require write access to the simulation directory, so could also be used by collaborators who do not own the simulation.

    3. Similar to (1), but only performed at the end of an iteration. SimFactory would indicate to Cactus to pause the simulation at the end of the current iteration, when all files are presumably valid on disk. Cactus would indicate that the simulation had paused in a control file, and SimFactory would then transfer the data, and unpause the simulation when it was finished. This would guarantee that the synced data was in a consistent state. We might want to have some mechanism to ensure that simulations do not remain paused forever, perhaps by requiring simfactory to update the control file periodically if it is still syncing.

    All of the above apply only to the active restart. I think (3) is the simplest and most robust. It is also the most expensive in SUs. The control file location could be customisable, and placed somewhere that all collaborators have write access.

  2. Roland Haas
    • removed comment

    I believe option 2 will break if Cactus starts to write to a file while simfactory is synching it. The OS will allow the rename and rsync will happily continue reading from the now changing file, possibly mixing old and new content. If Cactus finishes adding to the file before rsync finishes and moves it from tmp to its original name, then simfactory would not notice this (but rsync most likely would when it does its final check on the file content).

  3. Erik Schnetter
    • removed comment

    Stopping a simulation seems dangerous. I'd prefer something safer. Here is one idea: - Cactus writes a timestamp or some other unique identifier for each HDF5 file into a separate file. - Simfactory checks this file before and after an rsync. If the time stamps differ, the file has to be copied again.

    Another, similar method would be to have Cactus generate checksums for the HDF5 files. This is expensive, as the whole file would have to be checksummed, unless there is an HDF5 facility for this. After copying the file we compare checksums.

    Both methods need to be combined with a method to avoid accessing a file while it is being updated; renaming it to *.tmp is the standard Unix way to do so.

    Yet another way would be to ask Cactus to make copies of all output files. Such copies are never changed after being created (only deleted if they are outdated), and such can be safely copied.

  4. Frank Löffler
    • removed comment

    I might be interested in such a 'get' command, but personally I am not so much concerned about consistency. As long as a simulation is running I accept that files might be inconsistent between each other, and in the case of hdf5 might also be broken at times. Pausing a simulation and wasting SUs that way just to get a snapshot of the data 100% reliably doesn't sound like a good idea to me. Automatically checking that an hdf5 file isn't broken would be nice though. However, making sure the file is 'current' seems unnecessary. I usually don't care to have the absolutely last timestep if that was just written and would like to avoid another rsync for that; especially if yet another timestep might be written during that time.

    Of course, as soon as a simulation finished all of this isn't an issue anymore anyway.

  5. Ian Hinder reporter
    • removed comment

    I have been working like this for the last few years, and the occasional corrupted HDF5 has not been a problem. However, recently I have started to run simulations with many more files and more frequent output, and I sometimes get into the situation where every time I sync the simulation, at least one of the files is corrupt, as they are written very frequently, and if they are copied before the write finishes, the copy can be corrupt. A corrupt HDF5 file cannot be used; you cannot read the previous datasets, you cannot even list the datasets. You just have to sync again and hope for the best. I also agree that stopping the simulation is to be avoided if possible. In the far future, we would probably be able to create a filesystem-level snapshot of the simulation at the end of each iteration. That would ensure a consistent state which could be easily copied. But filesystems and OSes are not up to that yet.

    Re: writing a time stamp to a separate file: The timestamp on the file won't be updated until the file is closed.

  6. Frank Löffler
    • removed comment

    I don't have a corrupted hdf5 file handy, so I have to ask: Is there a quick way to check if such a file is corrupt, maybe an 'h5ls'? If so, this would make a selected resync quite easy.

    Especially for high-frequency output you might come into trouble with timestamps if your rsync takes longer than one of the output frequency steps (and the simulation doesn't stop, which I really don't want to have). You'll never get all files consistently because the moment you copy the last the first has already been updated again.

    All I care about would be to have all files non-corrupt. Retrying (only the corrupt files) until this is reached sounds like a possible solution to me, and would be an incentive to use simfactory over the otherwise also simple, manual rsync.

  7. Ian Hinder reporter
    • removed comment

    I think that would already be a good solution. Yes, when I have found an odd error, I find that h5ls fails, or gives no datasets. This may not be 100% reliable, but it's better than nothing. Yes, you will never get all the files consistent unless you lock the simulation or have a filesystem snapshot. I agree that temporal consistency is much less important than avoiding corrupt hdf5 files; I already have code to handle temporal inconsistency (two BH_diagnostics output files having different numbers of iterations, etc).

  8. Ian Hinder reporter
    • removed comment

    There is a tool called "h5check" (http://www.hdfgroup.org/products/hdf5_tools/h5check.html) which checks hdf5 files for consistency. It is probably too slow to run this on every transfer, but this would be the most robust solution. I have just encountered an HDF5 file which gives no errors with h5ls but fails an h5check. This file was being written when I ran out of quota. h5dump also fails, on all the datasets I have tried in the file, so I expect the file is not usable due to the header being corrupted.

  9. Log in to comment