Reduce space taken by Formaline tarballs

Issue #488 closed
Ian Hinder created an issue

When syncing lightweight data of simulations from a cluster, much of the time and disk space on the local system is taken with Formaline tarballs. It would be good to reduce this.

I propose opportunistically replacing the generated tarballs with hard-links to existing tarballs on the same filesystem after Formaline has written them to the new output directory if the files compare equal. These could be from previous restarts of the same simulation when using simfactory. Then, if an entire simulation is rsynced with the appropriate options, hard-links will be transferred as hard-links, and the overall transfer time and disk space used will be significantly reduced.

Keyword: Formaline

Comments (16)

  1. Erik Schnetter
    • removed comment

    Formaline already has a global cache directory that it uses for executables. This saves space even across different simulations. We could use this for the Formaline tarballs as well.

  2. Ian Hinder reporter
    • removed comment

    I originally was going to suggest this, but then I thought it might start to take up too much space. The files could be named according to their hashes in the cache directory, and when Formaline creates the tarballs that are embedded into the executable, it could compute the hash as well. This way, Formaline never even has to write the tarball to disk if it finds a cached tarball with the correct hash.

  3. Roland Haas
    • removed comment

    I am not sure I am so happy with the idea of having hard links. I would like the formaline logic as simple as possible to avoid risk of the source code tarball in the executable differing from the actually used source code. The total tarball size is 40M for the full Lovelace ET checkout.

  4. Erik Schnetter
    • removed comment

    Formaline already uses hardlinks e.g. for executables and for checkpoint files. This just adds another use case. Such space optimizations were present from the beginning, since they were very important -- otherwise it would be too expensive to keep simulations and restarts independent.

    Tarballs will be larger for production runs involving additional, non-ET thorns. People will also want to handle several (many) simulations, and the sizes add up.

    Simfactory assumes that, once a restart has finished (and has been cleaned up), its contents are not modified any more. It is okay to add more content, but it is not okay to modify or remove files.

  5. Roland Haas
    • removed comment

    To clarify: are we talking about Formaline or simfactory? Formaline should not deal with executables or checkpoints, simfactory not with tarballs.

    I was actually going to suggest (different ticket) to make the simfatory executables copies instead of hard links. Otherwise changing the executable via a cp command (eg. when using simfactory in debugging runs which are the vast majority of all runs I run [other user might differ of course]) is rater surprising in that all hardlinked executables change which can in fact cause running simulations to fail (since the disk image that the OS used as paging store changed).

    I fully agree that checkpoints in simfactory are to large to copy but am not sure if the same argument applies to the <200MB of executable and tarballs. I tend to generate multi GB of output for even the smallest useful simulation so am not sure if saving 200MB is worth the added headache in having to deal with a more complicated and less inherently robust system.

    Please don't get me wrong: if others are more concerned about saving disk space then I can live the proposed changes, they just would not be my first choice.

  6. Erik Schnetter
    • removed comment

    We are discussing Simfactory here. The Formaline tarballs are only one of the issues. With this patch, Simfactory will combine all possible large files via hardlinks, not just Formaline tarballs. This will be done only within each simulation.

    One use case of Simfactory is to run benchmarks. This involves producing many small simulations, and the overhead of copying executables is significant and prohibitive, as the benchmark output is much smaller than the size of the executables.

    I see and understand the problem with hardlinks. One crazy solution could involve git, which internally automatically collapses identical files. Of course, having both the repository and a checkout present will double the disk space (for compressed files) -- is there a way to use git without local repository? We could also just wait until deduplicating file systems are more common in HPC.

  7. Ian Hinder reporter
    • removed comment

    I currently disable syncing of Formaline tarballs because they take up so much space and time. This defeats the whole purpose of Formaline. The proposed solution in the patch only eliminates duplicated data from within a single simulation. In the case of running benchmarks, this does not help, as typically there will only be one restart per simulation. Similarly for small debugging runs. The current solution is OK, but it would be much better if the files were shared across simulations. They could live in the CACHE directory, just as executables do. Is there a reason to treat them differently? When rsyncing simulations, the CACHE directory could be synced as well, and with suitable options, rsync will respect the hard links.

    Regarding the code: it makes an assumption about the name of the output directories. This should be obtained from a function which would be used throughout simfactory. I know this assumption is made frequently, but for new code, we should try to do better.

    Please also see #127 if you think that simfactory 2 puts hardlinks to the executable in each restart.

  8. Erik Schnetter
    • removed comment

    Yes, please, just to make sure things don't blow up. (Python detects erros only at run time.)

  9. Ian Hinder reporter
    • changed status to open
    • assigned issue to
    • removed comment

    I can do so (when I get back from travelling in three weeks), but I am unhappy about two aspects of the solution:

    1. SimFactory is learning about a detail of Formaline, as well as modifying existing restarts, when I think this job could easily be done by Formaline as it went along. If Formaline knew where to look for existing tarballs (e.g. by having a global directory, or a path pattern in which to search), then it could do the hard-linking at the time the tarball is created.

    2. This solution doesn't address the problem with benchmarks, which typically only have one restart per simulation, and which usually have many simulations for each source tree. Using a common cache directory would solve this problem (either in simfactory, or if 1 is implemented, in formaline).

  10. Erik Schnetter
    • removed comment

    Ad 1: I think Simfactory should use a criterion that is not "Formaline". For example, Simfactory could treat all files larger than 1 MByte this way, or all archives, etc.

    This should not happen in Formaline, because it is much more difficult to do this in C than in Perl, and much more difficult to do this reliably at run time on an HPC system than after the fact. Also, doing this in Simfactory means that one can save space (again) after a simulation has been moved to a different system.

    Ad 2: Using Simfactory's cache directory seems like a good idea. However, there is currently no mechanism yet to "clean the cache" periodically, i.e. to remove old files that are present only in the cache. Also, the cache can hold only a single file for each name (but this may be good enough for Formaline tarballs).

    Note also that git already does all of these things (compress, de-duplicate). Maybe we should store the tarballs in git, or drop the tarballs in favour of having the source code in git. Formaline would still output them, but once Simfactory verified that the source tree is safely stored in git, it could delete the tarballs.

  11. Log in to comment