Support compressed (tar.gz) testsuite data

Issue #566 closed
Frank Löffler created an issue

This patch to Cactus adds support for compressed testsuite data.

It assumes that if no directory is found within the thorn containing the reference data for a particular test, that it might be in a correspondingly named file (test.par with a missing test/ as output_dir might actually have test.tar.gz). Cactus would, when running the testsuites, decompress the tarfile to TEST/config/thorn/test.orig, and compare that against the new results (in TEST/config/thorn/test, as usual). The test.orig directory will remain within TEST for easier debugging/comparing later, but will be overwritten for each new re-run of the test.

Using ET_2011_05 as example, compressing all testsuites this way, and disregarding the svn metadata, this reduces the number of files within arrangements (following symlinks) from 16k to 4.8k, and the size from 220MB to 112MB. The metadata from svn is likely to about double these numbers. These numbers mean that using this would not only save space, but also a lot of time when syncing between machines.

I propose to add support for this before the next ET release, but to only actually use it in thorns after this.

Commands you may find useful when testing this:

Compress all testsuite data (from within arrangements/):

for i in find -L -maxdepth 3 -mindepth 3 -type d -name test; do cd $i; find -maxdepth 1 -mindepth 1 -not -name .svn -type d | awk '{print $1".tar.gz "$1;}' | xargs -n 2 -P 4 tar -czf; cd -; done

After that: remove the test data directories (because they have higher priority than the tar.gz files):

find -L -maxdepth 4 -mindepth 4 -type d | grep -v .svn | grep -e '\/test\/' | xargs rm -rf

Keyword:

Comments (10)

  1. Roland Haas
    • removed comment

    tarring and zipping the test data will make diffs of testsuites useless (at least the straight svn diff command). Also all changes to testsuites will require the full tarbal to be recommitted since gzip spreads information about the tarball throughout the file.

    Instead just tarring the data might be sufficient. For rsync one can add the -z flag to compress data during transfer and at least I usually have no problem with file size quotas on clusters but only with file number. Just tarring also reduces the number of small files so does effectively save data on disk. Svn commits would still be efficient, though diffs will still not work (or only very strangely).

  2. Erik Schnetter
    • removed comment

    Reducing the number of files is probably a good idea. Ideally, one could use HDF5 to combine various output files into one; lacking support for this, tar.gz seems like an easy solution.

    One cannot use "tar" in a script, since "tar" may not be a good tar implementation. One should use gnutar or gtar if they are available, and one some platforms one explicitly has to look for a good tar.

    Why is there a "why?" comment in the patch? The need for this line seems obvious.

  3. Frank Löffler reporter
    • removed comment

    The 'why' should go - that is a left-over.

    It is true that this would make 'svn diff' on testsuite data kind of useless. However, I would personally be willing to trade this.

    Concerning Mime: It's nice for something like email, which was designed to contain only text. Using mime, you can send also binary blobs, but I wouldn't count the resulting text as readable.

    'tar' alone might be a problem, right. We have the same issue with the build-scripts for libraries. I propose to change the patch to use something similar there, as in:

    ` TAR=$(gtar --help > /dev/null 2> /dev/null && echo gtar || echo tar) `

  4. Ian Hinder
    • removed comment

    I am also concerned about losing the ability to diff between commits. Is there a real-world case where the current system causes problems?

  5. Frank Löffler reporter
    • removed comment

    I do have a couple of complete Cactus checkouts - and at least three of them I usually sync to various machines. This can create space issues on /home directories with quotas. This is mainly a space-issue then.

    The other problem are machines where single-file IO is slow, thus a simfactory sync is painfully slow because we have so many files, and most of these files are actually testsuite data files. I do want to transfer these files, but I usually don't need to have them plain within my checkout.

  6. Ian Hinder
    • removed comment

    I think that keeping the tar files in version control is bad because you lose the ability to diff the files. When regenerating data, you cannot easily see what has changed as a final check before committing in your version control tool. It also means that regenerating test data is more tedious, as there are more steps to go through (making the directory, tarring). This might not seem like much but when you are doing this for four or five tests at a time it quickly adds up.

    I propose that the tests are kept in version control as individual files. We can provide a script/make target which tarzips up all the test output and keeps the tar files in parallel with the untarred files. Users who run into problems on remote machines can use this script. When syncing, these users can exclude the individual files, and sync the tar files. We could provide an example rsync-excludes pattern in !SimFactory for this. The current patch would enable Cactus to recognise the tar files on the remote machine and use them. This has the following advantages:

    • The current simple behavior does not change for most users;
    • Regenerating test data is no more tedious than it currently is;
    • The version control system will have maximum knowledge about what has changed between two commits;
    • Syncing will transmit only the compressed single files, solving both the space and file number problem on the remote system.
  7. Frank Löffler reporter
    • changed status to resolved
    • removed comment

    I added a test for gtar/tar and a test if that tar actually supports the decompression. I also added bzip2 and xz decompressors, since it was trivial, as well as uncompressed tar balls.

  8. Log in to comment