Formaline circumvents git's file change caching

Issue #1900 closed
Roland Haas created an issue

Git contains a heuristic that considers a file as unchanged if its modification time is the same as when it was last added to the repository thus avoiding having to diff the file and its repository copy. Formaline commit 68dddf3 circumvents this by using plumbing commands that always read the file to compute a hash potentially slowing down Formaline significantly

Keyword: Formaline

Comments (8)

  1. Erik Schnetter
    • removed comment

    I believe that 68dddf3 was designed by Roland Haas. Roland, do you have an idea of achieving the same effects without using the plumbing commands?

  2. Roland Haas reporter
    • removed comment

    Yes, that patch that broke this is mine. The idea behind the patch was to avoid having to make hard links when adding files to the repo (both for hoped for speed and because some FS do not allow hard links). I actually looked at the git source code (complicated) to try and find out if there is a way to use git's cache of metadata when using the plumbing commands. So far I have not found anything that would let me do that.

  3. Erik Schnetter
    • removed comment

    I think the current implementation is significantly slower than the previous one. Do you have a suggestion for improving speed? Otherwise I'd be tempted to revert the patch.

  4. Roland Haas reporter
    • removed comment

    I have no suggestion. The problem with the previous one is that it relies in hard links between directories which not all file systems support (well ok: 1 file system does not support, but it is on a cluster that I use).

    If you want to revert the patch in master then that is fine with me since I have no real better solution so far. However if this breaks (rather than slows down) formaline on eg loewe then we need a workaround for the failure on loewe I think.

  5. Ian Hinder
    • removed comment

    BeeGFS does not support hard links as used by Formaline, as I understand it. This is becoming a very popular filesystem for HPC, and I think we need to support it. The new AEI NR cluster uses BeeGFS. How hard would it be to include a workaround? e.g. use hardlinks if possible, but revert to the slow method if not?

  6. Roland Haas reporter
    • removed comment

    A workaround is probably simple (since the change itself is simple). I am not sure how large the slowdown is because each time a file changes (and thus all files are currently re-hashed) the tarball is also recreated causing all files to be read and tar-gzipped. It is possible that the slowdown is more noticeable in cases where updating the git repo and creating the tarballs happens at different times since the OS caches have been flushed. For cases where updating the git repo happens just after creating the tarball then all files should still be in cache and any slowdown would be due to increased CPU load which should be much less than due to IO.

  7. Log in to comment