Rewrite git handling in Formaline

Issue #1832 closed
Erik Schnetter created an issue

Handle git repositories in a single script, vastly simplifying the code. This also correct several error or inconsistencies along the way.

  • tar files are no longer recored in git (they lead to very large repositories)
  • the code is now serial (since git is serial anyway), so that locking the repo is no longer necessary
  • use the flesh's TAR variable instead of looking for tar
  • correct paths returned by "find" (git is picky about double slashes)

See https://bitbucket.org/cactuscode/cactusutils/pull-requests/4/handle-git-repositories-in-a-single-script/diff.

Fixes #1830.

Keyword:

Comments (7)

  1. Ian Hinder
    • removed comment

    Suggestions: * Formaline is using the SILENT environment variable. Should it instead use the VERBOSE variable that Cactus now uses? Rather than parsing the user-supplied value, can it get the information pre-parsed from the build system? Or is this in fact what SILENT now is? * Is it necessary to perform the size check on the repo before garbage collecting? As I understood it, "git gc" performs its own check for whether it is necessary or not. * It would be nice to have a paragraph at the top of the script explaining what the script does.

    Comments: * I would have preferred that the long main function were split into subfunctions, increasing modularity and reducing variable scope, which would also make it easier to see the individual steps, and to understand their inputs and outputs.

    I have looked through the code and don't see any major causes for concern, and would be fine with it being committed. However, this close to a release, committing a rewrite might not be a good idea, since this will be used by everyone building the ET. Committing after the release would be fine. If we decide to do that, is it also possible to fix #1830 in the current version easily, or would that require a substantial effort?

  2. anonymous
    • removed comment
    • Formaline used the SILENT variable before. We can change this, but that wasn't the point of this rewrite.

    • Yes, it is necessary to do such a size check. "git gc" will perform a garbage collection every time you run it. It also isn't run automatically.

    • The individual steps are separated by blank lines and described via a brief comment. Each step uses almost all of the "configuration variables" that are passed in to the script, namely ($git_cmd, $git_repo, $git_master_repo, $git_local_repo, $git_central_repo, $git_root, $build_id, $config_id). Creating subroutines and passing these variables probably wouldn't help with clarity.

  3. Ian Hinder
    • removed comment

    Replying to [comment:4 anonymous]:

    • Yes, it is necessary to do such a size check. "git gc" will perform a garbage collection every time you run it. It also isn't run automatically.

    See the documentation for the --auto flag at https://git-scm.com/docs/git-gc, as well as the gc.auto configuration variable. My point is that git has facilities for determining whether gc is necessary or not. I think it would be good to use these facilities, unless they are found to have poor performance for the usage pattern here, in which case we should implement it ourselves.

    • The individual steps are separated by blank lines and described via a brief comment. Each step uses almost all of the "configuration variables" that are passed in to the script, namely ($git_cmd, $git_repo, $git_master_repo, $git_local_repo, $git_central_repo, $git_root, $build_id, $config_id). Creating subroutines and passing these variables probably wouldn't help with clarity.

    The main reason I dislike such a pattern is that it is very hard to reason about locality of effect; if I set a variable in such a block, it might have an effect several pages down. This can be helped by introducing local variable blocks, but typically this is not how this style is used. Variables that are "common" to each chunk usually are declared at the start of the function and then "reused". Since the interface to the block of code is not enforced (e.g. "this block uses ($git_cmd, $git_repo, $git_master_repo, $git_local_repo, $git_central_repo, $git_root, $build_id, $config_id).", it is not clear immediately to the reader that this is the case. And it is likely to lead to some exceptions. Also, I think this style encourages copy and paste syndrome. Since a block may be slightly coupled to its environment, and not in a clear way via a well-defined interface, it is hard to then extract such a block of code for re-use elsewhere, so the code is copied and pasted to the new location. It is also impossible to see at a glance whether any of these variables are modified by any of these chunks; one has to read through the entire function.

    In this case, I think a sequence of function calls that passed in a dict (or hash, in perl?) of the common arguments would be a lot clearer.

    Looking back at the code now (of course after writing the above!), I see that in this case, the main function is only a couple of pages long, so it's maybe borderline. It's still a bit long for my taste; I prefer a function to fit "on a screen", so I can see the input, the output, and what is done, all at the same time. I just wanted to point out the above reasons for my preference for modular code.

  4. Log in to comment