GetComponents has trouble using svn in parallel with recent versions of svn

Issue #1395 closed
Erik Schnetter created an issue

I receive svn errors when I use

/GetComponents --parallel https://svn.einsteintoolkit.org/manifest/branches/ET_2013_05/einsteintoolkit.th

for the initial checkout, as described in the tutorial. Things work without the --parallel option.

Keyword:

Comments (46)

  1. Roland Haas
    • removed comment

    It currently (as of 5min ago) works for me. Both on queenbee (qb3) and on my workstation. svn version is: 1.1.4 (r13838) on qb3 and 1.6.17 (r1128011) on my workstation.

    Can you provide the text of the error message.

  2. Erik Schnetter reporter
    • removed comment

    After a few seconds, I see this message for a fresh checkout on my laptop:

      Checking out module: .
          from repository: http://svn.cactuscode.org/Utilities/branches/ET_2013_05
                     into: Cactus
                       as: utils
    
    Warning: Could not checkout module .
    svn: E155037: Previous operation has not finished; run 'cleanup' if it was interrupted
    

    Ben Bernard also sees errors about git repositories.

  3. Frank Löffler
    • removed comment

    We should find out what causes these problems. It might not be GetComponents after all. I use --parallel all the time and never had these problems. I just tried on a low-latency machine and a high-latency machine and none had any problem. For people that seem to have problems: please make sure there is no previous checkout, and use the following commands to produce a log file for GetComponents (bash):

    ./GetComponents --verbose --parallel https://svn.einsteintoolkit.org/manifest/branches/ET_2013_05/einsteintoolkit.th > checkout.log 2>&1
    svn --version >> checkout.log 2>&1
    

    This logfile will be quite big (1.5MB). Make sure the error occurred, and if so, please compress it ('gzip -9' gets it down to 88kB) and attach it to the ticket.

  4. Erik Schnetter reporter
    • removed comment

    Apparently I can reproduce the error at work (but not at home). Log file attached.

  5. Frank Löffler
    • removed comment

    Is this all on your laptop, so that it seems to depend on the network it is attached to?

  6. Frank Löffler
    • removed comment

    Do you have access to other machines at work where you could try this? Maybe something at this network either tries to limit traffic or cannot cope with parallel connections? Do we have anyone else with this problem?

  7. Erik Schnetter reporter
    • removed comment

    This is the network at Perimeter, it is generally pretty good.

    Yes, a similar error was reported on the mailing list, which prompted me to investigate and then confirm this problem.

  8. Frank Löffler
    • removed comment

    Wasn't that for git repositories? I would expect these to be independent (although we don't know the real problem yet, so we cannot be sure).

    Roland said he couldn't confirm the problem with git on queenbee. However, I just now could. The problem is found (git's ever-changing user interface), and patch is currently being tested.

  9. Roland Haas
    • removed comment

    Uh actually. While working on this with Frank, I realized that I had compiled my own version of git on QueenBee which is newer than the system provided version (though for different reasons than not being able to check out the ET). So that would explain why it worked for me. Sorry for the confusion.

  10. Frank Löffler
    • removed comment

    The attached patch works around a change in the git interface. Version 1.6.1.3 (queenbee) doesn't understand the shorthand of

    git checkout NAME
    

    for

    git checkout -b NAME origin/NAME
    

    yet. So, I changed GetComponents to use the full command, which also works with newer versions of git. Tested and works on queenbee. Please review for backport to ET_2013_05 (and, obviously master as well).

  11. Frank Löffler

    Ok, changing this ticket to the problems Erik has. Ideally we would have a separate ticket, but since the problems are already described here and the log is also attached, so what. Setting to 'minor' unless we also see this some place else. Also, please speak up if you have any idea what the problem might be. It's pretty hard to work on this if you don't see the problem yourself.

  12. Frank Löffler
    • removed comment

    Either --parallel is working or not. I use it all the time (and would like to in the future) and didn't see problems lately. Plus, so far we have this failure only at the PI network and the tutorial talks about Queenbee, where it does work. So really, at the moment I don't see a need to remove --parallel - not from the tutorial and not from GetComponents.

    However, I would be interested to find out what goes on at PI. Can you check if running several, independent 'svn co' work in parallel outside of GetComponents? As I see it we need more information on this issue. Something lets subversion abort and it would be nice to find the cause.

  13. Frank Löffler
    • removed comment

    Works without a problem for me, every time (svn version 1.6.17). Maybe you should report this to the Subversion project?

  14. Frank Löffler
    • removed comment

    It could be that this problem is introduced because svn changed from .svn subdirs in every subdir of a checkout to a single .svn at the top of the checkout. When you request a checkout into a target directory, svn very likely checks if this isn't already a checkout. With the old system it only had to check for an existing .svn in that subdirectory. In our case there wasn't any. Now it has to 'go up', however, and might find one which at that time is 'in use', thus the error. If this is indeed how this was introduced, this is not very likely to go away in future versions.

    GetComponents has to be able to deal with that. We could either teach it that some versions of svn apparently have this problem, and disable parallel checkouts there (not my favorite, especially since this might mean all versions >=1.7), or we teach GetComponents that there are repositories that live 'within' other repositories and that the 'parents' have to be checked out first and completely before the 'children' can proceed. Right now the only such repo is the flesh.

    We could add a new keyword to CRL for this. One such possibility could be adding a !PRIORITY keyword to a section, specifying a number between 0 and 100. Default would be not specified, and thus, 0. Nothing would change in the ET thornlist, except that for the flesh there would be an additional line

    PRIORITY = 100
    

    Only components with an identical number would be checked out in parallel.

  15. Frank Löffler
    • removed comment

    Of course it would be interesting why you see this only on the PI network. A possible reason could be that other networks aren't as fast and don't trigger the race. One reason why I don't see this anywhere could be that I don't use version 1.7 on any system yet.

  16. Erik Schnetter reporter
    • removed comment

    CVS seems to have the same problem, according to a comment in GetComponent's source code. We could apply the same solution, which seems to be to check out into a temporary directory.

  17. Frank Löffler
    • removed comment

    An interesting idea. But how would you handle the case if one of the thorns is checked out earlier than the flesh, then the flesh finishes and needs to be moved to live in a parent directory of the thorn? You cannot move the whole flesh checkout, since the target already exists (and contains the thorn). You could only copy everything within the flesh-checkout (including .svn), but that also doesn't sound so clean to me.

  18. Frank Löffler
    • removed comment

    Plus, depending on the location of the temporary checkout relative to the target, the OS might need to copy the contents of .svn file by file instead of just updating the directory inode, resulting in a similar race again.

  19. Frank Löffler
    • removed comment

    For the moment I added a note to the tutorial. With this taken care of, we can start being serious again and instead of removing a feature when it breaks we could fix it.

  20. Roland Haas
    • removed comment

    Question for Ian: did you not also experience some issues with svn and very parallel checkouts where you tried to checkout each folder in a repository individually? Ie. IOUtil/src IOUtil/doc ... ? Speaking of fixes, I don't think that the priority option that Frank suggested is a good solution. While it may work and is simple to implement would seem quite hard to set up and requires a lot of user intervention (as in us having to come up with correct priorities). Instead GetComponents should be able to deduce priorities itself if at all possible.

  21. Frank Löffler
    • removed comment

    I would also prefer an automatic version of all this. It might not work for generic cases, but for all of the cases we are interested in. The one and only case we are really interested in is the flesh, which is checkout out as '.' into $ROOT. GetComponents would have to create the final path for each component using TARGET, CHECKOUT and possibly NAME, and look for parent/child relationships in there. Then, parents have to be checked out first, and finish before children. Again, the only such relation I see for our cases for be the flesh (and everything else).

  22. Ian Hinder
    • removed comment

    Replying to [comment:29 rhaas]:

    Question for Ian: did you not also experience some issues with svn and very parallel checkouts where you tried to checkout each folder in a repository individually? Ie. IOUtil/src IOUtil/doc ... ? Speaking of fixes, I don't think that the priority option that Frank suggested is a good solution. While it may work and is simple to implement would seem quite hard to set up and requires a lot of user intervention (as in us having to come up with correct priorities). Instead GetComponents should be able to deduce priorities itself if at all possible.

    SVN changed it's on-disk representation of working copies (see, e.g. http://stackoverflow.com/questions/1364618/how-do-i-determine-svn-working-copy-layout-version) several times. In one of these changes, they switched from having a .svn directory in each subdirectory to having one at the top level (like git). As far as I know, with this change, you can no longer update subdirectories of an SVN repository in parallel (also like git, unfortunately). Additionally, since each checkout creates a new http(s) connection, you might run into the web server's configured limit if you attempt very parallel checkouts.

  23. Roland Haas
    • removed comment

    So for now we just check the svn version then disable parallel if it is 1.7.X? Objections?

  24. Frank Löffler
    • removed comment

    We could, but version 1.7 (and 1.8.0) are likely to be around for a while. Instead, I propose the attached patch. It doesn't change anything in the CRL syntax. Instead, it sorts the modules by target path with the shortest first (so that the flesh comes first), and does the checkout/update of the very first element in serial. The argument is that the very first element is very likely something containing others, and to avoid the problems with parallel checkouts, the very first isn't done in parallel.

    This isn't an ideal patch. Ideally we would checkout everything in another location and move everything 'together' once it is checked out. However, 'update' would have the same problem, and the same strategy wouldn't work there. We also cannot use a similar hack like for git and checkout in a completely different place and just symlink into the checkout, because the symlink would end up inside the containing checkout and might cause confusion there.

    Instead, this patch tries to be as short as possible, changing as little as possible. The only user-visible change is the order things are done, but that order isn't guaranteed anyway (with parallel checkouts), and of course the intentional serial checkout of the very first module (with the shortest target path: TARGET.(NAME||CHECKOUT), for each TARGET (and these also sorted by length).

  25. Frank Löffler
    • removed comment

    We could tie this to specific svn versions as well. I didn't do this here to keep things simple, but I wouldn't object if someone would like to make that distinction.

  26. Roland Haas
    • changed status to open
    • removed comment

    Patch looks ok to me. It's a stopgap measure but might well do the trick, I would suggest adding a routine to check for the affected versions and to output a warning in that case.

  27. Roland Haas
    • changed status to open
    • removed comment

    Following the discussion on the ET phone call today: please apply the first patch. The second patch (version check) is to be dropped.

  28. Log in to comment