pyc files when syncing

Issue #349 closed
Barry Wardell created an issue

When I use 'sim sync' to copy files to a machine, it often rsyncs several .pyc files in the SimFactory directory. This happens every time I sync after running a simfactory command on the remote machine. I think these files should probably be excluded from the sync process.

Keyword:

Comments (17)

  1. Erik Schnetter
    • removed comment

    If we exclude the .pyc files (as we did before), then old, outdated .pyc files for which there is no .py file any more are also not deleted during rsync. This then makes it impossible to remove directories if they contain such outdated .pyc files, leading to rsync warnings.

    The current way does not look "nice" (and is not ideal), but it does not lead to warnings.

  2. Ian Hinder
    • removed comment

    rsync has extensive options for excluding files, and can be controlled by a set of "filter rules":

    exclude, - specifies an exclude pattern. include, + specifies an include pattern. merge, . specifies a merge-file to read for more rules. dir-merge, : specifies a per-directory merge-file. hide, H specifies a pattern for hiding files from the transfer. show, S files that match the pattern are not hidden. protect, P specifies a pattern for protecting files from deletion. risk, R files that match the pattern are not protected. clear, ! clears the current include/exclude list (takes no arg)

    See http://www.samba.org/ftp/rsync/rsync.html

    Perhaps the "hide" rule will hide the files when rsync tried to delete the directory, and avoid the warning?

  3. Barry Wardell reporter
    • removed comment

    Replying to [comment:1 eschnett]:

    If we exclude the .pyc files (as we did before), then old, outdated .pyc files for which there is no .py file any more are also not deleted during rsync. This then makes it impossible to remove directories if they contain such outdated .pyc files, leading to rsync warnings.

    Wouldn't the fact that we're using --delete-excluded cover this case? It would also mean that the pyc files are deleted on the remote machine every time the sync is run, but I guess that's better than syncing them from the local machine.

    Replying to [comment:1 hinder]:

    Perhaps the "hide" rule will hide the files when rsync tried to delete the directory, and avoid the warning?

    According to this: http://www.mail-archive.com/rsync@lists.samba.org/msg22776.html 'hide' is just 'exclude' without 'protect'. I'm not quite sure I understand what the difference is between this and --exclude with --delete-excluded.

  4. Barry Wardell reporter
    • removed comment

    Replying to [comment:3 barry.wardell]:

    I'm not quite sure I understand what the difference is between this and --exclude with --delete-excluded.

    It seems that hide is the same as --exclude with --delete-excluded: http://lists.samba.org/archive/rsync/2009-July/023612.html

    Maybe we should be doing similar to the suggestion in that email and not using --delete-excluded but giving a more detailed list of filter rules. It seems that there are some files which you might want to hide and some that you might want to exclude.

  5. Erik Schnetter
    • removed comment

    We can certainly exclude .pyc files. However, in this case, rsync will output messages about deleting the remote .pyc files.

  6. Barry Wardell reporter
    • removed comment

    How about something like the attached patch? It's not commit-ready but it works and gives an idea of the approach

    This makes use of an rsync rules file to specify which paths to include/exclude/hide/show. It doesn't use --delete-exclude since that can be achieved with hide if necessary. As well as making things simpler, it solves the pyc problem by excluding with the perishable flag.

  7. Erik Schnetter
    • removed comment

    I like the idea (if I understand the rsync behaviour correctly).

    What the patch misses mostly is a way for the user to extend or at least override the default behaviour. Can one combine two such files?

  8. Barry Wardell reporter
    • removed comment

    Yes, We can specify more files either on the command line or directly in the filter.rules file. In fact, this is what the line ': .rsync.rules' does. It specifies to look inside a file called '.rsync.rules' inside every directory, if such a file exists.

  9. Barry Wardell reporter
    • removed comment

    I think it still needed a bit more tidying up before being ready. I will do so and apply the patch.

  10. Barry Wardell reporter
    • removed comment

    Attached is an updated version of this patch. I think this one is probably ready to commit, but I have not done so as I didn't get a chance to test it much. If you test it and think it's OK, please feel free to commit. Otherwise, I will do so myself the week after next.

    There are two modes of operation after this patch:

    sim sync <machine> Rsync with <machine> all files specified by the rule files simfactory/etc/filter.rules and simfactory/etc/filter.cactus.rules

    sim sync <machine> [paths ...] rsync with <machine> the files specified by [paths...] taking into account the rules in simfactory/etc/filter.rules.

    The file filter.rules files just exaclues some common paths which the user probably doesn't want to sync (.svn, .git, ...). The filter.cactus.rules file specifies which paths in the Cactus base directory should be sync-ed by default (lib/, arrangements/, ...) and excludes any other paths. The user can also put an optional .rsync.rules file in any directory in their Cactus tree to override the rules for that directory.

  11. Erik Schnetter
    • removed comment

    Barry

    Thanks for the new patch!

    You are introducing a new "paths" syntax to the sync command. I thought previously that one could just list multiple machines, and simfactory would sync to all of them -- apparently that isn't the case, that was lost in translation. However, it seems that filter.cactus.rules contains only a list of top-level paths, and isn't supposed to contain any actual rules -- if it did contain rules, then the result would be confusing, because "sim sync" and "sim sync paths" would copy and/or delete different sets of files. Also, people may want to change this default list of paths, so there should probably also be a filter.cactus.local.rules... Should this list of paths instead be stored in an ini file, where there is already a mechanism to configure settings, and where simfactory could check that these are actually only path names and not accidentally patterns?

    I'm also unsure about the rules:

    Shouldn't "_darcs" always be excluded, similar to CVS .svn .git .hg etc?

    What does "C" do? Does it read a .cvsignore file? If so, shouldn't this file be transferred as well, and be documented for simfactory? This would be one more configuration file for people to understand; can we ignore this file instead? cvs is not important any more these days.

    How do you expect people to use the "paths" mechanism? Can one give just top-level paths, or also directly paths deep into the hierarchy? Would you expect to do this regularly? If so, why? I find this somewhat dangerous, because people may miss transferring an updated file. Instead of telling simfactory what to do, the user currently tells simfactory his/her intent, e.g. "copy source files" or "copy parameter files", which are prerequisites to either building or submitting. Simfactory then deals with the details, ensuring things are done in a safe way. Would you find it inconvenient if you had to use an option to specify a pathname, e.g. "sim sync damiana -p par"?

  12. Barry Wardell reporter
    • removed comment

    Replying to [comment:12 eschnett]:

    You are introducing a new "paths" syntax to the sync command. I thought previously that one could just list multiple machines, and simfactory would sync to all of them -- apparently that isn't the case, that was lost in translation.

    Yes, sorry for the confusion. This patch introduces three changes (switching to filter rules system, changing the behavior of sim sync with multiple arguments and removing the --sync-parfiles and --sync-sourcetree options) which should ideally be separated into separate issues for consideration. The reason I didn't do so was that the the three changes naturally came at the same time in terms of the changes to the code. The last two can be restored to their original behavior if desired. However, I actually prefer the new behavior because:

    • I am much more likely to want to sync specific paths than to sync to multiple machines at once.
    • The paths system provides more flexibility and control than the --sync-parfiles and --sync-sourcetree options did and makes them somewhat unnecessary. This flexibility is particularly useful on machines with slower filesystems where only syncing a specific path can save a lot of time.

    What are other people's opinions on this?

    However, it seems that filter.cactus.rules contains only a list of top-level paths, and isn't supposed to contain any actual rules -- if it did contain rules, then the result would be confusing, because "sim sync" and "sim sync paths" would copy and/or delete different sets of files. Also, people may want to change this default list of paths, so there should probably also be a filter.cactus.local.rules... Should this list of paths instead be stored in an ini file, where there is already a mechanism to configure settings, and where simfactory could check that these are actually only path names and not accidentally patterns?

    The idea is that if specific paths are not given, then the file filter.cactus.rules is read in and gives a default list of paths to be included. I agree that this should not contain any actual rules for the reason you give and have added a comment to this effect to the top of filter.cactus.rules. Any filter rules to be applied to all paths should be put in filter.rules.

    If the user wants to modify this, they can add a .rsync.rules file in their Cactus base directory which is read in first and so will override anything in filter.cactus.rules. I don't really like storing these in ini files because that would be moving away from using rsync's filter rules system (unless simfactory parsed the ini file and generated the appropriate .rsync.rules file).

    Shouldn't "_darcs" always be excluded, similar to CVS .svn .git .hg etc?

    Yes, "_darcs" is also excluded by a rule in the filter.rules file. I have also added .hg in the lastest patch.

    What does "C" do? Does it read a .cvsignore file? If so, shouldn't this file be transferred as well, and be documented for simfactory? This would be one more configuration file for people to understand; can we ignore this file instead? cvs is not important any more these days.

    "C" is designed to exclude many common paths which you often don't want to transfer. These are: "RCS SCCS CVS CVS.adm RCSLOG cvslog.* tags TAGS .make.state .nse_depinfo *~ #* .#* ,* _$* *$ *.old *.bak *.BAK *.orig *.rej .del-* *.a *.olb *.o *.obj *.so *.exe *.Z *.elc *.ln core .svn/ .git/ .bzr/" It also appends any patterns listed in $HOME/.cvsignore and in directory-local .cvsignore files. I don't think we should worry too much about .cvsignore files though - I doubt many people have them any more.

    How do you expect people to use the "paths" mechanism? Can one give just top-level paths, or also directly paths deep into the hierarchy? Would you expect to do this regularly? If so, why? I find this somewhat dangerous, because people may miss transferring an updated file. Instead of telling simfactory what to do, the user currently tells simfactory his/her intent, e.g. "copy source files" or "copy parameter files", which are prerequisites to either building or submitting. Simfactory then deals with the details, ensuring things are done in a safe way. Would you find it inconvenient if you had to use an option to specify a pathname, e.g. "sim sync damiana -p par"?

    The idea is that there are three modes of operation:

    • Without any paths specified we sync all paths given in filter.cactus.rules (and also include anything any modifications in the file .rsync.rules). This is essentially the same as what happened before.
    • With a list of paths given, only those paths are synchronized. Both filter.cactus.rules and $CACTUSDIR/.rsync.rules are ignored (but any .rsync.rules files in the specified paths are read). For consistency, only toplevel paths are accepted in this mode.
    • With a single path given, only this path is synchronized. Both filter.cactus.rules and $CACTUSDIR/.rsync.rules are ignored (but any .rsync.rules files in the specified paths are read). In this case, non-toplevel paths are allowed and handled appropriately.

    The main case where I would expect to use this regularly is when syncing to machine with a slow filesystem (eg. Kraken) where simply checking which files need to be synced can sometimes take a long time. In fact, before now I often used rsync manually instead of 'sim sync' when I was syncing small changes often (eg. when debugging a problem, setting up a new parameter file, etc.). I quite like how things work with this patch applied. We could add the --sync-sources and --sync-parfiles convenience options back, although I'm not sure if I would personally use them.

    What is the advantage of using an option to specify a pathname?

  13. Erik Schnetter
    • removed comment

    Replying to barry.wardell:

    Replying to eschnett:

    > You are introducing a new "paths" syntax to the sync command. I thought previously that one could just list multiple machines, and simfactory would sync to all of them -- apparently that isn't the case, that was lost in translation.

    Yes, sorry for the confusion. This patch introduces three changes (switching to filter rules system, changing the behavior of sim sync with multiple arguments and removing the --sync-parfiles and --sync-sourcetree options) which should ideally be separated into separate issues for consideration. The reason I didn't do so was that the the three changes naturally came at the same time in terms of the changes to the code. The last two can be restored to their original behavior if desired. However, I actually prefer the new behavior because:

    I am much more likely to want to sync specific paths than to sync to multiple machines at once.

    I see. Myself, I'm much more likely to sync to different machines. For example, while debugging at scale, I may build and submit on three or four machines at once, to increase my chances of a job starting quickly

    On the other hand, I don't sync only part of my source tree. Not having to do so is exacly the advantage that Simfactory is supposed to provide, because it can lead to strange errors when one forgets to sync a file. I usually find that the first sync is slow, but subsequent ones are much faster. If your experience is different, then we should introduce another high-level automatic mechanism instead of asking people to do the low-level file management themselves again. For example, Simfactory could remember the time of the last sync to another machine, and then look locally for files that changed since. This would avoid accessing Kraken's slow file system, and shouldn't be more than a line or two with find.

    The paths system provides more flexibility and control than the --sync-parfiles and --sync-sourcetree options did and makes them somewhat unnecessary. This flexibility is particularly useful on machines with slower filesystems where only syncing a specific path can save a lot of time.

    What are other people's opinions on this?

    The main goals of Simfactory are not flexibility or control, but to provide safe and convenient default choices that work almost all the time. In many cases, people want flexibility and control only because something else is not working right -- in this case, sync is apparently too slow for you. I would thus suggest (1) a quick work-around for you that is, hopefully, temporary, and (2) trying to come up with a good solution that lets you "just sync" files without worrying about its performance. But I would not want to design much additional flexibility into Simfactory, because this makes it more difficult to learn and more dangerous to use.

    > However, it seems that filter.cactus.rules contains only a list of top-level paths, and isn't supposed to contain any actual rules -- if it did contain rules, then the result would be confusing, because "sim sync" and "sim sync paths" would copy and/or delete different sets of files. Also, people may want to change this default list of paths, so there should probably also be a filter.cactus.local.rules... Should this list of paths instead be stored in an ini file, where there is already a mechanism to configure settings, and where simfactory could check that these are actually only path names and not accidentally patterns?

    The idea is that if specific paths are not given, then the file filter.cactus.rules is read in and gives a default list of paths to be included. I agree that this should not contain any actual rules for the reason you give and have added a comment to this effect to the top of filter.cactus.rules. Any filter rules to be applied to all paths should be put in filter.rules.

    If the user wants to modify this, they can add a .rsync.rules file in their Cactus base directory which is read in first and so will override anything in filter.cactus.rules. I don't really like storing these in ini files because that would be moving away from using rsync's filter rules system (unless simfactory parsed the ini file and generated the appropriate .rsync.rules file).

    Can we just put the filter rules into ini files and have Simfactory write out these configuration files? In this way, all configuration files are in a single place and are easy to find. This would make it easier for new users to configure their Simfactory, or to understand/copy a Simfactory setup from someone else.

    > How do you expect people to use the "paths" mechanism? Can one give just top-level paths, or also directly paths deep into the hierarchy? Would you expect to do this regularly? If so, why? I find this somewhat dangerous, because people may miss transferring an updated file. Instead of telling simfactory what to do, the user currently tells simfactory his/her intent, e.g. "copy source files" or "copy parameter files", which are prerequisites to either building or submitting. Simfactory then deals with the details, ensuring things are done in a safe way. Would you find it inconvenient if you had to use an option to specify a pathname, e.g. "sim sync damiana -p par"?

    The idea is that there are three modes of operation:

    Without any paths specified we sync all paths given in filter.cactus.rules (and also include anything any modifications in the file .rsync.rules). This is essentially the same as what happened before.

    Sure.

    With a list of paths given, only those paths are synchronized. Both filter.cactus.rules and $CACTUSDIR/.rsync.rules are ignored (but any .rsync.rules files in the specified paths are read). For consistency, only toplevel paths are accepted in this mode.

    With a single path given, only this path is synchronized. Both filter.cactus.rules and $CACTUSDIR/.rsync.rules are ignored (but any .rsync.rules files in the specified paths are read). In this case, non-toplevel paths are allowed and handled appropriately.

    Do these two last rules mean that .svn files in these paths are then copied to the remote system? That would be strange.

    How important are per-directory .rsync.rules for you? Are these just a coincidence of your implementation using a relative path name in etc/filter.rules? Or do you use this for some purpose? Isn't it strange that only those rules in the exact directories that the user specifies are used, while any rules in subdirectories are ignored?

    I also wonder why you don't allow synchronising two non-toplevel directories at the same time. Assume you modified a source file and a parameter file, and want to copy over both. Currently, you would need to use two separate sync commands. If this is just due to an additional test, we can just leave it out. But I assume that it has more to do with rsync's path specifications, and you would need to call rsync multiple times. In this case, I would rather report a "not yet implemented" error message.

    The main case where I would expect to use this regularly is when syncing to machine with a slow filesystem (eg. Kraken) where simply checking which files need to be synced can sometimes take a long time. In fact, before now I often used rsync manually instead of 'sim sync' when I was syncing small changes often (eg. when debugging a problem, setting up a new parameter file, etc.). I quite like how things work with this patch applied. We could add the --sync-sources and --sync-parfiles convenience options back, although I'm not sure if I would personally use them.

    What is the advantage of using an option to specify a pathname?

    The advantage is that one can list multiple machines.

    I would approve the patch if you use an option to specify paths, so that we can continue to sync to multiple machines. I would strongly favour keeping the list of top-level directories in Simfactory ini files, because I have additional such paths, and being able to configure these is important to me, and since this is currently already working in Simfactory we don't need to change this. I would also prefer to have the rsync rules in a Simfactory ini file for the same reason.

    However, it may also be good to have a bit more discussion about these approaches. Do people think that "sim sync" is too slow? Do you prefer .rsync.filer files over ini files, and if so, why?

  14. Barry Wardell reporter
    • removed comment

    After discussing this patch with Erik, we figured out the details of how things should work.

    I have now commmitted this patch. I also committed a second patch which brings back the sync-parfiles and sync-sources config ini options, the --sync-parfiles and --sync-sourcetree arguments, support for syncing to multiple machines in a single command and the use of -p/--sync-path for specifying additional paths to sync.

  15. Log in to comment