Mirror Management: How Often Should I Sync?
A CollabNet customer submitted the following question to us recently:
We set up mirrors of several Subversion repositories which CollabNet hosts for us. I have a question regarding the frequency of synchronization. How often should it be done? Should we synchronize often with a smaller number of changes, or only once in a while with larger amount?
As it turns out, I'd been considering this very question myself recently. The dynamics of mirror management in Subversion are interesting, so fielding this question gave me an opportunity to render some of my recent musings on the matter as text. And as there is nothing particularly unique about this customer's Subversion deployment scenario, you the reader get to benefit from the generality and (now) publicity of the response that I offered to the inquirer.
My response was as follows:
This is an interesting question, and one I've been chewing on myself lately. The answer I provide may not be satisfactory if you're looking for a simple "You need to sync every X minutes" type of response. The answer is instead tied somewhat to the balance between your intended purposes for the mirrors and the level of complexity you're willing to endure while maintaining them.
Let's start by examining the naive approach to synchronization, where you fire off svnsync every so many minutes. Depending on how you wish to use the mirrors, the exact number of minutes may vary. For a simple nightly backup job, 60 x 24 = 1440 minutes works fine. But for a mirror perhaps used by developers trying to stay atop the state of a rapidly changing codebase, that's not often enough. You might need to sync every twenty minutes. Or five. Or one. You don't want to poll the original repository so often that you affect that server's performance, of course. (Launching denial-of-service (DoS) attacks against yourself is not considered wise.) But the cost of attempting to sync an already up to date mirror isn't all that great.
Now, if versioned changes were the only bits of data maintained by this synchronization task, the choice of how often to run svnsync sync would be just as simple as the above. Unfortunately, there are also the unversioned revision properties to pay attention to as well. Because you can change a revision property at any time, and because Subversion doesn't record when you did so, complete synchronization of Subversion repositories gets complicated. Say you have 100 revisions in your master repository, and you've just caught your mirror up to date, too. Later, one of the developers changes the log message for revision 50. svnsync will never realize that this change happened. Future svnsync sync invocations will of course continue to pull down new revisions that have been added (r101 and later), but the log message for r50 in the original and the mirror will not be the same. The svnsync copy-revprops subcommand is the tool for remedying this discrepancy, but something has to tell that subcommand to run, and against which revisions to do its thing.
So the revision property synchronization angle on this adds complexity. Most of the time, developers quickly realize mistakes made in log messages, and fix them relatively soon after the commit completes. As long as they make the fix before your sync job pulls down that revision, all is well in the mirror. So that makes an argument for doing synchronization less often (to allow time for post-facto log message touch-ups). But how long is long enough? What about those cases where somebody changes log messages on revisions committed months ago? These questions can't be answered without again looking to the purpose of the mirrors. In your situation, does it matter if the revision properties are out of sync so long as the core file/directory versioned data is up to date? Maybe not. Maybe it's okay if the revision properties deviate for longer periods of time. Maybe in your situation, a revision sync every ten minutes plus a nightly revision property sync for all revisions in the repository is just what the sysadmin ordered.
(As promised, I've probably raised more questions than provided answers here.)
In my opinion, the best approach is a multi-faceted one, a combination of real-time event-based triggering of sync actions and scheduled just-in-case full synchronization jobs.
The first part of this is the part that, barring communication errors between the master repository server and the servers housing the mirrors, keeps those mirrors as up-to-date as possible. Ideally here, your primary repository is able to push changes to your mirror(s), or at least push notifications of changes to them. For example, your primary repository might have post-commit and post-revprop-change hooks that run svnsync to update the mirrors directly. Of, if that's not possible for reasons of firewalls and security and such, then perhaps those post-commit and post-revprop-change hooks at least send email notifications of changes, and the mirror machines have some automated way of noticing those mails and triggering the relevant sync tasks. A commit mail translates to running svnsync sync; a propchange mail to running svnsync copy-revprops for the revision whose property was changed.
The second facet covers the what-if cases. What if the mirror machines didn't get some of those email notifications? What if the sync jobs themselves suffered network outages? To address this, you might want to have some kind of regular scheduled task that attempts svnsync sync (usually finding nothing to sync, because the event-based sync triggers are working just fine), and also does svnsync copy-revprops across ranges of revisions (usually rewriting the mirror's revision properties with the values they already had, for the same reason). Of course the thing to avoid is any given svnsync job taking so long as to cause contention with other svnsync jobs operating against the same repository.
While not outright instructive, I hope this has been informative enough for you to decide which implementation works best for you.


> Of course the thing to avoid is any given svnsync job taking so long as to cause contention with other svnsync jobs operating against the same repository.
Hi,
this is exactly what I was wondering about with my every 5 minute svnsync cron. If n-1 svnsync is still running due a very large commit, how will the next svnsync react?
- immediately exits with error (or no error) (this is what I would expect) ?
- wait until the previous svnsync has left (-> this will forward the issue of the following svnsync calls) ?
- wrek the mirror ?
Thanks,
Alex | May 29, 2008 at 12:39 AM
svnsync uses custom properties, stored on r0 of the mirror repository, for bookkeeping. One of those properties is svn:sync-lock, and it exists to prevent multiple svnsync jobs from clobbering each other. Your sync of rN will see the property set on the repository by the sync of rN-1, will enter a retry loop for 10 seconds or so to give that lock some time to be cleared, but failing that will just error out with "Failed to get lock on destination repos, currently held by '??'". Of course, that still leaves your sync out of date until the next commit comes in. But adding a regular cron-driven sync to the mix would reduce that out-of-dateness window.
As it turns out, just last night I documented svnsync's custom properties for the Version Control with Subversion book. You can find these notes in this section: http://svnbook.red-bean.com/nightly/en/svn.reposadmin.maint.html#svn.reposadmin.maint.replication
C. Michael Pilato | May 29, 2008 at 05:27 AM