A full rsync of fedora-buffet0 can take hours just to receive the file list, due to the fact that there are over 12.4 million files. This has to happen for every client, every time they want to update, which is murderous on the download servers and worse on the backend NFS server. It also slows the propagation of important updates because mirrors simply can't poll often.
By generating a simple database of file and directory timestamps, it becomes easy for the client to determine which files have changed since the last mirror run and only mirror those files. This also provides enough information to handle files which have been deleted on the server. And if the timestamps are chosen properly, it allows hardlinks to be copied as hardlinks instead of copying the files multiple times.
The client is quick-fedora-mirror
. It is currently written in zsh but
should be portable to bash. (I needed associative arrays and haven't learned
to do them in bash.) It needs no external tools besides rsync
, awk
,
and the usual core utilities.
A config file is required (unless you edit the script); see the sample file in
quick-fedora-mirror.conf.dist
. The destination directory and location of
the file to store the last mirror time must be set, though you probably want to
set the list of modules to mirror as well.
The client downloads the master file list for each module, generates lists of new files, deletes files and directories which no longer exist on the server, and passes one combined list to rsync via --files-from. Because all modules are copied together, hardlinks between modules will be copied as hardlinks.
The speed improvements can be extraordinary.
Copy quick-fedora-mirror
somewhere. Copy quick-fedora-mirror.conf.dist
to quick-fedora-mirror.conf
, edit as appropriate and copy to one of the
following:
(There is not currently any option parsing, so no way to specify the config file on the command line.)
The last mirror time is assumed to be the epoch if quick-fedora-mirror
has
not previously been run. This means that every single file will be checked,
which will take forever.
If you are certain your mirror is up to date, you can just fudge the last mirror date:
date -d 'last week' +LASTTIME=%s > lastmirrortime
(Obviously, use the timestamp file you chose in your configuration file.) Then your run will only examine files which have changed in the last week. This may still be a lot of files, but not all of them.
If you have to add a module after the fact, note that rsync will not pick up any hardlinks. You can of course do the download and then run hardlink afterwards, or do a special run with just the new module and the other modules which have many shared links. Most of the links are between fedora-archive and fedora-enchilada, or fedora-alt and fedora-enchilada.
The server must include one file (by default, fullfiletimelist) per module to
be mirrored using this code. This file is created by running
create-filelist
. This will generate a list of all files in the current
directory in the proper format and write it to STDOUT
. This output should be
captured to a tempfile and moved into place once the run is complete. It could
do that directory but I wanted to keep it simple.
The timestamp in the file list is the newer of mtime and ctime. This means
that newly created hardlinks will cause both the original and the new version
of the file to appear to have been updated. rsync
will note that the extra
files are up to date and will create the hardlinks directory (assuming, of
course, that it is called with -H
).
The format of the file list is simple enough to be parsed by a shell script with a few calls to awk.
Be sure to run create-filelist
after every repository change.
Note that this method works for downstream mirrors as well. Intermediate
mirrors should not modify the filelists. Even though the timestamps
(specifically, ctime) may differ, the file list is still a valid source of data
even for mirrors far down the chain. Assuming rsync
is called with
--delay-updates, downstream should always have a consistent view of the
repository. Changes should get out very quickly, because mirrors can poll
frequently without overloading servers.
All of this code was originally written by Jason Tibbitts <tibbs@math.uh.edu> and has been donated to the public domain. If you require a statement of license, please consider this work to be licensed as "CC0 Universal", any version you choose.