A full rsync of fedora-buffet0 can take hours just to receive the file list, due to the fact that there are over 12.4 million files. This has to happen for every client, every time they want to update, which is murderous on the download servers and worse on the backend NFS server. It also slows the propagation of important updates because mirrors simply can't poll often.
By generating a simple database of file and directory timestamps, it becomes easy for the client to determine which files have changed since the last mirror run and only mirror those files. This also provides enough information to handle files which have been deleted on the server and files which are missing on the client. In most situations, it also allows hardlinks to be copied as hardlinks instead of being downloaded multiple times.
The client is
quick-fedora-mirror. It is currently written in zsh but
should be portable to bash. (I needed associative arrays and haven't learned
to do them in bash.) It needs no external tools besides
and the usual core utilities.
A config file is required (unless you edit the script); see the sample file in
quick-fedora-mirror.conf.dist. The destination directory and location of
the file to store the last mirror time must be set, though you probably want to
set the list of modules to mirror as well.
The client downloads the master file list for each module, generates lists of
new and updated files, deletes files and directories which no longer exist on
the server, and passes one combined list to rsync via --files-from. Because
all modules are copied together, hardlinks between modules will be copied as
hardlinks. Files and directories which no longer exist on the server are
deleted after the copy has completed, similar to the
The speed improvements can be extraordinary. Just the "receiving file list"
phase of a mirror of
fedora-buffet can take over ten hours and places a
huge load on the host from which you're downloading. With this script it takes
quick-fedora-mirror somewhere. Copy
quick-fedora-mirror.conf, edit as appropriate and copy to one of the
|-a||Always check the file list. This disables the optimization in which the file list isn't processed at all if it hasn't changed from the local copy. Useful if you believe that some files have gone missing from your repository and you want to force them to be fetched.|
|-c||Configuration file to use.|
|-d||Set output verbosity. See the VERBOSE setting in the sample configuration file for details.|
|-t||Instead of the previous run time, use this many seconds since the epoch.
|-T||Instead of the previous run time, use this. The value is passed to |
The last mirror time is assumed to be the epoch if
not previously been run. This means that every single file will be checked,
which will take many hours. If you are certain your mirror is up to date, you
can just fudge the last mirror date:
quick-fedora-mirror -T 'last week'
Then your run will only examine files which have changed in the last week. This may still be a lot of files, but not all of them.
If you have to add a module after the fact (i.e. you already have
fedora-enchilada and you want to add fedora-alt), note that rsync will not pick
up any hardlinks. You can of course do the download and then run hardlink
afterwards, or do a full transfer (i.e. using
-t 0). To minimize the sime
spent counting files, you use a configuration file which specifies only the
modules with which the new module shares hardlinks. Most of the hardlinks are
between fedora-archive and fedora-enchilada, or fedora-alt and
The server must include one file (by default, fullfiletimelist) per module to
be mirrored using this code. This file is created by running
create-filelist. This will generate a list of all files in the current
directory in the proper format and write it to
STDOUT. This output should be
captured to a tempfile and moved into place once the run is complete. It could
do that directory but I wanted to keep it simple.
The timestamp in the file list is the newer of mtime and ctime. This means
that newly created hardlinks will cause both the original and the new version
of the file to appear to have been updated.
rsync will note that the extra
files are up to date and will create the hardlinks directory (assuming, of
course, that it is called with
The format of the file list is simple enough to be parsed by a shell script with a few calls to awk.
Note that this method works for downstream mirrors as well. Intermediate mirrors should not modify the filelists.
rsync is called with --delay-updates, downstream mirrors should
always have a consistent view of the repository. Changes should get out very
quickly, because mirrors can poll frequently without overloading servers.
Note that you can of course run the server component in your own repository,
but the clients will of course need to specify
MODULES array to map module names to directories. The client also the
assumption that all of the separate module are included in a master module. If
you would like to use this code but those constraints don't fit your use case,
please file an issue and I'll be happy to take a look.
Be sure to run
create-filelist after every repository change. You can also
run it from cron, but clients may see the repository in an inconsistent state
in the interval between the changes and the file list generation. This will
not result in any repository corruption, though; clients will pick up the
correct repository state on the next run.
It's a good idea to run a diff or something and only copy the output into place
if the new output differs. Technically
create-filelist could just do this,
but it is currently easy to integrate into whatever script or arrangement you
might have in place.