A full rsync of fedora-buffet0 can take hours just to receive the file list, due to the fact that there are over 11 million files. This has to happen for every client, every time they want to update, which is murderous on the download servers and worse on the backend NFS server. It also slows the propagation of important updates because mirrors simply can't poll often.
By generating a simple database of file and directory timestamps and sizes, it becomes easy for the client to determine which files have changed since the last mirror run and only mirror those files. This also provides enough information to handle files which have been deleted on the server and files which are missing on the client. In most situations, it also allows hardlinks to be copied as hardlinks instead of being downloaded multiple times.
The client is
quick-fedora-mirror. It is currently written in zsh but
should be portable to bash. (I needed associative arrays and haven't learned
to do them in bash.) It needs no external tools besides
and the usual core utilities.
A config file is required (unless you edit the script); see the sample file in
quick-fedora-mirror.conf.dist. The destination directory and location of
the file to store the last mirror time must be set, though you probably want to
set the list of modules to mirror as well.
The client downloads the master file list for each module, generates lists of
new and updated files, plus those with changed sizes or checksums and passes
one combined list to rsync via --files-from. Because all modules are copied
together, hardlinks between modules will be copied as hardlinks. Files and
directories which no longer exist on the server are deleted after the copy has
completed, similar to the
--delete-delay option to
The speed improvements can be extraordinary. Just the "receiving file list"
phase of a mirror of
fedora-buffet can take over ten hours and places a
huge load on the host from which you're downloading. With this script it takes
The client preserves file timestamps, but does not preserve directory timestamps in all situations.
After a successful mirror run, the client can optionally perform a mirrormanager checkin for each changed module. This eliminates the need to run report_mirror, and also avoids another full filesystem traversal.
quick-fedora-mirror somewhere. Copy
quick-fedora-mirror.conf, edit as appropriate and copy to one of the
You should ensure that the location you configure as
DESTD exists. the
rest of the directory structure will be created there as necessary.
|-a||Always check the file list, and always check in all modules. This disables the optimization in which the file list isn't processed at all if it hasn't changed from the local copy. Useful if you believe that some files have gone missing from your repository and you want to force them to be fetched, or if you want to force a checkin.|
|-c||Configuration file to use.|
|-d||Set output verbosity. See the VERBOSE setting in the sample configuration file for details.|
|-n||Dry run. Don't transfer or delete any content or update the timestamp. Note: the master is still contacted to download the file lists.|
|-N||Partial dry run. Ask rsync to do a normal transfer, but don't delete any local files which are not present in the file list.|
|-t||Instead of the previous run time, use this many seconds since the epoch.
|-T||Instead of the previous run time, use this. The value is passed to |
|--dir-times||Resynchronize the timestamps on all directories in the repository.|
The last mirror time is assumed to be the epoch if
not previously been run. This means that every single file will be checked,
which will take many hours. If you already have a relatively recent mirror,
you can just fudge the last mirror date:
quick-fedora-mirror -T 'last week'
Then your run will only examine (but not necessarily transfer) files which have
changed in the last week. This may still be a lot of files, but not all of
them. The time needn't be precise;
quick-fedora-mirror will clean up stale
files and transfer missing or modified files regardless of the timestamp.
If you have to add a module after the fact (i.e. you already have
fedora-enchilada and you want to add fedora-alt), note that rsync will not pick
up any hardlinks. You can of course do the download and then run the
hardlinker afterwards (see below), or do a full transfer (i.e. using
-t 0, though this
will most likely be far slower.
A program to keep your repository fully hardlinked is included.
See the Hardlinker documentation for more information.
The server must include one file per module to be mirrored (by default named
"fullfiletimelist-" with the module name appended). This file is created by
create-filelist. This will generate a list of all files in the
specified directory in the proper format and write it to the specified file.
It is generally best to write this to a temporary location and only move it
into place if the contents actually changed. In order to avoid additional
needless filesystem traversals, it will also optionally generate two extra file
lists not used by the client:
The main file list contains a timestamp and size for each file. The timestamp
in the file list is the newer of mtime and ctime. This means that newly
created hardlinks will cause both the original and the new version of the file
to appear to have been updated.
rsync will note that the extra files are
up to date and will create the hardlinks directory (assuming, of course, that
it is called with
-H). But this works only if all of the file lists are
updated at once.
The output also includes a section with checksums of selected files. By default, this includes only the repomd.xml files, because they are important, their names never change and neither does their size. So if they ever get missed by the mirror process somehow, it's still possible to detect this situation.
The format of the file list is simple enough to be parsed by a shell script with a few calls to awk.
create-filelist takes the following options:
|-d||The directory to scan.|
|-t||The filename of the full file list with times. Defaults to stdout.|
|-f||The filename of the list of files with no additional data. If not specified, no plain file list is generated.|
|-c||Include checksums of all repomd.xml files.|
|-C||Include checksums of all of the specified filenames wherever they appear in the repository. May be specified multiple times.|
|-s||Don't include any fullfiletimelist files in the file list with times to avoid inception.|
|-S||Don't include the named file in the file list with times. May be specified multiple times.|
An example of how you might call
create-filelist as part of a larger system
to manage several modules is given in the
This is only an example, and will at least need to be edited as appropriate for
Note that this method works for downstream mirrors as well. Intermediate mirrors should not modify the filelists.
rsync is called with --delay-updates, downstream mirrors should
always have a consistent view of the repository. Due to deletes happening
after rsync runs, downstreams may briefly see a few extra files but if using
the file lists this shouldn't matter. Changes should get out very quickly,
because mirrors can poll frequently without overloading servers.
Note that you can of course run the server component in your own repository,
but the clients will of course need to specify
MODULES array to map module names to directories. The client also
makes the assumption that all of the separate module are all subdirectories
accessible from within a master module. If you would like to use this code but
those constraints don't fit your use case, please file an issue and I'll be
happy to take a look.
Be sure to run
create-filelist after every repository change. If you
hardlink files between one module and another, you must update the file lists
in both modules. You can also run it from cron, but clients may see the
repository in an inconsistent state in the interval between the changes and the
file list generation. This will not result in any persistent errors on your
clients, though; they will pick up the correct repository state on the next
It's a good idea to run a diff or something and only copy the output into place if the new output differs. The example wrapper shows one way to do this.
Why, when I look at the debugging output, does rsync complain about all of these duplicate directories?
Any directories with updated timestamps will be added to the transfer lists. rsync will implicitly add all levels of parent directories of any updated files, and then complain when that results in duplicates. This is completely harmless.
quick-fedora-mirror preserve all timestamps?
It will preserve timestamps on files, but if you modify a timestamp locally to be newer than what the master has, then that timestamp won't be modified unless the file changes on the master.
Timestamps on directories are, in general, not preserved. This script must do any file deletion after the main rsync process has completed, which will necessarily alter various directories and their timestamps.
Code to make a third rsync call to fix up timestamps is being worked on, but this won't be made the default.
All of this code was originally written by Jason Tibbitts <email@example.com> and has been donated to the public domain. If you require a statement of license, please consider this work to be licensed as "CC0 Universal", any version you choose.