#45 Opportunistically pre-link files into place before transfer
Opened 7 years ago by tibbs. Modified 7 years ago

Since we know before we call rsync which files we may be transferring, and we have some idea of whether we already have files with matching names/sizes (from the file lists), we can, for each file we expect to transfer which doesn't already exist,

  • look for another file in the file list with that name/size that exists locally.
  • If any match, make a hardlink from that file to the one we intend to download.

If that ends up being the wrong file, the rsync will simply overwrite it. If it's the right file, then we potentially saved ourselves a transfer. This would help even if the files aren't hardlinked on the master server.

Open questions:

  1. What happens when the transfer fails for some reason? Now we have files which maybe shouldn't exist.
  2. If the file we think will be transferred ends up not existing on the server, we may end up with this random hardlink that shouldn't be there.
  3. It's kind of tough to unwind from this state when something goes wrong in general.
  4. We must be sure to exclude files which appear in the checksum section because we can't trust that "same name/size" means.

Having a few extra files shouldn't actually hurt anything; it might look weird, and they would be pulled down by anyone who uses plain rsync to download content, but they would be removed automatically the next time the module changes. The files wouldn't be unsafe because they would simply be copied from elsewhere in the repo.


It occurs to me that depending on how things are set up, a noarch package could end up signed by a different key depending on where it was built, and that somehow this could maybe end up with the same size. This file just might end up being linked into place, and then the rsync call could bomb out before this "error" gets corrected. And then because the file looks perfectly correct, a succeeding run wouldn't see it.

So:

  • Can this actually happen?
  • Can this state be detected?

Detection is technically possible because the inode change time on the now "bad" file will be newer than the last mirror time (which won't be updated since the rsync call failed). So we'd need to have our big find over $moduledir/* also output %c and include in the list anything which is newer than the last mirror time. Which isn't really that tough and introduces no additional stat calls. Since hardlinking does change the ctime on both inodes being linked together, this would occasionally result in a few extra files being sent to rsync, but this is almost completely harmless as they won't be transferred.

So, the bottom line is that yes, Fedora it seems will intentionally change the content of RPM files without changing the name, and the size remains the same as well. Which means that at least for now, any attempt to be this fancy is going to have to take extra care.

The dates do change when this happens, and thus the file list will reflect the change, but it's possible due to bugs and whatnot that something is missed. So if I go forward with an implementation of this, it needs to be very well tested against various failure modes including interrupted transfers.

Some basic explanation as to why this would still be worth the effort:

  • Sometimes things aren't hardlinked on the master until after you've transferred them.
  • If you poll fast enough you can get one updated module and then another in a different poll.
  • If you're not tier 1 then you will only see half of the linked content until the permissions flip, and then afterwards you won't get the whole set into one transfer because the file list basically won't change.

Metadata Update from @tibbs:
- Issue tagged with: feature

7 years ago

Login to comment on this ticket.

Metadata