#59 Opportunitistically add "similar" files to the transfer list
Opened 6 years ago by tibbs. Modified 4 years ago

When transferring, say, a file that due to a bitflip has just become accessible to us, we will miss hardlink opportunities because while those files will show up as "missing", they won't show up as "new".

This could be fixed on the server by doing a cp -rl of the bitflipped directory to a temp dir, deleting that temp dir and then regenerating the file list (which will update the inode ctimes of all of the hardlinked stuff). But I don't know if that would be something that people would want to do.

On the client, we have enough information to go through any "missing" files and see if they have counterparts elsewhere in the file lists that have the same name and timestamp. If we just add those to the transfer list, rsync will then figure out when they are hardlinked.

We could conceivably do this for every transferred file. And if we drop the timestamps from the comparison, we could find hardlink opportunities even when the file lists are out of date.

It would take some analysis to see if this inflates the copied file count excessively, though. But I think this should be almost free. The only extra cost would be in the unlikely case that there are files changing which have the same name and different contents from files with the same name. This is the case with repomd.xml, but we can fix that case by not doing this for files listed in the checksums section. Or maybe it doesn't matter, really; it's not that many files.


Metadata Update from @tibbs:
- Issue tagged with: feature

6 years ago

Confirmed that the way that the file lists are generated on the master mirrors will basically guarantee that during a big update like a Fedora release rollout, if you poll at least once every 30 minutes or so, you will see at least one of fedora-enchilada, fedora-secondary or fedora-alt not updated while the others have been updated. Which means you'll miss hardlinks and transfer too much.

This could be fixed on the master just by creating the file lists in lockstep and in one place, but probably won't ever be. It's best not to assume and instead implement this feature.

Recent fun with even active releases being copied to archive, plus even F29 betas being kept well after F30 has been released, rather increase the need for this.

Some questions/notes:

  • Will rsync hardlink on the client even if the file is not hardlinked on the server but could be? Would be awesome if so, but I don't think so.

  • If not, this means we should only look at the file lists from other modules for potential matches.

  • Will need a list of files not to try to transfer opportunistically, because many files are expected to have the same name but different content. Needs to be configurable, of course. For example:

    • repomd.xml
    • rysc-v's annoying root.log, build.log and script.log
  • Could also invert this, and only look for linking opportunities with certain types of files. *.rpm, *.iso, various image file formats. Only if doing otherwise results in significant bloat to the transfer list.

  • We should already have the extracted list of paths, though not necessarily for every module since some of them may have changed. Will need to make them all.

  • Should try to make sure we're not looking at directories, just in case. Though if a directory gets added it won't really hurt because we don't do recursive transfers, but it will bloat the list because there are many more directories named g than there are named glibc-2.29.9000-35.fc31.armv7hl.rpm.

  • grep -f will be useful, and using -F may speed things up a bit. Would it work to just use -F on the current transfer list? If 'g' gets in the list then that will end badly unless / is prepended, which requires another temporary file. But then using regexps also has the same issue, because you'd need to prepend \ and append $. The transfer list is expected to be small so these operations should be trivial.

Login to comment on this ticket.

Metadata