#2 Option to use fullfilelist instead of recursing fedora-buffet
Closed 7 years ago Opened 7 years ago by pfrields.

Rather than recursing the tree via rsync, using the fullfilelist would reduce load on the rsync server. This is probably useful especially for often-repeating processes running in Fedora's cloud or infra. The quick-fedora-mirror does something similar according to @kevin, @pingou, and @puiterwijk. For external users (likely not automating) maybe this isn't so useful since the fullfilelist is extremely large.

I haven't dug far enough into fedfind code to see how difficult this would be. But the idea would be to not use the rsync helper when an option is engaged, but rather just retrieve the file list via https, and then pull the data out of that.


Actually we can do this even better. We can produce a 'notfullfilelist', which would be much smaller, and throw out fedfind's rsync parsing (which I kinda hate - it was just the best idea I could come up with at the time) in favour of parsing that. I will write a patch for https://pagure.io/quick-fedora-mirror/blob/master/f/create-filelist later, but here's my mock-up of it:

[adamw@adam tmp]$ sed -r -e '/^.*\.(rpm|drpm|dtb|html)$/d' altfflist > altnotfflist
[adamw@adam tmp]$ sed -r -e '/^.*\.(rpm|drpm|dtb|html)$/d' fflist > notfflist
[adamw@adam tmp]$ ls -lh *fflist
-rw-rw-r--. 1 adamw adamw 128M Nov 18 09:00 altfflist
-rw-rw-r--. 1 adamw adamw 5.3M Nov 18 09:57 altnotfflist
-rw-rw-r--. 1 adamw adamw 186M Nov 18 08:19 fflist
-rw-rw-r--. 1 adamw adamw 713K Nov 18 09:57 notfflist

That is, if we just filter out rpm, drpm, dtb and html files from the full file lists, we get something pretty tiny. That's small enough that I'm happy just having fedfind always use that. It can of course cache the file locally and only go out and re-download it if the queried release is not in the cached copy.

So we've got a patch for create-filelists merged. I've sent a freeze break request for an infra ansible change to generate the filtered lists. @kevin says he'll test/push that out tomorrow. Once the files show up on the mirrors, I can send out the fedfind change, which I have written and tested already (still need to write the unit tests for it, though). Thanks again for the suggestion.

OK, so the filterlist files are on dl.fedoraproject.org now. The fedfind code is on the no-rsync branch if you want to try it out: note I'm doing force pushes to this branch as I work, rewriting the single commit on it, so to keep in sync you have to do git fetch origin; git reset --hard origin/no-rsync.

The actual code should be done now, I'm just working on the tests. I'm aiming to merge to master and cut a new fedfind release tomorrow or Monday.

It turns out to be a pretty great change:

[adamw@adam fedfind (master %)]$ time ./fedfind.py images -r 24
...
real    0m25.001s
user    0m0.093s
sys 0m0.060s

[adamw@adam fedfind (no-rsync %)]$ time ./fedfind.py images -r 24
...
real    0m1.938s
user    0m0.238s
sys 0m0.056s

so yay for that!

OK, this is now merged to master and released as 3.0.0! Thanks a lot for the idea, it turned out to be a really great improvement to fedfind. It's now hugely faster than before, produces less load on client and server, doesn't fail when the rsync server is full, and I was able to make the tests a lot better. So yay for that!

The create-filelist script now produces 'imagelist' files, which use the productmd 'known image format' definition and only include files in those formats. fedfind simply downloads those files and caches them locally, and derives its results for mirrored composes with no metadata from those files.

3.0.0 packages will be available soon, or you can check it out of git and run it from the checkout.

@adamwill changed the status to Closed

7 years ago

fedfind 3.0.4 is now built for all releases and updates are submitted for el6, el7, f23, f24, f25. Please let me know if there's anything else you need.

Login to comment on this ticket.

Metadata