archive.linux.duke.edu is in the Tier 1 mirror list at https://fedoraproject.org/wiki/Infrastructure/Mirroring/Tiering , but its fedora-enchilada module appears wrong. Based on the behaviour of other mirrors, the top level of the enchilada module should be pub/fedora (so at the top level we should see a 'linux' directory). But this is not the case. Its top level is pub:
[adamw@toolbx releng (main %)]$ rsync --list-only archive.linux.duke.edu::fedora-enchilada/linux ... rsync: [sender] link_stat "linux" (in fedora-enchilada) failed: No such file or directory (2) ... [adamw@toolbx releng (main %)]$ rsync --list-only archive.linux.duke.edu::fedora-enchilada/pub/ ... drwxrwsr-x 995 2025/02/25 06:56:58 epel drwxr-xr-x 198 2025/02/24 00:40:26 fedora
Also, if you roll with the incorrect layout, the development directory - which should be present in enchilada - is entirely missing:
[adamw@toolbx releng (main %)]$ rsync --list-only archive.linux.duke.edu::fedora-enchilada/pub/fedora/linux/ archive.linux.duke.edu - This rsync server is currently available to any/all people. - This is subject to change with little or limited notice. Modules: drwxr-xr-x 97 2018/02/21 12:20:22 . drwxr-xr-x 161 2006/10/17 05:46:37 core drwxrwxr-x 24 2013/04/25 01:55:49 extras drwxr-xr-x 722 2024/10/25 12:53:24 releases drwxrwsr-x 703 2025/02/04 07:27:53 updates
Metadata Update from @james: - Issue tagged with: low-gain, medium-trouble, ops
Metadata Update from @phsmoura: - Issue priority set to: Waiting on Assignee (was: Needs Review)
@nate-duke Hey, you are listed as admin contact here. Could you take a look?
I mailed nate-duke directly.
He's going to take a look.
Hey guys, sorry. I was having trouble logging in here yesterday and this morning pagure was down all together. I didn't have a chance to look at this yet but it's on my list for Monday morning.
Ok, sorry it took me so long to get to this. Busy week!
I've corrected the rsyncd configuration to provide the correct view for ::fedora-enchilada
❯ rsync --list-only rsync://archive.linux.duke.edu/fedora-enchilada/linux/ archive.linux.duke.edu - This rsync server is currently available to any/all people. - This is subject to change with little or limited notice. Modules: drwxr-xr-x 97 2018/02/21 15:20:22 . drwxr-xr-x 161 2006/10/17 08:46:37 core drwxrwxr-x 24 2013/04/25 04:55:49 extras drwxr-xr-x 722 2024/10/25 15:53:24 releases drwxrwsr-x 703 2025/02/04 10:27:53 updates
The missing development/ path was due to an effort to save some space in a prior configuration of this system. We've since rearchitected to more sensibly allocate storage resource to this project but I neglected to add extra storage back for fedora and adjust the quick-mirror config. I've corrected this now and we're running a quick-mirror to pull in the new content.
Total on client: 436019 files, 3490 dirs. Not present on server: 235709 files, 1720 dirs. Missing on client: 159337 files, 537 dirs. Size Changed: 58 files. Timestamps to restore: 2278 files. Checksum Failed: 72 files. Filelist changes: 253828 paths. Total to transfer: 253831 paths.
thanks a lot for the fixes!
So, how are we looking now?
Can we close this? or do we still need to fix anything?
top level looks right now, but the development directory is still missing.
I'll double check it. the development directory was taking a while to sync so i just left it to the persistence of cron to eventually finish.
EDIT: I expect it's the '-T "last day"' argument to our quick-mirror call that's to blame.
Running a fresh sync now without the time specification.
How's it looking now?
Still no development:
[adamw@toolbx openQA (master %)]$ rsync --list-only archive.linux.duke.edu::fedora-enchilada/linux/ archive.linux.duke.edu - This rsync server is currently available to any/all people. - This is subject to change with little or limited notice. Modules: drwxr-xr-x 97 2018/02/21 12:20:22 . drwxr-xr-x 161 2006/10/17 05:46:37 core drwxrwxr-x 24 2013/04/25 01:55:49 extras drwxr-xr-x 722 2025/04/11 04:01:08 releases drwxrwsr-x 723 2025/02/04 07:27:53 updates
@adamwill Yeah, i've been messing with it every morning for the past week trying to get it to sync that directory. When i run quick-fedora-mirror it makes a bit of progress and then stalls out. Here's the tail of the output. I'm open to any suggestions on alternative ways. I was reading through the docs yesterday and trying to come up with an alternate way to get the development directory and then running the hardlinker. The docs indicated that was a possibility.
The full output of the currrent run is in a snippet here
Total on client: 736623 files, 5449 dirs. Not present on server: 400240 files, 3147 dirs. Missing on client: 38414 files, 18 dirs. Size Changed: 58 files. Timestamps to restore: 3518 files. Checksum Failed: 90 files. Filelist changes: 271653 paths. Total to transfer: 271656 paths. >> Log: Counts for fedora-epel: Svr:374794/2320 Loc:736623/5449 Diff:271653 New:269549/2104 Xtra:400240/3147 Miss:38414/18 Size:58 Csum:90 Dtim:3518 >> Log: Processing end: fedora-epel Finished processing fedora-epel. Changes in fedora-epel: 271656 files/dirs ============================================================ ============================================================ Transferring 1644553 files. >> Calling /usr/bin/rsync --timeout=600 -aSH -f R .~tmp~ --stats --delay-updates --out-format=@ %i %10l %n%L -v --files-from=master-transferlist.sorted rsync://dl.fedoraproject.org/fedora-buffet0/ /fedora/pub/ >> Log: calling /usr/bin/rsync --timeout=600 -aSH -f R .~tmp~ --stats --delay-updates --out-format=@ %i %10l %n%L -v --files-from=master-transferlist.sorted rsync://dl.fedoraproject.org/fedora-buffet0/ /fedora/pub/
so...it's winding up with a huge transfer, basically. I don't think it's doing anything wrong, but that transfer looks like it's going to take a long time. At various points it logs "Possibly aborted rsync run. Cleaning up.", which indicates that yeah, the previous run just didn't finish.
For fedora-enchilada It looks like you still have the 39 and 40 releases present; manually 'syncing' those (so they just contain the README pointing to archive) might help a bit. It might be a good idea to concentrate on one module at a time; both enchilada and epel seem to want to do a huge amount of syncing, for some reason, so doing them both in one transaction is maybe not a good idea, it might be best to get one synced then work on the other?
I'll give that a shot. Since i posted that and the previous run log i streamlined things a bit and increased verbosity to 7 to maybe get some progress output and it's been chewing on the filelist for a few hours.
I also went back to -T 'last day' in case that was adding to the troubles.
-T 'last day'
It's not using very much memory and seems single threaded (no surprise there i suppose). current run output
far as i can tell it hasn't transmitted much data either, just chomping on that file list.
here's some boring graphs of resource usage for the current manual run.
<img alt="2025-05-01_11-23.png" src="/fedora-infrastructure/issue/raw/files/cae5b5bf32ea78ef0725c2aa50d03e6d55b9c78db9a28ce6296880c6e61654db-2025-05-01_11-23.png" />
<img alt="2025-05-01_11-24.png" src="/fedora-infrastructure/issue/raw/files/a726aaff1c3a976ecab2b1e58228f8b5b3df247b62c1c84c7f05107c056af11e-2025-05-01_11-24.png" />
I wonder if it might actually be better to do a non-"quick" sync first to get things back more or less in line? I'm guessing maybe the "quick" sync stuff doesn't work so well if things are wildly out of line to start with.
oh, also, a dumb idea, but: it's not somehow still configured to sync to the same layout you had before, and thus essentially redoing everything from scratch plus throwing away the correct layout, is it?
shouldn't be. This stuff is all running in kubernetes using a specific PVC for each mirror and that hasn't changed. I.e. the backing NFS export, the mount point inside the container nor the parameters for the target directory in the quick-fedora-mirror.conf file.
I'm open to trying this. i messed around with how to do it a few days ago but the stern warnings coming back from the upstream rsync server to not just use rsync shied me away.
... i've been chewing on this question. Can you clarify the issue this would present?
We are using the same layout. Ultimately the only change that we've made was to change
FILTEREXP='(/releases/test|/development)'
to
FILTEREXP='(/releases/test)'
No, what I meant is how you initially had the top level set as /pub rather than /pub/fedora, as reported in the initial comment. It seemed just barely possible that there was some kind of mismatch there in the sync config so none of the files matched between the ends at all and it would end up wiping all your local files and re-transferring everything. But I don't think that's right, looking at the logs.
/pub
/pub/fedora
It's definitely wanting to transfer a hell of a lot, though. Look at the enchilada summary:
Total on client: 1313976 files, 3535 dirs. Not present on server: 378826 files, 765 dirs. Missing on client: 779291 files, 1569 dirs. Size Changed: 17 files. Timestamps to restore: 1658 files. Checksum Failed: 30 files. Filelist changes: 1374874 paths. Total to transfer: 1374883 paths.
So it thinks nearly 800,000 files are "missing on client" (i.e. those are files it's going to have to fully copy over from the server) and nearly 400,000 files are "not present on server" (i.e. those are files you have locally but which don't match anything on the server; I was assuming this was just the F39 and F40 files, but...). That seems like a lot of files, though I don't actually know if 800,000 is anyhing like "all the files". I guess it can plausibly just be 'all of development'.
@adamwill, thanks for the explanation. The web front end for this stuff is pretty malleable, that was the issue with the earlier "parenting issues" ;)
I guess we'll just let it run like this and hope for the best. If it hasn't made any progress tomorrow I'll abort it and look into running a bare rsync of just the development tree.
still chugging away today, it has made it up to
1013500 files... 1013600 files... 1013700 files...
but is only using ~400MB of memory (4Gi available) and is basically saturating one core.
I'm going to restart it again and give it more cores to see if that helps is chew on that file list.
ok, i just punted and did a plain rsync of the development tree. I have a "normal" quick-fedora-mirror running now that'll hopefully be able to finish on it's own since all the development content will be present.
🤞🏻
Just as an update. the quick-fedora-mirror process takes a little over 24h to chew through the file list. It's made some good progress overnight and i'm still minding it daily to make sure it hasn't croaked.
I'll drop an update here every couple of days until I see it able to complete in a timely fashion.
How are things now? all good?
If somebody could assess where we are and let me know that'd be great. I've been running fedora-quick-mirror constantly for the last ~month. I don't KNOW that it's ever finished but it's had various long runs of progress ... and i manually rsync'd the development directory behind it's back at one point.
Any pointers on making this thing run better, or tips on how to break it into pieces are welcome.
well, the development dir is there, but it seems to have the Rawhide compose from May 2nd:
rsync --list-only archive.linux.duke.edu::fedora-enchilada/linux/development/rawhide/Everything/x86_64/iso/ ... -rw-r--r-- 1,017,135,104 2025/05/01 22:58:01 Fedora-Everything-netinst-x86_64-Rawhide-20250502.n.0.iso
I'm afraid I can't tell why you're apparently having so much trouble getting it to sync, but...we definitely have other mirrors using the same script who aren't having the same problems...
Sorry to be so problematic as to elicit the ellipsis of incompetence. I assure you we're doing our best here.
Is there a way to execute the individual stages or top level modules of the enchilada so that we can get some parallelism going on? I think the big problem is that it's just a single thread/process trying to chew on like 4TiB and millions (or more?) of files.
I admit that our NAS might not be the greatest thing in the world and maybe nobody else is running this script under kubernetes on NFS. I'm open to suggestions on what to do to improve the situation and get us back in line with your expectations.
Well, quick mirror should just make things better/faster than bare rsync... it gets an index file from us, if there's no changes it doesn't transfer anything, if there are changes, it transfers only those files it doesn't have (and our side doesn't need to have rsyncd stat a million files to compare).
So, does quick mirror finish quickly? or never seem to finish? also, what mirror are you pulling from? if dl.fedoraproject.org, might be download-ib01 or download-cc-rdu01 are 'closer' and faster for you?
A regular rsync should work, but will just be slow to start as it makes our side send checksums for millions of files and then your side has to compare them and stat all the local versions... so quick mirror should be much better if it can be made to work. ;)
@kevin quick mirror has never finished (to my knowledge). It'll run for a day or two, transfer some files but it spends the vast majority of it's time just chewing on this file list.
We're hitting dl.fedoraproject.org which is resolving to dl-iad01.fedoraproject.org and has for at least the last 7 days.
Here's the log from the current run. It's been going about 4 hours.
https://gitlab.oit.duke.edu/-/snippets/327
I should point out that we also have ubuntu, debian and several other smaller mirrors that are all syncing just fine using the same infrastructure. Ubuntu is bare rsync (i think, it's been a while since i touched it) and Debian is using archvsync, their custom tooling.
Granted these are all smaller (ubuntu is ~3TiB and Debian about 800 GiB).
The point though is that the storage and compute resources available seem sufficient to the task of keeping this stuff fresh as Ubuntu syncs twice a day and debian syncs every 4 hours or so. I think there's just something peculiar about this quick mirror script.
Sorry to be so problematic as to elicit the ellipsis of incompetence
That's not what I meant! I use ellipses compulsively all over the place (much like the word "basically". And parentheses...)
I just meant I'm sort of stuck trying to figure out what's causing this. It'd be simpler if the explanation was just "oh, yeah, the script sucks, it's broken for everyone".
I mean, looking at the log, it does seem like wiggy stuff is going on. We're out of the script itself in five minutes, but it feeds rsync a file list with 1.5 million files in it. I would guess it should only be updating, I dunno, a few thousand files?
It is still noticing that the previous run was aborted and trying to 'correct' it. I wonder if it's stuck in a loop of increasingly intensive 'correction' attempts or something.
My next move would probably be to try using rsync directly for a few days at least, see if that works smoother, and if so, try switching back to the script now it's starting from a 'clean' state and see how it does then?
On our tier 1 mirror we also use quick-fedora-mirror but we run it separately for each category. It just took too long to run it over everything at once. We loose the automatic hardlinking between categories, but at least it finishes.
You might look for .~tmp~ dirs that rsync leaves around... try deleting all those and doing a fresh sync? If those are there it tries to complete an existing sync, but it might be it's too old and it gets confused.
quick-mirror should be faster than any normal rsync (because it doesn't have to stat everything, and it only transfers files you need), but it could be something is just completely out of wack and a frsh full rsync would help.
So, i've given up trying to be smart about this and am just going to cycle through a raw rsync for each of the top level modules over the next couple of days, then run a quick-mirror with -T 'last week' or something.
Sorry that this has carried on so long. It takes SO LONG to know if this thing is even making progress.
Thanks for continuing to poke at it. Sorry it's being a hassle. ;(
I think we might be all set here. I did all the rsync work yesterday, each one only took an hour or so. I then let a container just running quick-mirror with -T 'last day' run over night and it finished 19 times. Our normal kubernetes cronjob is running now and seems to be doing okay. It did spit out this in the epel portion which is a little weird and maybe a result of me running rsync by hand? It's the could not make way for symlink at the bottom here:
>> Log: Counts for fedora-epel: Svr:419867/2342 Loc:900148/4814 Diff:0 New:3178/1566 Xtra:480280/2472 Miss:2/0 Size:0 Csum:0 Dtim:3595 >> Log: Processing end: fedora-epel Finished processing fedora-epel. Changes in fedora-epel: 4749 files/dirs ============================================================ ============================================================ Transferring 9824 files. >> Calling /usr/bin/rsync --timeout=600 -aSH -f R .~tmp~ --stats --delay-updates --out-format=@ %i %10l %n%L --no-motd --files-from=master-transferlist.sorted rsync://dl-tier1.fedoraproject.org/fedora-buffet0/ /fedora/pub/ >> Log: calling /usr/bin/rsync --timeout=600 -aSH -f R .~tmp~ --stats --delay-updates --out-format=@ %i %10l %n%L --no-motd --files-from=master-transferlist.sorted rsync://dl-tier1.fedoraproject.org/fedora-buffet0/ /fedora/pub/ could not make way for new symlink: epel/10 could not make way for new symlink: epel/testing/10
ok, the make way thing was just the applications desire to symlink 10.0 to 10 and there was already a 10 directory. I cleared that out from epel and epel/testing and it cleared. quick-mirror -T 'last day' finishes in about 5 minutes since its all so fresh.
I think we're good but would love to do a healthcheck if i can. https://fedoraproject.org/wiki/Infrastructure/Mirroring#rsync_health_checking mentions mirrormanager doing a healthcheck but i can't find anything in my site in mirrormanager to enable or configure that.
If there's a script or something you guys use to do that would you mind passing it along? I'm happy to run it myself and it would probably be a good thing to know about so that i can get ahead of any problems like this in the future.
Here's the typical logs we're seeing now.
https://gitlab.oit.duke.edu/-/snippets/329
despite it saying
>> Extracting file and directory lists for fedora-enchilada. Total on server: 1496923 files, 3884 dirs. New on server: 3575 files, 905 dirs.
It doesn't seem to be transferring anything. Not sure if i need to be concerned about that or maybe tweak the -T 'last day' to maybe longer?
In that case it should have a list of those specific 3575 files and ask the server for them.
Are things looking the same or better/worse now? Note that we had the big datacenter move in there.
@kevin
Apologies that i haven't updated this. I think we're good now. What I did to fix it was run through a set of manual rsyncs for each module and then mess with the -T setting some.
Here's a log of a qa run I did just now.
https://gitlab.oit.duke.edu/-/snippets/332/raw/main/snippetfile1.txt
If someone could take a look at that, and maybe verify the health of our mirror i think we can call this good to close. I mentioned earlier that if there's some DIY method for ensuring mirror health that i'm happy to set that up on my own as well.
Thanks again for all your patience and help.
Great. thank YOU for mirroring... we appreciate it!
Metadata Update from @kevin: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.