Every day, in both staging and prod. ... docs-sync seems to download a lot of data and then delete it. As it downloads the data it uses a lot of CPU and disk space, and depending on the proxy usage that can take it over thresholds and trigger monitoring alerts.
docs-sync
In stg. it takes about 2 hours and consumes ~20GB of data in one shot ... and then deletes it all.
In prod. it takes about 2 hours and consumes ~40GB of data, then deletes 20-30GB of it, then consumes up to 40GB again ... and then deletes it all.
Metadata Update from @james: - Issue assigned to james
Low gain, but there's enough fires going on that stopping random alerts is nice. Also if it gets much worse we might run out of diskspace.
Metadata Update from @james: - Issue tagged with: low-gain, medium-trouble
The "fun" part is that docs-rsync runs every hour, and does not much for 23 of it's runs ... taking 10-20m and using minimal CPU/IO, basically just checking that most of the data is the same.
docs-rsync
But then at 23:00 UTC for stg. and 0:00 UTC for prod. ... it goes insane.
Eg. First individual du -shx and then a combined one (which will treat hardlinks as only counted to the first dir. they are found in), from a prod. proxy12:
Mon Oct 13 12:55:47 AM UTC 2025 27G /srv/web/docs-combined 27G /srv/web/docs.fedoraproject.org 2.2M /srv/web/docs-redirects 54G /srv/web 27G /srv/web/docs-combined 21G /srv/web/docs.fedoraproject.org 376K /srv/web/docs-redirects 7.4G /srv/web Filesystem Size Used Avail Use% Mounted on /dev/mapper/GuestVolGroup00-root 94G 71G 19G 79% / Mon Oct 13 01:58:03 AM UTC 2025 39G /srv/web/docs-combined 40G /srv/web/docs.fedoraproject.org 2.2M /srv/web/docs-redirects 68G /srv/web 39G /srv/web/docs-combined 21G /srv/web/docs.fedoraproject.org 376K /srv/web/docs-redirects 7.4G /srv/web Filesystem Size Used Avail Use% Mounted on /dev/mapper/GuestVolGroup00-root 94G 84G 5.4G 94% / Mon Oct 13 02:08:26 AM UTC 2025 14G /srv/web/docs-combined 14G /srv/web/docs.fedoraproject.org 2.2M /srv/web/docs-redirects 23G /srv/web 14G /srv/web/docs-combined 2.4G /srv/web/docs.fedoraproject.org 376K /srv/web/docs-redirects 7.4G /srv/web Filesystem Size Used Avail Use% Mounted on /dev/mapper/GuestVolGroup00-root 94G 40G 50G 45% /
So ... I'd assumed something was triggering rsync and due to that it was doing something weird with the data transfer, but rsync is working as expected.
rsync
The roles/fedora-docs/proxy/files/docs-rsync* scripts are identical for staging/prod and take the data from sundries01::docs/ and sundries01::docs-redirects/ and merge them into a single dir. that gets served. The idea being that sundries01::docs/ is old data that doesn't change but new data is being put into the redirects. However rsync can only copy between one or two hosts, so we rsync both sets of remote data to local dirs. and then rsync both to a final combined directory. To save space we hardlink the data between the local copies and the final combined data.
roles/fedora-docs/proxy/files/docs-rsync*
sundries01::docs/
sundries01::docs-redirects/
Alas sundries01::docs/ is actually the NFS mount ntap-rdu3*:/openshift_stg_docs which is also mounted by the docsbuilding openshift project, and that runs a daily cron job in openshift every day to "rebuild" the docs. Some of that is a bunch of changes like:
ntap-rdu3*:/openshift_stg_docs
docsbuilding
--- /srv/web/docs.fedoraproject.org/en-US/fedora/f29/system-administrators-guide/Wayland/index.html 2025-10-15 18:52:35.727536245 +0000 +++ docs.fedoraproject.org/en-US/fedora/f29/system-administrators-guide/Wayland/index.html 2025-10-15 17:52:34.984794327 +0000 @@ -908,7 +908,7 @@ </li> </ul> </div> - <p class="text-center text-xs text-fp-gray-dark">Last build: 2025-10-15 18:50:09 UTC | Last content update: 2018-07-28 </p> + <p class="text-center text-xs text-fp-gray-dark">Last build: 2025-10-15 17:50:09 UTC | Last content update: 2018-07-28 </p> </section> <!-- Red Hat Sponsorship Section --> <section class="bg-black py-6 text-center md:text-left px-2">
...which means rsync would have to change a bunch of data every day anyway, but the building also regenerates all of the mtimes for the files. Then because we specify --times (to preserve mtimes, via. -a) rsync has to update the mtimes but because the files are hardlinked rsync can't just update the mtime files ... it has to copy the data in each file just so it can put a new mtime on it. Sigh. Testing rsync also shows that even if you pass --no-times (or don't pass --times) it will still break the hardlinks.
--no-times
tl;dt Possible solutions:
I wonder if we couldn't look at dropping docs-old at this point?
Thats the old publican stuff...
CC: @darknao @pbokoc @pboy
Testing rsync also shows that even if you pass --no-times (or don't pass --times) it will still break the hardlinks. Some kind of change to rsync so using --no-times doesn't break hardlinks
Testing rsync also shows that even if you pass --no-times (or don't pass --times) it will still break the hardlinks.
--times
Some kind of change to rsync so using --no-times doesn't break hardlinks
With some more local testing the fix is to pass --no-times and --checksum, the later means rsync still needs to do a lot of read IO to calculate checksums for each file ... but if a file is identical except mtime then it will treat it as equal and won't break the hardlink.
--checksum
Going to push this to staging today, and we can see what the graphs look like tonight.
Graphs looked good: https://zabbix.stg.fedoraproject.org/history.php?action=showgraph&itemids%5B0%5D=50594
PR: https://pagure.io/fedora-infra/ansible/pull-request/2908
Just got merged and pushed.
Issue tagged with: sprint-0
I think we can call this solved then. Thanks @james!
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Issue status updated to: Open (was: Closed)
Issue status updated to: Closed (was: Open) Issue close_status updated to: Fixed
Log in to comment on this ticket.