#12848 docs-rsync doing lots of IO at midnight and setting off alarms.
Closed: Fixed a month ago by james. Opened 2 months ago by james.

Every day, in both staging and prod. ... docs-sync seems to download a lot of data and then delete it. As it downloads the data it uses a lot of CPU and disk space, and depending on the proxy usage that can take it over thresholds and trigger monitoring alerts.

In stg. it takes about 2 hours and consumes ~20GB of data in one shot ... and then deletes it all.

In prod. it takes about 2 hours and consumes ~40GB of data, then deletes 20-30GB of it, then consumes up to 40GB again ... and then deletes it all.


Metadata Update from @james:
- Issue assigned to james

2 months ago

Low gain, but there's enough fires going on that stopping random alerts is nice. Also if it gets much worse we might run out of diskspace.

Metadata Update from @james:
- Issue tagged with: low-gain, medium-trouble

2 months ago

The "fun" part is that docs-rsync runs every hour, and does not much for 23 of it's runs ... taking 10-20m and using minimal CPU/IO, basically just checking that most of the data is the same.

But then at 23:00 UTC for stg. and 0:00 UTC for prod. ... it goes insane.

Eg. First individual du -shx and then a combined one (which will treat hardlinks as only counted to the first dir. they are found in), from a prod. proxy12:

Mon Oct 13 12:55:47 AM UTC 2025
27G  /srv/web/docs-combined
27G  /srv/web/docs.fedoraproject.org
2.2M /srv/web/docs-redirects
54G  /srv/web

27G  /srv/web/docs-combined
21G  /srv/web/docs.fedoraproject.org
376K /srv/web/docs-redirects
7.4G /srv/web

Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/GuestVolGroup00-root   94G   71G   19G  79% /

Mon Oct 13 01:58:03 AM UTC 2025
39G  /srv/web/docs-combined
40G  /srv/web/docs.fedoraproject.org
2.2M /srv/web/docs-redirects
68G  /srv/web

39G  /srv/web/docs-combined
21G  /srv/web/docs.fedoraproject.org
376K /srv/web/docs-redirects
7.4G /srv/web

Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/GuestVolGroup00-root   94G   84G  5.4G  94% /

Mon Oct 13 02:08:26 AM UTC 2025
14G  /srv/web/docs-combined
14G  /srv/web/docs.fedoraproject.org
2.2M /srv/web/docs-redirects
23G  /srv/web

14G  /srv/web/docs-combined
2.4G /srv/web/docs.fedoraproject.org
376K /srv/web/docs-redirects
7.4G /srv/web

Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/GuestVolGroup00-root   94G   40G   50G  45% /

So ... I'd assumed something was triggering rsync and due to that it was doing something weird with the data transfer, but rsync is working as expected.

The roles/fedora-docs/proxy/files/docs-rsync* scripts are identical for staging/prod and take the data from sundries01::docs/ and sundries01::docs-redirects/ and merge them into a single dir. that gets served. The idea being that sundries01::docs/ is old data that doesn't change but new data is being put into the redirects. However rsync can only copy between one or two hosts, so we rsync both sets of remote data to local dirs. and then rsync both to a final combined directory. To save space we hardlink the data between the local copies and the final combined data.

Alas sundries01::docs/ is actually the NFS mount ntap-rdu3*:/openshift_stg_docs which is also mounted by the docsbuilding openshift project, and that runs a daily cron job in openshift every day to "rebuild" the docs.
Some of that is a bunch of changes like:

--- /srv/web/docs.fedoraproject.org/en-US/fedora/f29/system-administrators-guide/Wayland/index.html 2025-10-15 18:52:35.727536245 +0000
+++ docs.fedoraproject.org/en-US/fedora/f29/system-administrators-guide/Wayland/index.html  2025-10-15 17:52:34.984794327 +0000
@@ -908,7 +908,7 @@
         </li>
       </ul>
     </div>
-    <p class="text-center text-xs text-fp-gray-dark">Last build: 2025-10-15 18:50:09 UTC  | Last content update: 2018-07-28 </p>
+    <p class="text-center text-xs text-fp-gray-dark">Last build: 2025-10-15 17:50:09 UTC  | Last content update: 2018-07-28 </p>
   </section>
   <!-- Red Hat Sponsorship Section -->
   <section class="bg-black py-6 text-center md:text-left px-2">

...which means rsync would have to change a bunch of data every day anyway, but the building also regenerates all of the mtimes for the files. Then because we specify --times (to preserve mtimes, via. -a) rsync has to update the mtimes but because the files are hardlinked rsync can't just update the mtime files ... it has to copy the data in each file just so it can put a new mtime on it. Sigh. Testing rsync also shows that even if you pass --no-times (or don't pass --times) it will still break the hardlinks.

tl;dt Possible solutions:

  1. Stop running the openshift docsbuilding daily, and the old data won't change anymore.
  2. Stop running the rsync of the old data automatically (so it'll get rebuilt but we don't copy it daily, just for the mtimes).
  3. Some kind of change to rsync so using --no-times doesn't break hardlinks.
  4. Stop using hardlinks to save space (should save a lot of IO, but we'll be using an extra 20-40GB of storage all the time -- and thus. probably need to do #5 as well).
  5. Increase the disk space, so we stop getting alerts, and ignore the dumpster fire.

I wonder if we couldn't look at dropping docs-old at this point?

Thats the old publican stuff...

CC: @darknao @pbokoc @pboy

Testing rsync also shows that even if you pass --no-times (or don't pass --times) it will still break the hardlinks.

Some kind of change to rsync so using --no-times doesn't break hardlinks

With some more local testing the fix is to pass --no-times and --checksum, the later means rsync still needs to do a lot of read IO to calculate checksums for each file ... but if a file is identical except mtime then it will treat it as equal and won't break the hardlink.

Going to push this to staging today, and we can see what the graphs look like tonight.

Just got merged and pushed.

Issue tagged with: sprint-0

2 months ago

I think we can call this solved then. Thanks @james!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 months ago

Issue status updated to: Open (was: Closed)

a month ago

Issue status updated to: Closed (was: Open)
Issue close_status updated to: Fixed

a month ago

Log in to comment on this ticket.

Metadata
Boards 1
sprint-0 Status: Done