sundries01 is currently alerting every hour at XX:55 minutes (see Zabbix hsistory
On investigation, this appears to be rsync traffic from the proxies, and indeed I see this in Ansible:
roles/fedora-docs/proxy/files/cron-docs-rsync:55 * * * * root /usr/local/bin/lock-wrapper docssync /usr/local/bin/docs-rsync >& /dev/null
There are many sync jobs on the proxies, but this one seems particularly heavy. Perhaps we should distribute this across the hour? Something like this perhaps:
cron: name: docsync minute: "{{ (inventory_hostname | hash('md5') | int(base=16)) % 60 }}" user: root job: "/usr/local/bin/lock-wrapper docssync /usr/local/bin/docs-rsync >& /dev/null" cron_file: docs-rsync
which gives a deterministic "minute" based on hostname.
Metadata Update from @james: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: low-gain, low-trouble
So, some back story ... In the original ticket about docs-rsync altering on the proxies one solution I had is that the big root cause of the problem is that "old-docs" is N GBs and AIUI never changes, but is regenerated every night which changes the timestamps on all the files and also puts "build time: $(date)" in a bunch of the html files ... which made rsync lose its mind.
docs-rsync
The "simple solution" that was pushed was to have rsync ignore mtime and do checksum checks, so all the files that just change timestamps don't get transfered ... that fix obviously leads to this, as now every docs-rsync has to checksum N GB of data on sundries. Sigh.
We spoke about it in the standup and here is roughly what we said:
Obvious solution repeated: We turn off old-docs generation, turn back on mtime timestamp checking and everything should be happy as content only changes when it changes.
Other possible solutions:
Randomize it per host ... downsides are:
We can't easily know when any proxy will be updated apart from "wait about an hour, and it should be there"
Proxies will pretty much never have the exact same content on them.
Change the syncing ... Eg. have a single file that changes when anything changes? downsides are:
Someone needs to change a bunch of code and test it (maybe just use fullfiletimelist for old-docs as we do for public mirroring? -- and then the rsync code changes).
No matter what we'll hit this at ~midnight UTC due to the regeneration.
Boost the HW on sundries so it doesn't alert ... downsides: lol, boil the ocean.
Good point on the differences between proxies, that makes sense. I wasn't thinking about the end result there, clearly.
I think the obvious path is actually - stop alerting CPU on sundries unless it goes on for too long, maybe over an hour? My original goal was to see if we could avoid the load spike, since that's not great - but if we can't, we should make our monitoring understand it. I can look into making that happen.
I've made a change to the Zabbix config that seems to have worked - I changed the "5 minutes" part of the check to a configurable value, and set the default to 5m. Then I set sundries01 to "30m" instead - meaning, it only alerts if the load goes on for more than 30m, which seems right looking at the pattern.
This seems to have resolved the issue, so unless anyone thinks this isn;t the way to go, I'll prep a post-freeze commit to make it permanent.
Metadata Update from @gwmngilfen: - Issue assigned to gwmngilfen
Metadata Update from @gwmngilfen: - Issue tagged with: sprint-0
Issue status updated to: Closed (was: Open) Issue close_status updated to: Fixed
I've made the fix permanent via the template change & override for sundries - seems good :thumbsup:
Log in to comment on this ticket.