https://koji.fedoraproject.org/koji/taskinfo?taskID=137394254
It seems every so often, builds fail with a HTTP 503 Service Unavailable error on kojipkgs.fedoraproject.org.
Adding this ticket to track the problem.
One more: https://koji.fedoraproject.org/koji/taskinfo?taskID=137410948
I'll take a look at this one.
Metadata Update from @kevin: - Issue assigned to kevin - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
so, I looked into this some.
it seems that sometimes, the kojipkgs servers were not passing haproxy checks and got dropped out until more checks worked. Rarely, both 01 and 02 would drop, so haproxy would return a 503.
As a first stemp, I updated and rebooted both the servers. That gives them new kernel, varnish and more.
Please let me know if you see this again. I am going to wait and watch it for a bit and see if it comes back.
If it does, might have to adjust some varnish or haproxy settings.
It still seems to be doing it. Will look at some tuning.
Fell into a rabbit hole on this one this afternoon. :)
So, the problem seems to be that sometimes, triggered by what I am not sure... haproxy gets a timeout on it's health check for one or both of the kojipkgs servers. I was able to duplicate it with curl from a proxy as well. varnish isn't reporting any errors at all and seems very happy.
So, either there's something in varnish thats not processing all requests, or something on the proxies is having trouble talking to varnish. I haven't been able to figure out what yet.
FYI, https://koji.fedoraproject.org/koji/taskinfo?taskID=137435046 This is a different error and the first time I get it. But as it might be related, I am adding it here.
https://koji.fedoraproject.org/koji/taskinfo?taskID=137436572
Out of 5 builds, 3 failed with the original error... haha
One more: https://koji.fedoraproject.org/koji/taskinfo?taskID=137438017
(Let me know if you want me to stop btw)
I've seen this happen a few times a well:
Uno mas: https://koji.fedoraproject.org/koji/taskinfo?taskID=137442170
The src.fedoraproject.org 503 I think is something else.
The kojipkgs ones are all this. ;(
https://koji.fedoraproject.org/koji/taskinfo?taskID=137445971 This one is a bit different, DNF 503'd and 502'd on getting the files from kojipkgs.
Still probably the same thing.
I can duplicate it from all 4 rdu3 proxies (by just doing a curl in a loop) I can't duplicate it from batcave or wiki01 (another fedora instance) no errors or anything apparently wrong on kojipkgs, it just never gets the connections I updated kernels on proxy101/110, no change. proxy10 has the oldest kernel, still shows it. nothing I can find in updates or ansible changes match up I've tried a bunch of varnish settings with no luck latest attempt was to do a curl in a loop with a known source port so I could tcpdump it on kojipkgs... and... it doesn't happen. stopping nftables has no change anubis is not involved.
ok. I figured out that if the local port on the proxy is over 32k, the connection seems to timeout.
I set local_port_range to under 32k and it seems to be working.
I am still really puzzled why this would happen. I've got some open chat questions to networking folks in case it's something there. I cannot see anything on our side that would do this. ;(
Anyhow, if you see any more 503's from kojipkgs, please let us know.
I went through a whole 246 package chainbuild without a single crash. So at this point this LGTM!
ok. I completely don't understand why this was happening, and indeed... it no longer seems to fail that way.
I'm just going to keep it set the way it is (with the lower port range) and move on I think.
Thanks everyone!
Metadata Update from @kevin: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
Metadata Update from @kevin: - Issue status updated to: Open (was: Closed)
That fix didn't hold. It's happening again.
Metadata Update from @kevin: - Issue untagged with: medium-gain, medium-trouble - Issue tagged with: high-gain, high-trouble
My experience lately is that it really started yesterday (Oct 7th). During the day, builds crashed with 503 errors sporadically. sometimes I quite litterally couldn't start chain-builds because src.fp.o also returned 503 errors.
It seems to also affect QA according to what I can see.
~/FedoraWork/fedora/kde-updates-work/fw/kf6-kio ~/FedoraWork/fedora/kde-updates-work/fw branch 'f42' set up to track 'origin/f42'. Mise à jour 450d0a0..13c138a Fast-forward .gitignore | 2 ++ kf6-kio.spec | 8 +++++++- sources | 4 ++-- 3 files changed, 11 insertions(+), 3 deletions(-) Error during lookup request: status: 503 fatal : Impossible de lire le dépôt distant.
Veuillez vérifier que vous avez les droits d'accès et que le dépôt existe. Could not execute push: Failed to execute command.
That'S while pushing a package to git.
This is not this issue. It's because src is under heavy scraper load and we can't enable anubis due to https://pagure.io/fedora-infrastructure/issue/12812
ah, I spoke too soon, this might indeed be related now that I have dug deeper. :(
ok. I have put in place some haproxy config where it will retry if it gets a timeout a few times and also try a different backend.
This will not solve the underlying problem, but it hopefully will make it happen much much less...
So, reports of how often or if you see it would be welcome.
10-4. I will let you know. I still gotta do Gear builds in a day or two, so I'll have occasions to spot it.
(Context: KDE Gear = ~350 packages, across all branches)
Issue tagged with: sprint-0
I am getting alot of those errors this morning:
https://koji.fedoraproject.org/koji/taskinfo?taskID=137993384 BuildError: Error running GIT command "git clone -n https://src.fedoraproject.org/rpms/kate.git /var/lib/mock/f43-build-side-120610-63102458-6610479/root/chroot_tmpdir/scmroot/kate", see checkout.log for details
Submitting a chain build seems to often result in the following error: Could not execute chainbuild: Got an error finding f43 head for kosmindoormap: fatal : impossible d'accéder à 'https://src.fedoraproject.org/rpms/kosmindoormap.git/' : The requested URL returned error: 503
Another one: https://koji.fedoraproject.org/koji/taskinfo?taskID=137994858
I'm going to stop building for a while and try again later.
kojipkgs is doing much better. I still do see a few timeouts, but very much fewer.
pkgs/src is having a lot more however. I put in a freeze break to also setup some retry on error there.
The underlying cause is still sadly not found. ;(
The src workaround is now also in place. Will see if it improves things...
Yeah I am seeing some of the other issues (i.e. errors on pushing regarding fedora-messaging or strange "sources are not loaded") but the F43 KDE Gear build has been largely uneventful.
I am about to start Fedora 42 before I head for bed, so we'll see tomorrow!
It's been a very long time since I had a full gear build without a single issue.... that was nice!
1 chainbuild and everything went fine, good job! I hope you can find the underlying issue!
ok, the problem is still happening and has a even wider scope (although the mitigations for kojipkgs and src should hopefully mostly shield those from problems).
Here's what I know:
The problem exists between our prod vlan proxies in rdu3 (proxy01/10/110/101) and servers in the build vlan (koji01/02, kojipkgs01/02, pkgs01, oci-registry02, etc) Most of the time everything works as expected. Some small % of the time, a connection from one of the rdu3 proxies (proxy01,10,101,110) just times out. It happens to both apache->haproxy->backend and apache->backend It affects: kojipkgs, src, koji, registry at least All the proxies are F42 and pretty up to date.
Things I have tried that haven't fixed it: upgraded from 6.16 to 6.17 kernel on a proxy reverted some updates on a proxy that might be related Tried to find a match for when it started against updates or ansible commits, but it's not super clear when it started. adjusting the local port range (which seemed to help for a bit, then stopped helping) stopping nftables on local or remote or both sides messing with haproxy options before finding that it affects apache -> backend too messing with varnish options before finding that it's not just kojipkgs I have completely disabled/removed anubis from proxy101/110 and they still see the issue. messed with a lot of tcp options on proxies: tcp_tw_reuse, tcp_ecn, somaxconn, keepalive_time, mtu_probing, tcp_wmem, tcp_rmem, etc Looked at lots of tcp queues ( netstat -s, ss -ntl ) nothing seems to be hitting limits. asked networking folks to look for any problems between those two vlans, they say everything looks fine. Looked at all the virthosts involved. Nothing seems amiss there. made sure that everything has mtu 1500 and it does.
Additionally all the externally reachable proxies (ie, all but proxy101/proxy110) are now getting coredumps from time to time... it seems to be in a http2 teardown path. This is all of them, not just the rdu3 ones. I think it's not related, but I cannot be sure.
So, I think evidence points to it being the 4 proxies themselves somehow, or something in between the build and prod vlans, but I haven't been able to figure out what. ;(
Ideas, suggestions welcome.
This is probably too much work to try out but this is what I was thinking.
If you have a spare hardware box on the prod network, do the following: 1. Install it with RHEL10 and set it up as a proxy111. Add it to the queue of systems and check to see if it has problems. This eliminates possible problems with virthost networking stacks or virtual machine scheduling 2. If the problem doesn't then reinstall with Fedora 42, and reset up as proxy111. See if the problem occurs then. 3. If the problem doesn't in either then we have a probable cause in the virtual stack 4. If the problem does occur in EL-10 then we know that the problem goes back a while in software or it is switch/vlan related. However we also have a 'this affects more than us' type problem to get help with. 5. If the problem doesn't occur in EL-10 but does in Fedora 42, we then have a 'walk the releases' until we get it duplicated. Probably best with a binary 'search' of 40 and then 38 or 41 depending on what happens then.
If we don't have spare hardware, then doing something similar with virtual machines would at least see if the problem is OS related or if it is unknown.
If the problem turns out to be virtual machine level it is tougher to fix but at least we still have an idea of where to get help on.
Well, setting up another box is possible of course, but... it would require a bunch of work from networking to make it usable (it would need to be in the right vlan, then all the nat mappings would need to be updated along with acls for between vlan access). It's doable, but I am not sure when they could even get to it.
I'm not sure if rhel10/epel10 has all the things we need to install for a proxy. It might, but it also might not.
I also considered a clean f42 install on one to rule out some kind of misconfig that wasn't in ansible. Or a f43 one likewise.
Will ponder... thanks!
it's not super clear when it started.
Getting visibility here might be the key to solving the mystery (and preventing it from returning in the future.) I don't know the details here enough to offer specific suggestions, but perhaps consider what metrics could be put in place to directly observe the issue and its frequency / timing.
(Feel free to offer details here, such as the exact point in the network flow the timeout seems to be happening. e.g., is haproxy the most downstream source of the connection that gets dropped?)
Not sure if https://pagure.io/fedora-infrastructure/issue/12845 is the same issue or something different?
It looks like the same issue to me.
Yeah, we have haproxy and apache logs that show the timeouts, but it's a bit hard to get a definite time it started. Obviously before this ticket was filed... but it seems like it ramped up over time. I'll see if I can get more info there.
Connections from our proxies to servers on the build network. In some cases it's haproxy, in other cases it is apache directly via a balancer. The layer on the proxy side doesn't seem to matter, it's a timeout taking from proxyN to build vlan appZ
Thanks everyone for suggestions. I am going to see if I can come up with a list of things to try. I do not want to do anything at all today if I can avoid it as we are trying to compose a RC.
So, looking at proxy01 and koji timeouts in july and aug:
5 Jul 15 2 Jul 23 44 Jul 24 62 Jul 25 47 Jul 26 32 Jul 27 39 Jul 28 56 Jul 29 47 Jul 30 31 Jul 31 31 Aug 01 22 Aug 02 21 Aug 03 17 Aug 04 289 Aug 05 38 Aug 06 32 Aug 07 33 Aug 08 33 Aug 09 27 Aug 10 30 Aug 11 55468 Aug 12 50 Aug 13 45 Aug 14 36 Aug 15 50 Aug 16 35 Aug 17 38 Aug 18 43 Aug 19 38 Aug 20 36 Aug 21 31 Aug 22 21 Aug 23 27 Aug 24 32 Aug 25 36 Aug 26 34 Aug 27 54 Aug 28 75 Aug 29 60 Aug 30 81 Aug 31 86 Sep 01
Looks like it started really in late july (that aug day that was bad is a outlyer) and has slowly ramped up since.
here's october:
886 Oct 01 547 Oct 02 815 Oct 03 719 Oct 04 665 Oct 05 897 Oct 06 1319 Oct 07 879 Oct 08 10 Oct 09 398 Oct 10 677 Oct 11 478 Oct 12 1061 Oct 13 26 Oct 14
I'm so far unable to find anything that I can tie to this that happened on jul 23rd.
So here's the bigger hammers I can think of:
For proxies we can take one out of dns, so ideally no outage...
Messing with the hypervisors will cause an outage:
I guess at this point I'd say try the proxy reinstalls (because I can do that without an outage) and only then move on to hypervisors, but perhaps the reboot might be worth doing as that would be a pretty short outage.
I've been trying to search for things online and this: https://serverfault.com/questions/504308/by-what-criteria-do-you-tune-timeouts-in-ha-proxy-config
Suggests that "timeout server" should be close to 30s, where we currently have it as 500s aka. 8m20s ... and while we do have some requests that take a long time to respond, 8m+ seems a bit much ... and if there's some funky network thing happening that we can't see on connections over 5m then lowering that timeout might magically fix everything.
Well, it's worth noting that this happens on non haproxy paths too... so I don't think any haproxy adjustments will solve things. ;(
ie, pkgs/src uses a direct path with apache load balancer. ;(
The slowly-increasing occurance rate has me worried, as it points to something systematic that we're not seeing - especially as there have been reboots so that suggests it's not as simple as bad garbage collection somewhere.
Given the choices @kevin has outlined, I'd agree with doing some reinstalls with different versions of Fedora - if we can isolate what it's happening on, we'll at least have a mitigation strategy. It's not exactly a git bisect, but it's probably the closest we're going to get.
report from yesterday/friday:
This seems to have made proxy10 'better', but it's still seeing the timeouts.
I am not sure what next things to try are. Will ponder on it.
I tried some more various small changes without much luck. Currently things are probibly not failing as much, but they are instead slow... because haproxy is retrying...
I have a meeting tomorrow with our networking folks to help debug from their side.
We've been experiencing build failures with https://riscv-koji.fedoraproject.org/ as well and have noticed 503 errors in the kojid logs.
kojid[1420]: 2025-10-22 01:58:48,283 [INFO] {1420} koji:3198 Try #1 for call 1969 (listBuildroots) failed: 503 Server Error: Service Unavailable for url: https://riscv-koji.fedoraproject.org/kojihub
We have also seen a GSSAPIAuthError due to a read timeout at least once.
GSSAPIAuthError: unable to obtain a session (gssapi auth failed: requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='riscv-koji.fedoraproject.org', port=443): Read timed out. (read timeout=60)) Use following documentation to debug kerberos/gssapi auth issues. https://docs.pagure.org/koji/kerberos_gssapi_debug/
Yes, this is the same issue.
Just FYI, had a meeting with networking folks and they found some possible things in the firewall. They are hopefully going to try implementing some tomorrow and we can see it it helps...
It seems to happen less, but still: https://koji.fedoraproject.org/koji/taskinfo?taskID=138559725
Yeah, those changes didn't seem to fix it.
I did make another change after that. kojipkgs is now using port 8080 instead of port 80 and... I have yet to see a failure with it. This is pointing to something about port 80 traffic causing this.
@kevin howdy, so is it now better with 8080 usage?
kojipkgs seems 100% fine to me running on port 8080.
src/pkgs, koji, riscv-koji are all still using port 80 and still showing the problem.
Networking has some more ideas, but have some other fires too. They are going to try and create a plan to apply some changes for those things soon to see if they help.
I'm also going to look at seeing if we can move those services off 80 for now since that seems to mitigate it.
If you see any kojipkgs ones, do let me know.
thanks for the update
kojipkgs seems 100% fine to me running on port 8080. src/pkgs, koji, riscv-koji are all still using port 80 and still showing the problem. Networking has some more ideas, but have some other fires too. They are going to try and create a plan to apply some changes for those things soon to see if they help. I'm also going to look at seeing if we can move those services off 80 for now since that seems to mitigate it. If you see any kojipkgs ones, do let me know.
Yeah I built quite a few packages in the last couple days. I haven't seen koji fail builds because of this. As you mentioned, it's still very painful to do anything using dist-git, and even submitting chain-builds (Sometimes, the operation also 502/503s there because I assume it has to go get the commit hashes? I'm only assuming there)
But yes, works as expected.
Log in to comment on this ticket.