Since the firewall cluster upgade eariler this week, users have been reporting koji cli giving a:
502 Server Error: Bad Gateway for url: https://koji.fedoraproject.org/kojihub
This is easily duplicated by picking a large/long running build and running a watch-logs on it. The error doesn't actually affect the underlying tasks, it just errors on the client end, and the user can re-run their watch to resume (but then it will fail again)
The traceback from this looks like this here (rawhide koji):
2025-11-11 20:03:54,480 [DEBUG] koji: Opening new requests session Traceback (most recent call last): File "/usr/bin/koji", line 331, in <module> rv = locals()[command].__call__(options, session, args) File "/usr/lib/python3.14/site-packages/koji_cli/commands.py", line 6559, in anon_handle_watch_task return watch_tasks(session, tasks, quiet=options.quiet, poll_interval=goptions.poll_interval, topurl=goptions.topurl) File "/usr/lib/python3.14/site-packages/koji_cli/lib.py", line 361, in watch_tasks for child in session.getTaskChildren(task_id): ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^ File "/usr/lib/python3.14/site-packages/koji/__init__.py", line 2539, in __call__ return self.__func(self.__name, args, opts) ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/koji/__init__.py", line 3123, in _renew_expired_session return func(self, *args, **kwargs) File "/usr/lib/python3.14/site-packages/koji/__init__.py", line 3149, in _callMethod return self._sendCall(handler, headers, request) ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/koji/__init__.py", line 3042, in _sendCall raise e File "/usr/lib/python3.14/site-packages/koji/__init__.py", line 3038, in _sendCall return self._sendOneCall(handler, headers, request) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/koji/__init__.py", line 3086, in _sendOneCall r.raise_for_status() ~~~~~~~~~~~~~~~~~~^^ File "/usr/lib/python3.14/site-packages/requests/models.py", line 1026, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://koji.fedoraproject.org/kojihub
related/other mentions: https://pagure.io/fedora-infrastructure/issue/12814#comment-993194 https://pagure.io/releng/issue/13087
So, seems likely something in the new firewall os is somehow being more aggressive in closing connections. It's worth noting that I see the 502 in the proxy access logs, but there's no error in error log or anything in koji hub log
@mikem have any thoughts on this? would koji be trying to reuse a closed connection here? I can look at adding some kind of keepalives, but we didn't need them before and I would prefer to reason out what to add before we try too much.
Oops. The releng ticket is not related. It's an unrelated outage this morning.
would koji be trying to reuse a closed connection here? I can look at adding some kind of keepalives
Koji uses a python-requests session for all hub connections, which automatically uses keep-alive. Also keep-alive is the default in the http standard since http/1.1. Koji should automatically detect when a connection gets closed and make a new one. However, the symptom here is not a connection closing, but an active connection returning a 502.
What is probably happening is that the http connection from the proxy to the backend hub is timing out before the the http connection from the client to the proxy. What are the keep-alive timeout settings on each? You probably want to make sure the proxy keep-alive timeout is shorter than the backend.
Fwiw, koji clients can probably work around this by setting anon_retry=true in their koji config, but I think the real fix should be in the proxy setup.
anon_retry=true
I set the hub to 16s, and the proxies are 15s... but something is still not working right. I was still able to get a 502.
Well, worth a shot at least. I still haven't replicated it on my end
I did get it to happen once here, but takes a while
ok, we had a mass update/reboot cycle and since then I cannot get it to happen here.
I have been watching a webkitgtk build for 2+ hours now and no problems.
Can you all see if you can duplicate it now?
I can still get it to happen if I let something poll the hub of several hours. Most recently,
2025-11-24 01:51:42,545 [ERROR] koji: HTTPError: 502 Server Error: Bad Gateway for url: https://koji.fedoraproject.org/kojihub
Saw it multiple times this morning, including within seconds of doing a fedpkg build.
fedpkg build
[adamw@toolbx fedora-toolbox-43 os-autoinst (f42 %)]$ fedpkg build Building os-autoinst-5^20250707gitd55ec72-6.fc42 for f42-candidate Created task: 139261749 Task info: https://koji.fedoraproject.org/koji/taskinfo?taskID=139261749 Watching tasks (this may be safely interrupted)... 139261749 build (f42-candidate, /rpms/os-autoinst.git:125dca21d6bed9f2bc17b194b99b3f4e23f7771a): free 139261749 build (f42-candidate, /rpms/os-autoinst.git:125dca21d6bed9f2bc17b194b99b3f4e23f7771a): free -> open (buildvm-a64-25.rdu3.fedoraproject.org) Could not execute build: 502 Server Error: Bad Gateway for url: https://koji.fedoraproject.org/kojihub [adamw@toolbx fedora-toolbox-43 os-autoinst (f42 %)]$ koji watch-task 139261749 Watching tasks (this may be safely interrupted)... 139261749 build (f42-candidate, /rpms/os-autoinst.git:125dca21d6bed9f2bc17b194b99b3f4e23f7771a): open (buildvm-a64-25.rdu3.fedoraproject.org) 139261760 buildSRPMFromSCM (/rpms/os-autoinst.git:125dca21d6bed9f2bc17b194b99b3f4e23f7771a): open (buildvm-a64-05.rdu3.fedoraproject.org) 2025-11-24 11:59:46,060 [ERROR] koji: HTTPError: 502 Server Error: Bad Gateway for url: https://koji.fedoraproject.org/kojihub [adamw@toolbx fedora-toolbox-43 os-autoinst (f42 %)]$ koji watch-task 139261749 Watching tasks (this may be safely interrupted)... 139261749 build (f42-candidate, /rpms/os-autoinst.git:125dca21d6bed9f2bc17b194b99b3f4e23f7771a): closed 139261981 tagBuild (noarch): closed 139261803 buildArch (os-autoinst-5^20250707gitd55ec72-6.fc42.src.rpm, ppc64le): closed 139261804 buildArch (os-autoinst-5^20250707gitd55ec72-6.fc42.src.rpm, s390x): closed 139261801 buildArch (os-autoinst-5^20250707gitd55ec72-6.fc42.src.rpm, x86_64): closed 139261802 buildArch (os-autoinst-5^20250707gitd55ec72-6.fc42.src.rpm, aarch64): closed 139261760 buildSRPMFromSCM (/rpms/os-autoinst.git:125dca21d6bed9f2bc17b194b99b3f4e23f7771a): closed 2025-11-24 12:12:30,826 [ERROR] koji: HTTPError: 502 Server Error: Bad Gateway for url: https://koji.fedoraproject.org/kojihub
@adamwill do you have a custom poll_interval setting for koji?
I don't believe so. If I did, where would it be set? The string poll doesn't appear in my /etc/koji.conf.
poll
/etc/koji.conf
I don't believe so. If I did, where would it be set?
It's not something most people adjust, but I wanted to make sure for replication purposes. Koji pulls settings from ~/.koji/config, /etc/koji.conf, and /etc/koji.conf.d*
I've been playing with this and while I can't make it replicate readily, I have noticed that backend connection changes during the course of keepalive (at least according to the X-Fedora-Appserver header).
X-Fedora-Appserver
header: X-Fedora-Appserver: koji02.rdu3.fedoraproject.org header: X-Fedora-Proxyserver: proxy01.rdu3.fedoraproject.org header: Keep-Alive: timeout=15, max=450 ... header: X-Fedora-Appserver: koji01.rdu3.fedoraproject.org header: X-Fedora-Proxyserver: proxy01.rdu3.fedoraproject.org header: Keep-Alive: timeout=15, max=449
This implies the two layers of keepalive are untethered. Perhaps the proxy is just grabbing a connection from a pool each time (unless it is not using keepalive?). So my suggestion to make the inner keepalive timeout longer doesn't solve the issue (though perhaps a longer timeout might reduce the problem).
If I spam the hub with hello calls (a no-op) as fast as I can, it takes about 10-15 min to hit it. It seems to happen with both proxy01 and proxy10. If I disable keepalive, it takes longer to hit it, but this is probably just because keepalive allows for a faster rate of calls.
hello
Unfortunately, this doesn't really give any great insight into what is happening.
Here is my best guess -- the proxy is using a pool of keepalive connections to the backends and not quite handling the inherent race condition in the http keepalive spec. The race happens when the keepalive timeout happens just barely after the client (in this case the proxy) decides to reuse the connection. It fails to connect and passes the error on.
You can see a different example of this race here -- https://github.com/mikem23/keepalive-race (this code is specific to python-requests, but the general race could happen with different tools).
Log in to comment on this ticket.