A number of users have been reporting sporadic 502/503's from src.fedoraproject.org
I have found a few things on pkgs01 (which is the backend for src.fedoraprojectt.org):
There are core dumps from time to time. I think we looked at this in the past and found it to be in a shutdown/finished python area, but might be worth another look. I enabled coredumps, so on pkgs01 a 'coredumpctl' should show them. Perhaps @abompard could have a look?
There's some errors in ssl_error_log:
[Mon Oct 28 20:17:54.658285 2024] [wsgi:error] [pid 3096441:tid 139660315584256] [client 10.3.163.74:54508] Truncated or oversized response headers received from daemon process 'pagureproc': /var/www/pagure.wsgi [Mon Oct 28 20:18:16.181748 2024] [wsgi:error] [pid 3096518:tid 139659686426368] [client 10.3.163.75:39066] Timeout when reading response headers from daemon process 'pagureproc': /var/www/pagure.wsgi [Mon Oct 28 20:18:16.996922 2024] [wsgi:error] [pid 3094076:tid 139659988432640] [client 10.3.163.74:39334] Truncated or oversized response headers received from daemon process 'pagureproc': /var/www/pagure.wsgi [Mon Oct 28 20:18:16.996943 2024] [wsgi:error] [pid 3096441:tid 139659745142528] [client 10.3.163.75:52114] Truncated or oversized response headers received from daemon process 'pagureproc': /var/www/pagure.wsgi
Might be we need to increase processes or threads of the wsgi app? This might explain the 503's as if it was full of requests it might send back a 503/502...
Metadata Update from @zlopez: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: Needs investigation, high-gain
I noticed that this is happening on toddlers as well.
Example here: https://pagure.io/releng/fedora-scm-requests/issue/69450
Yeah there's been an issue raised in FMN that I tracked back to Pagure's API returning 500. Some of those requests are API calls that take more than 60s to resolve, at which point Apache's timeout triggers and the client gets a 500. I don't know if raising this timeout would be a good idea. The server's load is around 9 though, that could explain.
I'm trying to dig into this and found out that there is a large amount of requests from backup01 machine and OpenShift apps. I'm trying to look at why there are so many requests from backup01.
backup01
Found out that there is a grokmirror running every 30 minutes on backup01 and it's sending a large amount of requests to dist-git. It is doing a pull of all the repositories on dist-git. I'm not sure if that is the reason for the 500, but it's definitely sending large amount of requests to dist-git (300 000 in 10 hours).
grokmirror
From OpenShift apps we have around 1 million requests in 10 hours. I will try to look if there is something I can switch off to lower the load.
I tried to increase the number of processed for httpd to 10. I will see if that helps. I wanted to look at the coredump, but the coredumpctl said No coredumps found.
coredumpctl
No coredumps found.
That didn't help, still plenty of Segmentation faults in httpd error log.
Segmentation faults
httpd
I tried to enable the coredumps by following https://access.redhat.com/solutions/2139111, but no luck. I tried to change the coredump directory to /tmp in httpd config, but still no coredump. I don't see any denial in /var/log/audit/audit.log either. So I'm not sure why there is no coredump.
/tmp
/var/log/audit/audit.log
I can now see the coredumps thanks to @kevin. All I needed was to set ulimit -n 8192 as one needs to also change the ulimit for his own shell.
ulimit -n 8192
ulimit
Today I got pointed by @humaton to pagure_worker failing with exception every few minutes I think this is causing most of the issues.
pagure_worker
I dug into the error more and it seems that the request sent to celery worker is update_git and there is non existing issue or request id, which ends up with exceptions being raised. This wouldn't be an issue, but it's happening every few minutes. I will try to track what request this is exactly.
update_git
This is also impacting fedora-scm-requests from being able to be processed correctly.
If someone else runs into this, just add the following comment to your SCM request:
@releng-bot retry
@jsteffan Only retry should be enough, the bot reacts to any comment for now. We do want to add commands in the future, but right now it reacts to every comment.
retry
But to update with some progress, I was trying to monitor the pagure_worker and it doesn't correspond to errors in httpd. So my next step is to try to install all the debuginfo packages and get more info from coredumps.
Finally got to a state when I can read the coredumps and @kevin was right all the SIGSEGVs are caused by accessing thread that no longer exists. So still not sure what is the root issue as the error is not in the coredump itself, but I will continue investigating.
Could folks still seeing these add in here:
time it happened what exactly was the error message what exactly was it trying to fetch
This might help us isolate where the problem is...
I see error 500 from pagure.io all day. Consistently. I just tried from an incognito window and it works here.
Deleting all pagure.io cookies helped.
Hum, some auth fallout?
Oh crap I didn't realize Pagure would react that way to the auth method change. I'll look at the log see if there's a useful traceback to work around it
This is for src.fedoraproject.org, it's not happening on pagure.io, so the error is something completely different.
I just hit an error 500 on https://src.fedoraproject.org/fork/music/rpms/python-setuptools-gettext/diff/rawhide..pyproject
$ date -u Fri Nov 15 01:05:24 PM UTC 2024
In fact, any effort to open a PR on that project has the same result.
I'm seeing the 500 thing too but it is clearly not the same problem as this. Let's track it in a separate issue, I'll file one with a backtrace.
Filed https://pagure.io/fedora-infrastructure/issue/12291 for the 500 thing.
Today we investigated the error with @abompard and it seems that the 502/503 error is not happening anymore. The coredumps are harmless they happen because the process is restarted after serving too many requests.
I'm closing this issue as Fixed as it just vanished.
Metadata Update from @zlopez: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.