That is second time today I have hit "Application is not available :(" opening e.g. this URL
CC @mizdebsk
Trying once more, I finally got the result, but the behavior does not look healthy to me
Metadata Update from @phsmoura: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: low-gain, low-trouble, ops
Metadata Update from @mizdebsk: - Issue assigned to mizdebsk
"Application is not available" is caused by no Pod being available to serve the request. There is an excessive number of frontend Pod restarts lately. The restarts are caused by liveness probe failing: context deadline exceeded (Client.Timeout exceeded while awaiting headers) The liveness probe has timeout of 1 second, which I suspect may be not enough during spikes of load. For now I will increase the timeout. Commit: https://pagure.io/fedora-infra/ansible/c/a71bd24
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
The timeout change is live. The issue happens randomly (presumably during higher load), so I don't have an easy way to check if the change helped or not. I hope/assume it did help and I'm closing the ticket as resolved. Please let us know if "Application is not available" starts repeating again.
Metadata Update from @mizdebsk: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Metadata Update from @mizdebsk: - Issue status updated to: Open (was: Closed)
I've reopened the issue as the problem still occurs. I will investigate it further.
As the issue persists I've temporarily scaled up frontend Deployment (a52be17) and further increase frontend liveness and readiness probes timeouts (06f821b).
Investigation shows that there is an increased load from robots that visit the "affected by" links in succession. The query behind the "affected by" feature is quite slow. It used to require login to prevent abuse like that, but this requirement was removed: 436621d
Moreover, Koschei robots.txt file is not what I expected it to be. Upstream robots.txt: https://github.com/fedora-infra/koschei/blob/master/static/robots.txt The actual robots.txt: https://koschei.fedoraproject.org/robots.txt
Options to try nex: - put custom robots.txt file and forbid robots to follow "affected-by" links. - revert upstream commit 436621d and require login for using the "affected-by" feature
as a side note, a lot of the "AI" scraper bots pretty much ignore robots.txt. ;( We have blocked the ones that use a agent we can idntify, but there's still more of them that use random ip's/agent strings. ;(
I asked for FBR to add robots.txt If that does not help, I will consider requiring user authentication for using the "affected-by" feature.
Metadata Update from @mizdebsk: - Issue tagged with: unfreeze
The new robots.txt file has been deployed in production. It can be seen at https://koschei.fedoraproject.org/robots.txt Upstream robots.txt is at https://koschei.fedoraproject.org/static/robots.txt (unused) Recently there've been much fewer frontend Pod restarts, so I hope that the issue is resolved for now. In case this starts to happen again, I will probably implement the aforementioned solution with requiring user auth for some slower pages. I'm closing the ticket, but feel free to reopen if needed.
Metadata Update from @mizdebsk: - Issue untagged with: unfreeze - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.