#12180 Koschei "is not available"
Closed: Fixed 4 months ago by mizdebsk. Opened 4 months ago by vondruch.

That is second time today I have hit "Application is not available :(" opening e.g. this URL

CC @mizdebsk


Trying once more, I finally got the result, but the behavior does not look healthy to me

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: low-gain, low-trouble, ops

4 months ago

Metadata Update from @mizdebsk:
- Issue assigned to mizdebsk

4 months ago

"Application is not available" is caused by no Pod being available to serve the request.
There is an excessive number of frontend Pod restarts lately.
The restarts are caused by liveness probe failing: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
The liveness probe has timeout of 1 second, which I suspect may be not enough during spikes of load.
For now I will increase the timeout.
Commit: https://pagure.io/fedora-infra/ansible/c/a71bd24

The timeout change is live.
The issue happens randomly (presumably during higher load), so I don't have an easy way to check if the change helped or not.
I hope/assume it did help and I'm closing the ticket as resolved.
Please let us know if "Application is not available" starts repeating again.

Metadata Update from @mizdebsk:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 months ago

Metadata Update from @mizdebsk:
- Issue status updated to: Open (was: Closed)

4 months ago

I've reopened the issue as the problem still occurs.
I will investigate it further.

As the issue persists I've temporarily scaled up frontend Deployment (a52be17) and further increase frontend liveness and readiness probes timeouts (06f821b).

Investigation shows that there is an increased load from robots that visit the "affected by" links in succession. The query behind the "affected by" feature is quite slow. It used to require login to prevent abuse like that, but this requirement was removed: 436621d

Moreover, Koschei robots.txt file is not what I expected it to be.
Upstream robots.txt: https://github.com/fedora-infra/koschei/blob/master/static/robots.txt
The actual robots.txt: https://koschei.fedoraproject.org/robots.txt

Options to try nex:
- put custom robots.txt file and forbid robots to follow "affected-by" links.
- revert upstream commit 436621d and require login for using the "affected-by" feature

as a side note, a lot of the "AI" scraper bots pretty much ignore robots.txt. ;( We have blocked the ones that use a agent we can idntify, but there's still more of them that use random ip's/agent strings. ;(

I asked for FBR to add robots.txt
If that does not help, I will consider requiring user authentication for using the "affected-by" feature.

Metadata Update from @mizdebsk:
- Issue tagged with: unfreeze

4 months ago

The new robots.txt file has been deployed in production.
It can be seen at https://koschei.fedoraproject.org/robots.txt
Upstream robots.txt is at https://koschei.fedoraproject.org/static/robots.txt (unused)
Recently there've been much fewer frontend Pod restarts, so I hope that the issue is resolved for now.
In case this starts to happen again, I will probably implement the aforementioned solution with requiring user auth for some slower pages.
I'm closing the ticket, but feel free to reopen if needed.

Metadata Update from @mizdebsk:
- Issue untagged with: unfreeze
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 months ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog