#9446 openQA prod sometimes fails to respond or 503s, apparently a proxy/load balancer issue
Closed: Insufficient data 3 years ago by smooge. Opened 3 years ago by adamwill.

Sometimes when trying to visit https://openqa.fedoraproject.org/ from outside infra, you get a 503 or it just spins and fails to respond. This seems to be a proxy or load balancer issue, because during these times if I have an ssh connection open to the server (openqa01.iad2.fedoraproject.org) it keeps working fine and 'curl http://localhost' works just fine.


Interestingly, I don't think I've once seen this with the lab/stg instance.

Looking at the proxies.. there are no records of 503 in the logs. There are

45.41.142.211 - - [02/Nov/2020:21:22:02 +0000] "GET /tests/713795 HTTP/1.1" 502 416 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0"
45.41.142.209 - - [02/Nov/2020:21:22:02 +0000] "GET /tests/713802 HTTP/1.1" 502 416 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0"
45.41.142.102 - - [04/Nov/2020:21:15:53 +0000] "GET /tests/715464 HTTP/1.1" 502 416 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0"
45.41.142.104 - - [04/Nov/2020:21:15:54 +0000] "GET /tests/715539 HTTP/1.1" 502 416 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0"

Everything else is a 200, 404, 206, 302 or 304. Nothing in the error logs either. Can you get me a general timestamp, ip address and URL to see if I can find anything else?

There is no load balancer in this case. proxy01.iad2 and proxy10.iad2 are just using apache proxy to the openqa.iad2 proxy. I have looked on all the boxes and can't find any 503 in the appropriate logs.

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

3 years ago

of course, sod's law dictates it hasn't happened since you asked the questions.

It could indeed be a 502 you get if you wait long enough, not a 503. It's definitely a 50something. :)

So looking on the openqa server there are 503's listed in the logs there more often:

access_log-20201111.xz:10.3.174.63 - - [10/Nov/2020:22:03:59 +0000] "POST /api/v1/workers?cpu_arch=aarch64&cpu_flags=fp+asimd+evtstrm+aes+pmull+sha1+sha2+crc32+cpuid&cpu_modelname=X-Gene&cpu
_opmode=32-bit%2C+64-bit&host=openqa-a64-worker03.iad2.fedoraproject.org&instance=14&isotovideo_interface_version=20&mem_max=127761&websocket_api_version=1&worker_class=qemu_aarch64 HTTP/1.1
" 503 380 "-" "Mojolicious (Perl)"

but all of them are posts from internal hosts. So not sure where this is happening :disappointed:

those are worker checkins, I think. not sure if it's expected/known that they get 503s. but yeah, not likely the source of this.

Well the best way to make it happen again is to close it as "Invalid" or something similar.

Metadata Update from @smooge:
- Issue close_status updated to: Insufficient data
- Issue status updated to: Closed (was: Open)

3 years ago

Just hit this again, about a minute ago. It was a 503.

"Service Unavailable

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

Apache Server at openqa.fedoraproject.org Port 443"

Refreshed just now and it worked.

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Done