fas is returning 500's from time to time. This is causing all kinds of problems.
From qa's retval unable to auth to the wiki for test results, to signing packages failing causing them to be stuck in gating.
We need to track this down and fix it.
So, I redeployed all the fas pods and it seems to be much happier. I wonder if there's some kind of resource leak requiring restarting after X time?
I am not fully sure it's fixed yet, but will keep looking to watch for failures.
So, it is definitely not solved.
We have isolated it to the actual fas pods, not the proxies or links to them.
@codeblock is looking at adding some debugging so we can isolate it.
ok. Many thanks to @codeblock for tracking this down. It seems like a httpd update on the 30th changed somehow how apache handles load balancers (or introduced a bug). Before this our config was pointed at all 5 openshift nodes, even though 2 of them were not infra nodes and didn't run routers. After the 30th update something changed and now just fas seemed to break part of the time sending to those "down" balancer members. We removed the non infra nodes and it seems to be fixed.
Please report any further issues.
Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)
See #8268 . #8268 was closed as duplicate of this, but it does not seem to be fixed.
to comment on this ticket.