#10559 DNS lookup failure after FAS login on BlockerBugs
Closed: Fixed with Explanation 2 years ago by kevin. Opened 2 years ago by kparal.

Describe what you would like us to do:


We're seeing errors like this:

Proxy Error

The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request

Reason: DNS lookup failure for: worker03.ocp.iad2.fedoraproject.org

Reproducer:
1. open firefox in anonymous mode
2. open https://qa.fedoraproject.org/blockerbugs
3. click Log in
4. fill in credentials into FAS
5. see the failure

@mobrien says:

Ah I see, so only a couple of our proxies can reach the new ocp4 cluster which is fine when you go to qa.fp.o as they are directed there but I guess after ipsilon login you are redirected to a different proxy
which then can't access the cluster which throws up the dns errors you are seeing

When do you need this to be done by? (YYYY/MM/DD)


Soon? :) Ideally this week.


I guess the only quick fix here is to change id to resolve only to iad2 proxies. ;(

@mobrien did you want to do that or want me to? it's in the fedoraproject.org.cfg file, needs removed from there and added into the template using only the 2 iad2 proxies.

Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 years ago

I have updated the dns to use only the iad2 proxies for id.fedorproject.org so this should be fixed.

We will need to get a better fix implemented but this will do for now. Please reopen if the issue comes up again.

Metadata Update from @mobrien:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

2 years ago

So unfortunately I still see the same error. I tried it 5 times (each time closing and reopening the anonymous window from scratch), and twice it worked ok, and three times I saw the dns failure (with different workers).

Metadata Update from @kparal:
- Issue status updated to: Open (was: Closed)

2 years ago

There is an infra freeze tomorrow. Can we perhaps resolve this today?

I'd love to resolve it, but I no longer understand what is happening. Changing id to only be in that DC should have fixed it.

Do you see the same 'DNS lookup failure' hosts? Or it's exactly the same as before?

Tested right now, out of 10 login attempts, I saw the failure 4 times. Each time I completely closed and reopened the anonymous firefox window. The error is as before:

Proxy Error

The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request

Reason: DNS lookup failure for: worker01.ocp.iad2.fedoraproject.org

Interesting thing is, when I see the failure, and then try to access https://qa.fedoraproject.org/blockerbugs/ (in the same or a new tab, during the same window session), I immediately see the same error, and each time I refresh the page, the worker number increases, going rounds from worker01 to worker06 again and again. So I can't even access the original page, which initiated the login. Only after I hit Ctrl+Shift+R to force refresh the page, I see the blockerbugs web app. I'm not logged in. But when I hit Login, it immediately logs me in, without going to the FAS auth page, and without showing any error. I suppose there is some caching in play here, not sure if it is relevant to the problem. But I can easily cycle through the workers this way.

It seems that I see the Proxy Error each day, when I first access that site. Refreshing the page doesn't help, just cycles the workers. Force refreshing helps and shows the content. I'm logged in all the time.

In other words, it seems this not only affects people who try to log in, but also people, who are logged in, and haven't accessed the page in some time (a day or so).

We'd really need to fix this somehow. In the long run, I'll be trying to switch blockerbugs from openid to openid connect, but that will take time and I'm not even sure if it is related to this problem.

I can't duplicate it here at all. ;(

Can you attach:

host id.fedoraproject.org
host qa.fedoraproject.org

$ host id.fedoraproject.org
id.fedoraproject.org has address 38.145.60.20
id.fedoraproject.org has address 38.145.60.21

$ host qa.fedoraproject.org
qa.fedoraproject.org is an alias for ocp.fedoraproject.org.
ocp.fedoraproject.org has address 38.145.60.20
ocp.fedoraproject.org has address 38.145.60.21

In Firefox, I tried to enable "DNS over HTTPS" using CloudFlare, but that didn't have any impact, I can still see the problem.

Wild idea: could IPv4 vs IPv6 be involved here?

@pingou shouldn't be no, iad2 is ipv4 only.

I'm really quite puzzled by why this is still happening (and I can't seem to duplicate it here now).

@kparal can you get a HAR of the problem? that would be: open inspector, go to network, duplicate the problem, click the little gear thing and 'save all as HAR'. I am not sure if there's anything sensitive in there, so if you could upload it somewhere private (like your batcave01 homedir? or fedorapeople homedir) I can grab it from there.

@kevin I placed the HAR file in fedorapeople.org:/home/fedora/kparal/tmp/. Last week my colleague complained about the same problem. I wonder if this could be geographically-related? I can post a public call in the test list and we can try to figure out if this is happening everywhere or just e.g. in Europe, if that helps.

I sometimes get the error too (from Italy) the first time I open the blockerbugs homepage.
Some differences are:
- I do not use Firefox in anonymous mode
- refreshing the page with just F5 usually works after one or two attempts
- once it starts to work, I don't see the error while browsing other pages
- I cannot reproduce the error in a consistent way - for example, at present I cannot trigger the problem despite several attempts and CTRL-F5 refreshes

I must also add that I have this kind of error in BlockerBugs only, no other Fedora related website has ever showed a similar error.

So, does anyone who can see this see it in staging?

ie, https://qa.stg.fedoraproject.org/blockerbugs ?

I do note that we have only 2 ingress controllers in prod (and 6 worker nodes) and 2 ingress controllers in stg (and 4 workers).

I would still think apache wouldn't expose a non answering worker to the proxy, it should disable it... ;(

@kevin The staging instance doesn't have this problem. I tested 20 login attempts, all worked flawlessly. Given that in production I see at least 30% failure rate (and I verified that it still happens there, failed on my first attempt), it's very unlikely that stg is affected and I was just lucky. So I believe stg is not affected at all.

@mattia I use the anonymous mode just to reproduce the issue intentionally. I found out that regular logout&login process doesn't trigger it, only the first login in a completely clean environment (the anonymous mode) might trigger it, or if you wait some time and something expires, then it might happen once (that's why I sometimes see it in my regular browser session when I first access blockerbugs on a new day in the morning, or perhaps after a weekend).

ok. Good to know.

Can you test again in prod and see if it's still happening? I tweaked routers on the prod openshift cluster.

Yes, it's still broken in prod :-/

After several days without problem, I've just hit the error again:
Reason: DNS lookup failure for: worker01.ocp.iad2.fedoraproject.org

Note that I get the error before trying to login (at the first attempt of opening the homepage).

I believe I just saw this problem when opening https://accounts.fedoraproject.org . I refreshed the page too quickly (muscle memory), but it seems this problem no longer affects just Blockerbugs, but also Accounts.

I believe I just saw this problem when opening https://accounts.fedoraproject.org . I refreshed the page too quickly (muscle memory), but it seems this problem no longer affects just Blockerbugs, but also Accounts.

Accounts has just recently migrated to OCP4 as well so this confirms(as we suspected) that its an issue related to that and not the blockerbugs app itself.

@kparal would it be possible to try this with a web console open and tell us the x-fedora-proxyserver header value when the error happens? This may give us a clue to which proxy is causing the issue

I just realized this might be our many year old http/2 reuse bug in firefox. (ie, https://bugzilla.mozilla.org/show_bug.cgi?id=1420777 )
:)

Is everyone who is seeing this seeing it with firefox? Has anyone seen it with another browser?

If this is the case we can fix it. Just have other proxies return a 421.

Sigh, right now I can't reproduce it neither with Accounts nor Blockerbugs. But I can confirm that all problems were reported when using Firefox. One colleague tried it with Chrome as well and it didn't happen there (but that's a very small sample set, doesn't prove anything).

So, I put the fix in place for blockerbugs.

I did not (yet) for noggin/accounts... so I would think it would still be possible to see there.

You would need to do something like:

go to some https://fedoraproject.org site and load a few pages with the inspector on... and see when it's hit some proxy thats NOT proxy01 or proxy10.
Then, go to accounts. and it should try and reuse that connection and fail...

I went to https://calendar.fedoraproject.org/ which was served through proxy02.fp.o at that moment, then I replaced the URL to https://accounts.fedoraproject.org/ and I got Proxy Error: DNS lookup failure for: worker01.ocp.iad2.fedoraproject.org. I repeated that several times, it seem to happen consistently, in Firefox. In Chrome, I repeated the steps and saw no error.

Back in Firefox, when I made the same steps with https://qa.fedoraproject.org/blockerbugs/ , I get proxy01.iad2.fp.o or proxy10.iad2.fp.o every time and when I replace the URL to https://accounts.fedoraproject.org/ , it loads correctly.

So it seems you found the issue? :sunglasses:

I've just pushed a commit that should add the 421 to all apps that are only in iad2 and also add them when we move them to ocp4.

So, this should be fixed as soon as I finish testing this in staging and push to prod (later today).

I'm gonna go ahead and mark it now. :)

Thanks everyone!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

2 years ago

Thanks, Kevin, for figuring this out! Fingers crossed so that no more tweaks are necessary.

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog