#10954 Discourse 2 fedora messaging bridge is failing with 302 errors
Closed: Fixed with Explanation a year ago by kevin. Opened a year ago by mattdm.

This is happening on both Discussion and on Ask, and appears to have been going on for quite some time, unfortunately -- this is going to be a gap in the history.

It appears that there is a redirect loop. The failures look like this:

date: Sat, 22 Oct 2022 14:45:08 GMT
server: gunicorn
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-frame-options: SAMEORIGIN, DENY
x-xss-protection: 1; mode=block, 1; mode=block
x-content-type-options: nosniff, nosniff
referrer-policy: same-origin, same-origin
content-type: text/html; charset=utf-8
content-length: 307
location: https://discourse2fedmsg.fedoraproject.org/webhook
permissions-policy: interest-cohort=()
content-security-policy: default-src 'self' apps.fedoraproject.org; script-src 'strict-dynamic'
set-cookie: 71e45a530a1fabf45160cc70ca8d9671=c819b699e646e54fff21b5c8356c7477; path=/; HttpOnly; Secure; SameSite=None
apptime: D=43440
x-fedora-proxyserver: proxy04.fedoraproject.org
x-fedora-requestid: Y1QB9C_4fO3R6jCZ8asVTQAAAY4
connection: close

Body

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
<p>You should be redirected automatically to target URL: <a href="https://discourse2fedmsg.fedoraproject.org/webhook">https://discourse2fedmsg.fedoraproject.org/webhook</a>.  If not click the link.

Please fix this as soon as possible. Thank you!


I'm not sure if this has ever worked in the past...
But I think it may be related to the header policy change in Openshift that we made a while back.

To make sure that we use HTTPS, the application is looking for an X-Forwarded-Proto header set to https, if not there, it redirects to https, except we already are, hence the infinite loop.

Since the policy change, Openshift is not setting this header anymore and I don't think our proxies are either.
I've applied a temporary fix for now, but it'll revert back to the previous setting on the next restart.
PR is ready: https://pagure.io/fedora-infra/ansible/pull-request/1235 but setting the X-Forwarded-Proto header globally might be a better idea.

Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: Needs investigation, high-gain

a year ago

This did work in the past -- I'm not sure when it stopped.

FYI: This is currently working with @darknao's temporary fix (thank you!)

This is failing again. Can someone please review the PR?

Could we also add monitoring for this? As a first pass, if there are no messages in, say, 8 hours?

I pushed the PR for now, but we will want to fix this a perm way and/or add monitoring..

I'm gonna go ahead and close this.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a year ago

FYI seems "mostly working", but a large fraction are hitting a 20-second timeout and getting "error: Operation timed out - FinalDestination::HTTP"

Metadata Update from @mattdm:
- Issue status updated to: Open (was: Closed)

a year ago

Well, the "fix in a perm way" didn't seem worth doing to me...

I have tried changing the router timeout for the app to 5m.
Can you see if those timeouts go down/away?

What exact thing is giving that timeout? the discourse side -> fedora openshift app?

Sorry @kevin — I missed this in the holiday.

The errors I'm seeing are: error: Operation timed out - FinalDestination::HTTP. That's supposed to show the HTTP response headers, but to me that looks like instead an earlier-than-that error. It seems like the app is not responding to Discourse's http POST.

FWIW, successful requests are generally less than 1 second. Occasionally an outlier of up to 3 seconds. I'm not seeing any, like, 15 or 17 second successes — or even 5 seconds.

So a few things:

I only see requests from this on proxy09 and proxy04. I am not sure why it's only hitting those two...

I only see 200's in our logs. Every request that hits the proxies seems to work. ;(

We were having problems with proxy09 for a while with ipv6, I fixed that on jan 9th/10th... perhaps right after your last comment.

Is this still happening?

If so, we could perhaps change it to use proxy01/10 (our main iad2 proxies that are close to openshift) and see if that helps any?

Failing that, can you get any more info from discourse? Perhaps via support?

Yeah, the info there sure doesn't help us much... if they can scrape out a actual error that would help (dns resolving? timeout? something else).

It was DNS! It was taking more than 2 seconds for DNS to resolve sometimes, and that was set as a timeout somewhere. The TTL at 60s means lots of queries, too.

From CDCK:

We’ve also seen similar sporadic timeout failures during OAuth2 auth flows (always after between 2 and 3 seconds).

Due to that we have introduced a configurable query timeout and raised it to 5 seconds for your infrastructure; this went live yesterday on Thursday 23rd.

Curiously, we have not observed webhook or OAuth2 timeouts since Friday 17th. Any chance you did something on your end which makes DNS resolution systematically faster on that day?

Ha. It's always dns. ;)

The TTL should be 5m?

I don't see anything that would have changed on our end on the 17th. ;(

The TTL should be 5m?

Unless we're planning on moving it without advance notice, more is probably better!

The fedoraproject.org domain is 5m for a reason. We use dns is add/remove proxies from service and need to be able to take them out and put them back quickly.
:)

Anyhow, is this done then? or is there anything more to do from our side here?

I will go ahead and close this then, and if it happens again, please file a new ticket. :)

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

a year ago

Login to comment on this ticket.

Metadata
Related Pull Requests
  • #1235 Merged a year ago