fedora-infrastructure

#11025 Cannot log in COPR via FAS, I've got a timeout

Closed: Upstream a year ago by kevin. Opened a year ago by ycollet.

NOTE

If your issue is for security or deals with sensitive info please
mark it as private using the checkbox below.

Describe what you would like us to do:

I am trying to log in https://copr.fedorainfracloud.org/coprs/ycollet/audinux/
I log in via FAS.
On the FAS page, I enter my username / password and then, the page looks like it's active and then a timeout.
From time to time, after the timeout, I am logged in (but not always).

When do you need this to be done by?

This started to happen 2 weeks ago, a little while after a COPR update.

mattdm commented a year ago

Yeah, I can reproduce.

ycollet commented a year ago

Ahh, thanks a lot ... :)

kevin commented a year ago

Yep. Me too.

It seems to reach id.fedoraproject.org fine, but then it gets a 'gateway timeout'.

On the ipsilon side I see "LOGIN SUCCESSFUL" so, I think it might be something on the copr end.

CC: @praiskup @frostyx

rapier1 commented a year ago

Can reproduce. I was able to get logged in once about an hour ago but trying to build a new project failed on submission with a gateway timeout error.

praiskup commented a year ago

FWD to Copr https://github.com/fedora-copr/copr/issues/2418

praiskup commented a year ago

Reproduced. Seems like the redirect doesn't work for some reason, because I can then manually visit https://copr.fedorainfracloud.org/ - and I'm logged in.

praiskup commented a year ago

Seems like this works just fine for our staging https://copr.stg.fedoraproject.org/

Not sure what is going on. The @oid.after_login() hook is never called by Flask OpenID.

praiskup commented a year ago

Is it possible that https://id.fedoraproject.org is slower, compared to https://id.stg.fedoraproject.org ? It seems like the production variant redirects back to copr too late, when Copr's Timeout expires.

I tried to increase to Timeout 180 (from default, I bet 60) on https://copr.stg.fedoraproject.org ... and then login works (even though it takes a long time).

praiskup commented a year ago

I created some additional logging:

[Sun Dec 04 15:45:02.387495 2022] [wsgi:error] [pid 28726:tid 28854] [remote 2a00:ca8:a1f:46a:c8af:eec6:7d27:5795:52498] ERROR:coprs:https://copr.stg.fedoraproject.org/login/
[Sun Dec 04 15:45:02.387715 2022] [wsgi:error] [pid 28726:tid 28854] [remote 2a00:ca8:a1f:46a:c8af:eec6:7d27:5795:52498] ERROR:coprs:calling try_login()
[Sun Dec 04 15:47:12.502166 2022] [wsgi:error] [pid 28726:tid 28857] [remote 2a00:ca8:a1f:46a:c8af:eec6:7d27:5795:52498] ERROR:coprs:create_or_login called
[Sun Dec 04 15:47:12.502608 2022] [wsgi:error] [pid 28726:tid 28857] [remote 2a00:ca8:a1f:46a:c8af:eec6:7d27:5795:52498] INFO:coprs:Querying user 'praiskup' by username
[Sun Dec 04 15:47:12.507961 2022] [wsgi:error] [pid 28726:tid 28857] [remote 2a00:ca8:a1f:46a:c8af:eec6:7d27:5795:52498] INFO:coprs:Admin 'praiskup' logged in
[Sun Dec 04 15:47:14.588245 2022] [wsgi:error] [pid 29242:tid 29245] [remote 2a00:ca8:a1f:46a:c8af:eec6:7d27:5795:52498] INFO:coprs:Querying user 'praiskup' by username

Note that between try_login() and create_or_login() (@oid.after_login() hook) is
more than two minutes.

smooge commented a year ago

Is the system trying to authenticate via ipv6 and then falling back to ipv4?

praiskup commented a year ago

I don't think so? I think I've run it fully ipv6... I can reproduce the same thing over ipv4 though:

[Sun Dec 04 18:43:23.539167 2022] [wsgi:error] [pid 28726:tid 28857] [remote 193.165.236.110:47760] ERROR:coprs:https://copr.stg.fedoraproject.org/login/
[Sun Dec 04 18:43:23.540154 2022] [wsgi:error] [pid 28726:tid 28857] [remote 193.165.236.110:47760] ERROR:coprs:calling try_login()
[Sun Dec 04 18:45:39.563156 2022] [wsgi:error] [pid 28726:tid 28854] [remote 193.165.236.110:47760] ERROR:coprs:create_or_login called
[Sun Dec 04 18:45:39.563585 2022] [wsgi:error] [pid 28726:tid 28854] [remote 193.165.236.110:47760] INFO:coprs:Querying user 'praiskup' by username
[Sun Dec 04 18:45:39.588005 2022] [wsgi:error] [pid 28726:tid 28854] [remote 193.165.236.110:47760] INFO:coprs:Admin 'praiskup' logged in
[Sun Dec 04 18:45:40.024811 2022] [wsgi:error] [pid 34730:tid 34735] [remote 193.165.236.110:47760] INFO:coprs:Querying user 'praiskup' by username

Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: authentication, medium-gain, medium-trouble

a year ago

smooge commented a year ago

OK my apologies. I thought this was the log from the copr system talking to the id.fedoraproject.org system. [Or the client system talking to the account system..] Is there a way to get what might be happening then?

frostyx commented a year ago

I think it was a good guess with the IPv6, @smooge. I tried

sysctl -w net.ipv6.conf.all.disable_ipv6=1

and the issue disappeared immediately. What confused me is that after I set

sysctl -w net.ipv6.conf.all.disable_ipv6=0

again, the issue didn't appear.

ycollet commented a year ago

I retried the loggin and it went fine now ...

frostyx commented a year ago

I finally found out what is happening. You can see the details here
https://github.com/fedora-copr/copr/issues/2418#issuecomment-1341394581

TL;DR: IPv6 was prioritized over IPv4 and as a consequence, the following reproducer takes ~2 minutes to finish (enough time for the login attempt to timeout).

python -c 'import requests; requests.get("http://frostyx.id.fedoraproject.org/")'

The same happened for wget. On the other hand curl quickly fell back to IPv4 and finished immediately. I solved it by putting precedence ::ffff:0:0/96 100 into /etc/gai.conf.

I am not sure whether this can be caused by some issue on id.fedoraproject.org since id.stg.fedoraproject.org doesn't behave like this, so I am leaving this ticket open.

If everything is as it should be, please close :-)

ycollet commented a year ago

For me, it works now.
But as I am not a network specialist, I'll leave to somebody else the responsability to close this ticket :)

praiskup commented a year ago

From what I see, ipv6 works fine on Copr servers (so I view an explicit ipv4 preference as a work-around). Can ipv6 be broken on id.fedoraproject.org?

kevin commented a year ago

I'm still puzzled why it would timeout... but I could definitely see it being slow.

The path is:

user -> copr fe in aws -> one of our proxies that has ipv6. These could be anywhere, germany, oregon, raleigh, china -> vpn -> ipsilon server

where the ipv4 version would be:

user -> copr fe in aws -> one of our 2 iad2 proxies -> direct network to ipsilon server.

so, I could see slow, but don't see it being that slow.

We could make user.id.fedoraproject.org only resolve to the ipv4 addresses in iad2, but then users who are ipv6 only would be unable to auth at all.

It's planned to add ipv6 to iad2 sometime. We don't have a firm date on it, but early next year seems likely. We could wait for that and switch it then?

frostyx commented a year ago

It's planned to add ipv6 to iad2 sometime. We don't have a firm date on it, but early next year seems likely. We could wait for that and switch it then?

I think it's fine with us. For the time being, I will put our modified gai.conf into Copr frontend's ansible role and we will wait.

kevin commented a year ago

ok. I have made a note to do that once iad2 has ipv6. :)

Metadata Update from @kevin:
- Issue close_status updated to: Upstream
- Issue status updated to: Closed (was: Open)

a year ago

praiskup commented a year ago

What's weird is that id.stg.f.o worked fine, while id.f.o not. Is IPv6 configured differently on those?

kevin commented a year ago

id.fedoraproject.org does have ipv6, but username.id.fedoraproject.org does not, so that shouldn't matter I wouldn't think...

praiskup commented a year ago

Hm, so isn't the problem that we have non-working DNS ipv6 entries for the
production systems? The staging system seems to have none.

$ dig -t AAAA username.id.stg.fedoraproject.org +short
wildcard.stg.fedoraproject.org.

$ dig -t AAAA username.id.fedoraproject.org +short
wildcard.fedoraproject.org.
2600:2701:4000:5211:dead:beef:fe:fed3
2605:bc80:3010:600:dead:beef:cafe:fed9
2600:1f14:fad:5c02:3556:f9be:1874:bdc0
2620:52:3:1:dead:beef:cafe:fed7
2604:1580:fe00:0:dead:beef:cafe:fed1
2620:52:3:1:dead:beef:cafe:fed6

kevin commented a year ago

Those should all be working... except they may not be very fast if they are in another part of the world.

I do see proxy09 in there, we need to fix that one, it is broken. The rest should work.

praiskup commented a year ago

I'm not sure I follow your implications :) could the broken proxy09 be the problem eventually? It was a fairly reproducible problem and I would expect a round-robin on those ipv6 addresses by glibc.

As I understand it, the request timeout was like >= 30s (neither seems to be related to some location issue, or speed of light).

kevin commented a year ago

I've fixed proxy09 (and it's currently out of dns anyhow). But i don't think that was the issue... unless it was somehow getting the broken proxy every single time?

praiskup commented a year ago

Agreed.

Pity, because we are back at the beginning :-( because we still don't know why this isn't a problem at all when using the staging id.fedoraproject.org.

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

fedora-infrastructure

Source Code

#11025 Cannot log in COPR via FAS, I've got a timeout Closed: Upstream a year ago by kevin. Opened a year ago by ycollet.

Describe what you would like us to do:

When do you need this to be done by?

Metadata

authentication medium-trouble medium-gain

#11025 Cannot log in COPR via FAS, I've got a timeout

Closed: Upstream a year ago by kevin. Opened a year ago by ycollet.