#11426 Connections to https://pagure.io sometimes fail or hang, using HTTP 1.1 seems to avoid this
Closed: Fixed 8 months ago by adamwill. Opened 9 months ago by adamwill.

Describe what you would like us to do:

Make it so that connecting to https://pagure.io does not sometimes hang or fail. Various folks have reported this (according to @kevin ), for me it's quite badly affecting openQA tests. These both do git clone on pagure.io repos, and fetch individual files from pagure.io using curl. Both of these operations are affected and sometimes cause tests to fail.

This is affecting tests on at least Rawhide and Fedora 38 (probably F37 too, I just can't say for sure and it's a pain to work out). It seems like forcing HTTP 1.1 avoids this - over the weekend I changed the direct curl commands (but not yet the git clones) to force HTTP 1.1, and since then, I don't think I've seen any of the curl commands fail (but the git clones still sometimes do).

When do you need this to be done by? (YYYY/MM/DD)

ASAP, it's rather a big issue.

what would have been the requests been before you forced 1.1? 1.0 or 2.0 or something else?

https://openqa.fedoraproject.org/tests/2017558 is the earliest occurrence of this I can find, on 2023-07-15. The git clone there ran at 2023-07-15T15:12:03.466177Z (UTC). I've checked back about a week before that and don't see any other occurrences.

Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: Needs investigation, high-gain, ops, pagure

9 months ago

Has moving to http1.1 helped any?

I have spoken (well, typed) with our networking folks and indeed there is some issues on there end.

The main link there was being flaky, so they disabled it and traffic moved over to a backup link, but that backup link may be getting saturated.

They are working on fixing the primary link again and/or mitigating the traffic in other ways.

Hopefully they will have a fix soon.

yes, using http/1.1 appears to be avoiding the problem entirely. haven't seen it once since I switched both curl and git to use it.

Interesting. I wonder if that means that http/2 is more sensitive to congestion, or the issue isn't at the networking layer, but in the httpd http/2 implementation somewhere.

The networking issue has been fixed now.

Please let us know if you continue to see the problem after this point.

Closing as there is no response in a week, so everything seems to be OK now.

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

8 months ago

Oh, no. I never saw that message (was at Flock). openQA is still using HTTP 1.1 as a workaround, so I cannot say whether this is fixed. I can retest with HTTP 2.0 on staging tomorrow.

Metadata Update from @adamwill:
- Issue status updated to: Open (was: Closed)

8 months ago

OK, I've deployed a branch with all the HTTP 1.1 workarounds disabled on the stg instance, we'll see if it starts hitting problems again.

I don't see any related failures on stg overnight, so I'll merge this on prod too and verify for another day or two.

Haven't seen any on prod or stg over the weekend, so I think we're good. Thanks.

Metadata Update from @adamwill:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

8 months ago

Login to comment on this ticket.

Boards 1
ops Status: Backlog