Issue #4882: Timeouts (operation too slow) using mirrors.fedoraproject.org - fedora-infrastructure

fedora-infrastructure

#4882 Timeouts (operation too slow) using mirrors.fedoraproject.org

Closed: Fixed None Opened 8 years ago by gsauthof.

= bug description =

lately, I get more yum-cron timeout mails than I like. They read:

{{{
https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=x86_64
'Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds'
}}}

Looking at the last week I got 3 such mails from a server in a NYC datacenter (B), 4 such mails from a server in a german datacenter (G) and 15 from a server connected to a german DSL with static IPs (E). All machines are dual stack, i.e. IPv4 and IPv6 (native).

= bug analysis =

For the last week, I ran a script in three locations that periodically checks each record of mirrors.fedoraproject.org for 30s timeouts (via curl, pausing 10 minutes between each check).

The used curl command looks like this:

{{{
curl --globoff --silent --max-time 30 --insecure -H 'Host: mirrors.fedoraproject.org' 'https://'"$SOME_IPv4_orIPv6"'/metalink?repo=epel-7&arch=x86_64'
}}}

I've attached a plot for each location.

Each dot denotes a timeout event. The 'mail' lines denote yum cron mails.

The NYC (B) plot shows a red line - this is apparently due to a routing issue (I've provided some traceroutes to Stephen John Smoogen, private communication).

Since the plots also contain some samples for IPv4 servers the issue does not seem to be IPv6 specific.

= fix recommendation =

Fix the routing for {{{2001:4178:2:1269::fed2}}}.

Assuming that the timeouts are due to too much load on the single machines:

Add more machines to the rotation

Also assuming that your monitoring does not issue http requests (perhaps just pings) you could change that to get more realistic results.

Make yum-cron more resilent against single timeouts. Ideally, it should try each mirror round-robin and should only report an error if ALL mirrors time out. I suspect that yum-cron, currently, just does a simple DNS request and just uses the first record. Perhaps it even tries all records - but then I am really not interested in single timeout messages when it is succeeding with another record.

And when we are talking about yum-cron - it should at least include the IP in its 'Operation too slow' message. I mean, since the DNS returns multiple records for {{{mirrors.fedoraproject.org}}}, as-is, the message is not very useful.

gsauthof commented 8 years ago

NYC location
timeout_b.png

gsauthof commented 8 years ago

german dsl, static ip
timeout_e.png

gsauthof commented 8 years ago

german datacenter
timeout_g.png

kevin commented 8 years ago

Thanks for the detailed info. :)

We will take a look and see what we can find out.

Note that yum-cron is beyond our scope, if you want it to change, please file bugs against it in bugzilla.

Might you be able to add a traceroute6 from the site that cannot reach 2001:4178:2:1269::fed2 ?

gsauthof commented 8 years ago

Sure, here is the traceroute (created a few minutes ago):

{{{
traceroute6 2001:4178:2:1269::fed2
traceroute to 2001:4178:2:1269::fed2 (2001:4178:2:1269::fed2), 30 hops max, 80 byte packets
1 2604:a880:800:10:ffff:ffff:ffff:fff1 (2604:a880:800:10:ffff:ffff:ffff:fff1) 9.712 ms 2604:a880:800:10:ffff:ffff:ffff:fff2 (2604:a880:800:10:ffff:ffff:ffff:fff2) 0.472 ms 0.507 ms
2 2604:a880:800::501 (2604:a880:800::501) 0.454 ms 2604:a880:800::601 (2604:a880:800::601) 0.446 ms 2604:a880:800::801 (2604:a880:800::801) 0.423 ms
3 decix-nyc.he.net (2001:504:36::1b1b:0:1) 1.253 ms 10gigabitethernet1-3.core1.nyc5.he.net (2001:504:17:115::17) 7.844 ms 7.952 ms
4 10ge9-7.core1.nyc4.he.net (2001:470:0:20a::1) 7.691 ms 10ge16-1.core1.nyc4.he.net (2001:470:0:259::1) 2.427 ms 10ge9-7.core1.nyc4.he.net (2001:470:0:20a::1) 7.654 ms
5 100ge7-2.core1.lon2.he.net (2001:470:0:2cf::1) 84.762 ms 84.756 ms 84.717 ms
6 100ge7-1.core1.fra1.he.net (2001:470:0:37::2) 132.141 ms 131.991 ms 132.114 ms
7 10ge2-1.core1.muc1.he.net (2001:470:0:2f8::2) 86.706 ms 86.626 ms 86.546 ms
8 *
9 *
10 *
11 *
12 *
13 *
14 *
15 *
16 *
17 *
18 *
19 *
20 *
21 *
22 *
23 *
24 *
25 *
26 *
27 *
28 *
29 *
30 *
}}}

kevin commented 8 years ago

I've been running a similar script here for about 4 hours today and so far have gotten 0 timeouts. ;(

Are you still running your scripts?

Are you still seeing the issue(s)?

I'm going to keep running things here and also keep digging around for possible causes.

doteast commented 8 years ago

testing from a location in Virginia, only IPv4 addresses, 48-hours stretch.
starting on: Mon Sep 14 11:39:46 2015

following mirrors: ['209.132.181.16', '66.35.62.162', '67.203.2.67', '67.219.144.68', '140.211.169.196', '152.19.134.142', '209.132.181.15']

See the attached file: timeout.log

doteast commented 8 years ago

attachment
timeout_test.py

gsauthof commented 8 years ago

Replying to [comment:3 kevin]:

Are you still running your scripts?

No, I am not.

They were just running 2 weeks. The graphs contain the results from the first week.
The log files show some timeouts in the second week, as well.

I'll restart them again.

The last yum-cron timeout mails I received (in the last week) at:

{{{
2015-09-11 06:08:19 +0200
2015-09-09 07:03:58 +0200
2015-09-08 06:09:02 +0200
2015-09-07 07:06:13 +0200
}}}

adrian commented 8 years ago

Just (17:05 +0000) got an email from cron on a system in Germany:

{{{
/etc/cron.hourly/0yum-hourly.cron:

Could not get metalink https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=x86_64 error was
12: Timeout on https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=x86_64: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds')
}}}

doteast commented 8 years ago

attachment
timeout.log

kevin commented 8 years ago

I did a bunch of digging around in logs and various monitors today.

I see haproxy saying on several of the proxies folks reported timeouts that there were 0 errors or downtime, so I think that points to httpd on the proxies somehow not sending the request on to haproxy->vpn->mirrorlists.

I am at a loss to explain why that would be however. The proxies aren't hitting maxclients or anything, and there's no particular errors in logs.

I'm open to ideas on how to isolate it more...

adrian commented 8 years ago

Thinking a bit more about this the timeouts could be related to the point in time when we reload the mirrorlist server process. Looking at the report I posted, which is 5 minutes after the full hour, this could be the point where the mirrorlist pkl was newly generated, transferred to the mirrorlist servers and then the mirrorlist process gets killed -HUP. Once the mirrorlist server process restarts it reloads all its configuration and databases and needs some time until it can serve answers again. This happens every hour for a very short period of time and it would be a rather big coincidence if that would be the root cause for the problems described here. I just wanted to mention that we indeed have a short period were all mirrorlist servers are probably not able to answer the requests.

Looking at the latest comment from gsauthof the problems are also always just a few minutes after the full hour. Having access to the mirrorlist servers this could probably easily be forced by -HUP killing the mirrorlist server process and then looking at the time until the first new requests are served.

kevin commented 8 years ago

Well, it doesn't seem to take it anytime to reload:

{{{
Sep 18 21:03:26 mirrorlist-phx2 python2[898]: load_databases_and_caches...done.
Sep 18 21:03:26 mirrorlist-phx2 mirrorlist: x.y.z.a found its best mirror from geoip
Sep 18 21:03:26 mirrorlist-phx2 mirrorlist: x.y.z.q found its best mirror from geoip
Sep 18 21:03:26 mirrorlist-phx2 mirrorlist: x.y.z.a found its best mirror from netblocks
Sep 18 21:03:26 mirrorlist-phx2 mirrorlist: x.y.z.a found its best mirror from continent
Sep 18 21:03:26 mirrorlist-phx2 mirrorlist: x.y.z.a found its best mirror from netblocks
}}}

Also, if it was mirrorlists or vpn to them I would expect haproxy to report the service was down for some small amount of time. I suppose it could be finer grained than our check time tho. "check inter 5s rise 2 fall 3"
(so check every 5 seconds).

adrian commented 8 years ago

So I tried to break the mirrorlist in staging and I was able to create a situation were it took more than 30 seconds to get an answer (about 50 seconds).

I set processes=1, threads=1 and maximum-requests=2 on the mirrorlist wsgi and was running ab like this:

{{{
ab -n 10000 -c 100 'https://mirrors.stg.fedoraproject.org/metalink?repo=epel-7&arch=x86_64'
}}}

In parallel to this I was running urlgrabber in a loop on the same metalink (which should be irrelevant) and was able to have urlgrabber wait for over 50 seconds before getting an answer. After that I stopped ab and was getting a bunch of 503's before it started to get the usual answers from urlgrabber.

Looking at the cron job of /etc/yum/yum-cron-hourly.conf on CentOS 7 I see a random_sleep of 15. So the yum hourly cron jobs of all epel users (with yum cron hourly) are starting between 01 and 16 every hour.

This could lead to more requests than usual on the mirrorlist servers. Do you have an overview how the mirrorlist accesses are distributed during an hour? Running urlgrabber against the production mirrorlist I can see that it takes more time to answer requests in the first 15 minutes of each hour. Usually it is below 2 seconds. In the first 15 minutes of the hour I have seen more than 10 seconds. As we are also updating the mirrorlist servers during this time period it could increase the time to answer even more.

kevin commented 8 years ago

Yeah, you can definitely see more requests in the first 15min of every hour:

https://admin.fedoraproject.org/collectd/bin/graph.cgi?hostname=mirrorlist-phx2.phx2.fedoraproject.org;plugin=apache;plugin_instance=apache;type=apache_requests;begin=-86400

but I don't think that would cause it to hit 30s...

adrian commented 8 years ago

Wed, 7 Oct 2015 20:04:20 +0200 (CEST)

kevin commented 8 years ago

Yeah, we actually have a possible culprit now. ;)

Some of our proxy servers are under heavier load than they should be, which might be causing them to not process requests in a timely manner. We are working on that, once thats solved we should re-check this again.

kevin commented 8 years ago

ok, we may have things better on the load issues now.

Can everyone seeing this please watch out and see if you still see it now? With the load nominal on all the proxies hopefully they are all processing requests in reasonable time.

doteast commented 8 years ago

Started running the script at 03:25. I see timeouts for the record set resolved. I will let it run for a while and then attach the logs to the ticket.

adrian commented 8 years ago

Despite all the recent changes (proxy and max_children for the mirrorlist server) I still got:

Fri, 16 Oct 2015 15:04:07 +0200 (CEST):

kevin commented 8 years ago

I don't see any unusual load or issues around that time, however, its right around the time the mirrorlists reload their pkl. ;(

We could perhaps move to a ansible script that in turn takes each mirrorlist out of haproxy, updates it, then re-adds it? Or perhaps there's some way in the code to handle the new load off better.

kevin commented 8 years ago

So, any new status here?

Are folks still seeing this and only right after the top of the hour? Or has it gone away now?

gsauthof commented 8 years ago

I am still seeing it - although less often than before. Last error mail was sent 2015-10-22 01:06:50 (UTC+2).

I've started my scripts again. Thus, I can post the results in a week or so.

gsauthof commented 8 years ago

So I ran it for the last 11 days and the situation now is very much better than before.

From a German DSL (static IPs) I just got 2 timeouts:

{{{
2015-11-12T07:02:28+0100,67.219.144.68,28
2015-11-13T13:15:59+0100,[2610:28:3090:3001:dead:beef:cafe:fed3],28
}}}

From a machine in a German datacenter:

{{{
2015-11-09T09:49:38+0100,140.211.169.206,28
2015-11-12T09:06:32+0100,209.132.181.16,28
2015-11-16T22:52:40+0100,[2610:28:3090:3001:dead:beef:cafe:fed3],28
2015-11-17T06:18:51+0100,[2610:28:3090:3001:dead:beef:cafe:fed3],28
2015-11-17T06:28:35+0100,[2610:28:3090:3001:dead:beef:cafe:fed3],28
2015-11-17T06:38:24+0100,[2610:28:3090:3001:dead:beef:cafe:fed3],28
2015-11-17T06:48:09+0100,[2610:28:3090:3001:dead:beef:cafe:fed3],28
2015-11-17T06:57:55+0100,[2610:28:3090:3001:dead:beef:cafe:fed3],28
2015-11-17T07:07:49+0100,[2610:28:3090:3001:dead:beef:cafe:fed3],28
2015-11-17T07:17:35+0100,[2610:28:3090:3001:dead:beef:cafe:fed3],28
}}}

From a machine in a NY, USA datacenter 4 timeouts:

{{{
2015-11-10T12:52:37-0500,67.219.144.68,28
2015-11-14T06:38:07-0500,67.203.2.67,35
2015-11-16T12:12:40-0500,209.132.181.15,28
2015-11-16T19:24:36-0500,[2607:f188::dead:beef:cafe:fed1],28
}}}

Methods: as before, the checking script sleeps 9 minutes between each test, then it fetches all records from DNS (both IPv4/IPv6) and accesses them via curl (with a 30 s timeout, round-robin). Only failures (timeouts etc.) are reported. The last field in the quoted lines is the curl error code (28 -> timeout, 35 -> SSL connect/handshaking error).

Thus, I consider the current state as much improved in comparison to when I opened the ticket (cf. the old attachments).

kevin commented 8 years ago

Cool. Thats progress at least.

It would sure be nice to get to none of those however. :(

kevin commented 8 years ago

How are we doing now? I am not sure there's much more we could do from our end, and it sounds like it's pretty reliable now?

dsommers commented 8 years ago

The last 5-6 days or so, I am seeing this issue on several of my hosts. Not only IPv6 enabled hosts, but this happens more often on IPv6 enabled hosts. My IPv6 enabled hosts are dual-stack.

The exact error message I see each time is:
{{{
Could not get metalink https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=x86_64 error was
12: Timeout on https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=x86_64: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds')
}}}

kevin commented 8 years ago

Are you able to isolate better which mirrorlist(s) this is from? -v might tell you more...

adrian commented 8 years ago

I can confirm it. I also got another timeout on Fri, 12 Feb 2016 17:08:41 +0100

kevin commented 8 years ago

ok, any new reports in the last month?

I know adding mirrorlists has helped a good deal.

kevin commented 8 years ago

I've not heard any reports of this in a while, so I am going to close this.

If anyone here sees them again, please try and isolate where from and reopen this or file a new ticket.

I'm hopeful that things are more robust now.

jazzl0ver commented 2 years ago

Hi. I'm facing same issue now. It seems all mirrors work well besides proxy06.

For example, this one work good:

# curl -v "https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=x86_64&infra=stock&content=centos" >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* About to connect() to mirrors.fedoraproject.org port 443 (#0)
*   Trying 38.145.60.20...
* Connected to mirrors.fedoraproject.org (38.145.60.20) port 443 (#0)
...
X-Fedora-ProxyServer: proxy01.iad2.fedoraproject.org
...
100 31190  100 31190    0     0  31760      0 --:--:-- --:--:-- --:--:-- 31761

And here's the proxy06:

# curl -v "https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=x86_64&infra=stock&content=centos" >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* About to connect() to mirrors.fedoraproject.org port 443 (#0)
*   Trying 140.211.169.196...
* Connected to mirrors.fedoraproject.org (140.211.169.196) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
*       subject: CN=*.fedoraproject.org,O="Red Hat, Inc.",L=Raleigh,ST=North Carolina,C=US
*       start date: Feb 27 00:00:00 2020 GMT
*       expire date: Mar 02 12:00:00 2022 GMT
*       common name: *.fedoraproject.org
*       issuer: CN=DigiCert SHA2 High Assurance Server CA,OU=www.digicert.com,O=DigiCert Inc,C=US
> GET /metalink?repo=epel-7&arch=x86_64&infra=stock&content=centos HTTP/1.1
> User-Agent: curl/7.29.0
> Host: mirrors.fedoraproject.org
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0< HTTP/1.1 200 OK
< Date: Tue, 09 Nov 2021 17:03:37 GMT
< Server: Apache
< X-Frame-Options: SAMEORIGIN
< X-Xss-Protection: 1; mode=block
< X-Content-Type-Options: nosniff
< Referrer-Policy: same-origin
< content-type: application/metalink+xml
< content-length: 31190
< AppTime: D=3975
< X-Fedora-ProxyServer: proxy06.fedoraproject.org
< X-Fedora-RequestID: YYqp6quUoLfR3T26qk0O7gAAB8g
<
{ [data not shown]
 52 31190   52 16384    0     0    626      0  0:00:49  0:00:26  0:00:23     0^C

(interrupted after 23 secods of inactivity)

I've enabled yum-cron on my servers several days ago and it complains on the timeouts several times a day.

Please, advise.

jazzl0ver commented 2 years ago

After analyzing with tcpdump, it appears the slow speed happens not just with proxy06. I also saw proxy09's ip - 140.211.169.206. Is there anything I (or better you ;) can do to get that fixed? I'm getting a lot of such emails every single day. That's really annoying :((

jazzl0ver commented 2 years ago

sorry, wrong ticket

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

None

fedora-infrastructure

Source Code

#4882 Timeouts (operation too slow) using mirrors.fedoraproject.org Closed: Fixed None Opened 8 years ago by gsauthof.

Metadata

#4882 Timeouts (operation too slow) using mirrors.fedoraproject.org

Closed: Fixed None Opened 8 years ago by gsauthof.