Issue #172: vault.centos.org has been unstable for the past several days - centos-infra

centos-infra

#172 vault.centos.org has been unstable for the past several days

Closed: Fixed 3 years ago by arrfab. Opened 3 years ago by mrc0mmand.

Hello,

Going through logs of our (not only) CentOS CI jobs, the vault.centos.org server, which hosts CentOS SRPMs, has been somewhat unstable for the past several days. In many cases fetching metadata for the *-sources repositories fails[0][1], be it because of timeout, slow download, or even routing issues. Is there some infra maintenance going on?

Thanks!

[0]

enabling appstream-source repository
enabling baseos-source repository
enabling extras-source repository
CentOS Linux 8 - BaseOS - Source                0.0  B/s |   0  B     02:13    
Errors during downloading metadata for repository 'baseos-source':
  - Curl error (28): Timeout was reached for http://vault.centos.org/centos/8/BaseOS/Source/repodata/repomd.xml [Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds]
  - Curl error (28): Timeout was reached for https://vault.centos.org/centos/8/BaseOS/Source/repodata/repomd.xml [Connection timed out after 30461 milliseconds]
  - Curl error (28): Timeout was reached for http://vault.centos.org/centos/8/BaseOS/Source/repodata/repomd.xml [Connection timed out after 30001 milliseconds]
Error: Failed to download metadata for repo 'baseos-source': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

[1]

enabling appstream-source repository
enabling baseos-source repository
enabling extras-source repository
CentOS Linux 8 - BaseOS - Source                3.5 kB/s | 290 kB     01:23    
CentOS Linux 8 - AppStream - Source             0.0  B/s |   0  B     01:51    
Errors during downloading metadata for repository 'appstream-source':
  - Curl error (28): Timeout was reached for http://vault.centos.org/centos/8/AppStream/Source/repodata/33682491b2a4f3b971da89089eb8be14715588b8ecf045ce7cb4995717284e71-filelists.xml.gz [Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds]
  - Curl error (28): Timeout was reached for http://vault.centos.org/centos/8/AppStream/Source/repodata/63ee95c1e2e95ee2e52642230de2591fd2b4c9e5716bc9ad193805ac6e44f49a-primary.xml.gz [Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds]
  - Curl error (28): Timeout was reached for https://vault.centos.org/centos/8/AppStream/Source/repodata/63ee95c1e2e95ee2e52642230de2591fd2b4c9e5716bc9ad193805ac6e44f49a-primary.xml.gz [Connection timed out after 30604 milliseconds]
  - Curl error (28): Timeout was reached for http://vault.centos.org/centos/8/AppStream/Source/repodata/63ee95c1e2e95ee2e52642230de2591fd2b4c9e5716bc9ad193805ac6e44f49a-primary.xml.gz [Connection timed out after 30000 milliseconds]
Error: Failed to download metadata for repo 'appstream-source': Yum repo downloading error: Downloading error(s): repodata/63ee95c1e2e95ee2e52642230de2591fd2b4c9e5716bc9ad193805ac6e44f49a-primary.xml.gz - Cannot download, all mirrors were already tried without success; repodata/33682491b2a4f3b971da89089eb8be14715588b8ecf045ce7cb4995717284e71-filelists.xml.gz - Cannot download, all mirrors were already tried without success

arrfab commented 3 years ago

there is no maintenance and nodes behind vault.centos.org (5 nodes, mainly EC2 instances) are all reachable and under monitoring.
wondering if we want to have (for CI) one internal vault mirror, instead of relying on the network link to outside (that we don't control btw , and that is shared between multiple projects hosted in the same cage, without any QoS )

but can you give us details please ?
Is that happening from a openshift container ?
as I tried from a node in same CI vlan (VM) and it works fine :

curl --location http://vault.centos.org/centos/8/BaseOS/Source/repodata/repomd.xml

I know that @dkirwan needed to investigate a similar timeout issue happening only in OCP : see for example #159

Metadata Update from @arrfab:
- Issue tagged with: centos-ci-infra

3 years ago

mrc0mmand commented 3 years ago

but can you give us details please ?
Is that happening from a openshift container ?

No, every instance of this issue happened on Duffy nodes. Going through logs, here are some detailed information (where and when the timeout happened, in CET timezone, +- a couple minutes), the issue is pretty intermittent:

[2020-12-16]
- n30.crusty: 9:15 AM
- n29.crusty: 9:14 AM
- n56.gusty: 8:13 AM
- n55.crusty: 8:09 AM
- n53.pufty: 8:08 AM
- n38.pufty: 7:05 AM
- n53.pufty: 6:04 AM
- n62.dusty: 6:04 AM

[2020-12-15]
- n40.crusty: 8:34 AM
- n43.crusty: 8:33 AM
- n12.gusty: 7:15 AM
- n10.gusty: 7:13 AM
- n2.gusty: 6:48 AM
- n3.gusty: 6:48 AM
- n4.gusty: 5:23 AM
- n58.dusty: 4:29 AM
- n52.dusty: 4:29 AM
- n39.gusty: 1:12 AM
- n40.gusty: 1:11 AM
- n42.gusty: 12:42 AM
- n47.pufty: 12:39 AM

[2020-12-10]
- n47.pufty: 12:17 PM
- n37.dusty: 10:51 AM
- n41.dusty: 10:49 AM
- n40.dusty: 10:48 AM
- n6.dusty: 10:33 AM
- n13.dusty: 10:33 AM
- n46.crusty: 10:27 AM
- n47.crusty: 10:27 AM
- n40.crusty: 10:25 AM
- n39.crusty: 10:23 AM
- n32.crusty: 10:17 AM
- n37.crusty: 10:17 AM
- n12.crusty: 10:14 AM
- n26.crusty: 10:10 AM

arrfab commented 3 years ago

Thanks for the info, so not limited to containers running from inside OCP/Openshift.
We have 5 nodes behind vault.centos.org, 2 in US and 3 in EU, but you should always hit (we use GeoIP for that) one from US.
As said, I can't reproduce the issue to have a clear view on where it happens (as it goes through the same gateway and then shared router/firewall between multiples projects (and we can't see if there are routing issue).
While not denying an issue to reach vault.centos.org, I'm wondering about your use case from within CI vlan. Because the same way we have internal mirror.centos.org, we can work on internal vault.centos.org to let traffic internal to that vlan and that would solve (and reduce in/out traffic too).

Would you mind elaborating ?

mrc0mmand commented 3 years ago

While not denying an issue to reach vault.centos.org, I'm wondering about your use case from within CI vlan. Because the same way we have internal mirror.centos.org, we can work on internal vault.centos.org to let traffic internal to that vlan and that would solve (and reduce in/out traffic too).

Would you mind elaborating ?

The only use case in our scenarios is installation of build dependencies (dnf builddep systemd, etc.).

Mirroring the SRPMs internally would most likely help, but I'm not sure if the amount of work required to do that would balance the benefits.

Also, just thanks to this issue I came to the realization that SRPM repositories aren't mirrored in the same way as the rest of the repositories, which is kind of interesting :-)

Metadata Update from @arrfab:
- Issue assigned to arrfab

3 years ago

Metadata Update from @arrfab:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: low-gain, low-trouble

3 years ago

arrfab commented 3 years ago

Just to let you know that it's now in progress :

* e4bd289 - (HEAD -> master, origin/master, origin/HEAD) Converted old store01 with enough storage as internal vault. #172 (3 minutes ago) <Fabian Arrotin>

So ansible reconfigured that box and content is now being imported. I'll close this ticket after A record will be pushed internally and a quick test

arrfab commented 3 years ago

vault internal A record was pushed, now that all content landed.
So that should speed-up your builds for now.
Closing ticket

Metadata Update from @arrfab:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

mrc0mmand commented 3 years ago

Perfect, thanks a lot!

Metadata

Assignee

arrfab

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

Boards 1

CentOS CI Infra Status: In Progress

centos-infra

Source Code

#172 vault.centos.org has been unstable for the past several days Closed: Fixed 3 years ago by arrfab. Opened 3 years ago by mrc0mmand.

Metadata

centos-ci-infra low-trouble low-gain

Boards 1

#172 vault.centos.org has been unstable for the past several days

Closed: Fixed 3 years ago by arrfab. Opened 3 years ago by mrc0mmand.