Issue #8679: Fedora mirrors problems (31/32)? - fedora-infrastructure

fedora-infrastructure

#8679 Fedora mirrors problems (31/32)?

Closed: Fixed 4 years ago by kevin. Opened 4 years ago by praiskup.

One of our builds in copr failed today:
https://copr.fedorainfracloud.org/coprs/g/copr/copr-dev/build/1250351/

With the following dnf logs:
https://copr-be.cloud.fedoraproject.org/results/%40copr/copr-dev%3Apr%3A1285/fedora-32-x86_64/01250351-copr-rpmbuild/chroot_scan/var/lib/mock/1250351-fedora-32-x86_64-1582528487.447365/root/var/log/

Librepo log says:
2020-02-24T07:14:48Z INFO Librepo version: 1.11.1 with CURL_GLOBAL_ACK_EINTR support (libcurl/7.66.0 OpenSSL/1.1.1d-fips zlib/1.2.11 brotli/1.0.7 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.39.2)
2020-02-24T07:14:48Z INFO Librepo version: 1.11.1 with CURL_GLOBAL_ACK_EINTR support (libcurl/7.66.0 OpenSSL/1.1.1d-fips zlib/1.2.11 brotli/1.0.7 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.39.2)
2020-02-24T07:14:48Z INFO Downloading: https://download.copr.fedorainfracloud.org/results/@copr/copr-dev/fedora-32-x86_64/repodata/repomd.xml
2020-02-24T07:14:48Z INFO Downloading: https://download.copr.fedorainfracloud.org/results/@copr/copr-dev/fedora-32-x86_64/repodata/1cb61ea996355add02b1426ed4c1780ea75ce0c04c5d1107c025c3fbd7d8bcae-primary.xml.gz
2020-02-24T07:14:48Z INFO Downloading: https://download.copr.fedorainfracloud.org/results/@copr/copr-dev/fedora-32-x86_64/repodata/95a4415d859d7120efb6b3cf964c07bebbff9a5275ca673e6e74a97bcbfb2a5f-filelists.xml.gz
2020-02-24T07:14:48Z INFO Downloading: https://mirrors.fedoraproject.org/metalink?repo=fedora-32&arch=x86_64
2020-02-24T07:14:49Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/repomd.xml
2020-02-24T07:14:49Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/b13cce4d9f9bc5290cac8ed098774c28390390780769759bb5a49918b804442b-primary.xml.zck
2020-02-24T07:14:49Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/412422ad8e60d4c14e1f83bcfc91148a8766bf6f98f2e48d9e4f11c818da58d0-filelists.xml.zck
2020-02-24T07:14:49Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/816c32e48d37d5e668a5c4f337bdfcb3fcfca18e0c7f136a1f0aa479d54f399a-comps-Everything.x86_64.xml
2020-02-24T07:14:49Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/b13cce4d9f9bc5290cac8ed098774c28390390780769759bb5a49918b804442b-primary.xml.zck
2020-02-24T07:14:49Z INFO Error during transfer: Status code: 404 for https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/412422ad8e60d4c14e1f83bcfc91148a8766bf6f98f2e48d9e4f11c818da58d0-filelists.xml.zck (IP: 99.84.185.33)
2020-02-24T07:14:49Z INFO Downloading: http://mirrors.mit.edu/fedora/linux/development/32/Everything/x86_64/os/repodata/412422ad8e60d4c14e1f83bcfc91148a8766bf6f98f2e48d9e4f11c818da58d0-filelists.xml.zck
2020-02-24T07:14:49Z INFO Error during transfer: Status code: 404 for http://mirrors.mit.edu/fedora/linux/development/32/Everything/x86_64/os/repodata/412422ad8e60d4c14e1f83bcfc91148a8766bf6f98f2e48d9e4f11c818da58d0-filelists.xml.zck (IP: 18.7.29.125)
2020-02-24T07:14:49Z INFO Downloading: https://pubmirror1.math.uh.edu/fedora-buffet/fedora/linux/development/32/Everything/x86_64/os/repodata/412422ad8e60d4c14e1f83bcfc91148a8766bf6f98f2e48d9e4f11c818da58d0-filelists.xml.zck

Seems like there's some inconsistency between repomd and the rest of metadata.

praiskup commented 4 years ago

this is not only one build, looks like 100% reproducible

praiskup commented 4 years ago

This doesn't happen for me locally; so the location where this happens is Copr builders, AWS, N. Virginia.

smooge commented 4 years ago

OK cloudfront is slow to get updates into it... for 'stable' releases I expect this problem occurs rarely but for rawhide and f32 it may be way behind on things. [Where way means over 24 hours]

Edited 4 years ago by smooge

smooge commented 4 years ago

OK we are thinking this might be happening when we are syncing into cloudfront at a different time from when we do composes. Syncing 500 GB out of phx2 for f32/rawhide is taking a while and so the data going into the cloudfront is 'old'. We are going to try changing sync times to see if we can get around this

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

4 years ago

mohanboddu commented 4 years ago

I propose to run them the cron at 13, 17, 23 UTC.

Since rawhide completes at around 11-12 UTC and branched completes around 14-15 UTC and the last one is a bit delayed because if we run another one during the day, it should be done by 23 UTC generally.

Metadata Update from @mohanboddu:
- Issue priority set to: Needs Review (was: Waiting on Assignee)

4 years ago

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

4 years ago

praiskup commented 4 years ago

I'm not familiar with mirroring TBH, but the mirroring mechanism seemed to be
designed to not use not-yet rsynced mirrors, or? What exactly is happening
here?

praiskup commented 4 years ago

Happens again now:
https://download.copr.fedorainfracloud.org/results/@mock/mock-pull-requests/fedora-32-x86_64/01253655-mock/builder-live.log.gz

praiskup commented 4 years ago

Users complain for F31 as well :-(
https://bugzilla.redhat.com/show_bug.cgi?id=1808292

praiskup commented 4 years ago

Another one.
https://bugzilla.redhat.com/show_bug.cgi?id=1808289

praiskup commented 4 years ago

gently ping on this

smooge commented 4 years ago

There isn't much I can see we can do to fix this issue at this moment. Most of the changes to deal with CI and similar build systems being 24x7x365 usable really need us rewrite our entire build and delivery architectures.

I think the main problem is that our mirroring is really based off the older notion of things will eventually work for people.. while what COPR and CI need is 'IT WORKS ALL THE TIME'. Fedora Infrastructure does not have the resources (in time or people or architecture) to provide that promise.

I think the reason that COPR had not run into this for a long time was because the builders got the dl.fedoraproject.org download servers as their main source. A long time ago, COPR did have this happen often when it used other mirrors.

I am going to flag this for management to look at.

pfrields commented 4 years ago

@smooge Is the Cloudfront cache being invalidated during/after the sync, in order to force Cloudfront to go back to S3 to get the content?

kevin commented 4 years ago

So, here's what I think is happening (and I think we can at least improve it a lot):

copr builder gets a metalink from mirrorlists. Since it's in aws, it get cloudfront as the first/top entry. Since we control cloudfront, it's marked as 'always up to date'.
Currently we are syncing to cloudfront and some misc times, but not at very ideal ones. So, there may be cases where we are up to date, a updates push or compose happens and we don't sync it for many many hours. Then you get mirrormanager updated, but cloudfront not yet, so it has old data.

If we move our syncing to smarter times this should greatly reduce this issue, IMHO.

@praiskup can you tell us what os/version are copr builders? f31? and they are using the metalink? The one thing I don't understand is why they don't go on to another mirror after cloudfront 404's. Are they getting a list of mirrors in the metalink? You can login to one and just curl the metalink url to see.

The report with ++ in it is actually not this issue, but instead cloudfront's broken handling of ++ in names. ;( In this case it 'should' just go on to the next mirror and get it. I am not sure why it's not doing so.

adrian commented 4 years ago

Looking at one of the logs I see:

[SKIPPED] zstd-1.4.4-2.fc32.x86_64.rpm: Already downloaded                     
[MIRROR] shadow-utils-4.8-3.fc32.x86_64.rpm: Status code: 404 for https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/Packages/s/shadow-utils-4.8-3.fc32.x86_64.rpm (IP: 52.85.145.55)
[MIRROR] shadow-utils-4.8-3.fc32.x86_64.rpm: Status code: 404 for http://mirror.cc.vt.edu/pub/fedora/linux/development/32/Everything/x86_64/os/Packages/s/shadow-utils-4.8-3.fc32.x86_64.rpm (IP: 198.82.161.58)
[MIRROR] shadow-utils-4.8-3.fc32.x86_64.rpm: Status code: 404 for http://download-ib01.fedoraproject.org/pub/fedora/linux/development/32/Everything/x86_64/os/Packages/s/shadow-utils-4.8-3.fc32.x86_64.rpm (IP: 152.19.134.145)
[MIRROR] shadow-utils-4.8-3.fc32.x86_64.rpm: Status code: 404 for https://download-ib01.fedoraproject.org/pub/fedora/linux/development/32/Everything/x86_64/os/Packages/s/shadow-utils-4.8-3.fc32.x86_64.rpm (IP: 152.19.134.145)
[MIRROR] shadow-utils-4.8-3.fc32.x86_64.rpm: Status code: 404 for http://mirrors.mit.edu/fedora/linux/development/32/Everything/x86_64/os/Packages/s/shadow-utils-4.8-3.fc32.x86_64.rpm (IP: 18.7.29.125)
[MIRROR] shadow-utils-4.8-3.fc32.x86_64.rpm: Status code: 404 for http://mirror.siena.edu/fedora/linux/development/32/Everything/x86_64/os/Packages/s/shadow-utils-4.8-3.fc32.x86_64.rpm (IP: 199.223.246.113)
[... skip some output ...]
[MIRROR] shadow-utils-4.8-3.fc32.x86_64.rpm: Status code: 404 for http://dl.fedoraproject.org/pub/fedora/linux/development/32/Everything/x86_64/os/Packages/s/shadow-utils-4.8-3.fc32.x86_64.rpm (IP: 209.132.181.24)
[MIRROR] shadow-utils-4.8-3.fc32.x86_64.rpm: Status code: 404 for https://dl.fedoraproject.org/pub/fedora/linux/development/32/Everything/x86_64/os/Packages/s/shadow-utils-4.8-3.fc32.x86_64.rpm (IP: 209.132.181.23)

My interpretation of this is that it first get repomd.xml from cloudfront, probably an old file. The reason why it is an old-file (cached probably) is that the corresponding RPM is not available anywhere. Not even on dl.fedoraproject.org. If the repomd.xml whould be really new, then the corresponding files should at least be on dl.fedoraproject.org.

Not sure how the mirroring is set up here, but my recommendation would be (and I think that is also what debian recommends for their mirroring setup):

Sync all the RPMS (without deleting old content)
Sync repodata directories
Invalidate cloudfront caches for repodata directories
Delete old files

This is without any understanding how the S3 mirroring and cloudfront caches are working.

(It would even be better if this could be done from a netapp snapshot. The files used to be on a netapp some years ago and there were .snapshot directories.)

praiskup commented 4 years ago

@smooge wrote:

I think the reason that COPR had not run into this for a long time was because the builders got the dl.fedoraproject.org download servers as their main source.

We were using AWS builders for some time already (few months).

@kevin wrote:

Since we control cloudfront, it's marked as 'always up to date'.

What is always up to date? The cache, or the original content? We use cloudfronts
for copr repositories as well, and we came to conclusion that we can cache pretty
much everything, except for repomd.xml file.

@praiskup can you tell us what os/version are copr builders? f31? and they are using the metalink? The one thing I don't understand is why they don't go on to another mirror after cloudfront 404's. Are they getting a list of mirrors in the metalink?

Builders are fedora 31, x86_64/aarch64. They are using the defaults from mock-core-configs package (so metalinks for fedora).

My interpration is that repomd.xml is cached (and it should never be) ... so even fallbacks don't
make any change.

You can login to one and just curl the metalink url to see.

Yes, once that happens I'll try to do this.

@adrian wrote:

My interpretation of this is that it first get repomd.xml from
cloudfront, probably an old file. The reason why it is an old-file
(cached probably) is that the corresponding RPM is not available
anywhere. Not even on dl.fedoraproject.org. If the repomd.xml whould be
really new, then the corresponding files should at least be on
dl.fedoraproject.org.

+1, the thing we should do is IMO is to configure http server to never
cache repomd.xml file -> so cloudfront will respect that http config and
never cache it as well. This could reverse the problem, but I assume that
mirror manager and other mechanisms have pretty good protections
against that (the fallback to different mirror - which has up2date RPM
content - would succeed at least).

praiskup commented 4 years ago

FTR, copr-be servers use lighttpd, and the configuration which is now tested
that cloudfronts respect correct caching is:
https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=08e2476f5aadd97bfa9d521521ce203cc0a2826e

praiskup commented 4 years ago

Would blocking (e.g. by firewall) of https://d2lzkl7pfhq30w.cloudfront.net host
help copr builders? (this is not only copr problem, but still..).

mkyral commented 4 years ago

Guys, this ain't funny at all:

https://download.copr.fedorainfracloud.org/results/mkyral/plasma-unstable/fedora-31-x86_64/01283680-bluedevil/builder-live.log.gz

the cloudfront mirrors are broken as hell, the broken state is lasting for a week at least now with no apparent change and let me be clear: this is a disgrace, really.

Just get rid of them now and I will try to forget this ever happened.

Thanks.

praiskup commented 4 years ago

Meh, I'm afraid we mixed two issues together, sorry I did not notice before..

The problem @mkyral mentions
is not related to metalink= downloads, but baseurl=..download.fedoraproject.org....
That was specified explicitly as external repository, and I guess it is acceptable that
the particular mirror isn't in sync. Or what is the exact difference between
dl.fedoraproject.org vs. download.fedoraproject.org?

The original problem (in the initial comment) was related to metalink processing
though, for all fedora 32 builds, but I can not reproduce it now. It happened at least
here which
is 6 days ago.

smooge commented 4 years ago

@praiskup

download.fedoraproject.org is a round robin of mirrors with redirects to ones which are supposed to be valid for getting ISOs and other slow-changing items. It is built off the idea that you can always retry later for updates. [AKA eventual consistency.]

dl.fedoraproject.org are the master 5 rsync/http servers. When COPR was in the datacentre it usually got them as they were the closest mirrors. These are always considered up2date with the only thing having newer packages is koji. The problem is that they are usually heavily loaded because everyone realizes that. We therefore have to limit connections because the mirrors can't rsync from them.

smooge commented 4 years ago

@mkyral I will forward your complaint to management to see if we can get better resources aimed at this problem. It is clear that the services we provide for the project are not up to your standards and you are clearly unhappy.

mkyral commented 4 years ago

It is not just updates-testing, which is broken at the cloudfront mirrors:

https://download.copr.fedorainfracloud.org/results/mkyral/plasma-unstable/fedora-31-x86_64/01283768-pam-kwallet/builder-live.log.gz

mkyral commented 4 years ago

However, this issue happens just sometimes, not all the time - so it is workaroundable by repeat-until-pass approach (and all my builds failed due to this problem succeeded on the second iteration)

mkyral commented 4 years ago

@mkyral I will forward your complaint to management to see if we can get better resources aimed at this problem. It is clear that the services we provide for the project are not up to your standards and you are clearly unhappy.

Thank you

smooge commented 4 years ago

@mkyral

The central problem is that most of the update and distribution system was designed 14+ years ago for the use case of the home user updating their software. In that case the users system will eventually get updated to a safe state with updates.

Sites needing to do development were always encouraged to have local mirrors as they could do the freeze control of 'we want these packages but not these'. That is different from Continuous Integration/Continuous Delivery where you want a 'correct' window of 'HEAD' to build from at all times.

That the mirror systems works most of the time for CI/CD has been a bit of luck and building off a lot of hard engineering work to make the system robust long ago. However in the end it is like one of those old Roman bridges being used as part of a modern highway system. What it was designed for, how it works, etc still cause the 8 lane highway to drop down to a 2 lane road to cross this old river.

praiskup commented 4 years ago

@smooge, I don't want us to forget about the original problem. I.e. CDN providing
out-dated (cached) repomd.xml file. That will most probably bite our default Copr
configs again, right? Or was this resolved?

kevin commented 4 years ago

We want to fix the issues here, but if folks could please be patient that would be nice.

Some points:

You should NOT USE download.fedoraproject.org for a baseurl/repo. In fact we recently worked on changing fedora repo files to no longer mention this. As smooge says, it's a redirect. But the problem is it just redirects you to a mirror that is active, it has no way of knowing if that specific mirror has that specific content. Additionally, baseurl provides a lot less protection than metalink.
We need to adjust our sync to s3. Right now it's running at times that aren't good, so it has old content for a long while. Additionally, we could as @adrian suggested do a sync run then a delete run after. We do invalidate after we sync, so cloudfront caching the repomd.xml is not the issue, it's just that it's syncing at non ideal times.

praiskup commented 4 years ago

On Wednesday, March 4, 2020 10:22:06 PM CET Kevin Fenzi wrote:

We need to adjust our sync to s3. Right now it's running at times that aren't
good, so it has old content for a long while.

I still fail to see how this relates to repo consistency, though.

If a) repomd.xml is old, b) rest of metadata is old, and c) old RPMs are still
provided, I bet that everything will work fine.. only with slight delay,
against old data.

If we a) rsync repomd.xml, and b) invalidate cache for repomd.xml, we can IMO
afford to live for a while with the rest of metadata being old, and with old
RPMs. That's because DNF will fetch up2date repomd.xml, and for the rest of
things will pick different up2date mirror. Yes, consequence is that
non-metalink/baseurl use-cases will be broken till rsync finishes. But
still OK for Copr.

Additionally, we could as @adrian suggested do a sync run then a delete run
after.

Yes, that would solve the problem even for non-metalink/baseurl cases.
But we will have to:

0. never cache repomd.xml
1. do `rsync ...`, without repomd.xml!
2. update the repomd.xml
3. do `rsync --delete ...`

If we rsync without excluding repomd.xml, without step 0 and step 2, I'm
afraid there's still large window when similar problems can happen.

We do invalidate after we sync, so cloudfront caching the repomd.xml is not
the issue, it's just that it's syncing at non ideal times.

In the initial comment, I cited state where we clearly got an old
repomd.xml (either because cache, or because rsync wrongly ordered the
transaction and delayed repomd.xml transfer), while the rest of the
repository was inconsistent.

I think it actually is an issue, ... btw., why not to temporarily remove
the cloudfonts URL from mirrormanager?

smooge commented 4 years ago

OK looking through ansible, we don't use rsync as we are pushing into S3 buckets. We are instead using aws sync which seems to just write a new set of buckets and then delete the old content afterwords. If I am reading the work correctly, the mirroring is done via the scripts in:

https://infrastructure.fedoraproject.org/cgit/ansible.git/tree/roles/s3-mirror/files

The cron times for this being pushed out are set in
https://infrastructure.fedoraproject.org/cgit/ansible.git/tree/roles/s3-mirror/tasks/main.yml

That said.. I could also be looking at the wrong set of scripts.

praiskup commented 4 years ago

FTR, happened in F32 again few minutes ago:

2020-03-05T14:45:08Z INFO Downloading: https://mirrors.fedoraproject.org/metalink?repo=fedora-32&arch=x86_64
2020-03-05T14:45:09Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/repomd.xml
2020-03-05T14:45:09Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/bf9e4cd165075bec38ec3fd1b70abd78a9d937c9162afd2ae991e0b4d3f45d9b-primary.xml.zck
2020-03-05T14:45:09Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/99d8ba37ff6ff99364527c3e49ab60678c6e731fc3ac2c521b6d9a3025f0b219-filelists.xml.zck
2020-03-05T14:45:09Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/3a205ba9f270bf930817b0f344716aa53bb94ce2bfd9b102e84a25d0bf94b026-comps-Everything.x86_64.xml
2020-03-05T14:45:09Z INFO Error during transfer: Status code: 404 for https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/bf9e4cd165075bec38ec3fd1b70abd78a9d937c9162afd2ae991e0b4d3f45d9b-primary.xml.zck (IP: 52.85.145.55)
2020-03-05T14:45:09Z INFO Downloading: https://mirrors.lug.mtu.edu/fedora/linux/development/32/Everything/x86_64/os/repodata/bf9e4cd165075bec38ec3fd1b70abd78a9d937c9162afd2ae991e0b4d3f45d9b-primary.xml.zck
2020-03-05T14:45:09Z INFO Error during transfer: Status code: 404 for https://mirrors.lug.mtu.edu/fedora/linux/development/32/Everything/x86_64/os/repodata/bf9e4cd165075bec38ec3fd1b70abd78a9d937c9162afd2ae991e0b4d3f45d9b-primary.xml.zck (IP: 141.219.188.21)

kevin commented 4 years ago

ok, looking more closely I think I see the actual problem. ;) But let me clean up scripts and send another freeze break.

You are right it's not sync times.

If this doesn't fix things, we can disable and regroup...

kevin commented 4 years ago

Since we are still poking at the patch, I have disabled the mirror for now.

So, things might be slower, but possibly more stable.

kevin commented 4 years ago

ok. I am re-enabling the mirror now that we completely revamped the sync (see the list for patches).

Please re-open or let us know if you see any more problems with it.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

praiskup commented 4 years ago

Seems like some problems occur again:
https://lists.fedorahosted.org/archives/list/copr-devel@lists.fedorahosted.org/thread/D7XVVC7LNRU6EWPRDTVO2V2UAYB4C3HB/

Metadata Update from @praiskup:
- Issue status updated to: Open (was: Closed)

4 years ago

praiskup commented 4 years ago

see the list for patches

What list do you mean?

Otherwise the current problem seems like:
- the repomd.xml was outdated in CDN
- so the rest of old repodatata was downloaded, and parsed by dnf
- and the old repodata pointed to deleted RPMs (dts-6 seems to be removed now).

This looks like RPMs were rsync --deleted sooner than the repomd.xml was updated.

adrian commented 4 years ago

All errors listed in the last email link are for CentOS. Not related to Fedora's syncing of the files into S3.

kevin commented 4 years ago

Yeah, looks like a centos issue. ;(

I pinged some centos folks, but might be good to talk to them directly on centos-devel or via a bug on bugs.centos.org?

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

kevin commented 4 years ago

Ah, it's because that is EOL and was removed:

https://lists.centos.org/pipermail/centos-devel/2020-March/036649.html

So, the user should no longer use that I'm afraid since it's EOL and gone.

praiskup commented 4 years ago

I pinged some centos folks, but might be good to talk to them directly on centos-devel or via a bug on bugs.centos.org?

Ahm, ok, for centos issues I'll report separate issues next time. Let's see whether this
occurs again. Thanks.

So, the user should no longer use that I'm afraid since it's EOL and gone.

I don't think user wanted to download this; dts stuff used to be kind of implicit thing...
but the repository was broken... otherwise metadata wouldn't point to non-existing
RPMs - so dnf couldn't even try to download it.

praiskup commented 4 years ago

today another failure, but I don't have librepo logs:
https://download.copr.fedorainfracloud.org/results/@mock/mock-pull-requests/srpm-builds/01314966/builder-live.log.gz

Errors during downloading metadata for repository 'fedora':
  - Status code: 404 for http://mirror.its.dal.ca/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/6eec7a9c19e6d1fb05233909549b9f2fb5b9a1162eab637be7d6d321791ddd67-primary.xml.zck (IP: 192.75.96.254)
  - Status code: 404 for http://fedora.mirror.constant.com/fedora/linux/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 108.61.5.84)
  - Status code: 404 for https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/development/32/Everything/x86_64/os/repodata/6eec7a9c19e6d1fb05233909549b9f2fb5b9a1162eab637be7d6d321791ddd67-primary.xml.zck (IP: 147.75.69.165)
  - Status code: 404 for https://mirror.dst.ca/fedora/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 208.89.87.36)
  - Status code: 404 for https://mirror.dst.ca/fedora/development/32/Everything/x86_64/os/repodata/6eec7a9c19e6d1fb05233909549b9f2fb5b9a1162eab637be7d6d321791ddd67-primary.xml.zck (IP: 208.89.87.36)
  - Status code: 404 for http://mirror.web-ster.com/fedora/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 65.182.224.39)
  - Status code: 404 for http://mirror.mrjester.net/fedora/linux/development/32/Everything/x86_64/os/repodata/6eec7a9c19e6d1fb05233909549b9f2fb5b9a1162eab637be7d6d321791ddd67-primary.xml.zck (IP: 50.224.157.174)
  - Status code: 404 for https://ewr.edge.kernel.org/fedora-buffet/fedora/linux/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 147.75.197.195)
  - Status code: 404 for https://mirrors.xmission.com/fedora/linux/development/32/Everything/x86_64/os/repodata/6eec7a9c19e6d1fb05233909549b9f2fb5b9a1162eab637be7d6d321791ddd67-primary.xml.zck (IP: 198.60.22.13)
  - Status code: 404 for https://mirror.us-midwest-1.nexcess.net/fedora/development/32/Everything/x86_64/os/repodata/6eec7a9c19e6d1fb05233909549b9f2fb5b9a1162eab637be7d6d321791ddd67-primary.xml.zck (IP: 208.69.120.125)
  - Status code: 404 for https://pubmirror2.math.uh.edu/fedora-buffet/fedora/linux/development/32/Everything/x86_64/os/repodata/6eec7a9c19e6d1fb05233909549b9f2fb5b9a1162eab637be7d6d321791ddd67-primary.xml.zck (IP: 129.7.128.190)
  - Status code: 404 for https://dl.fedoraproject.org/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 209.132.181.24)
  - Status code: 404 for http://mirror.siena.edu/fedora/linux/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 199.223.246.113)
  - Status code: 404 for https://mirrors.lug.mtu.edu/fedora/linux/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 141.219.188.21)
  - Status code: 404 for https://mirror.atl.genesisadaptive.com/fedora/linux/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 206.198.182.70)
  - Status code: 404 for http://fedora.mirror.iweb.com/linux/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 192.175.120.170)
  - Status code: 404 for http://mirror.siena.edu/fedora/linux/development/32/Everything/x86_64/os/repodata/6eec7a9c19e6d1fb05233909549b9f2fb5b9a1162eab637be7d6d321791ddd67-primary.xml.zck (IP: 199.223.246.113)
  - Status code: 404 for http://fedora.mirrors.tds.net/fedora/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 204.246.0.137)
  - Status code: 404 for http://mirror.mrjester.net/fedora/linux/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 50.224.157.174)
  - Status code: 404 for http://mirror.metrocast.net/fedora/linux/development/32/Everything/x86_64/os/repodata/6eec7a9c19e6d1fb05233909549b9f2fb5b9a1162eab637be7d6d321791ddd67-primary.xml.zck (IP: 65.175.128.102)
  - Status code: 404 for https://mirrors.rit.edu/fedora/fedora/linux/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 129.21.171.72)
  - Status code: 404 for http://mirrors.rit.edu/fedora/fedora/linux/development/32/Everything/x86_64/os/repodata/7c5ecbb890f3129753c26ea33ade940d90ad3ce7c54edd192d0069375d2e3b32-filelists.xml.zck (IP: 129.21.171.72)

adrian commented 4 years ago

This looks hard to solve. Not sure if it is related to S3 or cloudfront.

Looking at the log from the failed copr built the only explanation I can come up with is that an out of date repomd.xml was downloaded, which points to repodata files which no mirror has any more.

It would be interesting to know from which mirror the repomd.xml was downloaded and if it was downloaded from cloudfront.

In the end this error is not solvable with the current mirror setup. This would need a change in DNF that if the repomd.xml which was downloaded does not point to any existing files, that DNF tries to download repomd.xml from another mirror.

As DNF came to this point it seems that it downloaded a repomd.xml file which was one of the files mentioned in the metalink. The metalink can return up to three repomd.xml file checksums so it seems we somehow where able to download a valid repomd.xml file, maybe the oldest version mentioned in the metalink. Why the mirror with this repomd.xml file did not have the corresponding other files in repodata could be related to being in the middle of the sync.

I think it has been requested and mentioned a couple of times over the last years, that if there is failure like this, that DNF should try to download another repomd.xml version. Maybe in this case, where the referenced files in the repomd.xml file did not exist anywhere, DNF should exclude a repomd.xml with the same checksum and try to fetch a newer version with another checksum.

As this is complicated I hope I was able to explain it, but I would say this requires a change on the DNF side to be more tolerant to failures concerning the download of different repomd.xml files and re-download another version in cases like this.

praiskup commented 4 years ago

This looks hard to solve. Not sure if it is related to S3 or cloudfront.

I'm pretty sure it is.

I have new build fail, with librepo log:

2020-03-22T17:15:27Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/x86_64/repodata/349b0d7ad77f3b95694848900fa2ea2d104b610149e7dd1775fc03a18751111a-primary.xml.zck
2020-03-22T17:15:27Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/x86_64/repodata/628b49b994d75144bd67058f282da25ec6d9edbc8554b3bf02b8eadaafed2512-filelists.xml.zck
2020-03-22T17:15:27Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/x86_64/repodata/500927b05cd03772c464930fbabdaf9b4ac915c31df65a1d1d87be1b1d54215f-comps-Everything.x86_64.xml.gz
2020-03-22T17:15:27Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/x86_64/repodata/d6c95fa1634c726ca62dd7760e38b19b1863dd182582ac3a82e145b6f835b33e-prestodelta.xml.zck
2020-03-22T17:15:27Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/x86_64/repodata/fbad2fd698a5b2615f31b67aaaf1d2cf466b2758e63fd07dcc96c952b86503bf-updateinfo.xml.zck
2020-03-22T17:15:27Z INFO Error during transfer: Status code: 404 for https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/x86_64/repodata/628b49b994d75144bd67058f282da25ec6d9edbc8554b3bf02b8eadaafed2512-filelists.xml.zck (IP: 99.84.106.151)
2020-03-22T17:15:27Z INFO Downloading: https://download-ib01.fedoraproject.org/pub/fedora/linux/updates/30/Everything/x86_64/repodata/628b49b994d75144bd67058f282da25ec6d9edbc8554b3bf02b8eadaafed2512-filelists.xml.zck
2020-03-22T17:15:27Z INFO Error during transfer: Status code: 404 for https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/x86_64/repodata/349b0d7ad77f3b95694848900fa2ea2d104b610149e7dd1775fc03a18751111a-primary.xml.zck (IP: 99.84.106.151)
...

That seems to be caused old repodata.xml.

In the end this error is not solvable with the current mirror setup. This would
need a change in DNF that if the repomd.xml which was downloaded does not point
to any existing files, that DNF tries to download repomd.xml from another
mirror.

That is a very, very good idea! Do you want to fill RFE for this?

praiskup commented 4 years ago

This looks hard to solve. Not sure if it is related to S3 or cloudfront.

I'm pretty sure it is.

Well, I mean -> copr builders are in AWS, so the cloudfronts mirror is likely to
be used for downloading repomd.xml.

praiskup commented 4 years ago

another one -> reopening :-(

Metadata Update from @praiskup:
- Issue status updated to: Open (was: Closed)

4 years ago

praiskup commented 4 years ago

I updated the fix proposal thread with two more patches that should help.

If that doesn't help, I'm afraid I'm clueless :-(.

What is also weird is that the s3 mirror is - when problem occur -
obviously the only de-synced mirror. Perhaps we indeed should find better
rsync time.

praiskup commented 4 years ago

This causes high failure ratio, can the CDN be disabled for now please?
another build

praiskup commented 4 years ago

Another two people complained on irc now; makes me wonder, how quickly
the cache invalidation works in s3? I mean, in copr we never cache
repomd.xml files instead. Perhaps we should consider the same on s3
mirror.

Edited 4 years ago by praiskup

praiskup commented 4 years ago

Or, instead of disabling the caches entirely - we could accept the 3rd patch
in the proposal (10 minutes delay before removal) - and then we could setup
cache invalidation header in the http server (to say 5 minutes).

The s3 sync script could stop explicit cache invalidation through s3 api.

adrian commented 4 years ago

I disabled the cloudfront server in MirrorManager

psss commented 4 years ago

Seems to be still failing:
https://copr.fedorainfracloud.org/coprs/psss/tmt/build/1315353/

adrian commented 4 years ago

Disabling needs some time. At least15 more minutes until the changes are synced out everywhere.

psss commented 4 years ago

I see. Seems to be working fine now. Thanks for lookit into this.

adrian commented 4 years ago

Reported bug against DNF: https://bugzilla.redhat.com/show_bug.cgi?id=1816153

smooge commented 4 years ago

@psss and others. This system is built on a 0 budget without any dedicated engineer. If this doesn't meet your needs please work with your management to get a budget. Everything from dnf to mirrorlists is primarily built to get systems to be updated not for continual build systems. It is also built with no dedicated developer or engineers working on it but as 'when we have time'. I thank @adrian for stepping up and putting in a lot of thankless work on this over the last 3 years while dealing with his day job.

For quicker results, I think the build systems needs to be cognizant of the limitations of mirrormanager and either work around them by working out ways to 'correct' errors or what I always had to do, set up a private local cache.

praiskup commented 4 years ago

I appreciate the work done as well, thank you @adrian!

@smooge the thing is that affects a lot of users, not only copr build
system (through copr we only made the problem visible - which is good
thing - try to google d2lzkl7pfhq30w 404 term).

I think we are pretty close to make the s3 mirror rock solid, I just didn't know
that there's no "http" server behind - that it is hosted in AWS bucket. So I think
that we should use aws s3 sync --cache-control max-age=300 for repomd.xml
files, and wait at least that amount of seconds before we start deleting the files.

WDYT?

Edited 4 years ago by praiskup

praiskup commented 4 years ago

Starting local mirror for copr purposes would be simply too much work, and
we already made mock in copr reeeally pedantic and resilient (we retry several
times, etc...) ... the thing is that the we can not detect s3 mirror is giving us old
repomd.xml file :-( or could we? Yes, we can put it on blacklist somehow (firewall)
but that goes against the mirror purpose... so it is really more optimal for Copr
team to help you with fixing the mirror.... Could I go, and apply the patches
to ansible git, btw.?

smooge commented 4 years ago

@praiskup I know the issue is affecting many people, but I feel like the Amazon problem is just the iceberg we see above the water. The CentOS and other groups with CI/CD have been complaining about problems with out of date repomd and other files for 6+ years so the problem goes beyond Amazon. I worked through a large amount of problems with previous COPR engineers where we ended up having copr just tie to one download server in Phoenix so it could stop having the problems.

In the end, the entire system is built around the idea that you can always try again in an hour or so. We can shave the top off the iceberg but there is a whole bunch underneath which you are going to run into over and over again.

praiskup commented 4 years ago

After applying the patches, it doesn't seem to do what one would expect, compare
Fedora cloudfronts and Copr cloudfronts responses:

$ HEAD https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/repomd.xml  | grep -i cache
X-Cache: Hit from cloudfront

$ HEAD https://download.copr-dev.fedorainfracloud.org/results/praiskup/ping/fedora-rawhide-x86_64/repodata/repomd.xml | grep -i cache
Cache-Control: no-cache
X-Cache: RefreshHit from cloudfront

What I'd expect there is Cache-Control: max-cache=300 header. Before I
start to debug what is going on there... a bit brainstorm.

How big is the chance that the i3 mirror gets synced faster than any
other (close) mirrors? Because now one of the problems is that the i3
mirror is always the last one.

What if we
- replaced repomd.xml files first, to invalidate the mirror content as soon as
possible
- then added the new metadata and RPM files (the long sync run, how long does it
actually take?)
- perhaps don't invalidate repomd.xml caches?
- sync with --delete

What is likely to happen very often in such config is that "up2date"
repomd.xml file is downloaded, but the mirror doesn't provide updated
metadata and RPMs.... so dnf/yum will fallback to other mirrors. But that
shouldn't be problem. Is anyone expected to use the mirror with
baseurl=?

Btw., @adrian, you mentioned that mirror manager provides 3 alternatives
of repomd.xml files in metalink (that's correct) -- but I wasn't able to
find MM2 crawler code that would somehow reflect this. Is crawler able to
check that particular mirror provides too old metadata (say repomd.xml is
older than the current repomd.xml + 3)?

adrian commented 4 years ago

After applying the patches, it doesn't seem to do what one would expect, compare
Fedora cloudfronts and Copr cloudfronts responses:

I think you only changed s3.sh. There is also the script s3-sync-path.sh which syncs certain paths. At least the paths you have tested are also handled by the s3-sync-path.sh. You probably need to add the max-age parameter there. Not sure what the relation between those two scripts is.

Btw., @adrian, you mentioned that mirror manager provides 3 alternatives
of repomd.xml files in metalink (that's correct) -- but I wasn't able to
find MM2 crawler code that would somehow reflect this. Is crawler able to
check that particular mirror provides too old metadata (say repomd.xml is
older than the current repomd.xml + 3)?

The crawler only checks for the latest file and if the mirror does not have that file for a certain repository, that part of the mirror is disabled. As we are crawling every 12 hours it can take up to 12+ hours until we can tell the clients the state of our mirrors. That is why we include older checksums of repomd.xml to not have all mirrors disabled once the master mirror has been updated. This means a user can hit a mirror with older content and everything still seems to work for him even if this user is not getting the newest Fedora content.

praiskup commented 4 years ago

There is also the script s3-sync-path.sh

Indeed, thanks for pointing that out, I was curious how the system can work with
only daily syncs... here is proposal for the other script:
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org/thread/TKBKSLMZFJTSZENP6WXNK2MI3WGV2UAZ/

As we are crawling every 12 hours it can take up to 12+ hours until we can tell the clients the state of our mirrors.

Ok, so s3-sync-path.sh is responsible for faster responses. I think we are back at what
@mohanboddu proposed above --> move the sync times to proper time. Basically,
what we want to achieve is that if i3 repomd.xml file is a bit older (but still in sync
with the rest of data) there still exist some other mirror being at least as old as i3 is,
so DNF can fallback to that mirror.

praiskup commented 4 years ago

If this still doesn't help with the caches, I'll try to cut-out the script into separate
"sync-s3-rpm-metadata" project and try to debug on some "testing mirror".

adrian commented 4 years ago

Ok, so s3-sync-path.sh is responsible for faster responses. I think we are back at what
@mohanboddu proposed above --> move the sync times to proper time.

For master mirror crawling we used to wait for message on the message bus to know the correct time. We do not do this anymore. Today we are no longer actually scanning the master mirror but only reading fullfiletimelist-* we are checking every 30 minutes if the timestamp of fullfiletimelist-* has changed and only if it has changed we are reading fullfiletimelist-* to update the MirrorManager database. For the MirrorManager database state we are relying 100% on the fullfiletimelist-* files and basically never doing any I/O.

The script doing this is umdl-required. Maybe something similar can be used for S3 syncing to know if something has actually changed on the master mirror and when it has changed.

adrian commented 4 years ago

After Pavel's latest patch have been applied the answers from cloudfront contain the correct headers and according to the headers the repomd.xml file is never older than 60 seconds.

Just as discussed on the mailing list I have re-enabled the cloudfront mirror again in MirrorManager and soon AWS clients should be re-directed to cloudfront again.

Closing this ticket once more and please let us know if there are failures. If necessary I can disable the cloudfront mirror in MirrorManager again.

Metadata Update from @adrian:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

praiskup commented 4 years ago

Trying this in AWS:
$ curl https://d2lzkl7pfhq30w.cloudfront.net/pub/epel/6/x86_64/repodata/repomd.xml

I get broken xml file, missing closing > symbol in closing </repomd> tag, and missing newline:

  </data>
</repomd

How this can happen?

praiskup commented 4 years ago

FTR I filled https://bugzilla.redhat.com/show_bug.cgi?id=1819188 for the DNF problem.

adrian commented 4 years ago

How this can happen?

No idea, but the s3 sync script was hanging since 2020-03-27 00:00. I stopped it and restarted it manually to hopefully fix this.

mkyral commented 4 years ago

Well, another release of plasma desktop and the build in copr again is a carnage:

[MIRROR] gcc-c++-9.3.1-1.fc31.x86_64.rpm: Status code: 404 for https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/31/Everything/x86_64/Packages/g/gcc-c++-9.3.1-1.fc31.x86_64.rpm (IP: 99.84.185.139)
(207/212): systemd-243.7-1.fc31.x86_64.rpm 38 MB/s | 3.8 MB 00:00
(208/212): systemd-pam-243.7-1.fc31.x86_64.rpm 22 MB/s | 166 kB 00:00
(209/212): systemd-rpm-macros-243.7-1.fc31.noar 8.3 MB/s | 22 kB 00:00
[MIRROR] xkeyboard-config-2.28-1.fc31.noarch.rpm: Curl error (23): Failed writing received data to disk/application for https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/31/Everything/x86_64/Packages/x/xkeyboard-config-2.28-1.fc31.noarch.rpm [Failed writing body (0 != 15913)]
[FAILED] xkeyboard-config-2.28-1.fc31.noarch.rpm: Curl error (23): Failed writing received data to disk/application for https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/31/Everything/x86_64/Packages/x/xkeyboard-config-2.28-1.fc31.noarch.rpm [Failed writing body (0 != 15913)]
(211-212/212): libst 91% [================== ] 971 kB/s | 138 MB 00:14 ETA
Error: Error downloading packages:
Curl error (23): Failed writing received data to disk/application for https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/31/Everything/x86_64/Packages/x/xkeyboard-config-2.28-1.fc31.noarch.rpm [Failed writing body (0 != 15913)]

Could you, please, just blacklist cloudfront altogether? I mean it. It's proven beyond doubt that it is notoriously unstable,not just all the time out of sync but also unable to serve the content properly. Nothing else but a source of problems. Just look at this conversation above. Don't waste your time fixing it. Don't waste my time. Just get rid of it. Thanks.

praiskup commented 4 years ago

[Failed writing body (0 != 15913)]

This one seems to be ENOSPC, not related to repository problems. What you
face now is likely https://pagure.io/copr/copr/issue/1299 and it his high
priority issue, we are working on it.

Don't waste your time fixing it.

I3 mirror should be fixed now.

mkyral commented 4 years ago

Well...

kevin commented 4 years ago

This is a different problem... a bug in cloudfront. It means that any file with ++ in the name won't get downloaded from cloudfront, but should be downloaded from the next mirror. So, it might look odd, but it should work fine.

praiskup commented 4 years ago

Not only ++, single + is enough. Has this been reported somewhere?

smooge commented 4 years ago

I am not sure where and whom we would document it as. This is a built-in and documented limitation with S3 so there isn't anything we can do about it.

praiskup commented 4 years ago

Thanks, I filled this https://github.com/rpm-software-management/createrepo_c/issues/215

praiskup commented 4 years ago

OK, again this morning :-(

2020-04-07T04:44:08Z INFO Downloading: https://mirrors.fedoraproject.org/metalink?repo=updates-released-f30&arch=aarch64
2020-04-07T04:44:08Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/aarch64/repodata/repomd.xml
2020-04-07T04:44:08Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/aarch64/repodata/fa7958d2c9ea1d09e987e79a55915c9ef79333df4bcb98261d9a10f12397e474-primary.xml.zck
2020-04-07T04:44:08Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/aarch64/repodata/2f0ee96dd70baf7454cb79f51f1509037f12a43ece1edb9a01bc645b9de5230a-filelists.xml.zck
2020-04-07T04:44:08Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/aarch64/repodata/6aef92335772a3efd1319e754a151875f92dd8b4fea9dedd67eb42d20afcfed7-comps-Everything.aarch64.xml.gz
2020-04-07T04:44:08Z INFO Downloading: https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/aarch64/repodata/efab835db9fcf7f891218d562fbe827c3c1c93a323ad4d21c4e4ea16caf5a7f6-prestodelta.xml.zck
2020-04-07T04:44:09Z INFO Error during transfer: Status code: 404 for https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/30/Everything/aarch64/repodata/2f0ee96dd70baf7454cb79f51f1509037f12a43ece1edb9a01bc645b9de5230a-filelists.xml.zck (IP: 99.84.185.199)
2020-04-07T04:44:09Z INFO Downloading: http://download-ib01.fedoraproject.org/pub/fedora/linux/updates/30/Everything/aarch64/repodata/2f0ee96dd70baf7454cb79f51f1509037f12a43ece1edb9a01bc645b9de5230a-filelists.xml.zck

I checked manually, and repomd.xml still references file 2f0ee96dd70baf7454cb79f5* file
now - but it still doesn't exist.

mkyral commented 4 years ago

The cloudfront repos are utterly broken and should not be used at all.

msuchy commented 4 years ago

Can you disable the cloudfront for updates? I think it can be safely enabled for GA fedora and for updates-testing (not used for builds). It is actually only updates which cause the problems.

I.e. if we create new behaviour for path
/pub/fedora/linux/updates/*
object caching = Customize
and set both minimum ttl and maximum ttl to zero, we disable caching for this urls.

Metadata Update from @smooge:
- Issue status updated to: Open (was: Closed)

4 years ago

smooge commented 4 years ago

Reopening ticket since it was closed and might not have been looked at by people. @msuchy when you say "Can you disable the cloudfront for updates" Are you meaning the copr systems so they are using something else or are you meaning the Fedora Project admins or something else?

There are 2 problems tied into this.
1. Cloudfront stores data in S3 buckets which treat + in a filename as a URL character which breaks modules and g++
2. There seems to be some other cloudfront problem with repodata which I am not sure is exactly going on since we have changed the script to eliminate various problems.

kevin commented 4 years ago

The ++ thing is anoying, but it should only cause dnf to go to the next mirror for that one file. It should in no way cause failures.

For the issue in comment 13, we need more info. Was it only on aarch64? One build? More?

adrian commented 4 years ago

Was the build actually aborted for the last error concerning aarch64? Or was it just a message in the logs?

It looks like 2f0ee96dd70baf7454cb79f51f1509037f12a43ece1edb9a01bc645b9de5230a-filelists.xml.zck was not downloaded from cloudfront but it was downloaded from another mirror. So no failure, right?

We know that ++ does not work, but DNF can handle it gracefully.

Can you disable the cloudfront for updates?

That is not possible with the current setup. The cloudfront mirror is marked as always up to date as we cannot crawl it and a mirror which is marked as 'always up to date' has to have the complete content. cloudfront is, for example, missing rawhide completely and DNF seems to handle this also gracefully.

msuchy commented 4 years ago

That is not possible with the current setup. The cloudfront mirror is marked as always up to date as we cannot crawl it and a mirror that is marked as 'always up to date' has to have the complete content.

Yes. But in CloudFront (itself) settings you can configure that some paths (objects) will not be cached.

kevin commented 4 years ago

So, consider:

s3.sh runs (this is the 'everything' sync at 00:00. It syncs stuff.
updates push out
s3.sh now updates repodata, but it's the 'new' repodata, not the one that was there when it first started at 00:00 and none of those packages are in the sync.

So, how about this: we scrap the entire s3 sync business and set cloudfront to just cache dl.fedoraproject.org repos directly. First hits might be slightly slower, but it should be much more in sync and less prone to issues.

praiskup commented 4 years ago

Was the build actually aborted for the last error concerning aarch64? Or was it just a message in the logs?

The build was aborted, but I am not able to to tell more :-( I lost the link, sorry I didn't
post it here.

It looks like 2f0ee96dd70baf7454cb79f51f1509037f12a43ece1edb9a01bc645b9de5230a-filelists.xml.zck was not downloaded from cloudfront but it was downloaded from another mirror. So no failure, right?

No, I unfortunately posted too short log; there were several 404s in line. I concentrated
on the fact that the major repo that is supposed to be never broken actually was broken.

Edited 4 years ago by praiskup

praiskup commented 4 years ago

Just now, fedora 32 this time.

It failed several other builds (even one of mine, so I noticed). I'm attaching librepo log.

adrian commented 4 years ago

So, consider:

s3.sh runs (this is the 'everything' sync at 00:00. It syncs stuff.

updates push out

s3.sh now updates repodata, but it's the 'new' repodata, not the one that was there when it first started at 00:00 and none of those packages are in the sync.

So, how about this: we scrap the entire s3 sync business and set cloudfront to just cache dl.fedoraproject.org repos directly. First hits might be slightly slower, but it should be much more in sync and less prone to issues.

This sounds like a good idea, especially as the s3.sh sync script currently takes longer than one week to finish. Is there a way to tell cloudfront to cache repomd.xml only for 60 seconds like we are doing now?

Another interesting thing I have just seen is that the same file (https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/repomd.xml) downloaded from Europe does not have the correct caching header. The file downloaded from the US has the caching header: Cache-Control: max-age=60

adrian commented 4 years ago

Just now, fedora 32 this time.
It failed several other builds (even one of mine, so I noticed).

I guess this is related to what I just mentioned that cloudfront in Europe returns different headers and repomd.xml files which are really old.

I disabled the cloudfront mirror again. Should be live in about 1 hour and 20 minutes.

praiskup commented 4 years ago

So, how about this: we scrap the entire s3 sync business and set cloudfront to
just cache dl.fedoraproject.org repos directly. First hits might be slightly
slower, but it should be much more in sync and less prone to issues.

Yes! Just don't forget to set 'no-cache' hader for all repomd.xml files on
dl.fedoraproject.org.

I guess this is related to what I just mentioned that cloudfront in Europe returns different headers and repomd.xml files which are really old.

This is weird definitely. Copr is in US, and it had issues. I'm in Europe and I see
the max-cache header. Nevermind, I think that what @kevin proposes is good
idea.

kevin commented 4 years ago

ok. I have set it up to use dl.fedoraproject.org and not cache *repomd.xml.

Please take a look at it and if it looks ok we can re-enable.

I can disable the s3 jobs.

praiskup commented 4 years ago

It behaves differently from copr, ...

$ HEAD http://dl.fedoraproject.org/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/repomd.xml | grep -i cache
<nothing>
$ HEAD https://copr-be.cloud.fedoraproject.org/results/decathorpe/elementary-nightly/fedora-32-x86_64/repodata/repomd.xml | grep -i cache
Cache-Control: no-cache

The cloudfronts then:

$ HEAD https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/32/Everything/x86_64/os/repodata/repomd.xml | grep -i cache
X-Cache: Hit from cloudfront
$ HEAD https://download.copr.fedorainfracloud.org//results/decathorpe/elementary-nightly/fedora-32-x86_64/repodata/repomd.xml | grep -i cache
Cache-Control: no-cache
X-Cache: RefreshHit from cloudfront

kevin commented 4 years ago

Yes? I am not sure what you are wanting... that seems right/expected to me?

praiskup commented 4 years ago

Are you sure the repomd.xml file isn't cached? I may be wrong, and I have no way to test it,
butX-Cache: Hit from cloudfront doesn't sound like non-cached variant. I think that
no-cache header provided by dl.f.o is needed...

kevin commented 4 years ago

Ah, I see what happened.

Try it now?

I do not think we want to add a no-cache header to dl. Lots of people hit it and I think it's fine for them to cache it.

I do think we need to make cloudfront not cache it... but I think that is fixed now. See if you agree?

praiskup commented 4 years ago

X-Cache: Miss from cloudfront, sounds OK to me, thanks

kevin commented 4 years ago

ok. I re-enabled it again. I look forward to any further issues. :)

praiskup commented 4 years ago

So far so good. Don't you want to blog-post about this?
"How to cache RPM repository in AWS i3". People seem to do this wrong, at
least CentOS guys.

adrian commented 4 years ago

With the new cloudfront setup we probably also could enable cloudfront for fedora-secondary/ and archive/ and maybe even alt/. Would that make sense?

kevin commented 4 years ago

So far so good. Don't you want to blog-post about this?
"How to cache RPM repository in AWS i3". People seem to do this wrong, at
least CentOS guys.

I'm not sure their setup, but sure a blog could be nice... once we are sure it's actually all working. :)

With the new cloudfront setup we probably also could enable cloudfront for fedora-secondary/ and archive/ and maybe even alt/. Would that make sense?

Yep. It's caching everything, so all those could be added yes.

I did get one report on irc:

FYI [MIRROR] elfutils-libs-0.178-7.fc31_0.179-1.fc31.x86_64.drpm: Status code: 404 for https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/updates/31/Everything/x86_64/drpms/elfutils-libs-0.178-7.fc31_0.179-1.fc31.x86_64.drpm (IP: 99.86.227.83)

but that was after a updates push, so I could see someone grabbing the repomd.xml, start downloading and then new repodata/packages appear on dl. I would think dnf would do the right thing here and go to the next mirror.

adrian commented 4 years ago

Okay, I will add the other MirrorManager categories to the entry.

The file does not exist on the master mirror, so cloudfront is correct in this case.

$ curl -s https://dl.fedoraproject.org/pub/fedora/linux/updates/31/Everything/x86_64/drpms/elfutils-libs-0.178-7.fc31_0.179-1.fc31.x86_64.drpm -I
HTTP/1.1 404 Not Found

kevin commented 4 years ago

So far so good. Don't you want to blog-post about this?
"How to cache RPM repository in AWS i3". People seem to do this wrong, at
least CentOS guys.

Looking at that, it's expected. devtoolset-6 scl is EOL and has been removed from mirrors.centos.org...

https://github.com/sclorg/sclo-ci-tests/blob/master/PackageLists/collections-list-rh-el7#L4

it no longer exists to mirror. It's not going to be available. The user needs to build against something else.

praiskup commented 4 years ago

but that was after a updates push, so I could see someone grabbing the
repomd.xml, start downloading and then new repodata/packages appear on
dl. I would think dnf would do the right thing here and go to the next
mirror.

Agreed. From the other side... can the repomd.xml file be "too new", and
available before the new referenced metadata and RPMs are available?

Looking at that, it's expected. devtoolset-6 scl is EOL and has been
removed from mirrors.centos.org...

Yes, but the devtoolset-6 packages were not downloaded intentionally for
the build purposes.. there were some packaging bugs from what I remember
in rh-scl, so the packages there mistakenly replaced some other packages
from base repos.

IIUC, the CentOS problem was that cloudfronts repo had outdated (overly
cached) repomd.xml, pointing to outdated metadata files that still
referenced devtoolset-6 packages... So DNF started to download them.

kevin commented 4 years ago

Agreed. From the other side... can the repomd.xml file be "too new", and
available before the new referenced metadata and RPMs are available?

Nope, it should sync all the data then the repodata only at the end.

I guess I will try and close this one again... reopen if (but hopefully not) when it breaks. ;)

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

praiskup commented 4 years ago

Just for the record, I now realized that the new I3 mirroring mechanism doesn't
suffer from the + sign problem. :-) win win

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

Waiting on Assignee

Attachments 1

dnf.librepo.log.xz

Attached 4 years ago View Comment

fedora-infrastructure

Source Code

#8679 Fedora mirrors problems (31/32)? Closed: Fixed 4 years ago by kevin. Opened 4 years ago by praiskup.

Metadata

Attachments 1

#8679 Fedora mirrors problems (31/32)?

Closed: Fixed 4 years ago by kevin. Opened 4 years ago by praiskup.