Issue #10531: Copr builds agains the Koji latest repo fail sometimes - fedora-infrastructure

fedora-infrastructure

#10531 Copr builds agains the Koji latest repo fail sometimes

Closed: Fixed 2 years ago by kevin. Opened 2 years ago by praiskup.

I'm not sure how this can happen:

Errors during downloading metadata for repository 'http_kojipkgs_fedoraproject_org_repos_rawhide_latest_basearch':
  - Status code: 404 for https://kojipkgs.fedoraproject.org/repos/rawhide/latest/x86_64/repodata/d5852081fe17f07cdbeae5e4b6f11cae2e760f83c5c011769e10b9b2842c9e63-filelists.xml.gz (IP: 38.145.60.20)
  - Status code: 404 for https://kojipkgs.fedoraproject.org/repos/rawhide/latest/x86_64/repodata/4a9333dbad0f011c9b5ce60fee4bddc71c1558a320a910ed9753fb604be49c98-primary.xml.gz (IP: 38.145.60.20)
Error: Failed to download metadata for repo 'http_kojipkgs_fedoraproject_org_repos_rawhide_latest_basearch': Yum repo downloading error: Downloading error(s): repodata/4a9333dbad0f011c9b5ce60fee4bddc71c1558a320a910ed9753fb604be49c98-primary.xml.gz - Cannot download, all mirrors were already tried without success; repodata/d5852081fe17f07cdbeae5e4b6f11cae2e760f83c5c011769e10b9b2842c9e63-filelists.xml.gz - Cannot download, all mirrors were already tried without success
WARNING: Dnf command failed, retrying, attempt #2, sleeping 10s
...

Copr tries 3 times (about a minute) and then fails. Any ideas? It looks like
there's the 'repomd.xml' file for a longter time period which points to the
outdated metadata files that are already removed. But that would be weird...
perhaps caching?

kevin commented 2 years ago

Note that this repodata is regnerated all the time by kojira. Whenever anything changes in the buildroot.

If it gets the repomd.xml and then tries to get those other repodata files based on the repomd.xml, it might be the repodata has changed since then, and it needs to retry re-reading the repodata.xml.

Not sure how to handle this any better. ;(

The old repodata is kept in https://kojipkgs.fedoraproject.org/repos/f36-build/ ie, it moves the existing one, makes a new one and points 'lates' to it. Perhaps something could be done with that?

praiskup commented 2 years ago

If it gets the repomd.xml and then tries to get those other repodata files based on the repomd.xml, it might be the repodata has changed since then, and it needs to retry re-reading the repodata.xml.

But mock should do this, that's why this is weird. We fully restart the DNF install
process from scratch.

The way createrepo should work is that the repomd.xml file should be created
as the last one.., so whenever this file is changed - we can be sure that other
referenced metadata files are available as well (or at least there should be
a very short race when moving files from a temp dir).

Yet, we seem to get repomd.xml file for a longer time period.

The old repodata is kept in https://kojipkgs.fedoraproject.org/repos/f36-build/ ie, it moves the existing one, makes a new one and points 'lates' to it. Perhaps something could be done with that?

Can you please elaborate on the "Moves the existing one" part? Is that hardlinked?

Perhaps something could be done with that?

I believe the symlink s ln -sf action, and is done as the "last" action in the chain
of related actions (to minimize the race conditions). Therefore dunno... I would
bet that some caching goes against us... is there some? If so, is it fully disabled
for the repomd.xml URLs?

mizdebsk commented 2 years ago

Using latest symlink is prone to race condition - the symlink can be changed underneath while repo is being downloaded, which can lead not only to failures to load repo by DNF. But more importantly, builds for different architectures can be done against different repos (there may be a very significant delay between builds for different arches are ran by Copr) which can lead to subtle differences between the same package built for different arches.
To avoid this race condition you can first get ID of the latest repo for particular tag (with call like koji call getRepo f36-build) and then use that repo ID instead of latest.
Koschei and Koji itself always refer to repos by specifying repo ID, never by latest symlink.
They download Koji repos very frequently and I haven't seen any issue with repodata caching.

praiskup commented 2 years ago

But more importantly, builds for different architectures can be done against different repos (there may be a very significant delay between builds for different arches are ran by Copr) which can lead to subtle differences between the same package built for different arches.

Well, the symbolic link is though in the upper level, so it should be ARCH-agnostic?

I understand that the results might be different, all the metadata change between
architectures (when one arch is done later the other). But this is something that we
can tolerate in Copr.

To avoid this race condition you can first get ID of the latest repo for particular tag (with call like koji call getRepo f36-build) and then use that repo ID instead of latest.

I don't think we want to complicate the repo consumption in our logic. :-/ We don't
want to close this race.

The problem I describe now is that, for a non-trivial amount of time, we face the
inconsistency in the repodata (again, copr re-tries several times, and fails
repeatedly - while it immediately on the second attempt should get updated
repodate).

praiskup commented 2 years ago

I don't think we want to complicate the repo consumption in our logic.
:-/ We don't want to close this race.

I mean - we would probably want to fix this as well - but I'm not sure it
is worth it. Copr is close to Mock usage -- and mock is what users
usually do locally ... that is, they tolerate the --enablerepo=local
consequences.

OTOH the problem we want to solve seems to be much simpler, yet it isn't
obvious where is the problem.

Metadata Update from @mohanboddu:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: low-gain, low-trouble, ops

2 years ago

kevin commented 2 years ago

Well, I don't know whats happening here exactly, but I suspect:

dnf grabs latest/x86_64/repomd.xml file
koji updates to a new buildroot repo, so latest/x86_64/* is replaced with the new latest repo
dnf tries to grab the other metadata files, but they are now not there because the hash has changed.
error

So, I think we need dnf to also retry the repomd.xml in this case? Or check that it's not changed?

Otherwise using koji to see what the repo is, calling it by it's non latest version should work, but be more complex of course.

praiskup commented 2 years ago

So, I think we need dnf to also retry the repomd.xml in this case?

We already do this. We restart the whole DNF process from scratch. That's why
I don't get why this error can actually happen.

kevin commented 2 years ago

Huh, then I am puzzled how this is getting triggered.

Would it be possible to add some debugging? ie when this happens grab a index of that latest directory so we can see whats there? If it's just the hashes changed, or somehow it can't reach kojipkgs?

praiskup commented 2 years ago

Meh, I'm blind, but an example of such build is here:
https://download.copr.fedorainfracloud.org/results/%40python/python3.11/fedora-rawhide-x86_64/03329977-plplot/
https://download.copr.fedorainfracloud.org/results/%40python/python3.11/fedora-rawhide-x86_64/03329977-plplot/chroot_scan/var/lib/mock/fedora-rawhide-x86_64-1644133691.331359/root/var/log/
(we turn on debugging for all builds)

praiskup commented 2 years ago

It really looks like repomd.xml is repeatedly returned (because of caches?)
to the client... This isn't a problem for the <id> based URLs, but for the latest/
could be.