#8816 Corrupted package in Fedora 31 compose: perf-debuginfo-5.5.15-200.fc31.x86_64
Closed: Invalid 4 months ago by fweimer. Opened 4 months ago by fweimer.

Describe what you would like us to do:

perf-debuginfo-5.5.15-200.fc31.x86_64 seems broken, and installation fails:

Running transaction
  Preparing        :                                                       1/1 
  Installing       : perf-debuginfo-5.5.15-200.fc31.x86_64                 1/1 
Error unpacking rpm package perf-debuginfo-5.5.15-200.fc31.x86_64
  Verifying        : perf-debuginfo-5.5.15-200.fc31.x86_64                 1/1 

Petr Pisar thinks it's a Koji storage issue:

Downloading the package from Koji
https://kojipkgs.fedoraproject.org//packages/kernel-tools/5.5.15/200.fc31/x86_64/perf-debuginfo-5.5.15-200.fc31.x86_64.rpm hangs. It looks like a storage failure and a compose process just did not verify that the download finished successfully and copied the incomplete package to the repository.

While I cannot reproduce the hang, the RPM payload is missing in Koji, too.

This seems to affect a previous version of the package as well (perf-debuginfo-5.5.9-200.fc31.x86_64), so it is probably more than just bad luck/cosmic rays.

When do you need this to be done by? (YYYY/MM/DD)

As soon as possible? (Sorry.)

The backend storage doesn't show any errors and I was able to rpm check the files in the main repo without any RPM errors. I did find a midlevel host having problems talking to the NFS backend but a reboot seems to have cleared that up.

Could I get the command line you used to do the installation in the original ticket? That way I can see if the file is 'corrupt' somewhere outside of where I am looking.

Metadata Update from @smooge:
- Issue priority set to: Waiting on Reporter (was: Needs Review)

4 months ago

@smooge , basically
rpm -Uvh --root /srv/test/ --nodeps /tmp/perf-debuginfo-5.5.15-200.fc31.x86_64.rpm

But as I pointed out on the ml, this means that whatever happened here, it happened during rpmbuild already as the payload digests check out ok.

In that case the problem would have occurred inside the mock chroot. I am looking through the logs for https://kojipkgs.fedoraproject.org/packages/kernel-tools/5.5.15/200.fc31/data/logs/x86_64/ and not seeing any 'errors' in the system. However a yum -y update was done on the server 2 hours before this build occurred with 47 packages updated with one of them being mock.

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Waiting on Reporter)
- Issue tagged with: high-trouble, koji, medium-gain, packager_workflow_blocker

4 months ago

So, we are not sure what we can do here.

Is this a one off hardware issue?

Is this a bug in rpm or some tool?

Can we tell from the bad rpm what we should be looking for here?

I'm leaning towards an rpmbuild bug currently. I have not yet been able to try a local rebuild, sorry. 8-( I suspect it would reproduce there, too.

ok. If someone could test and duplicate it, we should file a rpm bug and close this in favor of that.

If you need more info from us, please let us know.

From a local build:

1:perf-debuginfo-5.5.15-200.fc31 ################################# [100%]
error: unpacking of archive failed: cpio: Bad magic
error: perf-debuginfo-5.5.15-200.fc31.x86_64: install failed

@jforbes Thanks for doing that. My local build did not show this problem for some reason.

What's the version of xz in your buildroot? There's been a recent Fedora 31 update of xz.

Since you can reproduce this and I can't, would you please debug this further?

No, xz doesn't seem to be the culprit. Or at least upgrading to 5.2.5-1.fc31 does not introduce the failure for me. And the faulty Koji build used 5.2.4-6.fc31, too. So much for that theory.

What's your glibc version? That's a difference I see between the currently working scratch build and the broken earlier production build? (glibc-2.30-11.fc31 vs glibc-2.30-10.fc31).

xz-5.2.5-1.fc31.x86_64 is what I had installed. Just checked on a different system with xz-5.2.4-6.fc31.x86_64 and it built fine. Updating xz on that system still resulted in a successful build.
gcc and glibc also each update with a successful build after them. The system I am testing this on hasn't been updated in a while, so I am going through relevant packages 1 by 1.

Well, that was a bust. I updated every relevant package and could not reproduce. Did a full update and rebooted, still could not reproduce. So I have one updated system where it happens, and another where it does not. Both are fairly modern hardware, the one which does reproduce is an AMD Ryzen 7 2700X system. The one which does not is an Intel Core i7-10510U

xz is innocent, F31 uses zstd for compression. Which is a possible suspect of course.

@jforbes , can you put the two just-built packages (one working one broken) available someplace for inspection? Since you have a system where its reproducable, does it still produce a broken archive if you build with xz compression instead (%define _binary_payload w3.zstdio in the spec, or define from rpmbuild cli)?

FWIW, I wasn't able to reproduce the corruption locally either, tried both in mock and locally-locally. Extracting the raw payload from the bad binary reveals that it's the zstd intro that is somehow damaged:

[pmatilaiğŸŽ©ï¸Žlumikko ~]$ file /tmp/bad.cpio /tmp/good.cpio
/tmp/bad.cpio: data
/tmp/good.cpio: Zstandard compressed data (v0.8+), Dictionary ID: None

Similarly zstd complains:

[pmatilaiğŸŽ©ï¸Žlumikko ~]$ zstd -dvv < /tmp/bad.cpio > /tmp/out
zstd command line interface 64-bits v1.4.4, by Yann Collet
Using stdin for input
Using stdout for output
Sparse File Support is automatically disabled on stdout ; try --sparse
zstd: /stdin\: unsupported format
[pmatilaiğŸŽ©ï¸Žlumikko ~]$ zstd -dvv < /tmp/good.cpio > /tmp/out
zstd command line interface 64-bits v1.4.4, by Yann Collet
Using stdin for input
Using stdout for output
Sparse File Support is automatically disabled on stdout ; try --sparse
/stdin\ : 15276048 bytes

Garbage in, garbage out, but the question remains, who put the garbage in. All parties seem equally unlikely as this seems bread-and-butter usual thing that rpm and zstd do all day long.

It may depend on processor count. I can reproduce it on a fresh Fedora 31 installation with:

ssh -t root@$1 "dnf install -y fedpkg mock"
ssh -t root@$1 "fedpkg clone -a kernel-tools"
ssh -t root@$1 "cd kernel-tools && git checkout f31 && fedpkg srpm"
ssh -t root@$1 "mock -r fedora-31-x86_64 kernel-tools/*.src.rpm"
ssh -t root@$1 "rpm2cpio /var/lib/mock/fedora-31-x86_64/result/perf-debuginfo*.rpm | wc -c"

on some larger machines: a single EPYC 7281 CPU (16 cores, 32 threads), a single Xeon D-2183IT CPU (likewise). On a two socket CPU E5-2640 v2 machine (2x 8 cores, 16 threads), it reproduced twice in three runs.

Panu, is there a way to run just the rpmbuild stage that builds the RPMs? Or if you want to look, I can give you access to these Beaker machines.

Right, it indeed seems related to thread count, I can reproduce with some reliability if I bump %_smp_build_nthreads to 16. So it looks like its an rpm bug alright. For some reason, perf-debuginfo gets written twice, and of course if two threads are writing to the same file its like tossing coin.

It's strange though, I don't see anything like that in other packages that I quickly tested. The relevant code in rpm hasn't changed in a better half of year, and this kind of corruption would make itself known if it was wide-spread. So I guess it's somehow specific to that spec or something it does, I'll need to investigate.

@fweimer, you can run 'rpmbuild -bi <spec>' followed by 'rpmbuild -bb --short-circuit <spec>', where the latter can in principle repeated as many times as you wish. But kernel-tools.spec %install modifies the buildroot so that doesn't work right (you need to use -bi --short-circuit to fix the "damage" first before rerunning -bb --short-circuit)

Anyway, I think we can conclude this is not an infra issue but an rpm bug, so this ticket can be closed.

Okay, kernel-tools.spec actually specifies a perf-debuginfo manually (due to originating from kernel which does it on debug foo) while also having autogenerated debuginfo packages enabled, which causes the mixup. Rpm will need to catch that condition of course, but this is a very special case.

Metadata Update from @fweimer:
- Issue close_status updated to: Invalid
- Issue status updated to: Closed (was: Open)

4 months ago

It's always nice to wake up and find that someone has figured out the issue for you! Thanks for looking into this, I will fix kernel-tools today.

FWIW, here's the upstream PR to address this before I forget the case (I briefly considered creating a ticket to deal with it later but it's such a funky dark corner that fixing was easier than explaining :joy: ): https://github.com/rpm-software-management/rpm/pull/1177

Login to comment on this ticket.