#1974 Issues unpacking src.rpm on fedoraproject koji on fedora s390x builders
Opened 2 years ago by mbooth. Modified a year ago

I can't reopen #1418 so I'm filing a new ticket. The error is the same:

DEBUG util.py:427:  Unsharing. Flags: 134217728
DEBUG util.py:598:  Updating / installing...
DEBUG util.py:598:  eclipse-1:4.14-5.fc31                 ########################################
DEBUG util.py:596:  error: unpacking of archive failed on file /builddir/build/SOURCES/eclipse-platform-sources-4.14.tar.xz;5e30651b: cpio: read failed - Inappropriate ioctl for device
DEBUG util.py:596:  error: /builddir/build/originals/eclipse-4.14-5.fc31.src.rpm cannot be installed
DEBUG util.py:744:  Child return code was: 1
DEBUG util.py:329:  kill orphans
DEBUG util.py:763:  child environment: None

Here is the failed task: https://koji.fedoraproject.org/koji/taskinfo?taskID=41143792


We've fixed it in koji code. I don't think, that Fedora already updated their instance.

Metadata Update from @tkopecek:
- Custom field Size adjusted to None

2 years ago

So, yeah this appears to not be fixed. ;(

https://koji.fedoraproject.org/koji/taskinfo?taskID=42725810

for example. The src.rpm job finished fine, but on s390x it was unable to unpack the src.rpm.

I pulled this one down and looked and indeed the src.rpm is corrupt.

-rw-r--r--. 1 kevin kevin 141049321 Mar 23 11:48 vdrift-20141020-21.fc33.src.rpm <- the src.rpm that was in the buildroot
-rw-rw-r--. 1 kevin kevin 508004960 Mar 23 11:33 vdrift-20141020-21.fc33.src.rpm <- the downloaded one from the src.rpm task

How is the src.rpm copied into originals/ dir? It seems like that might be where we are missing a check?

There is also this one that might be related: https://koji.fedoraproject.org/koji/taskinfo?taskID=42726437

Could it be, that upload was finished by buildSRPMFromSCM, but filer (or some mirror used by builder) wasn't fully synced before buildArch started? I surely can add some rpm verification before SRPM is used, but it looks weird as it is using same location and no copying happens. Even mock opens the file from original upload location. Anyway, adding verification just fails a few seconds before the mock, so there will be no real fix.

@kevin, have you found both srpms (the complete and the short) on the s390x builder? Or does the s390x builder contain only the short one?

Does koji somehow ensure that buildArch task starts after the srpm has been fully uploaded by buildSRPMFromSCM?

Can we get some logging on the builder to store the details (like file size and timestamp) about the downloaded srpm? Are those details in the http headers? @tkopecek's idea about download-before-fully-uploaded sounds plausible to me.

For buildArch yes - it runs sequentially, so it has to be uploaded already.

I am now seeing failures on s390x that look like this:

https://koji.fedoraproject.org/koji/taskinfo?taskID=42783181

Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/koji/daemon.py", line 1294, in runTask
    response = (handler.run(),)
  File "/usr/lib/python3.7/site-packages/koji/tasks.py", line 313, in run
    return koji.util.call_with_argcheck(self.handler, self.params, self.opts)
  File "/usr/lib/python3.7/site-packages/koji/util.py", line 263, in call_with_argcheck
    return func(*args, **kwargs)
  File "/usr/sbin/kojid", line 1371, in handler
    broot.init()
  File "/usr/sbin/kojid", line 544, in init
    self.session.host.setBuildRootList(self.id,self.getPackageList())
  File "/usr/sbin/kojid", line 667, in getPackageList
    self.markExternalRPMs(ret)
  File "/usr/sbin/kojid", line 805, in markExternalRPMs
    for line in fo2:
  File "/usr/lib64/python3.7/gzip.py", line 289, in read1
    return self._buffer.read1(size)
  File "/usr/lib64/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib64/python3.7/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

Is it related?

So, let me explain how things are setup:

s390x builders are in boston. Our koji hub is in phx2. There's a internet link between them (of course).

To help with all the packages that need downloaded for buildroots, there is a varnish cache in boston.
All the s390x builders have:

topurl = http://kojipkgs-cache01.s390.fedoraproject.org https://kojipkgs01.fedoraproject.org https://kojipkgs02.fedoraproject.org

(where kojipkgs-cache01 is of course the local machine running varnish).

varnish is set to use kojipkgs01/kojipkgs02 as it's backend servers.

kojipkgs01/02 are the main caching servers for packages in phx2. They run varnish also, which has a backend of apache (on the same machine) that fetches things from a netapp nfs mount of /mnt/koji.

So, could varnish be causing the problem here? I suppose it could, but I am not sure I see why.
When the srpm is written out, it's written to the netapp /mnt/koji mount. When that returns I would assume the file is completely written to disk.
varnish (either one) shouldn't have a incomplete cached copy, because it's not requesting it until after it's written out.

Unfortunately we don't have any varnish logs.

apache has for the above one for example:

127.0.0.1 - - [23/Mar/2020:18:34:18 +0000] "GET /work/tasks/5780/42725780/vdrift-20141020-21.fc33.src.rpm HTTP/1.1" 200 508004960 "-" "p
ython-requests/2.22.0"
127.0.0.1 - - [23/Mar/2020:18:33:40 +0000] "GET /work/tasks/5780/42725780/vdrift-20141020-21.fc33.src.rpm HTTP/1.1" 200 508004960 "-" "p
ython-requests/2.22.0"

The only thing, I've found is https://serverfault.com/questions/284132/static-file-download-from-browser-breaking-in-varnish-but-works-fine-in-apache It has to do something with long-running downloads and/or big files.

Short timeout would explain both big SRPMs for texlive, etc. and external repos (where could be slow download even for smaller packages)

So, maybe try to investigate in this direction?

Yeah, we have played with timeouts quite a lot.

Note that it's never failing in a normal root population or anything, it's ONLY with the newly created src.rpm. Thats why I was asking if it was transferred another way.

build start
builder takes job, populates srpm-build chroot, makes src.rpm
real builds start
builder takes job, populates rpm-build chroot, unpacks src.rpm and it fails.

Did it get the src.rpm via http(s)? or via xmlrpc against the hub? or something else?

build start
builder makes src.rpm
uploads it via xmlrpc to hub
hub saves it to /mnt/koji/tasks/...
buildArch starts
opens it via openRemoteFile which depends on if you've defined topdir or topurl in kojid.conf. First case expects NFS mount, then it is directly opened, second use chunked download via http(s) (requests.get(url, stream=True))
-- here it fails --

@tkopecek , could you add us some diagnostics output for the requests.get() path? Definitely would be useful to know the expected size (from HTTP headers?) and the actual size. If they will differ, then there could be something wrong with requests.get(). if they will be the same, but still shorter than the original srpm, then it could be the cache that truncates the srpm. We could also prepare a standalone downloader and try to reproduce the failure on the affected builders outside koji.

You can add some simple print like in attached patch and kojid with --debug, so you can find it easily. (Those interesting have previous line "Reading SRPM")

Patch is against 1.20.1 (and use it only on that one builder, there would too much clutter, as it is nor running through loggin module, so not filtered out)

0001-debug-requests.patch

Is it still useful to record occurances? I saw another today:

https://koji.fedoraproject.org/koji/taskinfo?taskID=42986402

not that much right now, but it will be again after @kevin will deploy the debugging support patch

I've applied the patch on buildvm-s390x-08.s390.fedoraproject.org. Will watch it over the weekend and see if we can get any debug.

If any of you notice failures on it, let me know and I can collect logs.

BTW, It is not much helpful for finding the root cause here, but I've filed https://pagure.io/koji/issue/2130 , so it is more easier to see in future, that problem was download phase, not corrupted file as the source.

You can add some simple print like in attached patch and kojid with --debug, so you can find it easily. (Those interesting have previous line "Reading SRPM")
Patch is against 1.20.1 (and use it only on that one builder, there would too much clutter, as it is nor running through loggin module, so not filtered out)
0001-debug-requests.patch

Tomas, I have gained access to the builders, so I'm looking into it now. There is one issue with the patch - print(r.headers) should be print(resp.headers) I guess.
The debug output will go into /var/log/kojid.log?

Hmm, it's not in /var/log/kojid.log. In fact I can't find where the info is logged, it's not printed even when running kojid from command line.
update - just got openRemoteFile: prefixed output in the terminal

Yes, it goes to console, if run in foreground or to kojid.log if bg.

an interim report
- I have ended using 0001-log-details-for-downloaded-srpms.patch to start capturing the details for the downloaded srpms
- there are 2 functions that download srpms - tasks.py/loadPath() for buildArch tasks and __init__.py/openRepoFile() for build tasks (the srpm will be later used in the buildArch tasks), it might be useful to merge them
- simple print() output doesn't go into the log files, so I have used the logging infra, now I have the details in /var/log/kojid.log when kojid runs as a service
- I'm going to try if I can capture any exception data from requests, there might be timeouts or something else
- it took ~1 hour for the builder/koji daemon to remove couple dozens of old buildroot when I started it, so I suspect the hypervisor is heavily I/O loaded
- there is already one incomplete download captured

2020-05-26 09:12:53,221 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/40/45000040/uhd-3.15.0.0-0.4.rc2.fc33.src.rpm, header: 126135871, got: 1243773
2020-05-26 09:12:53,222 [INFO] koji.build.BaseTaskHandler: {'Date': 'Tue, 26 May 2020 09:12:30 GMT', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Last-Modified': 'Tue, 26 May 2020 09:08:35 GMT', 'Content-Length': '126135871', 'AppTime': 'D=2583', 'X-Fedora-AppServer': 'kojipkgs01.phx2.fedoraproject.org', 'Content-Type': 'application/x-rpm', 'X-Varnish': '402487983, 17207703', 'Via': '1.1 varnish (Varnish/6.3), 1.1 varnish (Varnish/6.3)', 'Age': '0', 'Accept-Ranges': 'bytes', 'Connection': 'keep-alive'}

some more notes about possible next steps
- but if requests would throw exceptions, they would be already in the log (I guess)
- we can try moving from 8KB chunk to bigger (1M?), each chunk is individually written to disk, bigger chunks, less writes
- add some checks for read and written chunks
- explicitly set timeout for reading the chunks

I have added these changes (0001-add-more-debugging-output.patch) and as a result we have following in the kojid.log

2020-05-27 00:14:35,016 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/6345/45036345/gavl-1.4.0-16.eln101.src.rpm, chunk number: 87, 
clen != csize (8151 != 8192)
2020-05-27 00:14:35,017 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/6345/45036345/gavl-1.4.0-16.eln101.src.rpm, header: 4600376, g
ot: 712663, chunk count: 87, expected count: 562
2020-05-27 00:14:35,017 [INFO] koji.build.BaseTaskHandler: {'Date': 'Wed, 27 May 2020 00:14:11 GMT', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; pr
eload', 'Last-Modified': 'Tue, 26 May 2020 21:31:00 GMT', 'Content-Length': '4600376', 'AppTime': 'D=3639', 'X-Fedora-AppServer': 'kojipkgs01.phx2.fedoraproject.org', 'Content-Type': 'applic
ation/x-rpm', 'X-Varnish': '412690449, 16729426', 'Via': '1.1 varnish (Varnish/6.3), 1.1 varnish (Varnish/6.3)', 'Age': '0', 'Accept-Ranges': 'bytes', 'Connection': 'keep-alive'}

My interpretation is the requests.get wasn't able to receive the complete file in the 8KB chunks. But the new question is why? :-)

Two more failed downloads after setting a timeout explicitly, but nothing else in the log (no exception thrown or such)

2020-06-03 18:28:22,798 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/4626/45374626/qt5-qt3d-5.14.2-1.eln101.src.rpm, chunk number: 589, clen != csize (1206 != 8192)
2020-06-03 18:28:22,799 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/4626/45374626/qt5-qt3d-5.14.2-1.eln101.src.rpm, header: 81606731, got: 4818102, chunk count: 589, expected count: 9962
2020-06-04 08:01:26,495 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/2581/45402581/qgis-3.12.1-4.fc32.src.rpm, chunk number: 792, clen != csize (3577 != 8192)
2020-06-04 08:01:26,496 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/2581/45402581/qgis-3.12.1-4.fc32.src.rpm, header: 109352460, got: 6483449, chunk count: 792, expected count: 13349

I saw this during the mass rebuild: https://koji.fedoraproject.org/koji/taskinfo?taskID=47985596

Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/koji/daemon.py", line 1340, in runTask
    response = (handler.run(),)
  File "/usr/lib/python3.8/site-packages/koji/tasks.py", line 329, in run
    return koji.util.call_with_argcheck(self.handler, self.params, self.opts)
  File "/usr/lib/python3.8/site-packages/koji/util.py", line 258, in call_with_argcheck
    return func(*args, **kwargs)
  File "/usr/sbin/kojid", line 1382, in handler
    h = koji.get_rpm_header(fn)
  File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 962, in get_rpm_header
    hdr = ts.hdrFromFdno(fo.fileno())
  File "/usr/lib64/python3.8/site-packages/rpm/transaction.py", line 185, in hdrFromFdno
    raise rpm.error("error reading package header")
_rpm.error: error reading package header

Is it related?

Login to comment on this ticket.

Metadata
Attachments 1