Learn more about these different git repos.
Other Git URLs
I can't reopen #1418 so I'm filing a new ticket. The error is the same:
DEBUG util.py:427: Unsharing. Flags: 134217728 DEBUG util.py:598: Updating / installing... DEBUG util.py:598: eclipse-1:4.14-5.fc31 ######################################## DEBUG util.py:596: error: unpacking of archive failed on file /builddir/build/SOURCES/eclipse-platform-sources-4.14.tar.xz;5e30651b: cpio: read failed - Inappropriate ioctl for device DEBUG util.py:596: error: /builddir/build/originals/eclipse-4.14-5.fc31.src.rpm cannot be installed DEBUG util.py:744: Child return code was: 1 DEBUG util.py:329: kill orphans DEBUG util.py:763: child environment: None
Here is the failed task: https://koji.fedoraproject.org/koji/taskinfo?taskID=41143792
We've fixed it in koji code. I don't think, that Fedora already updated their instance.
Metadata Update from @tkopecek: - Custom field Size adjusted to None
So, yeah this appears to not be fixed. ;(
https://koji.fedoraproject.org/koji/taskinfo?taskID=42725810
for example. The src.rpm job finished fine, but on s390x it was unable to unpack the src.rpm.
I pulled this one down and looked and indeed the src.rpm is corrupt.
-rw-r--r--. 1 kevin kevin 141049321 Mar 23 11:48 vdrift-20141020-21.fc33.src.rpm <- the src.rpm that was in the buildroot -rw-rw-r--. 1 kevin kevin 508004960 Mar 23 11:33 vdrift-20141020-21.fc33.src.rpm <- the downloaded one from the src.rpm task
How is the src.rpm copied into originals/ dir? It seems like that might be where we are missing a check?
There is also this one that might be related: https://koji.fedoraproject.org/koji/taskinfo?taskID=42726437
Could it be, that upload was finished by buildSRPMFromSCM, but filer (or some mirror used by builder) wasn't fully synced before buildArch started? I surely can add some rpm verification before SRPM is used, but it looks weird as it is using same location and no copying happens. Even mock opens the file from original upload location. Anyway, adding verification just fails a few seconds before the mock, so there will be no real fix.
buildSRPMFromSCM
buildArch
@kevin, have you found both srpms (the complete and the short) on the s390x builder? Or does the s390x builder contain only the short one?
Does koji somehow ensure that buildArch task starts after the srpm has been fully uploaded by buildSRPMFromSCM?
Can we get some logging on the builder to store the details (like file size and timestamp) about the downloaded srpm? Are those details in the http headers? @tkopecek's idea about download-before-fully-uploaded sounds plausible to me.
For buildArch yes - it runs sequentially, so it has to be uploaded already.
I am now seeing failures on s390x that look like this:
https://koji.fedoraproject.org/koji/taskinfo?taskID=42783181
Traceback (most recent call last): File "/usr/lib/python3.7/site-packages/koji/daemon.py", line 1294, in runTask response = (handler.run(),) File "/usr/lib/python3.7/site-packages/koji/tasks.py", line 313, in run return koji.util.call_with_argcheck(self.handler, self.params, self.opts) File "/usr/lib/python3.7/site-packages/koji/util.py", line 263, in call_with_argcheck return func(*args, **kwargs) File "/usr/sbin/kojid", line 1371, in handler broot.init() File "/usr/sbin/kojid", line 544, in init self.session.host.setBuildRootList(self.id,self.getPackageList()) File "/usr/sbin/kojid", line 667, in getPackageList self.markExternalRPMs(ret) File "/usr/sbin/kojid", line 805, in markExternalRPMs for line in fo2: File "/usr/lib64/python3.7/gzip.py", line 289, in read1 return self._buffer.read1(size) File "/usr/lib64/python3.7/_compression.py", line 68, in readinto data = self.read(len(byte_view)) File "/usr/lib64/python3.7/gzip.py", line 482, in read raise EOFError("Compressed file ended before the " EOFError: Compressed file ended before the end-of-stream marker was reached
Is it related?
So, let me explain how things are setup:
s390x builders are in boston. Our koji hub is in phx2. There's a internet link between them (of course).
To help with all the packages that need downloaded for buildroots, there is a varnish cache in boston. All the s390x builders have:
topurl = http://kojipkgs-cache01.s390.fedoraproject.org https://kojipkgs01.fedoraproject.org https://kojipkgs02.fedoraproject.org
(where kojipkgs-cache01 is of course the local machine running varnish).
varnish is set to use kojipkgs01/kojipkgs02 as it's backend servers.
kojipkgs01/02 are the main caching servers for packages in phx2. They run varnish also, which has a backend of apache (on the same machine) that fetches things from a netapp nfs mount of /mnt/koji.
So, could varnish be causing the problem here? I suppose it could, but I am not sure I see why. When the srpm is written out, it's written to the netapp /mnt/koji mount. When that returns I would assume the file is completely written to disk. varnish (either one) shouldn't have a incomplete cached copy, because it's not requesting it until after it's written out.
Unfortunately we don't have any varnish logs.
apache has for the above one for example:
127.0.0.1 - - [23/Mar/2020:18:34:18 +0000] "GET /work/tasks/5780/42725780/vdrift-20141020-21.fc33.src.rpm HTTP/1.1" 200 508004960 "-" "p ython-requests/2.22.0" 127.0.0.1 - - [23/Mar/2020:18:33:40 +0000] "GET /work/tasks/5780/42725780/vdrift-20141020-21.fc33.src.rpm HTTP/1.1" 200 508004960 "-" "p ython-requests/2.22.0"
The only thing, I've found is https://serverfault.com/questions/284132/static-file-download-from-browser-breaking-in-varnish-but-works-fine-in-apache It has to do something with long-running downloads and/or big files.
Short timeout would explain both big SRPMs for texlive, etc. and external repos (where could be slow download even for smaller packages)
So, maybe try to investigate in this direction?
Yeah, we have played with timeouts quite a lot.
Note that it's never failing in a normal root population or anything, it's ONLY with the newly created src.rpm. Thats why I was asking if it was transferred another way.
build start builder takes job, populates srpm-build chroot, makes src.rpm real builds start builder takes job, populates rpm-build chroot, unpacks src.rpm and it fails.
Did it get the src.rpm via http(s)? or via xmlrpc against the hub? or something else?
build start builder makes src.rpm uploads it via xmlrpc to hub hub saves it to /mnt/koji/tasks/... buildArch starts opens it via openRemoteFile which depends on if you've defined topdir or topurl in kojid.conf. First case expects NFS mount, then it is directly opened, second use chunked download via http(s) (requests.get(url, stream=True)) -- here it fails --
openRemoteFile
topdir
topurl
kojid.conf
requests.get(url, stream=True)
@tkopecek , could you add us some diagnostics output for the requests.get() path? Definitely would be useful to know the expected size (from HTTP headers?) and the actual size. If they will differ, then there could be something wrong with requests.get(). if they will be the same, but still shorter than the original srpm, then it could be the cache that truncates the srpm. We could also prepare a standalone downloader and try to reproduce the failure on the affected builders outside koji.
requests.get()
You can add some simple print like in attached patch and kojid with --debug, so you can find it easily. (Those interesting have previous line "Reading SRPM")
--debug
Patch is against 1.20.1 (and use it only on that one builder, there would too much clutter, as it is nor running through loggin module, so not filtered out)
<img alt="0001-debug-requests.patch" src="/koji/issue/raw/files/691e341f6edd8daf9b4aab95add34a3e165eacccfbfa74be697ed6bbffef44c0-0001-debug-requests.patch" />
Is it still useful to record occurances? I saw another today:
https://koji.fedoraproject.org/koji/taskinfo?taskID=42986402
not that much right now, but it will be again after @kevin will deploy the debugging support patch
I've applied the patch on buildvm-s390x-08.s390.fedoraproject.org. Will watch it over the weekend and see if we can get any debug.
If any of you notice failures on it, let me know and I can collect logs.
BTW, It is not much helpful for finding the root cause here, but I've filed https://pagure.io/koji/issue/2130 , so it is more easier to see in future, that problem was download phase, not corrupted file as the source.
https://koji.fedoraproject.org/koji/tasks?state=failed&view=flat&method=buildArch&hostID=323&order=-id can be used to view failed builds on the -08 builder, nothing there yet
Two more just now: https://koji.fedoraproject.org/koji/taskinfo?taskID=44096312 https://koji.fedoraproject.org/koji/taskinfo?taskID=44096424
Another another: https://koji.fedoraproject.org/koji/taskinfo?taskID=44096323
I was hit by this recently
https://koji.fedoraproject.org/koji/taskinfo?taskID=44096430
https://koji.fedoraproject.org/koji/taskinfo?taskID=44207288 https://koji.fedoraproject.org/koji/taskinfo?taskID=44207247 too
You can add some simple print like in attached patch and kojid with --debug, so you can find it easily. (Those interesting have previous line "Reading SRPM") Patch is against 1.20.1 (and use it only on that one builder, there would too much clutter, as it is nor running through loggin module, so not filtered out)
Tomas, I have gained access to the builders, so I'm looking into it now. There is one issue with the patch - print(r.headers) should be print(resp.headers) I guess. The debug output will go into /var/log/kojid.log?
print(r.headers)
print(resp.headers)
/var/log/kojid.log
Hmm, it's not in /var/log/kojid.log. In fact I can't find where the info is logged, it's not printed even when running kojid from command line. update - just got openRemoteFile: prefixed output in the terminal
kojid
openRemoteFile:
Yes, it goes to console, if run in foreground or to kojid.log if bg.
an interim report - I have ended using <img alt="0001-log-details-for-downloaded-srpms.patch" src="/koji/issue/raw/files/4671ed8005f4d37542557188c182a02da51a99fbca61c00af9cb1714dcdcbe49-0001-log-details-for-downloaded-srpms.patch" /> to start capturing the details for the downloaded srpms - there are 2 functions that download srpms - tasks.py/loadPath() for buildArch tasks and __init__.py/openRepoFile() for build tasks (the srpm will be later used in the buildArch tasks), it might be useful to merge them - simple print() output doesn't go into the log files, so I have used the logging infra, now I have the details in /var/log/kojid.log when kojid runs as a service - I'm going to try if I can capture any exception data from requests, there might be timeouts or something else - it took ~1 hour for the builder/koji daemon to remove couple dozens of old buildroot when I started it, so I suspect the hypervisor is heavily I/O loaded - there is already one incomplete download captured
tasks.py/loadPath()
__init__.py/openRepoFile()
build
print()
requests
2020-05-26 09:12:53,221 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/40/45000040/uhd-3.15.0.0-0.4.rc2.fc33.src.rpm, header: 126135871, got: 1243773 2020-05-26 09:12:53,222 [INFO] koji.build.BaseTaskHandler: {'Date': 'Tue, 26 May 2020 09:12:30 GMT', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Last-Modified': 'Tue, 26 May 2020 09:08:35 GMT', 'Content-Length': '126135871', 'AppTime': 'D=2583', 'X-Fedora-AppServer': 'kojipkgs01.phx2.fedoraproject.org', 'Content-Type': 'application/x-rpm', 'X-Varnish': '402487983, 17207703', 'Via': '1.1 varnish (Varnish/6.3), 1.1 varnish (Varnish/6.3)', 'Age': '0', 'Accept-Ranges': 'bytes', 'Connection': 'keep-alive'}
some more notes about possible next steps - but if requests would throw exceptions, they would be already in the log (I guess) - we can try moving from 8KB chunk to bigger (1M?), each chunk is individually written to disk, bigger chunks, less writes - add some checks for read and written chunks - explicitly set timeout for reading the chunks
I have added these changes (<img alt="0001-add-more-debugging-output.patch" src="/koji/issue/raw/files/935dbec57caa2f5e1e8ee13224317964299a91a7ae9d5fc78d9d8eabd083c3f9-0001-add-more-debugging-output.patch" />) and as a result we have following in the kojid.log
kojid.log
2020-05-27 00:14:35,016 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/6345/45036345/gavl-1.4.0-16.eln101.src.rpm, chunk number: 87, clen != csize (8151 != 8192) 2020-05-27 00:14:35,017 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/6345/45036345/gavl-1.4.0-16.eln101.src.rpm, header: 4600376, g ot: 712663, chunk count: 87, expected count: 562 2020-05-27 00:14:35,017 [INFO] koji.build.BaseTaskHandler: {'Date': 'Wed, 27 May 2020 00:14:11 GMT', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; pr eload', 'Last-Modified': 'Tue, 26 May 2020 21:31:00 GMT', 'Content-Length': '4600376', 'AppTime': 'D=3639', 'X-Fedora-AppServer': 'kojipkgs01.phx2.fedoraproject.org', 'Content-Type': 'applic ation/x-rpm', 'X-Varnish': '412690449, 16729426', 'Via': '1.1 varnish (Varnish/6.3), 1.1 varnish (Varnish/6.3)', 'Age': '0', 'Accept-Ranges': 'bytes', 'Connection': 'keep-alive'}
My interpretation is the requests.get wasn't able to receive the complete file in the 8KB chunks. But the new question is why? :-)
requests.get
Two more failed downloads after setting a timeout explicitly, but nothing else in the log (no exception thrown or such)
2020-06-03 18:28:22,798 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/4626/45374626/qt5-qt3d-5.14.2-1.eln101.src.rpm, chunk number: 589, clen != csize (1206 != 8192) 2020-06-03 18:28:22,799 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/4626/45374626/qt5-qt3d-5.14.2-1.eln101.src.rpm, header: 81606731, got: 4818102, chunk count: 589, expected count: 9962 2020-06-04 08:01:26,495 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/2581/45402581/qgis-3.12.1-4.fc32.src.rpm, chunk number: 792, clen != csize (3577 != 8192) 2020-06-04 08:01:26,496 [INFO] koji.build.BaseTaskHandler: localPath: http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/2581/45402581/qgis-3.12.1-4.fc32.src.rpm, header: 109352460, got: 6483449, chunk count: 792, expected count: 13349
I saw this during the mass rebuild: https://koji.fedoraproject.org/koji/taskinfo?taskID=47985596
Traceback (most recent call last): File "/usr/lib/python3.8/site-packages/koji/daemon.py", line 1340, in runTask response = (handler.run(),) File "/usr/lib/python3.8/site-packages/koji/tasks.py", line 329, in run return koji.util.call_with_argcheck(self.handler, self.params, self.opts) File "/usr/lib/python3.8/site-packages/koji/util.py", line 258, in call_with_argcheck return func(*args, **kwargs) File "/usr/sbin/kojid", line 1382, in handler h = koji.get_rpm_header(fn) File "/usr/lib/python3.8/site-packages/koji/__init__.py", line 962, in get_rpm_header hdr = ts.hdrFromFdno(fo.fileno()) File "/usr/lib64/python3.8/site-packages/rpm/transaction.py", line 185, in hdrFromFdno raise rpm.error("error reading package header") _rpm.error: error reading package header
I've just seen error reading package header on two successive s390x builds, then it went away:
error reading package header
GenericError: Downloaded rpm http://kojipkgs-cache01.s390.fedoraproject.org/work/tasks/7711/88367711/rust-coreos-installer-0.15.0-1.fc37.src.rpm is corrupted: rpm's header can't be extracted: /var/tmp/koji/tasks/7799/88367799/local/work/tasks/7711/88367711/rust-coreos-installer-0.15.0-1.fc37.src.rpm (rpm error: error reading package header)
https://koji.fedoraproject.org/koji/taskinfo?taskID=88367799 https://koji.fedoraproject.org/koji/taskinfo?taskID=88368013
Login to comment on this ticket.