I've been experiencing weird build failures (primarily on s390x, but sometimes also on other architectures) recently (i.e. for the past 1-2 weeks), where network connectivity between builder and the servers where the build repositories are served from is flaky.
Either the buildSRPMfromSCM task fails, or the actual build task, most often affected seems to be s390x.
Example tasks from today:
All failed with very similar errors:
DEBUG util.py:459: Errors during downloading metadata for repository 'build': DEBUG util.py:459: - Curl error (18): Transferred a partial file for http://kojipkgs-cache03.s390.fedoraproject.org/repos/epel10.0-build-side-106553/6553376/s390x/repodata/0747f87dfdca84e3ec9899b7f0634b807f8099be7aeb6e57891bc055f760f048-primary.xml.zst [end of response with 1924105 bytes missing] DEBUG util.py:459: - Curl error (18): Transferred a partial file for http://kojipkgs-cache03.s390.fedoraproject.org/repos/epel10.0-build-side-106553/6553376/s390x/repodata/0747f87dfdca84e3ec9899b7f0634b807f8099be7aeb6e57891bc055f760f048-primary.xml.zst [end of response with 5266441 bytes missing] DEBUG util.py:459: Error: Failed to download metadata for repo 'build': Yum repo downloading error: Downloading error(s): repodata/0747f87dfdca84e3ec9899b7f0634b807f8099be7aeb6e57891bc055f760f048-primary.xml.zst - Cannot download, all mirrors were already tried without success
But I have seen other errors too, but all of them were about "got only part of the file instead of the whole one".
Anecdote: Today, mostly kojipkgs-cache03.s390.fedoraproject.org seems to be affected.
kojipkgs-cache03.s390.fedoraproject.org
Another failed task: https://koji.fedoraproject.org/koji/taskinfo?taskID=129610678 Same host: kojipkgs-cache03.s390.fedoraproject.org.
Similarly, in a F42 build https://koji.fedoraproject.org/koji/taskinfo?taskID=129610955 just now,
DEBUG util.py:459: >>> Curl error (18): Transferred a partial file for http://kojipkgs-cache03.s390 DEBUG util.py:459: [39/68] flexiblas-0:3.4.5-1.fc42.s390x 100% | 11.1 KiB/s | 26.2 KiB | 00m02s DEBUG util.py:459: >>> Curl error (18): Transferred a partial file for http://kojipkgs-cache03.s390 […] DEBUG util.py:459: [68/68] libraqm-0:0.10.1-2.fc42.s390x 0% | 0.0 B/s | 0.0 B | 00m04s DEBUG util.py:459: >>> Curl error (18): Transferred a partial file for http://kojipkgs-cache03.s390 DEBUG util.py:459: >>> Curl error (18): Transferred a partial file for http://kojipkgs-cache03.s390 DEBUG util.py:459: >>> No more mirrors to try - All mirrors were already tried without success DEBUG util.py:459: -------------------------------------------------------------------------------- DEBUG util.py:459: [68/68] Total 100% | 7.7 MiB/s | 73.7 MiB | 00m10s DEBUG util.py:459: Failed to download packages DEBUG util.py:459: Librepo error: Cannot download toplink/packages/libraqm/0.10.1/2.fc42/s390x/libraqm-0.10.1-2.fc42.s390x.rpm: All mirrors were tried
ok. I noticed there were a few ip's hitting kojipkgs a fair lot.
I blocked those... can anyone let me know if they still are seeing issues?
This is the most recent one I've seen:
https://koji.fedoraproject.org/koji/taskinfo?taskID=129611827
at around 18:52 -> 18:53 UTC. so that might have been just around the time you blocked those IPs?
It's happening still:
So not sure if what you did temporarily helped or not :(
Metadata Update from @phsmoura: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
Yeah chain builds are pretty annoying right now XD
GenericError: Downloaded file http://kojipkgs-cache03.s390.fedoraproject.org/work/tasks/5682/129615682/libplasma-6.3.2-1.fc43.src.rpm doesn't match expected size (0 vs 1994387) GenericError: Downloaded file http://kojipkgs-cache03.s390.fedoraproject.org/work/tasks/5538/129615538/krdp-6.3.2-1.fc43.src.rpm doesn't match expected size (0 vs 119967)
I think I only hit it on s390x (today) so far:
https://kojipkgs.fedoraproject.org/work/tasks/1614/129631614/root.log https://kojipkgs.fedoraproject.org/work/tasks/1684/129631684/root.log (f40) https://kojipkgs.fedoraproject.org/work/tasks/2501/129632501/root.log (f43)
I've adjusted one timeout and restarted varnish on s390x...
From digging:
Seeing seemingly similar issues on aarch64:
https://koji.fedoraproject.org/koji/taskinfo?taskID=129634174 https://koji.fedoraproject.org/koji/taskinfo?taskID=129634388 https://koji.fedoraproject.org/koji/taskinfo?taskID=129634170
This was a unrelated super load spike on the main kojipkgs01/02... I don't think it's the same as the s390x issue and it stopped before I could look at it too much. ;(
So, I have updated and rebooted both kojipkgs01 and kojipkgs02.
With a sidetrack into the fact that they were running a f39 kernel. ;( Seems sdubby got installed, which caused it to install new kernels as efi. ;( Removing that and rming the /boot/efi/ files and reinstalling the kernel got it working.
I am seeing a lot of load on them still... but perhaps they are handling it better now.
So, again, let me know when you next see this issue (or if it seems gone)
So, I don't see any s390x specific failures over night. :)
Will keep this open for a while in case it's just being slower to appear.
Didn't https://pagure.io/fedora-infrastructure/issue/12425 have an error or was that before the fixes you did?
Yes, none of those were last night. All of them were yesterday morning (for me)
I observed that https://koji.fedoraproject.org/koji/taskinfo?taskID=129661276 (Thu, 27 Feb 2025 18:53:44 UTC) failed with
GenericError: Downloaded file http://kojipkgs-cache03.s390.fedoraproject.org//repos/f43-build/6554873/s390x/rpmlist.jsonl doesn't match expected size (1966080 vs 20980475)
so I kicked off another build (mentioned on the devel list) just now (Fri, 28 Feb 2025 03:37:31 UTC), https://koji.fedoraproject.org/koji/taskinfo?taskID=129671299, and was rather horrified to see that the s390x build failed in the same way:
devel
s390x
GenericError: Downloaded file http://kojipkgs-cache03.s390.fedoraproject.org/work/tasks/1300/129671300/vtk-9.2.6-36.fc43.src.rpm doesn't match expected size (430833664 vs 638634331)
I canceled the build and am trying again, but it looks like the problem is back.
https://koji.fedoraproject.org/koji/taskinfo?taskID=129666313 (Thu, 27 Feb 2025 23:23:19 UTC) failed with:
GenericError: Downloaded file http://kojipkgs-cache03.s390.fedoraproject.org/work/tasks/6319/129666319/ags-3.6.2.7-1.fc43.src.rpm doesn't match expected size (393216 vs 8755823)
Host buildvm-s390x-20.s390.fedoraproject.org
So, both of those are before my comment above at 2025-02-27 00:36:58 UTC, so I think they were before the update/reboot.
I looked at https://koji.fedoraproject.org/koji/builds?state=3&order=-build_id and can't see any after that that are this issue.
What about https://koji.fedoraproject.org/koji/taskinfo?taskID=129671299, the second build mentioned in https://pagure.io/fedora-infrastructure/issue/12419#comment-958359? That build started at Fri, 28 Feb 2025 03:37:31 UTC.
Yeah, that does seem to be one. ;(
So, not 100% gone, but happening much less often?
Any further cases anyone has seen in the last few days?
I wanted to post an update here earlier today, but forgot. I did 200+ koji builds today and haven't hit this issue once. In my book I'd count that as "fixed" :) Feel free to close this issue.
Thanks. :crossed_flags:
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Metadata Update from @kevin: - Issue assigned to kevin
Just spotted this in a Fedora CI scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=129811343
Can confirm. It's not nearly as bad as last week, but: https://koji.fedoraproject.org/koji/taskinfo?taskID=129821400 (keditbookmarks crashing with that same error)
ok, my only theory left is that the load on kojipkgs01/02 is causing the s390x cache to do this, but yet not enough that it causes any of the other arches or external people to hit it.
So, to test that theory, I have taken kojipkgs02 out of external load and blocked kojipkgs01 on the s390x cache machine.
So, it should now get everything from kojipkgs02 which isn't being hit by any external stuff.
If that proves ok, then I think we may need to expand the number of kojipkgs or something to handle more external load. If it doesn't help that means this isn't the problem.
Metadata Update from @kevin: - Issue status updated to: Open (was: Closed)
So, I reverted all that this morning as it turned out I blocked some cloudfront ips. ;(
I've made some other adjustments now:
Lets see how it does now... thanks everyone for your patience and reporting things back.
I saw one report after this, but are any folks seeing more?
I have not seen any issues in any of the ~40 koji builds I did today.
In the last couple days I think I've seen it happen once with frameworks, I am sending it to you in matrix as I get them :)
One more I just got, ironically on the very last build in the chain:
https://koji.fedoraproject.org/koji/taskinfo?taskID=130017870
Ok, it's been 4 days... any more seen? It sounds like this is now mostly ok, but very occasionally we might still hit it?
I'm going to close this now. Feel free to re-open if you see it happening again and we can try more things.
Metadata Update from @kevin: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.