#12419 Network connectivity issues on koji builders (partial downloads, etc.)
Closed: Fixed with Explanation a month ago by kevin. Opened 2 months ago by decathorpe.

I've been experiencing weird build failures (primarily on s390x, but sometimes also on other architectures) recently (i.e. for the past 1-2 weeks), where network connectivity between builder and the servers where the build repositories are served from is flaky.

Either the buildSRPMfromSCM task fails, or the actual build task, most often affected seems to be s390x.

Example tasks from today:

All failed with very similar errors:

DEBUG util.py:459:  Errors during downloading metadata for repository 'build':
DEBUG util.py:459:    - Curl error (18): Transferred a partial file for http://kojipkgs-cache03.s390.fedoraproject.org/repos/epel10.0-build-side-106553/6553376/s390x/repodata/0747f87dfdca84e3ec9899b7f0634b807f8099be7aeb6e57891bc055f760f048-primary.xml.zst [end of response with 1924105 bytes missing]
DEBUG util.py:459:    - Curl error (18): Transferred a partial file for http://kojipkgs-cache03.s390.fedoraproject.org/repos/epel10.0-build-side-106553/6553376/s390x/repodata/0747f87dfdca84e3ec9899b7f0634b807f8099be7aeb6e57891bc055f760f048-primary.xml.zst [end of response with 5266441 bytes missing]
DEBUG util.py:459:  Error: Failed to download metadata for repo 'build': Yum repo downloading error: Downloading error(s): repodata/0747f87dfdca84e3ec9899b7f0634b807f8099be7aeb6e57891bc055f760f048-primary.xml.zst - Cannot download, all mirrors were already tried without success

But I have seen other errors too, but all of them were about "got only part of the file instead of the whole one".

Anecdote: Today, mostly kojipkgs-cache03.s390.fedoraproject.org seems to be affected.


Another failed task:
https://koji.fedoraproject.org/koji/taskinfo?taskID=129610678
Same host: kojipkgs-cache03.s390.fedoraproject.org.

Similarly, in a F42 build https://koji.fedoraproject.org/koji/taskinfo?taskID=129610955 just now,

DEBUG util.py:459:  >>> Curl error (18): Transferred a partial file for http://kojipkgs-cache03.s390
DEBUG util.py:459:  [39/68] flexiblas-0:3.4.5-1.fc42.s390x  100% |  11.1 KiB/s |  26.2 KiB |  00m02s
DEBUG util.py:459:  >>> Curl error (18): Transferred a partial file for http://kojipkgs-cache03.s390
[…]
DEBUG util.py:459:  [68/68] libraqm-0:0.10.1-2.fc42.s390x     0% |   0.0   B/s |   0.0   B |  00m04s
DEBUG util.py:459:  >>> Curl error (18): Transferred a partial file for http://kojipkgs-cache03.s390
DEBUG util.py:459:  >>> Curl error (18): Transferred a partial file for http://kojipkgs-cache03.s390
DEBUG util.py:459:  >>> No more mirrors to try - All mirrors were already tried without success     
DEBUG util.py:459:  --------------------------------------------------------------------------------
DEBUG util.py:459:  [68/68] Total                           100% |   7.7 MiB/s |  73.7 MiB |  00m10s
DEBUG util.py:459:  Failed to download packages
DEBUG util.py:459:   Librepo error: Cannot download toplink/packages/libraqm/0.10.1/2.fc42/s390x/libraqm-0.10.1-2.fc42.s390x.rpm: All mirrors were tried

ok. I noticed there were a few ip's hitting kojipkgs a fair lot.

I blocked those... can anyone let me know if they still are seeing issues?

This is the most recent one I've seen:

https://koji.fedoraproject.org/koji/taskinfo?taskID=129611827

at around 18:52 -> 18:53 UTC. so that might have been just around the time you blocked those IPs?

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 months ago

Yeah chain builds are pretty annoying right now XD

I've adjusted one timeout and restarted varnish on s390x...

From digging:

  • This seems to be only s390x? I think I saw someone saying they had hit it on another arch, but now I cannot find that. If you see it on non s390x, please let me know.
  • We are in freeze and haven't made any changes that might cause this that I can see.
  • We do apply security updates, but none of them seem related or match up time wise.

This was a unrelated super load spike on the main kojipkgs01/02... I don't think it's the same as the s390x issue and it stopped before I could look at it too much. ;(

So, I have updated and rebooted both kojipkgs01 and kojipkgs02.

With a sidetrack into the fact that they were running a f39 kernel. ;( Seems sdubby got installed, which caused it to install new kernels as efi. ;(
Removing that and rming the /boot/efi/ files and reinstalling the kernel got it working.

I am seeing a lot of load on them still... but perhaps they are handling it better now.

So, again, let me know when you next see this issue (or if it seems gone)

So, I don't see any s390x specific failures over night. :)

Will keep this open for a while in case it's just being slower to appear.

Didn't https://pagure.io/fedora-infrastructure/issue/12425 have an error or was that before the fixes you did?

Yes, none of those were last night. All of them were yesterday morning (for me)

I observed that https://koji.fedoraproject.org/koji/taskinfo?taskID=129661276 (Thu, 27 Feb 2025 18:53:44 UTC) failed with

GenericError: Downloaded file http://kojipkgs-cache03.s390.fedoraproject.org//repos/f43-build/6554873/s390x/rpmlist.jsonl doesn't match expected size (1966080 vs 20980475)

so I kicked off another build (mentioned on the devel list) just now (Fri, 28 Feb 2025 03:37:31 UTC), https://koji.fedoraproject.org/koji/taskinfo?taskID=129671299, and was rather horrified to see that the s390x build failed in the same way:

GenericError: Downloaded file http://kojipkgs-cache03.s390.fedoraproject.org/work/tasks/1300/129671300/vtk-9.2.6-36.fc43.src.rpm doesn't match expected size (430833664 vs 638634331)

I canceled the build and am trying again, but it looks like the problem is back.

https://koji.fedoraproject.org/koji/taskinfo?taskID=129666313 (Thu, 27 Feb 2025 23:23:19 UTC) failed with:

GenericError: Downloaded file http://kojipkgs-cache03.s390.fedoraproject.org/work/tasks/6319/129666319/ags-3.6.2.7-1.fc43.src.rpm doesn't match expected size (393216 vs 8755823)

Host buildvm-s390x-20.s390.fedoraproject.org

So, both of those are before my comment above at 2025-02-27 00:36:58 UTC, so I think they were before the update/reboot.

I looked at https://koji.fedoraproject.org/koji/builds?state=3&order=-build_id and can't see any after that that are this issue.

So, both of those are before my comment above at 2025-02-27 00:36:58 UTC, so I think they were before the update/reboot.

What about https://koji.fedoraproject.org/koji/taskinfo?taskID=129671299, the second build mentioned in https://pagure.io/fedora-infrastructure/issue/12419#comment-958359? That build started at Fri, 28 Feb 2025 03:37:31 UTC.

Yeah, that does seem to be one. ;(

So, not 100% gone, but happening much less often?

Any further cases anyone has seen in the last few days?

I wanted to post an update here earlier today, but forgot. I did 200+ koji builds today and haven't hit this issue once. In my book I'd count that as "fixed" :) Feel free to close this issue.

Thanks. :crossed_flags:

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 months ago

Metadata Update from @kevin:
- Issue assigned to kevin

2 months ago

Can confirm. It's not nearly as bad as last week, but: https://koji.fedoraproject.org/koji/taskinfo?taskID=129821400 (keditbookmarks crashing with that same error)

ok, my only theory left is that the load on kojipkgs01/02 is causing the s390x cache to do this, but yet not enough that it causes any of the other arches or external people to hit it.

So, to test that theory, I have taken kojipkgs02 out of external load and blocked kojipkgs01 on the s390x cache machine.

So, it should now get everything from kojipkgs02 which isn't being hit by any external stuff.

If that proves ok, then I think we may need to expand the number of kojipkgs or something to handle more external load.
If it doesn't help that means this isn't the problem.

Metadata Update from @kevin:
- Issue status updated to: Open (was: Closed)

2 months ago

So, I reverted all that this morning as it turned out I blocked some cloudfront ips. ;(

I've made some other adjustments now:

  • switch s390x cache to use memory instead of disk.
  • increases the varnish ttl from 120s to 180s to improve caching.
  • switched cloudfront use 'origin shield' to put less load on kojipkgs01/02 (so instead of every cloudfront endpoint hitting them, there's just one endpoint hitting them and syncing that info to the edge caches)

Lets see how it does now... thanks everyone for your patience and reporting things back.

I saw one report after this, but are any folks seeing more?

I have not seen any issues in any of the ~40 koji builds I did today.

In the last couple days I think I've seen it happen once with frameworks, I am sending it to you in matrix as I get them :)

One more I just got, ironically on the very last build in the chain:

https://koji.fedoraproject.org/koji/taskinfo?taskID=130017870

Ok, it's been 4 days... any more seen? It sounds like this is now mostly ok, but very occasionally we might still hit it?

I'm going to close this now. Feel free to re-open if you see it happening again and we can try more things.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

a month ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog