#12463 Further improve full compose speed
Opened 24 days ago by adamwill. Modified 18 days ago

  • Describe the issue
    We've massively improved the time a full compose takes over the last few years; it's now down to 2:13:45 for the latest Rawhide compose. This ticket is really just to hold my notes on what's needed to make it any faster.

We now basically have two significant chains:

  1. pkgset (35:20) -> ostree (28:03) -> ostree_installer (1:05:02). total: 2:08:25
  2. pkgset (35:20) -> gather (39:36) -> createrepo (7:07) -> kiwibuild (46:43). total: 2:08:46

The additional ~5 minutes are small things that happen before pkgset and after the image build phase, and between the gather/createrepo/buildinstall phase and the image build phase.

gather->createrepo run in parallel with buildinstall which takes 27:58, so if we saved more than ~20 minutes there, buildinstall would become the bottleneck. kiwibuild runs in parallel with createiso, osbuild, live_media and imagebuild, which take 6:30, 18:40, 31:51, and 41:42 respectively, so 5 minutes saving on kiwi would make imagebuild the bottleneck, 15 mins on kiwi and 10 mins on imagebuild would make live_media the bottleneck.

Overall seems like the place to attack is pkgset, if we still want to try and shave off more time from this. Getting significant gains anywhere else at all needs us to improve multiple things, at least two or three.

  • When do you need this? (YYYY/MM/DD)
    No particular time.

  • When is this no longer needed or useful? (YYYY/MM/DD)
    When Konflux eats Pungi in the glorious future!

  • If we cannot complete your request, what is the impact?
    Composes take longer.


the other possibility is moving more image build phases ahead of the gather/createrepo/buildinstall phase, but I suspect that's not possible, I don't think any image builds aside from ostree installer can work before gather/createrepo happen.

I don't think we are likely to be able to make gather/createrepo much faster easily. It's already hardlinking. I suppose we could drop a bunch of packages? ;)

This may shift around if we move more things to kiwi and drop imagebuild/live_media entirely...

Those run in parallel, so I don't expect a huge impact, even if kiwi is a bit faster. And improving anything on that chain alone doesn't help because the ostree chain will still be the same length. That's why I say pkgset is the obvious place to attack if anywhere. It's the only remaining significant single bottleneck.

I wonder if we could make it do debuginfo / source after the rest and at the same time as the next set of things? (ie, we don't use debuginfo or source to make anything, we just need them at the end)

ooh, that sounds interesting, yeah.

so looking at the pkgset phase, perhaps surprisingly, it spends ~20 minutes just constructing per-arch package sets - the codepath is phases.pkgset.common.populate_arch_pkgsets -> phases.pkgset.PackageSetBase.subset -> phases.pkgset.PackageSetBase.merge . That looks kinda attackable to me.

It does four arches, one after the other. ppc64le, aarch64 and s390x take about four minutes each. x86_64 takes about 8-9 minutes. That's about 20 mins total.

One idea I had was to do them all in parallel using threads, instead of one at a time. We already do this for the previous phase, when we're "Processing" packages somehow - Package set: spawning 10 worker threads. We could probably do the same here and spawn a thread per arch and do them all at once. That should cut the total down to ~8 minutes.

Then I thought, hmm, how about we flip the whole approach on its head and loop over the global package set's per-arch package lists once each, rather than creating one subset per arch at a time and looping over the relevant global package lists for each primary arch (which means we hit noarch and src four times each). https://pagure.io/pungi/pull-request/1794 is a pretty speculative PR for that. I will try and find time over the weekend or on Monday to work on it further and test it out somehow (I need to find somewhere I can run hacked-up Pungi code on a real-ish set of packages without hurting anything).

Metadata Update from @phsmoura:
- Issue tagged with: medium-gain, medium-trouble, ops

21 days ago

https://pagure.io/pungi/pull-request/1796 should save 20 minutes out of pkgset phase. that was a lot of fun.

Yup, that fix is good.

  • 1119.n.0 compose: Compose phase PKGSET time: 0:35:19
  • 1120.n.0 compose: Compose phase PKGSET time: 0:18:09

But now I'm noticing something else - the ostree container phase time seems pretty variable:

  • 1119.n.0 compose: Compose phase OSTREE_CONTAINER time: 1:02:44
  • 1120.n.0 compose: Compose phase OSTREE_CONTAINER time: 2:57:34

Not sure what's going on there, will dig into it.

In 1119.n.0, Kinoite ostree container on ppc64le took just over an hour:

2024-11-19 05:50:52 [INFO    ] [BEGIN] OSTree container phase for variant Kinoite, arch ppc64le
2024-11-19 06:53:31 [INFO    ] [DONE ] OSTree container phase for variant Kinoite, arch ppc64le

In 1120.n.0, it took nearly three hours:

2024-11-20 05:33:44 [INFO    ] [BEGIN] OSTree container phase for variant Kinoite, arch ppc64le
2024-11-20 08:31:13 [INFO    ] [DONE ] OSTree container phase for variant Kinoite, arch ppc64le

Here are the tasks: 1119.n.0, 1120.n.0. The faster one ran on buildvm-ppc64le-18.iad2.fedoraproject.org, the slower one on buildvm-ppc64le-27.iad2.fedoraproject.org . This has echoes of the previous outlier we found, where server_kvm was taking two hours on ppc64le sometimes - https://pagure.io/releng/issue/12398 . It really seems like there's some kind of general issue going on here with storage speed or something...

So I'm looking at the logs on buildvm-ppc64le-27.iad2.fedoraproject.org during the time the build was happening, and there seems to have been some sort of epic ansible operation going on. It starts logging ansible activity around 4:35 and it continues till 7:16 - that's nearly three hours of ansiblizing.

I don't know if this is chicken or egg - is the slow ansiblizing a symptom of whatever problem we're encountering here, or the cause of it - or unrelated but weird? It was happening on buildvm-18 during the faster build too, so maybe unrelated, but...still, it's odd. There seems to be something specific to ppc64le there, too. There's an ansible run on all buildvms starting at 4:34 every day, it seems like, but on x86 and a64 VMs (randomly sampled) it takes less than an hour and logs 400-600 sshd messages. On ppc64le it takes over two hours and logs over 2500 sshd messages. For some reason we're doing a lot more 'work' on ppc64le buildvms specifically, I don't know why yet.

The ansible stuff on the whole looks like a bit of a dead end. So I'm trying to draw patterns in the task history, as much as I can. Might be easier with a bit more data. Right now, ppc64le builds have been working again for four days - before that they were broken because of the thing with firefox. For all those days, ostree_container phase took around 2000-2100 seconds. Since ppc64le came back, it's ranged from 3765 to 10655 seconds. So ppc64le builder performance is definitely bottlenecking us somehow, here.

Each day we build Silverblue and Kinoite for ppc64le, so we have eight builds to look at. Here are the details:

20241117:
Kinoite: https://koji.fedoraproject.org/koji/taskinfo?taskID=125940507 - buildvm-ppc64le-18 - 2:12:48
Silverblue: https://koji.fedoraproject.org/koji/taskinfo?taskID=125940513 - buildvm-ppc64le-33 - 0:25:08
20241118:
Kinoite: https://koji.fedoraproject.org/koji/taskinfo?taskID=125974963 - buildvm-ppc64le-18 - 2:03:09
Silverblue: https://koji.fedoraproject.org/koji/taskinfo?taskID=125974969 - buildvm-ppc64le-27 - 2:10:34
20241119:
Kinoite: https://koji.fedoraproject.org/koji/taskinfo?taskID=126004534 - buildvm-ppc64le-18 - 1:02:33
Silverblue: https://koji.fedoraproject.org/koji/taskinfo?taskID=126004538 - buildvm-ppc64le-27 - 0:54:23
20241120:
Kinoite: https://koji.fedoraproject.org/koji/taskinfo?taskID=126043901 - buildvm-ppc64le-27 - 2:57:26
Silverblue: https://koji.fedoraproject.org/koji/taskinfo?taskID=126043905 - buildvm-ppc64le-01 - 0:36:38

6/8 jobs hit vm-18 or vm-27. 4 of those took over 2 hours, the other two took around an hour. the two jobs that didn't hit those hosts were the fastest, both Silverblue jobs that took around half an hour.

We have five hosts for ppc64le vmbuilders. vm-01 (which caught one build and did it in 36 minutes) is on host-01. vm-18 (three builds, 2:12, 2:03, 1:02) is on host-02. vm-27 (three builds, 2:10, 0:54, 2:57) is on host-04. vm-33 (one build, 0:25) is on host-05.

All the hosts seem to be identical in hardware terms AFAICS except for 05, which is very much not like the others. They all have 256G of RAM, but 01-04 are PowerNV 9006-22P with 128 'processors', and more to the point, 8x 4G 7200RPM hard disks each (ugh). 05 is PowerNV 9183-22X with 160 processors, and has 8x 240M SSDs (yay).

01 through 04 host 8 buildvms each...so they've effectively got one spinning rust hard disk of storage performance per buildvm (although not really even that, probably, because of overheads). 05 hosts one buildvm (33) and two osbuild buildvms. So...05 is a more powerful system, and has way faster storage, but is hosting five fewer VMs than 01 through 04? This feels weird.

Anyhow, I do suspect this is to do with storage contention on the 01-04 hosts, at this point. I slapped iotop on 03 and watched it for a while, and saw spikes of over 1000 "M/s", which assuming it's megabytes, is more than 8 hard disks can sustain, I think. So it definitely kinda looks like those machines are struggling for storage bandwidth. Just interacting with that host 'feels' slow, too - it's slow to ssh into, slow to type on, installing iotop took forever. 05 responds extremely fast, not surprisingly.

So...Kevin's idea to hook those hosts up to iscsi storage seems like a good one for sure, if we can't upgrade them to SSDs. Could we also look into rebalancing a little? It seems weird that 05 is carrying such a light load for such a powerful box. For instance, we could take one buildvm each off the other hosts and give them all to 05, then all five hosts would be hosting seven VMs each. Maybe total storage capacity is an issue, though?

05 has vastly smaller space. There's only 300gb free... enough for perhaps 1 more vm.
05 has coreos/osbuild/etc.

In theory the phases could be reordered so that image building/kiwibuild would happen after pkgset. Currently it installs packages from the corresponding variant, but it could also install from the pkgset repos. Buildinstall and ostree already do use those.

In theory, it could result in a e.g. Server image containing something that is not in the Server repo, but I don't think it's a big worry for Fedora as most images are using Everything anyway.

Gather could likely be sped up by implementing a dnf5 backend. Right now it's using dnf4 API, but doing the resolution directly, it doesn't use DNF transactions. Using DNF5 to actually resolve the transactions and not just querying repos might have interesting performance outcomes.

Would the reordering be troublesome for RHEL (cause images to include things from variants they should not)? On the whole I'm not super sure I love that idea, at least until we have nothing else left to optimize. Personally speaking I like the idea of being able to restrict image builds to packages from variants and might lobby to use it more extensively in future (spoiler alert...)

Speeding up gather would be fun in theory, but right now it doesn't help at all (as we're bottlenecked on ostree_container, which doesn't run after gather). If we solve the ppc64le storage bandwidth issue, we can re-evaluate the two paths after that and see which is the bottleneck and hence where we can attack.

We do still have ~20 minutes of pkgset to look at which is a time saving on every possible path; I did profile it, the result is messy but I think a lot of what remains is Koji queries. It might be interesting to dig into that and see if there's anything to optimize, but I'm trying not to get distracted from other stuff I ought to be working on at a bit higher priority :P

Log in to comment on this ticket.

Metadata
Boards 1
Ops Status: Backlog