Learn more about these different git repos.
Other Git URLs
We are having some issues with webkit2gtk3 builds taking huge amount of time to complete and running out of memory and builders restarting due to what appears to be kojid oom. Would it be possible to add it to koji heavybuilder channel to see if it can help with these issues?
Thanks!
I'd prefer to avoid this if we can... I setup the heavybuilder channel for chromium, and if we add more things to it, they are going to compete.
One issue with the OOM hitting is that systemd-oomd managed to get back in when I reinstalled builders. I just disabled it globally. Can we see if that helps any?
Metadata Update from @phsmoura: - Issue assigned to kevin - Issue tagged with: medium-gain, medium-trouble, ops
Sure, we can give it a try. As long as the builds succeed reliably, I'll be happy.
It looks like your change almost certainly fixed x86_64, which last restarted shortly before your last comment.
I'm still waiting on my very slow s390x and aarch64 builds to complete, but I'm optimistic. (These builds had not been restarting before.)
I've provided a negative update on s390x in #10910.
This build just restarted again, after running for 70 hours. My guess is the builder doesn't have enough RAM, but it's impossible to know because we don't know where the job failed until it actually fails. As the jobs are restarting, I don't have access to a failed log with an error message to know what is happening.
s390x builders 01->14 have 8GB ram and 6.6GB swap. builders 15->35 have 12 GB ram and 8GB swap. They all seem to have 2 CPUs.
I saw my local build using 12 GB when processing debuginfo a couple days ago. :/
You are correct and it is a memory issue:
Jul 22 12:00:06 buildvm-s390x-19.s390.fedoraproject.org systemd-oomd[472110]: Killed /system.slice/kojid.service due to memory used (13623361536) / total (13685317632) and swap used (774157>
I am trying to see if I can get anything else other than that.
OK I am tailing the current build.log so when it ooms again I can see what step it did it in. I am in agreement with your earlier assessment that the build will probably never succeed.
How to solve this in either the short term or long term is not clear as it has high tradeoffs which need to be negotiated at a level higher than this ticket. 1. I do not think there is no extra memory or CPUs which can be given to a builder without turning off existing builders. 2. turning off existing builders to make current ones 'bigger' may helps for large builds but causes small builds to pile up due to lack of builders. [Doing this in the middle of a mass rebuild would not be good either.] 3. excluding this architecture from this package would have a pile-on effect for a lot of other packages.
Note system requirements might be about to increase again as we still need to add the GTK 4 build. See https://bugzilla.redhat.com/show_bug.cgi?id=2108905 where we are working on that. The build and install process itself should not require any extra RAM, since we only do one at a time and can use %limit_build to control resource usage. But I'm suspicious that the RAM requirement for debuginfo processing might increase. It seems like the sort of thing that should be done for one library at a time and therefore it should be perfectly fine, but not sure if that's true in reality.
Argh. systemd-oomd is not supposed to be running. ;(
It seems to got re-enabled somehow. I am going to clean that up now.
ok. systemd-oomd is gone now. It was definitely firing on builds on the s390x builders (at 90% memory). ;(
So, lets see if this next one finishes this weekend. If not, then yeah, I guess we will need to rebalance things to provide more memory to less builders. :(
Also, I see we are hitting this:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 3 0 120784 3428364 596 7705440 0 0 0 0 214 71 73 3 0 0 25 2 0 120784 3345204 596 7705440 0 0 0 0 244 93 84 2 0 0 15 2 0 120784 3219456 596 7705440 0 0 0 0 212 57 86 2 0 0 11 2 0 120784 3116136 596 7705440 0 0 0 0 228 100 73 3 0 0 25 2 0 120784 2995932 596 7705440 0 0 0 0 216 65 67 3 0 0 31
11 to 31% of cpu is 'stolen'. What this means is that there's other LPARS on that mainframe that are higher priority and are getting those cpu cycles. Typically we have seen this when there's heavy testing/development of RHEL or other internal Red Hat things. ;(
OK, I started a scratch build with all three builds enabled at once to match what I plan to ship in F37 and F38. Interestingly, the s390x build is progressing without trouble. This time it's the x86_64 and the ppc64le builds that are restarting. Perhaps systemd-oomd is running there too?
The good news is that when I built this locally, I did not notice memory consumption greater than 12 GB when processing debuginfo, so I don't think the number of builds affects the RAM requirement at all: probably only affects the build time.
Sadly, the s390x build from #10910 continues to restart. I'm impressed that build is still running without failing after 94 hours.
The restart on the s390x left this log I was tailing:
+ /usr/bin/find-debuginfo -j2 --strict-build-id -m -i --build-id-seed 2.37.1-10.fc37 --unique-debug-suffix -2.37.1-10.fc37.s390x --unique-debug-src-base webkit2gtk3-2.37.1-10.fc37.s390x --run-dwz --dwz-low-mem-die-limit 10000000 --dwz-max-die-limit 250000000 -S debugsourcefiles.list /builddir/build/BUILD/webkitgtk-2.37.1 extracting debug info from /builddir/build/BUILDROOT/webkit2gtk3-2.37.1-10.fc37.s390x/usr/lib64/libjavascriptcoregtk-4.0.so.18.21.0 extracting debug info from /builddir/build/BUILDROOT/webkit2gtk3-2.37.1-10.fc37.s390x/usr/bin/WebKitWebDriver extracting debug info from /builddir/build/BUILDROOT/webkit2gtk3-2.37.1-10.fc37.s390x/usr/lib64/libjavascriptcoregtk-4.1.so.0.2.0 extracting debug info from /builddir/build/BUILDROOT/webkit2gtk3-2.37.1-10.fc37.s390x/usr/lib64/libwebkit2gtk-4.0.so.37.57.0 extracting debug info from /builddir/build/BUILDROOT/webkit2gtk3-2.37.1-10.fc37.s390x/usr/lib64/libwebkit2gtk-4.1.so.0.2.0 EXCEPTION: [KeyboardInterrupt()] Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/mockbuild/trace_decorator.py", line 93, in trace result = func(*args, **kw) File "/usr/lib/python3.10/site-packages/mockbuild/util.py", line 556, in do_with_status output = logOutput( File "/usr/lib/python3.10/site-packages/mockbuild/util.py", line 394, in logOutput i_rdy, o_rdy, e_rdy = select.select(fds, [], [], 1) File "/usr/libexec/mock/mock", line 453, in handle_signals util.orphansKill(buildroot.make_chroot_path()) File "/usr/lib/python3.10/site-packages/mockbuild/trace_decorator.py", line 57, in trace @functools.wraps(func) KeyboardInterrupt
Looks like a lot restarted itself at that time as
[Sat Jul 23 03:14:51 2022] sssd_nss invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 [Sat Jul 23 03:14:51 2022] CPU: 1 PID: 2027792 Comm: sssd_nss Not tainted 5.18.9-200.fc36.s390x #1 [Sat Jul 23 03:14:51 2022] Hardware name: IBM 8561 LT1 400 (KVM/Linux) [Sat Jul 23 03:14:51 2022] Call Trace: .... [Sat Jul 23 03:14:51 2022] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/kojid.service,task=gdb.minimal,pid=2027553,uid=1000 [Sat Jul 23 03:14:51 2022] Out of memory: Killed process 2027553 (gdb.minimal) total-vm:11985092kB, anon-rss:5329360kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:23302kB oom_score_adj:0 [Sat Jul 23 03:15:14 2022] oom_reaper: reaped process 2027553 (gdb.minimal), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
I don't think this rpm will be buildable currently.
That looks like 11.4 GiB, right? It is actually slightly less than I expected would be required.
systemd-oomd is off everywhere.
I took down 2 s390x builders and added their memory to 2 others and moved your s390x builds over to them.
I moved your x86_64 one that was failing to a buildhw box (with a lot more memory/cpu).
On ppc64le, I can make a few more bigger builders, but not sure I will have time this weekend.
I'd still prefer to avoid a special channel for this, so next week, perhaps I can rebalance things (reduce number of builders and increase memory) until things work.
The other thing we can do is increase the size of the generated debuginfo. We currently do this:
# Increase the DIE limit so our debuginfo packages can be size-optimized. # Decreases the size for x86_64 from ~5G to ~1.1G. # https://bugzilla.redhat.com/show_bug.cgi?id=1456261 %global _dwz_max_die_limit 250000000
That hugely reduces the package size, but also hugely increases the RAM required to process the debuginfo. It would be nice to not need to change that, though.
All my builds are green now! Thanks.
ok, great. I am going to try and rebalance things so that these builds are ok, but we don't have to use a special channel.
Will try and do that this week...
This F35 armv7hl build might need help. Looks like it restarted at least once so far: https://koji.fedoraproject.org/koji/taskinfo?taskID=90197240
Although it has been churning fine for the past 9 hours, so maybe it will be OK. (Note this is a stable branch build, so no increased system requirements relative to what has historically been required.)
This looks to be a different ongoing issue. Please see https://pagure.io/fedora-infrastructure/issue/10833
I don't think so. It was building C++ files when I checked last night. When I checked this morning, after noticing that it had restarted, it was again building C++ files. Sadly, it has just restarted again 30 minutes ago.
There is no way to know for sure how many times it has restarted total, but I see that it is currently building C++ files again. So I assume it is running out of memory.
OK several builders came up in the wrong memory configuration and only had 2.9GB. Those have all been rebooted into the 40GB mode. However I believe any process is limited in memory size to 4GB. If the debuginfo step grows beyond that, I don't think the process would complete.
That won't be a problem. This package disables debuginfo on armv7hl for exactly this reason.
The build eventually completed.
So, I am having issues with this too.
First, it started getting stuck and restarted, then @kevin wanted to try zvm builder, that got it right to extracting debuginfo to be killed and restarted again :/
It is this task: https://koji.fedoraproject.org/koji/taskinfo?taskID=90394095
Has the ICU soname bump been merged into rawhide yet? That is, is the distro broken currently? Or is this "merely" blocking ICU from being updated via a side tag?
Note that webkit2gtk3 is going to be retired imminently, see #10919. The goal is to retire this package before rawhide branches from F37, which is scheduled for early next week. The new webkitgtk package will provide the same libraries.
Has the ICU soname bump been merged into rawhide yet?
Yes.
That is, is the distro broken currently? Or is this "merely" blocking ICU from being updated via a side tag?
Distro shouldn't be broken (it seems that, apart from the R stack and eln, there wasn't any breakage), there is a compat library - libicu69. Packages not rebuilt against the new ICU should pick up that user/client side.
Hmm, that's good to know, in that case, rebuilding webkit2gtk2 isn't that important. I just wanted to avoid having two ICU libs shipped on Workstation/other deliverables.
We seem to be good on all platforms except s390x. Frantisek's build -- ongoing for 74 hours -- and my build just keep restarting and wasting resources. 12 GB of RAM is just too low. My possibly-incorrect suspicion is that we are very close to the required amount of RAM, and just a little more would suffice. I bet 16 GB would be safe.
I am doing some napkin calculations here so my numbers may be off. We have 2 virtual z machines we are running on. One is set up in one layout and the other is a KVM layout. The KVM system is easiest to read where we have 48CPU? and 256 GB of ram. That system has 21 virtual machines with 2CPU/12GB of ram. We would need to shut down at least 6 systems on this box to bring up the memory to 16GB per host. That would get us 15 systems with 16GB ram and 3 CPU's per system.
The Zvm system would probably need an equivalent amount of shutdown systems so we would go from 35 to 23 builders. I do not know if that would improve overall build times due to the lack of resources there. And if 16GB is only good now but not in 6 months do we have to drop down the number of builders even more? [Even if we were to make a 'big-system' we are going to have to drop down the number of builders by an equivalent count to pull those resources together.]
Again back of a napkin calculations so probably off.
We would need to shut down at least 6 systems on this box to bring up the memory to 16GB per host. That would get us 15 systems with 16GB ram and 3 CPU's per system.
Alternative would be 14 GB with 18 VMs at 2 vCPUs and 14 GB of RAM... but surely 15 VMs with 3 vCPUs would perform better overall.
And if 16GB is only good now but not in 6 months do we have to drop down the number of builders even more?
Eventually, yes. Build requirements are constantly increasing. But they increase relatively slowly, so I would expect it should be longer than 6 months before we need to increase requirements again.
Problem is I honestly do not know how much RAM is required to build successfully on s390x. Because the build log is lost each time the job restarts, I don't even know where it's dying. I'm only guessing that it dies when processing debuginfo. If it happens when compiling, then in theory we could fix things by adjusting the %limit_build macro. I don't think that's what's happening, though.
Another option would be to adjust the debuginfo optimization:
That's probably a terrible idea because, well, you can see how much it would bloat the generated RPMs. But I believe it would reduce memory usage during the build.
Yet another option would be to try -g1 instead of -g, but that would truly make s390x a second-class citizen. Not fond of that.
So I was wrong about the other set of systems. Frantisek's build is on the older z system and it only has 8 GB of ram per builder and 6 GB of swap. The build is crashing constantly due to OOMd
[2535842.291988] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/kojid.service,task=gdb.minimal,pid=2342849,uid=1000 [2535842.292011] Out of memory: Killed process 2342849 (gdb.minimal) total-vm:9699380kB, anon-rss:3938136kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:18718kB oom_score_adj:0
It will never complete and needs to be moved to a different builder.
I think it's worth a try to have builders with 3 CPUs and 16 GB memory.
It's not impossible that it would actually improve packager experience by allowing builds to complete faster. I guess in the best case 3 CPUs instead of 2 would make it possible for tasks to complete in 2/3 of the time, which then frees up the builder sooner and makes up for fewer builders?
I'm going to suggest that builder just needs to be turned off altogether.
BTW although it would be nice to see Frantisek's build complete, it's also OK to just cancel it. I will retire that package as soon as my build completes. I think my build is likely stuck in the same doom loop, though.
That would turn off ~15 builders... thus making s390x a bigger headache than you expect.
What needs to be done is working out what the horse trading of 'what is expected' and 'what is needed' and 'what is reasonable' with the limited resources Fedora has.
Strawman starting point proposal: if an architecture provides (a) less than 16 GB of RAM per standard builder, or (b) less than 2 GB of RAM per vCPU, then all large C++ projects (at least Firefox, Thunderbird, LibreOffice, Inkscape, WebKitGTK, QtWebKit, and QtWebEngine) should be built on a heavy builder channel instead of the standard channel. If the goal is to avoid moving projects to a heavy builder channel, then we should turn off however many builders necessary to achieve that. If a particular architecture slows down too much, we can put out a call for hardware donations, and if no company is interested in providing hardware, shut it down. Architecture support is ultimately a community decision, after all.
So, some clarification:
buildvm-s390x-01 to buildvm-s390x-14 are z/vm builders. It's not one big lpar, it's 14 of them. We have no way to change the resource allocation on those, it would require mainframe admins to change anything there.
buildvm-s390x-15 to buildvm-s390x-35 are kvm vm's. They are on buildvmhost-s390x-01, which is a single large instance and guests run on it like other virthosts. Here we can reallocation the resources as we like.
koji has a concept of 'capacity' for builders. All the z/vm instances are set to a 2.0 capacity, and all the kvm ones are set to 3.0. So, larger builds will always try and run on the kvm instances.
So, the kvm hosts we have 20 of them currently with 2cpus and 13gb mem. We can move that to 15 builders with 17gb memory and 3 cpus? But I worry adding another cpu will increase memory needs as 3 big threads of say c++ might suck up more memory, but I suppose we can try it.
I'll try and do that later today after nest... all the builds may restart again as I redo builders.
But I worry adding another cpu will increase memory needs as 3 big threads of say c++ might suck up more memory, but I suppose we can try it.
Of course going from 2 vCPUs to 3 vCPUs will increase memory needs during compile time by 50%. Now that we have LTO, I believe it will also increase memory needs during linking correspondingly. But if my theory that it's failing when processing debuginfo is correct, then that's not the problem here. WebKit should really only need 2-3 GB of RAM per vCPU to compile, and our builders are all way past that point.
ok. I rebalanced them.
There's now 15-30 and they have 17gb mem and 3 cpus.
I restarted the webkitgtk build.
Lets see how this goes.
Well it died due to failure to connect to the builder. :D I will restart it.
Now it seems the ppc64le build has unexpectedly restarted: https://koji.fedoraproject.org/koji/taskinfo?taskID=90503790. Surprise. :/
We just can't win. ;)
buildvm-ppc64le's have 20GB memory and 8 cpus...
Surely that ought to be more than enough. :/ At least it finished on the second try.
Sadly, s390x is still in the doom loop.
I could try increasing the %limit_build to require 3 GB of RAM per vCPU instead of 2 GB. But that would just slow things down for no benefit if we do not know that it is OOMing during the compile or link step. And without evidence to show when exactly it is hitting OOM, I think I'll hold off on that.
So, I watched one cycle and it ended in:
oh dammit.
systemd-oomd got restarted by socket activation!
Let me mask it. pesky thing.
My build has succeeded! \o/
Another ppc64le build restart: https://koji.fedoraproject.org/koji/taskinfo?taskID=90614841
It's weird... really hard to believe 20 GB would not be enough. :/
s390x is restarting again too. My current build is here: https://koji.fedoraproject.org/koji/taskinfo?taskID=90614616
Is it possible to investigate to see where exactly it is dying? If it dies when compiling or linking, I will adjust %limit_build to tone down the parallelism. If it dies processing debuginfo, that probably just means we need even more RAM?
I am tailing the current log so should hopefully get a capture of if(let us be honest, when) it breaks again
ppc64le build failed due to normal oom filling up 20GB of ram and 8GB of swap.
[Tue Aug 9 01:05:50 2022] [c0000000a2337c10] [c00000000023e3c0] sync_hw_clock+0x150/0x310 [Tue Aug 9 01:05:50 2022] [c0000000a2337c90] [c00000000017e3cc] process_one_work+0x2ac/0x570 [Tue Aug 9 01:05:50 2022] [c0000000a2337d30] [c00000000017ed98] worker_thread+0xa8/0x630 [Tue Aug 9 01:05:50 2022] [c0000000a2337dc0] [c00000000018b054] kthread+0x124/0x130 [Tue Aug 9 01:05:50 2022] [c0000000a2337e10] [c00000000000ce64] ret_from_kernel_thread+0x5c/0x64 [Tue Aug 9 01:05:50 2022] Instruction dump: [Tue Aug 9 01:05:50 2022] 3b600500 fba1ffe8 fbc1fff0 3b800a00 3bc00002 fbe1fff8 3ba00f00 3be00003 [Tue Aug 9 01:05:50 2022] f8010010 f821fe21 38610020 480602dd <60000000> 39200000 e9410128 f9210158 [Tue Aug 9 07:31:11 2022] cc1plus invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 [Tue Aug 9 07:31:11 2022] CPU: 6 PID: 1162682 Comm: cc1plus Tainted: G L 5.18.9-200.fc36.ppc64le #1 [Tue Aug 9 07:31:11 2022] Call Trace: [Tue Aug 9 07:31:11 2022] [c0000000177eb650] [c000000000a7b920] dump_stack_lvl+0x74/0xa8 (unreliable) [Tue Aug 9 07:31:11 2022] [c0000000177eb690] [c0000000003faad0] dump_header+0x64/0x24c [Tue Aug 9 07:31:11 2022] [c0000000177eb710] [c0000000003f8ec4] oom_kill_process+0x344/0x350 [Tue Aug 9 07:31:11 2022] [c0000000177eb750] [c0000000003fa448] out_of_memory+0x228/0x780 [Tue Aug 9 07:31:11 2022] [c0000000177eb7f0] [c000000000488928] __alloc_pages+0x10d8/0x11a0 [Tue Aug 9 07:31:11 2022] [c0000000177eb9d0] [c0000000004bed64] alloc_pages+0xd4/0x230 [Tue Aug 9 07:31:11 2022] [c0000000177eba10] [c0000000004beef4] folio_alloc+0x34/0x90 [Tue Aug 9 07:31:11 2022] [c0000000177eba40] [c0000000003f1aa4] __filemap_get_folio+0x294/0x650 [Tue Aug 9 07:31:11 2022] [c0000000177ebaf0] [c0000000003f21f8] filemap_fault+0x398/0xae0 [Tue Aug 9 07:31:11 2022] [c0000000177ebba0] [c00000000044fcb4] __do_fault+0x64/0x2a0 [Tue Aug 9 07:31:11 2022] [c0000000177ebbe0] [c0000000004542ac] __handle_mm_fault+0x103c/0x1d90 [Tue Aug 9 07:31:11 2022] [c0000000177ebce0] [c000000000455128] handle_mm_fault+0x128/0x310 [Tue Aug 9 07:31:11 2022] [c0000000177ebd30] [c000000000087fd4] ___do_page_fault+0x2a4/0xb90 [Tue Aug 9 07:31:11 2022] [c0000000177ebde0] [c000000000088b30] do_page_fault+0x30/0xc0 [Tue Aug 9 07:31:11 2022] [c0000000177ebe10] [c000000000008ce0] instruction_access_common_virt+0x190/0x1a0 [Tue Aug 9 07:31:11 2022] --- interrupt: 400 at 0x1117fa88 [Tue Aug 9 07:31:11 2022] NIP: 000000001117fa88 LR: 00000000111ae090 CTR: 00007fff98a84a30 [Tue Aug 9 07:31:11 2022] REGS: c0000000177ebe80 TRAP: 0400 Tainted: G L (5.18.9-200.fc36.ppc64le) [Tue Aug 9 07:31:11 2022] MSR: 800000004280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 88002883 XER: 00000004 [Tue Aug 9 07:31:11 2022] CFAR: 00000000111ae08c IRQMASK: 0 GPR00: 00000000111ae078 00007ffff0128140 0000000011d17a00 00007fff2a59ea40 GPR04: 00007fff2a59ea00 00000000119e5c98 00000000473e35b0 0000000000000034 GPR08: 0000000000000078 0000000045bfd49c 0000000045bfd49d 0000000045bfd499 GPR12: 00007fff98a84a30 00007fff9921cd40 0000000000000000 00000000119e5c38 GPR16: 00000000119e5d68 00007fff2a59ea00 0000000011d59b08 0000000011776520 GPR20: 00000000119e5cc0 0000000011772648 00000000119e5c98 0000000011dee310 GPR24: 00007fff254c1c00 0000000011782f50 000000000000000a 0000000000000004 GPR28: 00007fff253b30d0 0000000011d4f098 0000000000000002 00007fff253b31d8 [Tue Aug 9 07:31:11 2022] NIP [000000001117fa88] 0x1117fa88 [Tue Aug 9 07:31:11 2022] LR [00000000111ae090] 0x111ae090 [Tue Aug 9 07:31:11 2022] --- interrupt: 400 [Tue Aug 9 07:31:11 2022] Mem-Info: [Tue Aug 9 07:31:11 2022] active_anon:263487 inactive_anon:21323 isolated_anon:0 active_file:0 inactive_file:82 isolated_file:0 unevictable:62 dirty:0 writeback:0 slab_reclaimable:2820 slab_unreclaimable:6300 mapped:25 shmem:62 pagetables:217 bounce:0 kernel_misc_reclaimable:0 free:1031 free_pcp:255 free_cma:0 [Tue Aug 9 07:31:11 2022] Node 0 active_anon:16863168kB inactive_anon:1364672kB active_file:0kB inactive_file:5248kB unevictable:3968kB isolated(anon):0kB isolated(file):0kB mapped:1600kB dirty:0kB writeback:0kB shmem:3968kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:4784kB pagetables:13888kB all_unreclaimable? no [Tue Aug 9 07:31:11 2022] Node 0 Normal free:65984kB boost:49152kB min:71680kB low:92416kB high:113152kB reserved_highatomic:0KB active_anon:16863168kB inactive_anon:1364672kB active_file:1024kB inactive_file:0kB unevictable:3968kB writepending:0kB present:20971520kB managed:20812800kB mlocked:3968kB bounce:0kB free_pcp:16320kB local_pcp:2624kB free_cma:0kB [Tue Aug 9 07:31:11 2022] lowmem_reserve[]: 0 0 0 [Tue Aug 9 07:31:11 2022] Node 0 Normal: 552*64kB (UME) 239*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 65920kB [Tue Aug 9 07:31:11 2022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [Tue Aug 9 07:31:11 2022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [Tue Aug 9 07:31:11 2022] 238 total pagecache pages [Tue Aug 9 07:31:11 2022] 41 pages in swap cache [Tue Aug 9 07:31:11 2022] Swap cache stats: add 3350394, delete 3350428, find 718844/1746745 [Tue Aug 9 07:31:11 2022] Free swap = 0kB [Tue Aug 9 07:31:11 2022] Total swap = 8388544kB [Tue Aug 9 07:31:11 2022] 327680 pages RAM [Tue Aug 9 07:31:11 2022] 0 pages HighMem/MovableOnly [Tue Aug 9 07:31:11 2022] 2480 pages reserved [Tue Aug 9 07:31:11 2022] 0 pages cma reserved [Tue Aug 9 07:31:11 2022] 0 pages hwpoisoned [Tue Aug 9 07:31:11 2022] Tasks state (memory values in pages): [Tue Aug 9 07:31:11 2022] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [Tue Aug 9 07:31:11 2022] [ 680] 193 680 560 74 30208 75 0 systemd-resolve [Tue Aug 9 07:31:11 2022] [ 690] 0 690 666 57 28416 29 -1000 auditd [Tue Aug 9 07:31:11 2022] [ 725] 0 725 511 53 25600 89 0 systemd-logind [Tue Aug 9 07:31:11 2022] [ 728] 81 728 270 37 27648 41 -900 dbus-broker-lau [Tue Aug 9 07:31:11 2022] [ 748] 81 748 159 61 22528 27 -900 dbus-broker [Tue Aug 9 07:31:11 2022] [ 757] 0 757 4318 56 29952 86 0 NetworkManager [Tue Aug 9 07:31:11 2022] [ 772] 0 772 2035 14 29696 71 0 gssproxy [Tue Aug 9 07:31:11 2022] [ 787] 0 787 197 52 27136 18 0 crond [Tue Aug 9 07:31:11 2022] [ 800] 0 800 69 17 26112 5 0 agetty [Tue Aug 9 07:31:11 2022] [ 801] 0 801 119 4 22272 5 0 agetty [Tue Aug 9 07:31:11 2022] [ 857] 998 857 45974 43 54528 121 0 polkitd [Tue Aug 9 07:31:11 2022] [ 7212] 0 7212 65 61 26112 0 -1000 watchdog [Tue Aug 9 07:31:11 2022] [ 7476] 0 7476 862 21 28416 44 0 master [Tue Aug 9 07:31:11 2022] [ 7478] 89 7478 885 46 29440 49 0 qmgr [Tue Aug 9 07:31:11 2022] [ 7507] 0 7507 84 26 26112 12 0 rpc.idmapd [Tue Aug 9 07:31:11 2022] [ 7513] 0 7513 387 29 23040 38 0 rpc.gssd [Tue Aug 9 07:31:11 2022] [ 7515] 32 7515 324 29 28160 41 0 rpcbind [Tue Aug 9 07:31:11 2022] [ 7517] 29 7517 243 33 27136 32 0 rpc.statd [Tue Aug 9 07:31:11 2022] [ 7520] 0 7520 134 29 26624 16 0 nfsdcld [Tue Aug 9 07:31:11 2022] [ 7521] 0 7521 317 7 23552 40 0 rpc.mountd [Tue Aug 9 07:31:11 2022] [ 7561] 996 7561 1427 36 28416 31 0 chronyd [Tue Aug 9 07:31:11 2022] [ 13199] 0 13199 245 3 23296 30 0 oddjobd [Tue Aug 9 07:31:11 2022] [ 13416] 0 13416 1004 46 31744 120 0 sssd [Tue Aug 9 07:31:11 2022] [ 13418] 0 13418 1523 41 36096 100 0 sssd_nss [Tue Aug 9 07:31:11 2022] [ 13419] 0 13419 1029 28 27648 119 0 sssd_pam [Tue Aug 9 07:31:11 2022] [ 13420] 0 13420 988 30 27392 111 0 sssd_ssh [Tue Aug 9 07:31:11 2022] [ 13421] 0 13421 982 37 31488 106 0 sssd_sudo [Tue Aug 9 07:31:11 2022] [ 13422] 0 13422 1893 41 30720 157 0 sssd_pac [Tue Aug 9 07:31:11 2022] [ 20590] 0 20590 895 65 28416 95 -900 virtlogd [Tue Aug 9 07:31:11 2022] [ 22233] 0 22233 440 54 28928 46 0 systemd-machine [Tue Aug 9 07:31:11 2022] [ 22320] 423 22320 255 39 23296 24 0 dnsmasq [Tue Aug 9 07:31:11 2022] [ 22321] 0 22321 254 30 23296 25 0 dnsmasq [Tue Aug 9 07:31:11 2022] [1873186] 0 1873186 706 62 26112 85 -1000 systemd-udevd [Tue Aug 9 07:31:11 2022] [1873189] 0 1873189 428 45 24832 49 0 systemd-userdbd [Tue Aug 9 07:31:11 2022] [1873202] 0 1873202 3099 89 45824 54 -250 systemd-journal [Tue Aug 9 07:31:11 2022] [2990000] 0 2990000 10498 97 74240 29 0 rsyslogd [Tue Aug 9 07:31:11 2022] [ 891427] 0 891427 385 54 24576 46 -1000 sshd [Tue Aug 9 07:31:11 2022] [1710876] 0 1710876 2346 74 33536 197 0 sssd_be [Tue Aug 9 07:31:11 2022] [1953812] 0 1953812 2921 579 49152 748 0 kojid [Tue Aug 9 07:31:11 2022] [1149099] 89 1149099 1062 90 25600 1 0 pickup [Tue Aug 9 07:31:11 2022] [1154989] 0 1154989 2921 144 47872 1159 0 kojid [Tue Aug 9 07:31:11 2022] [1155947] 1000 1155947 1171 66 31232 541 0 mock [Tue Aug 9 07:31:11 2022] [1158092] 1000 1158092 375 1 28672 69 0 rpmbuild [Tue Aug 9 07:31:11 2022] [1158125] 1000 1158125 113 1 26368 13 0 sh [Tue Aug 9 07:31:11 2022] [1158469] 1000 1158469 673 0 26624 86 0 cmake [Tue Aug 9 07:31:11 2022] [1158470] 1000 1158470 670 0 26880 83 0 cmake [Tue Aug 9 07:31:11 2022] [1158471] 1000 1158471 227 1 27648 28 0 gmake [Tue Aug 9 07:31:11 2022] [1158474] 1000 1158474 230 1 23040 28 0 gmake [Tue Aug 9 07:31:11 2022] [1158491] 1000 1158491 235 1 23296 35 0 gmake [Tue Aug 9 07:31:11 2022] [1158493] 1000 1158493 233 1 23040 33 0 gmake [Tue Aug 9 07:31:11 2022] [1162555] 0 1162555 439 44 29184 45 0 systemd-userwor [Tue Aug 9 07:31:11 2022] [1162557] 1000 1162557 96 0 22272 12 0 g++ [Tue Aug 9 07:31:11 2022] [1162558] 1000 1162558 58395 26130 498944 30153 0 cc1plus [Tue Aug 9 07:31:11 2022] [1162559] 1000 1162559 18288 14169 176384 4006 0 as [Tue Aug 9 07:31:11 2022] [1162591] 1000 1162591 96 0 22272 12 0 g++ [Tue Aug 9 07:31:11 2022] [1162592] 1000 1162592 57875 19075 494080 36758 0 cc1plus [Tue Aug 9 07:31:11 2022] [1162593] 1000 1162593 15394 13764 148992 1517 0 as [Tue Aug 9 07:31:11 2022] [1162681] 1000 1162681 96 0 22016 13 0 g++ [Tue Aug 9 07:31:11 2022] [1162682] 1000 1162682 32957 20668 293888 10695 0 cc1plus [Tue Aug 9 07:31:11 2022] [1162683] 1000 1162683 10759 10120 111616 525 0 as [Tue Aug 9 07:31:11 2022] [1162693] 1000 1162693 96 0 22016 12 0 g++ [Tue Aug 9 07:31:11 2022] [1162694] 1000 1162694 46700 30888 400896 14332 0 cc1plus [Tue Aug 9 07:31:11 2022] [1162695] 1000 1162695 6004 5043 73984 845 0 as [Tue Aug 9 07:31:11 2022] [1162697] 1000 1162697 96 0 22016 12 0 g++ [Tue Aug 9 07:31:11 2022] [1162698] 1000 1162698 51999 40214 441344 9458 0 cc1plus [Tue Aug 9 07:31:11 2022] [1162699] 1000 1162699 2104 1 42752 1991 0 as [Tue Aug 9 07:31:11 2022] [1162702] 1000 1162702 96 0 26112 12 0 g++ [Tue Aug 9 07:31:11 2022] [1162703] 1000 1162703 32957 25201 289792 6061 0 cc1plus [Tue Aug 9 07:31:11 2022] [1162704] 1000 1162704 5851 5207 72704 527 0 as [Tue Aug 9 07:31:11 2022] [1162710] 1000 1162710 96 0 26112 12 0 g++ [Tue Aug 9 07:31:11 2022] [1162711] 1000 1162711 42924 35383 378624 6060 0 cc1plus [Tue Aug 9 07:31:11 2022] [1162712] 1000 1162712 1735 1079 35584 543 0 as [Tue Aug 9 07:31:11 2022] [1162731] 1000 1162731 96 0 22272 12 0 g++ [Tue Aug 9 07:31:11 2022] [1162732] 1000 1162732 38014 34452 331520 2444 0 cc1plus [Tue Aug 9 07:31:11 2022] [1162733] 1000 1162733 2104 1567 38400 396 0 as [Tue Aug 9 07:31:11 2022] [1162753] 0 1162753 439 89 24832 0 0 systemd-userwor [Tue Aug 9 07:31:11 2022] [1162754] 0 1162754 151 23 22016 0 0 systemd-userwor [Tue Aug 9 07:31:11 2022] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/kojid.service,task=cc1plus,pid=1162558,uid=1000 [Tue Aug 9 07:31:11 2022] Out of memory: Killed process 1162558 (cc1plus) total-vm:3737280kB, anon-rss:1672256kB, file-rss:64kB, shmem-rss:0kB, UID:1000 pgtables:487kB oom_score_adj:0
For ppc64le I agree with your assessment on %limit_build. I do not know about
The s390x did not die due to oom. I do not see any logs on why it failed so will try to keep a copy of its logs.
OK, that's not what I was expecting to see. The cc1plus process that died was using 3.7 GB of virtual memory, which is roughly twice what I assumed was required. The other processes are all using similar amounts of RAM. I'll adjust the %limit_build macro to request 4 GB per vCPU instead of 2 GB per vCPU.
I'm going to let the current doomed build continue, because the s390x build is 8 hours along and you are watching to see how it fails, and I don't want to set our investigation back by 8 hours by canceling it now.
Option 1 (sooner): Increase ZRAM size. It's capped to 8G on this system, and it's safe to be 1:1, i.e. 20G.
cp /usr/lib/systemd/zram-generator.conf /etc/systemd/zram-generator.conf
[zram0] zram-size = ram
systemctl restart systemd-zram-setup@zram0.service
free -m
Option 2 (later): do a little profiling of the workload to see if certain systems like these are better off with disk based swap and either reconfigure or reprovision. I can help with figuring this out.
powerpc build got further before it oom'd on
[Tue Aug 9 07:31:11 2022] Out of memory: Killed process 1162558 (cc1plus) total-vm:3737280kB, anon-rss:1672256kB, file-rss:64kB, shmem-rss:0kB, UID:1000 pgtables:487kB oom_score_adj:0 [Tue Aug 9 16:12:46 2022] gdb.minimal invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 [Tue Aug 9 16:12:46 2022] CPU: 6 PID: 1466670 Comm: gdb.minimal Tainted: G L 5.18.9-200.fc36.ppc64le #1 [Tue Aug 9 16:12:46 2022] Call Trace: [Tue Aug 9 16:12:46 2022] [c0000000a0317650] [c000000000a7b920] dump_stack_lvl+0x74/0xa8 (unreliable) [Tue Aug 9 16:12:46 2022] [c0000000a0317690] [c0000000003faad0] dump_header+0x64/0x24c [Tue Aug 9 16:12:46 2022] [c0000000a0317710] [c0000000003f8ec4] oom_kill_process+0x344/0x350 [Tue Aug 9 16:12:46 2022] [c0000000a0317750] [c0000000003fa448] out_of_memory+0x228/0x780 [Tue Aug 9 16:12:46 2022] [c0000000a03177f0] [c000000000488928] __alloc_pages+0x10d8/0x11a0 [Tue Aug 9 16:12:46 2022] [c0000000a03179d0] [c0000000004bed64] alloc_pages+0xd4/0x230 [Tue Aug 9 16:12:46 2022] [c0000000a0317a10] [c0000000004beef4] folio_alloc+0x34/0x90 [Tue Aug 9 16:12:46 2022] [c0000000a0317a40] [c0000000003f1aa4] __filemap_get_folio+0x294/0x650 [Tue Aug 9 16:12:46 2022] [c0000000a0317af0] [c0000000003f21f8] filemap_fault+0x398/0xae0 [Tue Aug 9 16:12:46 2022] [c0000000a0317ba0] [c00000000044fcb4] __do_fault+0x64/0x2a0 [Tue Aug 9 16:12:46 2022] [c0000000a0317be0] [c0000000004542ac] __handle_mm_fault+0x103c/0x1d90 [Tue Aug 9 16:12:46 2022] [c0000000a0317ce0] [c000000000455128] handle_mm_fault+0x128/0x310 [Tue Aug 9 16:12:46 2022] [c0000000a0317d30] [c000000000087fd4] ___do_page_fault+0x2a4/0xb90 [Tue Aug 9 16:12:46 2022] [c0000000a0317de0] [c000000000088b30] do_page_fault+0x30/0xc0 [Tue Aug 9 16:12:46 2022] [c0000000a0317e10] [c000000000008914] data_access_common_virt+0x194/0x1f0 [Tue Aug 9 16:12:46 2022] --- interrupt: 300 at 0x118f628c0 [Tue Aug 9 16:12:46 2022] NIP: 0000000118f628c0 LR: 0000000118f716e0 CTR: 0000000118f7d8e0 [Tue Aug 9 16:12:46 2022] REGS: c0000000a0317e80 TRAP: 0300 Tainted: G L (5.18.9-200.fc36.ppc64le) [Tue Aug 9 16:12:46 2022] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> CR: 44024484 XER: 2004002d [Tue Aug 9 16:12:46 2022] CFAR: c00000000000c94c DAR: 00007fff48d396ee DSISR: 40000000 IRQMASK: 0 GPR00: 0000000118fb912c 00007fffc6da8520 0000000119947900 000000013515fa00 GPR04: 00007fff48d396ee 00007fffc6da8668 00007fff48d396e9 00000001367d5870 GPR08: 0000000118f7d70c 0000000000000000 0000000000000000 ff00000000000000 GPR12: 0000000118f417d0 00007fffa7cb8c60 0000000000000000 000000034eaccdb0 GPR16: 0000000000010016 0000000000000000 0000000000000088 000000034e95dbf0 GPR20: 00007fffc6da8668 0000000000000000 0000000000000000 00007fffc6da8670 GPR24: 0000000000000003 00007fff48d396ee 000000034eaccdb0 000000034eaccd40 GPR28: 00007fffc6da8798 000000026467f4a0 000000013515fa00 00007fffc6da8798 [Tue Aug 9 16:12:46 2022] NIP [0000000118f628c0] 0x118f628c0 [Tue Aug 9 16:12:46 2022] LR [0000000118f716e0] 0x118f716e0 ... [Tue Aug 9 16:12:46 2022] [1466638] 1000 1466638 187716 100777 1496320 42728 0 gdb.minimal [Tue Aug 9 16:12:46 2022] [1466666] 1000 1466666 95 1 22272 13 0 gdb-add-index [Tue Aug 9 16:12:46 2022] [1466670] 1000 1466670 186092 96280 1491968 45644 0 gdb.minimal [Tue Aug 9 16:12:46 2022] [1466680] 1000 1466680 95 0 22016 12 0 gdb-add-index [Tue Aug 9 16:12:46 2022] [1466684] 1000 1466684 169588 87880 1355008 37420 0 gdb.minimal [Tue Aug 9 16:12:46 2022] [1466893] 0 1466893 439 73 28672 0 0 systemd-userwor [Tue Aug 9 16:12:46 2022] [1466911] 0 1466911 439 74 24832 0 0 systemd-userwor [Tue Aug 9 16:12:46 2022] [1466912] 0 1466912 439 63 24832 0 0 systemd-userwor [Tue Aug 9 16:12:46 2022] [1466931] 0 1466931 395 47 27136 15 0 crond [Tue Aug 9 16:12:46 2022] [1466932] 0 1466932 152 47 26880 0 0 lock-wrapper [Tue Aug 9 16:12:46 2022] [1466935] 0 1466935 152 32 22528 0 0 osbuildapi-upda [Tue Aug 9 16:12:46 2022] [1466937] 0 1466937 152 12 22272 0 0 osbuildapi-upda [Tue Aug 9 16:12:46 2022] [1466938] 0 1466938 368 41 23296 0 0 resolvectl [Tue Aug 9 16:12:46 2022] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/kojid.service,task=gdb.minimal,pid=1466638,uid=1000 [Tue Aug 9 16:12:46 2022] Out of memory: Killed process 1466638 (gdb.minimal) total-vm:12013824kB, anon-rss:6449728kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:1461kB oom_score_adj:0
That's the same as the last OOM, and should be fixed by my adjustment to the package's %limit_build.
The most recent trace is missing the section on swap free/total - did this have a zram bump or not yet? The previous one it had 0 bytes swap free.
nothing has been changed to zram.. that would require changes by Real(TM) releng. I am just filling in.
Tue Aug 9 16:12:46 2022] --- interrupt: 300 [Tue Aug 9 16:12:46 2022] Mem-Info: [Tue Aug 9 16:12:46 2022] active_anon:265307 inactive_anon:21362 isolated_anon:0 active_file:78 inactive_file:0 isolated_file:0 unevictable:62 dirty:10 writeback:0 slab_reclaimable:2838 slab_unreclaimable:6212 mapped:82 shmem:13 pagetables:220 bounce:0 kernel_misc_reclaimable:0 free:847 free_pcp:0 free_cma:0 [Tue Aug 9 16:12:46 2022] Node 0 active_anon:16979648kB inactive_anon:1367168kB active_file:4992kB inactive_file:0kB unevictable:3968kB isolated(anon):0kB isolated(file):0kB mapped:5248kB dirty:640kB writeback:0kB shmem:832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:4896kB pagetables:14080kB all_unreclaimable? no [Tue Aug 9 16:12:46 2022] Node 0 Normal free:54208kB boost:34816kB min:57344kB low:78080kB high:98816kB reserved_highatomic:0KB active_anon:16979648kB inactive_anon:1367168kB active_file:3968kB inactive_file:0kB unevictable:3968kB writepending:640kB present:20971520kB managed:20812800kB mlocked:3968kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [Tue Aug 9 16:12:46 2022] lowmem_reserve[]: 0 0 0 [Tue Aug 9 16:12:46 2022] Node 0 Normal: 263*64kB (UME) 252*128kB (UME) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 49344kB [Tue Aug 9 16:12:46 2022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [Tue Aug 9 16:12:46 2022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [Tue Aug 9 16:12:46 2022] 207 total pagecache pages [Tue Aug 9 16:12:46 2022] 28 pages in swap cache [Tue Aug 9 16:12:46 2022] Swap cache stats: add 3777065, delete 3777098, find 723615/2033381 [Tue Aug 9 16:12:46 2022] Free swap = 0kB [Tue Aug 9 16:12:46 2022] Total swap = 8388544kB [Tue Aug 9 16:12:46 2022] 327680 pages RAM [Tue Aug 9 16:12:46 2022] 0 pages HighMem/MovableOnly [Tue Aug 9 16:12:46 2022] 2480 pages reserved [Tue Aug 9 16:12:46 2022] 0 pages cma reserved [Tue Aug 9 16:12:46 2022] 0 pages hwpoisoned [Tue Aug 9 16:12:46 2022] Tasks state (memory values in pages):
How helpful bigger zram-size is depends on the output from zramctl once the server has been up a while (15 minutes I guess) but before an oomkill.
zramctl
zramctl output
NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT /dev/zram0 lzo-rle 8G 8G 2.2G 2.3G 3 [SWAP]
Currently the system is chewing through:
extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-15.fc37.s390x/usr/lib64/libwebkit2gtk-4.0.so.37.57.0 extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-15.fc37.s390x/usr/lib64/libwebkit2gtk-4.1.so.0.2.0 extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-15.fc37.s390x/usr/lib64/libwebkit2gtk-5.0.so.0.0.0
Good news is the workload is highly compressible, 3.5x including overhead. Bad news is swap is full which means the kernel can only reclaim file pages, which is more expensive time and IO wise than just bumping zram-size. Would disk-based swap be even better? Not sure. We'd have to look at the behavior under load, but for sure this system needs more swap given the workload.
Ok the s390x crashed while extracting from libwebkit2gtk-5.0.so.0.0.0. Here is the oom. At this point I would say @catanzaro that it is time to kill the job completely and put in the changes you were looking at:
libwebkit2gtk-5.0.so.0.0.0
Tue Aug 9 18:28:24 2022] systemd-journald[508]: Field hash table of /var/log/journal/7080e3794c9045ddaaf8e2b9a8ab0243/system.journal has a fill level at 75.4 (251 of 333 items), suggesting rotation. [Tue Aug 9 18:28:24 2022] systemd-journald[508]: /var/log/journal/7080e3794c9045ddaaf8e2b9a8ab0243/system.journal: Journal header limits reached or header out-of-date, rotating. [Tue Aug 9 20:03:26 2022] systemd-userwor invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 [Tue Aug 9 20:03:26 2022] CPU: 1 PID: 4106515 Comm: systemd-userwor Not tainted 5.18.11-200.fc36.s390x #1 [Tue Aug 9 20:03:26 2022] Hardware name: IBM 8561 LT1 400 (KVM/Linux) [Tue Aug 9 20:03:26 2022] Call Trace: [Tue Aug 9 20:03:26 2022] [<0000000389fe89a2>] dump_stack_lvl+0x62/0x80 [Tue Aug 9 20:03:26 2022] [<0000000389fdff7a>] dump_header+0x62/0x248 [Tue Aug 9 20:03:26 2022] [<000000038961b3bc>] oom_kill_process+0x1f4/0x1f8 [Tue Aug 9 20:03:26 2022] [<000000038961c49c>] out_of_memory+0x114/0x6d8 [Tue Aug 9 20:03:26 2022] [<00000003896862da>] __alloc_pages_slowpath.constprop.0+0xd02/0xe58 [Tue Aug 9 20:03:26 2022] [<0000000389686764>] __alloc_pages+0x334/0x358 [Tue Aug 9 20:03:26 2022] [<0000000389615e52>] __filemap_get_folio+0x112/0x458 [Tue Aug 9 20:03:26 2022] [<00000003896162e0>] filemap_fault+0x148/0x920 [Tue Aug 9 20:03:26 2022] [<000000038965a164>] __do_fault+0x4c/0x130 [Tue Aug 9 20:03:26 2022] [<000000038965f196>] __handle_mm_fault+0x8a6/0x10a0 [Tue Aug 9 20:03:26 2022] [<000000038965fa5e>] handle_mm_fault+0xce/0x210 [Tue Aug 9 20:03:26 2022] [<000000038940b7f0>] do_exception+0x1e0/0x488 [Tue Aug 9 20:03:26 2022] [<000000038940c00a>] do_dat_exception+0x2a/0x50 [Tue Aug 9 20:03:26 2022] [<0000000389fed048>] __do_pgm_check+0xf0/0x1b0 [Tue Aug 9 20:03:26 2022] [<0000000389ffc80e>] pgm_check_handler+0x11e/0x180 [Tue Aug 9 20:03:26 2022] Mem-Info: [Tue Aug 9 20:03:26 2022] active_anon:3236999 inactive_anon:300098 isolated_anon:0 active_file:51 inactive_file:5746 isolated_file:135 unevictable:0 dirty:0 writeback:0 slab_reclaimable:109646 slab_unreclaimable:44804 mapped:22 shmem:6 pagetables:15946 bounce:0 kernel_misc_reclaimable:0 free:18946 free_pcp:401 free_cma:0 [Tue Aug 9 20:03:26 2022] Node 0 active_anon:12947996kB inactive_anon:1200392kB active_file:204kB inactive_file:22536kB unevictable:0kB isolated(anon):0kB isolated(file):540kB mapped:88kB dirty:0kB writeback:0kB shmem:24kB writeback_tmp:0kB kernel_stack:3680kB pagetables:63784kB all_unreclaimable? no [Tue Aug 9 20:03:26 2022] Node 0 DMA free:61632kB boost:0kB min:1988kB low:4064kB high:6140kB reserved_highatomic:0KB active_anon:1608784kB inactive_anon:181832kB active_file:8kB inactive_file:2668kB unevictable:0kB writepending:0kB present:2097152kB managed:2097068kB mlocked:0kB bounce:0kB free_pcp:48kB local_pcp:0kB free_cma:0kB [Tue Aug 9 20:03:26 2022] lowmem_reserve[]: 0 15006 15006 [Tue Aug 9 20:03:26 2022] Node 0 Normal free:14656kB boost:0kB min:14712kB low:30076kB high:45440kB reserved_highatomic:0KB active_anon:11339212kB inactive_anon:1018560kB active_file:696kB inactive_file:19704kB unevictable:0kB writepending:0kB present:15728640kB managed:15371692kB mlocked:0kB bounce:0kB free_pcp:1104kB local_pcp:380kB free_cma:0kB [Tue Aug 9 20:03:26 2022] lowmem_reserve[]: 0 0 0 [Tue Aug 9 20:03:26 2022] Node 0 DMA: 208*4kB (UME) 198*8kB (UME) 297*16kB (UME) 214*32kB (UME) 150*64kB (UME) 82*128kB (UME) 31*256kB (UME) 8*512kB (UM) 2*1024kB (UM) 3*2048kB (U) 2*4096kB (UM) = 62528kB [Tue Aug 9 20:03:26 2022] Node 0 Normal: 526*4kB (UME) 367*8kB (UME) 285*16kB (UME) 148*32kB (UME) 6*64kB (U) 2*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14976kB [Tue Aug 9 20:03:26 2022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1024kB [Tue Aug 9 20:03:26 2022] 6222 total pagecache pages [Tue Aug 9 20:03:26 2022] 379 pages in swap cache [Tue Aug 9 20:03:26 2022] Swap cache stats: add 12362617, delete 12362237, find 126940/6308439 [Tue Aug 9 20:03:26 2022] Free swap = 0kB [Tue Aug 9 20:03:26 2022] Total swap = 8388604kB [Tue Aug 9 20:03:26 2022] 4456448 pages RAM [Tue Aug 9 20:03:26 2022] 0 pages HighMem/MovableOnly [Tue Aug 9 20:03:26 2022] 89258 pages reserved [Tue Aug 9 20:03:26 2022] 0 pages cma reserved [Tue Aug 9 20:03:26 2022] Tasks state (memory values in pages): [Tue Aug 9 20:03:26 2022] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [Tue Aug 9 20:03:26 2022] [ 508] 0 508 28935 134 143360 408 -250 systemd-journal [Tue Aug 9 20:03:26 2022] [ 541] 0 541 7793 5 108544 817 -1000 systemd-udevd [Tue Aug 9 20:03:26 2022] [ 606] 193 606 5658 151 112640 769 0 systemd-resolve [Tue Aug 9 20:03:26 2022] [ 607] 0 607 6325 8 75776 203 -1000 auditd [Tue Aug 9 20:03:26 2022] [ 611] 0 611 4023 23 92160 204 0 systemd-userdbd [Tue Aug 9 20:03:26 2022] [ 639] 81 639 2478 44 67584 163 -900 dbus-broker-lau [Tue Aug 9 20:03:26 2022] [ 646] 81 646 1268 87 57344 169 -900 dbus-broker [Tue Aug 9 20:03:26 2022] [ 652] 2 652 61233 0 124928 411 0 rngd [Tue Aug 9 20:03:26 2022] [ 653] 0 653 169445 176 483328 636 0 rsyslogd [Tue Aug 9 20:03:26 2022] [ 656] 0 656 6727 164 149504 318 0 sssd [Tue Aug 9 20:03:26 2022] [ 657] 0 657 4135 56 92160 224 0 systemd-machine [Tue Aug 9 20:03:26 2022] [ 678] 996 678 21086 34 77824 137 0 chronyd [Tue Aug 9 20:03:26 2022] [ 699] 0 699 63891 187 176128 506 0 NetworkManager [Tue Aug 9 20:03:26 2022] [ 706] 0 706 2281 1 71680 100 0 oddjobd [Tue Aug 9 20:03:26 2022] [ 708] 0 708 3579 2 88064 305 -1000 sshd [Tue Aug 9 20:03:26 2022] [ 709] 0 709 29880 0 120832 227 0 gssproxy [Tue Aug 9 20:03:26 2022] [ 731] 0 731 4443 5 67584 171 0 rpc.gssd [Tue Aug 9 20:03:26 2022] [ 772] 0 772 6733 103 155648 294 0 sssd_pam [Tue Aug 9 20:03:26 2022] [ 773] 0 773 6500 89 149504 283 0 sssd_ssh [Tue Aug 9 20:03:26 2022] [ 774] 0 774 6484 100 149504 279 0 sssd_sudo [Tue Aug 9 20:03:26 2022] [ 775] 0 775 18345 95 202752 519 0 sssd_pac [Tue Aug 9 20:03:26 2022] [ 795] 0 795 4635 49 100352 559 0 systemd-logind [Tue Aug 9 20:03:26 2022] [ 799] 0 799 2001 31 69632 73 0 crond [Tue Aug 9 20:03:26 2022] [ 803] 0 803 706 0 47104 32 0 agetty [Tue Aug 9 20:03:26 2022] [ 804] 0 804 1348 0 61440 35 0 agetty [Tue Aug 9 20:03:26 2022] [ 860] 0 860 11210 36 77824 213 0 master [Tue Aug 9 20:03:26 2022] [ 862] 89 862 14186 63 114688 210 0 qmgr [Tue Aug 9 20:03:26 2022] [ 353149] 423 353149 2428 0 69632 107 0 dnsmasq [Tue Aug 9 20:03:26 2022] [ 353150] 0 353150 2421 3 67584 100 0 dnsmasq [Tue Aug 9 20:03:26 2022] [ 353391] 0 353391 8824 2 153600 466 -900 virtlogd [Tue Aug 9 20:03:26 2022] [ 665942] 0 665942 30934 3803 444416 11156 0 kojid [Tue Aug 9 20:03:26 2022] [4028975] 0 4028975 30934 899 415744 14028 0 kojid [Tue Aug 9 20:03:26 2022] [4029362] 1000 4029362 12780 123 202752 7313 0 mock [Tue Aug 9 20:03:26 2022] [4029682] 1000 4029682 3730 0 98304 731 0 rpmbuild [Tue Aug 9 20:03:26 2022] [4089952] 0 4089952 7062 9 122880 449 0 sshd [Tue Aug 9 20:03:26 2022] [4089955] 0 4089955 5214 10 104448 771 100 systemd [Tue Aug 9 20:03:26 2022] [4089957] 0 4089957 46203 1 145408 1675 100 (sd-pam) [Tue Aug 9 20:03:26 2022] [4089966] 0 4089966 7100 47 122880 453 0 sshd [Tue Aug 9 20:03:26 2022] [4089967] 0 4089967 2171 211 67584 295 0 bash [Tue Aug 9 20:03:26 2022] [4102697] 1000 4102697 1139 2 61440 85 0 sh [Tue Aug 9 20:03:26 2022] [4102878] 1000 4102878 1034 1 49152 107 0 find-debuginfo [Tue Aug 9 20:03:26 2022] [4102893] 1000 4102893 1034 1 49152 107 0 find-debuginfo [Tue Aug 9 20:03:26 2022] [4102894] 1000 4102894 1067 2 49152 124 0 find-debuginfo [Tue Aug 9 20:03:26 2022] [4102895] 1000 4102895 1067 2 49152 124 0 find-debuginfo [Tue Aug 9 20:03:26 2022] [4102897] 1000 4102897 1067 2 49152 124 0 find-debuginfo [Tue Aug 9 20:03:26 2022] [4103138] 1000 4103138 1001 2 49152 76 0 gdb-add-index [Tue Aug 9 20:03:26 2022] [4103142] 1000 4103142 2584403 1256982 20342784 714850 0 gdb.minimal [Tue Aug 9 20:03:26 2022] [4103147] 1000 4103147 1001 2 49152 75 0 gdb-add-index [Tue Aug 9 20:03:26 2022] [4103151] 1000 4103151 2352687 1099992 18472960 640325 0 gdb.minimal [Tue Aug 9 20:03:26 2022] [4103156] 1000 4103156 1001 2 49152 76 0 gdb-add-index [Tue Aug 9 20:03:26 2022] [4103160] 1000 4103160 2488701 1169153 19564544 706469 0 gdb.minimal [Tue Aug 9 20:03:26 2022] [4104953] 0 4104953 716 16 57344 15 0 tail [Tue Aug 9 20:03:26 2022] [4106292] 0 4106292 4460 218 108544 0 0 sssd_be [Tue Aug 9 20:03:26 2022] [4106405] 89 4106405 14138 262 100352 0 0 pickup [Tue Aug 9 20:03:26 2022] [4106422] 0 4106422 11490 233 90112 0 0 smtp [Tue Aug 9 20:03:26 2022] [4106423] 89 4106423 14163 263 98304 0 0 cleanup [Tue Aug 9 20:03:26 2022] [4106433] 0 4106433 4942 63 65536 53 0 crond [Tue Aug 9 20:03:26 2022] [4106438] 0 4106438 4942 68 65536 52 0 crond [Tue Aug 9 20:03:26 2022] [4106444] 0 4106444 11202 241 90112 0 0 sendmail [Tue Aug 9 20:03:26 2022] [4106455] 0 4106455 4942 68 65536 52 0 crond [Tue Aug 9 20:03:26 2022] [4106457] 0 4106457 11149 182 81920 0 0 sendmail [Tue Aug 9 20:03:26 2022] [4106461] 0 4106461 11164 183 83968 0 0 trivial-rewrite [Tue Aug 9 20:03:26 2022] [4106467] 0 4106467 4942 66 65536 53 0 crond [Tue Aug 9 20:03:26 2022] [4106468] 0 4106468 4108 234 92160 0 0 systemd-userwor [Tue Aug 9 20:03:26 2022] [4106469] 0 4106469 4108 234 92160 0 0 systemd-userwor [Tue Aug 9 20:03:26 2022] [4106471] 0 4106471 11144 164 81920 0 0 sendmail [Tue Aug 9 20:03:26 2022] [4106476] 0 4106476 11143 164 81920 0 0 postdrop [Tue Aug 9 20:03:26 2022] [4106479] 0 4106479 4942 64 65536 53 0 crond [Tue Aug 9 20:03:26 2022] [4106480] 0 4106480 3656 96 88064 0 0 sssd_nss [Tue Aug 9 20:03:26 2022] [4106482] 0 4106482 11144 164 81920 0 0 sendmail [Tue Aug 9 20:03:26 2022] [4106496] 0 4106496 1744 68 63488 0 0 osbuildapi-upda [Tue Aug 9 20:03:26 2022] [4106497] 0 4106497 4942 59 65536 54 0 crond [Tue Aug 9 20:03:26 2022] [4106499] 0 4106499 1744 67 63488 0 0 osbuildapi-upda [Tue Aug 9 20:03:26 2022] [4106501] 0 4106501 1744 70 51200 0 0 osbuildapi-upda [Tue Aug 9 20:03:26 2022] [4106502] 0 4106502 4029 78 79872 0 0 resolvectl [Tue Aug 9 20:03:26 2022] [4106504] 0 4106504 1744 72 63488 0 0 lock-wrapper [Tue Aug 9 20:03:26 2022] [4106505] 0 4106505 1744 69 51200 0 0 osbuildapi-upda [Tue Aug 9 20:03:26 2022] [4106506] 0 4106506 4029 78 81920 0 0 resolvectl [Tue Aug 9 20:03:26 2022] [4106507] 0 4106507 4942 61 65536 54 0 crond [Tue Aug 9 20:03:26 2022] [4106510] 0 4106510 1744 72 63488 0 0 lock-wrapper [Tue Aug 9 20:03:26 2022] [4106513] 0 4106513 4942 61 65536 54 0 crond [Tue Aug 9 20:03:26 2022] [4106514] 0 4106514 4942 61 65536 54 0 crond [Tue Aug 9 20:03:26 2022] [4106515] 0 4106515 4000 77 79872 0 0 systemd-userwor [Tue Aug 9 20:03:26 2022] [4106516] 0 4106516 1744 69 79872 0 0 lock-wrapper [Tue Aug 9 20:03:26 2022] [4106517] 0 4106517 4951 58 65536 55 0 crond [Tue Aug 9 20:03:26 2022] [4106520] 0 4106520 1744 58 63488 0 0 run-parts [Tue Aug 9 20:03:26 2022] [4106522] 0 4106522 2015 52 59392 55 0 crond [Tue Aug 9 20:03:26 2022] [4106524] 0 4106524 1555 40 63488 0 0 mkdir [Tue Aug 9 20:03:26 2022] [4106525] 0 4106525 1142 25 47104 0 0 ps [Tue Aug 9 20:03:26 2022] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/kojid.service,task=gdb.minimal,pid=4103142,uid=1000 [Tue Aug 9 20:03:26 2022] Out of memory: Killed process 4103142 (gdb.minimal) total-vm:10337612kB, anon-rss:5027928kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:19866kB oom_score_adj:0 [Tue Aug 9 20:03:31 2022] oom_reaper: reaped process 4103142 (gdb.minimal), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
As nirik pointed out to me, there are two builds going on and what I pasted was the older one.
Now that you've caught this info, I've canceled the older -15 build. The newer -16 build already contains my changes to request 4 GB RAM per vCPU. I'm pretty sure that will fix ppc64le, but it will not help at all with s390x: debuginfo processing is not affected by the %limit_build macro. All that macro does is compute the -j that gets passed to ninja: that's it.
With this failed s390x build, we see it running three gdb processes at once, each using about 10 GB of virtual memory. That's a lot. (Note that I did perform a scratch build a week or two ago to make sure our infrastructure could handle this, before we imported the new webkitgtk package, and it succeeded on s390x after Kevin disabled systemd-oomd.) Interestingly, it seems to be parallel and based on the number of vCPUs, but I don't think there's any way for the packager to control the parallelism, so that's kinda out of my hands. This probably indicates that switching to 3 vCPU per VM actually backfired on us by increasing the required RAM by 50%. :/
Anyway, I cannot control the parallelism of the debuginfo processing. I guess RPM is responsible for that? What I can control is the _dwz_max_die_limit:
_dwz_max_die_limit
We should be able to reduce RAM usage by lowering that value, at the cost of ballooning the package size. I kinda think that's a bad trade-off, because 5 GB is just huge, so I'd rather not. But it is an option.
Ideally we would not mess with this value in the webkitgtk package, and just set whatever defaults we desire globally instead, here. I previously reported this bug to consider this.
BTW rawhide is broken until this build completes due to https://bugzilla.redhat.com/show_bug.cgi?id=2116626. Sorry about that....
well, I think the problem is that now that all those 3 things are in the same package, it builds them ok, but then at the end it runs find-debuginfo with N cpus and on ppc64le and s390x at least thats too much. ;(
I wonder if there's a way to use normal number of cpus for building, but then set it to 1 for the find-debuginfo.sh part?
The -16 build oom'd at 06:18 UTC when extracting debuginfo from
+ /usr/bin/find-debuginfo -j3 --strict-build-id -m -i --build-id-seed 2.37.1-16.fc37 --unique-debug-suffix -2.37.1-16.fc37.s390x --unique-debug-src-base webkitgtk-2.37.1-16.fc37.s390x --run-dwz --dwz-low-mem-die-limit 10000000 --dwz-max-die-limit 250000000 -S debugsourcefiles.list /builddir/build/BUILD/webkitgtk-2.37.1 extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/lib64/libjavascriptcoregtk-4.0.so.18.21.0 extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/bin/WebKitWebDriver extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/lib64/libjavascriptcoregtk-4.1.so.0.2.0 extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/lib64/libjavascriptcoregtk-5.0.so.0.0.0 extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/lib64/libwebkit2gtk-4.0.so.37.57.0 extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/lib64/libwebkit2gtk-4.1.so.0.2.0 extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/lib64/libwebkit2gtk-5.0.so.0.0.0
As you all have said, the -j3 is the problem.
The -j3 is not under packager control, though. That is not coming from the spec file and is therefore not something I can change. I don't know what calls find-debuginfo, but my guess would be RPM itself? It seems clear that it is matching the number of vCPUs available, so disabling vCPUs should solve this, of course at the cost of making everything much slower.
find-debuginfo
I was going to say "for s390x, yes, but ppc64le was just using too much RAM during the build itself." But bad news: the ppc64le build has unexpectedly restarted too, even though I switched to %limit_build -m 4096 to request 4 GiB of RAM per vCPU instead of the 2 GiB that I had been using before. I was expecting that to be enough to fix ppc64le, but apparently not.
%limit_build -m 4096
BTW, the current build started before branching, but will complete after branching. What happens now? Do I need to start another build for rawhide, or for F37?
It's coming from here.
We could tag this into rawhide as well... or do a new build.
So, on ppc64le, I made a builder with 64gb memory and 16 cpus and moved the current build to it. I would expect it can complete it.
On s390x, sadly we don't have a bunch of spare resources. I guess I can take down some more builders and build up one with more memory and move to that to get this build in, but I think we should try and find a more sustainable path after this.
What about trying:
%{_smp_build_ncpus} = 1
at the end of the spec %build section? (to set it to 1 for the debuginfo gathering)
Failing that, I see that the dwz limits might be set per arch?
-13: dwz_limit %{expand:%%{?%{1}%{arch}}%%{!?%{1}%{_arch}:%%%{1}}} -13: _dwz_low_mem_die_limit 10000000 -13: _dwz_low_mem_die_limit_armv5tel 4000000 -13: _dwz_low_mem_die_limit_armv7hl 4000000 -13: _dwz_max_die_limit 50000000 -13: _dwz_max_die_limit_armv5tel 10000000 -13: _dwz_max_die_limit_armv7hl 10000000 -13: _dwz_max_die_limit_x86_64 110000000
Perhaps we could set it for ppc64le and s390x to a different limit that had larger sizes but finished?
Should we try and rope in someone from tools team here for advise?
For sure, but that seems like a one-time fix.
Honestly I think that's the best approach: take down as many builders as necessary and let things go as slowly as required to be reliable. But remember to reduce vCPUs as well. The current setup might actually be OK if we reduce to 2 vCPUs instead of 3.
What about trying: %{_smp_build_ncpus} = 1 at the end of the spec %build section? (to set it to 1 for the debuginfo gathering)
Hm, I think we can use %define to temporarily change the value. That might work. (Just assigning to it won't work, since it's a macro, not a variable.) Of course, this is only good as a temporary/emergency measure to get the builds out. We shouldn't need hacks like this going forward since that will slow things down massively.
%define
Sure, although I would only be guessing as to which values to try... and also guessing about the names of the macros to define.
Couldn't hurt.
I've taken down 2 s390x builders and upped the memory in 29 to 32gb.
Thanks.
Meanwhile, I've started new builds so I have no doubt of which build goes to which branch: rawhide, and Fedora 37. I suspect the original build was headed only to rawhide and not to F37, but I wasn't certain. Let's see if the builds get scheduled on the builders that you hope they will....
They won't. :)
I'll move them. You will see one 'restart' as I do.
I assigned the f37 ones, as thats the one we want most...
I'm not going to be happy with a solution that requires manually moving builds to particular builders. Isn't that exactly what the heavy builder channel is intended for...?
I'm not happy with that either... just doing it now to get these builds in to fix things. :)
I'm also not happy with a heavybuilder channel if I can at all avoid it. It's very wastefull of resources and causes lots of problems.
My ideal solution is to figure out what exact setup we need to just build reliably and set that for all the builders. If we can adjust the spec/debuginfo, great. If that takes more builder adjustments, so be it...
Kalev came up with a nice trick:
%global _find_debuginfo_opts %limit_build -m 8192
It's not perfect, since it comes after the first -j that we don't control, but the one that comes last should win, so that's fine. I might need to bump that limit higher, though, since 8 GB will probably not actually be enough. I will probably try 12 GB to be safe.
-j
I'm jumping in to help with this as well. I kicked off 4 more builds (so it's 6 total now) so that we hopefully have something finished by tomorrow (we really need these now because of the broken upgrade path issues and the upcoming GNOME test week).
What we have running is (all builds are doubled because of f37 and rawhide): - Builds 1-2: mcatanzaro's builds that nirik moved to the beefiest s390x and ppc64le builders - Builds 3-4: these have disabled debuginfo on s390x and ppc64le, https://src.fedoraproject.org/rpms/webkitgtk/c/0ece072912ef34ea7778977af0301b3211e81c37?branch=rawhide - Builds 5-6: these re-enable debuginfo on s390x and ppc64le and instead attempt to limit debuginfo processing parallelization, https://src.fedoraproject.org/rpms/webkitgtk/c/8ea0b4518be2a64b8223c406792a8c9f25c5ea5f?branch=rawhide
Hopefully the last one works because this avoids the need for extra beefy builders and allows us to keep debuginfo. We'll see tomorrow.
I kicked off all the 4 new builds with --skip-tag (so that they don't get submitted to bodhi in wrong NVR order) and moved them to the builders I wanted them to be on. I'll check back on these tomorrow morning my time and tag the best ones that finished with -candidate tags, and move them to beefier builders if necessary.
Hopefully we can get some builds over the finish line with this :)
Looks like we have a winner with '%global _find_debuginfo_opts %limit_build -m 8192' and the builds are starting to get over the finish line.
I ran into a new mode of failure: buildvm-s390x-14.s390.fedoraproject.org (one of the smaller s390x builders, I wanted to see if it could build webkitgtk with the latest changes) ran out of disk space.
lto-wrapper: fatal error: write: No space left on device https://koji.fedoraproject.org/koji/taskinfo?taskID=90686756
I am checking out disk space on the builder. If something was using up disk space it is gone now.
/dev/md127 107876900 12779828 95097072 12% /
OK, so in conclusion, what ended up fixing the builds was adding:
# Require 8 GB of RAM per vCPU for debuginfo processing %global _find_debuginfo_opts %limit_build -m 8192
%_find_debuginfo_opts gets passed to /usr/bin/find-debuginfo and we can use it to control parallelism. '%limit_build -m 8192' means that for each 8 GB of RAM, it increases -j1 by one so on s390x builders with 18 GB RAM it ends up passing -j2, and on 8 GB builders it ends up passing -j1.
Note that I haven't actually verified that it builds on 8 GB builders because the last attempt ran out of disk space, but it hopefully should.
In addition to this (Michael's fix from earlier) cmake is called with
%cmake_build %limit_build -m 4096
... which makes it pass -j3 on s390x builders with 18 GB of RAM, which works out fine as well (especially since they actually have 3 CPUs).
So hopefully this all should be good enough that webkitgtk can now be built on any builder :)
There's still the question of how much time it takes and how to schedule webkitgtk builds so that they hit the faster builders to avoid builds taking forever, but I think this is something for another time :) We should be good now and are able to get builds done again.
I've already talked to nirik on IRC and he's going to take down the extra beefy s390x builder and the extra beefy ppc64le builder as they shouldn't be necessary any more. The regular builders should work just fine now (although the weaker ones may take close to two days to actually build this).
As for s390x kvm builder configuration, I'd leave it as it is now so that they have 3 CPUs and 18 GB of RAM. Reducing the number of CPUs would just make the builds take longer and I think the current number of s390x builders (which is less than before nirik made them have 3 CPUs and 18 GB of RAM) is enough for day-to-day builds. It may end up being a bottleneck for mass rebuilds, but that is a rare event and I think it's better to optimize for day-to-day operations. I'm fairly sure faster builders (but smaller number) makes it more fun for packagers to work on stuff since they can complete builds faster.
Thanks for all the work and patience on this.
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.