#10909 Add webkit2gtk3 to koji heavybuilder channel
Closed: Fixed 2 years ago by kevin. Opened 2 years ago by kalev.

We are having some issues with webkit2gtk3 builds taking huge amount of time to complete and running out of memory and builders restarting due to what appears to be kojid oom. Would it be possible to add it to koji heavybuilder channel to see if it can help with these issues?

Thanks!


I'd prefer to avoid this if we can... I setup the heavybuilder channel for chromium, and if we add more things to it, they are going to compete.

One issue with the OOM hitting is that systemd-oomd managed to get back in when I reinstalled builders.
I just disabled it globally. Can we see if that helps any?

Metadata Update from @phsmoura:
- Issue assigned to kevin
- Issue tagged with: medium-gain, medium-trouble, ops

2 years ago

Sure, we can give it a try. As long as the builds succeed reliably, I'll be happy.

One issue with the OOM hitting is that systemd-oomd managed to get back in when I reinstalled builders.
I just disabled it globally. Can we see if that helps any?

It looks like your change almost certainly fixed x86_64, which last restarted shortly before your last comment.

I'm still waiting on my very slow s390x and aarch64 builds to complete, but I'm optimistic. (These builds had not been restarting before.)

I've provided a negative update on s390x in #10910.

I've provided a negative update on s390x in #10910.

This build just restarted again, after running for 70 hours. My guess is the builder doesn't have enough RAM, but it's impossible to know because we don't know where the job failed until it actually fails. As the jobs are restarting, I don't have access to a failed log with an error message to know what is happening.

s390x builders 01->14 have 8GB ram and 6.6GB swap. builders 15->35 have 12 GB ram and 8GB swap. They all seem to have 2 CPUs.

I saw my local build using 12 GB when processing debuginfo a couple days ago. :/

You are correct and it is a memory issue:

Jul 22 12:00:06 buildvm-s390x-19.s390.fedoraproject.org systemd-oomd[472110]: Killed /system.slice/kojid.service due to memory used (13623361536) / total (13685317632) and swap used (774157>

I am trying to see if I can get anything else other than that.

OK I am tailing the current build.log so when it ooms again I can see what step it did it in. I am in agreement with your earlier assessment that the build will probably never succeed.

How to solve this in either the short term or long term is not clear as it has high tradeoffs which need to be negotiated at a level higher than this ticket.
1. I do not think there is no extra memory or CPUs which can be given to a builder without turning off existing builders.
2. turning off existing builders to make current ones 'bigger' may helps for large builds but causes small builds to pile up due to lack of builders. [Doing this in the middle of a mass rebuild would not be good either.]
3. excluding this architecture from this package would have a pile-on effect for a lot of other packages.

Note system requirements might be about to increase again as we still need to add the GTK 4 build. See https://bugzilla.redhat.com/show_bug.cgi?id=2108905 where we are working on that. The build and install process itself should not require any extra RAM, since we only do one at a time and can use %limit_build to control resource usage. But I'm suspicious that the RAM requirement for debuginfo processing might increase. It seems like the sort of thing that should be done for one library at a time and therefore it should be perfectly fine, but not sure if that's true in reality.

Argh. systemd-oomd is not supposed to be running. ;(

It seems to got re-enabled somehow. I am going to clean that up now.

ok. systemd-oomd is gone now. It was definitely firing on builds on the s390x builders (at 90% memory). ;(

So, lets see if this next one finishes this weekend. If not, then yeah, I guess we will need to rebalance things to provide more memory to less builders. :(

Also, I see we are hitting this:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0 120784 3428364    596 7705440    0    0     0     0  214   71 73  3  0  0 25
 2  0 120784 3345204    596 7705440    0    0     0     0  244   93 84  2  0  0 15
 2  0 120784 3219456    596 7705440    0    0     0     0  212   57 86  2  0  0 11
 2  0 120784 3116136    596 7705440    0    0     0     0  228  100 73  3  0  0 25
 2  0 120784 2995932    596 7705440    0    0     0     0  216   65 67  3  0  0 31

11 to 31% of cpu is 'stolen'. What this means is that there's other LPARS on that mainframe that are higher priority and are getting those cpu cycles.
Typically we have seen this when there's heavy testing/development of RHEL or other internal Red Hat things. ;(

OK, I started a scratch build with all three builds enabled at once to match what I plan to ship in F37 and F38. Interestingly, the s390x build is progressing without trouble. This time it's the x86_64 and the ppc64le builds that are restarting. Perhaps systemd-oomd is running there too?

The good news is that when I built this locally, I did not notice memory consumption greater than 12 GB when processing debuginfo, so I don't think the number of builds affects the RAM requirement at all: probably only affects the build time.

Sadly, the s390x build from #10910 continues to restart. I'm impressed that build is still running without failing after 94 hours.

The restart on the s390x left this log I was tailing:

+ /usr/bin/find-debuginfo -j2 --strict-build-id -m -i --build-id-seed 2.37.1-10.fc37 --unique-debug-suffix -2.37.1-10.fc37.s390x --unique-debug-src-base webkit2gtk3-2.37.1-10.fc37.s390x --run-dwz --dwz-low-mem-die-limit 10000000 --dwz-max-die-limit 250000000 -S debugsourcefiles.list /builddir/build/BUILD/webkitgtk-2.37.1                                                         
extracting debug info from /builddir/build/BUILDROOT/webkit2gtk3-2.37.1-10.fc37.s390x/usr/lib64/libjavascriptcoregtk-4.0.so.18.21.0                                                          
extracting debug info from /builddir/build/BUILDROOT/webkit2gtk3-2.37.1-10.fc37.s390x/usr/bin/WebKitWebDriver                                                                                
extracting debug info from /builddir/build/BUILDROOT/webkit2gtk3-2.37.1-10.fc37.s390x/usr/lib64/libjavascriptcoregtk-4.1.so.0.2.0                                                            
extracting debug info from /builddir/build/BUILDROOT/webkit2gtk3-2.37.1-10.fc37.s390x/usr/lib64/libwebkit2gtk-4.0.so.37.57.0                                                                 
extracting debug info from /builddir/build/BUILDROOT/webkit2gtk3-2.37.1-10.fc37.s390x/usr/lib64/libwebkit2gtk-4.1.so.0.2.0                                                                   
EXCEPTION: [KeyboardInterrupt()]
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/mockbuild/trace_decorator.py", line 93, in trace
    result = func(*args, **kw)
  File "/usr/lib/python3.10/site-packages/mockbuild/util.py", line 556, in do_with_status
    output = logOutput(
  File "/usr/lib/python3.10/site-packages/mockbuild/util.py", line 394, in logOutput
    i_rdy, o_rdy, e_rdy = select.select(fds, [], [], 1)
  File "/usr/libexec/mock/mock", line 453, in handle_signals
    util.orphansKill(buildroot.make_chroot_path())
  File "/usr/lib/python3.10/site-packages/mockbuild/trace_decorator.py", line 57, in trace
    @functools.wraps(func)
KeyboardInterrupt

Looks like a lot restarted itself at that time as

[Sat Jul 23 03:14:51 2022] sssd_nss invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[Sat Jul 23 03:14:51 2022] CPU: 1 PID: 2027792 Comm: sssd_nss Not tainted 5.18.9-200.fc36.s390x #1
[Sat Jul 23 03:14:51 2022] Hardware name: IBM 8561 LT1 400 (KVM/Linux)
[Sat Jul 23 03:14:51 2022] Call Trace:
....
[Sat Jul 23 03:14:51 2022] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/kojid.service,task=gdb.minimal,pid=2027553,uid=1000
[Sat Jul 23 03:14:51 2022] Out of memory: Killed process 2027553 (gdb.minimal) total-vm:11985092kB, anon-rss:5329360kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:23302kB oom_score_adj:0
[Sat Jul 23 03:15:14 2022] oom_reaper: reaped process 2027553 (gdb.minimal), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I don't think this rpm will be buildable currently.

That looks like 11.4 GiB, right? It is actually slightly less than I expected would be required.

systemd-oomd is off everywhere.

I took down 2 s390x builders and added their memory to 2 others and moved your s390x builds over to them.

I moved your x86_64 one that was failing to a buildhw box (with a lot more memory/cpu).

On ppc64le, I can make a few more bigger builders, but not sure I will have time this weekend.

I'd still prefer to avoid a special channel for this, so next week, perhaps I can rebalance things (reduce number of builders and increase memory) until things work.

The other thing we can do is increase the size of the generated debuginfo. We currently do this:

# Increase the DIE limit so our debuginfo packages can be size-optimized.
# Decreases the size for x86_64 from ~5G to ~1.1G.
# https://bugzilla.redhat.com/show_bug.cgi?id=1456261
%global _dwz_max_die_limit 250000000

That hugely reduces the package size, but also hugely increases the RAM required to process the debuginfo. It would be nice to not need to change that, though.

All my builds are green now! Thanks.

ok, great. I am going to try and rebalance things so that these builds are ok, but we don't have to use a special channel.

Will try and do that this week...

This F35 armv7hl build might need help. Looks like it restarted at least once so far: https://koji.fedoraproject.org/koji/taskinfo?taskID=90197240

Although it has been churning fine for the past 9 hours, so maybe it will be OK. (Note this is a stable branch build, so no increased system requirements relative to what has historically been required.)

This looks to be a different ongoing issue. Please see https://pagure.io/fedora-infrastructure/issue/10833

I don't think so. It was building C++ files when I checked last night. When I checked this morning, after noticing that it had restarted, it was again building C++ files. Sadly, it has just restarted again 30 minutes ago.

There is no way to know for sure how many times it has restarted total, but I see that it is currently building C++ files again. So I assume it is running out of memory.

OK several builders came up in the wrong memory configuration and only had 2.9GB. Those have all been rebooted into the 40GB mode. However I believe any process is limited in memory size to 4GB. If the debuginfo step grows beyond that, I don't think the process would complete.

That won't be a problem. This package disables debuginfo on armv7hl for exactly this reason.

The build eventually completed.

So, I am having issues with this too.

First, it started getting stuck and restarted, then @kevin wanted to try zvm builder, that got it right to extracting debuginfo to be killed and restarted again :/

It is this task: https://koji.fedoraproject.org/koji/taskinfo?taskID=90394095

Has the ICU soname bump been merged into rawhide yet? That is, is the distro broken currently? Or is this "merely" blocking ICU from being updated via a side tag?

Note that webkit2gtk3 is going to be retired imminently, see #10919. The goal is to retire this package before rawhide branches from F37, which is scheduled for early next week. The new webkitgtk package will provide the same libraries.

Has the ICU soname bump been merged into rawhide yet?

Yes.

That is, is the distro broken currently? Or is this "merely" blocking ICU from being updated via a side tag?

Distro shouldn't be broken (it seems that, apart from the R stack and eln, there wasn't any breakage), there is a compat library - libicu69. Packages not rebuilt against the new ICU should pick up that user/client side.

Note that webkit2gtk3 is going to be retired imminently, see #10919. The goal is to retire this package before rawhide branches from F37, which is scheduled for early next week. The new webkitgtk package will provide the same libraries.

Hmm, that's good to know, in that case, rebuilding webkit2gtk2 isn't that important. I just wanted to avoid having two ICU libs shipped on Workstation/other deliverables.

We seem to be good on all platforms except s390x. Frantisek's build -- ongoing for 74 hours -- and my build just keep restarting and wasting resources. 12 GB of RAM is just too low. My possibly-incorrect suspicion is that we are very close to the required amount of RAM, and just a little more would suffice. I bet 16 GB would be safe.

I am doing some napkin calculations here so my numbers may be off. We have 2 virtual z machines we are running on. One is set up in one layout and the other is a KVM layout. The KVM system is easiest to read where we have 48CPU? and 256 GB of ram. That system has 21 virtual machines with 2CPU/12GB of ram. We would need to shut down at least 6 systems on this box to bring up the memory to 16GB per host. That would get us 15 systems with 16GB ram and 3 CPU's per system.

The Zvm system would probably need an equivalent amount of shutdown systems so we would go from 35 to 23 builders. I do not know if that would improve overall build times due to the lack of resources there. And if 16GB is only good now but not in 6 months do we have to drop down the number of builders even more? [Even if we were to make a 'big-system' we are going to have to drop down the number of builders by an equivalent count to pull those resources together.]

Again back of a napkin calculations so probably off.

We would need to shut down at least 6 systems on this box to bring up the memory to 16GB per host. That would get us 15 systems with 16GB ram and 3 CPU's per system.

Alternative would be 14 GB with 18 VMs at 2 vCPUs and 14 GB of RAM... but surely 15 VMs with 3 vCPUs would perform better overall.

And if 16GB is only good now but not in 6 months do we have to drop down the number of builders even more?

Eventually, yes. Build requirements are constantly increasing. But they increase relatively slowly, so I would expect it should be longer than 6 months before we need to increase requirements again.

Problem is I honestly do not know how much RAM is required to build successfully on s390x. Because the build log is lost each time the job restarts, I don't even know where it's dying. I'm only guessing that it dies when processing debuginfo. If it happens when compiling, then in theory we could fix things by adjusting the %limit_build macro. I don't think that's what's happening, though.

Another option would be to adjust the debuginfo optimization:

# Increase the DIE limit so our debuginfo packages can be size-optimized.
# Decreases the size for x86_64 from ~5G to ~1.1G.
# https://bugzilla.redhat.com/show_bug.cgi?id=1456261
%global _dwz_max_die_limit 250000000

That's probably a terrible idea because, well, you can see how much it would bloat the generated RPMs. But I believe it would reduce memory usage during the build.

Yet another option would be to try -g1 instead of -g, but that would truly make s390x a second-class citizen. Not fond of that.

So I was wrong about the other set of systems. Frantisek's build is on the older z system and it only has 8 GB of ram per builder and 6 GB of swap. The build is crashing constantly due to OOMd

[2535842.291988] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/kojid.service,task=gdb.minimal,pid=2342849,uid=1000
[2535842.292011] Out of memory: Killed process 2342849 (gdb.minimal) total-vm:9699380kB, anon-rss:3938136kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:18718kB oom_score_adj:0

It will never complete and needs to be moved to a different builder.

I think it's worth a try to have builders with 3 CPUs and 16 GB memory.

It's not impossible that it would actually improve packager experience by allowing builds to complete faster. I guess in the best case 3 CPUs instead of 2 would make it possible for tasks to complete in 2/3 of the time, which then frees up the builder sooner and makes up for fewer builders?

So I was wrong about the other set of systems. Frantisek's build is on the older z system and it only has 8 GB of ram per builder and 6 GB of swap. The build is crashing constantly due to OOMd

I'm going to suggest that builder just needs to be turned off altogether.

BTW although it would be nice to see Frantisek's build complete, it's also OK to just cancel it. I will retire that package as soon as my build completes. I think my build is likely stuck in the same doom loop, though.

That would turn off ~15 builders... thus making s390x a bigger headache than you expect.

What needs to be done is working out what the horse trading of 'what is expected' and 'what is needed' and 'what is reasonable' with the limited resources Fedora has.

Strawman starting point proposal: if an architecture provides (a) less than 16 GB of RAM per standard builder, or (b) less than 2 GB of RAM per vCPU, then all large C++ projects (at least Firefox, Thunderbird, LibreOffice, Inkscape, WebKitGTK, QtWebKit, and QtWebEngine) should be built on a heavy builder channel instead of the standard channel. If the goal is to avoid moving projects to a heavy builder channel, then we should turn off however many builders necessary to achieve that. If a particular architecture slows down too much, we can put out a call for hardware donations, and if no company is interested in providing hardware, shut it down. Architecture support is ultimately a community decision, after all.

So, some clarification:

buildvm-s390x-01 to buildvm-s390x-14 are z/vm builders. It's not one big lpar, it's 14 of them. We have no way to change the resource allocation on those, it would require mainframe admins to change anything there.

buildvm-s390x-15 to buildvm-s390x-35 are kvm vm's. They are on buildvmhost-s390x-01, which is a single large instance and guests run on it like other virthosts. Here we can reallocation the resources as we like.

koji has a concept of 'capacity' for builders. All the z/vm instances are set to a 2.0 capacity, and all the kvm ones are set to 3.0. So, larger builds will always try and run on the kvm instances.

So, the kvm hosts we have 20 of them currently with 2cpus and 13gb mem. We can move that to 15 builders with 17gb memory and 3 cpus?
But I worry adding another cpu will increase memory needs as 3 big threads of say c++ might suck up more memory, but I suppose we can try it.

I'll try and do that later today after nest... all the builds may restart again as I redo builders.

But I worry adding another cpu will increase memory needs as 3 big threads of say c++ might suck up more memory, but I suppose we can try it.

Of course going from 2 vCPUs to 3 vCPUs will increase memory needs during compile time by 50%. Now that we have LTO, I believe it will also increase memory needs during linking correspondingly. But if my theory that it's failing when processing debuginfo is correct, then that's not the problem here. WebKit should really only need 2-3 GB of RAM per vCPU to compile, and our builders are all way past that point.

ok. I rebalanced them.

There's now 15-30 and they have 17gb mem and 3 cpus.

I restarted the webkitgtk build.

Lets see how this goes.

Well it died due to failure to connect to the builder. :D I will restart it.

Now it seems the ppc64le build has unexpectedly restarted: https://koji.fedoraproject.org/koji/taskinfo?taskID=90503790. Surprise. :/

We just can't win. ;)

buildvm-ppc64le's have 20GB memory and 8 cpus...

Surely that ought to be more than enough. :/ At least it finished on the second try.

Sadly, s390x is still in the doom loop.

I could try increasing the %limit_build to require 3 GB of RAM per vCPU instead of 2 GB. But that would just slow things down for no benefit if we do not know that it is OOMing during the compile or link step. And without evidence to show when exactly it is hitting OOM, I think I'll hold off on that.

So, I watched one cycle and it ended in:

  • /usr/bin/find-debuginfo -j3 --strict-build-id -m -i --build-id-seed 2.37.1-13.fc37 --unique-debug-suffix -2.37.1-13.fc37.s390x --unique-debug-src-base webkitgtk-2.37.1-13.fc37.s390x --run-dwz --dwz-lo
    w-mem-die-limit 10000000 --dwz-max-die-limit 250000000 -S debugsourcefiles.list /builddir/build/BUILD/webkitgtk-2.37.1
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/bin/WebKitWeb
    Driver
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/lib64/libjava
    scriptcoregtk-4.1.so.0.2.0
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/lib64/libjava
    scriptcoregtk-4.0.so.18.21.0
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/lib64/libwebk
    it2gtk-4.0.so.37.57.0
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/lib64/libwebk
    it2gtk-4.1.so.0.2.0
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/lib64/webkit2
    gtk-4.0/injected-bundle/libwebkit2gtkinjectedbundle.so
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/lib64/webkit2
    gtk-4.1/injected-bundle/libwebkit2gtkinjectedbundle.so
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/libexec/webki
    t2gtk-4.0/MiniBrowser
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/libexec/webki
    t2gtk-4.0/WebKitNetworkProcess
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/libexec/webkit2gtk-4.0/WebKitWebProcess
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/libexec/webki
    t2gtk-4.0/jsc
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/libexec/webki
    t2gtk-4.1/MiniBrowser
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/libexec/webki
    t2gtk-4.1/WebKitNetworkProcess
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/libexec/webki
    t2gtk-4.1/WebKitWebProcess
    extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-13.fc37.s390x/usr/libexec/webki
    t2gtk-4.1/jsc

oh dammit.

systemd-oomd got restarted by socket activation!

Let me mask it. pesky thing.

My build has succeeded! \o/

Another ppc64le build restart: https://koji.fedoraproject.org/koji/taskinfo?taskID=90614841

It's weird... really hard to believe 20 GB would not be enough. :/

s390x is restarting again too. My current build is here: https://koji.fedoraproject.org/koji/taskinfo?taskID=90614616

Is it possible to investigate to see where exactly it is dying? If it dies when compiling or linking, I will adjust %limit_build to tone down the parallelism. If it dies processing debuginfo, that probably just means we need even more RAM?

I am tailing the current log so should hopefully get a capture of if(let us be honest, when) it breaks again

ppc64le build failed due to normal oom filling up 20GB of ram and 8GB of swap.

[Tue Aug  9 01:05:50 2022] [c0000000a2337c10] [c00000000023e3c0] sync_hw_clock+0x150/0x310
[Tue Aug  9 01:05:50 2022] [c0000000a2337c90] [c00000000017e3cc] process_one_work+0x2ac/0x570
[Tue Aug  9 01:05:50 2022] [c0000000a2337d30] [c00000000017ed98] worker_thread+0xa8/0x630
[Tue Aug  9 01:05:50 2022] [c0000000a2337dc0] [c00000000018b054] kthread+0x124/0x130
[Tue Aug  9 01:05:50 2022] [c0000000a2337e10] [c00000000000ce64] ret_from_kernel_thread+0x5c/0x64
[Tue Aug  9 01:05:50 2022] Instruction dump:
[Tue Aug  9 01:05:50 2022] 3b600500 fba1ffe8 fbc1fff0 3b800a00 3bc00002 fbe1fff8 3ba00f00 3be00003 
[Tue Aug  9 01:05:50 2022] f8010010 f821fe21 38610020 480602dd <60000000> 39200000 e9410128 f9210158 
[Tue Aug  9 07:31:11 2022] cc1plus invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[Tue Aug  9 07:31:11 2022] CPU: 6 PID: 1162682 Comm: cc1plus Tainted: G             L    5.18.9-200.fc36.ppc64le #1
[Tue Aug  9 07:31:11 2022] Call Trace:
[Tue Aug  9 07:31:11 2022] [c0000000177eb650] [c000000000a7b920] dump_stack_lvl+0x74/0xa8 (unreliable)
[Tue Aug  9 07:31:11 2022] [c0000000177eb690] [c0000000003faad0] dump_header+0x64/0x24c
[Tue Aug  9 07:31:11 2022] [c0000000177eb710] [c0000000003f8ec4] oom_kill_process+0x344/0x350
[Tue Aug  9 07:31:11 2022] [c0000000177eb750] [c0000000003fa448] out_of_memory+0x228/0x780
[Tue Aug  9 07:31:11 2022] [c0000000177eb7f0] [c000000000488928] __alloc_pages+0x10d8/0x11a0
[Tue Aug  9 07:31:11 2022] [c0000000177eb9d0] [c0000000004bed64] alloc_pages+0xd4/0x230
[Tue Aug  9 07:31:11 2022] [c0000000177eba10] [c0000000004beef4] folio_alloc+0x34/0x90
[Tue Aug  9 07:31:11 2022] [c0000000177eba40] [c0000000003f1aa4] __filemap_get_folio+0x294/0x650
[Tue Aug  9 07:31:11 2022] [c0000000177ebaf0] [c0000000003f21f8] filemap_fault+0x398/0xae0
[Tue Aug  9 07:31:11 2022] [c0000000177ebba0] [c00000000044fcb4] __do_fault+0x64/0x2a0
[Tue Aug  9 07:31:11 2022] [c0000000177ebbe0] [c0000000004542ac] __handle_mm_fault+0x103c/0x1d90
[Tue Aug  9 07:31:11 2022] [c0000000177ebce0] [c000000000455128] handle_mm_fault+0x128/0x310
[Tue Aug  9 07:31:11 2022] [c0000000177ebd30] [c000000000087fd4] ___do_page_fault+0x2a4/0xb90
[Tue Aug  9 07:31:11 2022] [c0000000177ebde0] [c000000000088b30] do_page_fault+0x30/0xc0
[Tue Aug  9 07:31:11 2022] [c0000000177ebe10] [c000000000008ce0] instruction_access_common_virt+0x190/0x1a0
[Tue Aug  9 07:31:11 2022] --- interrupt: 400 at 0x1117fa88
[Tue Aug  9 07:31:11 2022] NIP:  000000001117fa88 LR: 00000000111ae090 CTR: 00007fff98a84a30
[Tue Aug  9 07:31:11 2022] REGS: c0000000177ebe80 TRAP: 0400   Tainted: G             L     (5.18.9-200.fc36.ppc64le)
[Tue Aug  9 07:31:11 2022] MSR:  800000004280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 88002883  XER: 00000004
[Tue Aug  9 07:31:11 2022] CFAR: 00000000111ae08c IRQMASK: 0 
                           GPR00: 00000000111ae078 00007ffff0128140 0000000011d17a00 00007fff2a59ea40 
                           GPR04: 00007fff2a59ea00 00000000119e5c98 00000000473e35b0 0000000000000034 
                           GPR08: 0000000000000078 0000000045bfd49c 0000000045bfd49d 0000000045bfd499 
                           GPR12: 00007fff98a84a30 00007fff9921cd40 0000000000000000 00000000119e5c38 
                           GPR16: 00000000119e5d68 00007fff2a59ea00 0000000011d59b08 0000000011776520 
                           GPR20: 00000000119e5cc0 0000000011772648 00000000119e5c98 0000000011dee310 
                           GPR24: 00007fff254c1c00 0000000011782f50 000000000000000a 0000000000000004 
                           GPR28: 00007fff253b30d0 0000000011d4f098 0000000000000002 00007fff253b31d8 
[Tue Aug  9 07:31:11 2022] NIP [000000001117fa88] 0x1117fa88
[Tue Aug  9 07:31:11 2022] LR [00000000111ae090] 0x111ae090
[Tue Aug  9 07:31:11 2022] --- interrupt: 400
[Tue Aug  9 07:31:11 2022] Mem-Info:
[Tue Aug  9 07:31:11 2022] active_anon:263487 inactive_anon:21323 isolated_anon:0
                            active_file:0 inactive_file:82 isolated_file:0
                            unevictable:62 dirty:0 writeback:0
                            slab_reclaimable:2820 slab_unreclaimable:6300
                            mapped:25 shmem:62 pagetables:217 bounce:0
                            kernel_misc_reclaimable:0
                            free:1031 free_pcp:255 free_cma:0
[Tue Aug  9 07:31:11 2022] Node 0 active_anon:16863168kB inactive_anon:1364672kB active_file:0kB inactive_file:5248kB unevictable:3968kB isolated(anon):0kB isolated(file):0kB mapped:1600kB dirty:0kB writeback:0kB shmem:3968kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:4784kB pagetables:13888kB all_unreclaimable? no
[Tue Aug  9 07:31:11 2022] Node 0 Normal free:65984kB boost:49152kB min:71680kB low:92416kB high:113152kB reserved_highatomic:0KB active_anon:16863168kB inactive_anon:1364672kB active_file:1024kB inactive_file:0kB unevictable:3968kB writepending:0kB present:20971520kB managed:20812800kB mlocked:3968kB bounce:0kB free_pcp:16320kB local_pcp:2624kB free_cma:0kB
[Tue Aug  9 07:31:11 2022] lowmem_reserve[]: 0 0 0
[Tue Aug  9 07:31:11 2022] Node 0 Normal: 552*64kB (UME) 239*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 65920kB
[Tue Aug  9 07:31:11 2022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Tue Aug  9 07:31:11 2022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Tue Aug  9 07:31:11 2022] 238 total pagecache pages
[Tue Aug  9 07:31:11 2022] 41 pages in swap cache
[Tue Aug  9 07:31:11 2022] Swap cache stats: add 3350394, delete 3350428, find 718844/1746745
[Tue Aug  9 07:31:11 2022] Free swap  = 0kB
[Tue Aug  9 07:31:11 2022] Total swap = 8388544kB
[Tue Aug  9 07:31:11 2022] 327680 pages RAM
[Tue Aug  9 07:31:11 2022] 0 pages HighMem/MovableOnly
[Tue Aug  9 07:31:11 2022] 2480 pages reserved
[Tue Aug  9 07:31:11 2022] 0 pages cma reserved
[Tue Aug  9 07:31:11 2022] 0 pages hwpoisoned
[Tue Aug  9 07:31:11 2022] Tasks state (memory values in pages):
[Tue Aug  9 07:31:11 2022] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Tue Aug  9 07:31:11 2022] [    680]   193   680      560       74    30208       75             0 systemd-resolve
[Tue Aug  9 07:31:11 2022] [    690]     0   690      666       57    28416       29         -1000 auditd
[Tue Aug  9 07:31:11 2022] [    725]     0   725      511       53    25600       89             0 systemd-logind
[Tue Aug  9 07:31:11 2022] [    728]    81   728      270       37    27648       41          -900 dbus-broker-lau
[Tue Aug  9 07:31:11 2022] [    748]    81   748      159       61    22528       27          -900 dbus-broker
[Tue Aug  9 07:31:11 2022] [    757]     0   757     4318       56    29952       86             0 NetworkManager
[Tue Aug  9 07:31:11 2022] [    772]     0   772     2035       14    29696       71             0 gssproxy
[Tue Aug  9 07:31:11 2022] [    787]     0   787      197       52    27136       18             0 crond
[Tue Aug  9 07:31:11 2022] [    800]     0   800       69       17    26112        5             0 agetty
[Tue Aug  9 07:31:11 2022] [    801]     0   801      119        4    22272        5             0 agetty
[Tue Aug  9 07:31:11 2022] [    857]   998   857    45974       43    54528      121             0 polkitd
[Tue Aug  9 07:31:11 2022] [   7212]     0  7212       65       61    26112        0         -1000 watchdog
[Tue Aug  9 07:31:11 2022] [   7476]     0  7476      862       21    28416       44             0 master
[Tue Aug  9 07:31:11 2022] [   7478]    89  7478      885       46    29440       49             0 qmgr
[Tue Aug  9 07:31:11 2022] [   7507]     0  7507       84       26    26112       12             0 rpc.idmapd
[Tue Aug  9 07:31:11 2022] [   7513]     0  7513      387       29    23040       38             0 rpc.gssd
[Tue Aug  9 07:31:11 2022] [   7515]    32  7515      324       29    28160       41             0 rpcbind
[Tue Aug  9 07:31:11 2022] [   7517]    29  7517      243       33    27136       32             0 rpc.statd
[Tue Aug  9 07:31:11 2022] [   7520]     0  7520      134       29    26624       16             0 nfsdcld
[Tue Aug  9 07:31:11 2022] [   7521]     0  7521      317        7    23552       40             0 rpc.mountd
[Tue Aug  9 07:31:11 2022] [   7561]   996  7561     1427       36    28416       31             0 chronyd
[Tue Aug  9 07:31:11 2022] [  13199]     0 13199      245        3    23296       30             0 oddjobd
[Tue Aug  9 07:31:11 2022] [  13416]     0 13416     1004       46    31744      120             0 sssd
[Tue Aug  9 07:31:11 2022] [  13418]     0 13418     1523       41    36096      100             0 sssd_nss
[Tue Aug  9 07:31:11 2022] [  13419]     0 13419     1029       28    27648      119             0 sssd_pam
[Tue Aug  9 07:31:11 2022] [  13420]     0 13420      988       30    27392      111             0 sssd_ssh
[Tue Aug  9 07:31:11 2022] [  13421]     0 13421      982       37    31488      106             0 sssd_sudo
[Tue Aug  9 07:31:11 2022] [  13422]     0 13422     1893       41    30720      157             0 sssd_pac
[Tue Aug  9 07:31:11 2022] [  20590]     0 20590      895       65    28416       95          -900 virtlogd
[Tue Aug  9 07:31:11 2022] [  22233]     0 22233      440       54    28928       46             0 systemd-machine
[Tue Aug  9 07:31:11 2022] [  22320]   423 22320      255       39    23296       24             0 dnsmasq
[Tue Aug  9 07:31:11 2022] [  22321]     0 22321      254       30    23296       25             0 dnsmasq
[Tue Aug  9 07:31:11 2022] [1873186]     0 1873186      706       62    26112       85         -1000 systemd-udevd
[Tue Aug  9 07:31:11 2022] [1873189]     0 1873189      428       45    24832       49             0 systemd-userdbd
[Tue Aug  9 07:31:11 2022] [1873202]     0 1873202     3099       89    45824       54          -250 systemd-journal
[Tue Aug  9 07:31:11 2022] [2990000]     0 2990000    10498       97    74240       29             0 rsyslogd
[Tue Aug  9 07:31:11 2022] [ 891427]     0 891427      385       54    24576       46         -1000 sshd
[Tue Aug  9 07:31:11 2022] [1710876]     0 1710876     2346       74    33536      197             0 sssd_be
[Tue Aug  9 07:31:11 2022] [1953812]     0 1953812     2921      579    49152      748             0 kojid
[Tue Aug  9 07:31:11 2022] [1149099]    89 1149099     1062       90    25600        1             0 pickup
[Tue Aug  9 07:31:11 2022] [1154989]     0 1154989     2921      144    47872     1159             0 kojid
[Tue Aug  9 07:31:11 2022] [1155947]  1000 1155947     1171       66    31232      541             0 mock
[Tue Aug  9 07:31:11 2022] [1158092]  1000 1158092      375        1    28672       69             0 rpmbuild
[Tue Aug  9 07:31:11 2022] [1158125]  1000 1158125      113        1    26368       13             0 sh
[Tue Aug  9 07:31:11 2022] [1158469]  1000 1158469      673        0    26624       86             0 cmake
[Tue Aug  9 07:31:11 2022] [1158470]  1000 1158470      670        0    26880       83             0 cmake
[Tue Aug  9 07:31:11 2022] [1158471]  1000 1158471      227        1    27648       28             0 gmake
[Tue Aug  9 07:31:11 2022] [1158474]  1000 1158474      230        1    23040       28             0 gmake
[Tue Aug  9 07:31:11 2022] [1158491]  1000 1158491      235        1    23296       35             0 gmake
[Tue Aug  9 07:31:11 2022] [1158493]  1000 1158493      233        1    23040       33             0 gmake
[Tue Aug  9 07:31:11 2022] [1162555]     0 1162555      439       44    29184       45             0 systemd-userwor
[Tue Aug  9 07:31:11 2022] [1162557]  1000 1162557       96        0    22272       12             0 g++
[Tue Aug  9 07:31:11 2022] [1162558]  1000 1162558    58395    26130   498944    30153             0 cc1plus
[Tue Aug  9 07:31:11 2022] [1162559]  1000 1162559    18288    14169   176384     4006             0 as
[Tue Aug  9 07:31:11 2022] [1162591]  1000 1162591       96        0    22272       12             0 g++
[Tue Aug  9 07:31:11 2022] [1162592]  1000 1162592    57875    19075   494080    36758             0 cc1plus
[Tue Aug  9 07:31:11 2022] [1162593]  1000 1162593    15394    13764   148992     1517             0 as
[Tue Aug  9 07:31:11 2022] [1162681]  1000 1162681       96        0    22016       13             0 g++
[Tue Aug  9 07:31:11 2022] [1162682]  1000 1162682    32957    20668   293888    10695             0 cc1plus
[Tue Aug  9 07:31:11 2022] [1162683]  1000 1162683    10759    10120   111616      525             0 as
[Tue Aug  9 07:31:11 2022] [1162693]  1000 1162693       96        0    22016       12             0 g++
[Tue Aug  9 07:31:11 2022] [1162694]  1000 1162694    46700    30888   400896    14332             0 cc1plus
[Tue Aug  9 07:31:11 2022] [1162695]  1000 1162695     6004     5043    73984      845             0 as
[Tue Aug  9 07:31:11 2022] [1162697]  1000 1162697       96        0    22016       12             0 g++
[Tue Aug  9 07:31:11 2022] [1162698]  1000 1162698    51999    40214   441344     9458             0 cc1plus
[Tue Aug  9 07:31:11 2022] [1162699]  1000 1162699     2104        1    42752     1991             0 as
[Tue Aug  9 07:31:11 2022] [1162702]  1000 1162702       96        0    26112       12             0 g++
[Tue Aug  9 07:31:11 2022] [1162703]  1000 1162703    32957    25201   289792     6061             0 cc1plus
[Tue Aug  9 07:31:11 2022] [1162704]  1000 1162704     5851     5207    72704      527             0 as
[Tue Aug  9 07:31:11 2022] [1162710]  1000 1162710       96        0    26112       12             0 g++
[Tue Aug  9 07:31:11 2022] [1162711]  1000 1162711    42924    35383   378624     6060             0 cc1plus
[Tue Aug  9 07:31:11 2022] [1162712]  1000 1162712     1735     1079    35584      543             0 as
[Tue Aug  9 07:31:11 2022] [1162731]  1000 1162731       96        0    22272       12             0 g++
[Tue Aug  9 07:31:11 2022] [1162732]  1000 1162732    38014    34452   331520     2444             0 cc1plus
[Tue Aug  9 07:31:11 2022] [1162733]  1000 1162733     2104     1567    38400      396             0 as
[Tue Aug  9 07:31:11 2022] [1162753]     0 1162753      439       89    24832        0             0 systemd-userwor
[Tue Aug  9 07:31:11 2022] [1162754]     0 1162754      151       23    22016        0             0 systemd-userwor
[Tue Aug  9 07:31:11 2022] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/kojid.service,task=cc1plus,pid=1162558,uid=1000
[Tue Aug  9 07:31:11 2022] Out of memory: Killed process 1162558 (cc1plus) total-vm:3737280kB, anon-rss:1672256kB, file-rss:64kB, shmem-rss:0kB, UID:1000 pgtables:487kB oom_score_adj:0

For ppc64le I agree with your assessment on %limit_build. I do not know about

The s390x did not die due to oom. I do not see any logs on why it failed so will try to keep a copy of its logs.

OK, that's not what I was expecting to see. The cc1plus process that died was using 3.7 GB of virtual memory, which is roughly twice what I assumed was required. The other processes are all using similar amounts of RAM. I'll adjust the %limit_build macro to request 4 GB per vCPU instead of 2 GB per vCPU.

I'm going to let the current doomed build continue, because the s390x build is 8 hours along and you are watching to see how it fails, and I don't want to set our investigation back by 8 hours by canceling it now.

Option 1 (sooner): Increase ZRAM size. It's capped to 8G on this system, and it's safe to be 1:1, i.e. 20G.

  • cp /usr/lib/systemd/zram-generator.conf /etc/systemd/zram-generator.conf
  • Edit the file making the configuration
[zram0]
zram-size = ram
  • Reboot is safest. I'd only consider systemctl restart systemd-zram-setup@zram0.service if the system is not busy and has enough free memory to accept all the used pages in swap (using free -m is sufficient).

Option 2 (later): do a little profiling of the workload to see if certain systems like these are better off with disk based swap and either reconfigure or reprovision. I can help with figuring this out.

powerpc build got further before it oom'd on

[Tue Aug  9 07:31:11 2022] Out of memory: Killed process 1162558 (cc1plus) total-vm:3737280kB, anon-rss:1672256kB, file-rss:64kB, shmem-rss:0kB, UID:1000 pgtables:487kB oom_score_adj:0
[Tue Aug  9 16:12:46 2022] gdb.minimal invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[Tue Aug  9 16:12:46 2022] CPU: 6 PID: 1466670 Comm: gdb.minimal Tainted: G             L    5.18.9-200.fc36.ppc64le #1
[Tue Aug  9 16:12:46 2022] Call Trace:
[Tue Aug  9 16:12:46 2022] [c0000000a0317650] [c000000000a7b920] dump_stack_lvl+0x74/0xa8 (unreliable)
[Tue Aug  9 16:12:46 2022] [c0000000a0317690] [c0000000003faad0] dump_header+0x64/0x24c
[Tue Aug  9 16:12:46 2022] [c0000000a0317710] [c0000000003f8ec4] oom_kill_process+0x344/0x350
[Tue Aug  9 16:12:46 2022] [c0000000a0317750] [c0000000003fa448] out_of_memory+0x228/0x780
[Tue Aug  9 16:12:46 2022] [c0000000a03177f0] [c000000000488928] __alloc_pages+0x10d8/0x11a0
[Tue Aug  9 16:12:46 2022] [c0000000a03179d0] [c0000000004bed64] alloc_pages+0xd4/0x230
[Tue Aug  9 16:12:46 2022] [c0000000a0317a10] [c0000000004beef4] folio_alloc+0x34/0x90
[Tue Aug  9 16:12:46 2022] [c0000000a0317a40] [c0000000003f1aa4] __filemap_get_folio+0x294/0x650
[Tue Aug  9 16:12:46 2022] [c0000000a0317af0] [c0000000003f21f8] filemap_fault+0x398/0xae0
[Tue Aug  9 16:12:46 2022] [c0000000a0317ba0] [c00000000044fcb4] __do_fault+0x64/0x2a0
[Tue Aug  9 16:12:46 2022] [c0000000a0317be0] [c0000000004542ac] __handle_mm_fault+0x103c/0x1d90
[Tue Aug  9 16:12:46 2022] [c0000000a0317ce0] [c000000000455128] handle_mm_fault+0x128/0x310
[Tue Aug  9 16:12:46 2022] [c0000000a0317d30] [c000000000087fd4] ___do_page_fault+0x2a4/0xb90
[Tue Aug  9 16:12:46 2022] [c0000000a0317de0] [c000000000088b30] do_page_fault+0x30/0xc0
[Tue Aug  9 16:12:46 2022] [c0000000a0317e10] [c000000000008914] data_access_common_virt+0x194/0x1f0
[Tue Aug  9 16:12:46 2022] --- interrupt: 300 at 0x118f628c0
[Tue Aug  9 16:12:46 2022] NIP:  0000000118f628c0 LR: 0000000118f716e0 CTR: 0000000118f7d8e0
[Tue Aug  9 16:12:46 2022] REGS: c0000000a0317e80 TRAP: 0300   Tainted: G             L     (5.18.9-200.fc36.ppc64le)
[Tue Aug  9 16:12:46 2022] MSR:  800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE>  CR: 44024484  XER: 2004002d
[Tue Aug  9 16:12:46 2022] CFAR: c00000000000c94c DAR: 00007fff48d396ee DSISR: 40000000 IRQMASK: 0 
                           GPR00: 0000000118fb912c 00007fffc6da8520 0000000119947900 000000013515fa00 
                           GPR04: 00007fff48d396ee 00007fffc6da8668 00007fff48d396e9 00000001367d5870 
                           GPR08: 0000000118f7d70c 0000000000000000 0000000000000000 ff00000000000000 
                           GPR12: 0000000118f417d0 00007fffa7cb8c60 0000000000000000 000000034eaccdb0 
                           GPR16: 0000000000010016 0000000000000000 0000000000000088 000000034e95dbf0 
                           GPR20: 00007fffc6da8668 0000000000000000 0000000000000000 00007fffc6da8670 
                           GPR24: 0000000000000003 00007fff48d396ee 000000034eaccdb0 000000034eaccd40 
                           GPR28: 00007fffc6da8798 000000026467f4a0 000000013515fa00 00007fffc6da8798 
[Tue Aug  9 16:12:46 2022] NIP [0000000118f628c0] 0x118f628c0
[Tue Aug  9 16:12:46 2022] LR [0000000118f716e0] 0x118f716e0
...
[Tue Aug  9 16:12:46 2022] [1466638]  1000 1466638   187716   100777  1496320    42728             0 gdb.minimal
[Tue Aug  9 16:12:46 2022] [1466666]  1000 1466666       95        1    22272       13             0 gdb-add-index
[Tue Aug  9 16:12:46 2022] [1466670]  1000 1466670   186092    96280  1491968    45644             0 gdb.minimal
[Tue Aug  9 16:12:46 2022] [1466680]  1000 1466680       95        0    22016       12             0 gdb-add-index
[Tue Aug  9 16:12:46 2022] [1466684]  1000 1466684   169588    87880  1355008    37420             0 gdb.minimal
[Tue Aug  9 16:12:46 2022] [1466893]     0 1466893      439       73    28672        0             0 systemd-userwor
[Tue Aug  9 16:12:46 2022] [1466911]     0 1466911      439       74    24832        0             0 systemd-userwor
[Tue Aug  9 16:12:46 2022] [1466912]     0 1466912      439       63    24832        0             0 systemd-userwor
[Tue Aug  9 16:12:46 2022] [1466931]     0 1466931      395       47    27136       15             0 crond
[Tue Aug  9 16:12:46 2022] [1466932]     0 1466932      152       47    26880        0             0 lock-wrapper
[Tue Aug  9 16:12:46 2022] [1466935]     0 1466935      152       32    22528        0             0 osbuildapi-upda
[Tue Aug  9 16:12:46 2022] [1466937]     0 1466937      152       12    22272        0             0 osbuildapi-upda
[Tue Aug  9 16:12:46 2022] [1466938]     0 1466938      368       41    23296        0             0 resolvectl
[Tue Aug  9 16:12:46 2022] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/kojid.service,task=gdb.minimal,pid=1466638,uid=1000
[Tue Aug  9 16:12:46 2022] Out of memory: Killed process 1466638 (gdb.minimal) total-vm:12013824kB, anon-rss:6449728kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:1461kB oom_score_adj:0

That's the same as the last OOM, and should be fixed by my adjustment to the package's %limit_build.

The most recent trace is missing the section on swap free/total - did this have a zram bump or not yet? The previous one it had 0 bytes swap free.

nothing has been changed to zram.. that would require changes by Real(TM) releng. I am just filling in.

Tue Aug  9 16:12:46 2022] --- interrupt: 300
[Tue Aug  9 16:12:46 2022] Mem-Info:
[Tue Aug  9 16:12:46 2022] active_anon:265307 inactive_anon:21362 isolated_anon:0
                            active_file:78 inactive_file:0 isolated_file:0
                            unevictable:62 dirty:10 writeback:0
                            slab_reclaimable:2838 slab_unreclaimable:6212
                            mapped:82 shmem:13 pagetables:220 bounce:0
                            kernel_misc_reclaimable:0
                            free:847 free_pcp:0 free_cma:0
[Tue Aug  9 16:12:46 2022] Node 0 active_anon:16979648kB inactive_anon:1367168kB active_file:4992kB inactive_file:0kB unevictable:3968kB isolated(anon):0kB isolated(file):0kB mapped:5248kB dirty:640kB writeback:0kB shmem:832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:4896kB pagetables:14080kB all_unreclaimable? no
[Tue Aug  9 16:12:46 2022] Node 0 Normal free:54208kB boost:34816kB min:57344kB low:78080kB high:98816kB reserved_highatomic:0KB active_anon:16979648kB inactive_anon:1367168kB active_file:3968kB inactive_file:0kB unevictable:3968kB writepending:640kB present:20971520kB managed:20812800kB mlocked:3968kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[Tue Aug  9 16:12:46 2022] lowmem_reserve[]: 0 0 0
[Tue Aug  9 16:12:46 2022] Node 0 Normal: 263*64kB (UME) 252*128kB (UME) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 49344kB
[Tue Aug  9 16:12:46 2022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Tue Aug  9 16:12:46 2022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Tue Aug  9 16:12:46 2022] 207 total pagecache pages
[Tue Aug  9 16:12:46 2022] 28 pages in swap cache
[Tue Aug  9 16:12:46 2022] Swap cache stats: add 3777065, delete 3777098, find 723615/2033381
[Tue Aug  9 16:12:46 2022] Free swap  = 0kB
[Tue Aug  9 16:12:46 2022] Total swap = 8388544kB
[Tue Aug  9 16:12:46 2022] 327680 pages RAM
[Tue Aug  9 16:12:46 2022] 0 pages HighMem/MovableOnly
[Tue Aug  9 16:12:46 2022] 2480 pages reserved
[Tue Aug  9 16:12:46 2022] 0 pages cma reserved
[Tue Aug  9 16:12:46 2022] 0 pages hwpoisoned
[Tue Aug  9 16:12:46 2022] Tasks state (memory values in pages):

How helpful bigger zram-size is depends on the output from zramctl once the server has been up a while (15 minutes I guess) but before an oomkill.

zramctl output

NAME       ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram0 lzo-rle         8G   8G  2.2G  2.3G       3 [SWAP]

Currently the system is chewing through:

extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-15.fc37.s390x/usr/lib64/libwebkit2gtk-4.0.so.37.57.0
extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-15.fc37.s390x/usr/lib64/libwebkit2gtk-4.1.so.0.2.0
extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-15.fc37.s390x/usr/lib64/libwebkit2gtk-5.0.so.0.0.0

Good news is the workload is highly compressible, 3.5x including overhead. Bad news is swap is full which means the kernel can only reclaim file pages, which is more expensive time and IO wise than just bumping zram-size. Would disk-based swap be even better? Not sure. We'd have to look at the behavior under load, but for sure this system needs more swap given the workload.

Ok the s390x crashed while extracting from libwebkit2gtk-5.0.so.0.0.0. Here is the oom. At this point I would say @catanzaro that it is time to kill the job completely and put in the changes you were looking at:

Tue Aug  9 18:28:24 2022] systemd-journald[508]: Field hash table of /var/log/journal/7080e3794c9045ddaaf8e2b9a8ab0243/system.journal has a fill level at 75.4 (251 of 333 items), suggesting rotation.
[Tue Aug  9 18:28:24 2022] systemd-journald[508]: /var/log/journal/7080e3794c9045ddaaf8e2b9a8ab0243/system.journal: Journal header limits reached or header out-of-date, rotating.
[Tue Aug  9 20:03:26 2022] systemd-userwor invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[Tue Aug  9 20:03:26 2022] CPU: 1 PID: 4106515 Comm: systemd-userwor Not tainted 5.18.11-200.fc36.s390x #1
[Tue Aug  9 20:03:26 2022] Hardware name: IBM 8561 LT1 400 (KVM/Linux)
[Tue Aug  9 20:03:26 2022] Call Trace:
[Tue Aug  9 20:03:26 2022]  [<0000000389fe89a2>] dump_stack_lvl+0x62/0x80 
[Tue Aug  9 20:03:26 2022]  [<0000000389fdff7a>] dump_header+0x62/0x248 
[Tue Aug  9 20:03:26 2022]  [<000000038961b3bc>] oom_kill_process+0x1f4/0x1f8 
[Tue Aug  9 20:03:26 2022]  [<000000038961c49c>] out_of_memory+0x114/0x6d8 
[Tue Aug  9 20:03:26 2022]  [<00000003896862da>] __alloc_pages_slowpath.constprop.0+0xd02/0xe58 
[Tue Aug  9 20:03:26 2022]  [<0000000389686764>] __alloc_pages+0x334/0x358 
[Tue Aug  9 20:03:26 2022]  [<0000000389615e52>] __filemap_get_folio+0x112/0x458 
[Tue Aug  9 20:03:26 2022]  [<00000003896162e0>] filemap_fault+0x148/0x920 
[Tue Aug  9 20:03:26 2022]  [<000000038965a164>] __do_fault+0x4c/0x130 
[Tue Aug  9 20:03:26 2022]  [<000000038965f196>] __handle_mm_fault+0x8a6/0x10a0 
[Tue Aug  9 20:03:26 2022]  [<000000038965fa5e>] handle_mm_fault+0xce/0x210 
[Tue Aug  9 20:03:26 2022]  [<000000038940b7f0>] do_exception+0x1e0/0x488 
[Tue Aug  9 20:03:26 2022]  [<000000038940c00a>] do_dat_exception+0x2a/0x50 
[Tue Aug  9 20:03:26 2022]  [<0000000389fed048>] __do_pgm_check+0xf0/0x1b0 
[Tue Aug  9 20:03:26 2022]  [<0000000389ffc80e>] pgm_check_handler+0x11e/0x180 
[Tue Aug  9 20:03:26 2022] Mem-Info:
[Tue Aug  9 20:03:26 2022] active_anon:3236999 inactive_anon:300098 isolated_anon:0
                            active_file:51 inactive_file:5746 isolated_file:135
                            unevictable:0 dirty:0 writeback:0
                            slab_reclaimable:109646 slab_unreclaimable:44804
                            mapped:22 shmem:6 pagetables:15946 bounce:0
                            kernel_misc_reclaimable:0
                            free:18946 free_pcp:401 free_cma:0
[Tue Aug  9 20:03:26 2022] Node 0 active_anon:12947996kB inactive_anon:1200392kB active_file:204kB inactive_file:22536kB unevictable:0kB isolated(anon):0kB isolated(file):540kB mapped:88kB dirty:0kB writeback:0kB shmem:24kB writeback_tmp:0kB kernel_stack:3680kB pagetables:63784kB all_unreclaimable? no
[Tue Aug  9 20:03:26 2022] Node 0 DMA free:61632kB boost:0kB min:1988kB low:4064kB high:6140kB reserved_highatomic:0KB active_anon:1608784kB inactive_anon:181832kB active_file:8kB inactive_file:2668kB unevictable:0kB writepending:0kB present:2097152kB managed:2097068kB mlocked:0kB bounce:0kB free_pcp:48kB local_pcp:0kB free_cma:0kB
[Tue Aug  9 20:03:26 2022] lowmem_reserve[]: 0 15006 15006
[Tue Aug  9 20:03:26 2022] Node 0 Normal free:14656kB boost:0kB min:14712kB low:30076kB high:45440kB reserved_highatomic:0KB active_anon:11339212kB inactive_anon:1018560kB active_file:696kB inactive_file:19704kB unevictable:0kB writepending:0kB present:15728640kB managed:15371692kB mlocked:0kB bounce:0kB free_pcp:1104kB local_pcp:380kB free_cma:0kB
[Tue Aug  9 20:03:26 2022] lowmem_reserve[]: 0 0 0
[Tue Aug  9 20:03:26 2022] Node 0 DMA: 208*4kB (UME) 198*8kB (UME) 297*16kB (UME) 214*32kB (UME) 150*64kB (UME) 82*128kB (UME) 31*256kB (UME) 8*512kB (UM) 2*1024kB (UM) 3*2048kB (U) 2*4096kB (UM) = 62528kB
[Tue Aug  9 20:03:26 2022] Node 0 Normal: 526*4kB (UME) 367*8kB (UME) 285*16kB (UME) 148*32kB (UME) 6*64kB (U) 2*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14976kB
[Tue Aug  9 20:03:26 2022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1024kB
[Tue Aug  9 20:03:26 2022] 6222 total pagecache pages
[Tue Aug  9 20:03:26 2022] 379 pages in swap cache
[Tue Aug  9 20:03:26 2022] Swap cache stats: add 12362617, delete 12362237, find 126940/6308439
[Tue Aug  9 20:03:26 2022] Free swap  = 0kB
[Tue Aug  9 20:03:26 2022] Total swap = 8388604kB
[Tue Aug  9 20:03:26 2022] 4456448 pages RAM
[Tue Aug  9 20:03:26 2022] 0 pages HighMem/MovableOnly
[Tue Aug  9 20:03:26 2022] 89258 pages reserved
[Tue Aug  9 20:03:26 2022] 0 pages cma reserved
[Tue Aug  9 20:03:26 2022] Tasks state (memory values in pages):
[Tue Aug  9 20:03:26 2022] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Tue Aug  9 20:03:26 2022] [    508]     0   508    28935      134   143360      408          -250 systemd-journal
[Tue Aug  9 20:03:26 2022] [    541]     0   541     7793        5   108544      817         -1000 systemd-udevd
[Tue Aug  9 20:03:26 2022] [    606]   193   606     5658      151   112640      769             0 systemd-resolve
[Tue Aug  9 20:03:26 2022] [    607]     0   607     6325        8    75776      203         -1000 auditd
[Tue Aug  9 20:03:26 2022] [    611]     0   611     4023       23    92160      204             0 systemd-userdbd
[Tue Aug  9 20:03:26 2022] [    639]    81   639     2478       44    67584      163          -900 dbus-broker-lau
[Tue Aug  9 20:03:26 2022] [    646]    81   646     1268       87    57344      169          -900 dbus-broker
[Tue Aug  9 20:03:26 2022] [    652]     2   652    61233        0   124928      411             0 rngd
[Tue Aug  9 20:03:26 2022] [    653]     0   653   169445      176   483328      636             0 rsyslogd
[Tue Aug  9 20:03:26 2022] [    656]     0   656     6727      164   149504      318             0 sssd
[Tue Aug  9 20:03:26 2022] [    657]     0   657     4135       56    92160      224             0 systemd-machine
[Tue Aug  9 20:03:26 2022] [    678]   996   678    21086       34    77824      137             0 chronyd
[Tue Aug  9 20:03:26 2022] [    699]     0   699    63891      187   176128      506             0 NetworkManager
[Tue Aug  9 20:03:26 2022] [    706]     0   706     2281        1    71680      100             0 oddjobd
[Tue Aug  9 20:03:26 2022] [    708]     0   708     3579        2    88064      305         -1000 sshd
[Tue Aug  9 20:03:26 2022] [    709]     0   709    29880        0   120832      227             0 gssproxy
[Tue Aug  9 20:03:26 2022] [    731]     0   731     4443        5    67584      171             0 rpc.gssd
[Tue Aug  9 20:03:26 2022] [    772]     0   772     6733      103   155648      294             0 sssd_pam
[Tue Aug  9 20:03:26 2022] [    773]     0   773     6500       89   149504      283             0 sssd_ssh
[Tue Aug  9 20:03:26 2022] [    774]     0   774     6484      100   149504      279             0 sssd_sudo
[Tue Aug  9 20:03:26 2022] [    775]     0   775    18345       95   202752      519             0 sssd_pac
[Tue Aug  9 20:03:26 2022] [    795]     0   795     4635       49   100352      559             0 systemd-logind
[Tue Aug  9 20:03:26 2022] [    799]     0   799     2001       31    69632       73             0 crond
[Tue Aug  9 20:03:26 2022] [    803]     0   803      706        0    47104       32             0 agetty
[Tue Aug  9 20:03:26 2022] [    804]     0   804     1348        0    61440       35             0 agetty
[Tue Aug  9 20:03:26 2022] [    860]     0   860    11210       36    77824      213             0 master
[Tue Aug  9 20:03:26 2022] [    862]    89   862    14186       63   114688      210             0 qmgr
[Tue Aug  9 20:03:26 2022] [ 353149]   423 353149     2428        0    69632      107             0 dnsmasq
[Tue Aug  9 20:03:26 2022] [ 353150]     0 353150     2421        3    67584      100             0 dnsmasq
[Tue Aug  9 20:03:26 2022] [ 353391]     0 353391     8824        2   153600      466          -900 virtlogd
[Tue Aug  9 20:03:26 2022] [ 665942]     0 665942    30934     3803   444416    11156             0 kojid
[Tue Aug  9 20:03:26 2022] [4028975]     0 4028975    30934      899   415744    14028             0 kojid
[Tue Aug  9 20:03:26 2022] [4029362]  1000 4029362    12780      123   202752     7313             0 mock
[Tue Aug  9 20:03:26 2022] [4029682]  1000 4029682     3730        0    98304      731             0 rpmbuild
[Tue Aug  9 20:03:26 2022] [4089952]     0 4089952     7062        9   122880      449             0 sshd
[Tue Aug  9 20:03:26 2022] [4089955]     0 4089955     5214       10   104448      771           100 systemd
[Tue Aug  9 20:03:26 2022] [4089957]     0 4089957    46203        1   145408     1675           100 (sd-pam)
[Tue Aug  9 20:03:26 2022] [4089966]     0 4089966     7100       47   122880      453             0 sshd
[Tue Aug  9 20:03:26 2022] [4089967]     0 4089967     2171      211    67584      295             0 bash
[Tue Aug  9 20:03:26 2022] [4102697]  1000 4102697     1139        2    61440       85             0 sh
[Tue Aug  9 20:03:26 2022] [4102878]  1000 4102878     1034        1    49152      107             0 find-debuginfo
[Tue Aug  9 20:03:26 2022] [4102893]  1000 4102893     1034        1    49152      107             0 find-debuginfo
[Tue Aug  9 20:03:26 2022] [4102894]  1000 4102894     1067        2    49152      124             0 find-debuginfo
[Tue Aug  9 20:03:26 2022] [4102895]  1000 4102895     1067        2    49152      124             0 find-debuginfo
[Tue Aug  9 20:03:26 2022] [4102897]  1000 4102897     1067        2    49152      124             0 find-debuginfo
[Tue Aug  9 20:03:26 2022] [4103138]  1000 4103138     1001        2    49152       76             0 gdb-add-index
[Tue Aug  9 20:03:26 2022] [4103142]  1000 4103142  2584403  1256982 20342784   714850             0 gdb.minimal
[Tue Aug  9 20:03:26 2022] [4103147]  1000 4103147     1001        2    49152       75             0 gdb-add-index
[Tue Aug  9 20:03:26 2022] [4103151]  1000 4103151  2352687  1099992 18472960   640325             0 gdb.minimal
[Tue Aug  9 20:03:26 2022] [4103156]  1000 4103156     1001        2    49152       76             0 gdb-add-index
[Tue Aug  9 20:03:26 2022] [4103160]  1000 4103160  2488701  1169153 19564544   706469             0 gdb.minimal
[Tue Aug  9 20:03:26 2022] [4104953]     0 4104953      716       16    57344       15             0 tail
[Tue Aug  9 20:03:26 2022] [4106292]     0 4106292     4460      218   108544        0             0 sssd_be
[Tue Aug  9 20:03:26 2022] [4106405]    89 4106405    14138      262   100352        0             0 pickup
[Tue Aug  9 20:03:26 2022] [4106422]     0 4106422    11490      233    90112        0             0 smtp
[Tue Aug  9 20:03:26 2022] [4106423]    89 4106423    14163      263    98304        0             0 cleanup
[Tue Aug  9 20:03:26 2022] [4106433]     0 4106433     4942       63    65536       53             0 crond
[Tue Aug  9 20:03:26 2022] [4106438]     0 4106438     4942       68    65536       52             0 crond
[Tue Aug  9 20:03:26 2022] [4106444]     0 4106444    11202      241    90112        0             0 sendmail
[Tue Aug  9 20:03:26 2022] [4106455]     0 4106455     4942       68    65536       52             0 crond
[Tue Aug  9 20:03:26 2022] [4106457]     0 4106457    11149      182    81920        0             0 sendmail
[Tue Aug  9 20:03:26 2022] [4106461]     0 4106461    11164      183    83968        0             0 trivial-rewrite
[Tue Aug  9 20:03:26 2022] [4106467]     0 4106467     4942       66    65536       53             0 crond
[Tue Aug  9 20:03:26 2022] [4106468]     0 4106468     4108      234    92160        0             0 systemd-userwor
[Tue Aug  9 20:03:26 2022] [4106469]     0 4106469     4108      234    92160        0             0 systemd-userwor
[Tue Aug  9 20:03:26 2022] [4106471]     0 4106471    11144      164    81920        0             0 sendmail
[Tue Aug  9 20:03:26 2022] [4106476]     0 4106476    11143      164    81920        0             0 postdrop
[Tue Aug  9 20:03:26 2022] [4106479]     0 4106479     4942       64    65536       53             0 crond
[Tue Aug  9 20:03:26 2022] [4106480]     0 4106480     3656       96    88064        0             0 sssd_nss
[Tue Aug  9 20:03:26 2022] [4106482]     0 4106482    11144      164    81920        0             0 sendmail
[Tue Aug  9 20:03:26 2022] [4106496]     0 4106496     1744       68    63488        0             0 osbuildapi-upda
[Tue Aug  9 20:03:26 2022] [4106497]     0 4106497     4942       59    65536       54             0 crond
[Tue Aug  9 20:03:26 2022] [4106499]     0 4106499     1744       67    63488        0             0 osbuildapi-upda
[Tue Aug  9 20:03:26 2022] [4106501]     0 4106501     1744       70    51200        0             0 osbuildapi-upda
[Tue Aug  9 20:03:26 2022] [4106502]     0 4106502     4029       78    79872        0             0 resolvectl
[Tue Aug  9 20:03:26 2022] [4106504]     0 4106504     1744       72    63488        0             0 lock-wrapper
[Tue Aug  9 20:03:26 2022] [4106505]     0 4106505     1744       69    51200        0             0 osbuildapi-upda
[Tue Aug  9 20:03:26 2022] [4106506]     0 4106506     4029       78    81920        0             0 resolvectl
[Tue Aug  9 20:03:26 2022] [4106507]     0 4106507     4942       61    65536       54             0 crond
[Tue Aug  9 20:03:26 2022] [4106510]     0 4106510     1744       72    63488        0             0 lock-wrapper
[Tue Aug  9 20:03:26 2022] [4106513]     0 4106513     4942       61    65536       54             0 crond
[Tue Aug  9 20:03:26 2022] [4106514]     0 4106514     4942       61    65536       54             0 crond
[Tue Aug  9 20:03:26 2022] [4106515]     0 4106515     4000       77    79872        0             0 systemd-userwor
[Tue Aug  9 20:03:26 2022] [4106516]     0 4106516     1744       69    79872        0             0 lock-wrapper
[Tue Aug  9 20:03:26 2022] [4106517]     0 4106517     4951       58    65536       55             0 crond
[Tue Aug  9 20:03:26 2022] [4106520]     0 4106520     1744       58    63488        0             0 run-parts
[Tue Aug  9 20:03:26 2022] [4106522]     0 4106522     2015       52    59392       55             0 crond
[Tue Aug  9 20:03:26 2022] [4106524]     0 4106524     1555       40    63488        0             0 mkdir
[Tue Aug  9 20:03:26 2022] [4106525]     0 4106525     1142       25    47104        0             0 ps
[Tue Aug  9 20:03:26 2022] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/kojid.service,task=gdb.minimal,pid=4103142,uid=1000
[Tue Aug  9 20:03:26 2022] Out of memory: Killed process 4103142 (gdb.minimal) total-vm:10337612kB, anon-rss:5027928kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:19866kB oom_score_adj:0
[Tue Aug  9 20:03:31 2022] oom_reaper: reaped process 4103142 (gdb.minimal), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

As nirik pointed out to me, there are two builds going on and what I pasted was the older one.

Ok the s390x crashed while extracting from libwebkit2gtk-5.0.so.0.0.0. Here is the oom. At this point I would say @catanzaro that it is time to kill the job completely and put in the changes you were looking at:

Now that you've caught this info, I've canceled the older -15 build. The newer -16 build already contains my changes to request 4 GB RAM per vCPU. I'm pretty sure that will fix ppc64le, but it will not help at all with s390x: debuginfo processing is not affected by the %limit_build macro. All that macro does is compute the -j that gets passed to ninja: that's it.

With this failed s390x build, we see it running three gdb processes at once, each using about 10 GB of virtual memory. That's a lot. (Note that I did perform a scratch build a week or two ago to make sure our infrastructure could handle this, before we imported the new webkitgtk package, and it succeeded on s390x after Kevin disabled systemd-oomd.) Interestingly, it seems to be parallel and based on the number of vCPUs, but I don't think there's any way for the packager to control the parallelism, so that's kinda out of my hands. This probably indicates that switching to 3 vCPU per VM actually backfired on us by increasing the required RAM by 50%. :/

Anyway, I cannot control the parallelism of the debuginfo processing. I guess RPM is responsible for that? What I can control is the _dwz_max_die_limit:

# Increase the DIE limit so our debuginfo packages can be size-optimized.
# Decreases the size for x86_64 from ~5G to ~1.1G.
# https://bugzilla.redhat.com/show_bug.cgi?id=1456261
%global _dwz_max_die_limit 250000000

We should be able to reduce RAM usage by lowering that value, at the cost of ballooning the package size. I kinda think that's a bad trade-off, because 5 GB is just huge, so I'd rather not. But it is an option.

Ideally we would not mess with this value in the webkitgtk package, and just set whatever defaults we desire globally instead, here. I previously reported this bug to consider this.

BTW rawhide is broken until this build completes due to https://bugzilla.redhat.com/show_bug.cgi?id=2116626. Sorry about that....

well, I think the problem is that now that all those 3 things are in the same package, it builds them ok, but then at the end it runs find-debuginfo with N cpus and on ppc64le and s390x at least thats too much. ;(

I wonder if there's a way to use normal number of cpus for building, but then set it to 1 for the find-debuginfo.sh part?

The -16 build oom'd at 06:18 UTC when extracting debuginfo from

+ /usr/bin/find-debuginfo -j3 --strict-build-id -m -i --build-id-seed 2.37.1-16.fc37 --unique-debug-suffix -2.37.1-16.fc37.s390x --unique-debug-src-base webkitgtk-2.37.1-16.fc37.s390x --run-dwz --dwz-low-mem-die-limit 10000000 --dwz-max-die-limit 250000000 -S debugsourcefiles.list /builddir/build/BUILD/webkitgtk-2.37.1
extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/lib64/libjavascriptcoregtk-4.0.so.18.21.0
extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/bin/WebKitWebDriver
extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/lib64/libjavascriptcoregtk-4.1.so.0.2.0
extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/lib64/libjavascriptcoregtk-5.0.so.0.0.0
extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/lib64/libwebkit2gtk-4.0.so.37.57.0
extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/lib64/libwebkit2gtk-4.1.so.0.2.0
extracting debug info from /builddir/build/BUILDROOT/webkitgtk-2.37.1-16.fc37.s390x/usr/lib64/libwebkit2gtk-5.0.so.0.0.0

As you all have said, the -j3 is the problem.

The -j3 is not under packager control, though. That is not coming from the spec file and is therefore not something I can change. I don't know what calls find-debuginfo, but my guess would be RPM itself? It seems clear that it is matching the number of vCPUs available, so disabling vCPUs should solve this, of course at the cost of making everything much slower.

well, I think the problem is that now that all those 3 things are in the same package, it builds them ok, but then at the end it runs find-debuginfo with N cpus and on ppc64le and s390x at least thats too much. ;(

I was going to say "for s390x, yes, but ppc64le was just using too much RAM during the build itself." But bad news: the ppc64le build has unexpectedly restarted too, even though I switched to %limit_build -m 4096 to request 4 GiB of RAM per vCPU instead of the 2 GiB that I had been using before. I was expecting that to be enough to fix ppc64le, but apparently not.

BTW, the current build started before branching, but will complete after branching. What happens now? Do I need to start another build for rawhide, or for F37?

BTW, the current build started before branching, but will complete after branching. What happens now? Do I need to start another build for rawhide, or for F37?

We could tag this into rawhide as well... or do a new build.

So, on ppc64le, I made a builder with 64gb memory and 16 cpus and moved the current build to it. I would expect it can complete it.

On s390x, sadly we don't have a bunch of spare resources. I guess I can take down some more builders and build up one with more memory and move to that to get this build in, but I think we should try and find a more sustainable path after this.

What about trying:

%{_smp_build_ncpus} = 1

at the end of the spec %build section?
(to set it to 1 for the debuginfo gathering)

Failing that, I see that the dwz limits might be set per arch?

-13: dwz_limit %{expand:%%{?%{1}%{arch}}%%{!?%{1}%{_arch}:%%%{1}}}
-13: _dwz_low_mem_die_limit 10000000
-13: _dwz_low_mem_die_limit_armv5tel 4000000
-13: _dwz_low_mem_die_limit_armv7hl 4000000
-13: _dwz_max_die_limit 50000000
-13: _dwz_max_die_limit_armv5tel 10000000
-13: _dwz_max_die_limit_armv7hl 10000000
-13: _dwz_max_die_limit_x86_64 110000000

Perhaps we could set it for ppc64le and s390x to a different limit that had larger sizes but finished?

Should we try and rope in someone from tools team here for advise?

So, on ppc64le, I made a builder with 64gb memory and 16 cpus and moved the current build to it. I would expect it can complete it.

For sure, but that seems like a one-time fix.

On s390x, sadly we don't have a bunch of spare resources. I guess I can take down some more builders and build up one with more memory and move to that to get this build in, but I think we should try and find a more sustainable path after this.

Honestly I think that's the best approach: take down as many builders as necessary and let things go as slowly as required to be reliable. But remember to reduce vCPUs as well. The current setup might actually be OK if we reduce to 2 vCPUs instead of 3.

What about trying:

%{_smp_build_ncpus} = 1

at the end of the spec %build section?
(to set it to 1 for the debuginfo gathering)

Hm, I think we can use %define to temporarily change the value. That might work. (Just assigning to it won't work, since it's a macro, not a variable.) Of course, this is only good as a temporary/emergency measure to get the builds out. We shouldn't need hacks like this going forward since that will slow things down massively.

Perhaps we could set it for ppc64le and s390x to a different limit that had larger sizes but finished?

Sure, although I would only be guessing as to which values to try... and also guessing about the names of the macros to define.

Should we try and rope in someone from tools team here for advise?

Couldn't hurt.

I've taken down 2 s390x builders and upped the memory in 29 to 32gb.

Thanks.

Meanwhile, I've started new builds so I have no doubt of which build goes to which branch: rawhide, and Fedora 37. I suspect the original build was headed only to rawhide and not to F37, but I wasn't certain. Let's see if the builds get scheduled on the builders that you hope they will....

They won't. :)

I'll move them. You will see one 'restart' as I do.

I assigned the f37 ones, as thats the one we want most...

I'm not going to be happy with a solution that requires manually moving builds to particular builders. Isn't that exactly what the heavy builder channel is intended for...?

I'm not happy with that either... just doing it now to get these builds in to fix things. :)

I'm also not happy with a heavybuilder channel if I can at all avoid it. It's very wastefull of resources and causes lots of problems.

My ideal solution is to figure out what exact setup we need to just build reliably and set that for all the builders.
If we can adjust the spec/debuginfo, great. If that takes more builder adjustments, so be it...

Kalev came up with a nice trick:

%global _find_debuginfo_opts %limit_build -m 8192

It's not perfect, since it comes after the first -j that we don't control, but the one that comes last should win, so that's fine. I might need to bump that limit higher, though, since 8 GB will probably not actually be enough. I will probably try 12 GB to be safe.

I'm jumping in to help with this as well. I kicked off 4 more builds (so it's 6 total now) so that we hopefully have something finished by tomorrow (we really need these now because of the broken upgrade path issues and the upcoming GNOME test week).

What we have running is (all builds are doubled because of f37 and rawhide):
- Builds 1-2: mcatanzaro's builds that nirik moved to the beefiest s390x and ppc64le builders
- Builds 3-4: these have disabled debuginfo on s390x and ppc64le, https://src.fedoraproject.org/rpms/webkitgtk/c/0ece072912ef34ea7778977af0301b3211e81c37?branch=rawhide
- Builds 5-6: these re-enable debuginfo on s390x and ppc64le and instead attempt to limit debuginfo processing parallelization, https://src.fedoraproject.org/rpms/webkitgtk/c/8ea0b4518be2a64b8223c406792a8c9f25c5ea5f?branch=rawhide

Hopefully the last one works because this avoids the need for extra beefy builders and allows us to keep debuginfo. We'll see tomorrow.

I kicked off all the 4 new builds with --skip-tag (so that they don't get submitted to bodhi in wrong NVR order) and moved them to the builders I wanted them to be on. I'll check back on these tomorrow morning my time and tag the best ones that finished with -candidate tags, and move them to beefier builders if necessary.

Hopefully we can get some builds over the finish line with this :)

Looks like we have a winner with '%global _find_debuginfo_opts %limit_build -m 8192' and the builds are starting to get over the finish line.

I ran into a new mode of failure: buildvm-s390x-14.s390.fedoraproject.org (one of the smaller s390x builders, I wanted to see if it could build webkitgtk with the latest changes) ran out of disk space.

lto-wrapper: fatal error: write: No space left on device
https://koji.fedoraproject.org/koji/taskinfo?taskID=90686756

I am checking out disk space on the builder. If something was using up disk space it is gone now.

/dev/md127     107876900 12779828  95097072  12% /

OK, so in conclusion, what ended up fixing the builds was adding:

# Require 8 GB of RAM per vCPU for debuginfo processing
%global _find_debuginfo_opts %limit_build -m 8192

%_find_debuginfo_opts gets passed to /usr/bin/find-debuginfo and we can use it to control parallelism. '%limit_build -m 8192' means that for each 8 GB of RAM, it increases -j1 by one so on s390x builders with 18 GB RAM it ends up passing -j2, and on 8 GB builders it ends up passing -j1.

Note that I haven't actually verified that it builds on 8 GB builders because the last attempt ran out of disk space, but it hopefully should.

In addition to this (Michael's fix from earlier) cmake is called with

%cmake_build %limit_build -m 4096

... which makes it pass -j3 on s390x builders with 18 GB of RAM, which works out fine as well (especially since they actually have 3 CPUs).

So hopefully this all should be good enough that webkitgtk can now be built on any builder :)

There's still the question of how much time it takes and how to schedule webkitgtk builds so that they hit the faster builders to avoid builds taking forever, but I think this is something for another time :) We should be good now and are able to get builds done again.

I've already talked to nirik on IRC and he's going to take down the extra beefy s390x builder and the extra beefy ppc64le builder as they shouldn't be necessary any more. The regular builders should work just fine now (although the weaker ones may take close to two days to actually build this).

As for s390x kvm builder configuration, I'd leave it as it is now so that they have 3 CPUs and 18 GB of RAM. Reducing the number of CPUs would just make the builds take longer and I think the current number of s390x builders (which is less than before nirik made them have 3 CPUs and 18 GB of RAM) is enough for day-to-day builds. It may end up being a bottleneck for mass rebuilds, but that is a rare event and I think it's better to optimize for day-to-day operations. I'm fairly sure faster builders (but smaller number) makes it more fun for packagers to work on stuff since they can complete builds faster.

Thanks for all the work and patience on this.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata
Boards 1
Ops Status: Backlog