#119 enabling earlyoom by default
Opened 2 months ago by chrismurphy. Modified a day ago


@catanzaro @hadess @hakavlad @xvitaly
Change proposal is still set to incomplete, feel free to edit as appropriate, and add your names as owners if you wish.

Right now it's constrained to Workstation. I'm not sure how desktop spins inherit such changes, if they have to opt-in or opt-out.

Earlyoom service is enabled by default, which will cause kernel oom-killer to trigger sooner

earlyoom not trigger kernel oom killer (as LMM or psi-monitor). It just sends the signal to the victim.

KILL_PERCENT setting, default 2%.

No. 5% for SIGKILL:

  -m PERCENT[,KILL_PERCENT] set available memory minimum to PERCENT of total
                            (default 10 %).
                            earlyoom sends SIGTERM once below PERCENT, then
                            SIGKILL once below KILL_PERCENT (default PERCENT/2).

How To Test

Simplest way is running tail /dev/zero. Other way is described there https://lkml.org/lkml/2019/8/4/15.

Simplest way is running tail /dev/zero.

Ugh, that's brilliant. Never thought to try that!

Updated. I'm leaving it as a system-wide change in case it will apply to other spins (do they inherit 80-workstation.preset?) But I've also stated that it applies to product Workstation. Lemme know how to clear that up, and I'll set it to ready for wranger. Ben told me he'd put it through as a system-wide change despite being 2 days late.

I need to add earlyoom package to the Workstation package set, what thingy do I need to submit a PR for to add it?

Also, if we want to apply the change to upgrades, I need to submit PR to earlyoom package, e.g.

%triggerpostun -- fedora-release < 32
systemctl preset earlyoom.service

I'll wait until FESCo approves the change, but probably should include this information in the change proposal so they understand the scope.

It's systemwide because the change affects the entire system (applies to more than just a particular application or service), not because it only affects a particular edition or spin.

I think Workstation presets will not be inherited by other editions.

I need to add earlyoom package to the Workstation package set, what thingy do I need to submit a PR for to add it?

fedora-comps at https://pagure.io/fedora-comps/tree/master.

Would be wise to aim for WG approval of this at next week's meeting, so we have that to inform FESCo before they vote on the change.

That'll happen in the devel@ announcement, which is part of the change process.

My fears are not possible bugs in earlyoom itself (it is fairly well tested, widely known, the last release was more than six months ago), but the changed behavior: 1. The kernel setting in cgroup2 memory.oom.group, used to kill the whole group, is not will work: one process will end instead of a group. 2. Probably, the settings KillMode= and OOMPolicy= applied in units will not work.

It will also be possible to receive complaints from users whose applications will crashes before the appearance of obvious signs of OOM, especially from those who do not use the swap space. For users, this may look like an unreasonable termination of applications against the background of complete well-being, if they do not monitor the memory level and have not heard about enabling earlyoom by default.

I believe that it is necessary to more thoroughly convey to users all the possible consequences of the changes and their causes.

@hakavlad @xvitaly

  1. Lack of cgroupv2 group kill support is suboptimal but not a regression. The kernel oom-killer behavior kills only one process, not a group.

  2. My reading of KillMode= and OOMPolicy= suggests no conflict. But I want to be clear on why you think there might be a conflict, because I definitely don't want to step on other work and expectations in this area.
    https://www.freedesktop.org/software/systemd/man/systemd.kill.html
    https://www.freedesktop.org/software/systemd/man/systemd.service.html

  3. In the case of no swap, I agree this becomes difficult to use a single percentage. 1% is too low for e.g. 8GiB RAM system, and is too high for e.g. 64GiB RAM system. This might be the most significant liability of the feature.

Even though no swap isn't our default configuration, and the installer warns users when swap isn't configured, I think it's sufficiently common users have systems with lots of RAM forego swap. I'm not convinced merely documenting users will experience OOM killing of processes at 5% of 32-64GiB, e.g. 1.6-3.2GiB free RAM, is acceptable. Opinions?

Workaround A: Is is possible given the time frame to enhance earlyoom to use /proc/pressure/memory instead of a RAM percentage? This facility is available on Fedora kernels and it's cheap to read.

Workaround B: We need a way to disable earlyoom on systems without swap. Either as a systemd unit condition to prevent service from starting, or an earlyoom test for it that's enabled in /etc/default/earlyoom

Any other ideas?

In the case of no swap, I agree this becomes difficult to use a single percentage. 1% is too low for e.g. 8GiB RAM system, and is too high for e.g. 64GiB RAM system. This might be the most significant liability of the feature.

Can you explain this further?

Our first priority should be to make this work well with our default configuration, which is currently swap partition equal in size to RAM. I've just reported #120 because it seems like an awful default.

earlyoom's low watermark for triggering terminate or kill is based on a percentage of RAM and swap. If there is both RAM and swap, we're for sure in a bad UX responsiveness state once swap is closing in on 5% free because for sure RAM is already at minimum. And therefore earlyoom default behavior is fine.

However, if there is no swap, earlyoom has only a single low watermark, free RAM. Once that gets to 5% on a 64GiB system, i.e. 3.2GiB free, it's going to kill something. That's the problem. 3.2GiB free is plenty of RAM to keep chugging away, so in the no swap case I think it needs a different metric for what the low watermark is.

I'm not worried about 5%, especially not if we have swap enabled by default. Could always lower the default though... is 2% enough to keep up with the tail /dev/zero test? If 5% is what it needs to respond quickly enough to rapidly rising memory usage, then that's what it needs and we should use that. But if a lower value works in practice, then that's great too.

Some more comments: https://www.reddit.com/r/Fedora/comments/ejbm75/earlyoom_was_proposed_to_be_enabled_by_default_in/

Why earlyoom uses 10% thresholds: https://github.com/rfjakob/earlyoom/issues/6 : 10% is good value to save page cache.

IMHO 4% for SIGTERM and 2% are not bad threshold for MemAvailable. This should be enough to prevent freezes at OOM. But 10% allows you to maintain better responsiveness after corrective action, especially without swap space.

By the way, using the earlyoom with the default settings 10% without swap space is one of the best ways to maintain a very responsive desktop:

  1. The monitoring intensity is very high, 1 - 10 Hz. This makes it easy to handle even fast memory allocations, such as tail /dev/zero, and makes this processing not affect responsiveness.
  2. You will not encounter slow swapping because there is simply no swap space.
  3. There is always a 5% to 10% disk cache margin.
  4. You can not be afraid of running ninja -j64 (high memory pressure will not a problem; one problem will be only IO and maybe CPU pressure).

Out-of-memory handlers based only on PSI handle OOM without swap much worse, see https://github.com/facebookincubator/oomd/issues/80.

My reading of KillMode= and OOMPolicy= suggests no conflict.

Okay. The main incompatibility with the kernel killer is the impossibility of group killings (OOMPolicy also provide this: "If set to kill and one of the service's processes is killed by the OOM killer the kernel is instructed to kill all remaining processes of the service, too.")

I think I understand why I'm confused. earlyoom isn't killing tail /dev/zero the kernel oom-killer is doing it every time; so that's not a good test case for the proposal write up.

Every time tail is killed by oom-killer, GNOME Terminal is also killed, and I see this:
Jan 04 01:45:42 fmac.local systemd[978]: gnome-terminal-server.service: A process of this unit has been killed by the OOM killer.

earlyoom uses oom_score to determine what to terminate and kill; but doesn't actually use the kernel oom-killer to do the killing. Whereas systemd OOMPolicy depends on oom-killer.

So yeah, enabling earlyoom by default means OOMPolicy is, in effect, preempted. Most of the time it should be earlyoom doing the killing.

GNOME Terminal is also killed

Jan 04 01:45:42 fmac.local systemd[978]: gnome-terminal-server.service: A process of this unit has been killed by the OOM killer.

Are you sure that the gnome terminal was also killed? This line only indicates that the process was killed in gnome-terminal-server.service.

tail /dev/zero is not for to show how the system freezes. It is to demonstrate how earlyoom can handle fast memory allocations. If the earlyoom is turned off, then starting tail /dev/zero always leads to OOMK triggering.

I ran tail /dev/zero from GNOME Terminal and the whole Terminal app definitely quits, I see it happen at the same time. I don't know why. I don't see any messages in the journal that help me understand why Terminal goes away rather than only tail being killed.

So far in fc31 I'm not seeing earlyoom SIGTERM/SIGKILL tail. It's always the kernel oom-killer that does it, whether earlyoom is running or not.

Jan 04 01:45:43 fmac.local kernel: Out of memory: Killed process 42912 (tail) total-vm:6190904kB, anon-rss:5975628kB, file-rss:1016kB, shmem-rss:0kB, UID:1000 pgtables:12042240kB oom_score_adj:0
Jan 04 01:45:43 fmac.local kernel: oom_reaper: reaped process 42912 (tail), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I honestly prefer the webkitgtk example better, I just wish it were easier to setup and got things into trouble sooner.

As for the percentage, I'm not finding a sweet spot. 4% means SIGTERM happens at ~2.6G free on a 64G RAM computer; while SIGKILL happens at ~82M on a 4G RAM computer. The former is high, the latter is low. I never see less than ~130M free in the totally hung state.

“Lack of cgroupv2 group kill support is suboptimal but not a regression. The kernel oom-killer behavior kills only one process, not a group.”

It does kill groups as a single unit with memory.oom.group. And it would be unfortunate to lose that because I think careful usage of cgroups in userspace could actually be the better way to keep the system responsive. For example, putting processes started by gnome-terminal in a cgroup or modifying Flatpak and Toolbox to use cgroup v2 with memory limits would guarantee responsiveness, indicate to the OOM killer (whether kernel or userspace) which collection of processes is the culprit, even if a compiler has been forking itself, a web browser is running multiple processes, etc.

It solves the “unprivileged process can hang the system” issue nicely as well. And, very importantly (to me), it wouldn’t require any new daemon running as root. Furthermore, it still allows the user to run an expensive task without the memory limits when they know the task is not going to grow uncontrollably. So the user can still benefit from using all their memory. It’d be unfortunate to lose that, which would be the case with earlyoom.

Another note: I don’t think notifying applications and hoping they can drop a few thumbnails or caches is going to help enough to rescue any system. It could work as a way to run more applications on a machine with 4 GB of RAM, at best. On a machine with more memory, the only situation I can imagine high memory pressure is when running a buggy application or an expensive task, and in those cases it is very likely any memory freed by notifying applications will be used instantly.

I'm not seeing earlyoom SIGTERM/SIGKILL tail. It's always the kernel oom-killer that does it, whether earlyoom is running or not.

Seems like it's time to debug. Please run earlyoom with -d -r 0.1 flags and show the output.

The output in my case with tail /dev/zero and -d -r 0.1 without swap space https://pastebin.com/YC9wU94t and with swap space https://pastebin.com/T6UyMc9x : the kernel OOMK was never invoked.

I suggest adding to the Benefit to Fedora sending SIGTERM as the first stage: this helps some applications to be terminated more correctly.

Release Notes
Earlyoom service is enabled by default, which will cause kernel oom-killer to trigger sooner.

Need fixes: earlyoom sends signals, but not triggers kernel OOMK.

I suggest adding to the Benefit to Fedora sending SIGTERM as the first stage: this helps some applications to be terminated more correctly.

Good idea, done.

Release Notes
Earlyoom service is enabled by default, which will cause kernel oom-killer to trigger sooner.

Need fixes: earlyoom sends signals, but not triggers kernel OOMK.

Done. Yes this was an early misunderstanding on my part, confusing oom_score and oom-killer. Earlyoom uses the same metric, but a different mechanism. Very curiously, I regularly see oom-killer clobber small privileged processes like sshd, and I've never seen that behavior from earlyoom and nohang.

@hakavlad re: oom-killer always killing tail, not earlyoom. So what I'm seeing is that it's killed before swap is even filled. The kernel oom-killer trace says "Free swap = 6914492kB" out of 8G, so that'd explain why earlyoom is not triggering. But I can't tell why oom-killer is triggering so fast in this particular case...

https://pastebin.com/cX9mj9bh

I like the meminfo and task state that kernel oom-killer spits out. sysrq+m includes the former, not the latter. And I'm not seeing another trigger that shows the task list with oom_score, maybe/probably f does but I don't want to kill something else in addition to earlyoom. Of course sysrq+t is just way excessive, it'll fill up the kernel buffer with the default size we're using, and it doesn't contain oom_score.

@hakavlad @catanzaro or anyone really - is there a way for a systemd service unit to condition check for the existence of a swap device? What if we only start earlyoom if there's a swap device?

Heavy page-in/page-out (simultaneously) is what consistently appears to be responsible for the loss of system responsiveness. A system with out a swap device is unlikely to experience this problem. And then we don't have to answer the question about 10% being too high for those users.

And then we don't have to answer the question about 10% being too high for those users.

I think we can ask the author of earlyoom to change the default values. Disabling earlyoom by default is a bad idea, because earlyoom greatly improves the user experience, especially without a swap. Without earlyoom Artem S. Tashkinov's case https://lkml.org/lkml/2019/8/4/15 won't be fixed.

In general, without a swap, earlyoom is one of the best solutions along with nohang. MemAvailable is the only metric that needs to be monitored. The use of PSI does not give a good effect: psi-monitor and oomd won't help you without swap: https://github.com/facebookincubator/oomd/issues/80. -- PSI is inefficient without swap space. The only problem is the default value.

oom-killer always killing tail

I observed this with vm.swappiness=0. BTW, vm.swappiness=0 doesn' forbid swapping. I also saw topics on forums that users complained that OOM killer comes when not all swap space has been exhausted.

Curiously, Michal Hocko suggests vm.swappiness=100 to encourage more swapping, but is skeptical it'll make a difference. I found vm.swappiness=0 has no effect in the webkitgtk use case.
https://marc.info/?l=linux-mm&m=157865446317791&w=2

Anyway, it sounds like there's pretty strong interest now in integrating oomd into systemd as a medium-term solution, and no doubt we'll be able to adjust its OOM heuristics to meet our responsiveness requirements. So do we really want to continue pursuing earlyoom in light of this? Maybe it's time to refocus on the zram issue as low-hanging fruit short term, and defer earlyoom to a future release cycle? If the systemd-level effort progresses as envisioned, it sounds like that'd be clearly preferable.

Concerns about killing entire cgroups no longer seem problematic for Workstation, since we are already launching each application under its own cgroup since 3.34.2. Web browsers have the ability to do this for their web content processes; I was able to quickly test by changing WebKit to run its web process under 'systemd-run --scope' and that worked fine. I didn't submit a patch upstream because that solution wasn't hierarchical; the new scope is created under systemd's toplevel scope, and it's unclear how to easily change. But as a proof of concept, I think that's sufficient to show browsers will be able to handle this should they choose to.

Then there's also concern about using PSI metrics. Well, if that doesn't work out, we can ask for a basic % RAM use metric instead. It seems clear that the systemd developers will try to accommodate our needs here and come up with a solution that actually works.

RE: ZRAM
I think zram-generator needs some work. I've filed a few bugs.
https://github.com/systemd/zram-generator/issues

But aren't we past the system-wide change period to get this into Fedora 32?

RE: earlyoom
It does help. It doesn't help enough in all cases. My biggest concern is if users expect a much better responsiveness experience, and don't actually get it.

RE: OOM managers generally
I'm not convinced any oom manager can estimate what the user wants. IF the user walks away, chances are they want the the job to finish, responsiveness is a non-factor. IF the user comes back, wants to make another change and restart the job, they will want immediate responsiveness.

In my example to upstream, it was making slow progress, only 1/2 of swap is used, but GUI is frozen. What's the right thing to do? I think the existing oom managers do nothing in this case, and therefore no worse off with earlyoom.
https://marc.info/?l=linux-mm&m=157865446317791&w=2

IF the user walks away, chances are they want the the job to finish

@chrismurphy This is exactly what I mean when I do not enable PSI by default. earlyoom guarantees that there will be no corrective action until there is really low memory left, unlike killers using PSI. Users do not always need a responsive desktop. Sometimes they are simply forced to actively use the swap and come to terms with the low responsiveness of the desktop when they do a lot of work. In many cases, compilation could complete successfully and work could be done and responsiveness would be restored if the PSI-based killer does not kill the task.

I found vm.swappiness=0 has no effect

In my tests, tasks involving active swapping with swappiness=0 take longer and go under higher mem/io PSI pressure.

defer earlyoom to a future release cycle?

In this case, users simply will not receive a solution to some of the problems, for example, А. S. Tashkinov’s case will still not be fixed.

Maybe it's time to refocus on the zram issue as low-hanging fruit short

You can enable zram and earlyoom at the same time.

there's pretty strong interest now in integrating oomd into systemd as a medium-term solution

I am sure that systemd-oomd will not be ready for Fedora 33. Even oomd still have many problems.

only 1/2 of swap is used, but GUI is frozen. What's the right thing to do?

We cannot have a definite answer now. Some users would rather wait until the compilation is complete. Other users would prefer to achieve high responsiveness at all costs. Some users criticize earlyoom for taking early corrective actions. But in the case of an attempt to maintain the responsiveness of the desktop, based on the PSI, the corrective action will occur even earlier.

we are already launching each application under its own cgroup since 3.34.2

@catanzaro Other desktop environments, other browsers, other applications may not do this - they still start all processes in the session-1.scope. Thus, an oomd that kills entire control groups is much less universal. BTW, oomd sends SIGKILL instead of SIGTERM by default. By the way, oomd also ignores memory.oom.group - oomd just always kills cgroup.

But aren't we past the system-wide change period to get this into Fedora 32?

Yes. That's OK. We can play a longer game here. We're trying to solve a decades-old problem; users aren't expecting fixes to materialize overnight. Let's target F33 for zram, and try to land it in rawhide as quickly as possible after the branch point.

RE: earlyoom
It does help. It doesn't help enough in all cases. My biggest concern is if users expect a much better responsiveness experience, and don't actually get it.

I think it's just not compatible with a large disk-based swap. We don't need it to be perfect in all cases. Let's worry about making it work well in our default case, which is going to be zram.

I'm not convinced any oom manager can estimate what the user wants. IF the user walks away, chances are they want the the job to finish, responsiveness is a non-factor. IF the user comes back, wants to make another change and restart the job, they will want immediate responsiveness.

Let's just always optimize for responsiveness, not throughput. Desktop workstations should always be responsive, period.

In my example to upstream, it was making slow progress, only 1/2 of swap is used, but GUI is frozen. What's the right thing to do? I think the existing oom managers do nothing in this case, and therefore no worse off with earlyoom.
https://marc.info/?l=linux-mm&m=157865446317791&w=2

I have no doubt it's best to kill before GUI is frozen. But again, let's simplify the problem with zram before worrying that earlyoom doesn't work well in combination with large disk-based swaps.

I am sure that systemd-oomd will not be ready for Fedora 33. Even oomd still have many problems.

Yeah that seems certain, it's certainly F34-F36 timeframe. It just seems unfortunate to install earlyoom now and then change our minds in favor of (hypothetical) systemd-oomd next year. We can't easily remove packages on upgrade, so earlyoom will stick around forever on users' systems.

only 1/2 of swap is used, but GUI is frozen. What's the right thing to do?

We cannot have a definite answer now. Some users would rather wait until the compilation is complete. Other users would prefer to achieve high responsiveness at all costs. Some users criticize earlyoom for taking early corrective actions. But in the case of an attempt to maintain the responsiveness of the desktop, based on the PSI, the corrective action will occur even earlier.

We can pick a default that's right for most users, though: prioritize responsiveness. A process that's so out of control as to cause the desktop to become laggy either needs to ease up on the system resource usage or be killed. We can have a configuration knob somewhere (e.g. the earlyoom config file) for technically-minded users who don't like this to change the behavior.

we are already launching each application under its own cgroup since 3.34.2

Other desktop environments, other browsers, other applications may not do this - they still start all processes in the session-1.scope.

That's OK: we're building Fedora Workstation, and we know Fedora Workstation means GNOME, and that means applications are now running in isolated scopes.

Thus, an oomd that kills entire control groups is much less universal. BTW, oomd sends SIGKILL instead of SIGTERM by default. By the way, oomd also ignores memory.oom.group - oomd just always kills cgroup.

I understand all these are limitations to be fixed before we can turn on our hypothetical future systemd-oomd. Your expertise is much appreciated, by the way; you've saved us from making several mistakes so far....

running in isolated scopes

@catanzaro It seems that the processes running in all windows of the gnome-terminal work in one gnome-terminal-server.service. If we enable the killing of entire control groups, this will mean that if we run dnf update in one window and tail /dev/zero in another, then in the end all these processes will be killed at the same time.

In fact, supporting memory.oom.group is not a big problem. Support for this can be, if desired, in any killer. For example, I am going to add support for memory.oom.group to nohang in the future https://github.com/hakavlad/nohang/issues/66. You can also, if you wish, do this in earlyoom (I don’t think that the author of earlyoom will do it himself. Perhaps he might merge PR if someone do it).

I agree that gnome-terminal will need to create a new cgroup when running processes.

IMHO, I don't think that installing earlyoom is necessary. It's a work-around for the majority of applications and binaries on our system not setting their oom policies properly.

We're slowly moving to systemd for session management, in addition to using it for the system, and we should, instead of working around the kernel OOM killer, be modifying settings expectations in user-space.

I think that between setting OOM policies for important user-space components, getting Flatpak to apply this setting to applications it launches (and those that become background applications), and voluntary memory savings through listening to low memory signals, I don't think we'd need earlyoom, which will paper over the problems rather than solving them.

(As an aside, we had problems during the F31 beta where the kernel OOM killer could bring down the whole session, which was caused by bugs in the systemd user setup, and have now been fixed. I'd be worried about earlyoom hiding those problems)

Working Group discussed this today at the meeting. The vast majority agree that system responsiveness is a relevant working group matter, and that the working group should make a decision to support or withdraw the feature proposal; to continue the discussion and get subject experts as guests for the next meetings; and to produce a plan indicating a deliberate process for the next few releases, one that is open to adjustments as we learn more about this issue. Based on present available information, a majority of voting working group members support earlyoom by default for Fedora 32.

@hakavlad Can you contact me off list? I can't find an email address for you. chris@colorremedies.com - thanks.

The earlyoom change proposal includes: earlyoom.service will be enabled on upgrade.

Workstation PRD says: Upgrading ... should give a result that is the same as an original install of Fedora Workstation.

Should we consider whether to exclude upgraded systems from this change? The PRD is non-binding, and also says "should" not "must". I think a particularly persuasive argument is needed to overcome "should".

The earlyoom change proposal includes: earlyoom.service will be enabled on upgrade.

Workstation PRD says: Upgrading ... should give a result that is the same as an original install of Fedora Workstation.

These seem consistent with each other.

Should we consider whether to exclude upgraded systems from this change? The PRD is non-binding, and also says "should" not "must". I think a particularly persuasive argument is needed to overcome "should".

I'd want an upgrade to F32 to give the same result as a new install in this area. And if we later remove earlyoom from enablement, an upgrade to F3X to again give the same result as a new install.

Agreed: proceed with enabling earlyoom for F32.

Action items: Chris to change Fedora presets and update comps. Ensure earlyoom gets installed during distro upgrades with Supplements: fedora-release-workstation.

Metadata Update from @catanzaro:
- Issue untagged with: meeting

13 days ago

Agreed: proceed with enabling earlyoom for F32.

Are the meeting minutes available somewhere?

Are the meeting minutes available somewhere?

They're supposed to be posted to the mailing list later today. I've shared them with you privately in the meantime.

earlyoom.service: drop root privileges

Run as a random unprivilege user instead of as root,
but add the capabilities CAP_KILL CAP_IPC_LOCK.

https://github.com/rfjakob/earlyoom/commit/f2b45e6a18a0624032d289318569ad57c24fd419

I'm glad to see these updates upstream. How soon will Fedora see an update? I don't see it in the mass rebuild failures list, and yet there's no fc32 build in koji. The latest package is from July.

Login to comment on this ticket.

Metadata