#119 enabling earlyoom by default
Opened 3 months ago by chrismurphy. Modified 6 days ago


@catanzaro @hadess @hakavlad @xvitaly
Change proposal is still set to incomplete, feel free to edit as appropriate, and add your names as owners if you wish.

Right now it's constrained to Workstation. I'm not sure how desktop spins inherit such changes, if they have to opt-in or opt-out.

Earlyoom service is enabled by default, which will cause kernel oom-killer to trigger sooner

earlyoom not trigger kernel oom killer (as LMM or psi-monitor). It just sends the signal to the victim.

KILL_PERCENT setting, default 2%.

No. 5% for SIGKILL:

  -m PERCENT[,KILL_PERCENT] set available memory minimum to PERCENT of total
                            (default 10 %).
                            earlyoom sends SIGTERM once below PERCENT, then
                            SIGKILL once below KILL_PERCENT (default PERCENT/2).

How To Test

Simplest way is running tail /dev/zero. Other way is described there https://lkml.org/lkml/2019/8/4/15.

Simplest way is running tail /dev/zero.

Ugh, that's brilliant. Never thought to try that!

Updated. I'm leaving it as a system-wide change in case it will apply to other spins (do they inherit 80-workstation.preset?) But I've also stated that it applies to product Workstation. Lemme know how to clear that up, and I'll set it to ready for wranger. Ben told me he'd put it through as a system-wide change despite being 2 days late.

I need to add earlyoom package to the Workstation package set, what thingy do I need to submit a PR for to add it?

Also, if we want to apply the change to upgrades, I need to submit PR to earlyoom package, e.g.

%triggerpostun -- fedora-release < 32
systemctl preset earlyoom.service

I'll wait until FESCo approves the change, but probably should include this information in the change proposal so they understand the scope.

It's systemwide because the change affects the entire system (applies to more than just a particular application or service), not because it only affects a particular edition or spin.

I think Workstation presets will not be inherited by other editions.

I need to add earlyoom package to the Workstation package set, what thingy do I need to submit a PR for to add it?

fedora-comps at https://pagure.io/fedora-comps/tree/master.

Would be wise to aim for WG approval of this at next week's meeting, so we have that to inform FESCo before they vote on the change.

That'll happen in the devel@ announcement, which is part of the change process.

My fears are not possible bugs in earlyoom itself (it is fairly well tested, widely known, the last release was more than six months ago), but the changed behavior: 1. The kernel setting in cgroup2 memory.oom.group, used to kill the whole group, is not will work: one process will end instead of a group. 2. Probably, the settings KillMode= and OOMPolicy= applied in units will not work.

It will also be possible to receive complaints from users whose applications will crashes before the appearance of obvious signs of OOM, especially from those who do not use the swap space. For users, this may look like an unreasonable termination of applications against the background of complete well-being, if they do not monitor the memory level and have not heard about enabling earlyoom by default.

I believe that it is necessary to more thoroughly convey to users all the possible consequences of the changes and their causes.

@hakavlad @xvitaly

  1. Lack of cgroupv2 group kill support is suboptimal but not a regression. The kernel oom-killer behavior kills only one process, not a group.

  2. My reading of KillMode= and OOMPolicy= suggests no conflict. But I want to be clear on why you think there might be a conflict, because I definitely don't want to step on other work and expectations in this area.
    https://www.freedesktop.org/software/systemd/man/systemd.kill.html
    https://www.freedesktop.org/software/systemd/man/systemd.service.html

  3. In the case of no swap, I agree this becomes difficult to use a single percentage. 1% is too low for e.g. 8GiB RAM system, and is too high for e.g. 64GiB RAM system. This might be the most significant liability of the feature.

Even though no swap isn't our default configuration, and the installer warns users when swap isn't configured, I think it's sufficiently common users have systems with lots of RAM forego swap. I'm not convinced merely documenting users will experience OOM killing of processes at 5% of 32-64GiB, e.g. 1.6-3.2GiB free RAM, is acceptable. Opinions?

Workaround A: Is is possible given the time frame to enhance earlyoom to use /proc/pressure/memory instead of a RAM percentage? This facility is available on Fedora kernels and it's cheap to read.

Workaround B: We need a way to disable earlyoom on systems without swap. Either as a systemd unit condition to prevent service from starting, or an earlyoom test for it that's enabled in /etc/default/earlyoom

Any other ideas?

In the case of no swap, I agree this becomes difficult to use a single percentage. 1% is too low for e.g. 8GiB RAM system, and is too high for e.g. 64GiB RAM system. This might be the most significant liability of the feature.

Can you explain this further?

Our first priority should be to make this work well with our default configuration, which is currently swap partition equal in size to RAM. I've just reported #120 because it seems like an awful default.

earlyoom's low watermark for triggering terminate or kill is based on a percentage of RAM and swap. If there is both RAM and swap, we're for sure in a bad UX responsiveness state once swap is closing in on 5% free because for sure RAM is already at minimum. And therefore earlyoom default behavior is fine.

However, if there is no swap, earlyoom has only a single low watermark, free RAM. Once that gets to 5% on a 64GiB system, i.e. 3.2GiB free, it's going to kill something. That's the problem. 3.2GiB free is plenty of RAM to keep chugging away, so in the no swap case I think it needs a different metric for what the low watermark is.

I'm not worried about 5%, especially not if we have swap enabled by default. Could always lower the default though... is 2% enough to keep up with the tail /dev/zero test? If 5% is what it needs to respond quickly enough to rapidly rising memory usage, then that's what it needs and we should use that. But if a lower value works in practice, then that's great too.

Some more comments: https://www.reddit.com/r/Fedora/comments/ejbm75/earlyoom_was_proposed_to_be_enabled_by_default_in/

Why earlyoom uses 10% thresholds: https://github.com/rfjakob/earlyoom/issues/6 : 10% is good value to save page cache.

IMHO 4% for SIGTERM and 2% are not bad threshold for MemAvailable. This should be enough to prevent freezes at OOM. But 10% allows you to maintain better responsiveness after corrective action, especially without swap space.

By the way, using the earlyoom with the default settings 10% without swap space is one of the best ways to maintain a very responsive desktop:

  1. The monitoring intensity is very high, 1 - 10 Hz. This makes it easy to handle even fast memory allocations, such as tail /dev/zero, and makes this processing not affect responsiveness.
  2. You will not encounter slow swapping because there is simply no swap space.
  3. There is always a 5% to 10% disk cache margin.
  4. You can not be afraid of running ninja -j64 (high memory pressure will not a problem; one problem will be only IO and maybe CPU pressure).

Out-of-memory handlers based only on PSI handle OOM without swap much worse, see https://github.com/facebookincubator/oomd/issues/80.

My reading of KillMode= and OOMPolicy= suggests no conflict.

Okay. The main incompatibility with the kernel killer is the impossibility of group killings (OOMPolicy also provide this: "If set to kill and one of the service's processes is killed by the OOM killer the kernel is instructed to kill all remaining processes of the service, too.")

I think I understand why I'm confused. earlyoom isn't killing tail /dev/zero the kernel oom-killer is doing it every time; so that's not a good test case for the proposal write up.

Every time tail is killed by oom-killer, GNOME Terminal is also killed, and I see this:
Jan 04 01:45:42 fmac.local systemd[978]: gnome-terminal-server.service: A process of this unit has been killed by the OOM killer.

earlyoom uses oom_score to determine what to terminate and kill; but doesn't actually use the kernel oom-killer to do the killing. Whereas systemd OOMPolicy depends on oom-killer.

So yeah, enabling earlyoom by default means OOMPolicy is, in effect, preempted. Most of the time it should be earlyoom doing the killing.

GNOME Terminal is also killed

Jan 04 01:45:42 fmac.local systemd[978]: gnome-terminal-server.service: A process of this unit has been killed by the OOM killer.

Are you sure that the gnome terminal was also killed? This line only indicates that the process was killed in gnome-terminal-server.service.

tail /dev/zero is not for to show how the system freezes. It is to demonstrate how earlyoom can handle fast memory allocations. If the earlyoom is turned off, then starting tail /dev/zero always leads to OOMK triggering.

I ran tail /dev/zero from GNOME Terminal and the whole Terminal app definitely quits, I see it happen at the same time. I don't know why. I don't see any messages in the journal that help me understand why Terminal goes away rather than only tail being killed.

So far in fc31 I'm not seeing earlyoom SIGTERM/SIGKILL tail. It's always the kernel oom-killer that does it, whether earlyoom is running or not.

Jan 04 01:45:43 fmac.local kernel: Out of memory: Killed process 42912 (tail) total-vm:6190904kB, anon-rss:5975628kB, file-rss:1016kB, shmem-rss:0kB, UID:1000 pgtables:12042240kB oom_score_adj:0
Jan 04 01:45:43 fmac.local kernel: oom_reaper: reaped process 42912 (tail), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I honestly prefer the webkitgtk example better, I just wish it were easier to setup and got things into trouble sooner.

As for the percentage, I'm not finding a sweet spot. 4% means SIGTERM happens at ~2.6G free on a 64G RAM computer; while SIGKILL happens at ~82M on a 4G RAM computer. The former is high, the latter is low. I never see less than ~130M free in the totally hung state.

“Lack of cgroupv2 group kill support is suboptimal but not a regression. The kernel oom-killer behavior kills only one process, not a group.”

It does kill groups as a single unit with memory.oom.group. And it would be unfortunate to lose that because I think careful usage of cgroups in userspace could actually be the better way to keep the system responsive. For example, putting processes started by gnome-terminal in a cgroup or modifying Flatpak and Toolbox to use cgroup v2 with memory limits would guarantee responsiveness, indicate to the OOM killer (whether kernel or userspace) which collection of processes is the culprit, even if a compiler has been forking itself, a web browser is running multiple processes, etc.

It solves the “unprivileged process can hang the system” issue nicely as well. And, very importantly (to me), it wouldn’t require any new daemon running as root. Furthermore, it still allows the user to run an expensive task without the memory limits when they know the task is not going to grow uncontrollably. So the user can still benefit from using all their memory. It’d be unfortunate to lose that, which would be the case with earlyoom.

Another note: I don’t think notifying applications and hoping they can drop a few thumbnails or caches is going to help enough to rescue any system. It could work as a way to run more applications on a machine with 4 GB of RAM, at best. On a machine with more memory, the only situation I can imagine high memory pressure is when running a buggy application or an expensive task, and in those cases it is very likely any memory freed by notifying applications will be used instantly.

I'm not seeing earlyoom SIGTERM/SIGKILL tail. It's always the kernel oom-killer that does it, whether earlyoom is running or not.

Seems like it's time to debug. Please run earlyoom with -d -r 0.1 flags and show the output.

The output in my case with tail /dev/zero and -d -r 0.1 without swap space https://pastebin.com/YC9wU94t and with swap space https://pastebin.com/T6UyMc9x : the kernel OOMK was never invoked.

I suggest adding to the Benefit to Fedora sending SIGTERM as the first stage: this helps some applications to be terminated more correctly.

Release Notes
Earlyoom service is enabled by default, which will cause kernel oom-killer to trigger sooner.

Need fixes: earlyoom sends signals, but not triggers kernel OOMK.

I suggest adding to the Benefit to Fedora sending SIGTERM as the first stage: this helps some applications to be terminated more correctly.

Good idea, done.

Release Notes
Earlyoom service is enabled by default, which will cause kernel oom-killer to trigger sooner.

Need fixes: earlyoom sends signals, but not triggers kernel OOMK.

Done. Yes this was an early misunderstanding on my part, confusing oom_score and oom-killer. Earlyoom uses the same metric, but a different mechanism. Very curiously, I regularly see oom-killer clobber small privileged processes like sshd, and I've never seen that behavior from earlyoom and nohang.

@hakavlad re: oom-killer always killing tail, not earlyoom. So what I'm seeing is that it's killed before swap is even filled. The kernel oom-killer trace says "Free swap = 6914492kB" out of 8G, so that'd explain why earlyoom is not triggering. But I can't tell why oom-killer is triggering so fast in this particular case...

https://pastebin.com/cX9mj9bh

I like the meminfo and task state that kernel oom-killer spits out. sysrq+m includes the former, not the latter. And I'm not seeing another trigger that shows the task list with oom_score, maybe/probably f does but I don't want to kill something else in addition to earlyoom. Of course sysrq+t is just way excessive, it'll fill up the kernel buffer with the default size we're using, and it doesn't contain oom_score.

@hakavlad @catanzaro or anyone really - is there a way for a systemd service unit to condition check for the existence of a swap device? What if we only start earlyoom if there's a swap device?

Heavy page-in/page-out (simultaneously) is what consistently appears to be responsible for the loss of system responsiveness. A system with out a swap device is unlikely to experience this problem. And then we don't have to answer the question about 10% being too high for those users.

And then we don't have to answer the question about 10% being too high for those users.

I think we can ask the author of earlyoom to change the default values. Disabling earlyoom by default is a bad idea, because earlyoom greatly improves the user experience, especially without a swap. Without earlyoom Artem S. Tashkinov's case https://lkml.org/lkml/2019/8/4/15 won't be fixed.

In general, without a swap, earlyoom is one of the best solutions along with nohang. MemAvailable is the only metric that needs to be monitored. The use of PSI does not give a good effect: psi-monitor and oomd won't help you without swap: https://github.com/facebookincubator/oomd/issues/80. -- PSI is inefficient without swap space. The only problem is the default value.

oom-killer always killing tail

I observed this with vm.swappiness=0. BTW, vm.swappiness=0 doesn' forbid swapping. I also saw topics on forums that users complained that OOM killer comes when not all swap space has been exhausted.

Curiously, Michal Hocko suggests vm.swappiness=100 to encourage more swapping, but is skeptical it'll make a difference. I found vm.swappiness=0 has no effect in the webkitgtk use case.
https://marc.info/?l=linux-mm&m=157865446317791&w=2

Anyway, it sounds like there's pretty strong interest now in integrating oomd into systemd as a medium-term solution, and no doubt we'll be able to adjust its OOM heuristics to meet our responsiveness requirements. So do we really want to continue pursuing earlyoom in light of this? Maybe it's time to refocus on the zram issue as low-hanging fruit short term, and defer earlyoom to a future release cycle? If the systemd-level effort progresses as envisioned, it sounds like that'd be clearly preferable.

Concerns about killing entire cgroups no longer seem problematic for Workstation, since we are already launching each application under its own cgroup since 3.34.2. Web browsers have the ability to do this for their web content processes; I was able to quickly test by changing WebKit to run its web process under 'systemd-run --scope' and that worked fine. I didn't submit a patch upstream because that solution wasn't hierarchical; the new scope is created under systemd's toplevel scope, and it's unclear how to easily change. But as a proof of concept, I think that's sufficient to show browsers will be able to handle this should they choose to.

Then there's also concern about using PSI metrics. Well, if that doesn't work out, we can ask for a basic % RAM use metric instead. It seems clear that the systemd developers will try to accommodate our needs here and come up with a solution that actually works.

RE: ZRAM
I think zram-generator needs some work. I've filed a few bugs.
https://github.com/systemd/zram-generator/issues

But aren't we past the system-wide change period to get this into Fedora 32?

RE: earlyoom
It does help. It doesn't help enough in all cases. My biggest concern is if users expect a much better responsiveness experience, and don't actually get it.

RE: OOM managers generally
I'm not convinced any oom manager can estimate what the user wants. IF the user walks away, chances are they want the the job to finish, responsiveness is a non-factor. IF the user comes back, wants to make another change and restart the job, they will want immediate responsiveness.

In my example to upstream, it was making slow progress, only 1/2 of swap is used, but GUI is frozen. What's the right thing to do? I think the existing oom managers do nothing in this case, and therefore no worse off with earlyoom.
https://marc.info/?l=linux-mm&m=157865446317791&w=2

IF the user walks away, chances are they want the the job to finish

@chrismurphy This is exactly what I mean when I do not enable PSI by default. earlyoom guarantees that there will be no corrective action until there is really low memory left, unlike killers using PSI. Users do not always need a responsive desktop. Sometimes they are simply forced to actively use the swap and come to terms with the low responsiveness of the desktop when they do a lot of work. In many cases, compilation could complete successfully and work could be done and responsiveness would be restored if the PSI-based killer does not kill the task.

I found vm.swappiness=0 has no effect

In my tests, tasks involving active swapping with swappiness=0 take longer and go under higher mem/io PSI pressure.

defer earlyoom to a future release cycle?

In this case, users simply will not receive a solution to some of the problems, for example, А. S. Tashkinov’s case will still not be fixed.

Maybe it's time to refocus on the zram issue as low-hanging fruit short

You can enable zram and earlyoom at the same time.

there's pretty strong interest now in integrating oomd into systemd as a medium-term solution

I am sure that systemd-oomd will not be ready for Fedora 33. Even oomd still have many problems.

only 1/2 of swap is used, but GUI is frozen. What's the right thing to do?

We cannot have a definite answer now. Some users would rather wait until the compilation is complete. Other users would prefer to achieve high responsiveness at all costs. Some users criticize earlyoom for taking early corrective actions. But in the case of an attempt to maintain the responsiveness of the desktop, based on the PSI, the corrective action will occur even earlier.

we are already launching each application under its own cgroup since 3.34.2

@catanzaro Other desktop environments, other browsers, other applications may not do this - they still start all processes in the session-1.scope. Thus, an oomd that kills entire control groups is much less universal. BTW, oomd sends SIGKILL instead of SIGTERM by default. By the way, oomd also ignores memory.oom.group - oomd just always kills cgroup.

But aren't we past the system-wide change period to get this into Fedora 32?

Yes. That's OK. We can play a longer game here. We're trying to solve a decades-old problem; users aren't expecting fixes to materialize overnight. Let's target F33 for zram, and try to land it in rawhide as quickly as possible after the branch point.

RE: earlyoom
It does help. It doesn't help enough in all cases. My biggest concern is if users expect a much better responsiveness experience, and don't actually get it.

I think it's just not compatible with a large disk-based swap. We don't need it to be perfect in all cases. Let's worry about making it work well in our default case, which is going to be zram.

I'm not convinced any oom manager can estimate what the user wants. IF the user walks away, chances are they want the the job to finish, responsiveness is a non-factor. IF the user comes back, wants to make another change and restart the job, they will want immediate responsiveness.

Let's just always optimize for responsiveness, not throughput. Desktop workstations should always be responsive, period.

In my example to upstream, it was making slow progress, only 1/2 of swap is used, but GUI is frozen. What's the right thing to do? I think the existing oom managers do nothing in this case, and therefore no worse off with earlyoom.
https://marc.info/?l=linux-mm&m=157865446317791&w=2

I have no doubt it's best to kill before GUI is frozen. But again, let's simplify the problem with zram before worrying that earlyoom doesn't work well in combination with large disk-based swaps.

I am sure that systemd-oomd will not be ready for Fedora 33. Even oomd still have many problems.

Yeah that seems certain, it's certainly F34-F36 timeframe. It just seems unfortunate to install earlyoom now and then change our minds in favor of (hypothetical) systemd-oomd next year. We can't easily remove packages on upgrade, so earlyoom will stick around forever on users' systems.

only 1/2 of swap is used, but GUI is frozen. What's the right thing to do?

We cannot have a definite answer now. Some users would rather wait until the compilation is complete. Other users would prefer to achieve high responsiveness at all costs. Some users criticize earlyoom for taking early corrective actions. But in the case of an attempt to maintain the responsiveness of the desktop, based on the PSI, the corrective action will occur even earlier.

We can pick a default that's right for most users, though: prioritize responsiveness. A process that's so out of control as to cause the desktop to become laggy either needs to ease up on the system resource usage or be killed. We can have a configuration knob somewhere (e.g. the earlyoom config file) for technically-minded users who don't like this to change the behavior.

we are already launching each application under its own cgroup since 3.34.2

Other desktop environments, other browsers, other applications may not do this - they still start all processes in the session-1.scope.

That's OK: we're building Fedora Workstation, and we know Fedora Workstation means GNOME, and that means applications are now running in isolated scopes.

Thus, an oomd that kills entire control groups is much less universal. BTW, oomd sends SIGKILL instead of SIGTERM by default. By the way, oomd also ignores memory.oom.group - oomd just always kills cgroup.

I understand all these are limitations to be fixed before we can turn on our hypothetical future systemd-oomd. Your expertise is much appreciated, by the way; you've saved us from making several mistakes so far....

running in isolated scopes

@catanzaro It seems that the processes running in all windows of the gnome-terminal work in one gnome-terminal-server.service. If we enable the killing of entire control groups, this will mean that if we run dnf update in one window and tail /dev/zero in another, then in the end all these processes will be killed at the same time.

In fact, supporting memory.oom.group is not a big problem. Support for this can be, if desired, in any killer. For example, I am going to add support for memory.oom.group to nohang in the future https://github.com/hakavlad/nohang/issues/66. You can also, if you wish, do this in earlyoom (I don’t think that the author of earlyoom will do it himself. Perhaps he might merge PR if someone do it).

I agree that gnome-terminal will need to create a new cgroup when running processes.

IMHO, I don't think that installing earlyoom is necessary. It's a work-around for the majority of applications and binaries on our system not setting their oom policies properly.

We're slowly moving to systemd for session management, in addition to using it for the system, and we should, instead of working around the kernel OOM killer, be modifying settings expectations in user-space.

I think that between setting OOM policies for important user-space components, getting Flatpak to apply this setting to applications it launches (and those that become background applications), and voluntary memory savings through listening to low memory signals, I don't think we'd need earlyoom, which will paper over the problems rather than solving them.

(As an aside, we had problems during the F31 beta where the kernel OOM killer could bring down the whole session, which was caused by bugs in the systemd user setup, and have now been fixed. I'd be worried about earlyoom hiding those problems)

Working Group discussed this today at the meeting. The vast majority agree that system responsiveness is a relevant working group matter, and that the working group should make a decision to support or withdraw the feature proposal; to continue the discussion and get subject experts as guests for the next meetings; and to produce a plan indicating a deliberate process for the next few releases, one that is open to adjustments as we learn more about this issue. Based on present available information, a majority of voting working group members support earlyoom by default for Fedora 32.

@hakavlad Can you contact me off list? I can't find an email address for you. chris@colorremedies.com - thanks.

The earlyoom change proposal includes: earlyoom.service will be enabled on upgrade.

Workstation PRD says: Upgrading ... should give a result that is the same as an original install of Fedora Workstation.

Should we consider whether to exclude upgraded systems from this change? The PRD is non-binding, and also says "should" not "must". I think a particularly persuasive argument is needed to overcome "should".

The earlyoom change proposal includes: earlyoom.service will be enabled on upgrade.

Workstation PRD says: Upgrading ... should give a result that is the same as an original install of Fedora Workstation.

These seem consistent with each other.

Should we consider whether to exclude upgraded systems from this change? The PRD is non-binding, and also says "should" not "must". I think a particularly persuasive argument is needed to overcome "should".

I'd want an upgrade to F32 to give the same result as a new install in this area. And if we later remove earlyoom from enablement, an upgrade to F3X to again give the same result as a new install.

Agreed: proceed with enabling earlyoom for F32.

Action items: Chris to change Fedora presets and update comps. Ensure earlyoom gets installed during distro upgrades with Supplements: fedora-release-workstation.

Metadata Update from @catanzaro:
- Issue untagged with: meeting

2 months ago

Agreed: proceed with enabling earlyoom for F32.

Are the meeting minutes available somewhere?

Are the meeting minutes available somewhere?

They're supposed to be posted to the mailing list later today. I've shared them with you privately in the meantime.

earlyoom.service: drop root privileges

Run as a random unprivilege user instead of as root,
but add the capabilities CAP_KILL CAP_IPC_LOCK.

https://github.com/rfjakob/earlyoom/commit/f2b45e6a18a0624032d289318569ad57c24fd419

I'm glad to see these updates upstream. How soon will Fedora see an update? I don't see it in the mass rebuild failures list, and yet there's no fc32 build in koji. The latest package is from July.

Don't worry, the package will be updated anyway. Most likely in a few months. I hope some more changes will be made before the release.

I'm pretty sure we want an fc32 build of earlyoom to ship by beta, don't we? At the least it should be subject to mass rebuild, I'm not sure why it got missed.

Fedora 32 beta freeze is in five days (Feb 25). I've filed a bug. Maybe it just needs a poke to get it to build an fc32 version.

@xvitaly @hakavlad

@xvitaly said that the new unit was backported. @chrismurphy

I think the earlyoom defaults are OK for default/autopartition installations, which have a swap device. What about the noswap case? I'm considering:

EARLYOOM_ARGS="-r 60 -M 512000"

I haven't seen free memory often below 100M when under pressure, and 500M gives quite a good amount of headroom without being so wasteful on systems with a lot of RAM.

500M is 25% MemTotal if MemTotal = 2000M.

We can assume systems with < 4 G RAM have swap, and thus the limiting factor is 10% swap free.

However, they don't have swap on initial installation. Anaconda's swap-on-ZRAM starts pretty early on DVD/netinstall; but not until Anaconda launches on Live. Also, right now there is a switch that happens during the installation, where Anaconda's zram.service is deactivated and swap-on-disk is activated. I don't know if they overlap or if the former happens before the latter in a way that maybe means no swap very briefly. At which point such a system might be prone to becoming a shooting gallery.

What about 200M? Does that provide enough head room for OOM recovery?

What about 200M? Does that provide enough head room for OOM recovery?

The amount that is going to be "enough head room", and provide the best interactivity can't be defined in strict numbers. It depends on the specific workload (even for "desktop" workloads). That's why low-memory-monitor (and others) uses the memory pressure information rather than amounts of RAM (whether absolute or relative).

Eg. 200 megs might be enough to recover unless one of your applications needs that 200 megs of RAM in one allocation to carry on working.

That's why low-memory-monitor (and others) uses the memory pressure information rather than amounts of RAM

@hadess What about https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/8 ?

That's why low-memory-monitor (and others) uses the memory pressure information rather than amounts of RAM

@hadess What about https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/8 ?

It has nothing to do with what we're discussing. It was a systemd session problem which has since been fixed, and the functionality is disabled anyway. “Tu quoque” arguments aren't useful in this case.

Android's lmkd and Endless' psi-monitor also use the memory pressure information, did you have links you wanted to use for those?

I appreciate the feedback so far. I'd like to constrain the discussion in this issue to that of fine tuning earlyoom appropriately for Fedora Workstation 32, since it is an approved feature. There will be a time to re-evaluate for F33.

re: earlyoom in the noswap case. We know earlyoom is naive and simple, getting to perfection isn't the goal. But is SIGTERM on 10% free RAM the best that can be done? e.g. workstation with 64G RAM, it means SIGTERM at ~6.4G RAM free which seems like quite a lot. Or am I overstating things by suggesting it's excessive?

earlyoom.service: drop root privileges
Run as a random unprivilege user instead of as root,
but add the capabilities CAP_KILL CAP_IPC_LOCK.

https://github.com/rfjakob/earlyoom/commit/f2b45e6a18a0624032d289318569ad57c24fd419

When /proc is mounted with hidepid=2, it doesn't work. The Earlyoom service can only see its own pid.

What is a good default memory/swap limit: https://github.com/rfjakob/earlyoom/issues/6

What about 200M? Does that provide enough head room for OOM recovery?

It's enough, IMHO. But look at https://github.com/rfjakob/earlyoom/issues/111.

earlyoom will send only SIGKILL if you set -M 200000 with MemTotal=32GB.

I suggest -m 4. I want to offer this changes in earlyoom/issues, plz wait.

Updated to version 1.3.1: F31, F32, F33, epel8.

@xvitaly @hakavlad Note that F32 is in beta freeze until 17 Mar. Pretty much only if earlyoom were having problems with Live media (testing out the distro and installing) would freeze exception apply; otherwise it's still possible to get updates from updates-testing repo, and updates will become available to beta users in the first batch of updates.

The "test the distro" use case is actually one in which there is no swap. This would change in F33 if #127 is approved.

When /proc is mounted with hidepid=2, it doesn't work. The Earlyoom service can only see its own pid.

@latalante https://github.com/rfjakob/earlyoom/wiki/proc-hidepid

IF the user walks away, chances are they want the the job to finish, responsiveness is a non-factor. IF the user comes back, wants to make another change and restart the job, they will want immediate responsiveness.

@chrismurphy It is for this reason that I do not believe in the possibility of universal settings based on PSI. Some users prefer low latency and good responsiveness no matter what. For other users, responsiveness is not a priority, at least when performing some of the tasks - they would prefer that the task be completed and no process be killed no matter what.

It has nothing to do with what we're discussing.

@hadess OK, we can discuss PSI and LMM in #98 further.

it is for this reason that I do not believe in the possibility of universal settings based on PSI. Some users prefer low latency and good responsiveness no matter what. For other users, responsiveness is not a priority, at least when performing some of the tasks - they would prefer that the task be completed and no process be killed no matter what.

We can't predict what a general user will want in general scenarios, but we are making Workstation-specific decisions for Workstation here. I think we have a broad consensus that we want to prioritize responsiveness on Workstation.

I wonder if we should make the default staggered, something like

Up to 4GB RAM: 10%
Up to 8GB RAM: 4%
Up to 64GB RAM: 2%
Above 64GB RAM: 1%

-- https://github.com/rfjakob/earlyoom/issues/162#issuecomment-593149868

I wonder if we should make the default staggered

I like it in theory, but in practice I wonder if it will make the configuration file complicated?

I think we can discount the likelihood of no swap below 4GiB RAM. Thus the proposal suggests a SIGTERM trigger from 400M to 1280M, in the no swap case. What about a default -of -M 358400?

I like it in theory, but in practice I wonder if it will make the configuration file complicated?

You do not need to change the configuration file if this is done in the upstream. Perhaps the values will be hardcoded and will depend on MemTotal.

I think we can discount the likelihood of no swap below 4GiB RAM.

we can discount the likelihood of no swap below 4GiB RAM in any case. Noswap with low RAM is bad idea.

See also:
https://github.com/rfjakob/earlyoom/tree/v1.4
https://github.com/rfjakob/earlyoom/tree/v1.4#changelog

Updated to version 1.4: F31, F32, F33, epel8.

Testing needed.

...

I upgraded to F32 and installed earlyoom, and I see this in system journal:

Mar 06 13:07:14 phoenix earlyoom[1134]: mem avail: 19883 of 23924 MiB (83 %), swap free: 8191 of 8191 MiB (100 %)
Mar 06 13:08:14 phoenix earlyoom[1134]: mem avail: 19775 of 23924 MiB (82 %), swap free: 8191 of 8191 MiB (100 %)
Mar 06 13:09:14 phoenix earlyoom[1134]: mem avail: 19785 of 23924 MiB (82 %), swap free: 8191 of 8191 MiB (100 %)
Mar 06 13:10:14 phoenix earlyoom[1134]: mem avail: 19751 of 23924 MiB (82 %), swap free: 8191 of 8191 MiB (100 %)
Mar 06 13:11:14 phoenix earlyoom[1134]: mem avail: 19769 of 23924 MiB (82 %), swap free: 8191 of 8191 MiB (100 %)
Mar 06 13:12:14 phoenix earlyoom[1134]: mem avail: 19740 of 23924 MiB (82 %), swap free: 8191 of 8191 MiB (100 %)

Is it necessary to log every minute? It makes the journal less readable. The background checking can run as frequently as possible, but it would be nice if it logged very rarely when there's enough free memory, and increased the interval when the memory is getting filled up. When I have 80% of memory free, such frequent logging seems superfluous. Or is just debugging feature that will get turned off on F32 final release?

Benjamin is very close to getting desktop session protection working in #98. In light of the progress in that issue, I'd like to again propose that we defer the decision to enable earlyoom from F32 to F33. It's beginning to look quite likely that we will be able to obsolete earlyoom just with desktop session protection, without needing a replacement userspace oom handler.

Metadata Update from @catanzaro:
- Issue tagged with: meeting-request

a month ago

Metadata Update from @chrismurphy:
- Issue untagged with: meeting-request
- Issue tagged with: meeting

a month ago

we will be able to obsolete earlyoom just with desktop session protection, without needing a replacement userspace oom handler.

@catanzaro In fact, one does not interfere with the other. earlyoom still will be improved further. Using both means can provide comprehensive protection.

No i like the memory report, fedora can set it to 0 if they want.

-- https://github.com/rfjakob/earlyoom/issues/171#issuecomment-596059650

@chrismurphy @xvitaly can change default values if WG wants. Just say what you want.

@hakavlad

EARLYOOM_ARGS="-r 0 -m 4 -M 500000"

But the man page says "You can only use either -m or -M." I'm not sure what happens with both.

I think it's reasonable for final release to do -r 0 and then suggest enabling reporting on a case by case basis; but I also don't find it meddlesome, the earlyoom messages are priority "info" so they can be filtered out with journalctl -p 5 or lower value.

Actually, we can use the old command line options. When you pass both a percentage and a megabyte value, the lower value could be used

-- rfjakob, https://github.com/rfjakob/earlyoom/issues/162#issuecomment-593424982

I'm not sure what happens with both

@chrismurphy I mean we can use this -m 4 -M 500000 in the future (I hope).

Benjamin is very close to getting desktop session protection working in #98. In light of the progress in that issue, I'd like to again propose that we defer the decision to enable earlyoom from F32 to F33. It's beginning to look quite likely that we will be able to obsolete earlyoom just with desktop session protection, without needing a replacement userspace oom handler.

We've discussed and agreed to continue with earlyoom for F32 regardless.

Metadata Update from @catanzaro:
- Issue untagged with: meeting

24 days ago

Most concerns from devel@ have been addressed in earlyoom-1.4, which is in u-t right now. The one item the WG remains concerned about is conflict with systemd service unit functionality:

Probably, the settings KillMode= and OOMPolicy= applied in units will not work.

It would be useful to know if a future earlyoom will avoid conflict?

F32 defaults

-r (reporting frequency). Many other things dump low priority info messages into the journal, that can be filtered out with journalctl -p, but I have no objection to turning off reporting in final release either.

-M 300000 This only affects the noswap case; with swap, the SIGTERM/SIGKILL trigger will be swap going below its low water mark value. If we discount the low RAM+noswap case, 300M seems to be a decent compromise among the values floated so far.

I've just tested -r 0 -M 300000 and building webkitgtk, and the only journal entry I get is:

Mar 10 09:40:21 fmac.local earlyoom[1746]: mem avail:    28 of  7865 MiB ( 0 %), swap free:  431 of 3931 MiB (10 %)
Mar 10 09:40:21 fmac.local earlyoom[1746]: low memory! at or below SIGTERM limits: mem 3 %, swap 10 %
Mar 10 09:40:22 fmac.local earlyoom[1746]: sending SIGTERM to process 9354 uid 1000 "cc1plus": badness 112, VmRSS 1151 MiB
Mar 10 09:40:22 fmac.local earlyoom[1746]: process exited after 0.3 seconds

Another option is to increase the reporting frequency to 2 minutes? It's somewhat useful to get a RAM/swap usage report 1-2 minutes before earlyoom takes action, it helps show the trend leading up to the action.

It would be useful to know if a future earlyoom will avoid conflict?

We can ask rfjakob to add memory.oom.group suppurt, but I don’t think it will be quickly implemented.

We can ask rfjakob to add memory.oom.group suppurt, but I don’t think it will be quickly implemented.

I'd first ask rfjakob for an assessment. The WG prefers a generic solution that also cooperates with systemd. The expectation is that there will be a systemd-oomd to transition to, and generally the path of least resistance is preferred unless there's significant benefit to extra effort.

This suggests a direct transition from earlyoom to systemd-oomd; but there's an open ended role for Workstation edition to participate in systemd-oomd development that may include using oomd itself as an intermediate step.

OOMPolicy=
Configure the Out-Of-Memory (OOM) killer policy.

To be honest, earlyoom is not OOM killer. It's just low memory handler and OOM prevention daemon. See also https://github.com/rfjakob/earlyoom/issues/60.

And I don’t know how full support for OOMPolicy in earlyoom can be realized.

earlyoom may support memory.oom.group: it can check memory.oom.group value and, if it=1, kill all processes in cgroup of the victim.

I'd first ask rfjakob for an assessment.

OK, please.

The WG prefers a generic solution that also cooperates with systemd.

Only low-memory-monitor and psi-monitor cooperate with systemd, because it triggers OOMK.

BTW, how many services does use non-default OOMPolicy values? It seems to me that there are very few such services, most units use default values.

Only low-memory-monitor and psi-monitor cooperate with systemd, because it triggers OOMK.

Understood.

BTW, how many services does use non-default OOMPolicy values? It seems to me that there are very few such services, most units use default values.

Right, which is why I'm not worried about it. Yet. However, I want to make sure we're not stepping on other people's work and expectations. If there's no conflict, there's no problem.

I've been going through Lennart's complaints and selected the ones that aren't complaining about earlyoom not using PSI (hakavlad has already established PSI doesn't work well):

  1. Waking up all the time in 100ms intervals? We generally try to avoid waking the CPU up all the time if nothing happens. Saving power and things.

Tejun suggested we use a 1s poll interval, or possibly even higher. I imagine 100ms will likely have a noticeable impact on our battery life?

  1. New code using system() in the year 2020? Really?

  2. Fixed size buffers and implicit, undetected, truncation of strings at various places (for example, when formatting the shell string to pass to system()).

Have these been fixed?

BTW, this should not be a root daemon anyway. It only needs one cap:
CAP_SYS_KILL. Hence, drop privs to some user of its own, and keep that
one cap. Use AmbientCapabilities= in the unit file.

Are we still running as root, or did we create an earlyoom user?

I want to make sure we're not stepping on other people's work and expectations

Of course, we cannot be 100% sure.

hakavlad has already established PSI doesn't work well

PSI can work quite well with default kernel if used properly (do not shoot if the pressure is not increased for a long time and strongly, do not shoot if MemAvailable > 5% MemTotal, for example). BTW, default oomd settings:

                            "cgroup": "user.slice",
                            "resource": "memory",
                            "threshold": "60",
                            "duration": "30"

These are much higher thresholds than are used in psi-monitor and especially in low-memory-monitor. I think good practice is using PSI and MemAvailable (to prevent possible false-positives) at the same time. And I already enabled PSI usage by default in nohang-desktop, see https://pagure.io/fedora-workstation/issue/98#comment-623868.

Tejun suggested we use a 1s poll interval

earlyoom uses 1s interval in most cases. You can check the interval if you run earlyoom with -d flag. The interval closes to 0.1 if MemAvailable and SwapFree close to 0. See also https://github.com/rfjakob/earlyoom/issues/61. I see that earlyoom uses 0% CPU in top/htop.

New code using system() in the year 2020?
formatting the shell string to pass to system()

It is not used by default. It is used only with -n and -N flags.

Fixed size buffers

explain buffer size choice - https://github.com/rfjakob/earlyoom/commit/4139165ec9681d88ea006ab6860fc30a30096eb2

Are we still running as root, or did we create an earlyoom user?

earlyoom v1.4 starts as DynamicUser=true with only CAP_KILL and CAP_IPC_LOCK capabilities, see https://github.com/rfjakob/earlyoom/blob/master/earlyoom.service.in and https://github.com/rfjakob/earlyoom#changelog.

Thus, the main problems now are:

  • default memory threshold;

  • OOMPolicy support.

This also allow simultaneous use of -M and -m and change of memory/swap.
https://github.com/rfjakob/earlyoom/pull/176

When is the deadline for patching earlyoom?

When is the deadline for patching earlyoom?

Final freeze begins 2020-04-07. I'd like to see us settle on the settings to use, ideally by the end of March.

In some ways a bigger issue I've run into is #138; I can only get earlyoom in clean installs, there's nothing (so far) to drag it in on an upgrade from F31.

earlyoom is ready for patching now. You can use -m and -M both:

# earlyoom -m 10 -M 128000
earlyoom v1.4-20-g3c28576
mem total: 9790 MiB, swap total:    0 MiB
sending SIGTERM when mem <=  1.28% and swap <= 10.00%,
        SIGKILL when mem <=  0.64% and swap <=  5.00%
mem avail:  6401 of  9790 MiB (65.38%), swap free:    0 of    0 MiB ( 0.00%)
...
mem avail:  1339 of  9790 MiB (13.68%), swap free:    0 of    0 MiB ( 0.00%)
mem avail:   684 of  9790 MiB ( 7.00%), swap free:    0 of    0 MiB ( 0.00%)
mem avail:   180 of  9790 MiB ( 1.85%), swap free:    0 of    0 MiB ( 0.00%)
mem avail:    93 of  9790 MiB ( 0.95%), swap free:    0 of    0 MiB ( 0.00%)
low memory! at or below SIGTERM limits: mem  1.28%, swap 10.00%
sending SIGTERM to process 16873 uid 1000 "python3": badness 672, VmRSS 6568 MiB
process exited after 0.3 seconds
mem avail:  6677 of  9790 MiB (68.20%), swap free:    0 of    0 MiB ( 0.00%)

@xvitaly waits for WG final decision.

@chrismurphy My suggestion is using -m 4 -M 409600. And what -r value do you want? Maybe 0, 60, 120 or 300?

Currently earlyoom spam system journal 1 r/m and this is too bad for me. I think I will disable this completely.

I've tested earlyoom with F32 beta on special VM with 2 GB RAM within a week and earlyoom pushed tons of garbage to journald.

earlyoom.default: set -r 3600

As there have now been two complaints about the default
log output being too noisy, I'll have to accept that
other users find the memory reports less interesting
than I do :)

Increase from 60 seconds to 3600 seconds (one memory
report per hour)

https://github.com/rfjakob/earlyoom/commit/e1c74d918c3951e0ec20bd8817ab55530e020491

Most of users don't need this reports at all. I will completely disable them by -r 0.

Btw, hurry up, guys. The time is running out.

@chrismurphy What thresholds should we set by default?

EARLYOOM_ARGS="-r 0 -m 4 -M 409600" sounds reasonable to me.

Also, man page needs updates to reflect the changes to -p -m -M -s -S.

EARLYOOM_ARGS="-r 0 -m 4 -M 409600" sounds reasonable to me.

What about adding the list of excluded process names? It can easily kill X11/dnf and this is not good.

Also, man page needs updates to reflect the changes to -p -m -M -s -S.

It generate manpages automatically from markdown files.

If dnf process will be killed by earlyoom or by any other userspace killers, the system will be completely broken and must be reinstalled. That's why I think we should add the list of excluded processes.

Example:

EARLYOOM_ARGS="-r 0 -m 4 -M 409600 --avoid '(^|/)(systemd|dnf|sshd)$'"

F32 scratch build with EARLYOOM_ARGS="-r 0 -m 4 -M 409600" for testing: https://koji.fedoraproject.org/koji/taskinfo?taskID=42715225

we should add the list of excluded processes

My suggestion is --avoid '^(dnf|packagekitd|gnome-shell|Xwayland)$'.

Top by oom_score in F31W: http://okturing.com/src/8038/body

Killing gnome-shell or Xwayland cause killing all processes in the user session, that's why we should protect them.

Also at least: gnome-session, dbus-broker

@benzea, anything else? Maybe gsd-* processes?

Maybe gsd-|ibus-|Xorg|at-spi|dbus-|pipewire|pulseaudio. Maybe xdg- (i.e. all the portal stuff).

Didn't try to include system service (e.g. systemd itself, or UPower).

gnome-session, dbus-broker

It's not in top by oom_score. It's small processes. Killing gsd-* does not lead to killing user session.

The time is running out! Last chance to change default options. The next will be after Fedora 32 release.

@chrismurphy

Should we prefer Firefox tabs, like in chromium and electron-based apps?

This is default: chromium tabs get oom_score_adj=300 by default.

$ oom-sort -l0
oom_score oom_score_adj  UID   PID Name            VmRSS   VmSwap
--------- ------------- ---- ----- --------------- ------- --------
      304           300 1000 18072 riot-web          188 M      0 M 
      302           300 1000 19125 chromium           97 M      0 M 
      302           300 1000 19133 chromium           78 M      0 M 
      301           300 1000 19185 chromium           46 M      0 M 
      202           200 1000 19095 chromium          110 M      0 M 
      200           200 1000 19165 chromium           15 M      0 M 
       59             0 1000 17420 VirtualBox       2322 M      0 M 

Should we set --prefer 'Web Content' to terminate ff children instead of whole browser?

My suggestion:

EARLYOOM_ARGS="-r 0 -m 4 -M 409600 --prefer '^Web Content$' --avoid '^(dnf|packagekitd|gnome-shell|gnome-session-c|gnome-session-b|lightdm|sddm|sddm-helper|gdm|gdm-wayland-ses|gdm-session-wor|gdm-x-session|Xorg|Xwayland|systemd|systemd-logind|dbus-daemon|dbus-broker|cinnamon|cinnamon-sessio|kwin_x11|kwin_wayland|plasmashell|ksmserver|plasma_session|startplasma-way|xfce4-session|mate-session|marco|lxqt-session|openbox)$'"

Looks good. I think for Firefox, ideally individual tabs get killed rather than all of Firefox.

If web browser tabs get +300 oom_score_adj, does this mean it will favor killing off browser tabs one by one, even if there's a runaway program operating from the command line? Where can I find oom-sort?

it will favor killing off browser tabs one by one

Yes.

if there's a runaway program operating from the command line

I think it does not affect oom_score.

Hey, anyone knows where the +300 OOM score for firefox is coming from? I am asking because there is the idea to default to +500 for user processes so that we can then adjust the score down again selectively. But if firefox just sets 300 (rather than changing it relatively), then it could suddenly get a lower score than anything else.

The Chrome people made their browser (and all electron-based apps - vscode, skype, discord etc) always be the first victim by setting oom_score_adj very high - https://bugs.chromium.org/p/chromium/issues/detail?id=333617

@benzea

earlyoom is just copies this behavior (add +300 to badness) for preferred processes.

This behavior led to terminating tabs one by one instead of terminating whole browser. Demo: https://youtu.be/PLVWgNrVNlc

@hakavlad sure.

All i want to make sure here is that no one is setting absolute values in a way that will interfere with a change of defaults inside systemd. And the chrome stuff you linked smells like it might end up doing exactly that. i.e. I believe they should only change it relative to the main process or relative to their own initial score.

Earlyoom 1.5 build: F31, F32, F33, EPEL8.

Please test this builds and add karma to updates.

All i want to make sure here is that no one is setting absolute values

No one is setting absolute values. Badness of processes used in earlyoom is equal to oom_score by default.

earlyoom does not come up with anything new. It uses roughly the same mechanisms as in the kernel.

--prefer works as setting oom_score_adj=300. Similar things have long been used in Chrome, and no problem with it.

a change of defaults inside systemd

earlyoom does not affect systemd and kernel settings and does not change anything in the /proc and other filesystem.

@hakavlad I am not sure you understand.

If all user processes get an oom_score_adj of 500 by default, and you set it to 300 using --prefer, then suddenly you would be protecting those processes rather than preferentially killing them.

@benzea It will add +300 to oom_score. 500+300=800.

Login to comment on this ticket.

Metadata