#202 Improve the user feedback for OOM situations
Opened a year ago by tpopela. Modified 3 months ago

We had a discussion with @zbyszek about systemd-oomd few weeks ago and he mentioned one thing that I'm really not sure whether we'd talked about in the past. I hope that @zbyszek won't mind if I put part of our conversation here:

I think it is fair to say that earlyoom in F33 is working well. If anything, the experience on the desktop is too smooth, e.g. firefox tabs are just killed without giving a hint that an oom killer was involved, and the whole thing looks like a quiet crash of a tab.

I think that no matter what the mechanism is (earlyoom, systemd-oomd, kernel oomd) we should enhance the user feedback. Even a simple notification popup saying "Machine was oom. Something was killed" would go a long way.


I'm inclined to say that we should only show a notification if we are able to say which app/process was affected - without this information, they will likely do more harm than good.

I'm inclined to say that we should only show a notification if we are able to say which app/process was affected - without this information, they will likely do more harm than good.

Yes, that makes sense. I suspect that earlyoom is putting the information (what process it killed) somewhere in the journal. I suspect that @chrismurphy or @hakavlad might know more.

earlyoom does state the process ID and name it kills into the journal, and also the memory and swap values and % free. What about user space oom killers using dbus to communicate a subset of this information, so the desktop environment can pick this up and notify the user accordingly?

Related work but a bit of a tangent:
https://github.com/facebookexperimental/resctl-demo

In the demo, we see less dependency on killing off processes, and instead using resource control to throttle the heavy resource consumer, in order to keep the system responsive. Part of this is pushing dirty pages aggressively to swap, which due to recent improvements in the kernel's swap code (and more forthcoming) this looks pretty good for the non-thrashing case of just paging out something like a browser tab that isn't being used.

@benzea @lennart @hadess

earlyoom can show GUI notifications using additional daemon:

Since version 1.6, earlyoom can send notifications about killed processes via the system d-bus. Pass -n to enable them.

To actually see the notifications in your GUI session, you need to have systembus-notify running as your user.

https://github.com/rfjakob/earlyoom#notifications
https://github.com/rfjakob/systembus-notify

nohang-desktop shows GUI notifications about low-memory conditions and about what was killed for all logged-in users out of the box, demo: https://youtu.be/PLVWgNrVNlc, https://youtu.be/G0FYDIKVPYI.

Metadata Update from @tpopela:
- Issue tagged with: meeting-request

a year ago

Metadata Update from @catanzaro:
- Issue untagged with: meeting-request
- Issue tagged with: meeting

a year ago

Discussed today. Desktop notifications in earlyoom would be nice, but we're not very interested in tracking this because systemd-oomd is now ready (or almost ready).

ACTION: Michel Salim to submit systemd-oomd change proposal

ACTION: Chris to round up @benzea @lennart and maybe @hadess to discuss how to improve user feedback (presumably via desktop notifications)

The challenge is to display application names whenever possible, and avoid displaying confusing process names. But we also don't want to say "Firefox" is killed when only one individual web content process running in its own cgroup gets killed....

Metadata Update from @catanzaro:
- Issue untagged with: meeting

a year ago

My humble opinion is that specifying what got killed will only confuse the user (e.g. when it is just a single tab in Firefox, which the user might not even notice). The message can be generic and still understandable, something like:

Your system is running out of available memory. Applications might crash or behave incorrectly. Please close some applications to free more resources.

This might actually make sense to display before something gets killed, but it is still helpful even if we display it after.

I just had a situation where a notification that doesn't specify the process would have been very useful, provided that it appeared shortly after the process was killed: I was building a project, and the build kept failing without specifying a clear reason. Turns out it was earlyoom, but I really had no clue and had to ask around for help.

Even a generic "A process has been terminated to maintain system responsiveness." would give a hint to go look at logs. But that isn't going to help with what to do next: change some setting but what? file a bug, against what?

Right now I'd expect that there'd be some brief sluggishness before there's an earlyoom triggered termination. Otherwise, it suggests earlyoom is too eager: either zram-swap needs to be bumped or maybe earlyoom swap % for terminate needs to be even less.

@aday do you have the journal entry for this earlyoom termination? I'd like to see what the memory and swap %'s were at the time. And also is it disk or zram based swap, or combo?

Even a generic "A process has been terminated to maintain system responsiveness." would give a hint to go look at logs. But that isn't going to help with what to do next: change some setting but what? file a bug, against what?

In my case, adding -j1 to the build command fixed it. It would be good to have a discussion about what actions to recommend in which circumstances. That could reveal some design options.

Whatever happens, it would be good to link to docs with generic advice:

  • close some apps/tabs/processes, particularly heavy ones
  • try using the system monitor to see what's using memory
  • if it's a VM, increase its memory allocation
  • etc

@aday do you have the journal entry for this earlyoom termination? I'd like to see what the memory and swap %'s were at the time. And also is it disk or zram based swap, or combo?

No logs right now, but I can try to reproduce later. It happened in a VM.

OK I've got an idea of what it looks like in a VM, I don't need a reproducer. It's common for some build/compile tools to ask for more jobs than they really should, by looking only at the number of cpus, and not at all accounting for memory. And then one of the threads gets clobbered by earlyoom.

I'm not certain what the right thing to do is, but somehow it seems like the thing that's spawning multiple jobs needs to either know in advance not to so aggressively ask for resources that just aren't there; or dynamically figure it out and not spawn so many jobs later in the build process. I don't know what coordination or capability that would take. Maybe that's more in the realm of low-memory-monitor? Because cgroupsv2 resource control, as I understand it, will constrain IO, memory, cpu going to a particular process or group of processes. That too might have problems where the process goes "I want 1G memory" or "I want to spin off more threads" and cgroupv2 says "nope" and the process goes "ack!" and just dies? That also isn't great (if true).

ACTION: Michel Salim to submit systemd-oomd change proposal

ACTION: Chris to round up @benzea @lennart and maybe @hadess to discuss how to improve user feedback (presumably via desktop notifications)

Adding pending-action for these two.

Metadata Update from @aday:
- Issue tagged with: pending-action

a year ago

Metadata Update from @aday:
- Issue assigned to chrismurphy

a year ago

We have this upstream https://gitlab.gnome.org/GNOME/gnome-settings-daemon/-/issues/302

Sounds like a good place to handle that in GNOME.

@anitazha @hadess @benzea
Any thoughts on what work is needed for improving desktop feedback for OOM situations? I'm thinking a warning notification well enough prior to kill that the user could save their data (for those applications that still aren't doing some kind of regular autosave); and then a notification after a kill has happened.

Do we need a design doc for how this would work? When could we do that? I suspect there's two or three parts: oomd's minimal information to indicate potential candidates for kill in the near term (resource hogs) and the notification system that needs to pick up on the warn and kill events, and the UI/UX for advising the user about those two kinds of events.

Could this pre-kill warning information be picked up by the wayward apps themselves, such that it could trigger autosave? Is there a role for low-memory-monitor in any of this?

This is just one example that's prompting discussion in the Workstation working group:
https://bugzilla.redhat.com/show_bug.cgi?id=1983048

Also likely true, but are out of scope for this issue: apps might have a bug needing fixed; apps should have reliable periodic save so that if there's a crash/kill/power fail, the user doesn't experience data loss.

I think a design doc would help a lot to coordinate on what we're looking for. Speaking from the oomd side I would want to know: the specifics of when to send those signals (in relation to the existing thresholds for pressure and swap), who should receive them, and what those signals should look like (dbus? signals? or something else). In terms of when we can do a design doc... just say when and where :)

low-memory-monitor (https://gitlab.freedesktop.org/hadess/low-memory-monitor/) which should be running by default on Fedora Workstation, sends D-Bus signals, which applications can listen to:
https://developer.gnome.org/gio/2.64/GMemoryMonitor.html

I'm not quite sure why we neutered low-memory-monitor's OOM killer and kept systemd-oomd's one alive though, and I don't know whether inkscape should or shouldn't be getting killed.

I don't think that any system notifications would be helpful to end users. The OOM killer needs to choose better targets ultimately, and those targets be better prepared.

low-memory-monitor (https://gitlab.freedesktop.org/hadess/low-memory-monitor/) which should be running by default on Fedora Workstation, sends D-Bus signals, which applications can listen to:
https://developer.gnome.org/gio/2.64/GMemoryMonitor.html

I think the goal here is to help applications that are unlikely to ever use GMemoryMonitor, like Inkscape. Even if Inkscape were to use GMemoryMonitor, in this particular case it surely has some major bug that's unlikely to be fixed by dropping caches.

I'm not quite sure why we neutered low-memory-monitor's OOM killer and kept systemd-oomd's one alive though, and I don't know whether inkscape should or shouldn't be getting killed.

I would flip the question around: what strong reason is there to not use systemd-oomd?

This seems like a communication issue more than anything. We in the WG had planned for the OOM killer to move to systemd before systemd-oomd existed, and were surprised when you wrote low-memory-monitor. We eventually approved it anyway on the condition that it not handle OOM killing since we certainly don't need two different userspace OOM killers.

I don't think that any system notifications would be helpful to end users. The OOM killer needs to choose better targets ultimately, and those targets be better prepared.

I think systemd-oomd is choosing the right target, because Inkscape is leaking a huge amount of memory. The desktop should be responsible for notifying when an application is killed so that applications don't individually have to implement their own logic for telling users that they're about to die. It's just not realistic to expect applications to do that.

low-memory-monitor (https://gitlab.freedesktop.org/hadess/low-memory-monitor/) which should be running by default on Fedora Workstation, sends D-Bus signals, which applications can listen to:
https://developer.gnome.org/gio/2.64/GMemoryMonitor.html

I think the goal here is to help applications that are unlikely to ever use GMemoryMonitor, like Inkscape. Even if Inkscape were to use GMemoryMonitor, in this particular case it surely has some major bug that's unlikely to be fixed by dropping caches.

Why wouldn't it ever use this API, I don't understand?

I'm not quite sure why we neutered low-memory-monitor's OOM killer and kept systemd-oomd's one alive though, and I don't know whether inkscape should or shouldn't be getting killed.

I would flip the question around: what strong reason is there to not use systemd-oomd?

This seems like a communication issue more than anything. We in the WG had planned for the OOM killer to move to systemd before systemd-oomd existed, and were surprised when you wrote low-memory-monitor. We eventually approved it anyway on the condition that it not handle OOM killing since we certainly don't need two different userspace OOM killers.

That's a nice history rewriting there. low-memory-monitor existed for a year before systemd-oomd work was even started. systemd-oomd wasn't targeted at desktop and laptop machines, whereas low-memory-monitor was based on code that was, and settings that were tested.

Feel free to double-check the dates when all this happened, because my discussions of user-space OOM killing in Fedora, along with our move to zram, started when running gdb on my desktop machine gobbled up all the RAM and rendered it unusable for an afternoon.

I don't begrudge having a project I worked on replaced, but systemd-oomd doesn't implement an API that would be useful to applications, so I can't sunset it. I did ask:
https://github.com/systemd/systemd/pull/15206#issuecomment-659310309

I don't think that any system notifications would be helpful to end users. The OOM killer needs to choose better targets ultimately, and those targets be better prepared.

I think systemd-oomd is choosing the right target, because Inkscape is leaking a huge amount of memory. The desktop should be responsible for notifying when an application is killed so that applications don't individually have to implement their own logic for telling users that they're about to die. It's just not realistic to expect applications to do that.

There aren't any notifications when applications crash or get killed because of OOM on mobile OSes like Android or iOS. On the desktop, abrt should be able to pick up that the application got killed because of OOM and show a notification with the ability to send a bug report. You have to wonder how useful those bug reports would be though.

I wonder if the lmm dbus messaging component can be incorporated into oomd?

A big difference between silent OOM kills on mobile vs desktop is mobile platforms have fairly robust APIs for autosaving user data on a regular basis. From the user standpoint the silent kill has no negative impact unless the killed app was in focus. And even then, their data was very likely saved before the kill.

low-memory-monitor (https://gitlab.freedesktop.org/hadess/low-memory-monitor/) which should be running by default on Fedora Workstation, sends D-Bus signals, which applications can listen to:
https://developer.gnome.org/gio/2.64/GMemoryMonitor.html

I think the goal here is to help applications that are unlikely to ever use GMemoryMonitor, like Inkscape. Even if Inkscape were to use GMemoryMonitor, in this particular case it surely has some major bug that's unlikely to be fixed by dropping caches.

Why wouldn't it ever use this API, I don't understand?

I imagine because it is useless on the platform they currently all develop on (Ubuntu).

That said, it's a very bad plan for us to rely on application awareness, so we need to be able to do reasonably correct things without it.

If there's something useful for an application to do behind the scenes (drop caches, suspend VMs, whatever), I think it's reasonable to expect the application to listen to LMM notifications, and to file issues / PRs as necessary.

But I don't think it's reasonable to expect every application to write their own "You are running low on memory, consider closing tabs and windows" dialog - or even a good idea of applications to do this -

  • it provides a really poor user experience if every application pops up their own dialog.
  • Application dialogs can't properly integrate with system tools like Usage to let the user see "hey, Inkscape is using 16GB of memory for my two small images, what's going on here?"

Looking back, I see a suggestion to use gnome-settings-daemon and a request for design details for oomd.

Let's cover those points in a meeting and come up with a plan.

Metadata Update from @aday:
- Issue untagged with: pending-action
- Issue set to the milestone: Fedora 36
- Issue tagged with: meeting

9 months ago

Discussion defered until the meeting next week

OK, really, this is purely about giving user feedback. From our perspective, I believe that we are mostly interested in knowing about a systemd user unit failing due to an OOM kill.

Now, right now, this would be possible to do already by inspecting the systemd log. I am hoping that systemd could just show the unit to go into the failed state with the oom-kill result. That should make it extremely easy to show a notification based on changing unit properties.

The WG discussed this at today's meeting. Our consensus is that we'd at least like to have notifications for oom kill events, as a bare minimum. (Of course there's more that could be done.)

Since there's been some recent discussion upstream we'll let that play out and keep an eye on how it's going.

Metadata Update from @aday:
- Issue untagged with: meeting

8 months ago

While the journald log has enough information to figure out what happened, it's not enough to explain to users why web browser tabs just disappear silently. Going back to the top of this issue:

Even a simple notification popup saying "Machine was oom. Something was killed" would go a long way.

What pieces are needed still to make that possible, and what sort of timeline would be needed to do it? i.e. should this remain on the Fedora 36 schedule or does it need to be pushed to Fedora 37?

Upstream discussions continuing, most recent comments (September and November) include the idea of mutter tracking the killed off cgroup back to the (now vanished) app window and putting up a notification.
https://github.com/systemd/systemd/issues/20649#issuecomment-914411222

But it's not going to happen for Fedora 36 so I'll bump this to Future until we have a better idea when it'll happen.

Metadata Update from @chrismurphy:
- Issue set to the milestone: Future Release (was: Fedora 36)

4 months ago

Maybe I don't understand what is going on... But I have the feeling this new feature is a disaster and QA violation.

Look at that case for a laptop with 8GB and a several day session:
- KDE
-- Firefox with a lot of tabs
-- Konsole
--- Emacs
--- and a ninja build stressing the RAM

This configuration should work with this amount of memory from a fresh boot.

But system-oomd killed everything excepted KDE instead of the ninja process.
And I am facing a lot of arbitrary cases like that.

For a workstation, I don't want to worry about cgroups. I need a smart process manager instead.

When you are doing dev, it is frequent to overshoot memory. Kernel should detect that and respond correctly, instead of killing processes arbitrarily.

I have the feeling this implementation was designed for Facebook cloud servers and pushed like that in Fedora desktop.

Of course there is an improvement, I don't have any more to press the on-off button because the Linux kernel is just dying. But this feature does actually an evil job.

I think such feature is dramatic for Linux Desktop. Should we use Windows and docker as a workaround in 2022 ?

@fabricesalvaire please file a RHBZ against systemd component and put [oomd] in the title; feel free to cc me bugzilla at colorremedies.com; please attach a journal, e.g. journalctl -b -o short-monotonic --no-hostname > journa.log so we can see the oomd kill events.

This configuration should work with this amount of memory from a fresh boot.

Maybe, it depends. You're welcome to systemctl disable --now systemd-oomd and try to reproduce and see if you get better or worse behavior, and mention it in the bug. Before the various resource control related changes, the result was a system that'd increasingly become sluggish, then totally hang, sometimes even including the mouse arrow.

When you are doing dev, it is frequent to overshoot memory.

If you are routinely overcommitting available resources, my suggestion is to add more swap. i.e. either add a swapfile or swap partition, in addition to or instead of swap on zram (the default). oomd takes swap pressure into account before killing processes

Kernel should detect that and respond correctly, instead of killing processes arbitrarily.

The kernel's oom killer is concerned with the health of the kernel, not the responsiveness of the system from the user's point of view. It is not a resource control component. It will kill processes as a last resort, as in, when the kernel's stability is threatened, it'll clobber something.

Of course there is an improvement, I don't have any more to press the on-off button because the Linux kernel is just dying. But this feature does actually an evil job.

It's an improvement, but it's evil? Let us help you refine the complaint so it's not so overtly contradictory. If you can please file a bug, include what happened and what you expect to have happen instead, we can likely figure out what's going on. And then see if something needs to be tweaked.

Login to comment on this ticket.

Metadata