#98 Better interactivity in low-memory situations
Opened 4 months ago by hadess. Modified 11 hours ago

This is a "tracker bug" about replacing disk-based swap with zram-based swap.

Installing and enabling zram by default (already merged):
https://bugzilla.redhat.com/show_bug.cgi?id=1731598
https://pagure.io/fedora-comps/pull-request/391

Disabling disk-based swap by default:
https://bugzilla.redhat.com/show_bug.cgi?id=1731978

Justification is listed in the anaconda bug.


If anyone wants to object to zram by default or removal of disk-based swap, please do so here. Otherwise, Bastien seems to have this under control, so I don't see the need to discuss at a meeting.

I think it would be useful to have Self-Contained change for this so that more people are aware of this change. Having steps to migrate "from old setup to new", would be nice.

I think it would be useful to have Self-Contained change for this so that more people are aware of this change. Having steps to migrate "from old setup to new", would be nice.

I started doing that, but there were too many unknowns for that page to be useful and the deadline is today, and I won't have the answers by that time.

Adding meeting keyword due to some initial skepticism from Anaconda developers. We might need to have a formal vote here.

I started doing that, but there were too many unknowns for that page to be useful and the deadline is today, and I won't have the answers by that time.

IMO it would actually be perfectly fine to do this an a F32-timeframe change proposal instead of for F31. That way users have plenty of time to report possible fallout. This is a very longstanding problem, so fixing it ASAP doesn't seem urgent.

Also, I'd say this would be a systemwide change.

Metadata Update from @catanzaro:
- Issue tagged with: meeting

4 months ago

OK so all of this work is fine by me, but at the same time it totally steps on all the work and coordination I just wrapped up in issue 56. I was expecting that to happen at some point, just not this quickly.

This change probably means swap on zram gets activated in early boot on lives
https://src.fedoraproject.org/rpms/fedora-release/pull-request/87#request_diff

and that will conflict with this change in anaconda which uses a different service file to setup and enable swap on zram when anaconda is launched
https://github.com/rhinstaller/anaconda/pull/2039

There's no error handling in the anaconda code. I don't know for sure what happens when PR 87 lands, but I suspect on LiveOS we'll see two swap on zram devices, the one with higher priority will get used and the other will be a benign extra.

But to clean it up, either PR 87 needs to be reverted until this issue gets better organized, or an alternative is this:
https://src.fedoraproject.org/fork/chrismurphy/rpms/zram/c/8dad080d9285db3076fdd502efb83081545520d8?branch=devel

@hadess our meeting is tomorrow at 9:00 AM EDT if you want to participate (in #fedora-meeting-2)

@hadess our meeting is tomorrow at 9:00 AM EDT

Er, I suppose that's already "today" in France.

This is a "tracker bug" about replacing disk-based swap with zram-based swap.
Installing and enabling zram by default (already merged):
https://bugzilla.redhat.com/show_bug.cgi?id=1731598

This was reopened for reversal as it conflicts with:
https://github.com/rhinstaller/anaconda/pull/2039

https://pagure.io/fedora-comps/pull-request/391

This can stay, it just installs a file on disk.

Disabling disk-based swap by default:
https://bugzilla.redhat.com/show_bug.cgi?id=1731978
Justification is listed in the anaconda bug.

This is another problem.

I'll close this as I'm not interested in working on this for Fedora 32. It was made abundantly clear that I should have coordinated with an effort I knew nothing about and that my work was "[not] even done".

There are bugs already opened for the individual changes though, so coordination should be straight forward.

Metadata Update from @hadess:
- Issue close_status updated to: Won't fix
- Issue status updated to: Closed (was: Open)

4 months ago

I'm still interested in tracking this here even if you're not. :)

Metadata Update from @catanzaro:
- Issue status updated to: Open (was: Closed)

4 months ago

I'm still interested in tracking this here even if you're not. :)

That's fine by me, as long as ownership is clear. Thanks.

Metadata Update from @catanzaro:
- Issue assigned to chrismurphy

4 months ago

I very much appreciate @hadess contribution to a generic swap on swap solution so far. A developer once reminded me that Fedora is supposed to be fun. The way I expressed surprise at all of his work actually triggering changes is contrary to having fun. And for that I apologize.

There are actually quite a few things that get touched with this feature and I think it should go through the feature process for future Fedora. I'm happy to negotiate the bureaucracy aspect of this and make sure there's proper coordination and testing. But I can barely bash my way out of a hat, so there is no way I'm going to magically get the systemd zram-generator working on my own.

I think the first step is to establish whether the systemd zram-generator project is going to be the accepted generic swap on zram implementation by Anaconda, Workstation, and IoT folks. And concurrently sort out who, or at least that someone, will respond when it breaks. When it breaks, we'll need a pretty fast fix. Right now it's broken. I don't think there's is a viable feature if there isn't agreement on, a) a generic solution; b) someone to maintain it; c) for it to actually be working. And right now none of those three things is for sure true or known.

Anaconda folks aren't in favor of switching to merely a different swap on zram solution than what they have. I think it's reasonable to want something robust, and is sufficiently upstream that it can and will be used by other distributions. Along those lines, I need to followup on is Arch has recently discussed moving to swap on zram by default too.

Concurrently, the subject of this issue suggests a problem that needs additional investigation. I'll make that point in this bug
https://bugzilla.redhat.com/show_bug.cgi?id=1731978

Does that sound reasonable? And at least for intial steps is there anything I've left out?

@hadess or anyone else experiencing this problem; I'd like a clear simple as possible set of reproduce steps for this statement:

Unfortunately, especially on interactive systems such as the Workstation variants, hitting the disk-based swap under low-memory conditions renders the machine completely unusable. The disk-based swap is not fast enough to free up physical memory to keep the machine's interactivity.
https://bugzilla.redhat.com/show_bug.cgi?id=1731978

I have experienced cases like that myself of course, but I do not have a consistent reproducer and want to make certain I'm seeing the same thing everyone else is talking about with a relevant Workstation specific use case example. Post it here or in a BZ, whichever is appropriate. Thanks.

BTW I noticed that Anaconda's automatic partitioning defaults to creating a swap partition equal to the amount of RAM, which can result in ludicrous swap sizes on systems with large amounts of RAM. So that's something else to keep in mind.

I have experienced cases like that myself of course, but I do not have a consistent reproducer and want to make certain I'm seeing the same thing everyone else is talking about with a relevant Workstation specific use case example. Post it here or in a BZ, whichever is appropriate. Thanks.

http://trac.webkit.org/wiki/BuildingGtk wouldn't be a horrible testcase:

$ cmake -DPORT=GTK -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja
$ ninja

If that's not enough to hang your computer, then something absurd like ninja -j64 might do the trick.

@catanzaro

BTW I noticed that Anaconda's automatic partitioning defaults to creating a swap partition equal to the amount of RAM, which can result in ludicrous swap sizes...

I just mentioned that in bug 1731978. In ancient times swap at 1x RAM was 4MB. And compared to drive performance, 16-32GB is just goofycakes. But it has to be at least 1x RAM if your view is that hibernation is a plausible use case to try to support; really hibernation requires 100% of the total used memory which is RAM+swap. So plausibly hibernation requires 2x RAM. There's a reason swap and hibernation files are decoupled on macOS and Windows, and why Microsoft has effectively abandoned this style of hibernation. But as far as I know, there's no support on Linux for modernizing it and dealing with all the firmware bugs.

something absurd like ninja -j64 might do the trick

Isn't that pathological? I mean, should I really be considering things that are not realistically a good idea in normal usage? Anyway the test system I have has 4 real cores, 8 with hyperthreading, and only 8GB RAM, so I think this should be very straightforward to trigger with 8GB of swap on SSD.

Isn't that pathological? I mean, should I really be considering things that are not realistically a good idea in normal usage? Anyway the test system I have has 4 real cores, 8 with hyperthreading, and only 8GB RAM, so I think this should be very straightforward to trigger with 8GB of swap on SSD.

I think 'ninja' with no -j args should default to -j8 or -j10, it's either nproc or nproc+2, something like that. I'm fairly confident 8GB is not enough, so it should hang without any -j passed. But if not, you can play with -j and see what it takes.

Metadata Update from @catanzaro:
- Issue untagged with: meeting

3 months ago

https://lkml.org/lkml/2019/8/4/15 is relevant, although it shows the limits of what we hope to achieve here: we're hoping that disabling swap will be our solution, but in this example swap is already disabled and everything goes wrong anyway.

The kernel developers have posted a patch. Who knows, maybe they will finally magically solve this for us after all these years....

https://lkml.org/lkml/2019/8/4/15 is relevant, although it shows the limits of what we hope to achieve here: we're hoping that disabling swap will be our solution, but in this example swap is already disabled and everything goes wrong anyway.
The kernel developers have posted a patch. Who knows, maybe they will finally magically solve this for us after all these years....

Even if it works around the worst behavioural problems, it won't fix the fact that hitting the disk swap is bad, and that we should prefer RAM compression to disk based swap.

Perhaps the most problematic processes that instigate this problem are good candidates for being run in container with a cgroup memory request and limit.

Whether it's an application or kernel responsibility, or some hybrid approach, I think we can agree that it's not OK for an application to just hog all system resources.

OK I've reproduced this:
https://pagure.io/fedora-workstation/issue/98#comment-585500

Test system is a Macbook Pro, 8GiB RAM, Samsung SSD 840 EVO, Fedora Rawhide Workstation.

Summary:
Whether I use 8GiB swap on SSD plain partition, or 8GiB swap on ZRAM (a 1:1 ratio), the system eventually hangs, not even the mouse pointer will move. I gave up after 30 minutes, and forced power off.

The most central problem in this example build and test system, is ninja's default number of jobs. This is autocalculated somehow. Whether N jobs is derived from package configuration, cmake, or ninja itself, the default used for this package on this system is guaranteed to fail to build, and always results in an unresponsive system unmistakable to most users as a totally lost system.

Does swap on ZRAM help? Performance wise, no. The problem happens sooner, and is more abrupt than swap on SSD plain partition. It does help reduce wear on the SSD. A successful build will write 15GiB to disk, and nearly another 15GiB in swap writes. But in both cases, the default build fails so which one fails better or worse is irrelevant for our purposes.

What did work? ninja -j 4 resulted in a responsive system the entire time, except for a few brief periods lasting less than 15s near the end. I was able to use Firefox for browsing with 8-12 tabs, and concurrently running youtube video without any stuttering. And the build did finish. The configuration for this was 8GiB swap on SSD plain partition.

Based on this limited testing, I can't recommend only moving to swap on ZRAM. First, we need better build application defaults. It's not reasonable for developers to have to know these things, defaults should not cause the system to blow up and the build to fail 100% of the time on a reasonable, even if limited configuration, where manual intervention allows it to succeed and have exactly the user experience we want.

Extended cut

What's going on with swap on ZRAM? With a 1:1 ratio on the test system, /dev/zram0 is 8GiB, the same as available RAM. But ZRAM device usage is allocated dynamically. If all of swap gets consumed, and I have screen shots showing it was, at best 2:1 compression ratio, this uses 4GiB of RAM. Pilfering that much memory away from the build process basically wedged the entire build process, comprised of at minimum 20 processes (more on that in a bit). It's just untenable. The system flat out needs more memory to use ninja's defaults on this system. Where with SSD, the system actually wasn't as memory starved, even though the swap was less efficient on SSD than in memory. The memory starvation is why the swap on ZRAM case failed sooner and more abruptly, with no responsivity even via remote ssh connection. In the swap on SSD case, while the GUI became totally unresponsive in the same way, there was partial responsivity via remote ssh connection - but it wasn't good enough to regain control. The oom killer was never invoked in either case.

Interestingly, repeating the test with a 1/4 sized swap on ZRAM, the oom killer was invoked just before the midway point of the build, the build fails, all build processes quit, complete system recovery happens relatively quickly (1-2 minutes). But the build fails. This data point suggests it's possible to overcommit /dev/zram and cause worse memory starvation. And it just underscores the real problem is the application is asking for too many resources.

What is ninja doing by default? When I run ninja --help it reports back:
-j N run N jobs in parallel (0 means infinity) [default=10 on this system]

If I reboot with nr_cpus=4, and rerun ninja --help it reports back:
-j N run N jobs in parallel (0 means infinity) [default=6 on this system]

I'm gonna guess its metric is to set N jobs to nrcpus + 2. Each job actually means a minimum of two processes, so -j 10 translates to ten c++, and ten gcc/cc1plus processes running concurrently. These defaults strike me as intended for a dedicated headless build system with a tone of resources. They're wildly inappropriate for a developer's desktop or laptop running Fedora Workstation while doing other work at the same time as the build. So I think our true task is how to get better defaults. Either convincing upstreams that their build defaults need to be for individual user machiens, and burden build systems with custom build settings, or we're going to have to do the work of containerizing build applications so they can't (basically) sabotage users systems by using inappropriate defaults.

A secondary task, which can be done concurrent to the above, more testing is necessary to figure out the optimal ZRAM device size. I think 1:1 is too aggressive. Perhaps it can be 1/3 or 1/4 the size of RAM. But at this point in my testing I can't recommend only moving to swap on ZRAM, it mostly just means rearranging the deck chairs.

Is this worth broader discussion on devel@? Maybe bring in more subject matter experts about what the proper relative behaviors should be for build defaults, the kernel, swap behavior, etc? And maybe encourage more testing/experimentation in this area?

Hm wow, I wasn't expecting zram to make responsiveness worse. That's... interesting.

What is ninja doing by default? When I run ninja --help it reports back:
-j N run N jobs in parallel (0 means infinity) [default=10 on this system]
If I reboot with nr_cpus=4, and rerun ninja --help it reports back:
-j N run N jobs in parallel (0 means infinity) [default=6 on this system]
I'm gonna guess its metric is to set N jobs to nrcpus + 2. Each job actually means a minimum of two processes, so -j 10 translates to ten c++, and ten gcc/cc1plus processes running concurrently. These defaults strike me as intended for a dedicated headless build system with a tone of resources. They're wildly inappropriate for a developer's desktop or laptop running Fedora Workstation while doing other work at the same time as the build.

nproc + 2 is a little aggressive, but not unreasonable for C projects or small C++ projects. But clearly it's way too much for WebKit. (It also works extremely well for my high-end system, with ridiculous specs that we should not optimize for. Using a lower value would make my builds drastically slower.) Ideally make and ninja would learn to look at system memory pressure and decide whether to launch a new build process based on that, but it's probably too much to expect. :/ Trying to trigger OOM earlier seems like a better bet.

Is this worth broader discussion on devel@? Maybe bring in more subject matter experts about what the proper relative behaviors should be for build defaults, the kernel, swap behavior, etc? And maybe encourage more testing/experimentation in this area?

Yes.

Hm wow, I wasn't expecting zram to make responsiveness worse. That's... interesting.

While monitoring top and iotop as this happens, it looked like a case of Ouroboros, the snake eating itself. It's not the fault of ZRAM per se, it's just that it has a different and faster failure mode once the system was setup to fail from the outset. Had the resource demand not reached the critical level, ZRAM would have relieved the pressure better than swap on SSD.

Thanks for the detailed testing, Chris!

As other people have pointed out, swapping anonymous pages out is only one out of two tools that the kernel has to reclaim memory - the other tool is reclaiming pages from the page cache - including things like mapped-in program code.

So adjusting the way we swap by itself is unlikely to make a huge difference. We really need the kernel to be making a decision in some fashion "this process / these pages are less important" - "this process / these pages are more important".

The main tool that seems to be available in this area is the cgroups memory controller. (cgroups v2 has more abilities in this area, like per-cgroup pressure stall information - https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html) I don't think there's any ability to "prioritize" things, but there are various controls over minimum/maximum amounts of memory used. You can even set "swapiness" per-cgroup.

So, vague idea would be try to arrange things so that gnome-shell, gnome-terminal, dbus-broker, and other critical tasks for maintaining interactive performance are in one part of the cgroup hierarchy, and applications and terminal-spawned processes are in another part of the cgroup hierarchy and try to protect a minimum amount of memory for the critical processes.

You'd really like it so that in a low memory situation Firefox staying interactive was prioritized over ninja spawning 10 g++'s - but that seems harder - you don't want to say that on a 8GB system, 2GB are for the system, 3GB are for applications, and 3GB are for non-interactive processes! -

Definitely bringing this to a wider audience would be a good idea - there may be more tools that we aren't aware of.

I think 1:1 is too aggressive. Perhaps it can be 1/3 or 1/4 the size of RAM. But at this point in my testing I can't recommend only moving to swap on ZRAM, it mostly just means rearranging the deck chairs.

1:1 is certainly not aggressive. I use 2:1 is very pleased.
Chrome OS uses 3:2 on its chromebooks.
https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/refs/heads/master/chromeos-base/swap-init/files/init/swap.sh#205
Sample system log, shows 2GB RAM, zram swap 2.74GB.
https://bugs.chromium.org/p/chromium/issues/detail?id=970989

Based on this limited testing, I can't recommend only moving to swap on ZRAM. First, we need better build application defaults. It's not reasonable for developers to have to know these things, defaults should not cause the system to blow up and the build to fail 100% of the time on a reasonable, even if limited configuration, where manual intervention allows it to succeed and have exactly the user experience we want.

WebKit has always been a pain to get compiled on any system, I'm not sure we need to base ourselves off of this one workload test.

Of course, using a zram-backed swap isn't enough, nor is removing disk-based swap, but each makes the heavy workload transitions slightly smoother, reduces wear-and-tear on the disks. Using cgroups, or OOM-scores is a parallel problem. See:
https://gitlab.freedesktop.org/hadess/low-memory-monitor
https://gitlab.gnome.org/GNOME/glib/merge_requests/1005

Lennart suggested something interesting here, in the second portion of his threaded response:
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/MML5MAKBFNEXBT67TCOVUWGFNOUDYUUP/

The gist is, perhaps multiple swaps, some of which are not normally active, and are activated upon hibernation. I will ask Lennart some questions about this. And also what happens if e.g. there are two swaps: ZRAM and disk, and the user tries to hibernate.

Also, there is zswap which is a totally different thing. It uses both a definable memory pool for swap, and spills over to a conventional disk based swap, all of which is always compressed. It really fits this use case nicely, the gotcha is, it's flagged experimental in kernel documentation. I just pinged someone upstream about it to see if that's really the current state. (I'm assuming something marked experimental is not something we want to use in default installations.)

@hadess I agree this is a complicated issue (the general and specific problems, as well as this ticket). It's sorta like a dam with holes spouting water and wanting to fill them all while deciding which ones we can work on and when and in what order, etc.

The webkit example is badass because of its simplicity and clarity as an unprivileged task fork bombing the system. I don't mean to indicate that specific case must be solved in this ticket. But I'm also skeptical of desktop specific bandaids that paper over lower level deficiencies.

And I like the low-memory-monitor concept insofar as it can provide feedback to the user and give them some control over a runaway train situation. But I also strongly take the position a non-sysadmin user can never be held responsible for an unprivileged task taking down the system. Perhaps I want my cake and to eat it too, and so be it.

Another tangent issue when the system gets into these heavy swap states. This one is a GPU hang I reported upstream and there's some relevant back and forth.
https://bugs.freedesktop.org/show_bug.cgi?id=111512

https://facebookmicrosites.github.io/oomd/docs/overview
https://news.ycombinator.com/item?id=17590858

I love this line "flexibility where each workload can have custom protection rules" in that it suggests it could be adapted from its server roots to a workstation use case.

oomd will be packaged, https://github.com/facebookincubator/oomd/issues/90

By the way, earlyoom and nohang are ready for desktop right now.

https://apps.fedoraproject.org/packages/earlyoom
https://github.com/rfjakob/earlyoom

https://copr.fedorainfracloud.org/coprs/atim/nohang
https://github.com/hakavlad/nohang

nohang also supports PSI and low memory warnings, you can test it right now.
Demo: nohang v0.1 prevents Out Of Memory with GUI notifications: https://youtu.be/ChTNu9m7uMU.

I'm nohang author and earlyoom contributor. I'd like to see you questions about OOM prevention in userspace.

I think earlyoom is stable and tiny. PSI support is under discussion: https://github.com/rfjakob/earlyoom/issues/100. I would like to hear arguments against its enabling by default.

oomd is not good for desktop:

seems like oomd kills all processes in memhog.scope, not only fattest process, it is not good for desktop

Yeah that's by design. The smallest granularity oomd will operate on is a cgroup. Doing per-process is kind of a mess, especially when multiple teams own different services on a system.

https://github.com/facebookincubator/oomd/issues/61#issuecomment-520641352

There's this bug which has been bugging many people for many years
already and which is reproducible in less than a few minutes under the
latest and greatest kernel, 5.2.6. All the kernel parameters are set to
defaults.

Steps to reproduce:

1) Boot with mem=4G
2) Disable swap to make everything faster (sudo swapoff -a)
3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
4) Start opening tabs in either of them and watch your free RAM decrease

Once you hit a situation when opening a new tab requires more RAM than
is currently available, the system will stall hard.

This bug cannot be reproduced if earlyoom or nohang is enabled, see demo above.

You'd really like it so that in a low memory situation Firefox staying interactive was prioritized over ninja spawning 10 g++'s - but that seems harder - you don't want to say that on a 8GB system, 2GB are for the system, 3GB are for applications, and 3GB are for non-interactive processes!

@otaylor what about nice -n 19 ionice -c 3 ./foo? What about systemd-run --user -p TasksMax=10 -p IOWeight=1 -p CPUWeight=1 -t ./foo?

@hakavlad In the web browser reproduce case, what happens with earlyoom or nohang? Does it kill off just a child process of the browser killing off a tab? Is it randomly chosen? Would it kill off the whole browser? Could it kill off some other process entirely or does it try to kill the parent process whose combined resource consumption is the greatest?

In the ninja+webkitgtk example, killing off ninja itself would be good, where kernel's existing oom-killer either plays with it like a cat does a mouse and maybe kills it tomorrow, or abruptly kills just one child process, leaving the others to continue on for quite some time doing unnecessary work now that the build has failed. Conversely in the browser example, I would be OK with a tab getting killed off but not my entire browser session.

And then what if I try to reproduce the browser and ninja cases at the time time? I'd like the browser tab (child process) consuming the most CPU+memory to be treated the most harshly and expendable. But it really is a gray area which comes next? A secondary browser tab versus the entire compile?

I definitely agree that unprivileged processes need resource limitations put on them. It would be very cool if these limitations can be retroactively applied, i.e. to change TasksMax and IO/CPU weighting functions.

These also sorta relate:
https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/4#note_217565
https://twitter.com/ramcq/status/1150489660688424962

In fact I've run into that same bt gdb debugging firefox io hell @hadess has. I suppose in some ideal world, my (interactive) actions inform my system "i'm working, give me priority" and if that means gdb is effectively suspended, fine. And then when some trigger happens that indicates I've walked away (like the display goes to powersave), it then lets gdb hog the system of all resources so long as sshd doesn't face plant.

OK I did one test with earlyoom and did a quick and dirty write up on it in the devel@ thread.
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/TCIVNXEHWENIWQT35XX6PRC7ZAYTRDGQ/

Sadly, I failed to specifically mention the workload, but it's the same as before: Fedora 31 Workstation, Firefox ~8-10 tabs, Terminal building webkitgtk with the default ninja command

In the web browser reproduce case, what happens with earlyoom or nohang? Does it kill off just a child process of the browser killing off a tab? Is it randomly chosen? Would it kill off the whole browser? Could it kill off some other process entirely or does it try to kill the parent process whose combined resource consumption is the greatest?

earlyoom andnohang by default select the victim in the same way as the default OOM killer does: the victim is a process with highest oom_score. The default behavior can be modified.

In the ninja+webkitgtk example, killing off ninja itself would be good

See https://github.com/rfjakob/earlyoom#preferred-processes

Preferred Processes

The command-line flag --prefer specifies processes to prefer killing; likewise, --avoid specifies processes to avoid killing. The list of processes is specified by a regex expression. For instance, to avoid having foo and bar be killed:

earlyoom --avoid '^(foo|bar)$'

For earlyoom you can set :

earlyoom --prefer '^(ninja)$'

It adds +300 to ninja badness. +300 value is hardcoded for earlyoom.

For nohang it may be more flexible. For example,

@BADNESS_ADJ_RE_NAME 500 /// ^ninja$

or

@BADNESS_ADJ_RE_REALPATH 900 /// ^/usr/bin/ninja$

Prefer chromium tabs (its are already has oom_score_adj=300 by default):

@BADNESS_ADJ_RE_CMDLINE 300 /// --type=renderer

Prefer firefox children:

@BADNESS_ADJ_RE_CMDLINE 300 /// -childID

Other way:

@BADNESS_ADJ_RE_NAME 300 /// ^(Web Content|WebExtensions)$

Avoid killing processes in system.slice:

@BADNESS_ADJ_RE_CGROUP_V2   -200   ///   ^/system\.slice/

Other options - https://github.com/hakavlad/nohang/blob/master/nohang.conf

It would be very cool if these limitations can be retroactively applied, i.e. to change TasksMax and IO/CPU

Just change the values (pids.max, io.max, io.weight, cpu.weight, cpu.max) in cgroup, you can do it on-the-fly.

leaving the others to continue on for quite some time doing unnecessary work now that the build has failed

Killing a cgroup as a single unit can help you. I plan to implement it in nohang.

By the way, if you always run ninja via systemd-run --user ninja, you can enjoy cgroup-killing right now:

You can customize corrective action in nohang:

@SOFT_ACTION_RE_NAME  ^ninja$  ///  systemctl kill -s SIGKILL $SERVICE

If victim's name is ninja: the next command will be corrective action: systemctl kill -s SIGKILL $SERVICE, $SERVICE will be replaced by ninja's unit service, and ninja's cgroup will be killed as a single unit. Now it works only with legacy/mixed cgroup hierarchy, I will fix it to support unified cgroup hierarchy.

P.S. Seems like systemctl kill doesn't work with services in user.slice. It should work with follow::

systemd-run --uid=1000 --gid=1000 -t ninja

Yes, the oom happens sooner

@chrismurphy Low memory != OOM. OOM did not happen in your test case. OOM was prevented by earlyoom when the SwapFree was at 10%:

[ 5049.976811] fmac.local earlyoom[3470]: mem avail:   258 of  7837 MiB ( 3 %), swap free: 2469 of 7836 MiB (31 %)
[ 5096.265936] fmac.local earlyoom[3470]: mem avail:   216 of  7837 MiB ( 2 %), swap free:  860 of 7836 MiB (10 %)
[ 5096.265936] fmac.local earlyoom[3470]: low memory! at or below SIGTERM limits: mem 10 %, swap 10 %
[ 5097.266702] fmac.local earlyoom[3470]: sending SIGTERM to process 6775 "cc1plus": badness 99, VmRSS 251 MiB
[ 5101.141163] fmac.local earlyoom[3470]: process exited after 4.7 seconds
[ 5115.243511] fmac.local earlyoom[3470]: mem avail:  1546 of  7837 MiB (19 %), swap free: 4259 of 7836 MiB (54 %)

Please use exact wording. In this case there was low memory handling, not OOM killing.

In 2019, we do not have a good zram manager. We do not have a zram manager that can handle errors and use many zram functions, such as backing_dev, offer new compression algorithms, such as zstd, and offer many options for fine tuning. I am surprised at this fact.

How do you feel about low memory GUI notifications? For example, if the levels of MemAvailable & SwapFree fall below 20%, the user begins to receive periodic notifications, as he does when the battery is low. This will allow the user to stop in time and stop opening new browser tabs to avoid data loss. Should this behavior be enabled on the desktop by default? Argument your position.

How do you feel about low memory GUI notifications?

I don't think we want this. Various reasons:

  1. Our goal ought to be to reduce the amount of work a user has to do to maintain their system, not the opposite
  2. It assumes knowledge/understanding of memory and how it works - many users don't have this
  3. It doesn't actually solve the problem - we will still inevitably end up with situations where the available memory is exceeded, despite the notifications, and we will still need to handle this somehow
  4. There will be situations where memory usage increases while someone isn't actively using the machine

In 2019, we do not have a good zram manager.

This is why I'm sceptical of swap on ZRAM. There are ways to make sure there's a low water mark for RAM, but none of the service implementations or the upstream generator do this. And it's why I'm slightly more in favor of zswap as a swap thrashing moderator. But in the runaway high memory pressure examples, that is inadequate. It's just a moderator.

How do you feel about low memory GUI notifications?

With my QA tester hat on, I like it. As a user, the first question that comes to mind is "What year is this?" I don't really see memory management as user domain. Sounds like the operating system is confused and falling over, and what am I supposed to do about that?

And if I merge the two perspectives together, I come up with: an unprivileged process just preempted the GUI, that is a fail whale.

RPMs for the just released oomd v0.2.0 are available on this COPR repository:
https://copr.fedorainfracloud.org/coprs/filbranden/oomd/

I think it would be badass if we had a way to get Fedora users to opt-in to experiments, and then randomly give them things like nohang and earlyoom and oomd and low-memory-monitor. No documentation, no warning, nothing. They just get one of them. As if it were their default installation. And see what blows up, or not, what complaints they have, or not. If they explicitly install something, instead of random, they end up with bias that actually pollutes the data. Just a thought.

report back on how they work for you!

  1. They kill all the session instead of one process.
  2. They uses 7% CPU on my VM.
  3. I turned off the swap and started opening browser tabs, and in the end the system freezed.

https://imgur.com/a/FSOtqPm

Resume: they are not for desktop. Don't advise it to use on desktop. Use earlyoom or nohang on desktop instead of oomd. - https://github.com/facebookincubator/oomd/issues/90#issuecomment-530836227

@chrismurphy

kills just one child process, leaving the others to continue on for quite some time doing unnecessary work now that the build has failed

I can add in nohang an option: kill bash (or any other name) sessions (by SID or NSsid) as a single unit (if the victim is part of this group). That is, for example, if the name of the leader of the session is bash, then the entire session will be killed.

This is why I'm sceptical of swap on ZRAM. There are ways to make sure there's a low water mark for RAM, but none of the service implementations or the upstream generator do this. And it's why I'm slightly more in favor of zswap as a swap thrashing moderator. But in the runaway high memory pressure examples, that is inadequate. It's just a moderator.

You've done performance tests. According to my benchmarks, zswap is much slower than a swap. Not worth special attention at present (it was faster during the 3.11 kernel, but it has changed a lot since then).
https://lore.kernel.org/lkml/CALZtONCO5BJJw-RjrhEeap95nZy0h9GBqYgx2apVB62ZemY54g@mail.gmail.com/T/#m9580de3ad114a1699b096570b14131155600aead

zRAM is out of competition. The best choice without a doubt. I love this solution.

@latalante

zswap is much slower than a swap

It depends a lot on the settings. With zswap pool size=90% and z3fold/zsmalloc I got performance similar to zram. Of course, if you use default 20% pool size and zbud, you will get bad performance.

@latalante

You've done performance tests. According to my benchmarks, zswap is much slower than a swap.

My testing does not indicate a performance difference whether swap is on NVMe or SSD, if the memory pool is the same size. I have no HDD systems to test. The URL you provide involves a hyperspecific workload in a VM, no details of that setup are provided, and responses to the proposal mention that the results don't adequately take the general case into account. If you're experiencing zswap enabled is slower than swap alone, that's unquestionably a bug, and requires a bug report showing the system details and reproduce steps.

In the worse case scenario, I was consistently able to get the test system to totally wedge itself in with swap on ZRAM, essentially "swap thrashing" becomes CPU bound rather than IO bound, but the system was lost (omitting any of earlyoom, nohang, oomd). In the incidental swap usage cases, whatever differences there are between zswap and swap on ZRAM, I can't say are noticeable. Time frame is roughly 18 months for zswap and 3 months for swap on ZRAM.

@hakavlad
Maybe. I think we need upstream zswap folks to commit to production status by updating kernel documentation, rather than giving it the go ahead in a bug report.
https://bugzilla.kernel.org/show_bug.cgi?id=204563#c6

@chrismurphy can you provide a status update here?

Everyone I've talked to about it: kernel people, and user space folks, all say the same thing: oh yeah, that problem. Everyone knows it's a problem, there is no universal one size fits all fix, and it pretty much amounts to use case specific work arounds. I'd say a near majority consensus is: buy more memory, or (manually) use build commands that require fewer resources.

The issue really is, the kernel's oom-killer is not at all interested in user space. It only cares about making sure the kernel itself is still treading water. So it doesn't care if most all of user space is totally unresponsive and faceplanting. Ergo, there is by design, no kernel facility that tries to preserve responsiveness from a user perspective.

Facebook has learned a lot from oomd, first iteration. There is an oomd2 they've been working on that requires less effort on the part of specific use cases to build their own use case specific modules. So that's possibly worth another look from what I got out of watching the All Systems Go conference in Berlin last month.

I'm still mostly interested in a generic solution that doesn't require users either spending money, or having to configure and self-monitor the problem, in order to paper over what's clearly a fundamental deficiency in the operating system.

I'm using the OS term to mean kernel+systemd+watchdog-like-daemons, as in I'm casting a broad net of blame, none of which is the user themselves, but even going so far as to say I don't think this is the responsibility of application developers either. It's 2019 and ordinary sane applications effectively acting like unprivileged fork bombs is really quite impressive to me, and not in a good way.

Also, I haven't done any comparative evaluation on Windows or macOS to see how they're handling unprivileged fork bombs. But I can't say I'm convinced it's that relevant, because the behavior is that much of a betrayal of users that I can't feel OK saying: well its the same problem on macOS and Windows, as if that were OK if true.

Can we assume at this point that we should be enabling some userspace OOM handler, such as one of those analyzed here?

@chrismurphy nohang is also available in Fedora 30+ repos. Version 0.2 coming soon. Please test it too. It has a lot of config keys to fine tuning, supports PSI and GUI notifications. And I plan to support killing cgroups as a single unit as an option.

https://github.com/hakavlad/nohang

Demo with typical usecase: 2.3GB RAM, 1.8GB swap. Opening a lot of browser tabs and OOM prevention with GUI notifications: https://youtu.be/PLVWgNrVNlc

Also look at psi-monitor in nohang package. You can start it before stress testing to monitor PSI pressue values, output like follow:

$ psi-monitor
Set target to SYSTEM_WIDE to monitor /proc/pressure
Starting psi-monitor, target: SYSTEM_WIDE, period: 2
------------------------------------------------------------------------------------------------------------------
 some cpu pressure   || some memory pressure | full memory pressure ||  some io pressure    |  full io pressure
---------------------||----------------------|----------------------||----------------------|---------------------
 avg10  avg60 avg300 ||  avg10  avg60 avg300 |  avg10  avg60 avg300 ||  avg10  avg60 avg300 |  avg10  avg60 avg300
------ ------ ------ || ------ ------ ------ | ------ ------ ------ || ------ ------ ------ | ------ ------ ------
  0.17   0.83   0.57 ||  11.54  49.25  32.03 |  11.27  45.70  29.59 ||  42.80  78.67  42.78 |  42.09  74.13  39.81
  0.14   0.80   0.57 ||   9.63  47.67  31.81 |   9.41  44.24  29.39 ||  40.12  77.01  42.68 |  39.54  72.62  39.73
  0.11   0.77   0.56 ||   7.88  46.11  31.60 |   7.70  42.79  29.19 ||  32.85  74.49  42.39 |  32.38  70.25  39.46
  0.09   0.75   0.56 ||   6.45  44.60  31.38 |   6.31  41.39  28.99 ||  27.08  72.09  42.11 |  26.69  67.98  39.19
  0.07   0.72   0.55 ||   5.28  43.14  31.16 |   5.16  40.04  28.79 ||  22.90  69.86  41.85 |  22.58  65.89  38.95
  0.06   0.70   0.55 ||   4.33  41.73  30.95 |   4.23  38.73  28.60 ||  18.75  67.57  41.56 |  18.49  63.73  38.69
  0.05   0.67   0.55 ||   7.35  41.05  30.88 |   7.26  38.15  28.54 ||  20.24  66.25  41.46 |  20.03  62.53  38.61
  0.04   0.65   0.54 ||   6.38  39.77  30.68 |   6.31  36.96  28.36 ||  17.48  64.24  41.21 |  17.31  60.65  38.38
  0.03   0.63   0.54 ||   5.22  38.47  30.47 |   5.17  35.75  28.17 ||  14.67  62.21  40.94 |  14.53  58.73  38.13
  0.02   0.61   0.54 ||   4.27  37.21  30.27 |   4.23  34.58  27.97 ||  12.02  60.17  40.66 |  11.90  56.81  37.87
  0.02   0.59   0.53 ||   3.50  36.00  30.06 |   3.46  33.45  27.78 ||  10.20  58.27  40.40 |  10.10  55.01  37.62
  0.01   0.57   0.53 ||   2.86  34.82  29.85 |   2.83  32.36  27.59 ||   8.35  56.36  40.12 |   8.27  53.21  37.36

I would like to know how much memory pressure rises during your tests. Please also turn on debug keys in config if you will test nohang.

Look at this discussion:

installed nohang-git and enabled its service
well its better than expected
no complete freeze when using full ram(to get to full ram use i opened 75 tabs of firefox and 70 tabs on chromium)
system is still responsive

I have used Earlyoom for a few months, and wonder why it is not preinstalled.

-- https://forum.manjaro.org/t/out-of-memory-killer-nohang-earlyoom-oomd/95543/9

Actually IMHO earlyoom is the best candidate to be default OOM prevention daemon in Fedora 32 right now: it is stable and tiny, works fine, developed since 2014 and it is most famous userspace OOMK. It does one thing well: it prevents the system from freezing at a critical decrease in available memory and free swap. Perhaps in the future it will be possible to replace it with something else, but it is better to start with it.

If anyone wants to object to zram by default or removal of disk-based swap, please do so here.

ZRAM doesn't help you if data in memory is bad compressible (actually it a very rare case; compression ratio is about 3:1 of you if the memory is full of browser tabs). ZSWAP has the advantage that it allows you to put a constant amount of data in a swap regardless of its compressibility. Also zswap does not break hibernation.

ZSWAP is a less traumatic (it does not break hibernation, it works if the memory is full of incompressible data) and more conservative solution.

Regarding nohang, could you leave a comment on https://github.com/hakavlad/nohang/commit/2a3209ca72616a6a8f59711ff7fde7a6662ff3c7 to indicate what you are fixing and what its security impact is? The brief commit messages are concerning.

Regardless, I appreciate your assistance in this issue and your vote of confidence for earlyoom.

@catanzaro Done. Sorry for dirty commit style.

Can we assume at this point that we should be enabling some userspace OOM handler

I wrote one specifically for Fedora Workstation and desktop use, which is available at:
https://gitlab.freedesktop.org/hadess/low-memory-monitor/

The GLib integration code is here:
https://gitlab.gnome.org/GNOME/glib/merge_requests/1005

The Portal code is here:
https://github.com/flatpak/xdg-desktop-portal/pull/365

Something new: high responsiveness with active swapping. Code not yet published.

With tail /dev/zero:
https://youtu.be/H6-qfXqzINA

With stress -m 9 --vm-bytes 99G:
https://youtu.be/DJq00pEt4xg

In the latter case, without the use of a special daemon, there would be complete freezing.

Perhaps @hadess and @hakavlad could collaborate on this? It seems like we have more than enough competing implementations already?

Perhaps @hadess and @hakavlad could collaborate on this? It seems like we have more than enough competing implementations already?

None of them were suitable for integration in the desktop unfortunately, otherwise I wouldn't have written a new one. There's not much "collaboration" left to be done, the code is written, and functional. It's waiting for integration.

integration in the desktop

@hadess Do you mean only Gnome? How will this be integrated? How will this be configured?

None of them were suitable for integration in the desktop

What should integration look like?

Tested low-memory-monitor https://aur.archlinux.org/packages/low-memory-monitor-git/ on Manjaro.

https://imgur.com/a/UTB6tZJ - screenshots.

At the slightest entry into the swap, the culprit of tail /dev/zero was killed, and after a few seconds the Xorg was killed. After recovery, after some time, the firefox was killed, although there was enough memory.

No --help option, no output. VmRSS is about 4 MiB.

@aday, Philip Withnall, Rob McQueen, and I discussed this issue last week at an Endless OS/GNOME hackfest in London last week. And it's a difficult problem, is what I've concluded.

We discussed, but didn't conclude, to what degree we can depend on swap configurations other than what we have now, due to Anaconda defaults, and the user having control over this in Custom/Advanced partitioning anyway. My experience with no swap is objectively worse, where the kernel oom-killer becomes super sensitive to knocking over random processes like sshd, and systemd-journald, and other processes we'd really want to keep around. Also, I wonder about always assuming the top memory consumer is the process to kill. What if it's a VM? Killing VMs would be a pretty bad UX, and likely leads to some form of data loss.

Another thread about this topic has appeared on devel@
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/JW6FG5HB3Q67U2F7PJKBXUBZWL2FK32C/

We discussed, but didn't conclude, to what degree we can depend on swap on ZRAM by default instead of swap on HDD/SSD. There are known problems with having no swap at all, which is that quite a lot of people do have use cases where incidental swap is used regularly, and if it's not available the in-kernel oom killer is very fast at knocking off random processes to free up memory. In my testing, such a setup regularly caused sshd, systemd-journald, and similar such processes to get killed - not the top consuming process.

Something I brought up, is the distinction between incidental swap use (a rate of usage where it's trivial and can be handled by HDD/SSD without seemingly affecting system responsiveness); and persistently aggressive use (a rate that does cause responsiveness problems). The former is acceptable, the latter isn't. How to distinguish between them? Is there a way to sense a jerky mouse point for example? I can see it. Is there anything in Linux, either kernel or user space, that could become aware of that as an early indicator for a low water mark of resources being overwhelmed?

Login to comment on this ticket.

Metadata