#98 Better interactivity in low-memory situations
Opened 9 months ago by hadess. Modified a month ago

This is a "tracker bug" about replacing disk-based swap with zram-based swap.

Installing and enabling zram by default (already merged):
https://bugzilla.redhat.com/show_bug.cgi?id=1731598
https://pagure.io/fedora-comps/pull-request/391

Disabling disk-based swap by default:
https://bugzilla.redhat.com/show_bug.cgi?id=1731978

Justification is listed in the anaconda bug.


If anyone wants to object to zram by default or removal of disk-based swap, please do so here. Otherwise, Bastien seems to have this under control, so I don't see the need to discuss at a meeting.

I think it would be useful to have Self-Contained change for this so that more people are aware of this change. Having steps to migrate "from old setup to new", would be nice.

I think it would be useful to have Self-Contained change for this so that more people are aware of this change. Having steps to migrate "from old setup to new", would be nice.

I started doing that, but there were too many unknowns for that page to be useful and the deadline is today, and I won't have the answers by that time.

Adding meeting keyword due to some initial skepticism from Anaconda developers. We might need to have a formal vote here.

I started doing that, but there were too many unknowns for that page to be useful and the deadline is today, and I won't have the answers by that time.

IMO it would actually be perfectly fine to do this an a F32-timeframe change proposal instead of for F31. That way users have plenty of time to report possible fallout. This is a very longstanding problem, so fixing it ASAP doesn't seem urgent.

Also, I'd say this would be a systemwide change.

Metadata Update from @catanzaro:
- Issue tagged with: meeting

9 months ago

OK so all of this work is fine by me, but at the same time it totally steps on all the work and coordination I just wrapped up in issue 56. I was expecting that to happen at some point, just not this quickly.

This change probably means swap on zram gets activated in early boot on lives
https://src.fedoraproject.org/rpms/fedora-release/pull-request/87#request_diff

and that will conflict with this change in anaconda which uses a different service file to setup and enable swap on zram when anaconda is launched
https://github.com/rhinstaller/anaconda/pull/2039

There's no error handling in the anaconda code. I don't know for sure what happens when PR 87 lands, but I suspect on LiveOS we'll see two swap on zram devices, the one with higher priority will get used and the other will be a benign extra.

But to clean it up, either PR 87 needs to be reverted until this issue gets better organized, or an alternative is this:
https://src.fedoraproject.org/fork/chrismurphy/rpms/zram/c/8dad080d9285db3076fdd502efb83081545520d8?branch=devel

@hadess our meeting is tomorrow at 9:00 AM EDT if you want to participate (in #fedora-meeting-2)

@hadess our meeting is tomorrow at 9:00 AM EDT

Er, I suppose that's already "today" in France.

This is a "tracker bug" about replacing disk-based swap with zram-based swap.
Installing and enabling zram by default (already merged):
https://bugzilla.redhat.com/show_bug.cgi?id=1731598

This was reopened for reversal as it conflicts with:
https://github.com/rhinstaller/anaconda/pull/2039

https://pagure.io/fedora-comps/pull-request/391

This can stay, it just installs a file on disk.

Disabling disk-based swap by default:
https://bugzilla.redhat.com/show_bug.cgi?id=1731978
Justification is listed in the anaconda bug.

This is another problem.

I'll close this as I'm not interested in working on this for Fedora 32. It was made abundantly clear that I should have coordinated with an effort I knew nothing about and that my work was "[not] even done".

There are bugs already opened for the individual changes though, so coordination should be straight forward.

Metadata Update from @hadess:
- Issue close_status updated to: Won't fix
- Issue status updated to: Closed (was: Open)

9 months ago

I'm still interested in tracking this here even if you're not. :)

Metadata Update from @catanzaro:
- Issue status updated to: Open (was: Closed)

9 months ago

I'm still interested in tracking this here even if you're not. :)

That's fine by me, as long as ownership is clear. Thanks.

Metadata Update from @catanzaro:
- Issue assigned to chrismurphy

9 months ago

I very much appreciate @hadess contribution to a generic swap on swap solution so far. A developer once reminded me that Fedora is supposed to be fun. The way I expressed surprise at all of his work actually triggering changes is contrary to having fun. And for that I apologize.

There are actually quite a few things that get touched with this feature and I think it should go through the feature process for future Fedora. I'm happy to negotiate the bureaucracy aspect of this and make sure there's proper coordination and testing. But I can barely bash my way out of a hat, so there is no way I'm going to magically get the systemd zram-generator working on my own.

I think the first step is to establish whether the systemd zram-generator project is going to be the accepted generic swap on zram implementation by Anaconda, Workstation, and IoT folks. And concurrently sort out who, or at least that someone, will respond when it breaks. When it breaks, we'll need a pretty fast fix. Right now it's broken. I don't think there's is a viable feature if there isn't agreement on, a) a generic solution; b) someone to maintain it; c) for it to actually be working. And right now none of those three things is for sure true or known.

Anaconda folks aren't in favor of switching to merely a different swap on zram solution than what they have. I think it's reasonable to want something robust, and is sufficiently upstream that it can and will be used by other distributions. Along those lines, I need to followup on is Arch has recently discussed moving to swap on zram by default too.

Concurrently, the subject of this issue suggests a problem that needs additional investigation. I'll make that point in this bug
https://bugzilla.redhat.com/show_bug.cgi?id=1731978

Does that sound reasonable? And at least for intial steps is there anything I've left out?

@hadess or anyone else experiencing this problem; I'd like a clear simple as possible set of reproduce steps for this statement:

Unfortunately, especially on interactive systems such as the Workstation variants, hitting the disk-based swap under low-memory conditions renders the machine completely unusable. The disk-based swap is not fast enough to free up physical memory to keep the machine's interactivity.
https://bugzilla.redhat.com/show_bug.cgi?id=1731978

I have experienced cases like that myself of course, but I do not have a consistent reproducer and want to make certain I'm seeing the same thing everyone else is talking about with a relevant Workstation specific use case example. Post it here or in a BZ, whichever is appropriate. Thanks.

BTW I noticed that Anaconda's automatic partitioning defaults to creating a swap partition equal to the amount of RAM, which can result in ludicrous swap sizes on systems with large amounts of RAM. So that's something else to keep in mind.

I have experienced cases like that myself of course, but I do not have a consistent reproducer and want to make certain I'm seeing the same thing everyone else is talking about with a relevant Workstation specific use case example. Post it here or in a BZ, whichever is appropriate. Thanks.

http://trac.webkit.org/wiki/BuildingGtk wouldn't be a horrible testcase:

$ cmake -DPORT=GTK -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja
$ ninja

If that's not enough to hang your computer, then something absurd like ninja -j64 might do the trick.

@catanzaro

BTW I noticed that Anaconda's automatic partitioning defaults to creating a swap partition equal to the amount of RAM, which can result in ludicrous swap sizes...

I just mentioned that in bug 1731978. In ancient times swap at 1x RAM was 4MB. And compared to drive performance, 16-32GB is just goofycakes. But it has to be at least 1x RAM if your view is that hibernation is a plausible use case to try to support; really hibernation requires 100% of the total used memory which is RAM+swap. So plausibly hibernation requires 2x RAM. There's a reason swap and hibernation files are decoupled on macOS and Windows, and why Microsoft has effectively abandoned this style of hibernation. But as far as I know, there's no support on Linux for modernizing it and dealing with all the firmware bugs.

something absurd like ninja -j64 might do the trick

Isn't that pathological? I mean, should I really be considering things that are not realistically a good idea in normal usage? Anyway the test system I have has 4 real cores, 8 with hyperthreading, and only 8GB RAM, so I think this should be very straightforward to trigger with 8GB of swap on SSD.

Isn't that pathological? I mean, should I really be considering things that are not realistically a good idea in normal usage? Anyway the test system I have has 4 real cores, 8 with hyperthreading, and only 8GB RAM, so I think this should be very straightforward to trigger with 8GB of swap on SSD.

I think 'ninja' with no -j args should default to -j8 or -j10, it's either nproc or nproc+2, something like that. I'm fairly confident 8GB is not enough, so it should hang without any -j passed. But if not, you can play with -j and see what it takes.

Metadata Update from @catanzaro:
- Issue untagged with: meeting

8 months ago

https://lkml.org/lkml/2019/8/4/15 is relevant, although it shows the limits of what we hope to achieve here: we're hoping that disabling swap will be our solution, but in this example swap is already disabled and everything goes wrong anyway.

The kernel developers have posted a patch. Who knows, maybe they will finally magically solve this for us after all these years....

https://lkml.org/lkml/2019/8/4/15 is relevant, although it shows the limits of what we hope to achieve here: we're hoping that disabling swap will be our solution, but in this example swap is already disabled and everything goes wrong anyway.
The kernel developers have posted a patch. Who knows, maybe they will finally magically solve this for us after all these years....

Even if it works around the worst behavioural problems, it won't fix the fact that hitting the disk swap is bad, and that we should prefer RAM compression to disk based swap.

Perhaps the most problematic processes that instigate this problem are good candidates for being run in container with a cgroup memory request and limit.

Whether it's an application or kernel responsibility, or some hybrid approach, I think we can agree that it's not OK for an application to just hog all system resources.

OK I've reproduced this:
https://pagure.io/fedora-workstation/issue/98#comment-585500

Test system is a Macbook Pro, 8GiB RAM, Samsung SSD 840 EVO, Fedora Rawhide Workstation.

Summary:
Whether I use 8GiB swap on SSD plain partition, or 8GiB swap on ZRAM (a 1:1 ratio), the system eventually hangs, not even the mouse pointer will move. I gave up after 30 minutes, and forced power off.

The most central problem in this example build and test system, is ninja's default number of jobs. This is autocalculated somehow. Whether N jobs is derived from package configuration, cmake, or ninja itself, the default used for this package on this system is guaranteed to fail to build, and always results in an unresponsive system unmistakable to most users as a totally lost system.

Does swap on ZRAM help? Performance wise, no. The problem happens sooner, and is more abrupt than swap on SSD plain partition. It does help reduce wear on the SSD. A successful build will write 15GiB to disk, and nearly another 15GiB in swap writes. But in both cases, the default build fails so which one fails better or worse is irrelevant for our purposes.

What did work? ninja -j 4 resulted in a responsive system the entire time, except for a few brief periods lasting less than 15s near the end. I was able to use Firefox for browsing with 8-12 tabs, and concurrently running youtube video without any stuttering. And the build did finish. The configuration for this was 8GiB swap on SSD plain partition.

Based on this limited testing, I can't recommend only moving to swap on ZRAM. First, we need better build application defaults. It's not reasonable for developers to have to know these things, defaults should not cause the system to blow up and the build to fail 100% of the time on a reasonable, even if limited configuration, where manual intervention allows it to succeed and have exactly the user experience we want.

Extended cut

What's going on with swap on ZRAM? With a 1:1 ratio on the test system, /dev/zram0 is 8GiB, the same as available RAM. But ZRAM device usage is allocated dynamically. If all of swap gets consumed, and I have screen shots showing it was, at best 2:1 compression ratio, this uses 4GiB of RAM. Pilfering that much memory away from the build process basically wedged the entire build process, comprised of at minimum 20 processes (more on that in a bit). It's just untenable. The system flat out needs more memory to use ninja's defaults on this system. Where with SSD, the system actually wasn't as memory starved, even though the swap was less efficient on SSD than in memory. The memory starvation is why the swap on ZRAM case failed sooner and more abruptly, with no responsivity even via remote ssh connection. In the swap on SSD case, while the GUI became totally unresponsive in the same way, there was partial responsivity via remote ssh connection - but it wasn't good enough to regain control. The oom killer was never invoked in either case.

Interestingly, repeating the test with a 1/4 sized swap on ZRAM, the oom killer was invoked just before the midway point of the build, the build fails, all build processes quit, complete system recovery happens relatively quickly (1-2 minutes). But the build fails. This data point suggests it's possible to overcommit /dev/zram and cause worse memory starvation. And it just underscores the real problem is the application is asking for too many resources.

What is ninja doing by default? When I run ninja --help it reports back:
-j N run N jobs in parallel (0 means infinity) [default=10 on this system]

If I reboot with nr_cpus=4, and rerun ninja --help it reports back:
-j N run N jobs in parallel (0 means infinity) [default=6 on this system]

I'm gonna guess its metric is to set N jobs to nrcpus + 2. Each job actually means a minimum of two processes, so -j 10 translates to ten c++, and ten gcc/cc1plus processes running concurrently. These defaults strike me as intended for a dedicated headless build system with a tone of resources. They're wildly inappropriate for a developer's desktop or laptop running Fedora Workstation while doing other work at the same time as the build. So I think our true task is how to get better defaults. Either convincing upstreams that their build defaults need to be for individual user machiens, and burden build systems with custom build settings, or we're going to have to do the work of containerizing build applications so they can't (basically) sabotage users systems by using inappropriate defaults.

A secondary task, which can be done concurrent to the above, more testing is necessary to figure out the optimal ZRAM device size. I think 1:1 is too aggressive. Perhaps it can be 1/3 or 1/4 the size of RAM. But at this point in my testing I can't recommend only moving to swap on ZRAM, it mostly just means rearranging the deck chairs.

Is this worth broader discussion on devel@? Maybe bring in more subject matter experts about what the proper relative behaviors should be for build defaults, the kernel, swap behavior, etc? And maybe encourage more testing/experimentation in this area?

Hm wow, I wasn't expecting zram to make responsiveness worse. That's... interesting.

What is ninja doing by default? When I run ninja --help it reports back:
-j N run N jobs in parallel (0 means infinity) [default=10 on this system]
If I reboot with nr_cpus=4, and rerun ninja --help it reports back:
-j N run N jobs in parallel (0 means infinity) [default=6 on this system]
I'm gonna guess its metric is to set N jobs to nrcpus + 2. Each job actually means a minimum of two processes, so -j 10 translates to ten c++, and ten gcc/cc1plus processes running concurrently. These defaults strike me as intended for a dedicated headless build system with a tone of resources. They're wildly inappropriate for a developer's desktop or laptop running Fedora Workstation while doing other work at the same time as the build.

nproc + 2 is a little aggressive, but not unreasonable for C projects or small C++ projects. But clearly it's way too much for WebKit. (It also works extremely well for my high-end system, with ridiculous specs that we should not optimize for. Using a lower value would make my builds drastically slower.) Ideally make and ninja would learn to look at system memory pressure and decide whether to launch a new build process based on that, but it's probably too much to expect. :/ Trying to trigger OOM earlier seems like a better bet.

Is this worth broader discussion on devel@? Maybe bring in more subject matter experts about what the proper relative behaviors should be for build defaults, the kernel, swap behavior, etc? And maybe encourage more testing/experimentation in this area?

Yes.

Hm wow, I wasn't expecting zram to make responsiveness worse. That's... interesting.

While monitoring top and iotop as this happens, it looked like a case of Ouroboros, the snake eating itself. It's not the fault of ZRAM per se, it's just that it has a different and faster failure mode once the system was setup to fail from the outset. Had the resource demand not reached the critical level, ZRAM would have relieved the pressure better than swap on SSD.

Thanks for the detailed testing, Chris!

As other people have pointed out, swapping anonymous pages out is only one out of two tools that the kernel has to reclaim memory - the other tool is reclaiming pages from the page cache - including things like mapped-in program code.

So adjusting the way we swap by itself is unlikely to make a huge difference. We really need the kernel to be making a decision in some fashion "this process / these pages are less important" - "this process / these pages are more important".

The main tool that seems to be available in this area is the cgroups memory controller. (cgroups v2 has more abilities in this area, like per-cgroup pressure stall information - https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html) I don't think there's any ability to "prioritize" things, but there are various controls over minimum/maximum amounts of memory used. You can even set "swapiness" per-cgroup.

So, vague idea would be try to arrange things so that gnome-shell, gnome-terminal, dbus-broker, and other critical tasks for maintaining interactive performance are in one part of the cgroup hierarchy, and applications and terminal-spawned processes are in another part of the cgroup hierarchy and try to protect a minimum amount of memory for the critical processes.

You'd really like it so that in a low memory situation Firefox staying interactive was prioritized over ninja spawning 10 g++'s - but that seems harder - you don't want to say that on a 8GB system, 2GB are for the system, 3GB are for applications, and 3GB are for non-interactive processes! -

Definitely bringing this to a wider audience would be a good idea - there may be more tools that we aren't aware of.

I think 1:1 is too aggressive. Perhaps it can be 1/3 or 1/4 the size of RAM. But at this point in my testing I can't recommend only moving to swap on ZRAM, it mostly just means rearranging the deck chairs.

1:1 is certainly not aggressive. I use 2:1 is very pleased.
Chrome OS uses 3:2 on its chromebooks.
https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/refs/heads/master/chromeos-base/swap-init/files/init/swap.sh#205
Sample system log, shows 2GB RAM, zram swap 2.74GB.
https://bugs.chromium.org/p/chromium/issues/detail?id=970989

Based on this limited testing, I can't recommend only moving to swap on ZRAM. First, we need better build application defaults. It's not reasonable for developers to have to know these things, defaults should not cause the system to blow up and the build to fail 100% of the time on a reasonable, even if limited configuration, where manual intervention allows it to succeed and have exactly the user experience we want.

WebKit has always been a pain to get compiled on any system, I'm not sure we need to base ourselves off of this one workload test.

Of course, using a zram-backed swap isn't enough, nor is removing disk-based swap, but each makes the heavy workload transitions slightly smoother, reduces wear-and-tear on the disks. Using cgroups, or OOM-scores is a parallel problem. See:
https://gitlab.freedesktop.org/hadess/low-memory-monitor
https://gitlab.gnome.org/GNOME/glib/merge_requests/1005

Lennart suggested something interesting here, in the second portion of his threaded response:
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/MML5MAKBFNEXBT67TCOVUWGFNOUDYUUP/

The gist is, perhaps multiple swaps, some of which are not normally active, and are activated upon hibernation. I will ask Lennart some questions about this. And also what happens if e.g. there are two swaps: ZRAM and disk, and the user tries to hibernate.

Also, there is zswap which is a totally different thing. It uses both a definable memory pool for swap, and spills over to a conventional disk based swap, all of which is always compressed. It really fits this use case nicely, the gotcha is, it's flagged experimental in kernel documentation. I just pinged someone upstream about it to see if that's really the current state. (I'm assuming something marked experimental is not something we want to use in default installations.)

@hadess I agree this is a complicated issue (the general and specific problems, as well as this ticket). It's sorta like a dam with holes spouting water and wanting to fill them all while deciding which ones we can work on and when and in what order, etc.

The webkit example is badass because of its simplicity and clarity as an unprivileged task fork bombing the system. I don't mean to indicate that specific case must be solved in this ticket. But I'm also skeptical of desktop specific bandaids that paper over lower level deficiencies.

And I like the low-memory-monitor concept insofar as it can provide feedback to the user and give them some control over a runaway train situation. But I also strongly take the position a non-sysadmin user can never be held responsible for an unprivileged task taking down the system. Perhaps I want my cake and to eat it too, and so be it.

Another tangent issue when the system gets into these heavy swap states. This one is a GPU hang I reported upstream and there's some relevant back and forth.
https://bugs.freedesktop.org/show_bug.cgi?id=111512

https://facebookmicrosites.github.io/oomd/docs/overview
https://news.ycombinator.com/item?id=17590858

I love this line "flexibility where each workload can have custom protection rules" in that it suggests it could be adapted from its server roots to a workstation use case.

oomd will be packaged, https://github.com/facebookincubator/oomd/issues/90

By the way, earlyoom and nohang are ready for desktop right now.

https://apps.fedoraproject.org/packages/earlyoom
https://github.com/rfjakob/earlyoom

https://copr.fedorainfracloud.org/coprs/atim/nohang
https://github.com/hakavlad/nohang

nohang also supports PSI and low memory warnings, you can test it right now.
Demo: nohang v0.1 prevents Out Of Memory with GUI notifications: https://youtu.be/ChTNu9m7uMU.

I'm nohang author and earlyoom contributor. I'd like to see you questions about OOM prevention in userspace.

I think earlyoom is stable and tiny. PSI support is under discussion: https://github.com/rfjakob/earlyoom/issues/100. I would like to hear arguments against its enabling by default.

oomd is not good for desktop:

seems like oomd kills all processes in memhog.scope, not only fattest process, it is not good for desktop

Yeah that's by design. The smallest granularity oomd will operate on is a cgroup. Doing per-process is kind of a mess, especially when multiple teams own different services on a system.

https://github.com/facebookincubator/oomd/issues/61#issuecomment-520641352

There's this bug which has been bugging many people for many years
already and which is reproducible in less than a few minutes under the
latest and greatest kernel, 5.2.6. All the kernel parameters are set to
defaults.

Steps to reproduce:

1) Boot with mem=4G
2) Disable swap to make everything faster (sudo swapoff -a)
3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
4) Start opening tabs in either of them and watch your free RAM decrease

Once you hit a situation when opening a new tab requires more RAM than
is currently available, the system will stall hard.

This bug cannot be reproduced if earlyoom or nohang is enabled, see demo above.

You'd really like it so that in a low memory situation Firefox staying interactive was prioritized over ninja spawning 10 g++'s - but that seems harder - you don't want to say that on a 8GB system, 2GB are for the system, 3GB are for applications, and 3GB are for non-interactive processes!

@otaylor what about nice -n 19 ionice -c 3 ./foo? What about systemd-run --user -p TasksMax=10 -p IOWeight=1 -p CPUWeight=1 -t ./foo?

@hakavlad In the web browser reproduce case, what happens with earlyoom or nohang? Does it kill off just a child process of the browser killing off a tab? Is it randomly chosen? Would it kill off the whole browser? Could it kill off some other process entirely or does it try to kill the parent process whose combined resource consumption is the greatest?

In the ninja+webkitgtk example, killing off ninja itself would be good, where kernel's existing oom-killer either plays with it like a cat does a mouse and maybe kills it tomorrow, or abruptly kills just one child process, leaving the others to continue on for quite some time doing unnecessary work now that the build has failed. Conversely in the browser example, I would be OK with a tab getting killed off but not my entire browser session.

And then what if I try to reproduce the browser and ninja cases at the time time? I'd like the browser tab (child process) consuming the most CPU+memory to be treated the most harshly and expendable. But it really is a gray area which comes next? A secondary browser tab versus the entire compile?

I definitely agree that unprivileged processes need resource limitations put on them. It would be very cool if these limitations can be retroactively applied, i.e. to change TasksMax and IO/CPU weighting functions.

These also sorta relate:
https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/4#note_217565
https://twitter.com/ramcq/status/1150489660688424962

In fact I've run into that same bt gdb debugging firefox io hell @hadess has. I suppose in some ideal world, my (interactive) actions inform my system "i'm working, give me priority" and if that means gdb is effectively suspended, fine. And then when some trigger happens that indicates I've walked away (like the display goes to powersave), it then lets gdb hog the system of all resources so long as sshd doesn't face plant.

OK I did one test with earlyoom and did a quick and dirty write up on it in the devel@ thread.
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/TCIVNXEHWENIWQT35XX6PRC7ZAYTRDGQ/

Sadly, I failed to specifically mention the workload, but it's the same as before: Fedora 31 Workstation, Firefox ~8-10 tabs, Terminal building webkitgtk with the default ninja command

In the web browser reproduce case, what happens with earlyoom or nohang? Does it kill off just a child process of the browser killing off a tab? Is it randomly chosen? Would it kill off the whole browser? Could it kill off some other process entirely or does it try to kill the parent process whose combined resource consumption is the greatest?

earlyoom andnohang by default select the victim in the same way as the default OOM killer does: the victim is a process with highest oom_score. The default behavior can be modified.

In the ninja+webkitgtk example, killing off ninja itself would be good

See https://github.com/rfjakob/earlyoom#preferred-processes

Preferred Processes

The command-line flag --prefer specifies processes to prefer killing; likewise, --avoid specifies processes to avoid killing. The list of processes is specified by a regex expression. For instance, to avoid having foo and bar be killed:

earlyoom --avoid '^(foo|bar)$'

For earlyoom you can set :

earlyoom --prefer '^(ninja)$'

It adds +300 to ninja badness. +300 value is hardcoded for earlyoom.

For nohang it may be more flexible. For example,

@BADNESS_ADJ_RE_NAME 500 /// ^ninja$

or

@BADNESS_ADJ_RE_REALPATH 900 /// ^/usr/bin/ninja$

Prefer chromium tabs (its are already has oom_score_adj=300 by default):

@BADNESS_ADJ_RE_CMDLINE 300 /// --type=renderer

Prefer firefox children:

@BADNESS_ADJ_RE_CMDLINE 300 /// -childID

Other way:

@BADNESS_ADJ_RE_NAME 300 /// ^(Web Content|WebExtensions)$

Avoid killing processes in system.slice:

@BADNESS_ADJ_RE_CGROUP_V2   -200   ///   ^/system\.slice/

Other options - https://github.com/hakavlad/nohang/blob/master/nohang.conf

It would be very cool if these limitations can be retroactively applied, i.e. to change TasksMax and IO/CPU

Just change the values (pids.max, io.max, io.weight, cpu.weight, cpu.max) in cgroup, you can do it on-the-fly.

leaving the others to continue on for quite some time doing unnecessary work now that the build has failed

Killing a cgroup as a single unit can help you. I plan to implement it in nohang.

By the way, if you always run ninja via systemd-run --user ninja, you can enjoy cgroup-killing right now:

You can customize corrective action in nohang:

@SOFT_ACTION_RE_NAME  ^ninja$  ///  systemctl kill -s SIGKILL $SERVICE

If victim's name is ninja: the next command will be corrective action: systemctl kill -s SIGKILL $SERVICE, $SERVICE will be replaced by ninja's unit service, and ninja's cgroup will be killed as a single unit. Now it works only with legacy/mixed cgroup hierarchy, I will fix it to support unified cgroup hierarchy.

P.S. Seems like systemctl kill doesn't work with services in user.slice. It should work with follow::

systemd-run --uid=1000 --gid=1000 -t ninja

Yes, the oom happens sooner

@chrismurphy Low memory != OOM. OOM did not happen in your test case. OOM was prevented by earlyoom when the SwapFree was at 10%:

[ 5049.976811] fmac.local earlyoom[3470]: mem avail:   258 of  7837 MiB ( 3 %), swap free: 2469 of 7836 MiB (31 %)
[ 5096.265936] fmac.local earlyoom[3470]: mem avail:   216 of  7837 MiB ( 2 %), swap free:  860 of 7836 MiB (10 %)
[ 5096.265936] fmac.local earlyoom[3470]: low memory! at or below SIGTERM limits: mem 10 %, swap 10 %
[ 5097.266702] fmac.local earlyoom[3470]: sending SIGTERM to process 6775 "cc1plus": badness 99, VmRSS 251 MiB
[ 5101.141163] fmac.local earlyoom[3470]: process exited after 4.7 seconds
[ 5115.243511] fmac.local earlyoom[3470]: mem avail:  1546 of  7837 MiB (19 %), swap free: 4259 of 7836 MiB (54 %)

Please use exact wording. In this case there was low memory handling, not OOM killing.

In 2019, we do not have a good zram manager. We do not have a zram manager that can handle errors and use many zram functions, such as backing_dev, offer new compression algorithms, such as zstd, and offer many options for fine tuning. I am surprised at this fact.

How do you feel about low memory GUI notifications? For example, if the levels of MemAvailable & SwapFree fall below 20%, the user begins to receive periodic notifications, as he does when the battery is low. This will allow the user to stop in time and stop opening new browser tabs to avoid data loss. Should this behavior be enabled on the desktop by default? Argument your position.

How do you feel about low memory GUI notifications?

I don't think we want this. Various reasons:

  1. Our goal ought to be to reduce the amount of work a user has to do to maintain their system, not the opposite
  2. It assumes knowledge/understanding of memory and how it works - many users don't have this
  3. It doesn't actually solve the problem - we will still inevitably end up with situations where the available memory is exceeded, despite the notifications, and we will still need to handle this somehow
  4. There will be situations where memory usage increases while someone isn't actively using the machine

In 2019, we do not have a good zram manager.

This is why I'm sceptical of swap on ZRAM. There are ways to make sure there's a low water mark for RAM, but none of the service implementations or the upstream generator do this. And it's why I'm slightly more in favor of zswap as a swap thrashing moderator. But in the runaway high memory pressure examples, that is inadequate. It's just a moderator.

How do you feel about low memory GUI notifications?

With my QA tester hat on, I like it. As a user, the first question that comes to mind is "What year is this?" I don't really see memory management as user domain. Sounds like the operating system is confused and falling over, and what am I supposed to do about that?

And if I merge the two perspectives together, I come up with: an unprivileged process just preempted the GUI, that is a fail whale.

RPMs for the just released oomd v0.2.0 are available on this COPR repository:
https://copr.fedorainfracloud.org/coprs/filbranden/oomd/

I think it would be badass if we had a way to get Fedora users to opt-in to experiments, and then randomly give them things like nohang and earlyoom and oomd and low-memory-monitor. No documentation, no warning, nothing. They just get one of them. As if it were their default installation. And see what blows up, or not, what complaints they have, or not. If they explicitly install something, instead of random, they end up with bias that actually pollutes the data. Just a thought.

report back on how they work for you!

  1. They kill all the session instead of one process.
  2. They uses 7% CPU on my VM.
  3. I turned off the swap and started opening browser tabs, and in the end the system freezed.

https://imgur.com/a/FSOtqPm

Resume: they are not for desktop. Don't advise it to use on desktop. Use earlyoom or nohang on desktop instead of oomd. - https://github.com/facebookincubator/oomd/issues/90#issuecomment-530836227

@chrismurphy

kills just one child process, leaving the others to continue on for quite some time doing unnecessary work now that the build has failed

I can add in nohang an option: kill bash (or any other name) sessions (by SID or NSsid) as a single unit (if the victim is part of this group). That is, for example, if the name of the leader of the session is bash, then the entire session will be killed.

This is why I'm sceptical of swap on ZRAM. There are ways to make sure there's a low water mark for RAM, but none of the service implementations or the upstream generator do this. And it's why I'm slightly more in favor of zswap as a swap thrashing moderator. But in the runaway high memory pressure examples, that is inadequate. It's just a moderator.

You've done performance tests. According to my benchmarks, zswap is much slower than a swap. Not worth special attention at present (it was faster during the 3.11 kernel, but it has changed a lot since then).
https://lore.kernel.org/lkml/CALZtONCO5BJJw-RjrhEeap95nZy0h9GBqYgx2apVB62ZemY54g@mail.gmail.com/T/#m9580de3ad114a1699b096570b14131155600aead

zRAM is out of competition. The best choice without a doubt. I love this solution.

@latalante

zswap is much slower than a swap

It depends a lot on the settings. With zswap pool size=90% and z3fold/zsmalloc I got performance similar to zram. Of course, if you use default 20% pool size and zbud, you will get bad performance.

@latalante

You've done performance tests. According to my benchmarks, zswap is much slower than a swap.

My testing does not indicate a performance difference whether swap is on NVMe or SSD, if the memory pool is the same size. I have no HDD systems to test. The URL you provide involves a hyperspecific workload in a VM, no details of that setup are provided, and responses to the proposal mention that the results don't adequately take the general case into account. If you're experiencing zswap enabled is slower than swap alone, that's unquestionably a bug, and requires a bug report showing the system details and reproduce steps.

In the worse case scenario, I was consistently able to get the test system to totally wedge itself in with swap on ZRAM, essentially "swap thrashing" becomes CPU bound rather than IO bound, but the system was lost (omitting any of earlyoom, nohang, oomd). In the incidental swap usage cases, whatever differences there are between zswap and swap on ZRAM, I can't say are noticeable. Time frame is roughly 18 months for zswap and 3 months for swap on ZRAM.

@hakavlad
Maybe. I think we need upstream zswap folks to commit to production status by updating kernel documentation, rather than giving it the go ahead in a bug report.
https://bugzilla.kernel.org/show_bug.cgi?id=204563#c6

@chrismurphy can you provide a status update here?

Everyone I've talked to about it: kernel people, and user space folks, all say the same thing: oh yeah, that problem. Everyone knows it's a problem, there is no universal one size fits all fix, and it pretty much amounts to use case specific work arounds. I'd say a near majority consensus is: buy more memory, or (manually) use build commands that require fewer resources.

The issue really is, the kernel's oom-killer is not at all interested in user space. It only cares about making sure the kernel itself is still treading water. So it doesn't care if most all of user space is totally unresponsive and faceplanting. Ergo, there is by design, no kernel facility that tries to preserve responsiveness from a user perspective.

Facebook has learned a lot from oomd, first iteration. There is an oomd2 they've been working on that requires less effort on the part of specific use cases to build their own use case specific modules. So that's possibly worth another look from what I got out of watching the All Systems Go conference in Berlin last month.

I'm still mostly interested in a generic solution that doesn't require users either spending money, or having to configure and self-monitor the problem, in order to paper over what's clearly a fundamental deficiency in the operating system.

I'm using the OS term to mean kernel+systemd+watchdog-like-daemons, as in I'm casting a broad net of blame, none of which is the user themselves, but even going so far as to say I don't think this is the responsibility of application developers either. It's 2019 and ordinary sane applications effectively acting like unprivileged fork bombs is really quite impressive to me, and not in a good way.

Also, I haven't done any comparative evaluation on Windows or macOS to see how they're handling unprivileged fork bombs. But I can't say I'm convinced it's that relevant, because the behavior is that much of a betrayal of users that I can't feel OK saying: well its the same problem on macOS and Windows, as if that were OK if true.

Can we assume at this point that we should be enabling some userspace OOM handler, such as one of those analyzed here?

@chrismurphy nohang is also available in Fedora 30+ repos. Version 0.2 coming soon. Please test it too. It has a lot of config keys to fine tuning, supports PSI and GUI notifications. And I plan to support killing cgroups as a single unit as an option.

https://github.com/hakavlad/nohang

Demo with typical usecase: 2.3GB RAM, 1.8GB swap. Opening a lot of browser tabs and OOM prevention with GUI notifications: https://youtu.be/PLVWgNrVNlc

Also look at psi2log in nohang package. You can start it before stress testing to monitor PSI pressue values, output like follow:

$ psi2log
Set target to SYSTEM_WIDE to monitor /proc/pressure
Starting psi2log, target: SYSTEM_WIDE, period: 2
------------------------------------------------------------------------------------------------------------------
 some cpu pressure   || some memory pressure | full memory pressure ||  some io pressure    |  full io pressure
---------------------||----------------------|----------------------||----------------------|---------------------
 avg10  avg60 avg300 ||  avg10  avg60 avg300 |  avg10  avg60 avg300 ||  avg10  avg60 avg300 |  avg10  avg60 avg300
------ ------ ------ || ------ ------ ------ | ------ ------ ------ || ------ ------ ------ | ------ ------ ------
  0.17   0.83   0.57 ||  11.54  49.25  32.03 |  11.27  45.70  29.59 ||  42.80  78.67  42.78 |  42.09  74.13  39.81
  0.14   0.80   0.57 ||   9.63  47.67  31.81 |   9.41  44.24  29.39 ||  40.12  77.01  42.68 |  39.54  72.62  39.73
  0.11   0.77   0.56 ||   7.88  46.11  31.60 |   7.70  42.79  29.19 ||  32.85  74.49  42.39 |  32.38  70.25  39.46
  0.09   0.75   0.56 ||   6.45  44.60  31.38 |   6.31  41.39  28.99 ||  27.08  72.09  42.11 |  26.69  67.98  39.19
  0.07   0.72   0.55 ||   5.28  43.14  31.16 |   5.16  40.04  28.79 ||  22.90  69.86  41.85 |  22.58  65.89  38.95
  0.06   0.70   0.55 ||   4.33  41.73  30.95 |   4.23  38.73  28.60 ||  18.75  67.57  41.56 |  18.49  63.73  38.69
  0.05   0.67   0.55 ||   7.35  41.05  30.88 |   7.26  38.15  28.54 ||  20.24  66.25  41.46 |  20.03  62.53  38.61
  0.04   0.65   0.54 ||   6.38  39.77  30.68 |   6.31  36.96  28.36 ||  17.48  64.24  41.21 |  17.31  60.65  38.38
  0.03   0.63   0.54 ||   5.22  38.47  30.47 |   5.17  35.75  28.17 ||  14.67  62.21  40.94 |  14.53  58.73  38.13
  0.02   0.61   0.54 ||   4.27  37.21  30.27 |   4.23  34.58  27.97 ||  12.02  60.17  40.66 |  11.90  56.81  37.87
  0.02   0.59   0.53 ||   3.50  36.00  30.06 |   3.46  33.45  27.78 ||  10.20  58.27  40.40 |  10.10  55.01  37.62
  0.01   0.57   0.53 ||   2.86  34.82  29.85 |   2.83  32.36  27.59 ||   8.35  56.36  40.12 |   8.27  53.21  37.36

I would like to know how much memory pressure rises during your tests. Please also turn on debug keys in config if you will test nohang.

Look at this discussion:

installed nohang-git and enabled its service
well its better than expected
no complete freeze when using full ram(to get to full ram use i opened 75 tabs of firefox and 70 tabs on chromium)
system is still responsive

I have used Earlyoom for a few months, and wonder why it is not preinstalled.

-- https://forum.manjaro.org/t/out-of-memory-killer-nohang-earlyoom-oomd/95543/9

Actually IMHO earlyoom is the best candidate to be default OOM prevention daemon in Fedora 32 right now: it is stable and tiny, works fine, developed since 2014 and it is most famous userspace OOMK. It does one thing well: it prevents the system from freezing at a critical decrease in available memory and free swap. Perhaps in the future it will be possible to replace it with something else, but it is better to start with it.

If anyone wants to object to zram by default or removal of disk-based swap, please do so here.

ZRAM doesn't help you if data in memory is bad compressible (actually it a very rare case; compression ratio is about 3:1 of you if the memory is full of browser tabs). ZSWAP has the advantage that it allows you to put a constant amount of data in a swap regardless of its compressibility. Also zswap does not break hibernation.

ZSWAP is a less traumatic (it does not break hibernation, it works if the memory is full of incompressible data) and more conservative solution.

Regarding nohang, could you leave a comment on https://github.com/hakavlad/nohang/commit/2a3209ca72616a6a8f59711ff7fde7a6662ff3c7 to indicate what you are fixing and what its security impact is? The brief commit messages are concerning.

Regardless, I appreciate your assistance in this issue and your vote of confidence for earlyoom.

@catanzaro Done. Sorry for dirty commit style.

Can we assume at this point that we should be enabling some userspace OOM handler

I wrote one specifically for Fedora Workstation and desktop use, which is available at:
https://gitlab.freedesktop.org/hadess/low-memory-monitor/

The GLib integration code is here:
https://gitlab.gnome.org/GNOME/glib/merge_requests/1005

The Portal code is here:
https://github.com/flatpak/xdg-desktop-portal/pull/365

Something new: high responsiveness with active swapping. Code not yet published.

With tail /dev/zero:
https://youtu.be/H6-qfXqzINA

With stress -m 9 --vm-bytes 99G:
https://youtu.be/DJq00pEt4xg

In the latter case, without the use of a special daemon, there would be complete freezing.

Perhaps @hadess and @hakavlad could collaborate on this? It seems like we have more than enough competing implementations already?

Perhaps @hadess and @hakavlad could collaborate on this? It seems like we have more than enough competing implementations already?

None of them were suitable for integration in the desktop unfortunately, otherwise I wouldn't have written a new one. There's not much "collaboration" left to be done, the code is written, and functional. It's waiting for integration.

integration in the desktop

@hadess Do you mean only Gnome? How will this be integrated? How will this be configured?

None of them were suitable for integration in the desktop

What should integration look like?

Tested low-memory-monitor https://aur.archlinux.org/packages/low-memory-monitor-git/ on Manjaro.

https://imgur.com/a/UTB6tZJ - screenshots.

At the slightest entry into the swap, the culprit of tail /dev/zero was killed, and after a few seconds the Xorg was killed. After recovery, after some time, the firefox was killed, although there was enough memory.

No --help option, no output. VmRSS is about 4 MiB.

@aday, Philip Withnall, Rob McQueen, and I discussed this issue last week at an Endless OS/GNOME hackfest in London last week. And it's a difficult problem, is what I've concluded.

We discussed, but didn't conclude, to what degree we can depend on swap configurations other than what we have now, due to Anaconda defaults, and the user having control over this in Custom/Advanced partitioning anyway. My experience with no swap is objectively worse, where the kernel oom-killer becomes super sensitive to knocking over random processes like sshd, and systemd-journald, and other processes we'd really want to keep around. Also, I wonder about always assuming the top memory consumer is the process to kill. What if it's a VM? Killing VMs would be a pretty bad UX, and likely leads to some form of data loss.

Another thread about this topic has appeared on devel@
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/JW6FG5HB3Q67U2F7PJKBXUBZWL2FK32C/

We discussed, but didn't conclude, to what degree we can depend on swap on ZRAM by default instead of swap on HDD/SSD. There are known problems with having no swap at all, which is that quite a lot of people do have use cases where incidental swap is used regularly, and if it's not available the in-kernel oom killer is very fast at knocking off random processes to free up memory. In my testing, such a setup regularly caused sshd, systemd-journald, and similar such processes to get killed - not the top consuming process.

Something I brought up, is the distinction between incidental swap use (a rate of usage where it's trivial and can be handled by HDD/SSD without seemingly affecting system responsiveness); and persistently aggressive use (a rate that does cause responsiveness problems). The former is acceptable, the latter isn't. How to distinguish between them? Is there a way to sense a jerky mouse point for example? I can see it. Is there anything in Linux, either kernel or user space, that could become aware of that as an early indicator for a low water mark of resources being overwhelmed?

Is there anything in Linux, either kernel or user space, that could become aware of that as an early indicator for a low water mark of resources being overwhelmed?

@chrismurphy /proc/pressure in kernel. oomd, low-memory-monitor and nohang in userspace can respond to PSI metrics. nohang responds most gently sending SIGTERM. oomd acts most severely, killing at least the whole control group. low-memory-monitor just triggers kernel's OOMK.

swap is used regularly, and if it's not available the in-kernel oom killer is very fast at knocking off random processes to free up memory

@chrismurphy you can avoid this using earlyoom or nohang: they both act carefully, send the SIGTERM signal first to process with highest oom_score by default, wait for the death of the victim before sending the next signal.

I think for quite a lot of use cases, we risk wasting a valuable resource, RAM, by opting for no swap at all, by default. There are definitely common cases where seldom used data can be effectively, efficiently offloaded into swap with zero meaningful performance cost, thereby freeing up RAM.

I'm open to the idea of swap on ZRAM, ZSWAP (using zbud for now since z3fold has some issues still being worked out), or either plain swap partition or swap files including setting the size to something other than the current 1:1 ratio with RAM, i.e. fixing it to 1-2GiB perhaps. That does mean dropping any pretense of supporting hibernation.

On hibernation: About the only defense of hibernation I can offer is it reduces the chance of data loss, e.g. the user hasn't saved the working files, the system goes to sleep (suspend to RAM), battery gets low and dies = data loss. Whereas with hybrid sleep supported by systemd, a hibernation image is written first, then suspend to RAM is initiated, could avoid that. But it's a bit complicated, including the hibernation lock down on UEFI with Secure Boot enabled systems. It seems like most users now use UEFI with Secure Boot enabled.

I tested LMM again, already version 2.0 on Fedora Rawhide.

@hadess Corrective action occurs long before the loss of the system. I ran a script slowly consuming memory. LMM killed the culprit very early, when there were no problems - at a time when the system was working normally and had no problems with responsiveness.

And now the main question: what thresholds should we use by default?

I believe that the corrective action should only occur under strong and prolonged high pressure - when we are sure that the increase in pressure is not caused by temporary or accidental circumstances.

In nohang, I generally turn off the default response to PSI: I am afraid that the demon will kill processes when users have not yet lost control of the system.

The opinion of the author of the earlyoom daemon:

I have tested a few things with PSI, and it seems pretty difficult to do "the right thing".

The difficult example is this: https://github.com/rfjakob/earlyoom/blob/master/contrib/oomstat/loadshift.txt
You have an app that uses 70% of your RAM but is idle, and you have enough swap free. Then another app wants to use 70% of your RAM. PSI goes crazy, but the situation calms down as the idle app is swapped out. The right thing here is not killing anything.

-- https://github.com/rfjakob/earlyoom/issues/100#issuecomment-508562342

And one more thing: someone may not want a responsive system — someone wants to put the compilation overnight with -j64 and go to bed — there are situations in which responsiveness is not important.

I'm strongly against adding userspace OOM killers to Fedora default images. Users should explicitly enable them only when needed.

  1. Such applications run with super-user privileges and has full access to all private memory of all processes and sensitive user data. This is a huge security breach.
  2. Some implementations are killing all processes with the same name and their developers think that this is a feature.
  3. Super-user daemons should not touch userspace. If you want to implement a real user-space OOM, you should run it with privileges of the same user using system-user units.
  4. nohang and earlyoom currently use dirty hacks for notifications. They must be rewrited to use D-Bus notifications.

@xvitaly https://github.com/hakavlad/oomie/commit/b7a0961bde471e8a599ff9342a6db7a301cc8eb1

This commit solves all the problems.

Such applications run with super-user privileges and has full access to all private memory of all processes and sensitive user data. This is a huge security breach.

Now it runs as a dynamic user with a lot of restrictions.

Some implementations are killing all processes with the same name and their developers think that this is a feature.

It is not earlyoom's problem.

Super-user daemons should not touch userspace

Now it is not super-user daemon.

nohang and earlyoom currently use dirty hacks for notifications

New unit forbids GUI notifications.

Thus, all these problems are not unsolvable. These were only particular problems of individual implementations.

Better interactivity with ninja in low-memory situations: https://github.com/ninja-build/ninja/issues/1706. Likes and comments are welcome!

Honestly, teaching ninja to be nicer to desktop users would be wonderful, but I consider it a bug that any random process is capable of causing desktop responsiveness issues by using too much RAM. ninja is probably the worst offender currently, but it's easy to imagine others. I've repeatedly had my system brought down by the ctags process spawned by Builder, for example (which seems to require an infinite amount of RAM to index WebKit).

You'd really like it so that in a low memory situation Firefox staying interactive was prioritized over ninja spawning 10 g++'s - but that seems harder - you don't want to say that on a 8GB system, 2GB are for the system, 3GB are for applications, and 3GB are for non-interactive processes! -

Definitely bringing this to a wider audience would be a good idea - there may be more tools that we aren't aware of.

@otaylor

https://github.com/poelzi/ulatencyd Ulatency is a daemon that controls how the Linux kernel will spend it's resources on the running processes. It uses dynamic cgroups to give the kernel hints and limitations on processes. It doesn't work with cgroup2: https://github.com/poelzi/ulatencyd/issues/67

https://github.com/Nefelim4ag/Ananicy Ananicy (ANother Auto NICe daemon) — is a shell daemon created to manage processes' IO and CPU priorities, with community-driven set of rules for popular applications (anyone may add his own rule via github's pull request mechanism).

https://github.com/JohannesBuchner/verynice A Dynamic Process Re-nicer

It seems like new kernels (at least since 5.3) work with mq-deadline io scheduler by default and don't support ionice. It seems that there is not a single good enough re-nice daemon for user space.

I've repeatedly had my system brought down

@catanzaro I can offer only what was written above:

  • swap compression;
  • userspace daemons.

The only question is what exactly needs to be included in default images.

My suggestions:

  • Enable low-memory-monitor by default but only without OOMK triggering support (it has a critical bug https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/8 and Bastien fails to see the problem);

  • Enable earlyoom by default to handle low MemAvailable and SwapFree. earlyoom is a stable and well-established application.

PSI-based process killing should not be used by default, because this topic is still poorly understood and we don’t know what thresholds are desirable for most users: it’s hard to find good default values.

Maybe later I will recommend nohang instead of earlyoom, after I solve some problems. In contrast to earlyoom and other desktop-oriented killers, nohang allows you to flexibly configure thresholds of PSI, flexibly select a victim, and also prepares the possibility of killing process groups.

Enable low-memory-monitor by default but only without OOMK triggering support (it has a critical bug https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/8 and Bastien fails to see the problem);

I don't think this gains us anything. From the bug report, I agree low-memory-monitor is too risky in its current form.

Enable earlyoom by default to handle low MemAvailable and SwapFree. earlyoom is a stable and well-established application.

So this seems like the better proposal. @hadess, anything to add?

Discussed this with Robert McQueen and Philip Withnall in London last month; and my understanding is the API that LMM will use (in Glib) should be about done, and that LMM would be ready for Fedora 32 or 33. If it's not ready for F32, or there are open questions, is there something we can learn from enabling earlyoom by default in F32? And are we OK with backing earlyoom out for Fedora 33 in favor of LMM?

With respect to https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/8

If LMM is enabled by default, would -Dtrigger_kernel_oom=true also be the default behavior? That does seem a potentially risky default. If even a small percent of users need to change this to false, it's a problem because it's not obvious why terminal processes are being killed off, and what action they should take to stop it.

The main problems of LMM are awful default settings (oversensitive thresholds cause mass shootings) and low customizability. The critical threshold at which OOMK is invoked should be greatly increased and its should be configured through the config. https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/8 caused by oversensitive thresholds.

Original EOS https://github.com/endlessm/eos-boot-helper/tree/master/psi-monitor has no such bug. psi-monitor responds quite quickly to increased pressure, but does not cause mass shootings.

low-memory-monitor doesn't kill anything, it asks the kernel to do that.

@hadess It sounds like "It was not me who did the mass shooting, it was my gun!".

If LMM is enabled by default, would -Dtrigger_kernel_oom=true also be the default behavior?

@chrismurphy now -Dtrigger_kernel_oom=false by default, see https://gitlab.freedesktop.org/hadess/low-memory-monitor/commit/176c85d6a77d7c74a4b9e3ec412138dc8567cb4b and https://src.fedoraproject.org/rpms/low-memory-monitor/blob/f31/f/low-memory-monitor.spec#_37.

LMM would be ready for Fedora 32 or 33

LMM is ready for Fedora 31 and 32. The problem is that there are no applications that can respond to LMM signals. Perhaps enabling LMM by default will give developers the motivation to enable the ability to react to its signals. By the way, applications can monitor the state of memory without the mediation of LMM. For example: https://github.com/WebKit/webkit/blob/master/Source/WebKit/UIProcess/linux/MemoryPressureMonitor.cpp.

LMM and earlyoom can now be enabled by default at the same time.

low-memory-monitor-2.0-3.fc31: https://bodhi.fedoraproject.org/updates/FEDORA-2019-5b12d346b1.

@hadess What about integrating LMM in systemd?

See discussion: Killing the elephant in the room - the userspace low memory handler ability to gracefully handle low memory pressure - https://www.reddit.com/r/linux/comments/ee6szk/killing_the_elephant_in_the_room_the_userspace/.

-

Just another vote for the user-space handler:

...So in this testing, linux plus either earlyoom or nohang is much better than Windows, and much, much better than linux without either.

Full report: https://www.reddit.com/r/linux/comments/ee6szk/killing_the_elephant_in_the_room_the_userspace/fcj0qjr/

Questions: Should LMM + earlyoom be made into a change proposal for visibility? I think it's a good idea to promote this effort. If it works as intended, it's a great feature. If it has bugs, we should give people a heads up so bugs can get fixed, and the experience improved upon.

And if so, would it be a self-contained change? I think the idea is to put it only into Fedora Workstation. If self-contained, deadline is Jan 21. If it's system-wide, deadline is tomorrow, but i'm pretty sure this is self-contained.

Ideally, this would be proposed as a system-wide change.

It's definitely a system-wide change. Obviously we will miss the change deadline. Perhaps we can submit it late if we're appropriately apologetic, but I don't think we have a clear plan here yet. Is your proposal to enable both earlyoom and low-memory-monitor?

Certainly, low-memory-monitor on its own does not seem very useful right now, given substantial evidence that earlyoom and nohang are better at handing low memory conditions. I would like to see @hadess take @hakavlad's expertise on OOM handling into more serious consideration, so that we can get everything into one daemon with good default settings, instead of running multiple at the same time.

@hakavlad mentioned four months ago, earlyoom is ready and could go into fc32. LMM might be ready, but we need an readiness update. Can they co-exist? Even if they can, does it add noise to the testing? My inclination is we need to choose:

a. earlyoom for fc32, lmm for fc33
b. lmm for fc32+
c. no change for fc32, lmm for fc33

@hakavlad, the current description from dnf info earlyoom contains this portion: "At least that's what I think what it will do." which does not inspire confidence. Perhaps it's a stale description and should be updated?

"At least that's what I think what it will do." which does not inspire confidence.

I think this line in the description does not affect the actual behavior of the daemon.

I asked the maintainer (@xvitaly) to change the description to the following:

The oom-killer generally has a bad reputation among Linux users. This may be part of the reason Linux invokes it only when it has absolutely no other choice. It will swap out the desktop environment, drop the whole page cache and empty every buffer before it will ultimately kill a process. This made me and other people wonder if the oom-killer could be configured to step in earlier. As it turns out, no, it can't. At least using the in-kernel oom-killer. In the user space, however, we can do whatever we want. earlyoom checks the amount of available memory and free swap up to 10 times a second (less often if there is a lot of free memory). By default if both are below 10%, it will kill the largest process (highest oom_score). The percentage value is configurable via command line arguments.

@xvitaly said he might change the description for the next release.

Can they co-exist?

I do not see this as any problem. LMM and earlyoom do a different job: earlyoom causes applications to terminate when MemAvailable < 10% MemTotal and SwapFree < 10% SwapTotal; LMM just send signals via dbus (all applications now ignore it).

Now LMM has almost no effect by default. The meaning of its inclusion is to let developers know that LMM and GMemoryMonitor are already used by default in distributions, and they can add dbus signals response support to their applications.

By the way, bug in LMM is not fixed: if you enable OOM killing in LMM, you may also get mass shooting. The thresholds that are used in LMM, I still consider suboptimal. There is still no configuration file that allows you to change the default threshold values.

I just re-read the earlyoom description and the context of the excerpt I quoted was in regards to the kernel oomkiller, not earlyoomer killer. Sorry for the confusion.

I'm inclined to submit a change proposal for earlyoom inclusion in fc32. I'll work on the change proposal today, and give Ben a heads up that we're considering it.

Just FTR, you want to do it only for workstation edition, right? Also how would it behave with VMs? I am pretty sure, users which run VMs would not want to have their VMs killed, but rather some Firefox and such.

users which run VMs would not want to have their VMs killed, but rather some Firefox and such.

OOM killer is already in Linux. earlyoom by default terminates processes in the same order as a kernel OOMK, but a bit earlier. oom_score_adj that used to change process kill priorities in OOMK, also works in earlyoom. @ignatenkobrain

By the way, some userspace out-of-memory handlers allow you to more conveniently and flexibly control the priority of process killing than the kernel OOM killer allows. So to prevent the killing of virtual machines with some of them can be even easier.

@ignatenkobrain yes this is why I thought maybe this wasn't a system-wide change; my intent is to explicitly ask server, cloud, IoT folks, on their lists, if they want to opt-in. As for other desktop spins, my expectation is they inherit the Workstation defaults, but can opt-out.

Re: VM, @hakavlad can perhaps better answer this; but my understanding is the earlyoom doesn't alter what would get killed off, just that it happens sooner, i.e. if qemu-kvm has a low oom_score compared to firefox, then it's not going to get killed off before firefox does.

OOM killer is already in Linux. earlyoom by default terminates processes in the same order as a kernel OOMK, but a bit earlier. oom_score_adj that used to change process kill priorities in OOMK, also works in earlyoom. @ignatenkobrain

Ah, I see.. I was curious if you plan to adjust these settings as well. Since that is the thing would make it better for workstation people.

@ignatenkobrain yes this is why I thought maybe this wasn't a system-wide change; my intent is to explicitly ask server, cloud, IoT folks, on their lists, if they want to opt-in. As for other desktop spins, my expectation is they inherit the Workstation defaults, but can opt-out.

I think it does not make much sense to have this for server edition.


Thanks for answers!

Ah, I see.. I was curious if you plan to adjust these settings as well. Since that is the thing would make it better for workstation people.

I'm not sure how oom_score is initially set, I think there is a default and the process itself can also set one. I am very suspicious that oom_score may be set wrong for some processes because I have seen the kernel oomkiller clobber e.g. systemd-journald, and sshd, only slightly less often than the offending unprivileged 'bad process'. An unprivileged process pigging out on CPU and RAM should have a high 'badness' score compared to the journal, sshd, qemu-kvm.

The oom_adj is a way to adjust that, but I really see that as user domain, as kludgy as that is. If we learn that certain processes have an improper oom_score, that should get fixed by those developers, not papered over with oom_adj. And also, LMM has a different way of dealing with this, so I'm reluctant to step on any of that work.

The idea is to improve system responsiveness, the short term approach with earlyoom, and a longer term approach with LMM. That the kernel oomkiller is so reluctant to trigger, may very well expose a variety of processes with poorly set oom_score.

I was curious if you plan to adjust these settings as well. Since that is the thing would make it better for workstation people.

@ignatenkobrain I would like to change some priorities in nohang-desktop. The fact is that the killing of some processes that are the basis of the desktop usually leads to the termination of all processes on the desktop, and this can lead to the loss of unsaved data of all processes in the session. I think it would be a good idea to protect processes such as Xorg, Xwayland, gnome-shell and such other from murder by lowering their killing priority to prevent data loss of all the session. I plan to protect processes whose destruction results in the destruction of the entire session, as follows:

    Protect X.
@BADNESS_ADJ_RE_REALPATH -300  ///  ^(/usr/libexec/Xorg|/usr/lib/xorg/Xorg|/usr/lib/Xorg|/usr/bin/X|/usr/bin/Xwayland)$

    Protect Gnome.
@BADNESS_ADJ_RE_REALPATH -300  ///  ^(/usr/bin/gnome-shell|/usr/bin/metacity|/usr/bin/mutter)$

    Protect Plasma.
@BADNESS_ADJ_RE_REALPATH -300  ///  ^(/usr/bin/plasma-desktop|/usr/bin/plasmashell|/usr/bin/kwin|/usr/bin/kwin_x11|/usr/bin/kwin_wayland)$

    Protect Cinnamon.
@BADNESS_ADJ_RE_REALPATH -300  ///  ^(/usr/bin/cinnamon|/usr/bin/muffin|/usr/bin/cinnamon-session)$

    Protect Xfce.
@BADNESS_ADJ_RE_REALPATH -300  ///  ^(/usr/bin/xfwm4|/usr/bin/xfce4-session|/usr/bin/xfce4-panel|/usr/bin/xfdesktop)$

    Protect Mate.
@BADNESS_ADJ_RE_REALPATH -300  ///  ^(/usr/bin/marco|/usr/bin/mate-session|/usr/bin/caja|/usr/bin/mate-panel)$

    Protect LXQT.
@BADNESS_ADJ_RE_REALPATH -300  ///  ^(/usr/bin/lxqt-panel|/usr/bin/pcmanfm-qt|/usr/bin/lxqt-session)$

    Protect other.
@BADNESS_ADJ_RE_REALPATH -300  ///  ^(/usr/bin/compiz|/usr/bin/openbox|/usr/bin/fluxbox|/usr/bin/awesome|/usr/bin/icewm|/usr/bin/enlightenment)$

    Protect `systemd --user`.
@BADNESS_ADJ_RE_REALPATH -300  ///  ^(/lib/systemd/systemd|/usr/lib/systemd/systemd)$

    Protect `dbus-daemon --session`.
@BADNESS_ADJ_RE_REALPATH -300  ///  ^/usr/bin/dbus-daemon$

Of course, this goes beyond simply improving responsiveness on the desktop, but it allows you to better maintain control of the system and prevent data loss if corrective actions are applied.

It sounds like if we want to avoid killing the desktop, we need to use nohang rather than earlyoom. Yes?

It is not implemented right now. I hope to release nohang v0.2 in January, which should be ready to use and with desktop protection. Maybe I will offer nohang-desktop for Fedora 33.

@catanzaro you can also start earlyoom with --avoid ^(gnome-shell|Xwayland)$, for example.
https://github.com/rfjakob/earlyoom#preferred-processes

@BADNESS_ADJ_RE_REALPATH -300

Hacks, hacks, hacks. That's why I'm strongly against installing any of user-space OOM killers by default. If users want them, they should explicitly install them manually.

Hacks, hacks, hacks.

Indeed, it is hard to argue with your argument. Well, let's wait another 20 years, maybe the kernel guys will finally come up with something.

Of course, userspace implementation of @BADNESS_ADJ_RE_REALPATH -300 is a hack, but kernel implementation of oom_score_adj is not a hack.

When the kernel protects itself at OOM - this is not a hack. When userspace protects itself because the kernel doesn’t help it, it’s a hack.

Well I think we have consensus agreement that the status quo is not acceptable. I'm not interested in waiting for kernel improvements unless someone is actively working on the problem at the kernel level and expects to be able to demonstrate results comparable to what userspace solutions offer, which seems unlikely. Since the people working on userspace "hacks" are delivering real, demonstrable improvements to system responsiveness, it seems pretty clear that a userspace killer is going to be the path forward right now.

I'm opening new issue #119 specifically for tracking earlyoom by default change proposal. That issue will be the narrow scope directly related to the change proposal, and any tweaks we think are needed there.

This issue #98 will remain as the big picture issue of interactivity problems in low memory situations and coordination.

Lennart has proposed an alternative to earlyoom: wait for Facebook to integrate a dumbed-down version of oomd into systemd, then use that: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/W73ZVOIRNNW4YVQT6FNSLI6GHUJCZSKY/

@hakavlad, any comments on this proposal? I understand you have not been impressed with oomd thus far, but it sounds like integration into systemd would be a good opportunity to make changes.

I believe that oomd kills whole groups is not very suitable for the desktop. oomd on the desktop kills the entire session instead of the browser tab.

Right now PSI is overrated for desktop usage. There are drawbacks to using it: the optimal settings for the corrective action strongly depend on the system configuration - on the type of swap space (on HDD, on SSD, on zram), on the presence of a swap, on swappiness, on workloads.

What is used in oomd is great for the server, when the oomd is configured by the administrator for a specific configuration and workload. But this can be terrible for the desktop, especially if it is offered by default without taking into account the specific configuration.

PSI on the desktop is well reviewed by those who have never used PSI on the desktop. I hope you remember the issue https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/8.

Lennart speaks well of PSI because he never used PSI in practice.

I believe that oomd kills whole groups is not very suitable for the desktop. oomd on the desktop kills the entire session instead of the browser tab.

But that's easily solvable by running each web process in a separate systemd scope, yes? It's not like open source web browsers are black boxes that we cannot modify. We just changed WebKit to launch processes under bubblewrap; systemd scopes should be a lot easier than bubblewrap, right?

Lennart speaks well of PSI because he never used PSI in practice.

OK, I think we need to get you and Lennart together to discuss this. There's no point in going to a ton of trouble to integrate oomd into systemd if it's not going to work well.

Yes!
Right!

get you and Lennart together

I think he can read all my posts here. Perhaps I will clarify my position on PSI here.

PSI on the desktop is well reviewed by those who have never used PSI on the desktop. I hope you remember the issue https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/8.

Ehm. That issue is just gnome-terminal/vte being a bit stupid. We should place every terminal tab into a systemd transient scope/service and the problem there will go away. Really, it will be easy to fix in gnome-terminal/vte and such a fix should happen anyway.

This bug is not about terminal. I've seen LMM killed Xorg and Firefox, see my reports above. This bug is about suboptimal thresholds. The LMM uses the least adequate thresholds, the oomd uses much more adequate thresholds. Different userspace OOM killers have different ideas about good thresholds. I still did not find the optimal value, so I temporarily refused to use PSI. This topic predicts further research.

This bug is not about terminal. I've seen LMM killed Xorg and Firefox, see my reports above.

Well, certain components currently misbehave when the killer kicks in. What we need here is that people start filing bugs against said components to fix them (feel free to CC me). As an example, gnome-terminal should probably place each tab into its own systemd transient scope unit (also setting OOMPolicy=continue).

Note that we should probably set MemoryMin= or MemoryLow= for gnome-shell. Doing so should ensure that the shell remains running smoothly even on a loaded system. You could try the following:

$ systemctl --user --force edit gnome-shell-.service
[Service]
MemoryMin=250M
MemoryLow=500M

I don't think it makes sense to choose a solution based on which solution seems to have the better thresholds. Especially if tested on a system where a lot of other adjustments and fixes are still needed.

Note that we should probably set MemoryMin= or MemoryLow= for gnome-shell.

I can't find systemd documentation for these. Can you point me to the right place?

I've broached this upstream directly, as they were in the middle of discussing IO/swap related congestion for the next LSF/MM meeting:
https://lore.kernel.org/linux-fsdevel/20200104090955.GF23195@dread.disaster.area/T/#m749648b58633da4528ea904533fbf5ba8134eb10

There is a response that follows from Michal Hocko. This contains some subtle interesting things, including to what degree PSI is helpful; suggesting the action to take is workload specific. I followed up with, how do we determine and categorize workload, and then what do we do with that information to turn knobs automatically? A user has different workloads on the same computer, of course.

I was asked to separately emailed to linux-mm@ so I did, and also filed a tracking bug. At the moment, these just state the facts we already know, until there's a repsonse, but I leave them here in case anyone wants to track them directly. Otherwise I'll post meaningful updates as I become aware of them.
https://marc.info/?l=linux-mm&m=157842897820390&w=2

loss of responsiveness during heavy swap
https://bugzilla.kernel.org/show_bug.cgi?id=206117

New nohang behavior with psi_checking_enabled=True: don't shoot if MemAvailable > hard/soft threshold. It works well and it prevents false-positives.

Demo: https://youtu.be/Y6GJqFE_ke4.

Commands in the demo:

$ tail /dev/zero
$ stress -m 8 --vm-bytes 88G
$ for i in {1..5}; do tail /dev/zero; done      # sequential launch of five instances
$ for i in {1..5}; do (tail /dev/zero &); done  # simultaneous launch of five instances

with such config keys:

psi_checking_enabled = True
psi_excess_duration = 1
psi_post_action_delay = 5
soft_threshold_max_psi  = 5
hard_threshold_max_psi  = 90
ignore_positive_oom_score_adj = True

MemTotal=9.6GiB, zram disksize=47.8GiB.

Now PSI response enabled by default in nohang-desktop.conf.

It sounds like if we want to avoid killing the desktop, we need to use nohang rather than earlyoom. Yes?

Yes, with nohang-desktop.conf and nohang-desktop.service https://github.com/hakavlad/nohang/commit/ff620f04381ce412ac0c2a684222a85209e65ee9 @catanzaro

By the way, nohang-desktop should now prevent long-term freezes without mass shooting when using PSI. Freezing the desktop when using it is possible for a minute, but it is necessary to prevent false-positive results with the killing of innocent processes (of course you can make behavior more aggressive).

So, I have been experimenting quite a bit. This can be done with the following tarball (please follow the README):
https://benjamin.sipsolutions.net/gnome-slice-cgroups.tar.xz

There are a few high level observations from tests using the pathological no-swap test:

  • In the case of e.g. a heavy build, we end up with heavy memory and io pressure on the main shell (the io pressure is higher)
  • All of this pressure appears to stem from reads from the root file system. These reads are all for mmap'ed pages (i.e. file caches for data or executable maps)
  • Pages are always owned by one cgroup only, this includes file caches; so essential libraries for the shell might get accounted against an application
  • Protection can only kick in if the page is accounted against the correct cgroup

I have tried to create an artificial scenario (a C program that mmap's a file and then XORs all of it once a second). You can very nicely see the page ownership migration in progress. So you get a scenario where:

  • Run the app, pages are in its own cgroup
  • Kill it, pages get transferred to parent cgroup
  • Run it again, pages remain where they are
  • Create memory pressure; pages are reclaimed
  • application thrashes heavily, pages get loaded back in and accounted to its own cgroup

Ideally that thrashing part there would be prevented by the kernel. However, I am not sure that is easily possible, nor am I sure that this is what we really need. But mostly because we shouldn't be restarting session critical services all the time.

Remind me, please, why zram, not zswap?
For example, zswap works fine when filling memory with incompressible data, unlike zram.
For example, zswap uses max_pool_percent to limit max pool with compressed pages in the memory. Unlike this, zram managers don't use mem_limit in its settings.
@chrismurphy

Remind me, please, why zram, not zswap?

It's a good question. I've considered it for a year.

  • zswap requires a swap partition, zram is stand alone.

  • there's already a systemd/zram-generator using existing systemd facilities to load the kernel module, create the zram device, format it, and activate. Simple.

  • I've written about my experiences with both zswap and swap-on-zram on the devel@ list; but no clear winner, or feedback suggesting a clear winner for our use case.

For example, zswap works fine when filling memory with incompressible data, unlike zram.
For example, zswap uses max_pool_percent to limit max pool with compressed pages in the memory. Unlike this, zram managers don't use mem_limit in its settings.

I think those concerns can be mitigated by limiting the zram device's disklimit or mem_limit in sysfs. I understand disklimit only indirectly affects memory use but in practice if disklimit is 50% RAM or less, it's pretty sane as you say in #127, and that work is done - all of the implementations support disklimit.

Supporting zswap means:

  1. installer needs to learn about it, and add the proper boot parameters; or
  2. creation of, e.g. systemd/zswap-generator, to poke the sysfs interface to set it up.

I think 2 is preferred because it keeps complexity out of the installer; and makes it easier to support upgrades. To support upgrades with kernel parameters means modifying /etc/default/grub and regenerating grub.cfg and grubenv, which I think is risky and may not fly.

I'm happy to (re)consider zswap still, but there's some extra work indicated, compared to swap-on-zram.

Update: actually zswap is mentioned 28 times in this issue, mostly by me :D

OK, one more update on the effort to figure out what exactly is going on and protecting things with cgroups.

This is based on the cgroup hierarchy as found above (mem-min: 2GiB, mem-low: 4GiB), but I don't think it is relevant.

I figured out, that perf trace --no-syscalls -F -p PID lets me trace (major) page faults for certain applications. It also resolves the mapping that is being loaded, meaning you can actually kind of see the kernel fetching one function after the other. If you look at the below, you see that the pagefault tends to happen each time when the function is loaded, and then the corresponding library is loaded from disk:

18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [gjs_context_get_current+0x0] => /usr/lib64/libgjs.so.0.0.0@0x7e0f0 (d.)
18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [ToggleQueue::is_queued+0x0] => /usr/lib64/libgjs.so.0.0.0@0x737d0 (d.)
18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [JS_BeginRequest+0x0] => /usr/lib64/libmozjs-60.so.0.0.0@0x5cf750 (d.)
18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [JS::HeapObjectPostBarrier+0x0] => /usr/lib64/libmozjs-60.so.0.0.0@0x2e4ab0 (d.)
18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [0x90a4a0] => /usr/lib64/libmozjs-60.so.0.0.0@0x90a4a0 (d.)
18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [JS::AddPersistentRoot+0x0] => /usr/lib64/libmozjs-60.so.0.0.0@0x2ab3f0 (d.)

Also visible is that libraries with the same offset are repeatedly faulted in during high load workloads (make -j32 of mozjs without swap in my case):

$ grep -n /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x42170 majfault-traces.txt 
204:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [0x42170] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x42170 (d.)
367:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [0x42170] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x42170 (d.)
1897:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [0x42170] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x42170 (d.)
1930:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [0x42170] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x42170 (d.)
1964:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [0x42170] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x42170 (d.)
2222:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [0x42170] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x42170 (d.)
2414:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [0x42170] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x42170 (d.)
5348:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [0x42170] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x42170 (d.)
8040:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [0x42170] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x42170 (d.)
$ grep -n _clutter_threads_acquire_lock+0x0 majfault-traces.txt 
1982:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [_clutter_threads_acquire_lock+0x0] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x9dd40 (d.)
2145:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [_clutter_threads_acquire_lock+0x0] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x9dd40 (d.)
3675:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [_clutter_threads_acquire_lock+0x0] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x9dd40 (d.)
3708:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [_clutter_threads_acquire_lock+0x0] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x9dd40 (d.)
3742:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [_clutter_threads_acquire_lock+0x0] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x9dd40 (d.)
4000:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [_clutter_threads_acquire_lock+0x0] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x9dd40 (d.)
9818:18446744073709.551 ( 0.000 ms): gnome-shell/2339 majfault [_clutter_threads_acquire_lock+0x0] => /usr/lib64/mutter-5/libmutter-clutter-5.so.0.0.0@0x9dd40 (d.)

Other files affected are generally dconf stores (these are only visible as pointers in the log but can be resolved using the map'ings from /proc). Overall, both shared and non-shared mappings appear to be affected from the reclaim.

I think the question now becomes as to which cgroup these pages are accounted to. My expectation would have been that the pages are refaulted only once, and then the cgroup reclaim protection kicks in. In that case the desktop would hang once for a few seconds maybe, but recover at that point (pages might e.g. belong to e.g. user.sliceas they were loaded by GDM). However, that does not appear to be happening for me right now; I may need to double check if I maybe got some configuration wrong again.

So, think this is probably something that we can actually ask the kernel people about. i.e. how does the accounting work in this case and why isn't the reclaim protection kicking in.

I saw an interesting RFC patch series yesterday that adds a tool called DAMON. This allows applications to trace their own usage patterns and e.g. mlock data into memory. I think it is interesting in principle, but I don't expect it makes sense for us to use it directly. i.e. in the long run the kernel might use it to figure out the working set of session.slice.

See https://lkml.org/lkml/2020/2/25/303

And my traces (two traces separated by ==== and the mmaps at the start)
majfault-traces.txt

Well, sometimes it works. i.e. right now it started working for me after setting a memory limit on all child cgroups, but removing that limit again didn't get me back to the old behaviour. I'll try some reboots to see if I get a similar effect. But in principle, it seems that the reclaim protection does kick in sometimes.

The effect is funky. The cursor still moves in general and the shell is responsive. Though as applications are not responding, I usually get no cursor on top of applications which feels odd.

PS: I also tried setting io.latency. But my experimentations stopped there, when I figured I might need to write to /sys/fs/cgroup/io.cost.qos for it to take effect. And trying that on LUKS causes a NULL pointer exception in the kernel.

OK, it was reproducible for me after a suspend/resume cycle. Maybe the protection is kind of sticky for some reason. I updated https://benjamin.sipsolutions.net/gnome-cgroup-testing.tar.xz (different URL, the first was wrong) for people to try this (with a huge 2G/4G protection).

Note that as the shell is actually responding to input. So even applications are extremely slow to respond, one can successfully select a terminal window and send a ctrl+c to a thrashing make -j32. It takes a long time, but it does get there.

@benzea thanks for the update. It sounds both tedious and promising!

@benzea, I tested your new configuration with my "build WebKit" stress test. I think you made a significant improvement. Instead of my desktop freezing up all at once, I saw my mouse cursor stuttering for a few seconds, then gnome-shell "crashed" and returned me to the login screen. (It's not a real crash though, because I don't have a core dump in coredumpctl. Looking through my journal, I'm not really sure what happened there.)

This behavior is awful, but it's a lot less awful than an immediate unrecoverable permanent freeze, so I consider this to be pretty serious progress.

@catanzaro, hmm, maybe Xwayland decided to quit or something like that. I have also seen e.g. clients getting disconnected from wayland by the shell. So I could imagine some buggy error handling when wayland clients are not responding properly. I could take a look, but really, I expect it is an unrelated bug.

Either way, it seems like the page-reclaim protection did also kick in for you (as seen by the mouse cursor recovering after stuttering for a while; i.e. pages were re-faulted, accounted to the shell's cgroup and protected from then onwards). I'll send out a mail to the kernel people to ask whether the behaviour where we seem to need explicit protection on the children is expected.

EDIT: Posted to linux-mm and cgroups:

https://lore.kernel.org/linux-mm/d4826b9e568f1ab7df19f94c409df11956a8e262.camel@sipsolutions.net/T/#u (cgroups archive is not updated yet)

Note my mouse continued stuttering until my desktop "crashed."

OK, so a summary of things that I have learned recently:

NOTE: There are number of other very interesting mails from Tejun in response to my cgroups/linux-mm threads. They are not very long right now, so a good to read in its entirety. I would suggest against reading the cgroups mount options thread.

In defence of swap: common misconceptions There is a tl;dr at the top, Tejun and others review this. Gist is we still need some swap, open questions are how much and where, relates to #127.

Re: LUKS/LVM, @benzea @catanzaro @chrismurphy discussed on #fedora-workstation a little bit about the need for short and long term plans on partitioning+filesystem, which relates to #54 and #82. Gist is, dropping LVM and going with plain ext4 in the short term is consistent with the ~18 month decision in #54, wouldn't be controversial, doesn't make things more complicated for #82, and relieves some cgroups IO controller limitations; Btrfs also came up in the discussion, but needs some homework (maybe recruitment) on the WG's part; and last but not least, the encryption subgroup really needs a reboot.

There are some resource control related items in systemd-homed; man homectl on Fedora 32 or homectl.xml

--umask=MASK, --nice=NICE, --rlimit=LIMIT=VALUE[:VALUE] --tasks-max=TASKS --memory-high=BYTES, --memory-max=BYTES --cpu-weight=WEIGHT, --io-weight=WEIGHT

In this post I want to tell you about psi2log and psi-top scripts from nohang package. These simple tools can help you to conveniently measure and log various PSI metrics.

psi-top can print PSI metrics (avg10, avg60, avg300) from all available cgroups. You can choose one metrics type (memory, io, cpu) with --metrics flag. Default is memory.

psi2log prints PSI metrics (cpu, memory, io) from /proc/pressure (this it default behavior) or from selected cgroup. The program can work in two modes.
With --mode 1 (default behavior) psi2log prints avg10, avg60, avg300 values.
With --mode 2 psi2log can log PSI metrics with any timeframe using total values from PSI files. The timeframe in seconds you can set with --interval flag. This especially helps to record short-term spikes of pressure.
psi2log can log metrics in the specified (with --log flag) logfile.
Use --target flag to choose the cgroup whose metrics to log (default behavior is logging from /proc/pressue).
Use --interval flag to choose timeframe with --mode 2 and interval with --mode 1.
psi2log performs mlockall() if started with sudo to work correctly under high memory pressure. It's recommended to start psi2log with sudo, especially with --mode 2.
psi2log also prints the peak values obtained for the entire measurement time at its exit (if you press Ctrl+C).

Next I will give an example of using psi-top and psi2log.

I've run tail /dev/zero (with mounted swapspace) and measure memory/io PSI metrics from each cgroup at the end:

$ tail /dev/zero; psi-top --metrics memory; psi-top --metrics io

Output: https://gist.github.com/hakavlad/7fb5ee0303f8bfce524614a912de8d8d

In the process of tail /dev/zero operation, I've record PSI metrics using psi2log.

Just log avg10, avg60, avg300 from /proc/pressure:

$ sudo psi2log --mode 1 --log 1.log

Log: https://gist.github.com/hakavlad/e3d58330a2063f9097792430afa78eb4

Log metrics from /user.slice with 2 sec (default) timeframe:

$ sudo psi2log --mode 2 --log 2.log --target /user.slice

Log: https://gist.github.com/hakavlad/63317f3c98bd78013e1a21b284b6ab96

Run psi-top -h and psi2log -h to get help.

Source code:
https://github.com/hakavlad/nohang/blob/master/tools/psi-top
https://github.com/hakavlad/nohang/blob/master/tools/psi2log

@benzea You may use these tools to compare PSI metrics values in various tests and easy to share PSI measurement results.

Let's talk about PSI spikes and this bug in LMM.
See https://gist.github.com/hakavlad/63317f3c98bd78013e1a21b284b6ab96.
In this log you can see memory pressure spike that happened 20 sec after killing memory hog.
Short-term bursts of pressure are possible during normal operation of the system with good responsiveness after active use of the swap space.

2020-03-07 12:19:38,038:   1.1 |  14.9  14.0 |  16.0  15.4 | 2.003
2020-03-07 12:19:40,045:   0.4 |   2.5   2.5 |   4.5   4.5 | 2.007
2020-03-07 12:19:42,049:   0.5 |   2.7   2.5 |   3.0   2.8 | 2.004
2020-03-07 12:19:44,054:   1.0 |   1.7   1.7 |   1.9   1.9 | 2.005
2020-03-07 12:19:46,058:   1.1 |   0.7   0.7 |   1.6   1.6 | 2.004
2020-03-07 12:19:48,066:   0.4 |   1.2   1.1 |   2.0   1.9 | 2.008
2020-03-07 12:19:50,071:   1.0 |   1.2   1.2 |   1.2   1.2 | 2.005
2020-03-07 12:19:52,074:   0.8 |   4.4   4.0 |   5.2   4.8 | 2.003
2020-03-07 12:19:54,077:   0.7 |  95.9  92.3 |  95.8  92.3 | 2.003 <--[1]
2020-03-07 12:19:56,080:   0.5 |   7.5   6.4 |   8.4   7.3 | 2.002

[1] This is memory pressure spike. At this point, the system has a lot of available memory. At this moment low-memory-monitor would like to kill innocent victim. @hadess

It was a systemd session problem which has since been fixed

No, the problem is incorrect LMM's behavior.

the functionality is disabled anyway

This indicates that you acknowledge the presence of a bug.

Android's lmkd and Endless' psi-monitor also use the memory pressure information

Endless' psi-monitor uses avg10 metrics. These metrics are not subject to short-term fluctuations, in contrast to the metrics using super-short time frames used in LMM.

How to fix the bug in low-memory-monitor?

  • Use longer timeframes (maybe 5-10 sec);
  • Use higher thresholds (maybe 40% for partial stall or 25% for complete stall to start OOMK triggering);
  • Don't kill if MemAvailable >= 5% MemTotal (instead of 50% now).

The bug is also described there and here.

Update:
Maybe https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/8 and bug that I described are different bugs.
Maybe "LMM kiling all instances of gnome-terminal" and "I ran tail /dev/zero from GNOME Terminal and the whole Terminal app definitely quits" (see here) is the same problems. And this has nothing to do with LMM.
This bug is not documented and not yet fixed if LMM works with TriggerKernelOom=true.

Update2:
I've reproduse this bug right now with latest LMM on F31. I ran tail /dev/zero, and tail was killed by OOMK triggered by LMM. Since 20 sec I just click on Dash, and session was killed. Next I've seen GDM display. I logged in, tried to start terminal, and the session was killed again.
LMM vs gnome-shell: LMM wins: https://imgur.com/a/0nQDkio.

Update3:
https://gitlab.freedesktop.org/hadess/low-memory-monitor/issues/10.

Linux-ck with MuQSS provides incorrect PSI metrics: https://imgur.com/a/atIjhUw.
CPU metrics is always about 100%. After stress test: some memory is always about 100%. Full io is always about 0.

Login to comment on this ticket.

Metadata
Attachments 1
Attached a month ago View Comment