#662 [F34] systemd-oomd test week
Closed: Fixed 3 years ago by sumantrom. Opened 3 years ago by chrismurphy.

Change:
https://fedoraproject.org/wiki/Changes/EnableSystemdOomd

  • Applies to all editions and spins
  • Applies to new installs and upgrades
  • Contingency is final freeze, 6 Apr
  • Everyone's system is different, every workload is different - this poses some challenges in developing test cases.
    • Idea for a base test case that helps the user develop their own: i.e. add tasks (provide examples) keeping track of tasks and order added, until oomd kills something.

Test case creation is mainly up to change owners to produce. Fedora QA helps coordinate the test day/week, and best effort help with formatting and sanity checking the testcases, so that testers don't get confused and are providing useful data. Example test cases. QA will want to see the test cases written up at least a week before the start of the test.

I'm thinking the week of March 15. End of that test week will give two weeks before final freeze.

@dcavalca @salimma @anitazha


Metadata Update from @sumantrom:
- Issue set to the milestone: Fedora 34
- Issue tagged with: test days

3 years ago

Metadata Update from @sumantrom:
- Issue assigned to sumantrom

3 years ago

Metadata Update from @sumantrom:
- Issue tagged with: test cases

3 years ago

I'm still in the midst of working out some test cases and will have that done this week. The week of March 15 for the test week sounds great. @sumantrom is there anything else we need to get this set up?

My 2 cents

Example results page that users fill in
Wiki page that users read for any instructions/prerequisites

Prerequisites for the wiki:
* confirm earlyoom is disabled
* version of systemd and kernel to test? not sure if we want to narrow the focus; it'll be a 5.11 kernel that will ship in Fedora, but it might be ok if folks want to be testing a 5.12-rc kernel too. Any preference, Anita?

Profile: any special info to ask for? [1] Maybe it'd be helpful to see results at a glance if they include something like 'diskswap, zramswap, noswap, bothswap' ? Or leave it for any bug report?

What sorts of information to include in bug reports? Ideally a list of commands they can just run and copy paste into the bug report.

For tests, I'm imagining the hard part is finding out way to trick oomd and get it to do something wrong. Like oomd picked the wrong process to terminate. Or it should have picked something to terminate, but it doesn't terminate anything. Also it might be interesting if tests can be run concurrently and still get the expected results? (Or not?) One of our hard hitting tests is to build webkitgtk. Mock can help with this if we give the user the specific command to run - it'll download everything needed. This pretty much always does poorly with its defaults, it overcommits the hardware I have and system responsiveness is poor. Is that a useful test to include?

[1]
Tests to be run on: noswap (default Cloud), zram-based swap (default everything else, disk-based swap, and combination zram-based and disk-based swap (happens when upgrading from F32 and older to F33 and newer). Obviously the user doesn't have to reconfigure their machine to run all four possibilities, and bug reports will contained detailed info.

It would have to be systemd 248~rc2 at this point. And I think either kernel would work as long as it gets noted in the test results. Same for swap configuration, though I expect no swap cases to be finicky.

Commands... I need to think about, but probably some combination of journalctl, oomctl, and free so I can see what the host state is like.

Concurrent tests could be interesting but might give varying results between systems when it comes to what systemd-oomd kills. There is a 15 sec delay between each kill systemd-oomd performs. But maybe that's okay? How does building webkitgtk work? I'd be interested to observe it on my host and see how bad it is first.

I'm still in the midst of working out some test cases and will have that done this week. The week of March 15 for the test week sounds great. @sumantrom is there anything else we need to get this set up?
Hey Folks,

I will set up the test day pages and the app as usual by today (my time). Let's have it on 18th. Does that work for all of you guys?

The wiki link is https://fedoraproject.org/wiki/Test_Day:2021-03-18_Systemd-OOMd_Test_Week
I have added this on QA fedocal as well.
I am going to take a stab at the test case by today and post some here, reviews will be a much needed thing needless to say :D

I wrote up 2 test cases if there's a good way to link these in to the results page:
https://fedoraproject.org/wiki/QA:Testcase_Swap_Based_Killing
https://fedoraproject.org/wiki/QA:Testcase_Memory_Pressure_Based_Killing

I ended up using stress-ng for the memory pressure test but I don't like it since I have to disable the swap policy. I'll keep thinking about it and see if I can come up with something better.

@sumantrom It looks like the results page link is incorrect in the wiki.

How does building webkitgtk work? I'd be interested to observe it on my host and see how bad it is first.

There is a way to do it with mock. It's easier for users, since it automates everything, but I also think it uses systemd-nspawn for some things? And therefore might not be the same kind of test as building from source, which is what I do.

Basically ninja defaults to y jobs where y is x cpus + 2. On my test system it's 10 jobs, which is really 20 threads because it's 2 threads per job with only 12G RAM (let alone the 8G it had just six months ago). And about half way through, the jobs start needing around 1G each sometimes more or less. This involves a lot of memory, cpu, and IO. And the dynamic is quite different between disk-based and zram-based swap. It is possible to get a system "wedged in" where the kernel decides to not kill anything off, yet makes such little forward progress that without a user space oom killer, it could be hours, days, maybe weeks, before it completes. During which time there are long periods 10-30 minutes, where it does not respond to anything and to kill it means power button. :D it's awesome!

@sumantrom It looks like the results page link is incorrect in the wiki.

Thanks for the test cases
Here's the test results page
https://testdays.fedoraproject.org/events/105

So, are we all set here?
Do you guys need some more test cases ?
is something in works that you want me to add in the result matrices ?

I might want to add another test case if I can figure it out today. What is the hard stop deadline?

I might want to add another test case if I can figure it out today. What is the hard stop deadline?

There isn't one, just add the link here and I will do the needful. The whole idea is, tomorrow my time I will start with blogs, announcements, and so on and so forth.

@anitazha I'm following https://fedoraproject.org/wiki/QA:Testcase_Memory_Pressure_Based_Killing and have a few questions.

Boot the system and log in as a regular user.

GNOME has the concept of admin and standard users. Since the test is going to be generic, including Server edition and other desktops, is the idea that the tests should not be run as root? i.e. any use other than root?

sudo printf "[Slice]/nManagedOOMSwap=auto" > /etc/systemd/system/-.slice.d/99-test.conf

I get:
-bash: /etc/systemd/system/-.slice.d/99-test.conf: No such file or directory
I'm not sure where it should go.

[chris@fedora ~]$ systemd-run --user --unit -r systoomd_mempressure_test /usr/bin/stress-ng --brk 1 --stack 1 --bigheap 1 -t 90s

I get:
Failed to find executable systoomd_mempressure_test: No such file or directory
I'm not sure what should provide this file, stress-ng-0.12.04-1.fc34.x86_64 is installed.

Likewise in https://fedoraproject.org/wiki/QA:Testcase_Swap_Based_Killing
systoomd_swap_test isn't found on the system; but maybe I'm just checking it out too early.

Thanks for trying it out @chrismurphy . I made some typos on the commands that are corrected now.

As for which users to run as, running as root is fine as long as it triggers a login session since the pressure policy is applied on user@.service. So if a user logs in as root they should be able to run the tests and achieve the same result as long as user@0.service was started.

This test week went great, the results are now in the wiki https://fedoraproject.org/wiki/Test_Day:2021-03-18_Systemd-OOMd_Test_Week

Any notes before closing the issue?

Awesome!
Thanks for all the help @chrismurphy and @anitazha :)

Metadata Update from @sumantrom:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata