Systemd 248 RC3: systemd-oomd is now considered fully supported

[+] marcodiego|5 years ago|reply

Linux oom behavior always has had complaints[0]. Hope it now finally fixes the desktop case[1].

[1] https://lkml.org/lkml/2019/8/4/15

[+] bscphil|5 years ago|reply

For reference, [1] describes the SSD near-OOM thrash problem, where the speed of refaulting file pages is so high that the kernel detects enough activity and doesn't trigger OOM.

It's a heck of a problem to deal with, and a OOM killer will go part of the way to fixing it, but I still have two complaints about the thrash problem.

1. In similar situations on other operating systems, I don't usually see the GUI freeze to an unusable state. In Linux hitting a thrash situation will freeze the computer to the point where Ctrl-Alt-F2 will not switch the vterm even when waiting several minutes, and it has to be rebooted. I suppose it's possible that I just haven't used other OSes enough in a long time and they have this too, but this seems like a solvable problem. I assume it's not a CPU issue, since the scheduler should handle that just fine. So why not have a mechanism by which the user's desktop environment and init / login system can reserve itself enough memory to always be responsive under any memory condition?

2. Browsers. The state of browser memory management is atrocious. Just about every time I've run out of memory (other than compiling some complex software with -O3) it's been because my browser is hogging a huge chunk of it (even if the OOM killer blames something else). Now, I understand that in theory unused memory is wasted memory. So a browser using a ton of memory can be a good thing. But this is only true when memory the browser is holding but not using can be reclaimed to be used by another program, and this seems to basically never happen. If I close and reopen the browser (with the tabs automatically restored) the memory usage drops by 80% or more; the issue is that the browser simply never suspends tabs I haven't used in days or weeks. Modern browsers are not good "desktop citizens", if you will. They hog memory to the point where opening anything new will create a pressure stall.

I suspect solving these two issues would largely fix thrashing for most desktop users, without the need to ever kill anything, which is obviously undesirable.

Feel free to tell me if one or both of these is incorrect or impossible. These are just my observations, I'm not a kernel dev or anything close to one.

[+] alerighi|5 years ago|reply

In that regard Linux is much worse than Windows. In Windows you basically never go out of memory in a way to lock the system, sice the OS reserves some memory for its functions. It will block applications, but the operating system remain responsive.

In Linux I was never been able to generate a similar behaviour. No matter how aggressive I set the swappiness, the system didn't began to swap until it was practically locked up with the mouse pointer lagging.

Windows reserves some memory for its functions, for example the UI (maybe the fact that the GUI is in the kernel helps), and thus doesn't have these kind of problem, especially with an SSD (and I know that page file on SSDs is not ideal... but had a disk for 5 years swapping on it and never had a problem).

[+] warmwaffles|5 years ago|reply

[0] was a humorous read. I love these little analogies.

[+] hedora|5 years ago|reply

> A concept of system extension images is introduced. Such images may be used to extend the /usr/ and /opt/ directory hierarchies at runtime with additional files (even if the file system is read-only). When a system extension image is activated, its /usr/ and /opt/ hierarchies and os-release information are combined via overlayfs with the file system hierarchy of the host OS.

Oh, what fresh hell is this?

[+] viraptor|5 years ago|reply

It's basically the ostree idea on a smaller scale. Did you ever want to install some proprietary software with access to system libraries/config, but without actually polluting the global namespace? It's this. Like running a docker service with host as your base layer. (Minus other namespaces)

Honestly, it seems great and not very tricky for a system which already manages namespaces with private mounts.

[+] IgorPartola|5 years ago|reply

Just because it comes from systemd doesn’t mean it’s a bad idea. OpenWRT and other embedded Linux systems use overlay file systems very effectively.

[+] 2ion|5 years ago|reply

It's a fine concept. Good for replacing stuff like appimage besides the packaging part?

Mixing this with OS package management will require some consideration.

So they are using overlayfs, which means that "upper" directories have visibility precedence on file name conflict. Which means that some auditing facility should be required that alerts if an upper sys image masks a lower one, or if an OS package some time later installs a file to the "lower" directory, perhaps the original rootfs, but a sysfs package still takes precedence…

[+] chunkyks|5 years ago|reply

To be fair overlay fs is a good way to manage network boot / nfs root systems, live cds... All sorts of things where I want an underlying fs that clients can't mess with but they still need to think they can write to it

[+] crooked-v|5 years ago|reply

There's actually a directly analogous case in the gaming world, of all things: the app Mod Organizer 2 uses a runtime virtual file system There's actually a directly analogous case in the gaming world, of all things: the app Mod Organizer 2 uses a runtime-only virtual file system to layer together the files from different mod packages, making it easy to adjust for and dynamically disable/enable overlapping mod resource files (of which there can be a huge number in games like Skyrim), as well as redirecting output files from mods to a sane holding area instead of letting them pollute your game installs.

[+] diegocg|5 years ago|reply

If you want to understand why a oomd daemon is necessary and oom alone is not enough, you might be interested in this talk from FB "Linux memory management at scale" https://youtube.com/watch?v=cSJFLBJusVY

[+] Foxboron|5 years ago|reply

The finished talk with a (short) Q&A can be also be found.

https://media.ccc.de/v/arch-conf-online-2020-6390-linux-memo...

Q&A Questions: https://gitlab.archlinux.org/archlinux/conf-files/-/blob/mas...

[+] noobermin|5 years ago|reply

Can anyone provide a summary or short set of reasons? Proof by link to 40 minute video doesn't feel satisfactory.

[+] wooptoo|5 years ago|reply

They missed the opportunity to call it systemd-doom.

[+] hhh|5 years ago|reply

systemd-oom

[+] znpy|5 years ago|reply

I have some questions.

I started reading about systemd-oomd, then followed the link to its manpage (at https://www.man7.org/linux/man-pages/man8/systemd-oomd.8.htm...) then followed a link from there to this page: "In defence of swap: common misconceptions" at https://chrisdown.name/2018/01/02/in-defence-of-swap.html).

The first line of the tl;dr is: "Having swap is a reasonably important part of a well functioning system. Without it, sane memory management becomes harder to achieve."

Now, afaik linux as a kernel is pretty much designed to have swap memory and really not meant to be ran without it. Except disabling swap is required to run kubernetes, so much so that the kubelet will plain refuse to start if it detects swap memory enabled.

Why?

[+] aidenn0|5 years ago|reply

Linux works fine without swap. It also works fine with swap. It does not work fine under memory contention regardless of swap, but it is much, much worse if you have many GB of swap on a spinning disk.

Note that managing resources is much harder when swap is in the picture, because you can get the resident memory of a process, but not the swap usage of a process.

It used to be that the recommendation for swap was 2x ram. This recommendation was from the days were single-digit mega bytes of ram was common, and also from when linux core-dumps were both expected, and stored to swap.

Fast forward to 2010ish. You might have a 64GB or 128GB SSD, but you probably also have a spinning drive. you have 8 or 16GB of ram, so your 16-32GB swap would be 1/4 the SSD, plus you heard that putting swap on an SSD wears it out faster, so you put a 16-32GB swap file on the spinning drive.

Lets say you are developing software and you accidentally allocate in a tight loop. At some point the majority of the pages in ram not from the malloc loop end up in swap, so you try to do something to kill it, but Xorg and xterm both have their pages scattered across spinning drives, and your program is allocating memory faster than your machine saps in. If you are patient enough to not hard reset, an hour later, swap fills up and the OOM killer starts killing programs, and probably (but not definitely) kills the buggy malloc loop before it kills something more vital.

You are annoyed, so disable swap and do the same thing: the OOM killer kills probably just your web browser (which is virtually indistinguishable to a malloc loop with Web 2.0), and the malloc loop. The system recovers in minutes. You now swear off ever running with swap again.

Today: If you have a spinning disk 1GB or swap on it is fine, but zram is another good alternative. If you have slow and/or fragile SSD storage (e.g. sdcard or eMMC), zram is a good option. If you have fast nvme, I hear that you can go fairly crazy with swap sizes and have a system that works well under memory pressure, but I have not tried it myself.

[+] geofft|5 years ago|reply

This is one of the primary reasons the Kubernetes folks are working on letting you run with swap! See recent discussion in https://github.com/kubernetes/kubernetes/issues/53533 .

The historic reason Kubernetes didn't let you run with swap is that it's complicated from a resource management perspective - it's not clear whether each pod should be requesting some amount of swap the same way it requests some amount of RAM, and also the container runtimes didn't really have support for that (largely because you need cgroupv2 to have a hope of that sort of per-cgroup swap resource control working at all). But it's not an insurmountable problem, and it sounds from that ticket like they recognize that letting systems have swap enabled for the sake of a userspace oomd is an important use case.

[+] xyzzy_plugh|5 years ago|reply

Why is Linux not meant to be ran without swap? In 2021? I've never heard of this. I've operated massive clusters of thousands of hosts with no swap enabled with zero hiccups.

Swap comes with a lot of caveats -- when a performance-critical process begins swapping it can on occasion be more useful to just let it OOM instead.

My personal desktop has 64GB of memory, with no swap enabled. Why would I want swap it I don't need more than that?

[+] iio7|5 years ago|reply

Could they maybe start handling the 1400+ bugs that has been a part of systemd for such a long time (some unresolved since 2015) rather than keep adding and adding and adding and adding code!

[+] jabiko|5 years ago|reply

There are a lot of open issues but you are a bit unfair in calling them bugs.

If you filter out the RFEs (Requests For Enhancement) you find 617 issues. If you filter for only issues labeled as a bug you get 97 results.

[+] zokier|5 years ago|reply

People add features they need, and fix bugs that impact them. Scratching your own itch is the fundamental principle in open source. I recommend you to do the same.

[+] jbverschoor|5 years ago|reply

Feel free to contribute to open source projects!

[+] The_rationalist|5 years ago|reply

How will systemd-oomd communicate to the user? I was using https://github.com/hakavlad/nohang And it sent notification signaling to save your work before the application X will be killed. (I always wanted a way to click somewhere on the notification to prevent nohand from killing it when unwanted) Also nohang use PSI data does oomd does it too?

Finally the author of nohang is working on arguably superior solutions: https://github.com/hakavlad/prelockd https://github.com/hakavlad/memavaild https://github.com/hakavlad/le9-patch It's unfortunate that systemd has chosen to integrate an inferior solution

[+] yawaramin|5 years ago|reply

I actually don't understand why systemd-oomd is needed given MemoryMax= ( https://man7.org/linux/man-pages/man5/systemd.resource-contr... ). What are they doing differently?

[+] geofft|5 years ago|reply

I think the distinction is that MemoryMax= is just an interface to the cgroupv2 setting, i.e., that rule is implemented inside the kernel and invokes the kernel's OOM killer within a cgroup. The manpage for systemd-oomd says, "systemd-oomd is a system service that uses cgroups-v2 and pressure stall information (PSI) to monitor and take action on processes before an OOM occurs in kernel space."

It looks like systemd-oomd is related to (based on? from the same people as?) Facebook's oomd https://github.com/facebookincubator/oomd , whose documentation gives a bunch of reasons as to why you would prefer a userspace oomd that takes in PSI data and can be configured to proactively kill misbehaving processes instead of just letting the kernel OOM killer handle it. The major reason is time to recovery: a misbehaving process can cause a system to be so far under pressure that the kernel OOM killer will take a long time to flush things out, but a userspace component can respond in advance with more configurable rules (and more flexibility, since the kernel doesn't believe you're at capacity yet).

[+] danobi|5 years ago|reply

When a cgroup hits its memory.max the kernel OOM killer is invoked. systemd-oomd enables finer grained policy than the kernel OOM killer. systemd-oomd can also prevent livelocks (which memory.max does not).

[+] Klwohu|5 years ago|reply

That change log is HUGE and it’s the main reason we don’t use systemd. Creeping featurism isn’t good in your OS when you need things to be consistent and work the same every time. Systemd has a lot of state.

[+] dgellow|5 years ago|reply

I'm very happily using systemd since a bit less than a decade now. It's fantastic to have a common standardized way to manage your services and jobs that doesn't rely on a fragile set of scripts, and unified logs that can be easily queried and exported to a machine readable format.

[+] Jasper_|5 years ago|reply

Wait until you see the Linux kernel changelogs...

[+] greyw|5 years ago|reply

Who is we? I use systemd on the operating systems I use (debian, centos and opensuse)

[+] p4cmanus3r|5 years ago|reply

Am I the only one excited about support for luks via TPM?

[+] flippinburgers|5 years ago|reply

Systemd is such a wonderful example of "if I didn't develop it, it isn't good enough."

[+] chunkyks|5 years ago|reply

Soon it'll include a browser and email client. Can't wait!

[+] rkangel|5 years ago|reply

The issue here is branding, not engineering or design.

Systemd is an umbrella under which many key userspace tools are being developed. The actual init system itself is a lot bigger than the old SysV Init system was, but it is separate than all these other things that also say Systemd on them.

The best analogy is GNU. You don't complain that GNU contains a compiler, and also a text editor, and also C System Library do you? GNU is a project that develops a load of components that play well with each other. Systemd is the same.

[+] __turbobrew__|5 years ago|reply

GNU/Linux will soon be replaced by Systemd/Linux

[+] hexo|5 years ago|reply

Every time I see all the stuff packed into it I feel extreme urge to switch to freebsd again.

Oomd, oh yea, good idea, but WHY it has to be in systemd? The same applies to about 3/4 of things in there. Because of this i've started to call it LennartOS. And also I don't feel like linux is about freedom anymore but more like some sort of weird power display games.

128 comments