top | item 20620545

The Linux kernel's inability to gracefully handle low memory pressure

647 points| emkemp | 6 years ago |lkml.org | reply

455 comments

order
[+] bArray|6 years ago|reply
Similarly, there are many annoying Linux bugs:

`pthread_create` can sometimes return back a garbage thread value or crash your program entirely without any way to catch it or detect it [1]. High speed threadding is hard enough as it is, without the kernel acting non-deterministically.

Un-killable processes after copy failure (D or S state) [2]. If the kernel is completely unable to recover from this failure, is it really best to make the process hang forever, where your only available option is to restart the machine? I ran into this with a copy onto a network drive with a spotty connection, that actual file itself really didn't matter - but there was no way to tell the kernel this.

Out Of Memory (OOM) "randomly" kills off processes without warning [3]. There doesn't appear to be a way to mark something as low-priority or high-priority and if you have a few things running, it's just "random" what you end up losing. From a software writing stand-point this is frustrating to say the least and makes recovery very difficult - who restarts who and how do you tell why the other process is down?

[1] https://linux.die.net/man/3/pthread_create

[2] https://superuser.com/questions/539920/cant-kill-a-sleeping-...

[3] https://serverfault.com/questions/84766/how-to-know-the-caus...

[+] idoubtit|6 years ago|reply
Point 3 is wrong. OOM killing is not random. Each process is given a score according to its memory usage, and the highest score is chosen by the kernel. The way to mark priority in killing is to adjust this score through /proc. All of this is documented in `man 5 proc` from `/proc/[pid]/oom_adj` to `/proc/[pid]/oom_score_adj`.

http://man7.org/linux/man-pages/man5/proc.5.html

[+] throwaway2048|6 years ago|reply
There is now a way to mark processes as OOM-killer exempt

https://backdrift.org/oom-killer-how-to-create-oom-exclusion...

Part of the issue with processes stuck in D state (waiting for the kernel to do something) is that it is deeply tied into kernel assumptions about things like NFS, NFS is stateless, and theoretically severs can appear and disappear at will, and operations will keep working when it comes back. You can make NFS a hell of a lot less annoying in this regard by mounting it with soft or intr flags, however if the network disappears or hiccups, you WILL lose data (the network is NEVER reliable, in fact the entire model of NFS is arguably wrong to begin with)

[+] joshumax|6 years ago|reply
What are you referring to regarding pthread_create()? Last time I checked I thought that it would return an undefined thread* only when giving a nonzero return code, which while certainly isn't handled in a lot of multi-threaded applications could be checked before anything else is done with the newly created thread.
[+] epiphanitus|6 years ago|reply
What do you recommend doing when Linux Freezes? It doesn't come up a lot, but when it does it can be kind of unnerving since the three-finger-salute doesn't work.

I would also love to know if anybody has a solution for getting video to play properly in Firefox. I know it's not a bug per se, but it would be nice to not have to switch between browsers all the time.

I've been using Ubuntu for about a year now and otherwise its been a very positive experience.

[+] azinman2|6 years ago|reply
I have to say, reading all these replies about the OOM killer makes Linux look quite bad. These proc scores are not an elegant solution. I far prefer Darwin’s launchd which lets you set actual memory limits (soft and hard) that gives you warnings before you cross a threshold. Now this is more consumer OS oriented, but something equivalent for servers that let you express preferences in a more natural way seems desirable.
[+] altmind|6 years ago|reply
There is nothing more frustrating than unkillable processes stuck io iowait(D) state. There's no reason for this behavior to exist. And its so easy to hang forever - network blink, your NFS client gets stuck and your programs too.
[+] Animats|6 years ago|reply
Ah, yes, that bug.

Few programs can handle a fail return from "malloc", and Linux perhaps tries too hard to avoid forcing one. Most programs just aren't very good at getting a "no" to "give me more memory" Browsers should be better at this, since they started using vast amounts of memory for each tab.

I used to hit a worse bug on servers. If you did lots of MySQL activity, so that many blocks of open files were in memory, and then started creating processes, you'd often hit a situation where the Linux kernel needed a page of memory but couldn't evict a file block due to some lock being set. Crash. That was years ago; I hope it's been fixed by now.

[+] fluffything|6 years ago|reply
> Browsers should be better at this,

Browsers are quite good at this actually. Major web browsers run on Windows (and even 32-bit windows!), where there is no overcommit, so malloc can return "no" any time, which happens quite often when you are limited to 4Gb of memory per process.

The only apps that suck at this are Linux-only apps that are never used anywhere else and just assume that all Linux systems have overcommit enabled.

[+] simias|6 years ago|reply
>Most programs just aren't very good at getting a "no" to "give me more memory"

I suspect that overcommiting is one of the reasons for this though. Many programmers in the Linux world have integrated that "malloc can't fail" and the only error handling they bother doing is calling abort() if malloc fails.

Of course the fact that C doesn't provide any sane way to implement error handling probably doesn't help.

[+] JJMcJ|6 years ago|reply
> vast amounts of memory for each tab

What underlies this? I am astounded to see 1GB of memory returned when I close a couple of tabs.

Chrome and Firefox both seem like this.

[+] zwaps|6 years ago|reply
I think that's not that bug at all. When memory runs out, the entire system stalls, including the UI, but nothing crashes. If these issues are frequent, the system is basically frozen.

I have this in Matlab on Linux. Matlab can actually deal with worker processes being killed, but my machine just locks up. Therefore, we have to run these specific simulations under Windows, where this doesn't occur.

[+] HugThem|6 years ago|reply
I witnessed MySQL bringing linux servers down two.

In my case it happens like this:

I have a long running PHP process that constantly fires away mostly SELECT but also a bunch of INSERT and UPDATE statements and also some DELETEs.

Since the DB and the key files do not fit into memory, its all disk bound work.

All tables are MyISAM.

Like clockwork, this stalls the virtual machine once per day.

All I can do is to hard power down the VM and restart it. Afterwards the table data is corrupted beyond repair.

Not sure it is related to memory though. Because the memory usage of PHP and MySQL seem to be constant. Most RAM seems to be used by Linux for caches.

[+] ensiferum|6 years ago|reply
In general, the out of memory condition doesn't always come from the Linux kernel however but from the underlying memory allocator which typically is the memory allocator in the CRT in libc. Just because some process's memory allocator returned NULL or threw bad_alloc doesn't mean the system as whole is running out of memory.

When the kernel is running out of memory it will just start the OOM killer which will kill a process with low "nice" value.

[+] LgWoodenBadger|6 years ago|reply
Would it be possible for the kernel to suspend the process in scenarios where malloc would fail instead of returning a failure? Either until enough becomes available for it to succeed, or until something tells the kernel to renew/revive/resume the process and try the malloc again?
[+] ailideex|6 years ago|reply
You provide limited information but it is not clear the scenario you explain is a bug. If too much memory is locked into resident memory with mlock then this sounds like the expected and correct behavior.
[+] notacoward|6 years ago|reply
> Few programs can handle a fail return from "malloc"

Fewer than should, that's for sure, but hardly a trivial number. A lot of old-school C programs are very careful about this, and would handle such a failure passably well. Unfortunately, just about every other language tends to achieve greater "expressiveness" by making it harder to check for allocation failure. How many constructors were invoked by this line of code? By this simple manipulation of a list, map, or other collection type? How many hidden memory allocations did those involve? I'm not saying such expressiveness is a bad thing, but it does make memory-correctness more difficult and so most programmers won't even try.

As the world moves more and more toward "higher level" languages, returning an error from malloc becomes a less and less viable strategy. Might as well just terminate immediately, since "most frequent requester is most likely to die" is better than 99% of the OOM-killer configurations I've ever seen.

[+] sinsterizme|6 years ago|reply
Glad to see this issue raised! My system hangs for minutes sometimes and is very frustrating compared to Windows and OSX which seem to handle out of memory in a much more user-friendly way. Which seems to be: suspending the offending program and letting the user decide what to do from there. I'm sure there's a reason the Linux kernel doesn't do something similar, but can anyone enlighten me? :)
[+] cperciva|6 years ago|reply
Further to the comments about the pager hammering the disk to read clean pages (mainly but not exclusively binaries) even if swapping is disabled: In many cases adding swap space will reduce the amount of paging which occurs.

Many long-lived processes are completely idle (when was the last time that `getty ttyv6` woke up?) or at a minimum have pages of memory which are never used (e.g. the bottom page of main's stack). Evicting these "theoretically accessible but in practice never accessed" pages for memory frees up more memory for the things which matter.

[+] quazeekotl|6 years ago|reply
Unfortunately enabling swap in linux has a very annoying side effect, linux will preferentially push out pages of running programs that have been untouched for X time for more disk cache, pretty much no matter how much ram you have.

This comes into play when you copy or access huge files that are going to be read exactly once, they will start pushing out untouched program pages to disk, in exchange for disk cache that is completely 100% useless, even to the tune of hundreds of gigabytes of it.

Programs can reduce the problem with madvise(MADV_DONT_NEED), but that only applies to files you are mmap()ing, and every single program under the sun needs to be patched to issue these calls.

You can adjust vm.swapiness systctl to make X larger, but no matter what, programs will start to get pushed out to disk eventually, and cause unresponsiveness when activated. You can reduce vm.swapiness to 1, but if you do, the system only starts swapping in an absolute critical low ram situation and you encounter anywhere from 5 minutes, to 1+ hour of total, complete unresponsiveness in a low ram situation.

There _NEEDS_ to be a setting where program pages don't get pushed out for disk cache, peroid, unless approaching a low ram situation, but BEFORE it causes long periods of total crushing unresponsiveness.

[+] throwaway3627|6 years ago|reply
Linux resource scheduling and prioritization and is pretty awful compared to its popularity.

TBH, there are very few OSes that get high-pressure resource scheduling and prioritization right under nearly all normal circumstances.

The hackaround for decades on Linux is always adding a tiny swap device, say 64-256 MiB on a fast device in order to 0) detect average high memory pressure with monitoring tools 1) prevent oddities under load without swap (as in OP example).

[+] snvzz|6 years ago|reply
Swap has a side effect that's not very nice: It makes memory non-deterministic, as disk is non-deterministic.

Linux does unfortunately have serious issues to do with peaks of latency which make it behave horribly with realtime tasks such as pro audio. It's so bad that it's often perceiveable in desktop usage.

linux-rt does mitigate this considerably, but it's still not very good.

I'm hopeful for genode (with seL4 specifically), haiku, helenos and fuchsia.

[+] mnw21cam|6 years ago|reply
Yes, this is exactly the cause. The pager hammers the disc to read clean pages, because they don't count as swap.

And I agree that a small amount of swap can actually reduce the paging that occurs, if you assume that the amount of RAM required is independent of the amount of RAM available. However, as we all know, stuff grows to fill the available space, and if you do configure swap you just delay the inevitable, not prevent it.

Having said that, having swap available means that when memory pressure occurs, you have a more graceful degradation in service, because the first you know about it is when the kernel starts swapping out an idle process to free memory for the new memory hog that you are interacting with. This slows down your interactive session, but not as much as if you have no swap available - in that case, the system suddenly and drastically reduces in performance because it is trying to swap in your interactive process. The more graceful degradation of having some swap available gives you a chance to realise that you are doing more than your computer can cope with, and stop.

As far as I see it, there are three solutions:

1. Disable overcommit. This tends to not play very nicely with things that allocate loads of virtual memory and not use it, like Java. And if you do have a load of processes that actually use all the memory they allocated, then you can still have the same problem occur. The solution to that one is to get the kernel to say no early, before the system actually runs out of RAM.

2. Get the OOM killer to kill things much earlier, before the system starts throwing away clean pages that are actually being used. On my system with 384GB RAM, I have installed earlyoom, and instructed it to trigger the OOM killer if the free RAM falls below 10% (and remember that stuff you are actually using, but happens to be clean, counts as free RAM). This is the easiest and quickest solution right now. If your main objection to this is that you are inviting the system to kill things that might be important, remember that the kernel already does this, and if you don't like it you should use option 1 above (and really hope that all your software handles malloc failure correctly).

3. Introduce a new system in the kernel to mark pages that are actually being regularly used but are clean as "important", and no longer count as free RAM for the purposes of calculating memory pressure. This could either be as a new madvise (but it would be impractical to get all software developers to start using this), or by marking all binary text by default (which would neglect the large read-only databases that some programs hammer), or by some heuristics. This would then trigger the OOM killer (or the allocator to say no, depending on overcommit) when actual free RAM is low.

[+] zwaps|6 years ago|reply
This exact bug has been a huge issue for me when I am developing with Matlab. Those are large simulations.

Things get swapped around and memory is often close to the limit. Linux then becomes unresponsive, and basically stalls. Theoretically it recovers, but that process is so slow that the next stall is already happening.

It is therefore impossible to run large scale Matlab simulations on my Linux machine, while it is no issue in Windows.

As far as I can see, Linux is only usable with enough RAM so that it is guaranteed you never run out. I don't know why this has never been an issue, I guess because it is a Server OS and RAM is planable, or very infrequent?

[+] waingake|6 years ago|reply
I'm so happy someone has made a clear bug report here. Because damn, this is a thing.
[+] blattimwind|6 years ago|reply
> Your disk LED will be flashing incessantly (I'm not entirely sure why).

The VM is basically paging all clean pages in and out constantly as their tasks become runnable. A pretty standard case of thrashing.

[+] deepbreath|6 years ago|reply
I committed the grave mistake of purchasing a laptop with only 8GB ram and I constantly run out of memory as a result. When it happens, I just repeatedly mash alt+sysrq+f until it kills off some chromium tabs and unfreezes my machine. It essentially behaves like one of those extensions that lets you unload tabs. If needed, you can get the tab back by just reloading the page. The machine slows down to a crawl at 96% usage, and freezes at 97% usage (according to my i3 bar).
[+] quotemstr|6 years ago|reply
I've never liked the approach Linux kernel and userland take to memory exhaustion. Many people confidently asserts that it never happens. The somewhat better-informed suggest that it's unreasonable to write programs that recover from memory exhaustion because unwinding requires allocation --- a curious belief, because there are many existence proofs of the contrary. Then we get a feedback loop where everyone uses overcommit because everyone believes that programs can't recover from OOM, and people avoid writing OOM recovery code because they believe that everyone is using overcommit and allocation failure is unavoidable. And then they write kernel code and bring this attitude there.

Memory is just a resource. If you can recover from disk space exhaustion, you can recover from memory exhaustion. I think the current standard of memory discipline in the free software world is inadequate and disappointing.

[+] rwallace|6 years ago|reply
How are all the people talking about Windows here getting it to behave better? In my experience, when you run out of memory on Windows, the whole machine locks up hard for ten or fifteen minutes while it thrashes the disk before finally killing the offending process. (Admittedly that's on spinning metal; SSD would probably do better.)
[+] linsomniac|6 years ago|reply
The most annoying thing about OOM is when a process goes crazy and starts using a lot of memory, the OOM killer looks at the system and sees that process is really active, so it kills mysql/ssh/apache/postgres to make room for the run away.

I've set up monitoring that pages me when "dmesg" includes "OOM".

[+] isodude|6 years ago|reply
You can actually adjust the oom_score on those long lived important processes to hinder OOM to kill them.
[+] kasabali|6 years ago|reply
Try setting vm.oom_kill_allocating_task. I find it more useful than the default behavior and it seems to be closer to what you want.
[+] punnerud|6 years ago|reply
Apple have solved some of it by limiting the maximum number of processes before the user get a warning and have to close some of the old. Not sure if they handle memory any better, from my experience; no.
[+] GuB-42|6 years ago|reply
I wonder how they did with Android? Especially in the early days, not with today's 8GB+ monstrosities.

My first Android device was a Nexus One. 512MB of RAM for what is essentially a full Linux system. Able to run a browser and multiple Java apps, all isolated and running their own VM. Task managers often reported near 100% RAM use and things still worked fine.

And my understanding is that optimized things further, but given how overpowered phones are today and how bloated apps are, it is hard to tell.

[+] alexozer|6 years ago|reply
A couple weeks ago, one of my physical stick of RAM completely stopped working after yet another Linux out-of-memory-force-poweroff situation. No idea if that could be the proper cause, but I do find it a little funny.

I just arrived at this thread after my entire system stalling completely at yet another low memory situation.

Let's just say I'm extrememly grateful to discover some of these userspace early OOM solutions in this thread.

[+] alexghr|6 years ago|reply
I hit this bug yesterday on my laptop (16GB of RAM / 1GB of swap) with 2 instances of Firefox (about 60 tabs), Slack, Insomnia (Electron-based Postman clone) and a couple of `node` processes watching and transpiling.. stuff. `kswapd0` was running at 100% CPU, I guess trying to free up some RAM by moving things to swap (the swap partition was full by this point). Luckily I managed to recover the system by switching to another tty and killing kswapd0 and the node instances.

Sometimes instructing the kernel to clear its caches helps: `echo 1 | sudo tee /proc/sys/vm/drop_caches` [1]

[1]: https://serverfault.com/questions/696156/kswapd-often-uses-1...

[+] userbinator|6 years ago|reply
I'm not sure how he's getting swap even with swap off, but this seems to be the big disadvantage to having overcommit --- the memory allocator won't ever say NO, so an application can keep allocating memory even if that memory becomes uselessly slow to actually access.

Then again, this "allocation will never fail" mentality has also lead to applications being written with such an assumption, and when allocations do fail, they crash. (Arguably, that's better than thrashing the rest of the system.) I don't know if the modern browsers will actually stop letting you open new tabs and just give an "out of memory" error instead of crashing, but that's how most Windows programs are usually written --- without the assumption that allocations can never fail, because on Windows, they can.

[+] jhallenworld|6 years ago|reply
This is a very old problem, I used to see it decades ago when making tape backups. Tar would use move the entire disk through the buffer cache so that eventually everything in it was paged out. The classic solution was to use unbuffered versions of the disk device for backups.

What I've always thought is that there should be a working set size limit on a process which includes the buffer cache somehow. The idea is that the process may not use more RAM than this size- if it exceeds it, it must either fail or swap out its own pages, not those from any other process. This would fix the problem for tar- it only needs a tiny amount of memory.

I think the situation is very similar with the web-browser example. The browser should not be allowed to force all unrelated data to be paged out.

[+] makz|6 years ago|reply
So... running Linux swapless is a thing? How popular is it?
[+] kokey|6 years ago|reply
I don't know what other people do, but I think the better option would be to set vm.swappiness to 0. Swap space is a good safety valve. You should never really have to use it, so a good way to detect that something is going really wrong and take action before it brings the system is down is by looking at when swap is busy filling up.

Also if someone opens up an application that grabs huge chunks of RAM but leave a lot of it idle, and turn swap off completely, they should not be surprised. I don't know why people see this as a bug, but perhaps I've just been spending time in the UNIX family tree for too many decades.

[+] yongjik|6 years ago|reply
Kubernetes, for example, doesn't even support swap. Some bug reports say it won't even run with swap enabled, though I didn't test myself. ¯\_(ツ)_/¯
[+] snazz|6 years ago|reply
No idea how popular it is, but that’s how I run my ThinkPad X220 with 4 GB RAM and a mechanical hard drive. It’s still incredibly snappy.
[+] sandov|6 years ago|reply
I used to disable swap because it supposedly reduced the life span of SSDs, but I don't care about that anymore — when it dies, it dies.
[+] w-m|6 years ago|reply
When working on Ubuntu 16.04 LTS, this is such a productivity killer. Quite annoyed at the time lost from this behavior, after coming from a Mac. In shells where I run a program that may load a larger data set (e.g. before ipython), I now regularly run `ulimit -v 50000000` to limit the shell's virtual memory to ~50 GB of the available 64 GB on this machine.

If the program tries to use more RAM it'll then just die, and not drag down the whole system with it. Works fine, but I really shouldn't have to do this.