top | item 14959288

Linux Load Averages: Solving the Mystery

623 points| dmit | 8 years ago |brendangregg.com

86 comments

order
[+] ChuckMcM|8 years ago|reply
Awesome analysis, I have added it to my favorites list. Around 1990 or so when I was in the kernel group at Sun and a team had just embarked on the multi-processor kernel work that would later result in the 'interrupts as threads'[1] paper. During that time there was an epic thread on email which was something like "What the F*ck does load average mean on an MP system?" (no doubt I have a copy on an unreadable quarter inch tape somewhere :-(). If it helps, the exact same pivot point was identified, which is this, does 'load average' mean the load on the CPU or the load on the system. While there were supporters in the 'system' camp the traditionalists carried the day with "We can't change the definition on existing customers, all of their shell scripts would break!" or something to that effect. Basically, the response was if we were to change it, we would have to call it something different to maintain a commitment to the principle of least surprise. This has never been an issue for Linux :-).

As a "systems" guy I am always interested in how balanced the system is, which is to say that I am always trying to figure out what the slowest part of my system is and insuring that it is within some small epsilon of the other parts. If you do that, then system load is linear with workload almost regardless of task composition. So disk heavy processes load the "system" as much as "compute heavy" processes and "memory heavy" or "network heavy." In an imaginary world you could decompose a system into 'resource units' and then optimize it for a particular workload.

[1] http://dl.acm.org/citation.cfm?id=202217

[+] samstave|8 years ago|reply
Uh.. complete but relevant aside:

All you old farts (TM) need to get these freaking quarter inch tapes pushed up to some glacier S3 bucket or sum=such bucket before you kick said bucket...

I'm serious. C'mon, don't steal from the future what you actually did in the past to make the present the reality of today!!

[+] otoburb|8 years ago|reply
I wonder if Illumos / SmartOS and OpenIndiana continued with the principle of least surprise via their ancestry chain, or whether they also moved to a "system" load view like Linux.

It's great that design decisions and thinking from decades past can be dug out and examined by complete strangers.

[+] siebenmann|8 years ago|reply
This is great work in general and excellent historical research.

As an additional historical note: in Unix, load averages were introduced in 3BSD, and at that time they included processes in disk IO wait and other theoretically short-term waits that weren't interruptible. This definition was carried through the BSD series and onward into Unixes derived from them, such as the initial versions of SunOS and Ultrix. At some point (perhaps SunOS 3 to SunOS 4, perhaps later), the SunOS/Solaris definition changed to be purely runable processes.

(I'm not sure what System V derived Unixes such as Irix, HP-UX, and so on did, and their kernel source is not readily available online for spelunking.)

As of early 2016 when I last looked at this, the situation on FreeBSD, OpenBSD, and NetBSD was somewhat tangled. FreeBSD load average only included runable processes, but NetBSD and OpenBSD counted some sleeping or waiting processes as well.

[+] EvanAnderson|8 years ago|reply
When details of a piece of "open" software are so easily lost I shudder to think about the vast quantity of "closed" software that have had their history lost.

I also kept thinking about how the term "software archaeology" (which I first saw in the 1999 Vernor Vinge novel "A Deepness In the Sky") becomes more and more mainstream each day.

[+] brendangregg|8 years ago|reply
Thanks, I didn't know those Unix versions included disk I/O.
[+] ams6110|8 years ago|reply
The numbers are definitely not comparable between Linux and some other unix variants. OpenBSD "idle" load average are about 1, for example.
[+] Twirrim|8 years ago|reply
Several years back the company I worked for ended up picking up some work for a client. Every quarter we'd download a huge trove of TIFFs from some source, and then do some image conversion work before shipping transferring them to the customer's infrastructure.

There was a java application that powered the logic side of things, calling out to ImageMagick to do the actual processing and conversion. For whatever reason, after careful benchmarking we settled on a java thread count that happened to get us the peak throughput, but also caused system load average to hit around 400 and keep steady at around that level.

The day that happened, and I could show that no application on the server took a performance hit, was the day that I finally persuaded my boss that load average is an interesting stat, but it's not the be-all and end-all, and that a high load average doesn't necessarily correlate to an actual problem.

[+] Bluecobra|8 years ago|reply
I had something similar happen in the past a long time ago on a x86 Solaris 10 mail server. An employee thought it was a good idea to share best quality/full resolution JPEG pictures of his new baby with the whole company. This swamped the mail server (load average was well over 700) while it chugged through delivering a 50mb email to 200+ employees. I forgot what process was the culprit (I think GNU Mailman) but after a couple of hours it finally settled down. I was amazed that could still SSH into it and figure out what happened.
[+] sreque|8 years ago|reply
One source of high load average spikes that I've seen in my job is when a process crashes and generates a core dump. While the core dump is being written, all threads in the process are in the TASK_UNINTERRUPTIBLE state even though they are doing absolutely nothing, and as such they all count towards the load average as if they were spinning on on a CPU core. If the total virtual memory of the process is large, say in the multi-GB range, coredumping can take on the order of a minute, and Linux will report an unreasonably high load average if that process had a lot of running threads.

Things like the above scenario make me treat the load average metric with a lot of skepticism. I would much rather use other metrics to infer load.

[+] lotyrin|8 years ago|reply
I rarely recommend alerting monitoring or any kind of action based on load averages or more generally any metric derived from queue lengths. It's trends in high-quantile queue latencies your users (and therefore you should) care about.
[+] saalweachter|8 years ago|reply
If it was bothering anyone else: yes, the parenthesis in the patch in the email are unbalanced, and the code was checked in as:

                if (*p && ((*p)->state == TASK_RUNNING ||
                           (*p)->state == TASK_UNINTERRUPTIBLE ||
                           (*p)->state == TASK_SWAPPING))
[+] mentat|8 years ago|reply
They're... not unbalanced.
[+] simonjgreen|8 years ago|reply
Under Better Metrics the author discusses ways of drilling down to find the source of a high load average. I feel like this section should mention `atop`, which is imo a really underrated single-pane-of-glass view into everything your system is doing, now and historically.

If you haven't tried `atop`, give it a go.

This historical analysis in this article though is great, because while Load Average has been an oft discussed and we'll understood topic for a long time, the decisions that got us there are not.

[+] mnw21cam|8 years ago|reply
Good article. However, it is missing the reason why load averages include tasks waiting for disc/swap.

One of the things that the load average is sometimes used for is to work out whether it is appropriate to start some more processes running on a system. For example, make has a "-l" option, which prevents more parallel jobs being run while the load is above a supplied number. When a system is thrashing due to insufficient RAM, then the load average will be high, and this option will appropriately prevent more tasks being started which would make the thrashing worse. If the load average was just based on CPU, then it would be low while thrashing, and using that make option could lead to complete system collapse.

[+] Florin_Andrei|8 years ago|reply
> As a set of three, you can tell if load is increasing or decreasing

That could be accomplished with a set of two.

A set of three could in theory give you acceleration.

[+] btilly|8 years ago|reply
This comment makes perfect sense if load is a smooth function. But it is not. It tends to be a step function.

The most recent 2 data points give you is whether the problem is currently getting worse, getting better or steady. The third gives you a sense of whether it has been doing on a while.

[+] BayAreaSmayArea|8 years ago|reply
Never go to sea with two chronometers; take one or three.
[+] hathawsh|8 years ago|reply
This analysis cleared up a mystery for me. I've noticed that when a server app is under heavy load in Linux, the load average goes high if the bottleneck is the CPU or the disk, but the load average goes low if the bottleneck is network resources (like databases or microservice calls). I know why that happens, but it's very unintuitive and it confused me when I was new to Linux. I thought load average would measure the CPU load only. Now I know the historical reasons for measuring system load instead of CPU load.

I kind of like it the way it is since it's handy to be able to distinguish network load from CPU+disk load just by looking at the load average. However, since the load average includes other stuff as well, sometimes I still don't know what the load average really means.

[+] ty_a|8 years ago|reply
Holy crap, Brendan Gregg's site went down. Proof he is human I guess?
[+] brendangregg|8 years ago|reply
Yes, sorry. I guess proof this is a hobby on some personal hosting that can get overloaded. Try refreshing. Although it's load averages (couldn't resist) aren't that high:

    10:36:09 up 34 days, 20:05,  1 user,  load average: 2.39, 2.34, 2.08
[+] seanp2k2|8 years ago|reply
The cobbler's children have no shoes :)

Just because we can deploy services that can take a million RPS doesn't mean we have our side projects / hobby sites in order, hah. I worked in hosting for a long time and I had a personal WordPress site which would get hacked every other month. I literally fixed that problem daily at $JOB, but couldn't be arsed to do something better for myself. It worked, and it was quick and easy. The point was the content.

These days, I'd just use something like Medium or Tumblr. Let someone else worry about hosting it :)

[+] rcarmo|8 years ago|reply
I still managed to read the whole thing. Quite fascinating, really, considering the lengths he went into tracking the ancient (1993) patch that turned CPU load averages into whole system load averages.
[+] ge96|8 years ago|reply
Why isn't there one for ram in i3? I read something about how it's hard to gauge ram usage despite htop displaying it as well as inxi in general on Windows you look at task manager there is memory usage.
[+] faragon|8 years ago|reply
It incredibly detailed, including references and historical investigation. Mind blowing. Kudos, Brendan Gregg.
[+] solarengineer|8 years ago|reply
When I'd asked Brendan via Twitter for an article on Load Averages in Linux, I hadn't expected such a detailed response. I've worked on a few projects where I've had to show that even though the "load" on the Linux system was low, it was really the steal% and the iowait that were killing performance. I'm sure that from now on, so many system and support engineers will have a good article to reference. Thanks, Brendan!
[+] sytringy05|8 years ago|reply
My company took over production support of an ESB from another company for a client a couple of years ago. The worker nodes had about 100 JVMs running on it and its resting Load Avg was around 30. This on a 2 CPU RHEL vm.

Out of morbid curiosity, I restarted one of the test servers and ran top. Load Avg was in the order of 2200 for about 3 hours.

The worst part was that the guys we took it over from didn't even think it was a problem.

[+] mnarayan01|8 years ago|reply
Page swapping seems like it makes a lot of sense to include in the load average. Disk I/O seems like something more towards the opposite end of the spectrum, though TASK_KILLABLE (https://lwn.net/Articles/288056/) presumably mitigates this where used.
[+] rotten|8 years ago|reply
What we need is a systems model that allows us to assess the overall health of a server in a single metric. Indicators of something under strain will reflect itself in the metric and draw our attention for further drilldown and analysis. "Load Average" is the metric we (the systems community) have generally been using for this. Unfortunately it appears that the model it is based on may be rather dated and may have flaws which mean we will miss, or misinterpret system health status by relying on that number. So the million dollar question is - starting from scratch, how can we design a model of our system that yields an reliable system health indicator metric?
[+] mobilethrow|8 years ago|reply
OT: what could cause a system to have a load of 1 when idle?

I have one (unimportant) Linux system that idles with a load of exactly 1. The issue persists through reboots. It is a KVM virtual machine and qemu confirms nothing is going on in the background.

Any ideas how to find out what's causing it?

[+] fanf2|8 years ago|reply
I thought that including disk wait in the load average was a common Unix feature. Sadly I can't go spelunking through the archives right now, but it would be interesting to see what Solaris and BSD do, for comparison with systems a little bit closer to Linux than TENEX :-)
[+] brendangregg|8 years ago|reply
Solaris and BSD load averages are based on CPU only. As for avoiding TENEX, here's the comment from the freebsd src:

    /*
     * Compute a tenex style load average of a quantity on
     * 1, 5 and 15 minute intervals.
     */
    static void
    loadav(void *arg)
    {
    [...]
:)
[+] swinglock|8 years ago|reply
It’s indeed different, Solaris doesn’t count time waiting for the disk in the load average.