top | item 44722951

Linux Performance Analysis (2015)

181 points| benjacksondev | 7 months ago |netflixtechblog.com

40 comments

order

janvdberg|7 months ago

My first command is always 'w'. And I always urge young engineers to do the same.

There is no shorter command to show uptime, load averages (1/5/15 minutes), logged in users. Essential for quick system health checks!

mmh0000|7 months ago

It should also be mentioned, Linux Load Average is a complex beast[1]. However, a general rule of thumb that works for most environments is:

You always want the load average to be less than the total number of CPU cores. If higher, you're likely experiencing a lot of waits and context switching.

[1] https://www.brendangregg.com/blog/2017-08-08/linux-load-aver...

Propelloni|7 months ago

Me too! So much so that I add it to my .bashrc everywhere.

__turbobrew__|7 months ago

If you like this post, I would recommend “BPF Performance Tools” and “Systems Performance: Enterprise and the Cloud” by Brenden Gregg.

I have pulled out a few miracles using these tools (identifying kernel bottlenecks or profiling programs using ebpf) and it has been well worth the investment to read through the books.

wcunning|7 months ago

Literally did miracles at my last job with the first book and that got me my current job, where I also did some impressive proving which libraries had what performance with it again... Seriously valuable stuff.

sour-taste|7 months ago

Almost all of these have been replaced for me with below: https://developers.facebook.com/blog/post/2021/09/21/below-t...

It is excellent and contains most things you could need. Downside is that it isn't yet a standard tool so you need to get it installed across your fleet

benreesman|7 months ago

Oh man nostalgia city. I vividly remember meeting atop time travel debugging at 3am in Menlo Park in 2012, wild times.

louwrentius|7 months ago

The iostat command has always been important to observe HDD/SDD latency numbers.

Especially SSDs are treated like magic storage devices with infinite IOPS at Planck-scale latency.

Until you discover that SSDs that can do 10GB/s don't do nearly so well (not even close) when you access them in a single thread with random IOPS, with queue depth of 1.

wcunning|7 months ago

That's where you start down the eBPF rabbit hole with bcc/biolatency and other block device histogram tools. Further, the cache hit rate and block size behavior of the SSD/NVME drive can really affect things if, say, your autonomous vehicle logging service uses MCAP with a chunk size much smaller than a drive block... Ask me how I know

5pl1n73r|7 months ago

After this article was written, `free -m` on many systems started to have an "available" column that shows the sum of reclaimable and free memory. It's nicer than the "-/+" section shown in this old article.

  $ free -m
                 total        used        free      shared  buff/cache   available
  Mem:            3915        2116        1288          41         769        1799
  Swap:            974           0         974

CodeCompost|7 months ago

> At Netflix we have a massive EC2 Linux cloud

Wait a minute. I thought Netflix famously ran FreeBSD.

craftkiller|7 months ago

My understanding was their CDN ran on FreeBSD, but not their API servers. But I don't work for Netflix.

drewg123|7 months ago

The CDN runs FreeBSD. Linux is used for nearly everything else.

ImPostingOnHN|7 months ago

Maybe I missed it, but checking available disk space is often a good step in diagnosing misbehaving systems.

appleaday1|7 months ago

he forgot about rusttop

AnyTimeTraveler|7 months ago

I'm pretty sure that that didn't exist in 2015 ;)