top | item 45111071

(no title)

judge123 | 6 months ago

This hits so close to home. I once tried to explain to a manager that a server at 60% utilization had zero room left, and they looked at me like I had two heads. I wish I had this article back then!

discuss

order

hinkley|6 months ago

You also want to hit him with queueing theory.

Up to a hair over 60% utilization the queuing delays on any work queue remain essentially negligible. At 70 they become noticeable, and at 80% they've doubled. And then it just turns into a shitshow from there on.

The rule of thumb is 60% is zero, and 80% is the inflection point where delays go exponential.

The biggest cluster I ran, we hit about 65% CPU at our target P95 time, which is pretty much right on the theoretical mark.

BrendanLong|6 months ago

A big part of this is that CPU utilization metrics are frequently averaged over a long period of time (like a minute), but if your SLO is 100 ms, what you care about is whether there's any ~100 ms period where CPU utilization is at 100%. Measuring p99 (or even p100) CPU utilization can make this a lot more visible.

Ambroisie|6 months ago

Do you have a link to a more in-depth analysis of the queuing theory for these numbers?

PunchyHamster|6 months ago

that entirely depends on workload. especially now when average server CPUs start at 32 cores