top | item 28518795

Ask HN: Anyone else seeing people averaging percentiles?

15 points| stat_throwaway | 4 years ago

I work at a middling SV company, and I see people taking averages of percentiles (or, even crazier, percentiles of percentiles) every day - that is, each server computes its own "50%/90%/95% latency" of all requests, sends it to a central time-series database, and the Grafana console averages them all to show a "nice" graph. They are used for everything from alerts to launch decisions.

It's driving me crazy because that makes no sense: you can't average percentiles, you will get bogus numbers that randomly jumps up and down based on aggregation (how many servers you have, which tag sets you use, and so on). And I'm apparently the only one who is seriously bothered. Everyone else is somewhere between "Eh, that's the best data we have." and "What do you mean the numbers are wrong?"

Is this normal?

11 comments

order

richk449|4 years ago

On one hand, your point seems to be technically correct.

On the other hand, don't we average things that we can't technically justify all the time? Teachers give out homework and tests, assign a weight to each homework and each test, and average the results to assign you a grade. Is that grade arbitrary? Yes. Does that mean it is useless? Probably not.

If I were you, I would make your case based less on "you can't do that" and more on "if we used this approach to aggregation, we would improve our ability to detect cases X and Y that our customers really care about".

NumberCruncher|4 years ago

> Is this normal?

It is. At my last job the first ticket assigned to me was about fixing the bug "average is greater than 80% percentile". When I told them that this isn't a bug, they thought I am just kidding.

Bostonian|4 years ago

I think the idea is to measure average latency but to give extra weight to the worst 5% and 10% of latency cases. I've not seen such a measure before but don't think it's absurd.

wizwit999|4 years ago

I think they might be getting at what I've seen called a 'trimmed mean'. Its actually a pretty useful statistic for something like latency, because with an average you can see changes in the distribution within the percentile, as opposed to a simple percentile.

But that's a wrong way to collect it, they should aggregate at the end.

millrawr|4 years ago

It happens frequently because there's often not support in the underlying systems for mergeable latency sketches (e.g. DDSketch, HDR Histogram, or libcircllhist). There thankfully seems to be slowly increasing amounts of support for doing this the correct way.

wikibob|4 years ago

Do you work at my company? This is literally rampant. Nobody understands.

kderbyma|4 years ago

You need to do aggregate operations. Some operations are point based (think matrix dot multiplication) and others are series based (FFT)

This is typically fixed realising which you need.

ryanmonroe|4 years ago

Not sure it’s possible to know the answer without more context. What do they use the averages for? What do you think they should do instead and why is it better?

stat_throwaway|4 years ago

Here's a simplified example, adapted from an actual case I saw:

We have ten servers. Each server takes 90% user requests and 10% googlebot requests. Each user request takes exactly 200 ms, each googlebot request takes exactly 500 ms.

Each server reports 80% latency of 200 ms. After averaging, it's still 200 ms, which is correct.

Now, we conjecture that by sending all googlebot requests to a single server, it can use cache more efficiently. So we change routing: now nine servers get only user requests (90% of traffic), and one server gets only googlebot requests (10% of traffic). By doing this, we speed up every request by 10 ms. Great success! Or is it?

Let's calculate average p80. 80% latency of a user server: 190 ms (everything takes exactly 190 ms here). 80% latency of a googlebot server: 490 ms. Average latency = (190 * 9 + 490 * 1) / 10 = 220 ms. Huh wait, the service became 20 ms slower! Something is wrong, maybe we should revert the rollout ...

-----

You may say "Well, now that you explained what you did, of course that's wrong." The problem is, there's no right way to average percentiles. The numbers will be always suspect. The only difference is when it fails.

As far as I can tell, the only time the numbers are right is when every server gets exactly same mix of requests and shows exactly same performance characteristic, but of course, if you already know this then you don't even need average. You can take p90 from a single server, and every other server will have the same value.

You are aggregating because you DON'T know if they are all behaving in the same way. When you're averaging percentiles, you're assuming the very thing you set out to verify.

lolln|4 years ago

Sounds like another dev who doesn’t know about LLN tbqh.