Ask HN: Anyone else seeing people averaging percentiles?
15 points| stat_throwaway | 4 years ago
It's driving me crazy because that makes no sense: you can't average percentiles, you will get bogus numbers that randomly jumps up and down based on aggregation (how many servers you have, which tag sets you use, and so on). And I'm apparently the only one who is seriously bothered. Everyone else is somewhere between "Eh, that's the best data we have." and "What do you mean the numbers are wrong?"
Is this normal?
richk449|4 years ago
On the other hand, don't we average things that we can't technically justify all the time? Teachers give out homework and tests, assign a weight to each homework and each test, and average the results to assign you a grade. Is that grade arbitrary? Yes. Does that mean it is useless? Probably not.
If I were you, I would make your case based less on "you can't do that" and more on "if we used this approach to aggregation, we would improve our ability to detect cases X and Y that our customers really care about".
NumberCruncher|4 years ago
It is. At my last job the first ticket assigned to me was about fixing the bug "average is greater than 80% percentile". When I told them that this isn't a bug, they thought I am just kidding.
Bostonian|4 years ago
wizwit999|4 years ago
But that's a wrong way to collect it, they should aggregate at the end.
sobriquet9|4 years ago
[1] https://en.wikipedia.org/wiki/L-estimator
millrawr|4 years ago
wikibob|4 years ago
kderbyma|4 years ago
This is typically fixed realising which you need.
ryanmonroe|4 years ago
stat_throwaway|4 years ago
We have ten servers. Each server takes 90% user requests and 10% googlebot requests. Each user request takes exactly 200 ms, each googlebot request takes exactly 500 ms.
Each server reports 80% latency of 200 ms. After averaging, it's still 200 ms, which is correct.
Now, we conjecture that by sending all googlebot requests to a single server, it can use cache more efficiently. So we change routing: now nine servers get only user requests (90% of traffic), and one server gets only googlebot requests (10% of traffic). By doing this, we speed up every request by 10 ms. Great success! Or is it?
Let's calculate average p80. 80% latency of a user server: 190 ms (everything takes exactly 190 ms here). 80% latency of a googlebot server: 490 ms. Average latency = (190 * 9 + 490 * 1) / 10 = 220 ms. Huh wait, the service became 20 ms slower! Something is wrong, maybe we should revert the rollout ...
-----
You may say "Well, now that you explained what you did, of course that's wrong." The problem is, there's no right way to average percentiles. The numbers will be always suspect. The only difference is when it fails.
As far as I can tell, the only time the numbers are right is when every server gets exactly same mix of requests and shows exactly same performance characteristic, but of course, if you already know this then you don't even need average. You can take p90 from a single server, and every other server will have the same value.
You are aggregating because you DON'T know if they are all behaving in the same way. When you're averaging percentiles, you're assuming the very thing you set out to verify.
lolln|4 years ago