Not sure it’s possible to know the answer without more context. What do they use the averages for? What do you think they should do instead and why is it better?
Here's a simplified example, adapted from an actual case I saw:
We have ten servers. Each server takes 90% user requests and 10% googlebot requests. Each user request takes exactly 200 ms, each googlebot request takes exactly 500 ms.
Each server reports 80% latency of 200 ms. After averaging, it's still 200 ms, which is correct.
Now, we conjecture that by sending all googlebot requests to a single server, it can use cache more efficiently. So we change routing: now nine servers get only user requests (90% of traffic), and one server gets only googlebot requests (10% of traffic). By doing this, we speed up every request by 10 ms. Great success! Or is it?
Let's calculate average p80. 80% latency of a user server: 190 ms (everything takes exactly 190 ms here). 80% latency of a googlebot server: 490 ms. Average latency = (190 * 9 + 490 * 1) / 10 = 220 ms. Huh wait, the service became 20 ms slower! Something is wrong, maybe we should revert the rollout ...
-----
You may say "Well, now that you explained what you did, of course that's wrong." The problem is, there's no right way to average percentiles. The numbers will be always suspect. The only difference is when it fails.
As far as I can tell, the only time the numbers are right is when every server gets exactly same mix of requests and shows exactly same performance characteristic, but of course, if you already know this then you don't even need average. You can take p90 from a single server, and every other server will have the same value.
You are aggregating because you DON'T know if they are all behaving in the same way. When you're averaging percentiles, you're assuming the very thing you set out to verify.
stat_throwaway|4 years ago
We have ten servers. Each server takes 90% user requests and 10% googlebot requests. Each user request takes exactly 200 ms, each googlebot request takes exactly 500 ms.
Each server reports 80% latency of 200 ms. After averaging, it's still 200 ms, which is correct.
Now, we conjecture that by sending all googlebot requests to a single server, it can use cache more efficiently. So we change routing: now nine servers get only user requests (90% of traffic), and one server gets only googlebot requests (10% of traffic). By doing this, we speed up every request by 10 ms. Great success! Or is it?
Let's calculate average p80. 80% latency of a user server: 190 ms (everything takes exactly 190 ms here). 80% latency of a googlebot server: 490 ms. Average latency = (190 * 9 + 490 * 1) / 10 = 220 ms. Huh wait, the service became 20 ms slower! Something is wrong, maybe we should revert the rollout ...
-----
You may say "Well, now that you explained what you did, of course that's wrong." The problem is, there's no right way to average percentiles. The numbers will be always suspect. The only difference is when it fails.
As far as I can tell, the only time the numbers are right is when every server gets exactly same mix of requests and shows exactly same performance characteristic, but of course, if you already know this then you don't even need average. You can take p90 from a single server, and every other server will have the same value.
You are aggregating because you DON'T know if they are all behaving in the same way. When you're averaging percentiles, you're assuming the very thing you set out to verify.