Stop trying to re-invent statistics. Use a box and whisker plot of latency. You quickly get to see the mean, the quartiles, and all the outliers and you get it in a format which is familiar and easy to understand. You can even plot box and whisker plots next to each other for quick meaningful comparisons between different things.
scott_s|7 years ago
vlovich123|7 years ago
In this particular case the challenge is aggregating statistics from a very large fleet & having automated alarms. Visualization tools don't help with any of that. More specifically, the reporting tools out there apparently have a very common & persistent flaw of reporting an average of percentiles across agents which is a statistically meaningless metric. It makes no difference how you visualize it - the data is bunk.
This article flips it so that agents simply report how many requests they got & how many exceeded the required threshold. This lets them report the percentage of users having a worse experience than the desired SLA. You can also build reliable tools on top of this metric. It's not a universal solution but it's a neat trick to maintain the performance properties of not needing to pull full logs from all agents & still have a meaningful representation of the latency of your users.