This is easy for me to say in hindsight, but a graph of queries per user, for the top 100 accounts would have lit up the issue like a Christmas tree. But it's the kind of chart you stuff on the full page of charts that are usually not interesting. So many problems have been avoided by "huh, that looks weird" on a dashboard.
In a more targeted search, asking Splunk to show out where the traffic is coming from on a map, and a pie chart of the top 10 accounts they're coming from would also be illuminating.
But again, this is easy for me to say in hindsight, after the issue has been helpfully explained to me in a blog post. I don't claim could have done any better if I were there, my point is that you need to see inside the system to be operate it well.
If daily volume is ~400k a sharp 60k spike should show even on a straight rps dashboard. I suspect that didn’t exist because most cloudy SaaS tools don’t seem to put any emphasis on traffic over a particular unit time, like in the Cloudflare site dashboard one must hover over the graph and subtract timestamps to find out if the number of request is per hour, minute, etc. Splunk is similarly bad — the longer the period query covers, the less granularity you get and the minimum seems to be minutely.
Drill down by user, locale, etc would just make it even easier to figure out what’s going on once you spot the spike and start to dig.
My unsolicited advice for Zac in case they’re reading is — start thinking about SLIs (I for indicators) sooner rather than later to help you think through cases like this.
blantonl|2 years ago
fragmede|2 years ago
In a more targeted search, asking Splunk to show out where the traffic is coming from on a map, and a pie chart of the top 10 accounts they're coming from would also be illuminating.
But again, this is easy for me to say in hindsight, after the issue has been helpfully explained to me in a blog post. I don't claim could have done any better if I were there, my point is that you need to see inside the system to be operate it well.
eep_social|2 years ago
Drill down by user, locale, etc would just make it even easier to figure out what’s going on once you spot the spike and start to dig.
My unsolicited advice for Zac in case they’re reading is — start thinking about SLIs (I for indicators) sooner rather than later to help you think through cases like this.
scblzn|2 years ago
/s