(no title)
2gremlin181 | 3 months ago
I'm not asking for an instant RCA or confirmation of the scope of the outage. Just stop gaslighting me.
2gremlin181 | 3 months ago
I'm not asking for an instant RCA or confirmation of the scope of the outage. Just stop gaslighting me.
knorker|3 months ago
Ah, so you're saying the status page should be hooked up to internal monitoring probers?
So how sure are you that it's the service that's broken, and not the probers? How sure are you that the granularity of the probers reflect the actual scope of the outage?
Also this opens up questioning of "well why don't you have probing on the EXACT workflow that happened to break this time?!". Because honestly, that's not helpful.
Say you have a complete end to end workflow for your web store. Should you publish "100% outage, the webstore is down!!" on your status page, automatically, because the very diligent prober failed to get into the shoe section of your store? That's probably not helpful to anybody.
> Clearly these metrics and alerts exist internally too.
Well, no. Probers can never cover every dimension across which a service can have an outage. You may think that the service is simple and has an obvious status, but you're using like 0.1% of the user surface, and have never even heard of the weird things that 99% of actual traffic does.
How do you even model your minority use case? Is it an outage? Or is your workflow maybe a tiny weird one, even though you think it's the straightforward one?
Especially since the nature of outages in complex systems tend to be complex to describe accurately. And a status page needs to boil it down to simple.
In many cases even engineers inspecting the system can not always be certain if real users are experiencing an outage, or if they're chasing an internal user, or if nothing is user visible because internal retries are taking care of everything, or what.
Complex systems are often complex because the world is complex. And if the problem is simple and unevolving then there would be no reason to have outages in the first place.
And often engineers helping phrase an outage statement need to compromise verbosity for clarity.
Another thing is what do you do if you start serving 500s to 90% of traffic? An outage, right? Surely auto-publish to a status page? Oh, but it turns out this was a DoS attack, and no non-DoS traffic was affected. Can your monitoring detect the difference? Unlikely.
gwbas1c|3 months ago