Don't read into announcements like this too much. Status pages and outage notices are often political.
Status pages are rarely dynamic and updates require blessing from upstairs. And more often than not complete outages are referred to as "degraded performance affecting some users".
I don't know how status pages work at Google, but I do work in reliability engineering and I sometimes make recommendations to update the status pages.
Some context before I go on is that reliability is often measured by mapping critical features to services and degradation. This gets more challenging as a feature starts to map to more than a couple services and those services begin to have dependencies. When your reliability on average can be measured in its number of nines opposed to its significant preceding digits your signal interpretation game has to step up significantly. These two situations make it infinitely more complex to state whether a given service degradation in a chain of services is truly having external customer impact at a given time. That's why a human needs to make the call to update the status page and why status page availability numbers are different from internal numbers.
I spend a good portion of nearly every sprint hunting down systemic issues that'll pop up across the ecosystem of services from a birds eye view. Often, knowing whether external customer impact will be felt for this series of errors relies heavily on knowing the current configuration of services in a chain, their graceful failure mechanisms, what failure manifests as client side, and whether that failure is critical to an SLA.
I have not, in my history of reliability engineering, seen anyone object to updating the status page for political reasons.
Not sure why you’re being down voted. Status pages for big companies are never hooked up to automation. It’s just bad PR to show red across the bar.
If there is a networking outage, everything on a status page should be red but then that looks bad for PR. So you just set “networking outage” but everything else is green even though everything is realistically down.
voytec|2 years ago
Status pages are rarely dynamic and updates require blessing from upstairs. And more often than not complete outages are referred to as "degraded performance affecting some users".
oooyay|2 years ago
Some context before I go on is that reliability is often measured by mapping critical features to services and degradation. This gets more challenging as a feature starts to map to more than a couple services and those services begin to have dependencies. When your reliability on average can be measured in its number of nines opposed to its significant preceding digits your signal interpretation game has to step up significantly. These two situations make it infinitely more complex to state whether a given service degradation in a chain of services is truly having external customer impact at a given time. That's why a human needs to make the call to update the status page and why status page availability numbers are different from internal numbers.
I spend a good portion of nearly every sprint hunting down systemic issues that'll pop up across the ecosystem of services from a birds eye view. Often, knowing whether external customer impact will be felt for this series of errors relies heavily on knowing the current configuration of services in a chain, their graceful failure mechanisms, what failure manifests as client side, and whether that failure is critical to an SLA.
I have not, in my history of reliability engineering, seen anyone object to updating the status page for political reasons.
eddythompson80|2 years ago
If there is a networking outage, everything on a status page should be red but then that looks bad for PR. So you just set “networking outage” but everything else is green even though everything is realistically down.