(no title)
martius | 2 years ago
But I don't think it changes my point: knowing what/how Google Cloud designs regions or zones is still an implementation detail, what matters is what MTTR they are targeting and this should be known ahead of time.
There are so many "implementation details" that customers are not aware of, because they are always changing, non contractual, or just hard to make sense of, what matters is meaningful abstractions.
I am not saying it's OK if the zones are in the same building or not, I don't know and I was really surprised when I discovered this a few years ago. But this information gives you a mental model of "what could go wrong" that is biased towards some specific risks, and in my experience, relying on these very practical aspects make the risk analysis and design decisions harder to make.
Otho, one thing that may be problematic too (and biasing) is that the common understood definition of a "zone" is the one people know from AWS, so using the same term without being very explicit about the differences will also lead to incorrectly calculated risks. I find the public documentation of Google Cloud too vague in general (and often ambiguous).
flaminHotSpeedo|2 years ago
But back to the point, philosophically I agree, but practically I don't. IMO having SLA's and enforceable guarantees that give customers the information they need is much harder than exposing the implementation details.
"Zones within a region may be located in the same building" is much more concise than SLA's using contractual language, and probably conveys more (though potentially less accurate) information once I apply my context.
Also, if we look GCP's SLA's, this outage blew the SLA breach threshold out of the water for many services. Some are pushing 2 9's of downtime from this incident alone.
Finally (in hindsight maybe I should have led with this, but I'm too lazy to restructure this comment), SLA's are a joke. Outages can destroy your business, but all you get from your cloud provider is that they comp you for usually a small fraction of what they charge you. They have no teeth, so if you can't just write off a major outage you have to have a plan to avoid it, which means you need to know the implementation details