top | item 35716128

(no title)

martius | 2 years ago

But does it really matter that the incident is a flood or a cascading software failure if the likelihood and severity is the same?

Being in the same building is an "implementation detail" from a customer perspective, what matters is the consequences of this decision.

For example, maybe this decision allows for better network connectivity at a lower cost for inter-zones traffic, while, on the other hand, not protecting against some classes of risks.

In the end, you can have a similar multi-zone outage keeping the region down for an extended period of time just because of a bad network config push (see the massive facebook outage in 2021). As a customer, I don't care if it's a flood or a network outage.

Imho, what matters the most is a clear documentation of how these abstractions work for users and the corresponding contractual agreements (costs, SLAs, etc). Users can thus decide if they are ready to pay the price of protecting themselves against an extended outage impacting a single region.

discuss

flaminHotSpeedo|2 years ago

It absolutely does matter.

The MTTR for outages caused by physical damage is way higher, and resiliency against physical disasters is a major selling point of availability zones as a fault container.

Hosting every zone of your region (if that's actually the case here) in the same building is simply negligent.

Besides the obvious risks like this incident, even if the zones have physical fire barriers, chances that operators will be allowed in to one "zone" after another has a fire are slim to none.

martius|2 years ago

True, I implicitly included the MTTR in the "severity", but this is actually a different thing (severity is more about the impact radius).

But I don't think it changes my point: knowing what/how Google Cloud designs regions or zones is still an implementation detail, what matters is what MTTR they are targeting and this should be known ahead of time.

There are so many "implementation details" that customers are not aware of, because they are always changing, non contractual, or just hard to make sense of, what matters is meaningful abstractions.

I am not saying it's OK if the zones are in the same building or not, I don't know and I was really surprised when I discovered this a few years ago. But this information gives you a mental model of "what could go wrong" that is biased towards some specific risks, and in my experience, relying on these very practical aspects make the risk analysis and design decisions harder to make.

Otho, one thing that may be problematic too (and biasing) is that the common understood definition of a "zone" is the one people know from AWS, so using the same term without being very explicit about the differences will also lead to incorrectly calculated risks. I find the public documentation of Google Cloud too vague in general (and often ambiguous).

traderj0e|2 years ago

Seems the likelihood isn't the same. AWS is separating AZs physically, GCP is not. I'd want to know this as a customer, not some abstraction.