top | item 34668590

(no title)

vhold | 3 years ago

This is an example of why you want interoperable diversity in complex distributed systems.

By having everything so standardized and consistent, they had the exact same failure mode everywhere and lost redundant fault tolerance. If they had different interoperable switches, running different software, the outage wouldn't have been absolute.

When large complex distributed systems grow organically over time, they tend to wind up with diversity. It usually takes a big centralized project focused on efficiency to destroy that property.

discuss

yusyusyus|3 years ago

I appreciate this comment. In my world of packet pushing, I try to promote vendor diversity for this reason.

The practical downsides of this diversity live in the complexity of the interop (often slowing feature velocity), operations, and procurement/support.

But issues like the AT&T 4ESS outage have occurred before in IP networks, as an example, in some BGP bug. Diversity alleviates some of the global impact.

vlovich123|3 years ago

There are other ways of accomplishing this like doing staged rollouts without giving up the cost efficiencies of implementing your own network only once and avoiding a combinatorial explosion in testing complexity.

You can sometimes play this game with vendors because you want them to give you an interoperable interface so that you avoid vendor lock-in and have better pricing, but that’s a secondary benefit and staged rollouts should still be performed even if you have heterogenous software.

kortilla|3 years ago

Staged rollouts do not protect you from long lurking bugs. Even in this ATT case they most certainly did do a staged rollout just because they couldn’t just shut off the entire phone network to run an update across all systems.