top | item 7615816

(no title)

tbagman | 12 years ago

I understand your point of view, but I have a different opinion.

When I see these kinds of claims, my guess is that they are usually based on a failure analysis given a particular replication degree and estimates of failure probabilities of various components and failure domains. Underlying this analysis is usually an assumption about the independence of failures between the failure domains.

The good news is that industry has moved away from the computer as the unit of independent failure to a much larger failure domain: often a cluster within a data center, or an entire data center. This means that the analysis takes into account the infrequent occurrence of a large number of correlated failures within the failure domain.

The bad news is that there are inevitably correlated failures across the failure domains, regardless of how carefully you design to avoid them. Software bugs, coordinated attacks, operator errors, cascading failures caused by well-intentioned but runaway control loops and automated failover mechanisms, and so on, can be the culprit.

So, here's the problem. This statistic from Amazon, if taken at face value, would say that relying on Amazon to keep your data durable and safe is practically risk-free to the point of durability issues never happening in your lifetime (or, alternatively, to such a dramatically small fraction of objects that you might not care).

In practice, however, I suspect you do want to plan for the "unknown unknowns" that will cause data loss at low probability, but much higher probability than 0.000000001%.

Here's another way to look at it: I'd love it if Amazon posted some data about the rate at which they've experienced durability failures in the past year or two, rather than posting what I'm supposing (I might be wrong!) are calculations based on assumptions of dependent failure probabilities.

discuss

No comments yet.