(no title)
rcrowley | 5 months ago
2. The thinking laid out in the blog post you linked to is how we went about it. You can do the math with your own parameters by computing the probability of a second node failure within the time it takes to recover from a first node failure. These are independent failures, being on physically separate hardware in physically separate availability zones. It's only when they happen together that problems arise. The core is this: P(second node failure within MTTR for first node failure) = 1 - e^( -(MTTR node failure) / (MTBF for a node) )
3. This one's harder to test yourself. You can do all sorts of tests yourself (<https://rcrowley.org/2019/disasterpiece-theater.html>) and via AWS FIS but you kind of have to trust the cloud provider (or read their SOC 2 report) to learn how availability zones really work and really fail.
n_u|5 months ago
independence simplifies things
= P(one failure)P(second failure within MTTR of first node)
= P(one failure) * (1 - e^-λx)
where x = MTTR for first node
λ = 1/MTBF
plugging in the numbers from your blog post
P(one failure within 30 days) = 0.01 not sure if this part is correct.
MTTR = 5 minutes + 5 hours =~ 5.083 hours
MTBF = 30 days / 0.01 = 3000 days = 72000 hours
0.01 * (1 - e^(-5.083 / 72000)) = 0.0000007 ~= 0.00007 %
I must be doing something wrong cuz I'm not getting the 0.000001% you have in the blog post. If there's some existing work on this I'd be stoked to read it, I can't quite find a source.
Also there's two nodes that have the potential to fail while the first is down but that would make my answer larger not smaller.
rcrowley|5 months ago