top | item 45962912

(no title)

wparad | 3 months ago

This is absolutely true, but the end result is the same. The assumption is "We can fix a third party component behaving temporarily incorrectly, and therefore we can do something about it". If the third party component never behaves correctly, then nothing we can do to fix it.

Correlations don't have to be talked about, because they don't increase the likelihood for success, but rather the likihood of failure, meaning that we would need orders of magnitude more reliable technology to solve that problem.

In reality, those sorts of failures aren't usually temporary, but rather systemic, such as "we've made an incorrect assumption about how that technology works" - feature not a bug.

In that case, it doesn't really fit into this model. There are certainly things that would better indicate to us that we could use or are not allowed to use a component, but for the sake of the article, I think that was probably going much to far.

TL;DR Yes for sure, individual attempts are correlated, but in most cases, it doesn't make sense to track that because those situations end up in other buckets of "always down = unreliable" or "actually up - more complex story which may not need to be modelled".

discuss

scottlamb|3 months ago

I think the reasoning matters as much as the answer, and you had to make at least a couple strange turns to get the "right answer" that retries don't solve the problem:

* the 3rd-party component offering only 90% success—I've never actually seen a system that bad. 99.9% success SLA is kind of the minimum, and in practice any system that has acceptable mean and/or 99%/99.9% latency for a critical auth path also has >=99.99% success in good conditions (even if they don't promise refunds based on that).

* the whole "really reliable retry handler" thing—as mentioned in my first comment, I don't understand what you were getting at here.

I would go a whole other way with this section—more realistic, much shorter. Let's say you want to offer 99.999% success within 1 second, and the third-party component offers 99.9% success per try. Then two tries gives you 99.9999% success if the failures are all uncorrelated but retries do not help at all when the third-party system is down for minutes or hours at a time. [1] Thus, you need to involve an alternative that is believed to be independent of the faulty system—and the primary tool AWS gives you for that is regional independence. This sets up the talk about regional failover much more quickly and with less head-scratching. I probably would have made it through the whole article yesterday even in my feverish state.

[1] unless this request can be done asynchronously, arbitrarily later, in which case the whole chain of thought afterward goes a different way.

wparad|3 months ago

Hmm, I never considered potentially using an SLA on latency as a potential way to justify the argument. If I pull this content into a future article or talk, I will definitely consider reframing it for easier understanding.