top | item 28752239

(no title)

sarosh | 4 years ago

There is already a nice writeup on the current incident from Cloudflare at https://blog.cloudflare.com/october-2021-facebook-outage/

They key observations:

"Due to Facebook stopping announcing their DNS prefix routes through BGP, our and everyone else's DNS resolvers had no way to connect to their nameservers. Consequently, 1.1.1.1, 8.8.8.8, and other major public DNS resolvers started issuing (and caching) SERVFAIL responses.

But that's not all. Now human behavior and application logic kicks in and causes another exponential effect. A tsunami of additional DNS traffic follows.

This happened in part because apps won't accept an error for an answer and start retrying, sometimes aggressively, and in part because end-users also won't take an error for an answer and start reloading the pages, or killing and relaunching their apps, sometimes also aggressively."

discuss

order

spyspy|4 years ago

> apps won't accept an error for an answer and start retrying, sometimes aggressively

I'm certainly guilt of this. Retries make the world go round, and round again. I've been given attitude by teams that own downstream services.

Them: "Why are you retrying so aggressively?" Me: "Why is your service so damn flakey?"

yesbabyyes|4 years ago

Surely they are upstream from you if they need to rate limit you?

(And that sounds like you giving, rather than being given, attitude.)

drewcoo|4 years ago

I don't reload often, but when I do, I do it rapidly and in anger.

-some tester I know