Dependencies of these big providers like Google, Microsoft, Cloudflare are increasing which results to failure on a wide scale even if one fails. Distribution is the key.
Well for the vast majority of simple apps you're better off failing when everybody else is. People will blame it on you less. When your alternative solution fails and everything else seems to be up the blame will fall on you.
Google could probably do a better job here and not put so many services on the same pool of L7 devices. Separate pools with smaller groupings would reduce the blast radius.
(Googler, opinion is my own, I know nothing about this specific outage).
Google has LOTS of internal routing systems. BGP is about anouncing what IPs a given network can handle, which is not the case here.
Before hitting application level routing, I believe you hit the Maglev[0]. Seems unlikely this was the cause, as it would likely take down all services.
One of the first application layers balancers you hit that is well known is the GFE[1][2]. This is similar to an HTTP reverse proxy, but Google made. I could definitely see this as the cause.
Traffic entering Google's network hits a bunch of front ends that route traffic to the relevenat back ends. I'd guess it's those application-level front ends that were having trouble, rather than anything network-level like BGP.
There's a huge """secret""" Google data center in Council Bluffs, Iowa that appears to be in the finishing phases of completion. I talked yesterday to a union worker who is moving to Des Moines to work on a new Microsoft data center there tonight, it appears that work is drying up at this data center here and a lot of the travelling blue collar folk are leaving this area.
I wonder if this data center coming apparently partially online is a part of the problem?
Also, after this he is likely to work on an Amazon fulfillment center next year - impressed by all the (albeit temporary) blue collar jobs created by FAANG at the moment!
[+] [-] Kaknut|5 years ago|reply
[+] [-] auganov|5 years ago|reply
[+] [-] tyingq|5 years ago|reply
[+] [-] tyingq|5 years ago|reply
[+] [-] kyrra|5 years ago|reply
For example, PM was posted on this previous outage: https://status.cloud.google.com/incident/cloud-networking/19...
[+] [-] heartbeats|5 years ago|reply
[+] [-] kyrra|5 years ago|reply
Google has LOTS of internal routing systems. BGP is about anouncing what IPs a given network can handle, which is not the case here.
Before hitting application level routing, I believe you hit the Maglev[0]. Seems unlikely this was the cause, as it would likely take down all services.
One of the first application layers balancers you hit that is well known is the GFE[1][2]. This is similar to an HTTP reverse proxy, but Google made. I could definitely see this as the cause.
[0] https://static.googleusercontent.com/media/research.google.c...
[1] https://cloud.google.com/security/infrastructure/design#goog...
[2] https://landing.google.com/sre/workbook/chapters/managing-lo...
[+] [-] MyelinatedT|5 years ago|reply
[+] [-] mcpherrinm|5 years ago|reply
[+] [-] enneff|5 years ago|reply
[+] [-] skim_milk|5 years ago|reply
I wonder if this data center coming apparently partially online is a part of the problem?
Also, after this he is likely to work on an Amazon fulfillment center next year - impressed by all the (albeit temporary) blue collar jobs created by FAANG at the moment!
[+] [-] jeffbee|5 years ago|reply
https://www.google.com/maps/@41.2197694,-95.8658016,3a,89.3y...
[+] [-] SteveNuts|5 years ago|reply
[+] [-] qmarchi|5 years ago|reply
[+] [-] throwawayinfo|5 years ago|reply
[+] [-] rezonant|5 years ago|reply