Ongoing Incident in Google Cloud

fastest963|3 years ago

This affected us starting at 4:57am US/Pacific with a significant drop in traffic through the HTTPS Global Load Balancer across all regions and Pub/Sub 502 errors but there was nothing on the status page for another 45 minutes. Things returned to normal by 5:05am from what I can tell.

dixie_land|3 years ago

Yup we saw the exact same symptoms with some GCLBs getting 100% 502 ( our upstream QPS graph looks scary with 5 mins of 0 QPS )

bushbaba|3 years ago

This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern.

gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

tgtweak|3 years ago

You can't really have 30+ fully independent regions running their own stack with different versions of apps and separate secrets, IP/routing and certificates in each. At some point you have to unify or it becomes either unmanageable or inconsistent.

zamnos|3 years ago

The underlying problem is that Google doesn't operate the world's DNS servers, but still wants to offer the best possible user experience as a global service. This means anycast VIP routing, because not all DNS servers implement EDNS, but they want to have SSL connections terminate as closely to users as possible.

As far as global services go though, it's easy enough to say "it should just not be possible", but how do you propose doing that in practice for a global service?

How does new config going to go out, globally, without being global? How do global services work if they're not global? How does DDoS protection work if you don't do it globally?

People make fun of "webscale" but operating Google is really difficult and complicated!

medler|3 years ago

> gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

As I understand it, GCP is already designed to make global outages impossible. Obviously this outage shows that they messed up somehow and some global point of failure still remains. Looking forward to the post-mortem.

consumer451|3 years ago

My knowledge level: can use AWS console to do < 5% of what is possible.

How much more work would Google create for themselves if they had not globalized their stack? Are we talking something like 5 subsets to manage instead of 1?

kevinventullo|3 years ago

gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

I reckon the only to achieve that would be to have the same level of interoperability between regions as you would get between two distinct cloud providers.

uniformlyrandom|3 years ago

From the messaging, this seems like a partial network outage.

Of course, at Google scale 'partial' is still very big.

Terretta|3 years ago

> This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern.

And why enterprises clamoring for AWS to feature match Google's global stuff (theoretically making I.T. easier) instead of remaining regionally isolated (making I.T. actually more resilient, without extra work if I.T. operators can figure out infra-as-code patterns) should STFU and learn themselves some Terraform, Pulumi, or etc.

Also, AWS, if you're in this thread, stop with the recent cross-region coupling features already. Google's doing it wrong, explain that, and be patient, the market share will come back to you when they run out of the GCP subsidy dollars.

dotancohen|3 years ago

  > gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

If that's what you really need, then distribute your assets across GCP, AWS, and DO. That likely means not using any cloud-specific features such as Lambda. AWS is actually really good in this regard, as SES and RDS are easily copied to regular instances in other cloud providers, that possibly wrap some cloud-specific feature themselves.

Cthulhu_|3 years ago

For reference / comparison, how many regional outages have there been? Did service outages get avoided due to running a workload in multiple regions?

papruapap|3 years ago

Because copypasting from A to B is much safer...

dekhn|3 years ago

there are three things that scare google engineers enough to keep them up at night: a global network outage, a global power outage, and a global chubby outage. Actually, they only really worry about that last one.

typaty|3 years ago

https://packages.cloud.google.com/apt/doc/apt-key.gpg Even the public apt key for signing Google's cloud packages is unavailable (returns 500 for me). This is insane

nicholasklem|3 years ago

This key was 500 some hours before the incident started, I hope it's unrelated.

roseway4|3 years ago

Downloading the key has been erroring since at least ~5pm PT yesterday, 2/27. It’s likely unrelated. Though I’d be unsurprised if the recent layoffs contributed to the situation.

eik3_de|3 years ago

Google cloud bugtracker bug: https://issuetracker.google.com/issues/270782614?pli=1

unknown|3 years ago

[deleted]

chedabob|3 years ago

Currently being tracked here: https://github.com/GoogleCloudPlatform/gcsfuse/issues/961

kkielhofner|3 years ago

As has happened many times throughout history (back to mainframes and thin clients of the 90s) there are swings/trends in how infrastructure is hosted.

Listening to the “All In Podcast” yesterday even those guys were talking about revenue drops in the big cloud services and noting we’re currently in the midst of a swing back to self-hosting/co-location/whatever thinking and migrations out.

IMHO those building greenfield solution today should take a hard look at whether the default approach from the last ~10 years “of course you build in $BIGCLOUD” makes sense for the application - in many cases it does not.

It also has the added benefit of de-centralizing the internet a bit (even if only a little).

lolinder|3 years ago

As others have mentioned, there was no revenue drop, there's been a reduction in growth. AWS's 20% growth rate is still very respectable, more than double the 9% growth rate the company had overall.

I would be hesitant to attribute slowed growth to a return to self hosting, it's much more likely that it's caused by companies dialing back their cloud growth after spending a few years going ham digitizing everything during the pandemic.

unknown|3 years ago

[deleted]

ranman|3 years ago

The parent comment is neither factual nor advisable.

You build greenfield in cloud precisely because it is greenfield and the utilization isn't well understood. Cloud options let you adjust and experiment quickly. Once a workload is well understood it's a good candidate for optimization, including a move to self managed hardware / on prem.

Buying hardware is a great option once you actually understand the utilization of your product. Just make sure you also have competent operators.

ctvo|3 years ago

AWS is a 75 bln a year business still growing 20%+ YoY. It’ll break 100 bln this year. I would examine the numbers yourself.

outworlder|3 years ago

> IMHO those building greenfield solution today should take a hard look at whether the default approach from the last ~10 years “of course you build in $BIGCLOUD” makes sense for the application - in many cases it does not.

When one buys a house, they should take a hard loo at whether the default approach of paying for utilities makes sense, versus generating their own power.

While that's a bit snarky, the reasoning is similar. You can:

* Use "bigcloud"(TM) with the whole kit: VMs, their managed services, etc * Use bigcloud, but just VM or storage * Rent VMs from a smaller provider * Rent actual servers * Buy your servers and ship to a colo * Buy your servers and build a datacenter

Every level you drop, you need more work. And it grows(I suspect, not linearly). Sure, if you have all the required experts (or you rent them) you can do everything yourself. If not, you'll have to defer to vendors. You will pay some premium for this, but it's either that, or payroll.

What also needs to be factored in is how static your system is. If a single machine works for your use-case, great.

One of the systems I manage has hundreds of millions of dollars in contracts on the line, thousands of VMs. I do not care if any single VM goes down; the system will kill it and provision a new one. A big cloud provider availability zone often spans across multiple datacenters too, each datacenter with their own redundancies. Even if an entire AZ goes down, we can survive on the other two (with possibly some temporary degradation for a few minutes). If the whole region goes down, we fallback to another. We certainly don't have the time to discuss individual servers or rack and stack anything.

It does not come cheap. AWS specifically has egregious networking fees and you end up paying multiple times (AZ to AZ traffic, NAT gateways, and a myriad services that also charge by GB, like GuardDuty). It adds up if you are not careful.

From time to time, management comes with the idea of migrating to 'on-prem', because that's reportedly cheaper. Sure, ignoring the hundreds of engineers that will be involved in this migration, and also ignoring all the engineers that will be required to maintain this on-premises, it might be cheaper.

But that's also ignoring the main reason why cloud deployments tend to become so expensive: they are easy. Confronted with the option of spinning up more machines versus possibly missing a deadline, middle managers will ask for more resources. Maybe it's "just" 1k a month extra (those developers would cost more!). It gets approved. 50 other groups are doing the same. Now it's 50k. Rinse, repeat. If more emphasis would be placed into optimization, most cloud deployments could be shrunk spectacularly. The microservices fad doesn't help(your architecture might require that, but often the reason it does is because you want to ship your org chart, not for technical reasons).

camhart|3 years ago

All in podcast mentioned growth slowing, but not revenue dropping.

knorker|3 years ago

Revenue drop? Google Cloud is still growing 30-40% year on year.

Dave3of5|3 years ago

Ouch some pain at google today then. I hate to wake up on a Monday morning to this.

<3 To the engineers trying to fix it at the moment.

zamnos|3 years ago

Google has follows-the-sun on-call rotations for large rotations, so this hit the UK team just after lunch.

monero-xmr|3 years ago

This is why any criticism of AWS reliability is meaningless to me. All the cloud providers go down - all of them. Either you are multi-cloud, or you run your own hardware, but these events are inevitable.

vhiremath4|3 years ago

The amount of time you are down vs. up dictates your SLOs and SLAs. Criticism of how reliable one vs. another is is not only valid, it's backed by hundreds of millions of contractual dollars and credits every year. We spend tens of millions on AWS per year. We have several SLAs with them. Our Elasticache SLA was breached once (localized to us - not whole customer base) and we got credits which were commensurate with the amount of business we lost during that downtime period.

If one provider is down more than the others, the criticism is not only valid, it results in real business loss for the provider and its customers.

On multi-cloud: it's one way to reduce the amount of downtime you have, but it comes with a significant operational cost depending on how your application is architected and how your teams internal to your company are formed. It is totally practical for someone to bank on AWS' reliability until they're at a significant amount of traction or revenue where the added uptime of going multicloud is worth the investment. I know you're not saying this isn't the case (I think you're saying "do that if you're going to complain about 1 providers' uptime"), but thought it was worth putting the context into the HN ether.

yjftsjthsd-h|3 years ago

This is why any criticism of AWS > reliability is meaningless to me.

Er, we absolutely can and should compare rates of problems and overall reliability.

crazygringo|3 years ago

> Either you are multi-cloud, or you run your own hardware

If you run your own hardware these events are inevitable too.

dymk|3 years ago

Inevitable != immune to criticism

ctvo|3 years ago

> This is why any criticism of AWS reliability is meaningless to me.

Is anyone tracking reliability for these public providers? Would be curious how AWS compares to Azure and GCP. My experience is it's better, but we may have avoided Kinesis or whatever that keeps going down.

MuffinFlavored|3 years ago

> you run your own hardware

in multiple datacenters?

uniformlyrandom|3 years ago

05:41 - 06:26 PT, 45 min total.

Not great, not terrible.

throwaway892238|3 years ago

Yep. Of course there's no detail yet so we don't know what exactly was affected. All we can see is "Multiple services are being impacted globally" and a list of services (Build, Firestore, Container Registry, BigQuery, Bigtable, Networking, Pub/Sub, Storage, Compute Engine, Identity and Access Management) but there's no indication of what specifically was impacted. Could you still see status for your VMs, but not launch new ones? Was it mostly affecting only a couple regions? No idea. All we know is they're now below four nines in February for a handful of critical services.

Let's take a gander at incident history: https://status.cloud.google.com/summary

Cloud Build looks bad... three multi-hour incidents this year, four in fall/winter last year.

Cloud Developer Tools have had four multi-hour incidents this year, many last fall/winter.

Cloud Firestore looks abysmal... Six multi-hour incidents this year, one of them 23 hours.

Cloud App Engine had three multi-hour incidents this year, many in fall/winter last year.

BigQuery had three multi-hour incidents this year, many in fall/winter last year.

Cloud Console had five multi-hour incidents this year, many in fall/winter last year. (And from my personal experience, their console blows pretty much all the time)

Cloud Networking has had nine incidents this year, one of them was eight days long. What the fuck.

Compute Engine has had five multi-hour incidents this year, many last fall/winter.

GKE had 3 incidents this year, multiple the past winter.

Can somebody do a comparison to AWS? This seems shitty but maybe it's par for the course?

0x0000000|3 years ago

Outages at the hyperscalers can have a huge blast radius, is anyone encountering other services with outages because they're built on GCP?

hellcow|3 years ago

We are in us-central1 and didn't have an outage, so it appears not to have affected everyone.

2OEH8eoCRo0|3 years ago

I fantasize that it's a three-letter agency with a warrant making them either start shitting their chat logs or pulling drives and recovering them the hard way.

https://arstechnica.com/tech-policy/2023/02/us-says-google-r...

JosephRedfern|3 years ago

Sounds painful.

lokl|3 years ago

mail.google.com showed error messages for me intermittently during the past hour.

jakedata|3 years ago

https://www.google.com/appsstatus/dashboard/incidents/5ML14k...

They claim the Gmail specific issues are resolved. We shall see...

Feb 27, 2023 2:03 PM UTC We experienced a brief network outage with package loss, impacting a number of workspace services. The impact is over. We are investigating and monitoring.

asicsp|3 years ago

discussed here: https://news.ycombinator.com/item?id=34955906

oars|3 years ago

Is it likely this outage still would've have occurred even without their 12,000 layoffs in January?

unknown|3 years ago

[deleted]

Aldipower|3 years ago

Certainly a problem with a BGP misconfiguration. :)

zamnos|3 years ago

Nope. A BGP misconfiguration would manifest in more broad/different ways.

m00dy|3 years ago

Our workloads are fully functional, DK/EU

lee101|3 years ago

[deleted]

abc20230215|3 years ago

[deleted]

pictur|3 years ago

[deleted]

andsoitis|3 years ago

Solving this sort of thing is not about throwing more people at it. That would be brute force and not strategic. Instead, you want to architect systems like these in a way that strikes a good balance between resilience and things like cost/efficiency/etc.

unknown|3 years ago

[deleted]

105 comments