GCP Incidents | WingNews

[+] hermitcrab|2 years ago|reply

We are a small software company (2 people) and we've also had plenty of issues with Google over the years. Mostly related to Google Adwords. For example:

https://successfulsoftware.net/2015/03/04/google-bans-hyperl...

https://successfulsoftware.net/2016/12/05/google-cpa-bidding...

https://successfulsoftware.net/2020/08/21/google-ads-can-cha...

https://successfulsoftware.net/2021/05/04/wtf-google-ads/

If Google have no interest in providing decent support to the author of the original article, who are paying megabucks to Google, what hope do small businesses like mine have?

[+] biorach|2 years ago|reply

> Google have no interest in providing support

[+] annoyed_eng|2 years ago|reply

Generally, I think over the last few years, GCP has lost its way.

There was a time several years ago where they were a meaningfully better option when looking at price / performance for compute / storage / bandwidth when compare to AWS. At the time, we did detailed performance testing and cost modeling to prove this for our workload (hundreds of compute engine instances etc).

Support back then was also excellent. One of our early tickets was an obscure networking issue. The request was quickly escalated then passed from engineers in different regions around the world until it was resolved. We were very impressed. It was a change on the GCP end that ended up being reverted. We quickly got to real engineers who competently worked the problem with us to resolution.

The sales team interactions were also better back then. We had a great sales rep who would quickly connect us with any internal resources we needed. The sales rep was a net positive and made our experience with GCP better.

Since then, AWS has certainly caught up and is every bit as good from a cost / performance standpoint. They remain years ahead on many managed services.

The GCP support experience has degraded significantly at this point. Most cases seem to go to outsourced providers who don’t seem able to see any data about the actual underlying GCP infrastructure. We too have detected networking issues that GCP does not acknowledge. The support folks we are dealing with don’t seem to have any greater visibility than we do. It’s pathetic and deeply frustrating. I’m sure it’s just as frustrating for them.

The sales experience is also significantly worse. Our current rep is a significant net negative.

We’ve made significant investments in GCP and we hate seeing this happen. While we would love to see things improve, we don’t see any signs of that actually happening. We are actively working to reduce our GCP spend.

A few years ago, I was a vocal GCP advocate. At this point, I’d have a hard time suggesting anyone build anything new on GCP.

[+] supermatt|2 years ago|reply

No doubt all cloud providers have their problems.

For my day job, over the last 2 years we have discovered and reported multiple issues with Keyspaces, Amazon Aurora, and App Runner. In all cases these issues have resulted in performance degradation, and AWS support wasting our time sending us chasing our tails. After many weeks of escalation, we eventually ended up with project leads who confirmed the issues (some of which they were already aware of, yet the support teams had wasted our time anyway!) and (some of them) have since been resolved.

We are stuck with Keyspaces for the time being, but now refuse to use any non core services (EC2, EBS, S3). As soon as you venture away from those there be dragons.

[+] wavemode|2 years ago|reply

Oh, for goddamn sure. Half the services on AWS, probably, are very poorly designed or very poorly run (or both). CloudWatch stands out to me as one that is mind-bogglingly buggy and slow. To the point of basically being a "newbie trap" - when I see companies using it for all their logging, I assume it's due to inexperience with the many alternatives.

At least the compute services are reliable.

[+] vel0city|2 years ago|reply

It's hilarious people are bashing GCP for having one compute instance go down and the author acknowledges it's a rare event. On AWS I've got instances getting forced stopped or even straight disappearing all the time. 99.95% durability vs 99.999% is way different.

If they had the same architecture on AWS it would go down all the time IME. AWS primitives are way less reliable than GCP, according to AWS' docs and my own experiences.

[+] Wuzado|2 years ago|reply

The article doesn't seem to mention AWS, really. I also feel like the primary issue is the lack of communication and support, even for a large corporate partner.

Seems like they're moving to bare-metal, which has an obvious benefit of being able to tell your on-call engineer to fix the issue or die trying.

[+] deanCommie|2 years ago|reply

EC2 [0] and GCP Compute [1] have the exact same SLAs, which is 99.99%, dipping below which gets you a 10% refund. Dipping below 95% gets you a 100% refund.

[0] https://aws.amazon.com/compute/sla/

[1] https://cloud.google.com/compute/sla

[+] NineStarPoint|2 years ago|reply

This is very different from my experience. In my years with AWS I’ve only had an instance get stopped once for a reason that was weird AWS background stuff that had nothing to do with my application. I don’t think I’ve ever had or even heard of an instance just disappearing.

[+] belter|2 years ago|reply

In general in Cloud and as somebody said, you should Architect assuming everything fails all the time.

[+] StopHammoTime|2 years ago|reply

I have a lot of interaction with Google Cloud Support, mostly around their managed services. I am genuinely not over-impressed with their service, considering with similar employers of size on AWS the support experience was always wonderful.

However, I will say if you are on Google Cloud and you have a positive interaction, make a big deal about someone helping you. Given the rarity it occurs, it’s not a big deal to really go out of your way to reward someone with some emphatic positive feedback. I’ve had four genuinely fantastic experiences and there’s always a message to a TAM that flows soon after. I hope more people like those I interacted with get rewarded and promoted.

[+] latchkey|2 years ago|reply

> However, I will say if you are on Google Cloud and you have a positive interaction, make a big deal about someone helping you.

This. These sorts of discussions are like bike shedding over vi/emacs.

Only the complaints make it to the front page on HN. I've been using GCP off and on for projects for a decade now. Built multiple very successful businesses on it. Sure it hasn't been all perfect, but I'm an overall happy camper.

Having also used AWS heavily when I was on the team building the original hosted version of Cloud Foundry, I'd never go back to them again. It was endless drama.

[+] 363082a9-58a7|2 years ago|reply

I've had an experience with GCP that involved a very enterprise-y feature breaking in a way that clearly showed the feature never worked properly up until this point (aside from causing downtime when they tried to quietly fix it). GCP reps proceeded to remind everyone in the call in which they were supposed to explain what happened they were under NDA, because admitting to the above would've been a nightmare for regulated industries.

[+] HenryBemis|2 years ago|reply

I always wonder whether an NDA can prevent you speaking/whistleblow to a Regulator, Police, DA, or some (truly) state authority.

I would like to assume, 'no, you can always report a crime'.

[+] ransom1538|2 years ago|reply

"On December 1st, at 8:52am PST, a box dropped offline; inaccessible. And then, instead of automatically coming back after failover — it didn’t. Our primary on-call engineer was alerted for this and dug in. While digging in, another box fell offline and didn’t come back"

This makes no sense. A machine restarted and you had catastrophic failure? VMs reboot time to time. But if you design your setup to completely destroy itself in this scenario, I don't think you will like a move to AWS, or god forbid, your own colo.

[+] wavemode|2 years ago|reply

Read the article more carefully. The article (the text you quoted, even) clearly states that the machine didn't "restart". It crashed and didn't come back online.

And nowhere in the article do they state that this was a "catastrophic failure" - Railway itself didn't go down entirely. But Railway is a deployment company, so they are re-selling these compute resources to their customers to deploy applications. So when one of those VMs goes down and doesn't automatically failover, that's downtime for the specific customer who was running their service on that machine.

As they state:

> During manual failover of these machines, there was a 10 minute per host downtime. However, as many people are running multi-service workloads, this downtime can be multiplied many times as boxes subsequently went offline.

> For all of our users, we’re deeply sorry.

[+] simo7|2 years ago|reply

Interesting, I’m starting to think undocumented thresholds are quite common in GCP.

I experienced something similar with Clod Run: inexplicable scaling events based on CPU utilization and concurrent requests (the two metrics that regulate scaling according to their docs).

After a lot of back and forth with their (premium) support it turns out there are additional criteria, smthg related to request duration, but of course nobody was able to explain in details.

[+] klon|2 years ago|reply

Yes, we have also experienced undocumented limits for Cloud Run. For us it was an obscure quota for max network packages per second per instance. Really infuriating and took 6 months to track down what it was. I think it has been documented here now: https://cloud.google.com/run/quotas#cloud_run_bandwidth_limi...

[+] politelemon|2 years ago|reply

Unnanounced changes too, there was a Firefox outage in 2022 due to GCP:

https://hacks.mozilla.org/2022/02/retrospective-and-technica...

[+] strstr|2 years ago|reply

Sounds like a genuinely frustrating experience.

Bit confused about why nested virt has anything to do with their problems given that they aren’t using virt inside the VMs. Softlocks are a generic indication of a lack of forward progress.

Same confusion with the MMIO instructions comment. If that’s about instruction emulation, not sure why it matters where it happens? It’s both slow and bound for userspace anyway. If it’s supposed to be fast it should basically never be exiting the guest, let alone be emulated.

Sounds like the author is a bit frustrated and (understandably) grasping at whatever straws they can for that most recent incident.

[+] wg0|2 years ago|reply

> In 2022, we experienced continual networking blips from Google’s cloud products. After escalating to Google on multiple occasions, we got frustrated. So we built our own networking stack — a resilient eBPF/IPv6 Wireguard network that now powers all our deployments. Suddenly, no more networking issues.

My understanding is that the network is a VLAN programed via switches for VMs so when you create VPC, you're creating a VLAN probably.

So how can an overlay (UDP/Wire guard) be more reliable if the underlaying network isn't stable?

PS: Had even 1/10th of issues have happened on AWS with such a customer, their army of solution architects would be camping in conference rooms every other week reviewing architecture, taking support engineers on call and what not.

[+] readams|2 years ago|reply

It's not creating a VLAN by programming switches. It's all done in an overlay network.

This is out of date but gives you the idea https://www.usenix.org/conference/nsdi18/presentation/dalton

[+] devsda|2 years ago|reply

My guess is that whatever clever network optimizations that Google has are probably interfering with their traffic.

By building their own network stack, they are skipping them and also wireguard might be better equipped to dealt with occasional faults as it built on udp which is inherently unreliable.

[+] nomilk|2 years ago|reply

> In our experience, Google isn’t the place for reliable cloud compute

In the early days of cloud computing unreliability was understandable, but for Google to be frustrating its large customers in 2023 is a pretty bad look.

Curious to know if others have had similar experiences, or if the author was simply unlucky?

[+] whirlwin|2 years ago|reply

I don't know how it happened, but I used GKE for a side project, which was overkill for such a small project, and I could live with $100 /month, but the bill kept creeping up to $300 and later $400 with no apparent explanation or workload increase. I had no choice but to revert to something else. Ended up with good old Heroku with $20/month and never regretted it

[+] unknown|2 years ago|reply

[deleted]

[+] Kwpolska|2 years ago|reply

You should've migrated many months ago, if a cloud provider forces you to build your own networking or registry, you shouldn't use that cloud provider.

[+] politelemon|2 years ago|reply

That was the first thing that struck me, the 'workarounds' stagger belief, but they seem to be casually dropped in (?).

If I were in a situation where my company was contemplating implementing building our own registry/network stack, then the benefits of using a cloud provider are gone, and I would have considered moving to another provider... not saying "I can fix him". This feels like a sunken cost perhaps that is the right term.

[+] supriyo-biswas|2 years ago|reply

Well for folks building out cloud infrastructure, building your own networking stack and registry is a good way to achieve platform independence, without which you'll be left at a disadvantage and vulnerable to the whims of cloud providers who may or may not extend volume discounts, thus indirectly harming your ability to compete.

[+] kgeist|2 years ago|reply

>We have automated systems in place to detect and resolve this. We’re notified in Discord

Isn't Discord hosted on GCP, too? If it goes down, monitoring also goes down?

[+] justjake|2 years ago|reply

(Blogpost Author). We use Discord to notify. Our monitoring runs directly to PagerDuty for anything we actually need to action on.

[+] rurban|2 years ago|reply

> In our experience, Google isn’t the place for reliable cloud compute, and it’s sure as heck not the place for reliable customer support.

Always was, always will be. For them customers are always the last

[+] tedd4u|2 years ago|reply

It sounds like if you deploy on Railway they don't automatically handle a box dying (e.g. with K8s or other) -- "half the company was called in to go through runbooks." When they move to their own hardware, how will they handle that?

[+] londons_explore|2 years ago|reply

GCP is pretty reliable - for a smallish deployment you could probably go a couple of years before seeing a machine die.

So they probably never built in health checks and auto fail over.

[+] doubloon|2 years ago|reply

"reasons why Oxide has a business #12390"

[+] asylteltine|2 years ago|reply

An oxide rack has a minimum cost of something like 600k not including all the infra you need to run a rack, maintenance, and then needing to upgrade

[+] latchkey|2 years ago|reply

That's just moving the goal posts around.

[+] niuzeta|2 years ago|reply

I wonder how many of these stories it would take before it starts affecting Google's bottom line. I've tinkered with GCP on small side projects, sure - but after exposure of these stories for over a decade in HN, I can never recommend GCP as a serious cloud alternative. I can't imagine I'm the only one in this boat.

[+] lawgimenez|2 years ago|reply

If you go to Google’s issue tracker, you will find a lot of issues that were ignored. For example, this [0]issue that caused our ANR rate to dip.

[0] https://issuetracker.google.com/issues/230950647

[+] esafak|2 years ago|reply

When they say moving are off Google Cloud services to bare metal, where do they plan to move?

[+] b112|2 years ago|reply

My response to this, is that there are endless ways, and places to do this.

There are far more colos, people that will rent you a rack, and bandwidth, than VPS types. And you can rent servers too, instead of buying your own.

Colo is literally 10000x cheaper than many AWS deployments. I've seen million dollar bills drop to tens of thousands per year.

And of course, you can always deploy in house, in your own server room.

[+] supriyo-biswas|2 years ago|reply

Many data centers provide colo/hardware renting facilities, such as Equinix, Coresite, Digital Realty etc. (Even AWS got started off those, though they mostly build their own data centers now.)

[+] ur-whale|2 years ago|reply

> where do they plan to move?

Basement of their office?

We reached the same conclusion they did a while back and went back to good-old self-hosted.

Reliability has been as good as cloud and TCO is divided by a factor of 10.

[+] ghusto|2 years ago|reply

A data centre or (less likely) their own office. This was the way things were done not that long ago ;)

[+] dharmab|2 years ago|reply

Any business can rent space in a colo pretty easily. The constraint is mostly hiring engineers with experience racking and stacking boxes, and willing to drive to the colo when on call.

[+] tehlike|2 years ago|reply

My guess is hetzner.

[+] fidotron|2 years ago|reply

Maybe it is me but this doesn’t exactly reflect well on anyone. Isn’t the value prop of railway not having to worry about things like this? It doesn’t matter what the problem is - you shouldn’t be passing such problems on to customers at all.

I have worked on a product that caused such a spike on Google App Engine that within 20 minutes of it going public Google were on the phone explaining their pagers all went off, and in that case resolved to temporarily bump the quota up for 48 hours while a mutual workaround was implemented. The state of Google Cloud today seems just another classic case of the trend of blaming the customer.

158 comments