We are a small software company (2 people) and we've also had plenty of issues with Google over the years. Mostly related to Google Adwords. For example:
If Google have no interest in providing decent support to the author of the original article, who are paying megabucks to Google, what hope do small businesses like mine have?
Generally, I think over the last few years, GCP has lost its way.
There was a time several years ago where they were a meaningfully better option when looking at price / performance for compute / storage / bandwidth when compare to AWS. At the time, we did detailed performance testing and cost modeling to prove this for our workload (hundreds of compute engine instances etc).
Support back then was also excellent. One of our early tickets was an obscure networking issue. The request was quickly escalated then passed from engineers in different regions around the world until it was resolved. We were very impressed. It was a change on the GCP end that ended up being reverted. We quickly got to real engineers who competently worked the problem with us to resolution.
The sales team interactions were also better back then. We had a great sales rep who would quickly connect us with any internal resources we needed. The sales rep was a net positive and made our experience with GCP better.
Since then, AWS has certainly caught up and is every bit as good from a cost / performance standpoint. They remain years ahead on many managed services.
The GCP support experience has degraded significantly at this point. Most cases seem to go to outsourced providers who don’t seem able to see any data about the actual underlying GCP infrastructure. We too have detected networking issues that GCP does not acknowledge. The support folks we are dealing with don’t seem to have any greater visibility than we do. It’s pathetic and deeply frustrating. I’m sure it’s just as frustrating for them.
The sales experience is also significantly worse. Our current rep is a significant net negative.
We’ve made significant investments in GCP and we hate seeing this happen. While we would love to see things improve, we don’t see any signs of that actually happening. We are actively working to reduce our GCP spend.
A few years ago, I was a vocal GCP advocate. At this point, I’d have a hard time suggesting anyone build anything new on GCP.
For my day job, over the last 2 years we have discovered and reported multiple issues with Keyspaces, Amazon Aurora, and App Runner. In all cases these issues have resulted in performance degradation, and AWS support wasting our time sending us chasing our tails. After many weeks of escalation, we eventually ended up with project leads who confirmed the issues (some of which they were already aware of, yet the support teams had wasted our time anyway!) and (some of them) have since been resolved.
We are stuck with Keyspaces for the time being, but now refuse to use any non core services (EC2, EBS, S3). As soon as you venture away from those there be dragons.
Oh, for goddamn sure. Half the services on AWS, probably, are very poorly designed or very poorly run (or both). CloudWatch stands out to me as one that is mind-bogglingly buggy and slow. To the point of basically being a "newbie trap" - when I see companies using it for all their logging, I assume it's due to inexperience with the many alternatives.
It's hilarious people are bashing GCP for having one compute instance go down and the author acknowledges it's a rare event. On AWS I've got instances getting forced stopped or even straight disappearing all the time. 99.95% durability vs 99.999% is way different.
If they had the same architecture on AWS it would go down all the time IME. AWS primitives are way less reliable than GCP, according to AWS' docs and my own experiences.
The article doesn't seem to mention AWS, really. I also feel like the primary issue is the lack of communication and support, even for a large corporate partner.
Seems like they're moving to bare-metal, which has an obvious benefit of being able to tell your on-call engineer to fix the issue or die trying.
EC2 [0] and GCP Compute [1] have the exact same SLAs, which is 99.99%, dipping below which gets you a 10% refund. Dipping below 95% gets you a 100% refund.
This is very different from my experience. In my years with AWS I’ve only had an instance get stopped once for a reason that was weird AWS background stuff that had nothing to do with my application. I don’t think I’ve ever had or even heard of an instance just disappearing.
I have a lot of interaction with Google Cloud Support, mostly around their managed services. I am genuinely not over-impressed with their service, considering with similar employers of size on AWS the support experience was always wonderful.
However, I will say if you are on Google Cloud and you have a positive interaction, make a big deal about someone helping you. Given the rarity it occurs, it’s not a big deal to really go out of your way to reward someone with some emphatic positive feedback. I’ve had four genuinely fantastic experiences and there’s always a message to a TAM that flows soon after. I hope more people like those I interacted with get rewarded and promoted.
> However, I will say if you are on Google Cloud and you have a positive interaction, make a big deal about someone helping you.
This. These sorts of discussions are like bike shedding over vi/emacs.
Only the complaints make it to the front page on HN. I've been using GCP off and on for projects for a decade now. Built multiple very successful businesses on it. Sure it hasn't been all perfect, but I'm an overall happy camper.
Having also used AWS heavily when I was on the team building the original hosted version of Cloud Foundry, I'd never go back to them again. It was endless drama.
I've had an experience with GCP that involved a very enterprise-y feature breaking in a way that clearly showed the feature never worked properly up until this point (aside from causing downtime when they tried to quietly fix it). GCP reps proceeded to remind everyone in the call in which they were supposed to explain what happened they were under NDA, because admitting to the above would've been a nightmare for regulated industries.
"On December 1st, at 8:52am PST, a box dropped offline; inaccessible. And then, instead of automatically coming back after failover — it didn’t. Our primary on-call engineer was alerted for this and dug in. While digging in, another box fell offline and didn’t come back"
This makes no sense. A machine restarted and you had catastrophic failure? VMs reboot time to time. But if you design your setup to completely destroy itself in this scenario, I don't think you will like a move to AWS, or god forbid, your own colo.
Read the article more carefully. The article (the text you quoted, even) clearly states that the machine didn't "restart". It crashed and didn't come back online.
And nowhere in the article do they state that this was a "catastrophic failure" - Railway itself didn't go down entirely. But Railway is a deployment company, so they are re-selling these compute resources to their customers to deploy applications. So when one of those VMs goes down and doesn't automatically failover, that's downtime for the specific customer who was running their service on that machine.
As they state:
> During manual failover of these machines, there was a 10 minute per host downtime. However, as many people are running multi-service workloads, this downtime can be multiplied many times as boxes subsequently went offline.
Interesting, I’m starting to think undocumented thresholds are quite common in GCP.
I experienced something similar with Clod Run: inexplicable scaling events based on CPU utilization and concurrent requests (the two metrics that regulate scaling according to their docs).
After a lot of back and forth with their (premium) support it turns out there are additional criteria, smthg related to request duration, but of course nobody was able to explain in details.
Yes, we have also experienced undocumented limits for Cloud Run. For us it was an obscure quota for max network packages per second per instance. Really infuriating and took 6 months to track down what it was. I think it has been documented here now: https://cloud.google.com/run/quotas#cloud_run_bandwidth_limi...
Bit confused about why nested virt has anything to do with their problems given that they aren’t using virt inside the VMs. Softlocks are a generic indication of a lack of forward progress.
Same confusion with the MMIO instructions comment. If that’s about instruction emulation, not sure why it matters where it happens? It’s both slow and bound for userspace anyway. If it’s supposed to be fast it should basically never be exiting the guest, let alone be emulated.
Sounds like the author is a bit frustrated and (understandably) grasping at whatever straws they can for that most recent incident.
> In 2022, we experienced continual networking blips from Google’s cloud products. After escalating to Google on multiple occasions, we got frustrated. So we built our own networking stack — a resilient eBPF/IPv6 Wireguard network that now powers all our deployments. Suddenly, no more networking issues.
My understanding is that the network is a VLAN programed via switches for VMs so when you create VPC, you're creating a VLAN probably.
So how can an overlay (UDP/Wire guard) be more reliable if the underlaying network isn't stable?
PS: Had even 1/10th of issues have happened on AWS with such a customer, their army of solution architects would be camping in conference rooms every other week reviewing architecture, taking support engineers on call and what not.
My guess is that whatever clever network optimizations that Google has are probably interfering with their traffic.
By building their own network stack, they are skipping them and also wireguard might be better equipped to dealt with occasional faults as it built on udp which is inherently unreliable.
> In our experience, Google isn’t the place for reliable cloud compute
In the early days of cloud computing unreliability was understandable, but for Google to be frustrating its large customers in 2023 is a pretty bad look.
Curious to know if others have had similar experiences, or if the author was simply unlucky?
I don't know how it happened, but I used GKE for a side project, which was overkill for such a small project, and I could live with $100 /month, but the bill kept creeping up to $300 and later $400 with no apparent explanation or workload increase. I had no choice but to revert to something else. Ended up with good old Heroku with $20/month and never regretted it
You should've migrated many months ago, if a cloud provider forces you to build your own networking or registry, you shouldn't use that cloud provider.
That was the first thing that struck me, the 'workarounds' stagger belief, but they seem to be casually dropped in (?).
If I were in a situation where my company was contemplating implementing building our own registry/network stack, then the benefits of using a cloud provider are gone, and I would have considered moving to another provider... not saying "I can fix him". This feels like a sunken cost perhaps that is the right term.
Well for folks building out cloud infrastructure, building your own networking stack and registry is a good way to achieve platform independence, without which you'll be left at a disadvantage and vulnerable to the whims of cloud providers who may or may not extend volume discounts, thus indirectly harming your ability to compete.
It sounds like if you deploy on Railway they don't automatically handle a box dying (e.g. with K8s or other) -- "half the company was called in to go through runbooks." When they move to their own hardware, how will they handle that?
I wonder how many of these stories it would take before it starts affecting Google's bottom line. I've tinkered with GCP on small side projects, sure - but after exposure of these stories for over a decade in HN, I can never recommend GCP as a serious cloud alternative. I can't imagine I'm the only one in this boat.
Many data centers provide colo/hardware renting facilities, such as Equinix, Coresite, Digital Realty etc. (Even AWS got started off those, though they mostly build their own data centers now.)
Any business can rent space in a colo pretty easily. The constraint is mostly hiring engineers with experience racking and stacking boxes, and willing to drive to the colo when on call.
Maybe it is me but this doesn’t exactly reflect well on anyone. Isn’t the value prop of railway not having to worry about things like this? It doesn’t matter what the problem is - you shouldn’t be passing such problems on to customers at all.
I have worked on a product that caused such a spike on Google App Engine that within 20 minutes of it going public Google were on the phone explaining their pagers all went off, and in that case resolved to temporarily bump the quota up for 48 hours while a mutual workaround was implemented. The state of Google Cloud today seems just another classic case of the trend of blaming the customer.
[+] [-] hermitcrab|2 years ago|reply
https://successfulsoftware.net/2015/03/04/google-bans-hyperl...
https://successfulsoftware.net/2016/12/05/google-cpa-bidding...
https://successfulsoftware.net/2020/08/21/google-ads-can-cha...
https://successfulsoftware.net/2021/05/04/wtf-google-ads/
If Google have no interest in providing decent support to the author of the original article, who are paying megabucks to Google, what hope do small businesses like mine have?
[+] [-] biorach|2 years ago|reply
[+] [-] annoyed_eng|2 years ago|reply
There was a time several years ago where they were a meaningfully better option when looking at price / performance for compute / storage / bandwidth when compare to AWS. At the time, we did detailed performance testing and cost modeling to prove this for our workload (hundreds of compute engine instances etc).
Support back then was also excellent. One of our early tickets was an obscure networking issue. The request was quickly escalated then passed from engineers in different regions around the world until it was resolved. We were very impressed. It was a change on the GCP end that ended up being reverted. We quickly got to real engineers who competently worked the problem with us to resolution.
The sales team interactions were also better back then. We had a great sales rep who would quickly connect us with any internal resources we needed. The sales rep was a net positive and made our experience with GCP better.
Since then, AWS has certainly caught up and is every bit as good from a cost / performance standpoint. They remain years ahead on many managed services.
The GCP support experience has degraded significantly at this point. Most cases seem to go to outsourced providers who don’t seem able to see any data about the actual underlying GCP infrastructure. We too have detected networking issues that GCP does not acknowledge. The support folks we are dealing with don’t seem to have any greater visibility than we do. It’s pathetic and deeply frustrating. I’m sure it’s just as frustrating for them.
The sales experience is also significantly worse. Our current rep is a significant net negative.
We’ve made significant investments in GCP and we hate seeing this happen. While we would love to see things improve, we don’t see any signs of that actually happening. We are actively working to reduce our GCP spend.
A few years ago, I was a vocal GCP advocate. At this point, I’d have a hard time suggesting anyone build anything new on GCP.
[+] [-] supermatt|2 years ago|reply
For my day job, over the last 2 years we have discovered and reported multiple issues with Keyspaces, Amazon Aurora, and App Runner. In all cases these issues have resulted in performance degradation, and AWS support wasting our time sending us chasing our tails. After many weeks of escalation, we eventually ended up with project leads who confirmed the issues (some of which they were already aware of, yet the support teams had wasted our time anyway!) and (some of them) have since been resolved.
We are stuck with Keyspaces for the time being, but now refuse to use any non core services (EC2, EBS, S3). As soon as you venture away from those there be dragons.
[+] [-] wavemode|2 years ago|reply
At least the compute services are reliable.
[+] [-] vel0city|2 years ago|reply
If they had the same architecture on AWS it would go down all the time IME. AWS primitives are way less reliable than GCP, according to AWS' docs and my own experiences.
[+] [-] Wuzado|2 years ago|reply
Seems like they're moving to bare-metal, which has an obvious benefit of being able to tell your on-call engineer to fix the issue or die trying.
[+] [-] deanCommie|2 years ago|reply
[0] https://aws.amazon.com/compute/sla/
[1] https://cloud.google.com/compute/sla
[+] [-] NineStarPoint|2 years ago|reply
[+] [-] belter|2 years ago|reply
[+] [-] StopHammoTime|2 years ago|reply
However, I will say if you are on Google Cloud and you have a positive interaction, make a big deal about someone helping you. Given the rarity it occurs, it’s not a big deal to really go out of your way to reward someone with some emphatic positive feedback. I’ve had four genuinely fantastic experiences and there’s always a message to a TAM that flows soon after. I hope more people like those I interacted with get rewarded and promoted.
[+] [-] latchkey|2 years ago|reply
This. These sorts of discussions are like bike shedding over vi/emacs.
Only the complaints make it to the front page on HN. I've been using GCP off and on for projects for a decade now. Built multiple very successful businesses on it. Sure it hasn't been all perfect, but I'm an overall happy camper.
Having also used AWS heavily when I was on the team building the original hosted version of Cloud Foundry, I'd never go back to them again. It was endless drama.
[+] [-] 363082a9-58a7|2 years ago|reply
[+] [-] HenryBemis|2 years ago|reply
I would like to assume, 'no, you can always report a crime'.
[+] [-] ransom1538|2 years ago|reply
This makes no sense. A machine restarted and you had catastrophic failure? VMs reboot time to time. But if you design your setup to completely destroy itself in this scenario, I don't think you will like a move to AWS, or god forbid, your own colo.
[+] [-] wavemode|2 years ago|reply
And nowhere in the article do they state that this was a "catastrophic failure" - Railway itself didn't go down entirely. But Railway is a deployment company, so they are re-selling these compute resources to their customers to deploy applications. So when one of those VMs goes down and doesn't automatically failover, that's downtime for the specific customer who was running their service on that machine.
As they state:
> During manual failover of these machines, there was a 10 minute per host downtime. However, as many people are running multi-service workloads, this downtime can be multiplied many times as boxes subsequently went offline.
> For all of our users, we’re deeply sorry.
[+] [-] simo7|2 years ago|reply
I experienced something similar with Clod Run: inexplicable scaling events based on CPU utilization and concurrent requests (the two metrics that regulate scaling according to their docs).
After a lot of back and forth with their (premium) support it turns out there are additional criteria, smthg related to request duration, but of course nobody was able to explain in details.
[+] [-] klon|2 years ago|reply
[+] [-] politelemon|2 years ago|reply
https://hacks.mozilla.org/2022/02/retrospective-and-technica...
[+] [-] strstr|2 years ago|reply
Bit confused about why nested virt has anything to do with their problems given that they aren’t using virt inside the VMs. Softlocks are a generic indication of a lack of forward progress.
Same confusion with the MMIO instructions comment. If that’s about instruction emulation, not sure why it matters where it happens? It’s both slow and bound for userspace anyway. If it’s supposed to be fast it should basically never be exiting the guest, let alone be emulated.
Sounds like the author is a bit frustrated and (understandably) grasping at whatever straws they can for that most recent incident.
[+] [-] wg0|2 years ago|reply
My understanding is that the network is a VLAN programed via switches for VMs so when you create VPC, you're creating a VLAN probably.
So how can an overlay (UDP/Wire guard) be more reliable if the underlaying network isn't stable?
PS: Had even 1/10th of issues have happened on AWS with such a customer, their army of solution architects would be camping in conference rooms every other week reviewing architecture, taking support engineers on call and what not.
[+] [-] readams|2 years ago|reply
This is out of date but gives you the idea https://www.usenix.org/conference/nsdi18/presentation/dalton
[+] [-] devsda|2 years ago|reply
By building their own network stack, they are skipping them and also wireguard might be better equipped to dealt with occasional faults as it built on udp which is inherently unreliable.
[+] [-] nomilk|2 years ago|reply
In the early days of cloud computing unreliability was understandable, but for Google to be frustrating its large customers in 2023 is a pretty bad look.
Curious to know if others have had similar experiences, or if the author was simply unlucky?
[+] [-] whirlwin|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] Kwpolska|2 years ago|reply
[+] [-] politelemon|2 years ago|reply
If I were in a situation where my company was contemplating implementing building our own registry/network stack, then the benefits of using a cloud provider are gone, and I would have considered moving to another provider... not saying "I can fix him". This feels like a sunken cost perhaps that is the right term.
[+] [-] supriyo-biswas|2 years ago|reply
[+] [-] kgeist|2 years ago|reply
Isn't Discord hosted on GCP, too? If it goes down, monitoring also goes down?
[+] [-] justjake|2 years ago|reply
[+] [-] rurban|2 years ago|reply
Always was, always will be. For them customers are always the last
[+] [-] tedd4u|2 years ago|reply
[+] [-] londons_explore|2 years ago|reply
So they probably never built in health checks and auto fail over.
[+] [-] doubloon|2 years ago|reply
[+] [-] asylteltine|2 years ago|reply
[+] [-] latchkey|2 years ago|reply
[+] [-] niuzeta|2 years ago|reply
[+] [-] lawgimenez|2 years ago|reply
[0] https://issuetracker.google.com/issues/230950647
[+] [-] esafak|2 years ago|reply
[+] [-] b112|2 years ago|reply
There are far more colos, people that will rent you a rack, and bandwidth, than VPS types. And you can rent servers too, instead of buying your own.
Colo is literally 10000x cheaper than many AWS deployments. I've seen million dollar bills drop to tens of thousands per year.
And of course, you can always deploy in house, in your own server room.
[+] [-] supriyo-biswas|2 years ago|reply
[+] [-] ur-whale|2 years ago|reply
Basement of their office?
We reached the same conclusion they did a while back and went back to good-old self-hosted.
Reliability has been as good as cloud and TCO is divided by a factor of 10.
[+] [-] ghusto|2 years ago|reply
[+] [-] dharmab|2 years ago|reply
[+] [-] tehlike|2 years ago|reply
[+] [-] fidotron|2 years ago|reply
I have worked on a product that caused such a spike on Google App Engine that within 20 minutes of it going public Google were on the phone explaining their pagers all went off, and in that case resolved to temporarily bump the quota up for 48 hours while a mutual workaround was implemented. The state of Google Cloud today seems just another classic case of the trend of blaming the customer.