This affected us starting at 4:57am US/Pacific with a significant drop in traffic through the HTTPS Global Load Balancer across all regions and Pub/Sub 502 errors but there was nothing on the status page for another 45 minutes. Things returned to normal by 5:05am from what I can tell.
You can't really have 30+ fully independent regions running their own stack with different versions of apps and separate secrets, IP/routing and certificates in each. At some point you have to unify or it becomes either unmanageable or inconsistent.
The underlying problem is that Google doesn't operate the world's DNS servers, but still wants to offer the best possible user experience as a global service. This means anycast VIP routing, because not all DNS servers implement EDNS, but they want to have SSL connections terminate as closely to users as possible.
As far as global services go though, it's easy enough to say "it should just not be possible", but how do you propose doing that in practice for a global service?
How does new config going to go out, globally, without being global? How do global services work if they're not global?
How does DDoS protection work if you don't do it globally?
People make fun of "webscale" but operating Google is really difficult and complicated!
> gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.
As I understand it, GCP is already designed to make global outages impossible. Obviously this outage shows that they messed up somehow and some global point of failure still remains. Looking forward to the post-mortem.
My knowledge level: can use AWS console to do < 5% of what is possible.
How much more work would Google create for themselves if they had not globalized their stack? Are we talking something like 5 subsets to manage instead of 1?
gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.
I reckon the only to achieve that would be to have the same level of interoperability between regions as you would get between two distinct cloud providers.
> This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern.
And why enterprises clamoring for AWS to feature match Google's global stuff (theoretically making I.T. easier) instead of remaining regionally isolated (making I.T. actually more resilient, without extra work if I.T. operators can figure out infra-as-code patterns) should STFU and learn themselves some Terraform, Pulumi, or etc.
Also, AWS, if you're in this thread, stop with the recent cross-region coupling features already. Google's doing it wrong, explain that, and be patient, the market share will come back to you when they run out of the GCP subsidy dollars.
> gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.
If that's what you really need, then distribute your assets across GCP, AWS, and DO. That likely means not using any cloud-specific features such as Lambda. AWS is actually really good in this regard, as SES and RDS are easily copied to regular instances in other cloud providers, that possibly wrap some cloud-specific feature themselves.
there are three things that scare google engineers enough to keep them up at night: a global network outage, a global power outage, and a global chubby outage. Actually, they only really worry about that last one.
Downloading the key has been erroring since at least ~5pm PT yesterday, 2/27. It’s likely unrelated. Though I’d be unsurprised if the recent layoffs contributed to the situation.
As has happened many times throughout history (back to mainframes and thin clients of the 90s) there are swings/trends in how infrastructure is hosted.
Listening to the “All In Podcast” yesterday even those guys were talking about revenue drops in the big cloud services and noting we’re currently in the midst of a swing back to self-hosting/co-location/whatever thinking and migrations out.
IMHO those building greenfield solution today should take a hard look at whether the default approach from the last ~10 years “of course you build in $BIGCLOUD” makes sense for the application - in many cases it does not.
It also has the added benefit of de-centralizing the internet a bit (even if only a little).
As others have mentioned, there was no revenue drop, there's been a reduction in growth. AWS's 20% growth rate is still very respectable, more than double the 9% growth rate the company had overall.
I would be hesitant to attribute slowed growth to a return to self hosting, it's much more likely that it's caused by companies dialing back their cloud growth after spending a few years going ham digitizing everything during the pandemic.
The parent comment is neither factual nor advisable.
You build greenfield in cloud precisely because it is greenfield and the utilization isn't well understood. Cloud options let you adjust and experiment quickly. Once a workload is well understood it's a good candidate for optimization, including a move to self managed hardware / on prem.
Buying hardware is a great option once you actually understand the utilization of your product. Just make sure you also have competent operators.
> IMHO those building greenfield solution today should take a hard look at whether the default approach from the last ~10 years “of course you build in $BIGCLOUD” makes sense for the application - in many cases it does not.
When one buys a house, they should take a hard loo at whether the default approach of paying for utilities makes sense, versus generating their own power.
While that's a bit snarky, the reasoning is similar. You can:
* Use "bigcloud"(TM) with the whole kit: VMs, their managed services, etc
* Use bigcloud, but just VM or storage
* Rent VMs from a smaller provider
* Rent actual servers
* Buy your servers and ship to a colo
* Buy your servers and build a datacenter
Every level you drop, you need more work. And it grows(I suspect, not linearly). Sure, if you have all the required experts (or you rent them) you can do everything yourself. If not, you'll have to defer to vendors. You will pay some premium for this, but it's either that, or payroll.
What also needs to be factored in is how static your system is. If a single machine works for your use-case, great.
One of the systems I manage has hundreds of millions of dollars in contracts on the line, thousands of VMs. I do not care if any single VM goes down; the system will kill it and provision a new one. A big cloud provider availability zone often spans across multiple datacenters too, each datacenter with their own redundancies. Even if an entire AZ goes down, we can survive on the other two (with possibly some temporary degradation for a few minutes). If the whole region goes down, we fallback to another. We certainly don't have the time to discuss individual servers or rack and stack anything.
It does not come cheap. AWS specifically has egregious networking fees and you end up paying multiple times (AZ to AZ traffic, NAT gateways, and a myriad services that also charge by GB, like GuardDuty). It adds up if you are not careful.
From time to time, management comes with the idea of migrating to 'on-prem', because that's reportedly cheaper. Sure, ignoring the hundreds of engineers that will be involved in this migration, and also ignoring all the engineers that will be required to maintain this on-premises, it might be cheaper.
But that's also ignoring the main reason why cloud deployments tend to become so expensive: they are easy. Confronted with the option of spinning up more machines versus possibly missing a deadline, middle managers will ask for more resources. Maybe it's "just" 1k a month extra (those developers would cost more!). It gets approved. 50 other groups are doing the same. Now it's 50k. Rinse, repeat. If more emphasis would be placed into optimization, most cloud deployments could be shrunk spectacularly. The microservices fad doesn't help(your architecture might require that, but often the reason it does is because you want to ship your org chart, not for technical reasons).
This is why any criticism of AWS reliability is meaningless to me. All the cloud providers go down - all of them. Either you are multi-cloud, or you run your own hardware, but these events are inevitable.
The amount of time you are down vs. up dictates your SLOs and SLAs. Criticism of how reliable one vs. another is is not only valid, it's backed by hundreds of millions of contractual dollars and credits every year. We spend tens of millions on AWS per year. We have several SLAs with them. Our Elasticache SLA was breached once (localized to us - not whole customer base) and we got credits which were commensurate with the amount of business we lost during that downtime period.
If one provider is down more than the others, the criticism is not only valid, it results in real business loss for the provider and its customers.
On multi-cloud: it's one way to reduce the amount of downtime you have, but it comes with a significant operational cost depending on how your application is architected and how your teams internal to your company are formed. It is totally practical for someone to bank on AWS' reliability until they're at a significant amount of traction or revenue where the added uptime of going multicloud is worth the investment. I know you're not saying this isn't the case (I think you're saying "do that if you're going to complain about 1 providers' uptime"), but thought it was worth putting the context into the HN ether.
> This is why any criticism of AWS reliability is meaningless to me.
Is anyone tracking reliability for these public providers? Would be curious how AWS compares to Azure and GCP. My experience is it's better, but we may have avoided Kinesis or whatever that keeps going down.
Yep. Of course there's no detail yet so we don't know what exactly was affected. All we can see is "Multiple services are being impacted globally" and a list of services (Build, Firestore, Container Registry, BigQuery, Bigtable, Networking, Pub/Sub, Storage, Compute Engine, Identity and Access Management) but there's no indication of what specifically was impacted. Could you still see status for your VMs, but not launch new ones? Was it mostly affecting only a couple regions? No idea. All we know is they're now below four nines in February for a handful of critical services.
Cloud Build looks bad... three multi-hour incidents this year, four in fall/winter last year.
Cloud Developer Tools have had four multi-hour incidents this year, many last fall/winter.
Cloud Firestore looks abysmal... Six multi-hour incidents this year, one of them 23 hours.
Cloud App Engine had three multi-hour incidents this year, many in fall/winter last year.
BigQuery had three multi-hour incidents this year, many in fall/winter last year.
Cloud Console had five multi-hour incidents this year, many in fall/winter last year. (And from my personal experience, their console blows pretty much all the time)
Cloud Networking has had nine incidents this year, one of them was eight days long. What the fuck.
Compute Engine has had five multi-hour incidents this year, many last fall/winter.
GKE had 3 incidents this year, multiple the past winter.
Can somebody do a comparison to AWS? This seems shitty but maybe it's par for the course?
I fantasize that it's a three-letter agency with a warrant making them either start shitting their chat logs or pulling drives and recovering them the hard way.
They claim the Gmail specific issues are resolved. We shall see...
Feb 27, 2023 2:03 PM UTC
We experienced a brief network outage with package loss, impacting a number of workspace services. The impact is over. We are investigating and monitoring.
Solving this sort of thing is not about throwing more people at it. That would be brute force and not strategic. Instead, you want to architect systems like these in a way that strikes a good balance between resilience and things like cost/efficiency/etc.
fastest963|3 years ago
dixie_land|3 years ago
bushbaba|3 years ago
gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.
tgtweak|3 years ago
zamnos|3 years ago
As far as global services go though, it's easy enough to say "it should just not be possible", but how do you propose doing that in practice for a global service?
How does new config going to go out, globally, without being global? How do global services work if they're not global? How does DDoS protection work if you don't do it globally?
People make fun of "webscale" but operating Google is really difficult and complicated!
medler|3 years ago
As I understand it, GCP is already designed to make global outages impossible. Obviously this outage shows that they messed up somehow and some global point of failure still remains. Looking forward to the post-mortem.
consumer451|3 years ago
How much more work would Google create for themselves if they had not globalized their stack? Are we talking something like 5 subsets to manage instead of 1?
kevinventullo|3 years ago
I reckon the only to achieve that would be to have the same level of interoperability between regions as you would get between two distinct cloud providers.
uniformlyrandom|3 years ago
Of course, at Google scale 'partial' is still very big.
Terretta|3 years ago
And why enterprises clamoring for AWS to feature match Google's global stuff (theoretically making I.T. easier) instead of remaining regionally isolated (making I.T. actually more resilient, without extra work if I.T. operators can figure out infra-as-code patterns) should STFU and learn themselves some Terraform, Pulumi, or etc.
Also, AWS, if you're in this thread, stop with the recent cross-region coupling features already. Google's doing it wrong, explain that, and be patient, the market share will come back to you when they run out of the GCP subsidy dollars.
dotancohen|3 years ago
Cthulhu_|3 years ago
papruapap|3 years ago
dekhn|3 years ago
typaty|3 years ago
nicholasklem|3 years ago
roseway4|3 years ago
eik3_de|3 years ago
unknown|3 years ago
[deleted]
chedabob|3 years ago
kkielhofner|3 years ago
Listening to the “All In Podcast” yesterday even those guys were talking about revenue drops in the big cloud services and noting we’re currently in the midst of a swing back to self-hosting/co-location/whatever thinking and migrations out.
IMHO those building greenfield solution today should take a hard look at whether the default approach from the last ~10 years “of course you build in $BIGCLOUD” makes sense for the application - in many cases it does not.
It also has the added benefit of de-centralizing the internet a bit (even if only a little).
lolinder|3 years ago
I would be hesitant to attribute slowed growth to a return to self hosting, it's much more likely that it's caused by companies dialing back their cloud growth after spending a few years going ham digitizing everything during the pandemic.
unknown|3 years ago
[deleted]
ranman|3 years ago
You build greenfield in cloud precisely because it is greenfield and the utilization isn't well understood. Cloud options let you adjust and experiment quickly. Once a workload is well understood it's a good candidate for optimization, including a move to self managed hardware / on prem.
Buying hardware is a great option once you actually understand the utilization of your product. Just make sure you also have competent operators.
ctvo|3 years ago
outworlder|3 years ago
When one buys a house, they should take a hard loo at whether the default approach of paying for utilities makes sense, versus generating their own power.
While that's a bit snarky, the reasoning is similar. You can:
* Use "bigcloud"(TM) with the whole kit: VMs, their managed services, etc * Use bigcloud, but just VM or storage * Rent VMs from a smaller provider * Rent actual servers * Buy your servers and ship to a colo * Buy your servers and build a datacenter
Every level you drop, you need more work. And it grows(I suspect, not linearly). Sure, if you have all the required experts (or you rent them) you can do everything yourself. If not, you'll have to defer to vendors. You will pay some premium for this, but it's either that, or payroll.
What also needs to be factored in is how static your system is. If a single machine works for your use-case, great.
One of the systems I manage has hundreds of millions of dollars in contracts on the line, thousands of VMs. I do not care if any single VM goes down; the system will kill it and provision a new one. A big cloud provider availability zone often spans across multiple datacenters too, each datacenter with their own redundancies. Even if an entire AZ goes down, we can survive on the other two (with possibly some temporary degradation for a few minutes). If the whole region goes down, we fallback to another. We certainly don't have the time to discuss individual servers or rack and stack anything.
It does not come cheap. AWS specifically has egregious networking fees and you end up paying multiple times (AZ to AZ traffic, NAT gateways, and a myriad services that also charge by GB, like GuardDuty). It adds up if you are not careful.
From time to time, management comes with the idea of migrating to 'on-prem', because that's reportedly cheaper. Sure, ignoring the hundreds of engineers that will be involved in this migration, and also ignoring all the engineers that will be required to maintain this on-premises, it might be cheaper.
But that's also ignoring the main reason why cloud deployments tend to become so expensive: they are easy. Confronted with the option of spinning up more machines versus possibly missing a deadline, middle managers will ask for more resources. Maybe it's "just" 1k a month extra (those developers would cost more!). It gets approved. 50 other groups are doing the same. Now it's 50k. Rinse, repeat. If more emphasis would be placed into optimization, most cloud deployments could be shrunk spectacularly. The microservices fad doesn't help(your architecture might require that, but often the reason it does is because you want to ship your org chart, not for technical reasons).
camhart|3 years ago
knorker|3 years ago
Dave3of5|3 years ago
<3 To the engineers trying to fix it at the moment.
zamnos|3 years ago
monero-xmr|3 years ago
vhiremath4|3 years ago
If one provider is down more than the others, the criticism is not only valid, it results in real business loss for the provider and its customers.
On multi-cloud: it's one way to reduce the amount of downtime you have, but it comes with a significant operational cost depending on how your application is architected and how your teams internal to your company are formed. It is totally practical for someone to bank on AWS' reliability until they're at a significant amount of traction or revenue where the added uptime of going multicloud is worth the investment. I know you're not saying this isn't the case (I think you're saying "do that if you're going to complain about 1 providers' uptime"), but thought it was worth putting the context into the HN ether.
yjftsjthsd-h|3 years ago
Er, we absolutely can and should compare rates of problems and overall reliability.
crazygringo|3 years ago
If you run your own hardware these events are inevitable too.
dymk|3 years ago
ctvo|3 years ago
Is anyone tracking reliability for these public providers? Would be curious how AWS compares to Azure and GCP. My experience is it's better, but we may have avoided Kinesis or whatever that keeps going down.
MuffinFlavored|3 years ago
in multiple datacenters?
uniformlyrandom|3 years ago
Not great, not terrible.
throwaway892238|3 years ago
Let's take a gander at incident history: https://status.cloud.google.com/summary
Cloud Build looks bad... three multi-hour incidents this year, four in fall/winter last year.
Cloud Developer Tools have had four multi-hour incidents this year, many last fall/winter.
Cloud Firestore looks abysmal... Six multi-hour incidents this year, one of them 23 hours.
Cloud App Engine had three multi-hour incidents this year, many in fall/winter last year.
BigQuery had three multi-hour incidents this year, many in fall/winter last year.
Cloud Console had five multi-hour incidents this year, many in fall/winter last year. (And from my personal experience, their console blows pretty much all the time)
Cloud Networking has had nine incidents this year, one of them was eight days long. What the fuck.
Compute Engine has had five multi-hour incidents this year, many last fall/winter.
GKE had 3 incidents this year, multiple the past winter.
Can somebody do a comparison to AWS? This seems shitty but maybe it's par for the course?
0x0000000|3 years ago
hellcow|3 years ago
2OEH8eoCRo0|3 years ago
https://arstechnica.com/tech-policy/2023/02/us-says-google-r...
JosephRedfern|3 years ago
lokl|3 years ago
jakedata|3 years ago
They claim the Gmail specific issues are resolved. We shall see...
Feb 27, 2023 2:03 PM UTC We experienced a brief network outage with package loss, impacting a number of workspace services. The impact is over. We are investigating and monitoring.
asicsp|3 years ago
oars|3 years ago
unknown|3 years ago
[deleted]
Aldipower|3 years ago
zamnos|3 years ago
m00dy|3 years ago
lee101|3 years ago
[deleted]
abc20230215|3 years ago
[deleted]
pictur|3 years ago
[deleted]
andsoitis|3 years ago
unknown|3 years ago
[deleted]