Yep. Of course there's no detail yet so we don't know what exactly was affected. All we can see is "Multiple services are being impacted globally" and a list of services (Build, Firestore, Container Registry, BigQuery, Bigtable, Networking, Pub/Sub, Storage, Compute Engine, Identity and Access Management) but there's no indication of what specifically was impacted. Could you still see status for your VMs, but not launch new ones? Was it mostly affecting only a couple regions? No idea. All we know is they're now below four nines in February for a handful of critical services.
Cloud Build looks bad... three multi-hour incidents this year, four in fall/winter last year.
Cloud Developer Tools have had four multi-hour incidents this year, many last fall/winter.
Cloud Firestore looks abysmal... Six multi-hour incidents this year, one of them 23 hours.
Cloud App Engine had three multi-hour incidents this year, many in fall/winter last year.
BigQuery had three multi-hour incidents this year, many in fall/winter last year.
Cloud Console had five multi-hour incidents this year, many in fall/winter last year. (And from my personal experience, their console blows pretty much all the time)
Cloud Networking has had nine incidents this year, one of them was eight days long. What the fuck.
Compute Engine has had five multi-hour incidents this year, many last fall/winter.
GKE had 3 incidents this year, multiple the past winter.
Can somebody do a comparison to AWS? This seems shitty but maybe it's par for the course?
This is a pretty reductionist summary, e.g. the 8-day Cloud Networking incident root cause:
> Description: Our engineering team continues to investigate this issue and is evaluating additional improvement opportunities to identify effective rerouting of traffic. They have narrowed down the issue to one regional telecom service provider and reported this to them for further investigation. The connectivity problems are still mostly resolved at this point although some customers may observe delayed round trip time or longer latency or sporadic packet loss until fully resolved.
Still a big problem product-wise, but you're looking at a global incident history view without any region/severity filters.
The corresponding AWS service health dashboard makes it much harder to view this level of detail, but is also actually useful for someone asking "is product $xyz which I depend on in region $abc currently down or not"
It's weird, I did a cursory search and can't find people complaining about that 8 day long networking issue. I wonder if the latency was just barely out of SLO so people didn't notice? Or since it was a telecom problem, maybe it was part of one of the recent undersea cable outages so people weren't surprised enough to remark on it? Or maybe I'm just not searching well.
(full disclosure, work at Google but not on cloud stuff)
throwaway892238|3 years ago
Let's take a gander at incident history: https://status.cloud.google.com/summary
Cloud Build looks bad... three multi-hour incidents this year, four in fall/winter last year.
Cloud Developer Tools have had four multi-hour incidents this year, many last fall/winter.
Cloud Firestore looks abysmal... Six multi-hour incidents this year, one of them 23 hours.
Cloud App Engine had three multi-hour incidents this year, many in fall/winter last year.
BigQuery had three multi-hour incidents this year, many in fall/winter last year.
Cloud Console had five multi-hour incidents this year, many in fall/winter last year. (And from my personal experience, their console blows pretty much all the time)
Cloud Networking has had nine incidents this year, one of them was eight days long. What the fuck.
Compute Engine has had five multi-hour incidents this year, many last fall/winter.
GKE had 3 incidents this year, multiple the past winter.
Can somebody do a comparison to AWS? This seems shitty but maybe it's par for the course?
joatmon-snoo|3 years ago
This is a pretty reductionist summary, e.g. the 8-day Cloud Networking incident root cause:
> Description: Our engineering team continues to investigate this issue and is evaluating additional improvement opportunities to identify effective rerouting of traffic. They have narrowed down the issue to one regional telecom service provider and reported this to them for further investigation. The connectivity problems are still mostly resolved at this point although some customers may observe delayed round trip time or longer latency or sporadic packet loss until fully resolved.
Still a big problem product-wise, but you're looking at a global incident history view without any region/severity filters.
The corresponding AWS service health dashboard makes it much harder to view this level of detail, but is also actually useful for someone asking "is product $xyz which I depend on in region $abc currently down or not"
Rebelgecko|3 years ago
(full disclosure, work at Google but not on cloud stuff)