In lieu of an actual Googler, how about some educated speculation? It blows my mind that Google can even have problems like this. Aren't their apps highly distributed across tons of CDNs? Don't they have world class Devops people that roll out changes in a piecemeal fashion to check for bugs? How exactly can they have an issue that can affect a huge swath of their customers across countries? Insight appreciated.
joatmon-snoo|5 years ago
* We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation. (e.g. what if you're in a situation where rolling back could make the problem worse? we might be Google, but we don't have magic wands)
* Debugging new failure modes is a coin flip: maybe your existing tools are sufficient to understand what's happening, but if they're not, getting that visibility can in itself be difficult. And just like everyone else, this can become a trial and error process: we find a plausible root cause, design and execute a mitigation based on that understanding, and then get more information that makes very clear that our hypothesis was incomplete (in the worst case, blatantly wrong).
userbinator|5 years ago
As Douglas Adams says, "The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair."
missblit|5 years ago
Aperocky|5 years ago
Here comes the poison pills!
brown9-2|5 years ago
https://status.cloud.google.com/incident/zall/20013#20013003
max_streese|5 years ago
Well I guess the thing is left unanswered for now is why the quota management reduced the capacity for Google's IMS in the first place.
Maybe we will know someday :)
ravenstine|5 years ago
enneff|5 years ago
When you operate at Google's scale then everything that can go wrong, will go wrong. Google does an amazing job providing high-availability services to billions of users, but doing so is a constant learning process; they are constantly blazing new trails for which there are no established best practices, and so there will always be unforeseen issues.
marcan_42|5 years ago
Yes, apps are highly distributed. Yes, roll-outs are staggered and controlled.
But some things are necessarily global. Things like your Google account are global (what went down the other day). Of course you can (and Google does) design such a system such that it's distributed and tolerant of any given piece failing. But it's still one system. And so, if something goes wrong in a new and exciting way... It might just happen to hit the service globally.
When things go down, it's because something weird happened. You don't hear about all the times the regular process prevented downtime... because things don't go down.
Zenst|5 years ago
However, I'd speculate that in this instance, when you get that .0001% problem, less hands on deck makes work from home aspects less easier. Akin to remotely fixing somebodies PC over standing behind them.
With that premise I'd speculate in this instance that whilst not the root cause, may of been a small ripple that led to that root cause and/or lead to a slower resolution than what would normally get.
Those speculations aside, it will only highlight what that some tooling needs to adjust for remote workers as does design and set-ups more. Water cooler talk is not just for gossip and a counter would be more regular on-line group socialising at a work level so that not only the companies but the workers can fully adapt and embrace the work medium; But so the kinks and areas that need polishing can be polished and made better for all.
Lastly, I'd speculate that I'm totally wrong and yet what I said may well anecdote with some out there and resonate with others.
throwaway201103|5 years ago
erhk|5 years ago
megous|5 years ago
It should not be a problem that gmail is "down". Unless this would be happening for more than a few days, noone would lose e-mail. It's a problem that it's not returning a temporary error code, but permanent one.
eloisant|5 years ago
Sometimes it's a script responsible of deployment that will propagate an issue to the whole system. Sometimes it's the routing that will go wrong (for example when AWS routed all production traffic to the test cluster instead of production cluster).
unknown|5 years ago
[deleted]
nimchimpsky|5 years ago
[deleted]
yudlejoza|5 years ago
[deleted]
ink404|5 years ago
unknown|5 years ago
[deleted]
aprdm|5 years ago
sellyme|5 years ago
pmlnr|5 years ago