top | item 25436740

(no title)

In lieu of an actual Googler, how about some educated speculation? It blows my mind that Google can even have problems like this. Aren't their apps highly distributed across tons of CDNs? Don't they have world class Devops people that roll out changes in a piecemeal fashion to check for bugs? How exactly can they have an issue that can affect a huge swath of their customers across countries? Insight appreciated.

discuss

joatmon-snoo|5 years ago

Googler but nowhere near Gmail, so just educated speculation:

* We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation. (e.g. what if you're in a situation where rolling back could make the problem worse? we might be Google, but we don't have magic wands)

* Debugging new failure modes is a coin flip: maybe your existing tools are sufficient to understand what's happening, but if they're not, getting that visibility can in itself be difficult. And just like everyone else, this can become a trial and error process: we find a plausible root cause, design and execute a mitigation based on that understanding, and then get more information that makes very clear that our hypothesis was incomplete (in the worst case, blatantly wrong).

userbinator|5 years ago

We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation.

As Douglas Adams says, "The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair."

missblit|5 years ago

Rollback proof bugs are rare, but boy howdy are they exciting. I think I've only seen one so far (unless you count bad data / bad state that persists after a bad change is rolled back... which can also be pretty exciting)

Aperocky|5 years ago

> what if you're in a situation where rolling back could make the problem worse?

Here comes the poison pills!

brown9-2|5 years ago

You don’t really have to speculate, they disclosed yesterday that yesterday’s issue had to do with the automated quota system deciding the auth system had zero quota:

https://status.cloud.google.com/incident/zall/20013#20013003

max_streese|5 years ago

Thanks for providing this. It's funny to read the speculations when you have read the actual root cause :D

Well I guess the thing is left unanswered for now is why the quota management reduced the capacity for Google's IMS in the first place.

Maybe we will know someday :)

ravenstine|5 years ago

Maybe they have world class DevOps, but they also have way more things that can go wrong than the vast majority of businesses. It's kind of remarkable that the entire world can be pinging Google services and they have ~99.9% uptime.

enneff|5 years ago

> It blows my mind that Google can even have problems like this.

When you operate at Google's scale then everything that can go wrong, will go wrong. Google does an amazing job providing high-availability services to billions of users, but doing so is a constant learning process; they are constantly blazing new trails for which there are no established best practices, and so there will always be unforeseen issues.

marcan_42|5 years ago

Ex-Googler here.

Yes, apps are highly distributed. Yes, roll-outs are staggered and controlled.

But some things are necessarily global. Things like your Google account are global (what went down the other day). Of course you can (and Google does) design such a system such that it's distributed and tolerant of any given piece failing. But it's still one system. And so, if something goes wrong in a new and exciting way... It might just happen to hit the service globally.

When things go down, it's because something weird happened. You don't hear about all the times the regular process prevented downtime... because things don't go down.

Zenst|5 years ago

I speculate that for many companies, work from home has been at most, less impacting than they thought.

However, I'd speculate that in this instance, when you get that .0001% problem, less hands on deck makes work from home aspects less easier. Akin to remotely fixing somebodies PC over standing behind them.

With that premise I'd speculate in this instance that whilst not the root cause, may of been a small ripple that led to that root cause and/or lead to a slower resolution than what would normally get.

Those speculations aside, it will only highlight what that some tooling needs to adjust for remote workers as does design and set-ups more. Water cooler talk is not just for gossip and a counter would be more regular on-line group socialising at a work level so that not only the companies but the workers can fully adapt and embrace the work medium; But so the kinks and areas that need polishing can be polished and made better for all.

Lastly, I'd speculate that I'm totally wrong and yet what I said may well anecdote with some out there and resonate with others.

throwaway201103|5 years ago

You might be right for the smaller company where physical access to the machines in the data center is necessary at a certain point in the troubleshooting process. I work at such a place myself. I would guess, however, that Google moved beyond that quite some time ago. It's simply not practical, with or without having offices with people in them.

erhk|5 years ago

Software isn't as simple as splitting across different locations to prevent global failures.

megous|5 years ago

I thought SMTP was specifically designed for this (with support for multiple MX entries, queuing at the sender MTA side, etc.) and there's an easy hard boundary at the user mailbox level you can use to partition your system.

It should not be a problem that gmail is "down". Unless this would be happening for more than a few days, noone would lose e-mail. It's a problem that it's not returning a temporary error code, but permanent one.

eloisant|5 years ago

If there is something I've learned from AWS outages (they tend to publish detailed post-mortem), no matter how you design your architecture in a distribute way you will always have Single Point of Failure (SPOF) and sometimes discover SPOF you didn't think of.

Sometimes it's a script responsible of deployment that will propagate an issue to the whole system. Sometimes it's the routing that will go wrong (for example when AWS routed all production traffic to the test cluster instead of production cluster).

unknown|5 years ago

[deleted]

nimchimpsky|5 years ago

[deleted]

yudlejoza|5 years ago

[deleted]

ink404|5 years ago

Your contribution has greatly enhanced this conversation, thank you.

unknown|5 years ago

[deleted]

aprdm|5 years ago

Because, maybe, like in every big company, the thing actually doing the work is some old oracle database with some huge monolithic around it...

sellyme|5 years ago

Out of all the companies Google might be relying on in their back-end, I think Oracle is probably pretty far down the list.

pmlnr|5 years ago

Hush, you'll scare the shiny eyed faang wannabies away, they aren't supposed to know this until employed for at least two decades.