top | item 13744621

Incident management at Google – adventures in SRE-land

250 points| kungfudoi | 9 years ago |cloudplatform.googleblog.com | reply

43 comments

order
[+] no_wizard|9 years ago|reply
I find this bit to be particularly insightful:

"Can I handle this? What if I can’t?" But then I started to work the problem in front of me, like I was trained to, and I remembered that I don’t need to know everything — there are other people I can call on, and they will answer. I may be on point, but I’m not alone

It might be because I'm currently training people in this realm, and this is one of their biggest fears, or maybe because it was my biggest fear, but its so true. We're a team. We're here to help. At least if your SRE org is any good. Never be afraid to ask for help, and never be afraid to admit you don't know something or it might be outside of your comfort zone.

I'll take willing to learn and readily able to admit knowledge deficits over someone who doesn't any day of the week. Great book they're working on, great article on this. So many gems, but this one stuck out for me, and its pretty relevant to me right now.

[+] AdmiralAsshat|9 years ago|reply
I notice this alot with the team I work with. Our team's tickets are not assigned automatically: there's simply a queue and people are encouraged to grab what they can. Unfortunately, what I've often seen is that people see something in the description that they're not familiar with and refuse to touch the ticket because they don't want to ask for help, which means that it languishes in the queues. The end result is that the same person ends up taking the same kind of ticket over and over because they're the only one who has any familiarity with the program in question.
[+] HerraBRE|9 years ago|reply
It's not mentioned in the article, but there is an underlying point that affects hiring for roles like this: you need people who can and will admit they don't know everything and will ask for help rather than wing it.

"Rock stars" are downright dangerous, as are people who prefer to make things up rather than admit ignorance.

A new SRE doesn't need to know everything (and can't). But he absolutely needs to be curious and willing to ask for help.

[+] justicezyx|9 years ago|reply
The hard part is to know what you know and what you dont know. It has nothing to do with "inside a team and feeling safe". If you do not know what you know and what you dont know, then you will make the most naive mistakes eventually.

I myself have deleted the data file of a production mysql server, because I have no idea what I am doing. I need to call teammates in 12am to learn how to take care of the mess I created.

[+] richforrester|9 years ago|reply
Goes for any job really; know what you do/don't know, and know what you should/shouldn't know.

For some things, you should be able to trust others.

It's funny, because it boils down to "don't lie" (to yourself nor to others).

[+] divbit|9 years ago|reply
>there are other people I can call on, and they will answer. I may be on point, but I’m not alone

that sounds really nice

[+] mikecb|9 years ago|reply
The coolest thing I took away from the SRE book was this progression of system operations from manual, to scriptable, to automated, to a fourth category I hadn't even known existed: autonomous. The idea that you can keep moving up this hierarchy of exception management beyond even chef and puppet, and systems will be able to heal themselves, is a pretty cool one.

As a manager, this made the concept of 20% time a lot more clear. These are people with the knowledge and incentive to build a hierarchy of systems that progressively remove risk from their work. This is in fact their primary business objective. And we need to make sure they have time to do that, vs working them to death with manual remediation. It's a great lesson.

Incidentally, Stackdriver contains a simple alerting and incident management tool that's really nice to use. Hopefully it gets more robust as time goes on and larger and more complex orgs move to their cloud. Edit: not Outalator.

[+] ben_jones|9 years ago|reply
If anything it proves how much mismanagement / wasted potential software organizations have had in the last 20 years. Full automation should be the natural progression of our trade but I fear most companies stall after 5ish years due to turnover, brain-drain, re-organizations, acquisitions, management incompetency etc. Google on the other hand has always had a seemingly never ending pool of resources and talent to keep pushing the barrier further. Fortunately they give a lot back to the community in the form of books, talks, and projects such as Kubernetes (a poor man's Borg). However I fear that with all things commercial it will lead to an oligarchy where companies like Google, Facebook, and Uber, are just that far ahead of the curve nobody else will ever catch up.
[+] asuffield|9 years ago|reply
(I'm a Google SRE. My opinions are my own.)

That's not what our 20% time is for, and 20% is way too small a number for that purpose. "20% time" (the way we use the term) is for personal/career growth/scratching itches.

Time spent on building systems that make our service better is my primary job. Manual remediation ("toil") is something to be tracked as a dangerous antipattern that must not be allowed to take over.

Toil and oncall response should be less than 20% of my time, together. At least half my time should go into engineering projects. If the level of toil is in excess of 50% of team activity then I would expect only percussive intervention to get the team out of this situation.

[+] kyrra|9 years ago|reply
I'm a google employee, opinions are my own.

The incident management tool is not Outalator. Outalator is the pager queue management tool. The incident management tool is for manually creating incidents that have much broader visibility than Outlator does.

As someone who has been incident commander a few times, incidents tend to have broader impact beyond your immediate team or owned jobs.

[+] nodesocket|9 years ago|reply

  progression of system operations from manual, to scriptable, to automated, to a fourth category I hadn't even known existed: autonomous.
Completely agree. The first "eureka" moment is when you define all your infrastructure in code in something like Terraform. Magically networking, firewall rules, disks, instances, are all provisioned and dependencies calculated. It is quite a breakthrough from running CLI commands or using the web interface to allocate infrastructure.

Plug: I wrote a blog post on getting started with Terraform and Google Compute Engine for those interetested https://blog.elasticbyte.net/getting-started-with-terraform-...

[+] bluedino|9 years ago|reply
>> from manual, to scriptable, to automated, to a fourth category I hadn't even known existed: autonomous.

What's the difference between automated and autonomous?

[+] WestCoastJustin|9 years ago|reply
FYI - it's linked to in the post, but in case it's not obvious, they have posted the SRE book for free at https://landing.google.com/sre/book.html

I'd highly recommend it if you're in the Ops feild. Probably the best book out there on current large scale Ops practices.

[+] ben_jones|9 years ago|reply
It's a great book, very well written and fair. But the first time I read it I suffered a certain amount of zealotry: "Google is amazing! I should rewrite everything to be more like them!". Really a subcaption everyone should keep in mind is that the book defines how Google built systems for Google. YMMV.
[+] twosheep|9 years ago|reply
Maybe it's just me but I found the constant in-line plugs for the book to be distracting -- footnotes would have been better.

Interesting write-up though

[+] daenney|9 years ago|reply
I had a similar reaction. I was a bit irked by it b/c it felt very pushy towards the SRE book and broke me out of the flow of the article a few times. 10/10 on the book though, would recommend anyone to read it.
[+] vgy7ujm|9 years ago|reply
Is it just me or are we seeing a trend almost before the "new" role SRE has become mainstream that SRE is turning into support technicians because that is what is needed at most places that are not Google scale. The devaluation of the sysadmin took some time, this is happening much faster. What will be the next title when SRE can't get you a decent salary anymore? And why don't we see the same with the SWE role? Is it just that business leaders sees Ops as cost no matter what name it has?
[+] saycheese|9 years ago|reply
Anyone able to compare and contrast Google's "Wheel of Misfortune" with Netflix's "Chaos Monkey" both in terms of the systems that enable them and the operations that relate to them?
[+] bskap|9 years ago|reply
They're unrelated. Wheel of Misfortune is just a role-playing replay of a previous incident as a training exercise. Someone will grab (or simulate) logs and dashboards from the incident and then play GM for the wheel of misfortune at a future team meeting. Someone who isn't familiar with the incident will be designated "on-call". They'll state what they want to do and the GM will tell them or show them what they see when they do those things.

Chaos Monkey is actually taking down production systems to make sure the system as a whole stays up when those individual pieces fall. Google does have (manual, not automatic) exercises doing similar things called DiRT (Disaster Recovery Testing), but it's not related to the SRE training exercise.

(standard disclaimer: Google employee, not speaking for company, all opinions are my own, etc.)

[+] asuffield|9 years ago|reply
(I'm an SRE at Google. My opinions are my own.)

WoMs are a training exercise, intended to build familiarity with systems and how to respond when oncall. A typical WoM format is a few SREs sat in a room, with a designated victim who is pretending to be oncall. The person running the WoM will open with an exchange a bit like this (massively simpified):

"You receive a page with this alert in it showing suddenly elevated rpc errors (link/paste)" "I'm going to look at this console to see if there was just a rollout" "Okay, you see a rollout happened about two minutes before the spike in rpc errors" "I'll roll that back in one location" "rpc errors go back to normal in that location" ...etc

(Depending on the team and quality of simulation available, some of this may be replaced with actual historical monitoring data or simulated broken systems)

The "chaos monkey" tool, as I understand it, is intended to maintain a minimum level of failure in order to make sure that failure cases are exercised. I've never been on a team which needed one of those: at sufficient scale and development velocity, the baseline rate of naturally occurring failures is already high enough. We do have some tools like that, but they're more commonly used by the dev teams during testing (where the test environment won't be big enough to naturally experience all the failures that happen in production).