Simplicity – Google SRE Handbook (2017)

[+] zbentley|1 year ago|reply

Many commenters here are rightly pointing out Google’s hypocrisy in actually following the principles in this article. Fair enough. But others are throwing the baby out with the bathwater: it’s a little silly to read comment after comment saying that the advice in TFA must be bad because Google does dumb/bad stuff on the regular. Companies aren’t homogenous. Even misguided companies may employ people who can teach others important things.

Boeing is a perfect example of this. I would absolutely read an article proposing principles of engineering reliability from a Boeing eng/QA greybeard. Even as the rest of the company spiraled due to horrible leadership and management practices, many people in engineering and quality control did their damnedest to keep those failures from causing even more harm and loss of life. Those people probably have very valuable lessons to share about how to maintain what quality you can in a deeply hostile environment.

[+] ilrwbwrkhv|1 year ago|reply

Also Google has this problem after they outsourced a bunch of work to third world countries where original thinking is quite limited and management through bureaucracy is the norm.

[+] dangus|1 year ago|reply

I don’t see any validity to the alleged hypocrisy.

End users making that criticism are confusing the products with the reliability practices.

[+] fl0ki|1 year ago|reply

It's only fair to say that SRE's attitude towards complexity can be very different to SWE's, reinforced by how their performance is reviewed.

Though there's still hypocrisy in SRE complaining about how SWE builds projects while building equally contrived projects for other SRE to use.

[+] burakemir|1 year ago|reply

While the text touches on many points I would immediately sign, the paragraph starting with "Because engineers are human beings who often form an emotional attachment to their creations, ..." is really out of place.

The cause of complexity is not emotional attachment, these are decisions being made. The decision to add feature after feature and punt on maintenance for example is something that has little to do with emotions. There is a lot of agency that engineers, SWE and SRE alike have in shaping how things are. However there can be good reasons to abandon simplicity. The real trouble here is not psychology but that as a profession we are really bad at measuring and estimating the effective cost of maintenance. Part of that is considering measures to improve simplicity and maintainability as cost that comes without gain and somehow less important than features, and then just accept giant rewrite a few years later. A continuous portion of upkeep would likely be more economical and real engineering has always included an aspect of economy - cost vs benefit.

IMHO the loaded accusation of emotional attachment might be rooted in an "us vs them" attitude (SRE vs software engineering) that should have no place in a sober discussion on the value of simplicity and it diminishes an otherwise great text.

[+] CraigJPerry|1 year ago|reply

>> Because engineers are human beings who often form an emotional attachment to their creations, confrontations over large-scale purges of the source tree are not uncommon. Some might protest, "What if we need that code later?"

> the paragraph starting with "Because engineers are human beings who often form an emotional attachment to their creations, ..." is really out of place.

FWIW I’ve definitely encountered developers clinging to things when the business context has completely changed. I totally recognise the scenario in the original text.

[+] oooyay|1 year ago|reply

I'm a SRE and I disagree too, though, I think you're giving SREs too much credit in the category of our hegemony for an "us vs them" debate. Maybe at Google SWEs having relationships with their code based is a well studied thing. It could also just be someone's opinion that managed it's way unchallenged into the book. That's to say, Google SRE wasn't the best or last iteration of SRE.

I personally think systems evolve the way you describe because of a system of incentives. There are more incentives for features than there exist for refactor and non top priority defect fixes. This comes from the people who hold power to shape incentives and they often do so with conflicting priorities and superficial understandings of the existing incentive structure.

I'd also like to say that it's my own personal theory that systemic issues can only be caused by systemic forces. Individual mindsets cannot be to blame then; if a mindset has become systemic (example: SWEs overly attached to code and features) then your next question should be "why?". There's a system that enforces that, and if you don't look beyond personal obsession then you'll never find it.

[+] arccy|1 year ago|reply

But people do get attached to their creations, they don't want their things deprecated/removed, since to them it may feel like their thing is thrown away or wasted work down the drain. While they may not obviously state it as such, it can be the underlying reason driving their arguments (e.g. sunk cost fallacy).

[+] scott_w|1 year ago|reply

I think the examples the paragraph gives more than backs up the statement. I’ve met people who comment out code instead of deleting it (luckily not in a long time!) and I feel the authors speak from experience here.

[+] intelVISA|1 year ago|reply

> Because engineers are human beings who often form an emotional attachment to their creations

Because engineers are human beings who often form an emotional attachment to their job security

It's understandably very unwise to admit that Very Complex Solution that cost A Lot Of Money was A Bad Thing

[+] mrbungie|1 year ago|reply

I think that is being transparent with what actually happens in the real world (engineers, at least in part, being human and emotional in their decisions), rather than just talking about impossible ideals (engineers thinking about tradeoffs in a purely objective matter).

NIH, CV based development, preference for shiny/new things and a myriad of other "engineer/organizational diseases" exist, you know. And there are even SaaS/PaaS/XaaS marketing teams exploiting such human qualities when making software sales.

[+] jimmySixDOF|1 year ago|reply

When containers got going there was a phrase used in devops to think of servers as "cattle not pets" for just this reason.

[+] jimbokun|1 year ago|reply

It’s not specific to software engineers. People in every field get emotionally attached to their own creations.

[+] probably_jesus|1 year ago|reply

[deleted]

[+] elktown|1 year ago|reply

Just remember that what google writes in these kind of things is not universal. It's written from their very unusual circumstances. You can certainly pick nuggets that are more universal than others but, like in many other instances, too much unnecessary work is spent trying to imitate Google and others when it's not really needed. And no, you won't turn into Google over night, you will have time to adapt if fortune hits you. Some things are not even necessarily good advice at all, but rather a product of incentives within Google (and perhaps most tech corps) rewarding the aesthetics of "innovation".

[+] fmbb|1 year ago|reply

I read the whole text (granted, a bit quickly) looking for weird or unnecessary advice but I cannot see any.

This is a great text about considerations everyone operating software services should take to heart.

It applies regardless of if you deploy a monolith or several smaller servers.

If you are only one developer, it might apply in a smaller context.

[+] nvarsj|1 year ago|reply

Even within Google, this is not universal. I doubt the majority of SREs at Google have even read the "Google SRE book".

On the other hand, the book has some nuggets that make it worth reading. But it should be treated as a collection of essays from some very senior SREs rather than a manual.

[+] gnuser|1 year ago|reply

At the last “real” job I tried to help implement this as part of and later the manager of the ops team. It’s a great start, but in that case management wanted the idea of devops/sre but didn’t actually support it, and it really was a shit show. If you have a bad CTO and leadership on the board level, no amount of re-tooling will paper over their lack of support for the real principles.

[+] kryptonomist|1 year ago|reply

Glad to see those valuable principles written, even if it seems we are heading in the complete opposite. At least we can try to apply them on our side business.

These were also true in the early ages of aviation:

“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”

― Antoine de Saint-Exupéry

[+] MichaelMug|1 year ago|reply

> Source control systems make it easy to reverse changes

I have not observed this to be the case. After a few revisions there are so much changes that the code cannot be reversed without loosing a lot. A mech aims to cut out the soon-to-be-dead code like a flag is better. But perhaps maybe I’m doing something wrong.

[+] userbinator|1 year ago|reply

A lot of preaching but bears little resemblance to what Google is actually doing in reality. IMHO those who actually understand what "simplicity" means in software are only those who have tried to do anything in highly-resource-constrained environments.

[+] hiAndrewQuinn|1 year ago|reply

A taxonomy of what we mean when we talk about "resource-constrained" might be helpful for those seeking to gain this knowledge. Limited CPU, RAM, etc are the obvious contenders - but then there's also "resource-constrained" as in "I'm the solo dev of this project and have 5 hours in a good week to work on it", or "this runs in a weird place without Internet that I only get access to twice a year". I've been in all of these situations, sometimes multiple at the same time, and they've been great forcing functions to find new paths towards simplicity.

[+] sumeno|1 year ago|reply

> bears little resemblance to what Google is actually doing in reality

What is Google doing in reality?

[+] gtirloni|1 year ago|reply

You also have to keep in mind the scope and timeline of where these principles apply. I'm sure someone would be able to apply them to their own work most of the time but if you look at a company as a whole, unless someone at the top is really pushing for global simplicity, things are pretty messy most of the time.

I'm just saying this because Google might be doing this in little islands, not as a company strategy. I don't really know and can only guess from the outside.

[+] wouldbecouldbe|1 year ago|reply

"Why don’t we gate the code with a flag instead of deleting it?" These are all terrible suggestions. Source control systems make it easy to reverse changes, whereas hundreds of lines of commented code create distractions and confusion."

In most cases to delete code would be a good idea, but to say that source control systems make reverting easier. After a few months most developers will have forgot about those lines and at times uncommenting code & explaining it explicitly might be a better way to preserve knowledge then to rely on digging through GIT.

[+] lloydatkinson|1 year ago|reply

First time I’m hearing that feature flags and commented out code are the same thing.

[+] wiseowise|1 year ago|reply

Nothing like supporting dead code forever just because you might need it some day.

[+] ChrisArchitect|1 year ago|reply

Some more recent discussion:

https://news.ycombinator.com/item?id=39580346

[+] YokoZar|1 year ago|reply

See also the Simplicity chapter in the followup Google SRE Workbook: https://sre.google/workbook/simplicity/

[+] quintes|1 year ago|reply

Yeah look. This may be the throw it over the wall problem. sRE says No.

You build it you run it but may work at their scale

[+] unknown|1 year ago|reply

[deleted]

[+] YZF|1 year ago|reply

This reads for me as a reflection of Google politics/org structure. The SRE org positioning itself as the guardian of system design vs. the SWEs who are agents of complexity. Doesn't feel healthy to me. The principles are fine but it's the SWEs that should be talking and applying them because they are "closer" to the decisions.

[+] maximinus_thrax|1 year ago|reply

Maybe an unpopular opinion, but this type of content is useless and serves no other purpose than feeding the already bloated Google cargo-culting machine.

[+] OutOfHere|1 year ago|reply

Does this include instructions on accidentally deleting a customer's account? Because that's what Google does. I don't think I want to take any advice from Google on anything.

[+] gtirloni|1 year ago|reply

> Because that's what Google does

Your argument would be stronger if you could list a few cases like that latest high profile one where GCP deleted some enterprise customer's account. A single one won't cut it for "that's what Google does".

[+] dieortin|1 year ago|reply

I challenge you to find an organization that has never made a mistake. Truth is the uptime and reliability of Google services is very good, while operating at huge scale. And I have no association with Google whatsoever.

[+] infinityplus1|1 year ago|reply

Cloud computers are just someone's else computer. Amazon and Microsoft engineers can make the same mistake too. Take backups and test them regularly and you'll be OK.

[+] postepowanieadm|1 year ago|reply

It's from 2016 when google was less trash.

[+] kubb|1 year ago|reply

SRE has got to be one of the organisations that have done the most damage in the big G. They were given a license to mandate things based on philosophical musings backed with no science, and they can decide what's best and should be done without any data, just based on feels. They also have a culture of misanthropy, patronization and contempt towards devs. From what I can tell anyway.

[+] gtirloni|1 year ago|reply

> they can decide what's best and should be done without any data, just based on feels.

The book is exactly the opposite of this. The Principles chapter alone talk about many things that involve actually dealing with numbers (SLO, measuring complexity, etc).

[+] makerofthings|1 year ago|reply

Be google SRE. Elite software engineer. Cool under pressure.

Pager goes off! Grab pixel. Press finger print reader until it lets me enter my passcode. Ack page. Put down whisky. Shake self. 5 minutes to be logged in and dealing with the problem.

Password. gnubby. password. gnubby. gnubby. gnubby.

Check alert, see playbook, ignore playbook. Check which cell the problem is in. Correlate with rollouts. See a match. Roll back poorly tested dev promo project. Charts recover. Alert not firing.

Log out. Back to whisky.

[+] alienchow|1 year ago|reply

Why don't you give Mission Control a try for 6 months?

[+] sgarland|1 year ago|reply

> culture of misanthropy, patronization and contempt towards devs.

When you’re being paged for the Nth time because of an idiotic problem that you’ve pointed out repeatedly, you too might exhibit these traits.

[+] bru|1 year ago|reply

[citation needed]

[+] randmeerkat|1 year ago|reply

Google’s “best practices” lead them to deleting an entire customer’s $135 billion pension account [1]. I’m surprised anyone is still reading anything Google writes.

1. https://arstechnica.com/gadgets/2024/05/google-cloud-acciden...

[+] klabb3|1 year ago|reply

You’re assuming that those systems were all implemented to the letter of that guide. That’s never the case. Often these type of guidelines are written to address recurring problems found in an organization.

[+] dieortin|1 year ago|reply

If we should only read things written by organizations that make no mistakes, then we will never read anything.

[+] thirteenfingers|1 year ago|reply

That was seven years later. Maybe the problem is that Google stopped reading what Google wrote.

[+] gtirloni|1 year ago|reply

Oh, completely ignoring anything anyone from Google ever writes again? This is akin to the cancel culture which we all know is how society should work. /s

115 comments