Software Engineering at Google (2020) [pdf]

[+] inoffensivename|3 years ago|reply

I've maintained some pretty big libraries inside Google at one time or another in the last 16 years. I believe our system is bad for both sides (library maintainers and users). Here are some of my frustrations:

* the One Version Rule means that library authors can't change a library "upstream": your changes must work now or they can't get in. This sets the bar for landing changes very high, so development moves at a snail's pace. The version control and build system we have makes it difficult to work inside a branch, so collaboration on experimental/new things between engineers is difficult.

* users can create any test they wish that exercises their integration with your library. They can depend on things you never promised (Hyrum's law, addressed in the book). Each of these tests becomes a promise you, the library maintainer, make to that user: we will never break that test, no matter how weird it is. This is another huge burden on library maintainers.

* the One Version Rule means that, as a user, I can't just pin my dependence on some software to version X. For example, if I depend on Bigtable Client, just give me version 1.4. I don't need the meager performance improvements in version 1.5, because I don't want to risk breaking my project. This means every roll-up release you make, every sync to HEAD you do, risks bringing in some bug in the bleeding-edge version of literally everything you depend on.

[+] pclmulqdq|3 years ago|reply

I worked on one of those infrastructure services (like Bigtable). It was a huge boon to my productivity that the one version rule existed.

If we had to maintain 6 past months of releases, we would never get anything done. Since breakages in our client library were our problem, getting a months-old bug means that we can just tell you to recompile rather than figuring out how to backport a fix for you.

Hyrum's Law considerations got really weird, especially when people take them too far. I think this is based on the kernel idea of "don't break userspace," but its practical implications are nuts. Hyrum's law has killed infrastructure projects that could have made things a lot better, and has resulted in crazy test-only behavior being the norm.

One person on an adjacent team loved taking behaviors as promises (and he also had a reputation as one of the most prolific coders at G). We had to clean up his messes every time he relied on unspecified behavior. I pushed back on his nonsense a few times, particularly when he used an explicitly banned combination of configuration options that we forgot to check-fail on, but always lost. 1.5 SWEs on our team were full-time cleaning up after him.

[+] miiiiiike|3 years ago|reply

Thank you for saying this. I bought the physical copy of the book a few years ago, before I was a heavy user of Google OSS projects. I cracked it after I was shocked by how bad many of the practices seemed to be.. And it's all working as intended? The bit on docs is confusing. Google docs of some of the most poorly organized, least easily referenced docs I've ever seen. [0]

Skip forward a few years and my projects are full of workarounds for years old Google bugs. It feels like fixing basic functionality just isn't a priority. Most of them are literally labeled "NOT A PRIORITY".

[0]: You can read Scrapy's docs, and the docs for most major Python libraries, from beginning to end and just "know" how to use it (https://docs.scrapy.org/_/downloads/en/latest/pdf/). With Google docs you have to piece together fragments of information connected by an complex web of "[barely] Related Topics".

[+] titzer|3 years ago|reply

Google overcompensates for a number of practices that don't scale by throwing ungodly computational power at them. I don't know how much CPU forge/TAP consumes these days, but I remember when it was at least 90K cores in a single cluster. It's insane to me that hundreds of thousands of giga-brains are pinned 100% 24/7 to dynamically check literally trillions of things because the combinatorial space was too hard to factor.

This is not to disparage the people who built those systems, but there is only so much concrete you can put in a rocket ship.

[+] yowlingcat|3 years ago|reply

I'd agree with you that this would be bad for most companies I've worked at barring arguably one (Amazon, which is also gigantic) -- at any startup/medium sized company, these practices would not be right-sized to the work being done.

Even during my tenure at AWS inside Amazon, I have difficulty believing this would be useful; for example, at AWS, we'd separate services by tiers based on how close to the center they were inside the internal service graph. Running EC2/S3 or another tier 1 service? Yes, you're probably going to index on moving a bit slower to reduce operational risk than not. Running a new AWS service that is more of a "leaf node" than a center node? You can go ahead and move fairly quickly so long as you're obeying AWS best practices, which while somewhat hefty by startup standards are quite relaxed by other corporate standards.

What I wonder is whether this kind of heterogeneity would have been a better path for Google than what you describe. Or, is it the case that the sheer monolithic scale of search/ads at Google is such that it just wouldn't make sense, and that continuing to pile incremental resources into the monstrosity (and I mean this gently/positively) of search is what the company must do and so what the engineering culture must enable.

But, as you might be alluding to, perhaps the current approach doesn't even suit the needs of the company and is purely bad even for Google's specific problems -- in that case, is it simply there due to cultural cruft/legacy? I haven't worked at Google before, so it's hard for me to say something based on my experience with it.

[+] Rarebox|3 years ago|reply

I thought google did this versioning thing for libraries before, but it was stopped for reasonable reasons (g3 components).

Basically if you could pin lib versions everyone would be stuck on old versions for a long time, causing difficult debugging work for each user of the library. You'd then also have all sorts of diamond problems: what if you want newest absl but older bigtable client?

It's a difficult problem no matter which way you go.

[+] alpb|3 years ago|reply

This is inherently contradictory with the trunk-based development model.

I get the feeling you need to pin to v1.4 but ideally by being on the trunk head at all times, you force everyone (especially the library owners, and yourself, by writing tests around your wrapper) to do things properly such as having enough tests in place. Otherwise, you find yourself digging a grave for yourself when the time comes to migrate from v1.4 to v1.7, and it becomes grunt work that nobody wants to take on.

[+] kodah|3 years ago|reply

Re: the one version rule

On the other hand, my users can pin versions, and we maintain a longer LTS window for those features. To this day, that LTS window has never been exercised because we end up having to build backwards compatibility into everything we do. The backwards compatibility promise also means our testing is extremely verbose.

[+] mucle6|3 years ago|reply

>The version control and build system we have makes it difficult to work inside a branch

nit: use fig

[+] vinay_ys|3 years ago|reply

The book itself is a bit abstract and generalised (on purpose I suppose). People who have worked within Google vs not will think differently on how they can apply it in practice in their own company/team. Many of the practices that work very well at Google don't work as well elsewhere primarily due to bootstrapping problems. Google's developmental practice are built on top of so many layers of highly complex and large tech-infra capabilities that don't exist outside. The process/culture practices also have cyclic dependencies with those infra capabilities.

Interestingly, so many things that are quite easy everywhere else aren't so easy to do at Google. Software engineering within Google is painless in many of the usual sense of pains seen outside Google, but it has its own pains – some quite painful pains.

[+] ctvo|3 years ago|reply

Google is the only company I'm aware of where the engineers constantly publish popular books about their engineering practices.

Apple, Amazon / AWS, MSFT, etc. have all done impressive things in their space at various points, but seem to lack the mixture of personalities / culture / reputation where "Engineering at Apple" isn't quite the hit that SRE at Google [1] or this book may be.

1 - https://www.amazon.com/Site-Reliability-Engineering-Producti...

Edit: If you're at Apple and happen to work in hardware, I would pay good money to read about the process and war stories.

[+] jrvarela56|3 years ago|reply

Maybe a bit lighter than what you meant but entertaining: Creative Selection by Ken Kocienda (https://www.amazon.com/-/es/Ken-Kocienda-ebook/dp/B079DVT6VP).

One anecdote I enjoyed was how Steve Jobs said the guiding principle for Safari was that it had to load pages faster than any other existing browser so the engs made a benchmark with top 100/1k sites part of the CI to enforce the principle: diffs that slowed-down the browser could not be merged.

[+] dilyevsky|3 years ago|reply

Other companies usually publish books on management

[+] erwincoumans|3 years ago|reply

I find this part on page 35 very important:

"The Three Pillars of Social Interaction So, if teamwork is the best route to producing great software, how does one build (or find) a great team?

[...]

They’re the foundation on which all healthy interaction and collaboration are based:

Pillar 1: Humility

You are not the center of the universe (nor is your code!). You’re neither omniscient nor infallible. You’re open to self-improvement.

Pillar 2: Respect

You genuinely care about others you work with. You treat them kindly and appreciate their abilities and accomplishments.

Pillar 3: Trust

You believe others are competent and will do the right thing, and you’re OK with letting them drive when appropriate.

[+] alexashka|3 years ago|reply

They forgot Pillar 0 - not everyone's going to get along, let people choose teammates until they find ones that fit.

Once they do, pillars 1, 2 and 3 no longer need to exist.

If you need to tell people to be nice, you're doing it wrong. You're not their mother and these are not kids to be talked down to and told how to behave.

[+] lstamour|3 years ago|reply

FYI, this is also available as an audiobook from "Upfront Books". The first few chapters are easy to listen to, and the book is broken down into shorter sections. When it gets into talking about syntax formatting rules, it's a bit un-listenable. Some chapters have a lot of advice in them, others, like the chapter on inclusivity, have less specific, actionable advice. (What do I mean? Well, you could start talking about user research or scientific UX studies to get feedback beyond groupthink at the company, but that chapter on inclusivity didn't mention it.) But besides the unevenness, it's a book that does add value, I think, to general conversations on software development. And it's a rare technical book that's also available as an audiobook, so that helped it stand out for me.

I couldn't find a canonical publisher link, so here's a few ways to get the audiobook, instead: https://www.oreilly.com/videos/software-engineering-at/14920... or https://play.google.com/store/audiobooks/details/Software_En... or https://www.audible.ca/pd/Software-Engineering-at-Google-Aud... or https://books.apple.com/us/audiobook/software-engineering-at... or https://www.kobo.com/ca/en/audiobook/software-engineering-at...

[+] osigurdson|3 years ago|reply

Code reviews are an interesting topic since there is so much variability on the part of the reviewer. One could spend hours reviewing a change only to respond with "Looks good to me" after finally concluding that there is no better way, do a cursory code style level review with the same response or spend the same amount of time as the original author completely re-writing the code.

[+] UncleMeat|3 years ago|reply

Because Google owns their own code review tool, they can collect data to distinguish these sorts of cases.

[+] golondon|3 years ago|reply

I skimmed through it but still didn't see any part describing how is the local dev env is? Let's say that you work on a service that does something for ad serving, how do you write code for that and test it?

I understand unit tests and e2e tests are used but what I'm referring is just simple opening web browser, navigating to the localhost:3000/foo/bar/something and seeing if it's ok, I found this as much faster feedback loop while writing code in addition to the tests. Can anyone from Google share that?

[+] jeffbee|3 years ago|reply

I don't work there but when I did it was no problem to just `blaze run //the:thing -- --port=3000`. If a service needs to make RPCs (which is generally the case) that's not a problem because developer workstations are able to do so via ALTS[1]. Developers can only assert their own authority or the authority of testing entities, so such processes do not have access to production user data.

Another possibility is to install your program in production but under your own personal credentials. Every engineer has unlimited budget to do so, albeit with low priority.

Aside from the above, other practices vary but a team I was on had several dev environments in Borg (the cluster management system). One of which was just "dev" where anyone was welcome to release any service with a new build at any time. Another of which was "test" which also had basically no rules but existed for other teams to play with integrating with our services. The next of which was "experimental" where developers could release only an official release branch binary because it served a tiny amount of production traffic, "canary" which served a large amount of production traffic and required a 2-person signoff to release, and finally full production.

So basically developers had four different environments to just play with: their own workstations under their own authority, in prod, under their own authority, and dev/test in prod under team testing credentials.

1: https://cloud.google.com/docs/security/encryption-in-transit...

[+] Jensson|3 years ago|reply

Basically every service has already been packed full with so many things that an instance can barely fit on a server, you wont be able to run that monstrosity locally. Which is why they started doing "micro services", out of necessity since when each binary gets over a few gigabytes you don't got many other options. their micro services still takes gigabytes, but it let them continue adding more code. But each of those depends on hundreds of other micro services. And those microservices are of course properly secured so you wont be able to talk to production servers from your development machine.

[+] dilyevsky|3 years ago|reply

When i worked there you couldn’t even checkout code locally - had to ssh into office workstation. At that point just run your dev service on borg using free quota

[+] karmasimida|3 years ago|reply

Read during a long flight.

I like the testing part very much, it is pretty inspiring

[+] aaomidi|3 years ago|reply

Don't try to emulate google. Seriously. It's how you're going to kill your company if you do.

Google is not an efficient or fast company. You need to be efficient and fast to deal with the momentum google has.

If you interview people from FAANG and you see they are going to want to recreate their company within yours, don't hire them. They're going to ruin team dynamics. And seriously, make sure these people aren't assholes. With some companies in particular you need to really be an asshole to survive. Weeding out assholes is far more important for most smaller companies than technical expertise. Heck for most people you probably don't need a separate technical interview.

Just, honestly whatever google does do the opposite. From hiring to project management to planning to software development.

[+] 0x20cowboy|3 years ago|reply

> They're going to ruin team dynamics. And seriously, make sure these people aren't assholes.

I’ve worked with a number of X-FANNGers, and interviewed at a couple places that were predominantly run by them. Your comment brought up some unsavoury memories.

When I research places to work now, if any of the main people are X-FANNG I politely decline the opportunity - it just saves everyone’s time.

[+] dkersten|3 years ago|reply

> From hiring to

Speaking of hiring, just yesterday I spoke with a Google recruiter but bowed out when he told me that the process takes months.

I can't think of any point in my career where I'd be willing to put up with a process that takes MONTHS. Either I'm actively looking for work and want something ASAP, or I'm not actively looking, in which case why would I put myself through that? My time is valuable, I won't spend months on an interview/hiring process. If anything, Google better spend that time telling me why I should work there, because I have plenty of alternative options.

So, if you copy their multi-month hiring process, I certainly won't be applying.

[+] elromulous|3 years ago|reply

This is not a very constructive comment. It doesn't mention any specifics, then it jumps into assuming that Google / FAANG have more assholes?

Disclaimer: I'm a swe at alphabet.

[+] anon2020dot00|3 years ago|reply

Agree with the other commentator that this is an unconstructive message.

Replace Google in the message with Apple/Netflix/Microsoft and it will read exactly the same since it has no specifics except "don't do what x company is doing" with no particular reasoning behind it.

[+] Tepix|3 years ago|reply

> Just, honestly whatever google does do the opposite. From hiring to project management to planning to software development.

Also, respecting the privacy of your customers and having a business model that does not depend on collecting data they may not want to be collected.

[+] jeffbee|3 years ago|reply

Gotta say this is absolutely not my experience. I've worked at a lot of companies and have never seen one with anywhere near the developer productivity as Google enjoys. They have the groundwork that enables the velocity, so you can build/test/deploy very very quickly. Other organizations believed they had velocity because they skipped unit tests, code review, production security, supply chain security, etc. Consequently their whole thing becomes a haunted graveyard that everyone is terrified to change.

[+] no_wizard|3 years ago|reply

Name names. I think we are doing ourselves injustices not to name name's here. If Google, or Meta (Facebook) or whatever has terrible cultural practices as you describe, I think we as a community owe it to ourselves to call it out.

This also otherwise rings a little hollow without specific things to backup the assertion. I think being direct forces us to have the real conversation.

[+] alexashka|3 years ago|reply

You're bringing your personal trauma/opinions and re-stating them as universal truths.

[+] unknown|3 years ago|reply

[deleted]

[+] einpoklum|3 years ago|reply

At this link, it says: "File not found". Try:

https://abseil.io/resources/swe-book

or more directly:

https://abseil.io/resources/swe-book/html/toc.html

[+] nomilk|3 years ago|reply

I haven't read this, but for those who have, what did you find counter intuitive or surprising?

[+] google234123|3 years ago|reply

Is the first chapter explaining why all lines must be less than 80 characters in width and all parameters and conditional statements must be automatically formatted in the most compact way possible - damn readability.

[+] posharma|3 years ago|reply

600 pages? Wow! Will be great to have the best practices summarized somewhere.

[+] Bromeo|3 years ago|reply

There's a TLDR at the end of every chapter.

[+] ncmncm|3 years ago|reply

It is largely a litany of what never to do unless you work at, or are Google.

[+] b20000|3 years ago|reply

404

[+] jenny91|3 years ago|reply

https://raw.githubusercontent.com/abseil/abseil.github.io/cd...

[+] Nzen|3 years ago|reply

In the last three hours [0], they removed it from their git repository, and replaced it with separate html pages [1].

[0] https://github.com/abseil/abseil.github.io/commit/a69a63ec3f...

[1] https://abseil.io/resources/swe-book/html/toc.html

[+] blacksoil|3 years ago|reply

The PDF on the original link is down. Does anybody mind to re-upload it? Thanks :)

[+] mertysn|3 years ago|reply

It's available on the wayback machine [1]

1: https://web.archive.org/web/20220329143949/https://abseil.io...

[+] TheFragenTaken|3 years ago|reply

"For most Google projects, we must assume that they will live indefinitely" on page 6 made me chuckle.

[+] nine_k|3 years ago|reply

Indefinitely, not endlessly. You cannot predict the date when your project is going to be sunsetted: it may be in 6 month, or may be well past your retirement.

[+] izacus|3 years ago|reply

What's that software engineering truth? All the "temporary" code ends up running in production for decades? :D

[+] unknown|3 years ago|reply

[deleted]

[+] bigcat123|3 years ago|reply

Well, for people who are ridiculing the absurdity, note that it's software engineering at Google, not software engineering. One's peculiarities are another absurdity. Reserve judgement, and try to understand the rationale.

[+] classsic009f|3 years ago|reply

[deleted]

178 comments