top | item 31651146

Eliminating Toil

302 points| omarfarooq | 3 years ago |sre.google

82 comments

order
[+] capableweb|3 years ago|reply
> If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil. If human judgment is essential for the task, there’s a good chance it’s not toil.21

It's fun the the engineering at Google is so great at recognizing things, while the product/"human" teams (like whoever came up with the account reviews and other parts) seems to suck so much.

If YouTube applied the same view of what should/shouldn't be automated, they could solve the problem of peoples YouTube channels being locked in front of them, even if they don't break any ToS's.

[+] nonrandomstring|3 years ago|reply
The part I really liked is "Is Toil Always Bad?" because quite astute psychology and management ideas appear here. My takeaway is that people who like toil (and there are many zealous oversystemetisers around) are some of the most dangerous to long-term productivity.

However, I really feel for the author, because the distance between this philosophically ambitious position and the reality of using Big Tech products - which in my life are the primary source of pointless makework activity - is sad and frustrating.

Now, one cannot blame the tools for stupid policies that lead their misuse in constructing makework processes, but (to take the Heideggerian stance) they carry certain exacerbating values along with them. Technologies can be 'seductive'.

The point of departure for me, was the definition of "overheads" as justifiable makework according to some value set. Whose values, exactly? And if, in Weber's sense, bureaucracy is an unavoidable side effect of process, then the possibility to design "toil-free" systems is really about complexity management, not post-facto eliminating toil through automation, because that will only break things and introduce more toil (Which is of course the primary theme of Gall's "Systemantics").

[+] rmbyrro|3 years ago|reply
This is by design.

You can't expect a free service to have highly trained human judgement whenever you want.

They need it to run fully autonomous. And does run flawlessly for 99% of the users, which is impressive.

I wish they offered a paid option in the 1% cases. Like an arbitration.

But that would be a cost center for them, and they don't want it.

[+] vkou|3 years ago|reply
> It's fun the the engineering at Google is so great at recognizing things, while the product/"human" teams (like whoever came up with the account reviews and other parts) seems to suck so much

There's ~4 orders of magnitude difference in the scope of work between the two.

[+] sokoloff|3 years ago|reply
(In Google’s view), perhaps the need for careful review of these actions has been designed away.
[+] planetsprite|3 years ago|reply
Same can be said for most organizations. Google's privileged hyper-enlightened culture of organizational "niceness" rests on the bedrock of millions of man-hours of near perfect, rigorously vetted engineering.

I think the main cause of this is that humans tend to become very wrong and misguided without accurate feedback. An engineer who is wrong is proven wrong immediately; their code fails, doesn't pass tests, breaks, etc. A people-person design-minded product-guru takes years to be proven wrong, and even then their tactical obfuscation of reality can morph who ends up being blamed.

[+] throwaway892238|3 years ago|reply
You should probably not automate all toil. You should only automate toil when the toil cost/effect is more burdensome than the cost of automating it. All automation has a cost, and may or may not create value. Automation should have a positive and timely return on investment. If the ROI is 10 years down the road, you probably shouldn't automate it (yet). If there is a cheaper way to deal with the toil, explore that avenue first.

Several times in my career I worked on projects to reduce toil. Sometimes the project would fail because the time it took to work on them went well past the cost saving estimation. Sometimes they would be completed, but the value created was far less than their cost. And sometimes automation wasn't even the solution, and we just needed to change our process or system, or do some other manual thing that reduced the toil cost. Sometimes we chose to automate toil because we were afraid to take on a larger project we knew would make the toil unnecessary, so we paid for the automation and then later for rebuilding everything. Or toil was used as an excuse to justify a project that didn't really have to do with toil.

One of my biggest mistakes as an engineer was assumptions I made about my work that ended up creating more waste than value. Talk to an outsider about your plans and why you're doing it, take their advice seriously. And if your automation is optional, make sure you have buy-in before you start working on it; i've sunk months on things that nobody ended up using.

A great way to automate toil is incrementally. Typically you have a runbook with step-by-step instructions, and over time you automate one step, then another, etc. The investment is minimal and gradual, it can change over time, and you can target the costliest parts of the toil, optimizing value.

[+] ensignavenger|3 years ago|reply
Some times, the process of automating something provides enough positive returns in and of itself. For example, you might learn how to do new things along the way. Or you might be able to give the task to some one new so they can learn.

Or maybe in the process of automating, you discover new things about the process itself and can improve it.

I agree that one should be careful to consider work priorities and return on investment, but there are often hidden returns to something like this that leaders don't understand and take into full account.

[+] snovv_crash|3 years ago|reply
This misses the induced demand effect of dramatically reducing the cost of the task. There are many things that only happen occasionally because they are annoying and slow. If you reduce the friction suddenly everyone does it 10x per day and the whole company benefits from faster feedback loops.
[+] kubanczyk|3 years ago|reply
I don't know whether it's visible from where you are sitting, but what you wrote is exactly contrary to the TFA. TFA implicitly starts from "what if we forget for a minute that ROI exists, and see where that leads us". (If it seems incredibly wasteful, stay with me to the end.) And where it led them is that the people like you and me are crucial parts of Investment (the I of ROI), and the people group together. The ones that gladly do 90% toil don't like to team up with those that prefer 10% toil.

Because the problem with ROI calculation is that for some areas the ideal amount of toil would be 99% and for other 1%. For both extremes, you'll bleed valuable people, so sometimes your Investment becomes "hire new team just for that" and the rest is peanuts.

To put the thing back on its feet first create a team that takes 50% toil and give them these areas that ROI-wise require approximately 50% toil. Call the team "SRE". Create a team that takes 10% toil. Give different areas. Create a team that takes 90% toil, etc.

[+] bushbaba|3 years ago|reply
An organization generally has fixed resources for automation investment. You should look beyond if the ROI justifies investment, to instead prioritizing the highest ROI items that are most likely to be successfully automated.
[+] blowski|3 years ago|reply
Exactly. It's not unusual to end up with more toil on the automation than you had in the manual process.
[+] baobabKoodaa|3 years ago|reply
Why spend 5 minutes manually toiling on a task, when you can spend 6 hours failing to automate it?
[+] carlsborg|3 years ago|reply
> Among the many reasons why too much toil is bad

They missed the big one : human error is a common point of failure. Some of the big outages on GCP were due to ops configuration changes. Gitlab wiped their prod DB one time. KnightCapital suffered death by config error..etc.

[+] pooper|3 years ago|reply
I wonder if writing (bad?) software can also be toil.

Like if I need to change the spelling or add a new configuration setting and I need to make sure to use the same spelling in three places because they are all "stringly(sic) typed", is that toil?

[+] AnimalMuppet|3 years ago|reply
> human error is a common point of failure.

True. But it is also common to find that software automating the process didn't cover some corner case and you need human intervention. And it's worse if the process assumed that human intervention would never be necessary...

[+] dekhn|3 years ago|reply
Most of my work for SRE was the opposite; I did things manually because the automated systems were guaranteed to mess up some fraction of things. At some point my managers wanted me to automate a hardware management process- I checked and it would take 6 months to deploy the code to prod. Instead, I identified all the broken machines and filed tickets manually- getting things fixed far more quickly without a high rate of false positives and churn (google's hardware repair system churns a lot).

Many of the automated systems at Google were developed by geniuses. Others, not so much, and it ended up making a lot of work for other people.

[+] jeffbee|3 years ago|reply
I was also on the side of hands-on operations in SRE and it earned me no friends in that org to be sure. But I like to think that point of view is still basically correct. The "annealing" people have been working on their wacky automaton for more than a decade now and a critical reading of their publication reveals that it still fundamentally doesn't work.

https://www.usenix.org/publications/loginonline/prodspec-and...

[+] dusted|3 years ago|reply
While I'm not against eliminating toil, this article does not seem to consider the negative aspects of automation, such as the deskilling that happens naturally.

"This plant basically runs itself, but we do have a human present for if something goes wrong".. 50 years down the line, something goes wrong and nobody has the kind of insight and familiarity with the system that they'd have had it had been manually operated.

[+] jamesmishra|3 years ago|reply
On the other hand, it is very good that the plant operated flawlessly for 50 years.
[+] ladyattis|3 years ago|reply
I think this is a good scenario to consider especially now that we're running into this issue just from trying to build passenger rail networks in California where the expertise for it isn't domestic at all.
[+] otter-rock|3 years ago|reply
That's like claiming that the government services that still run on cobol should have been manual office jobs instead.
[+] dikei|3 years ago|reply
In some organizations, including mine, toil is sometimes "reduced" by saying "not my problem" and push it to other teams. It sucks to be on the receiving end of it.
[+] trhway|3 years ago|reply
It naturally applies not only to SRE. Toil is a great equalizer - if 80% of say development work is basically toil then a 10x developer is not really that distinguishable from nor useful more than an 1x (Amdahl's law so to speak :) Amount of toil (and not say failed projects/etc.) seems to be a one of the main factors separating the companies with revenues $2M/year/head like Google from the ones with mere $300K/year/head, and one of the best things a mid/low performing company can do is to reduce toil - though usually on practice any such attempt means something like MBA-style "efficiency improvement" measures and processes which add even more toil.
[+] bibliographer|3 years ago|reply
I am somewhat skeptical of the claim that "one of the best things a mid/low performing company can do is to reduce toil". It feels that this principle is dependent on the context to the extent of not being useful guidance anymore.

For instance, if the company is pre-product-market fit reducing toil seems like the wrong investment; doing stuff manually can be the way to go until you find what works (unless the effort investment in toil reduction is trivial).

If the company has reached something approximating product-market fit, reducing toil still ought to be weighed against the other priorities. That (as all technical debt reduction) can do wonders to productivity, but alternatives (e.g. pushing for a new feature) may as well be the better call.

[+] quickthrower2|3 years ago|reply
I think it is market dominance/monopoly that gives the big revenue per head rather than lack of toil.
[+] TimPC|3 years ago|reply
The difference between $300k/head revenue companies and $2M/head revenue companies is far larger than amount of toil and efficiency measures/overhead. Most notably, there tends to be a fundamental difference in who they hire, how they hire them and how they compensate them. It might even make sense for $100k/year engineers to do more toil than $400k/year engineers.
[+] VikingCoder|3 years ago|reply
I'm so glad that toil has been automated.

Now all I need to do is learn this new Domain Specific Language and find all of the exact configuration parameters to express my specific needs. Oh except this tool has leaky abstractions under it, and those tools also have their own DSLs and configuration parameters. And the tools under those do, too. It's all turtles, all the way down.

[+] wjdp|3 years ago|reply
Was confused at first as toil is also 'time off in lieu'. AKA unpaid overtime, where you're not paid but get compensated with holiday.
[+] black_puppydog|3 years ago|reply
Good read, good reasoning.

Just a bit sad that someone at Google seems to have read this and focused on the "Automatable" part going "but that includes basically everything we do!"

cf youtube/contentId, cf account blocking, cf customer "support", ...

[+] hoffs|3 years ago|reply
Content review is not SRE task
[+] reedlaw|3 years ago|reply
This reads like a positive framing of Jacque Ellul's critique of technique:

> The characteristics of the technical phenomenon are Autonomy, Unity, Universality, Totalization. Technique obeys a specific rationality. The characteristics of technical progress are self-augmentation, automization, absence of limits, casual progression, a tendency toward acceleration, disparity, and ambivalence. [1]

Supposing the harm Google does (e.g. ambivalence towards individuals harmed by algorithms) is a direct result of this totalizing impulse, maybe it's time to question some of the fundamental assumptions present within.

1. https://ellul.org/themes/ellul-and-technique/

[+] j7ake|3 years ago|reply
This article was nice. I wonder if it can be generalized to careers in general ?

Long, satisfying careers often involve proactive, design-oriented approach rather than purely reactive.

The only way to make grunge work an entire career would be if you’re constantly doing something for the first or second time, eg artists, novelists.

Even scientists, they can initially discover something significant, but they keep repeating the work on the same topic without more depth or breadth, the work will become tool.

[+] factsaresacred|3 years ago|reply
Knew I recognized some of this writing before. This book is quoted in an annual letter[0] from Zack Kanter which is also worth a read:

> Eliminating toil allows people to focus on the inherent complexity of the difficult, interesting problems at hand, rather than the incidental complexity caused by choices made along the way.

> Toil can be eliminated...by drawing the system boundary a bit differently. When we use an external service instead of an external library, we’re moving the code outside of our system – thereby outsourcing the entropy-fighting toil to some third party. Not our entropy, not our problem

[0] https://www.stedi.com/blog/excerpts-from-the-annual-letter

[+] natly|3 years ago|reply
Engineers automating themselves. This is why we should be kinda scared of software innovation stagnating. If we don't work on innovation we don't really have a purpose and a job.
[+] krageon|3 years ago|reply
Things are always breaking, everywhere. The people analysing and fixing that are engineers too, but what they do is not innovation. It's maintenance. Nothing wrong with that.
[+] jimbokun|3 years ago|reply
Having a job where I am required to innovate every day, almost by the job definition, is why I like being a software engineer.
[+] lamontcg|3 years ago|reply
Coming from a software engineering perspective there is a certain amount of toil which is impossible to automate away. CI break-fix issues often depend on the surface area of your software as it interfaces with third parties, including the CI system itself. In some cases that surface area can be large and break-fix takes up a considerable amount of time, but that toil is not _repetitive_ and is _necessary_ table stakes based on the system.

And this is after having someone who is extremely aggressive with automation and empowered to do whatever they like to reduce that surface area working on the system. I've taken codebases and hacked out 60% of the lines of code in order to remove brittle external surface area along with unnecessary requirements and contain the project better within its own boundaries and stop repetitive issues. I've taken clever ideas that someone had 5+ years ago out behind the barn and shot them in order to reduce total surface area.

But people can walk into an area with a lot of toil going on and go "oh, I know all the strategies on how to reduce this, I will explain to these people who clearly aren't as clever as me how to do it" without realizing that there's often a minimum level of toil for a project which you can't effectively reduce. There's a nonzero vacuum expectation value of toil in any project, and in some cases it can be quite large. Inherently.

I don't know how many managers I went through who would come and decide to document all the different failures we were having and spreadsheet them and look for the patterns to address them. And every week there would be 2-3 that would come up and they'd struggle with the fact that there was really no pattern, other than that the project inherently touched many different third parties, because it really HAD to, and that those third parties would change, which would then force interrupt driven toil.

There's some point where you just have to hire more people and spread it out. There's no magical incantation to manage your way out of additional headcount.

And I don't think the OP article even touched on re-enginering to reduce surface area and brittleness. Automation isn't the only answer to toil. You can automate restarting a service if it crashes, but its always better to just fix the bug (which may involve fixing architectural issues) and make it stop crashing in the first place.

[+] kubanczyk|3 years ago|reply
Is the "CI break-fix" the term you wanted to use? It has only a handful of hits on the Internet, and your post is the first one. It doesn't seem to be related the contractual https://en.wikipedia.org/wiki/Break/fix

But I don't have any term ready for the very familiar category of problems you've described, well maybe except CI business-as-usual :)

[+] tmp_anon_22|3 years ago|reply
> Eliminating Toil

With 150,000 employees.

[+] hinkley|3 years ago|reply
And about 4 million servers, give or take a million.
[+] zx8080|3 years ago|reply
The amount of time needed for a process requiring a number of code reviews, approvals for code style and and architecture ones.

Eliminating toil costs lot of time from every engineer.