top | item 20522868

Some items from my “reliability list”

567 points| luu | 6 years ago |rachelbythebay.com | reply

169 comments

order
[+] B-Con|6 years ago|reply
Required reading. Have you worked as an SRE?

> Item: Rollbacks need to be possible

This is the dirty secret to keeping life as an SRE unexciting. If you can't roll it back, re-engineer it with the dev team until you can. When there's no alternative, you find one anyway.

(When you really and truly cross-my-heart-and-hope-to-die can't re-engineer around it fully, isolate the the non-rollbackable pieces, break them into small pieces, and deploy them in isolation. That way if you're going to break something, you break as little as possible and you know exactly where the problem is.)

Try having a postmortem, even informal, for every rollback. If you were confident enough to push to prod, but didn't work, figure out why that happened and what you can do to avoid it next time.

> Item: New states (enums) need to be forward compatible

Our internal Protobuf style guides strongly encourage this. In face, some of the most backward-compatible-breaking features of protobuf v2 were changed for v3.

> Item: more than one person should be able to ship a given binary.

Easy to take this one for granted when it's true, but it 100% needs to be true. Also includes:

* ACLs need to be wide enough that multiple people can perform every canonical step.

* Release logic/scripts needs to be accessible. That includes "that one" script "that guy" runs during the push that "is kind of a hack, I'll fix it later". Check. It. In. Anyway.

* Release process needs to be understood by multiple people. Doesn't matter if they can perform the release if they don't know how to do it.

> Item: if one of our systems emits something, and another one of our systems can't consume it, that's an error, even if the consumer is running on someone's cell phone far far away.

Easy first step is to monitor 4xx response codes (or some RPC equivalent). I've rolled back releases because of an uptick in 4xxs. Even better is to get feedback from the clients. Having a client->server logging endpoint is one option.

And if a release broke a client, rollback and see the first point. Postmortem should include why it wasn't caught in smoke/integration testing.

[+] silvestrov|6 years ago|reply
> Release logic/scripts needs to be accessible

More than accessible: all scripts must be in git just like source code because that is also source code.

Building releases should be done using a fresh VM. Creating and configuring that VM should only use a script which is also in git.

Everything needs to be automated using scripts. If you type "apt-get" on the command line you have lost reliability. When multiple people are involved, such manual setups will be a problem at some point: people make mistakes. Manual steps also means you have lost good testing of the build process.

[+] packetslave|6 years ago|reply
Required reading. Have you worked as an SRE?

She was both a long-time SRE at Google and a long-time PE at Facebook.

[+] pmlnr|6 years ago|reply
> Rollbacks need to be possible

I always feel like people who write these never faced SQL schema changes or dataset updates. I wonder what rollback plans are in place for complete MySQL replication chains, for example.

[+] exlurker|6 years ago|reply
SRE: Site Reliability Engineer
[+] cwilkes|6 years ago|reply
Second step would be to have a way for the client to report an error. I’ve seen a number of systems where this isn’t the case — just looking at the server logs everything looks okay until a week later that client calls to say they haven’t gotten a file in a week and wondering if something is wrong.
[+] matthewowen|6 years ago|reply
"Check. It. In. Anyway."

It's consistently a source of sorrow to me how many bugs exist and how much inefficiency there is because people don't want to be embarrassed by the code they quickly wrote.

[+] tty7|6 years ago|reply
Check. It. In. Anyway.

Amen

[+] pjungwir|6 years ago|reply
> Also, if you are literally having HTTP 400s internally, why aren't you using some kind of actual RPC mechanism? Do you like pain?

I just had a discussion about this yesterday where we have an internal JSON API that auths a credit card, and if the card is declined it returns a status and a message. Another developer wanted it to return a 4xx error, but that made me uneasy. I think you could make a good argument either way, but to me that isn't a failure you'd present at the HTTP layer. 4xx is better than 5xx, but I was still worried how intermediate devices would interfere. (E.g. an AWS ELB will take your node out of service if it gives too many 5xxs, and IIS can do some crazy things if your app returns a 401.) Also I don't want declined cards to show up in system-level monitoring. But what do other folks think? I believe smart people can make a case either way.

EDIT: Btw based on these Stack Overflow answers I'm in the minority: https://stackoverflow.com/questions/9381520/what-is-the-appr...

[+] citrusx|6 years ago|reply
In my opinion, a rejection is an expected outcome, and therefore should have a response code of 200. You're not asking, "Does this card exist?", and sending a 404 if you have no record. You're asking a remote system to do a job for you, and if that job completes successfully, it's a "Success" in the HTTP world.

At that point, you rely on the body content to tell you what the service correctly determined for you. A result that the user doesn't like is way different than a result that comes about because something was done wrong at the client side (4xx) or a failure on the server side (5xx).

[+] parliament32|6 years ago|reply
I think you're right. 4XX means "client made an error" at the application layer. Something like a malformed request (say, bad JSON passed in): absolutely, 4XX is the way to go. However, if the client sent a good message who's contents happen to be rejected, that's a 200 (and whatever error goes in the response body). The whole "did this card work" question happens in a layer above the application layer, and HTTP status codes are only meant for the application layer itself.
[+] msluyter|6 years ago|reply
Interesting question. My understanding is that a 400 response indicates that there's something malformed about the request itself such that the server can't/won't process it. Given that in order to decline the card the service has to actually process the request, I'd agree that a 400 is inappropriate.
[+] munchbunny|6 years ago|reply
I would agree with not using the http status codes to return a card processing error.

Specifically, you want to be able to easily distinguish between "your url is wrong" or "your authentication credentials are wrong" or "the API endpoint threw an exception" type errors and "credit card processing failed" errors.

It's easier long run to put business logic errors somewhere separate from protocol/routing layer (i.e. even a http header would be better) so that you can tell what is Rails/Flask/whatever failing vs. logic failing. This also gives you more flexibility to do stuff at the hardware layer (another commenter mentioned ELB) without interfering with the application layer.

[+] erpellan|6 years ago|reply
4xx errors are no better or worse than 5xx. They just mean different things.

Generally, 4xx errors means 'Client screwed up'. There's probably no point sending this message again, it isn't going to work until something is fixed. That might mean it will work if, for example, the account is funded or card unlocked. But something needs to be done on behalf of the client.

5xx errors mean 'Server screwed up'. It probably _is_ worth having another attempt at sending the request. Maybe it was a temporary glitch, or maybe a new release of the server has to happen. Regardless, there was probably nothing inherently wrong with the actual request.

[+] xorcist|6 years ago|reply
Using HTTP for RPC lends itself to this sort of endless discussion. Authorization is even worse.

You could argue either way. What's important is being consistent, at the very least throughout an API but preferably throughout the whole organization.

(Personally I'd probably lean towards a 40x of some kind, just make sure it doesn't clash with something that you care about.)

Along the same lines, and arguably more important, is how to log the operations where a transaction completed successfully but with a negative answer. If you log expected negatives as errors you can get error blindness.

[+] russelldavis|6 years ago|reply
Stripe returns HTTP 402 ("Payment Required") when a card is declined.
[+] nieve|6 years ago|reply
The one I've seen missed most often in startups is directly implied by a lot of the other points and obvious to anyone with long experience: Take the time and put a lot of thought into how to break up your big transitions into smaller stages, each of which are functional. It's usually possible to at least narrow down the risky parts to a few finer grained steps and when something fails only rolling back one part to get to a good state is almost always faster and safer.

It's very easy to get absorbed into the awareness of the high level change you're making and miss the details of the process. Even just sitting down together and outlining what you think is actually going to go on (and then breaking those down into what they each are comprised of) can make it really clear that you don't have to run as many giant risks. I'm occasionally amazed how brilliant people (including some with big names in devops) can forget it's an option.

It's like taking small steps from stable to stable when you're going across a steep scree slope and only jumping when you have to - sometimes it feels riskier to take lots of small steps, but if you start to slide it can be a lot easier to recover from. Your chance of dying taking a big leap isn't the sum of the equivalent small steps. Perhaps complex computer systems have the equivalent of an Angle of Repose?

[+] mahkoh|6 years ago|reply
On JSON:

"if you only need 53 bits of your 64 bit numbers"

JSON numbers are arbitrary precision.

"blowing CPU on ridiculously inefficient marshaling and unmarshaling steps"

On the other hand I am not blowing dev and qa time on learning/developing tools to replace curl/jq/browser/text editor.

[+] patrec|6 years ago|reply
> JSON numbers are arbitrary precision. > On the other hand I am not blowing dev and qa time on learning/developing tools to replace curl/jq/browser/text editor.

What a pleasant surprise it will be for you when you find out that jq silently corrupts integers with more than 53 bits.

[+] kristiandupont|6 years ago|reply
I don't understand why you are being downvoted. This struck me not as something that starts to "emerge after you've dealt with enough failures in a given domain" as the author claims, but more like a pet peeve.

The fact that JS has 53 bit precision will be a JS problem whether you use protobuf or anything else. On the other hand, if you are not using JS, it will almost certainly parse numbers of the precision that your language offers.

[+] jcoffland|6 years ago|reply
Javascript cannot handle 64-bit unsigned integers and the largest exact integral value is 2^53-1, but yeah JSON itself has no such restrictions.
[+] TeMPOraL|6 years ago|reply
> On the other hand I am not blowing dev and qa time on learning/developing tools to replace curl/jq/browser/text editor.

Because learning how to use correct tool for the job is so difficult and so outside of dev/qa job descriptions, that it's much better to waste compute on your servers and deal with performance and scalability problems.

[+] LaGrange|6 years ago|reply
> On the other hand I am not blowing dev and qa time on learning/developing tools to replace curl/jq/browser/text editor.

Is your turnover so high that learning a single tool is a significant portion of employee costs?

[+] raxxorrax|6 years ago|reply
Why not just use SOAP?

I think recent experience should tell us that the easiest thing plainly wins out. JSON can be terrible, Javascript can be terrible, but limitations on large integers isn't the worst hurdle to take.

[+] wyc|6 years ago|reply
I love item #2, as it talks about writing code that can safely handle "future" enums and values as a result of rolled back code. Maybe we should call it the Sarah Connor Pattern.

I haven't heard enough people discuss the deployment management of growing enums or state machine evolution. This is a problem more particular to software than hardware, as once hardware is shipped it's usually set in silicon, but growing of the state garden is an expectation in many software architectures.

[+] VBprogrammer|6 years ago|reply
One of the fun challenges I have in my current job is that we provide releases to customers according to the customers schedule (which is related to needing hours of downtime because it's a creaky old system).

Some customers will skip releases altogether making strategies like add a new column, back populate it online, then the next release uses the new value impossible.

I guess that point is slightly moot when it'd take 2-3 releases to achieve the end goal and each release cycle is about a month.

[+] jefftk|6 years ago|reply
This is a great list! One thing I would add, once you have everything on this list, is a way to experiment on your changes. Instead of flipping a flag, seeing your error rate jump, and flipping it back, you run an experiment where to flip the flag on 0.1% of requests. Now you can compare this traffic to a control group, and aren't stuck wondering "did errors go up by 5% during our rollout because we broke things, or by chance?". If things look as expected at 0.1% you can ramp to 1%, then 10% before releasing.
[+] ricardobeat|6 years ago|reply
> using something with a solid schema that handles field names and types and gives you some way to find out when it's not even there [...] ex: protobuf

In proto3 all fields are optional, and have default values, so it becomes impossible to detect the absence of data unless you explicitly encode an empty/null state in your values.

[+] aplusbi|6 years ago|reply
This is true for primitives but not [completely] true for messages. Message still have the "hasMessage" semantics. So if you truly need to differentiate between unset and default for primitives, you can box them in messages.
[+] akavi|6 years ago|reply
Go structs behave the same way. I wonder if one influenced the other.
[+] tantalor|6 years ago|reply
> impossible to detect the absence of data

There are other ways:

- Check if the value different than the default, e.g., empty string

- If your data is repeated, then check number of data elements != 0

[+] asark|6 years ago|reply
Ew, what was the reason for that?
[+] punnerud|6 years ago|reply
I managed releases for one of Norway’s largest hospitals, and when all in the ‘reliability list’ is checked and you have frequent releases the real headache is ‘cross system rollback’ between several systems/companies. Add that this is done with the whole hospital in emergency procedure..
[+] C4stor|6 years ago|reply
Seems like a typical article from I assume a gafa employee, good advice mixed with "how to be google even if you don't need too" advice.
[+] crummy|6 years ago|reply
I don't know anything about databases. How do you roll back a significant schema change?
[+] twic|6 years ago|reply
I believe the strategy described avoids needing to do that. You start by releasing a version of the software which can run with the old or new schema, then you apply the new schema, then you release a version of the software which actually uses the new schema. If you discover a problem at that point, you roll back to the previous version of the software, but leave the schema as it is. You then have time to figure out what to do, which may involve changing the schema again, but that will be as a forward change, rather than as a rollback.

Some migration tools do support rollback scripts for schema changes, but unless you're actually testing these before release (deploy the new version in staging, accumulate representative data in the new schema, roll back the schema, deploy an old version of the app, test that it is doing the right thing), then they aren't really something you can rely on in production.

[+] evanelias|6 years ago|reply
A few strategies used by large companies, e.g. Facebook where the author worked for some time:

* Use external online schema change tooling which operates on a shadow table, so the tooling can be interrupted without affecting the original table. (Generally all of the open source MySQL online schema change tools work this way.)

* Use declarative schema management (e.g. tool operates on repo of CREATE TABLE statements), so that humans never need to bother writing "down" migrations. Want to roll something back? Use `git revert` to create a new commit restoring the CREATE TABLE to its previous state, and then push that out in your schema management system. (Full disclosure, I spend my time developing an open source system in this area, https://skeema.io)

* Ensure that your ORMs / data access layers don't interact with brand new columns until a code push occurs or a feature flag is flipped.

[+] tybit|6 years ago|reply
It really depends on the task but the general pattern is to split it up into separate steps, each of which is either low risk and/or easily reversible.

Lets say you want to add a new non nullable column with foreign keys, to replace an old non nullable column of foreign keys to a different table that’s obsolete and needs to be deleted.

1) update the code to be ok with a new nullable column. Rollback: deploy previous version of code.

2) create the new column in DB with it’s desired constraint, but make it nullable. Roll back: delete the column.

3) have the code start populating the new column as well as the old. Rollback: deploy previous version of code.

4) start backfilling historical entries with the new column. Rollback: you can’t roll this back!

5) make new column non nullable. Rollback: make it nullable.

6) update code to read from new column, continuing to write to both. Rollback: deploy previous version of code.

7) make old column nullable. Rollback: you can’t roll this back!

8) stop writing to old column. Rollback: deploy previous version of code.

9) once you’re satisfied the old column is no longer used and the version of code from the previous step will never be deployed again, drop the old column.

Rollback: you can’t roll this back!

10) rename the obsoleted table and see if anything breaks. Rollback: rename it back to its original name.

11) delete the obsoleted renamed table.

Rollback: you can’t roll this back!

[+] minxomat|6 years ago|reply
In your migration file(s) you have both forward and backward steps. If operations are reducers or otherwise destructive, the forward step backs that data up to separate "recovery" tables. The n+1 migration may delete recovery tables. That's how I do it most of the time. Sometimes with full snapshots. Ymmv
[+] beachy|6 years ago|reply
For a complex system, with many tables, rolling back a large change is very difficult.
[+] Thorrez|6 years ago|reply
>In this case, you need to make sure you can recognize the new value and not explode just from seeing it, then get that shipped everywhere. Then you have to be able to do something reasonable when it occurs, and get that shipped everywhere. Finally, you can flip whatever flag lets you start actually emitting that new value and see what happens.

Can't those first 2 steps be combined together? Why do they need to be shipped separately?

[+] barbarbar|6 years ago|reply
She has written many very interesting posts. Also one about "The One".
[+] z3t4|6 years ago|reply
Hot patching is scary but it makes you optimize for the right things like easy to understand, easy to update, detectable early errors, easy to recover/rollback. And it makes you think and understand before writing the actual code.
[+] torbjorn|6 years ago|reply
I love this and it makes me feel both excited and scared since I am suddenly realizing the ways my org is not in compliance with this good advice.

Are there any good books that are full of more rules of thumb like these?

[+] maximente|6 years ago|reply
what is the load balancing story for these RPC services thusly recommended? it was completely glossed over as if it was not even relevant; i know gRPC uses HTTP/2 and persists connections so it's not as simple as throwing a proxy in front.

that seems like a non-trivial point of friction when it comes to "just using solid storage/RPC formats" or whatever.

[+] akuji1993|6 years ago|reply
I think something that's missing in this list is to have a really good and consistent QA cycle. You need to have rollbacks, I agree, but even better is, when bugs can't even make it into your production build. Having automated testing set up, (actually correctly done) code reviews and quality gateways in place can save you a lot of time rolling back your code. Catch the bug before it goes into live.
[+] msh|6 years ago|reply
But you will never catch them all.