(no title)
jorgeortiz85 | 10 years ago
We have a library that allows us to describe expected schemas and expected indexes in application code. When application developers add or remove expected indexes in application code, an automated task turns these into alerts to database operators to run pre-defined tools that handle index operations.
In this situation, an application developer didn't add a new index description or remove an index description, but rather modified an existing index description. Our automated tooling erroneously handled this particular change and interpreted it not as a single intention but instead encoded it as two separate operations (an addition and a removal).
Developers describe indices directly in the relevant application/model code to ensure we always have the right indices available -- and in part to help avoid situations like this. In addition, the tooling for adding and removing indexes in production is restricted to a smaller set of people, both for security and to provide an additional layer of review (also to help prevent situations like this). Unfortunately, because of the bug above, the intent was not accurately communicated. The operator saw two operations, not obviously linked to each other, among several other alerts, and, well, the result followed.
There are some pretty obvious areas for tooling and process improvements here. We've been investigating them over the last few days. For non-urgent remediations, we have a custom of waiting at least a week after an incident before conducting a full postmortem and determining remediations. This gives us time to cool down after an incident and think clearly about our remediations for the long-term. We'll be having these in-depth discussions, and making decisions about the future of our tooling and processes, over the next week.
asuffield|10 years ago
I'm an SRE at Google, where postmortems are habitual. The thing that jumped out at me here is that a production change was instantaneously pushed globally, instead of being canaried on a fraction of the serving capacity so that problems could be detected. That seems like your big problem here.
(Of course, without knowing how your data storage works, it's difficult to tell how hard it is to fix that.)
jorgeortiz85|10 years ago
This is one of our few remaining unsharded databases (legacy problems...), so we can't easily canary a fraction of serving capacity. However, one clear remediation we can implement easily is to have our tooling change a replica first, failover to it as primary, and, if problems are detected, quickly fail back to the healthy former primary.
Lesson learned. We'll be doing a review of all of our database tooling to make sure changes are always canaried or easily reversible.
eldavido|10 years ago
I'd actually applied to work at stripe about two years ago, you guys turned me down ;)
I was responsible for ops at a billion-device-scale mobile analytics company for about 1.5 years. Your tooling is far superior to anything we produced. I like the idea of a single source of truth describing the data model (code, tables, query patterns, etc.) a lot, and doubly-so that it's revision-controlled and available right alongside the code.
I think it's far from decided though, how much to involve human operators in processes like this. Judging from this answer, you seem to be on the extreme end of "automate everything". How then, I'm curious, do you train/communicate to developers what can be done safely vs. something that would cause i/o bottlenecks, slowdown, or other potentially production-impacting effects? Can you even predict these things accurately in advance? (Some of our worst outages were caused by emergent phenomena that only manifested at production scale, such as hitting packet throughput and network bandwidth limits on memcached -- totally unforseeable in a code-only test environment).
It sounds like you let developers request changes (a la "The Phoenix Project") but ops is responsible for final approval of the change? That actually sounds like a great system. Would love some elaboration on this.
In any case, great writeup and from one guy who's been there when the pager goes off to another, sounds like the recovery went pretty smoothly.
jorgeortiz85|10 years ago
We're always learning and improving. In order to scale, we'll need better ways to manage complexity and isolate failure. Our tools, patterns, and processes have changed quite a bit over the last few years, and they will continue to change. Ultimately, we want every Stripe employee to have the right information evident to them when they make decisions. This will be challenging, especially as we grow, but I'm excited to take on that challenge.
If you're still interested in working at Stripe, I'd encourage you to reapply! Our needs have changed quite a bit since you applied, and we're willing to reconsider candidates after a year has passed. Feel free to shoot me a resume: jorge@stripe.com
toomuchtodo|10 years ago
devit|10 years ago
tempestn|10 years ago
As a simple example, to make an atomic change to a write-only table, you could create a copy of the table, alter the copy as necessary, then in a single rename operation, rename the live table to '_old' and the '_new' table to live. You most likely would not want to add two additional table schema and all of those steps to your development database operations.
It's entirely possible that they could capture what is done in production as migrations, and test them first, but it would still likely be separate from what the application developers are working with.
raspasov|10 years ago
lacksconfidence|10 years ago
codahale|10 years ago
spudlyo|10 years ago