top | item 10366459

(no title)

Hi, I work in infrastructure at Stripe and I'm happy to provide more insight. Several threads here have commented on our tooling and processes around index changes. I can give a bit more detail about how that works.

We have a library that allows us to describe expected schemas and expected indexes in application code. When application developers add or remove expected indexes in application code, an automated task turns these into alerts to database operators to run pre-defined tools that handle index operations.

In this situation, an application developer didn't add a new index description or remove an index description, but rather modified an existing index description. Our automated tooling erroneously handled this particular change and interpreted it not as a single intention but instead encoded it as two separate operations (an addition and a removal).

Developers describe indices directly in the relevant application/model code to ensure we always have the right indices available -- and in part to help avoid situations like this. In addition, the tooling for adding and removing indexes in production is restricted to a smaller set of people, both for security and to provide an additional layer of review (also to help prevent situations like this). Unfortunately, because of the bug above, the intent was not accurately communicated. The operator saw two operations, not obviously linked to each other, among several other alerts, and, well, the result followed.

There are some pretty obvious areas for tooling and process improvements here. We've been investigating them over the last few days. For non-urgent remediations, we have a custom of waiting at least a week after an incident before conducting a full postmortem and determining remediations. This gives us time to cool down after an incident and think clearly about our remediations for the long-term. We'll be having these in-depth discussions, and making decisions about the future of our tooling and processes, over the next week.

discuss

asuffield|10 years ago

(Tedious disclaimer: my opinion, not speaking for my employer, etc)

I'm an SRE at Google, where postmortems are habitual. The thing that jumped out at me here is that a production change was instantaneously pushed globally, instead of being canaried on a fraction of the serving capacity so that problems could be detected. That seems like your big problem here.

(Of course, without knowing how your data storage works, it's difficult to tell how hard it is to fix that.)

jorgeortiz85|10 years ago

Yup.

This is one of our few remaining unsharded databases (legacy problems...), so we can't easily canary a fraction of serving capacity. However, one clear remediation we can implement easily is to have our tooling change a replica first, failover to it as primary, and, if problems are detected, quickly fail back to the healthy former primary.

Lesson learned. We'll be doing a review of all of our database tooling to make sure changes are always canaried or easily reversible.

eldavido|10 years ago

hi jorge

I'd actually applied to work at stripe about two years ago, you guys turned me down ;)

I was responsible for ops at a billion-device-scale mobile analytics company for about 1.5 years. Your tooling is far superior to anything we produced. I like the idea of a single source of truth describing the data model (code, tables, query patterns, etc.) a lot, and doubly-so that it's revision-controlled and available right alongside the code.

I think it's far from decided though, how much to involve human operators in processes like this. Judging from this answer, you seem to be on the extreme end of "automate everything". How then, I'm curious, do you train/communicate to developers what can be done safely vs. something that would cause i/o bottlenecks, slowdown, or other potentially production-impacting effects? Can you even predict these things accurately in advance? (Some of our worst outages were caused by emergent phenomena that only manifested at production scale, such as hitting packet throughput and network bandwidth limits on memcached -- totally unforseeable in a code-only test environment).

It sounds like you let developers request changes (a la "The Phoenix Project") but ops is responsible for final approval of the change? That actually sounds like a great system. Would love some elaboration on this.

In any case, great writeup and from one guy who's been there when the pager goes off to another, sounds like the recovery went pretty smoothly.

jorgeortiz85|10 years ago

This is indeed a tricky balance. We want developers to iterate quickly, but we also want to understand the impact of production changes. With a small team and small sets of data, it's easy for everyone to understand the impact of changes and it's easy for modern hardware to hide inefficiencies. As we grow, the balance changes. It's harder for any one person to understand everything. It's also harder to hide inefficiencies with larger data sets.

We're always learning and improving. In order to scale, we'll need better ways to manage complexity and isolate failure. Our tools, patterns, and processes have changed quite a bit over the last few years, and they will continue to change. Ultimately, we want every Stripe employee to have the right information evident to them when they make decisions. This will be challenging, especially as we grow, but I'm excited to take on that challenge.

If you're still interested in working at Stripe, I'd encourage you to reapply! Our needs have changed quite a bit since you applied, and we're willing to reconsider candidates after a year has passed. Feel free to shoot me a resume: jorge@stripe.com

toomuchtodo|10 years ago

Shouldn't developers understand how a database change is going to impact an environment based on the code they've written?

devit|10 years ago

Why not just use simple version controlled database migrations, and testing them in a test environment?

tempestn|10 years ago

Generally you want your database migrations described in a straightforward manner for development; the migrations will contain a straightforward change from old to new (and back). With a live (busy) production database, it is often necessary to handle things differently to maintain up-time.

As a simple example, to make an atomic change to a write-only table, you could create a copy of the table, alter the copy as necessary, then in a single rename operation, rename the live table to '_old' and the '_new' table to live. You most likely would not want to add two additional table schema and all of those steps to your development database operations.

It's entirely possible that they could capture what is done in production as migrations, and test them first, but it would still likely be separate from what the application developers are working with.

raspasov|10 years ago

What kind of database was the incident on?

lacksconfidence|10 years ago

have you considered integrating index statistics into these changes? To take an example from mysql, there is the INDEX_STATISTICS table in information_schema that contains the current number of rows read from the index. Checking this twice with a one minute interval before applying the index drop could have shown that the index was under heavy usage, and might require human intervention.

codahale|10 years ago

MongoDB doesn't track this information, unfortunately.

spudlyo|10 years ago

That was my thought as well, but this change was done by an Operator and not a DBA, who tend to be a bit more curious about these kinds of changes.