top | item 20423264

(no title)

dps | 6 years ago

(Stripe CTO here)

Thanks for the questions. We have testing procedures and deploy mechanisms that enable us to ship hundreds of deploys a week safely, including many which touch our infrastructure. For example, we do a fleetwide version rollout in stages with a blue/green deploy for typical changes.

In this case, we identified a specific code path that we believed had a high potential to cause a follow-up incident soon. The course of action was reviewed by several engineers; however we lacked an efficient way to fully validate this change on the order of minutes. We're investing in building tooling to increase robustness in rapid response mechanisms and to help responding engineers understand the potential impact of configuration changes or other remediation efforts they're pushing through an accelerated process.

I think our engineers’ approach was strong here, but our processes could have been better. Our continuing remediation efforts are focused there.

discuss

ssalazars|6 years ago

Thank you for taking the time to respond to my questions. I believe the high potential of causing a follow-up incident was left out of the post (or maybe I missed it?).

I hope that lessons are learned from this operational event, and invest towards building metrics and tooling that allows you to, first of all, prevent issues, and second, shorten the outage/mitigation times in the future.

I'm happy you guys are being open about the issue, and taking feedback from people outside your company. I definitely applaud this.

tus88|6 years ago

> ship hundreds of deploys a week safely

That seems like a lot of change in a week, or does deploys mean something else like customer websites being deployed?

tschwimmer|6 years ago

They very likely have continuous deployment. So each change could potentially be released as a separate deploy. If the changes have changed to the data model, they gotta run a migration. So hundreds seems reasonable to me.

nialldalton|6 years ago

From the outside it sounds like, whatever the database is, it has far too many critical services tightly bound within it. E.g. leader election implemented internally instead of as a service with separate lifecycle management - pushing the database query processor minor version forward forcing me to move the leader election code or replica config handling forwards... ick.

From the description/comment it also sounds like the database operates directly on files rather than file leases as there's no notion of a separate local - cluster-scoped - byte-level replication layer below it. Harder to shoot a stateful node.. And sounds like it's tricky to externally cross-check various rates, i.e. monitor replication RPCs and notice that certain nodes are stepping away from the expected numbers without depending on the health of the nodes themselves.

Hopefully the database doesn't also mix geo-replication for local access requirements / sovereignty in among the same mechanisms too.. rather than separating out into some aggregation layers above purely cluster-scoped zones!

Of course, this is all far far easier said than done given the available open source building blocks. Fun problems while scaling like crazy :)