That’s a great point. In spite of technical changes such as Apple Pay/Android Pay, chip cards, and so on, I can never recall an instance when I was unable to use a credit card globally. It seems most failures to running a credit card are pretty localized, too, and never at the interchange level...
[2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol.
[2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.
There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.
To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.
Thanks for the questions. We have testing procedures and deploy mechanisms that enable us to ship hundreds of deploys a week safely, including many which touch our infrastructure. For example, we do a fleetwide version rollout in stages with a blue/green deploy for typical changes.
In this case, we identified a specific code path that we believed had a high potential to cause a follow-up incident soon. The course of action was reviewed by several engineers; however we lacked an efficient way to fully validate this change on the order of minutes. We're investing in building tooling to increase robustness in rapid response mechanisms and to help responding engineers understand the potential impact of configuration changes or other remediation efforts they're pushing through an accelerated process.
I think our engineers’ approach was strong here, but our processes could have been better. Our continuing remediation efforts are focused there.
In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better. When your customers are losing millions of dollars in the minutes you're down, mitigation would be the thing, and analysis can wait. All that is needed is enough forensic data so that testing in earnest to reproduce the condition in the lab can begin. Then get the customers back to working order pronto. 20 minutes seems like a lifetime if in fact they were concerned that the degradation could happen again at any time. 20 minutes seems like just enough time to follow a checklist of actions on capturing environmental conditions, gather a huddle to make a decision, document the change, and execute on it. Commendable actually, if that's what happened.
I think this is a good point. Don't rollback if you don't know why your new code is giving you problems. You may fix things with the rollback, or you may put yourself in a worse situation where the forward/backwards compatibility has a bug in it. The issue may even be coincidental to the new code.
However, it's hard to say whether this is a poor decision unless we know that they didn't analyze the path and determine that it would most likely be fine. If they did do that, then it's just a mistake and those happen. 20 minutes is enough time to make that call for the team that built it.
The odds of you understanding all of the constraints and moving variables in play, and doing situation analysis better than the seasoned ops team at a multibillion dollar company are pretty low. Maybe hold off on the armchair quarterbacking.
This was a focus in our after-action review. The nodes responded as healthy to active checks, while silently dropping updates on their replication lag, together this created the impression of a healthy node. The missing bit was verifying the absence of lag updates. (Which we have now.)
Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes.
having a few nodes down is perfectly acceptable. I guess they would have had an alert if the number of down nodes exceeded some threshold.
So the article identifies a software bug and a software/config bug as the root cause. That sounds a bit shallow for such a high visibility case - I was expecting something like the https://en.wikipedia.org/wiki/5_Whys method with subplots on why the bugs where not caught in testing. By the way I only clicked on it because I was hoping it would be an occasion to use the methods from http://bayes.cs.ucla.edu/WHY/ - alas no - it was too shallow for that.
It is likely that this RCA was shallow because it was intended for everyone--including non-technical users, who (at least in my experience) tend to misinterpret or get confused by deep technical or systemic failure analysis.
It would be excellent if Stripe published a truly technical RCA, perhaps for distribution via their tech blog, so that folks like us could get a more complete understanding and what-not-to-do lesson (if the failing systems were based on non-proprietary technologies, that is).
I'm Stripe's CTO and wrote a good deal of the RCA (with the help of others, including a lot of the engineers who responded to the incident). If you've any specific feedback on how to make this more useful, I'd love to hear it.
Out of curiosity, how would you have preferred to see a shard unable to accept writes? I think in both post-mortems, you would see comparable graphs - usage and then a drop in usage. I think it's easier to document a failed regex versus "here's our cluster architecture that we've been using for 3 months".
Also, does your company's engineering decisions change based on other companies' post-mortems?
Is this Stripe's first public RCA? Looking through their tweets, there do not appear to be other RCAs for the same "elevated error rates". It seems hard to conclude much from one RCA.
That's a reasonable question. We wrote this RCA to help our users understand what had happened and to help inform their own response efforts. Because a large absolute number of requests with stateful consequences (including e.g. moving money IRL) succeeded during the event, we wanted to avoid customers believing that retrying all requests would be necessarily safe. For example, users (if they don’t use idempotency keys in our API) who simply decided to re-charge all orders in their database during the event might inadvertently double charge some of their customers. We hear you on the transparency point, though, and will likely describe events of similar magnitude as an "outage" in the future - thank you for the feedback.
My guess is that it's because not everything was down so it wasn't a total outage. From the post mortem:
> Stripe splits data by kind into different database clusters and by quantity into different shards.
So in theory any request that didn't interact with the problematic database should have been OK (I don't know if the offending DB was in the critical path of _every_ request).
Since both companies' root cause analysis are currently trending on HN, it's pretty apparent that Stripe's engineering culture has a long ways to go catch up with Cloudflare's.
"We identified that our rolled-back election protocol interacted poorly with a recently-introduced configuration setting to trigger the second period of degradation."
Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing.
It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.
One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well.
Yes this was very surprising. The system was working fine after the cluster restart. There was no need for an emergency rollback.
Doing a large rollback based on a hunch seems like an overreaction.
It's totally normal for engineers to commit these errors. That's fine. The detail that's missing in this PM is what kind of operational culture, procedures and automation is in place to reduce operator errors.
Did the engineer making this decision have access to other team members to review their plan of action? I believe that a group (2-3) of experienced engineers sharing information in real-time and coordinating the response could have reacted better.
Of course, I wasn't there so I could be completely off.
vjagrawal1984|6 years ago
Is it because they are over the curve and don't make "any" changes to their system. As opposed to other companies, we are still maturing?
wallflower|6 years ago
> Visa, for example, uses the mainframe to process billions of credit and debit card payments every year.
> According to some estimates, up to $3 trillion in daily commerce flows through mainframes.
https://www.share.org/blog/mainframe-matters-how-mainframes-...
https://blog.syncsort.com/2018/06/mainframe/9-mainframe-stat...
https://www.ibm.com/it-infrastructure/servers/mainframes
londons_explore|6 years ago
https://www.ft.com/content/1fd2a066-860f-11e8-a29d-73e3d4545...
I suspect they sometimes 'fail open' (ie. allow all payments through and reconcile later) too.
jasonjei|6 years ago
Thaxll|6 years ago
segmondy|6 years ago
ssalazars|6 years ago
There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.
To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.
dps|6 years ago
Thanks for the questions. We have testing procedures and deploy mechanisms that enable us to ship hundreds of deploys a week safely, including many which touch our infrastructure. For example, we do a fleetwide version rollout in stages with a blue/green deploy for typical changes.
In this case, we identified a specific code path that we believed had a high potential to cause a follow-up incident soon. The course of action was reviewed by several engineers; however we lacked an efficient way to fully validate this change on the order of minutes. We're investing in building tooling to increase robustness in rapid response mechanisms and to help responding engineers understand the potential impact of configuration changes or other remediation efforts they're pushing through an accelerated process.
I think our engineers’ approach was strong here, but our processes could have been better. Our continuing remediation efforts are focused there.
bdamm|6 years ago
sb8244|6 years ago
However, it's hard to say whether this is a poor decision unless we know that they didn't analyze the path and determine that it would most likely be fine. If they did do that, then it's just a mistake and those happen. 20 minutes is enough time to make that call for the team that built it.
kevinburke|6 years ago
scarejunba|6 years ago
unknown|6 years ago
[deleted]
laCour|6 years ago
How did they not catch this? It's super surprising to me that they wouldn't have monitors for this.
lethain|6 years ago
This was a focus in our after-action review. The nodes responded as healthy to active checks, while silently dropping updates on their replication lag, together this created the impression of a healthy node. The missing bit was verifying the absence of lag updates. (Which we have now.)
ashelmire|6 years ago
gtirloni|6 years ago
Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes.
having a few nodes down is perfectly acceptable. I guess they would have had an alert if the number of down nodes exceeded some threshold.
NikolaeVarius|6 years ago
The article said that the node stalled in a way that was unforseen which may have caused standard recovery mechanisms to silently fail.
unknown|6 years ago
[deleted]
zby|6 years ago
zbentley|6 years ago
It would be excellent if Stripe published a truly technical RCA, perhaps for distribution via their tech blog, so that folks like us could get a more complete understanding and what-not-to-do lesson (if the failing systems were based on non-proprietary technologies, that is).
gr2020|6 years ago
conroy|6 years ago
segmondy|6 years ago
chance_state|6 years ago
dps|6 years ago
Havoc|6 years ago
The remediation part is quite cautious/generic but overall it seems like a good faith effort by someone constrained by corporate rules.
buildawesome|6 years ago
Also, does your company's engineering decisions change based on other companies' post-mortems?
draw_down|6 years ago
[deleted]
mual|6 years ago
unknown|6 years ago
[deleted]
jacquesm|6 years ago
dps|6 years ago
That's a reasonable question. We wrote this RCA to help our users understand what had happened and to help inform their own response efforts. Because a large absolute number of requests with stateful consequences (including e.g. moving money IRL) succeeded during the event, we wanted to avoid customers believing that retrying all requests would be necessarily safe. For example, users (if they don’t use idempotency keys in our API) who simply decided to re-charge all orders in their database during the event might inadvertently double charge some of their customers. We hear you on the transparency point, though, and will likely describe events of similar magnitude as an "outage" in the future - thank you for the feedback.
NikolaeVarius|6 years ago
I don't understand why people demand the usage of incorrect language.
dmlittle|6 years ago
> Stripe splits data by kind into different database clusters and by quantity into different shards.
So in theory any request that didn't interact with the problematic database should have been OK (I don't know if the offending DB was in the critical path of _every_ request).
luminati|6 years ago
debt|6 years ago
Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing.
It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.
One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well.
cetico|6 years ago
Doing a large rollback based on a hunch seems like an overreaction.
It's totally normal for engineers to commit these errors. That's fine. The detail that's missing in this PM is what kind of operational culture, procedures and automation is in place to reduce operator errors.
Did the engineer making this decision have access to other team members to review their plan of action? I believe that a group (2-3) of experienced engineers sharing information in real-time and coordinating the response could have reacted better.
Of course, I wasn't there so I could be completely off.
EugeneOZ|6 years ago