Root cause analysis: significantly elevated error rates on 2019‑07‑10

vjagrawal1984|6 years ago

In the face of so many outages from big companies, I wonder how Visa/MasterCard is so resilient.

Is it because they are over the curve and don't make "any" changes to their system. As opposed to other companies, we are still maturing?

wallflower|6 years ago

Mainframes.

> Visa, for example, uses the mainframe to process billions of credit and debit card payments every year.

> According to some estimates, up to $3 trillion in daily commerce flows through mainframes.

https://www.share.org/blog/mainframe-matters-how-mainframes-...

https://blog.syncsort.com/2018/06/mainframe/9-mainframe-stat...

https://www.ibm.com/it-infrastructure/servers/mainframes

londons_explore|6 years ago

Both have had plenty of downtime:

https://www.ft.com/content/1fd2a066-860f-11e8-a29d-73e3d4545...

I suspect they sometimes 'fail open' (ie. allow all payments through and reconcile later) too.

jasonjei|6 years ago

That’s a great point. In spite of technical changes such as Apple Pay/Android Pay, chip cards, and so on, I can never recall an instance when I was unable to use a credit card globally. It seems most failures to running a credit card are pretty localized, too, and never at the interchange level...

Thaxll|6 years ago

They're also much simpler and the system behind payment solution didn't changed that much in the last 10 years.

segmondy|6 years ago

They are not, they go down quite often. lol.

ssalazars|6 years ago

[2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol. [2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.

There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.

To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.

dps|6 years ago

(Stripe CTO here)

Thanks for the questions. We have testing procedures and deploy mechanisms that enable us to ship hundreds of deploys a week safely, including many which touch our infrastructure. For example, we do a fleetwide version rollout in stages with a blue/green deploy for typical changes.

In this case, we identified a specific code path that we believed had a high potential to cause a follow-up incident soon. The course of action was reviewed by several engineers; however we lacked an efficient way to fully validate this change on the order of minutes. We're investing in building tooling to increase robustness in rapid response mechanisms and to help responding engineers understand the potential impact of configuration changes or other remediation efforts they're pushing through an accelerated process.

I think our engineers’ approach was strong here, but our processes could have been better. Our continuing remediation efforts are focused there.

bdamm|6 years ago

In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better. When your customers are losing millions of dollars in the minutes you're down, mitigation would be the thing, and analysis can wait. All that is needed is enough forensic data so that testing in earnest to reproduce the condition in the lab can begin. Then get the customers back to working order pronto. 20 minutes seems like a lifetime if in fact they were concerned that the degradation could happen again at any time. 20 minutes seems like just enough time to follow a checklist of actions on capturing environmental conditions, gather a huddle to make a decision, document the change, and execute on it. Commendable actually, if that's what happened.

sb8244|6 years ago

I think this is a good point. Don't rollback if you don't know why your new code is giving you problems. You may fix things with the rollback, or you may put yourself in a worse situation where the forward/backwards compatibility has a bug in it. The issue may even be coincidental to the new code.

However, it's hard to say whether this is a poor decision unless we know that they didn't analyze the path and determine that it would most likely be fine. If they did do that, then it's just a mistake and those happen. 20 minutes is enough time to make that call for the team that built it.

kevinburke|6 years ago

The odds of you understanding all of the constraints and moving variables in play, and doing situation analysis better than the seasoned ops team at a multibillion dollar company are pretty low. Maybe hold off on the armchair quarterbacking.

scarejunba|6 years ago

I dunno. Based on what's on show here I'd rather buy their product than yours if you were competing.

unknown|6 years ago

[deleted]

laCour|6 years ago

"[Four days prior to the incident] Two nodes became stalled for yet-to-be-determined reasons."

How did they not catch this? It's super surprising to me that they wouldn't have monitors for this.

lethain|6 years ago

(Stripe infra lead here)

This was a focus in our after-action review. The nodes responded as healthy to active checks, while silently dropping updates on their replication lag, together this created the impression of a healthy node. The missing bit was verifying the absence of lag updates. (Which we have now.)

ashelmire|6 years ago

If you can think of every possible failure and create monitoring and reporting for it before it happens, then you're the best dev on the planet.

gtirloni|6 years ago

In this environment:

Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes.

having a few nodes down is perfectly acceptable. I guess they would have had an alert if the number of down nodes exceeded some threshold.

NikolaeVarius|6 years ago

In many HA setups, you're supposed to not have to care if any single thing goes down because it should auto recover

The article said that the node stalled in a way that was unforseen which may have caused standard recovery mechanisms to silently fail.

unknown|6 years ago

[deleted]

zby|6 years ago

So the article identifies a software bug and a software/config bug as the root cause. That sounds a bit shallow for such a high visibility case - I was expecting something like the https://en.wikipedia.org/wiki/5_Whys method with subplots on why the bugs where not caught in testing. By the way I only clicked on it because I was hoping it would be an occasion to use the methods from http://bayes.cs.ucla.edu/WHY/ - alas no - it was too shallow for that.

zbentley|6 years ago

It is likely that this RCA was shallow because it was intended for everyone--including non-technical users, who (at least in my experience) tend to misinterpret or get confused by deep technical or systemic failure analysis.

It would be excellent if Stripe published a truly technical RCA, perhaps for distribution via their tech blog, so that folks like us could get a more complete understanding and what-not-to-do lesson (if the failing systems were based on non-proprietary technologies, that is).

gr2020|6 years ago

Anybody know what database they’re using?

conroy|6 years ago

MongoDB is the primary data store used at Stripe.

segmondy|6 years ago

As I mentioned early, " human error often, configuration changes often, new changes often. " https://news.ycombinator.com/item?id=20406116

chance_state|6 years ago

This reads like the marketing/PR teams wrote much of it. Compare to the Cloudflare post-mortem from today: https://blog.cloudflare.com/details-of-the-cloudflare-outage...

dps|6 years ago

I'm Stripe's CTO and wrote a good deal of the RCA (with the help of others, including a lot of the engineers who responded to the incident). If you've any specific feedback on how to make this more useful, I'd love to hear it.

Havoc|6 years ago

>This reads like the marketing/PR teams wrote much of it.

The remediation part is quite cautious/generic but overall it seems like a good faith effort by someone constrained by corporate rules.

buildawesome|6 years ago

Out of curiosity, how would you have preferred to see a shard unable to accept writes? I think in both post-mortems, you would see comparable graphs - usage and then a drop in usage. I think it's easier to document a failed regex versus "here's our cluster architecture that we've been using for 3 months".

Also, does your company's engineering decisions change based on other companies' post-mortems?

draw_down|6 years ago

[deleted]

mual|6 years ago

Is this Stripe's first public RCA? Looking through their tweets, there do not appear to be other RCAs for the same "elevated error rates". It seems hard to conclude much from one RCA.

unknown|6 years ago

[deleted]

jacquesm|6 years ago

Why don't they call 'significantly elevated error rates' an 'outage' instead?

dps|6 years ago

(Stripe CTO here)

That's a reasonable question. We wrote this RCA to help our users understand what had happened and to help inform their own response efforts. Because a large absolute number of requests with stateful consequences (including e.g. moving money IRL) succeeded during the event, we wanted to avoid customers believing that retrying all requests would be necessarily safe. For example, users (if they don’t use idempotency keys in our API) who simply decided to re-charge all orders in their database during the event might inadvertently double charge some of their customers. We hear you on the transparency point, though, and will likely describe events of similar magnitude as an "outage" in the future - thank you for the feedback.

NikolaeVarius|6 years ago

Because "A substantial majority of API requests during these windows failed. " implying that there was not a complete outage.

I don't understand why people demand the usage of incorrect language.

dmlittle|6 years ago

My guess is that it's because not everything was down so it wasn't a total outage. From the post mortem:

> Stripe splits data by kind into different database clusters and by quantity into different shards.

So in theory any request that didn't interact with the problematic database should have been OK (I don't know if the offending DB was in the critical path of _every_ request).

luminati|6 years ago

Since both companies' root cause analysis are currently trending on HN, it's pretty apparent that Stripe's engineering culture has a long ways to go catch up with Cloudflare's.

debt|6 years ago

"We identified that our rolled-back election protocol interacted poorly with a recently-introduced configuration setting to trigger the second period of degradation."

Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing.

It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.

One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well.

cetico|6 years ago

Yes this was very surprising. The system was working fine after the cluster restart. There was no need for an emergency rollback.

Doing a large rollback based on a hunch seems like an overreaction.

It's totally normal for engineers to commit these errors. That's fine. The detail that's missing in this PM is what kind of operational culture, procedures and automation is in place to reduce operator errors.

Did the engineer making this decision have access to other team members to review their plan of action? I believe that a group (2-3) of experienced engineers sharing information in real-time and coordinating the response could have reacted better.

Of course, I wasn't there so I could be completely off.

EugeneOZ|6 years ago

Not sure why this is downvoted but it all really looks like non-tested deployments to production servers.

108 comments