This was a focus in our after-action review. The nodes responded as healthy to active checks, while silently dropping updates on their replication lag, together this created the impression of a healthy node. The missing bit was verifying the absence of lag updates. (Which we have now.)
You might want to clarify this in the post. To me it reads like you knowingly had degraded infra for days leading up to an incident which might have been preventable had you recovered this instances.
I am a curious and very amateur person, but do you think that if "100%" uptime were your goal, this:
"[Three months prior to the incident] We upgraded our databases to a new minor version that introduced a subtle, undetected fault in the database’s failover system."
could have been prevented if you had stopped upgrading minor versions, i.e. froze on one specific version and not even applied security fixes, instead relying on containing it as a "known" vulnerable database?
The reason I ask is that I heard of ATM's still running windows XP or stuff like that. but if it's not networked could it be that that actually has a bigger uptime than anything you can do on windows 7 or 10?
what I mean is even though it is hilariously out of date to be using windows xp, still, by any measure it's had a billion device-days to expose its failure modes.
when you upgrade to the latest minor version of databases, don't you sacrifice the known bad for an unknown good?
Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes.
having a few nodes down is perfectly acceptable. I guess they would have had an alert if the number of down nodes exceeded some threshold.
this case that doesn't sound like it was the issue, it was the lack of promotion of new master due to the bug in the system in terms of shard promotion.
Right, but they didn't recover speedily. To have the cluster in such a state for so long sounds like poor monitoring to me because this can knowingly interfere with an election later.
lethain|6 years ago
This was a focus in our after-action review. The nodes responded as healthy to active checks, while silently dropping updates on their replication lag, together this created the impression of a healthy node. The missing bit was verifying the absence of lag updates. (Which we have now.)
aeyes|6 years ago
throwaway3489|6 years ago
"[Three months prior to the incident] We upgraded our databases to a new minor version that introduced a subtle, undetected fault in the database’s failover system."
could have been prevented if you had stopped upgrading minor versions, i.e. froze on one specific version and not even applied security fixes, instead relying on containing it as a "known" vulnerable database?
The reason I ask is that I heard of ATM's still running windows XP or stuff like that. but if it's not networked could it be that that actually has a bigger uptime than anything you can do on windows 7 or 10?
what I mean is even though it is hilariously out of date to be using windows xp, still, by any measure it's had a billion device-days to expose its failure modes.
when you upgrade to the latest minor version of databases, don't you sacrifice the known bad for an unknown good?
excuse my ignorance on this subject.
ashelmire|6 years ago
sithlord|6 years ago
raverbashing|6 years ago
I kinda lost count of how many times Nagios barfed itself and reported an error while the application was fine
gtirloni|6 years ago
Stripe splits data by kind into different database clusters and by quantity into different shards. Each cluster has many shards, and each shard has multiple redundant nodes.
having a few nodes down is perfectly acceptable. I guess they would have had an alert if the number of down nodes exceeded some threshold.
runevault|6 years ago
NikolaeVarius|6 years ago
The article said that the node stalled in a way that was unforseen which may have caused standard recovery mechanisms to silently fail.
laCour|6 years ago
unknown|6 years ago
[deleted]