(no title)
dps | 6 years ago
Thanks for the questions. We have testing procedures and deploy mechanisms that enable us to ship hundreds of deploys a week safely, including many which touch our infrastructure. For example, we do a fleetwide version rollout in stages with a blue/green deploy for typical changes.
In this case, we identified a specific code path that we believed had a high potential to cause a follow-up incident soon. The course of action was reviewed by several engineers; however we lacked an efficient way to fully validate this change on the order of minutes. We're investing in building tooling to increase robustness in rapid response mechanisms and to help responding engineers understand the potential impact of configuration changes or other remediation efforts they're pushing through an accelerated process.
I think our engineers’ approach was strong here, but our processes could have been better. Our continuing remediation efforts are focused there.
ssalazars|6 years ago
I hope that lessons are learned from this operational event, and invest towards building metrics and tooling that allows you to, first of all, prevent issues, and second, shorten the outage/mitigation times in the future.
I'm happy you guys are being open about the issue, and taking feedback from people outside your company. I definitely applaud this.
tus88|6 years ago
That seems like a lot of change in a week, or does deploys mean something else like customer websites being deployed?
tschwimmer|6 years ago
nialldalton|6 years ago
From the description/comment it also sounds like the database operates directly on files rather than file leases as there's no notion of a separate local - cluster-scoped - byte-level replication layer below it. Harder to shoot a stateful node.. And sounds like it's tricky to externally cross-check various rates, i.e. monitor replication RPCs and notice that certain nodes are stepping away from the expected numbers without depending on the health of the nodes themselves.
Hopefully the database doesn't also mix geo-replication for local access requirements / sovereignty in among the same mechanisms too.. rather than separating out into some aggregation layers above purely cluster-scoped zones!
Of course, this is all far far easier said than done given the available open source building blocks. Fun problems while scaling like crazy :)