Yep, a decent canary mechanism should have caught this. There's a trade off between canarying and rollout speed, though. If this was a system for fighting bots, I'd expect it to be optimized for the latter.
Presumably optimal rollout speed entails something like or as close to ”push it everywhere all at once and activate immediately” that you can get — that’s fine if you want to risk short downtime rather than delays in rollout, what I don’t understand is why the nodes don’t have any independent verification and rollback mechanism. I might be underestimating the complexity but it really doesn’t sound much more involved than a process launching another process, concluding that it crashed and restarting it with different parameters.
I think they need to strongly evaluate if they need this level of rollout speed. Even spending a few minutes with an automated canary gives you a ton of safety.
Even if the servers weren't crashing it is possible that a bet set of parameters results in far too many false positives which may as well be complete failure.
kevincox|3 months ago
Even if you want this data to be very fresh you can probably afford to do something like:
1. Push out data to a single location or some subset of servers.
2. Confirm that the data is loaded.
3. Wait to observe any issues. (Even a minute is probably enough to catch the most severe issues.)
4. Roll out globally.
hnarn|3 months ago
kevincox|3 months ago
Even if the servers weren't crashing it is possible that a bet set of parameters results in far too many false positives which may as well be complete failure.