top | item 45982200

(no title)

When I first read about it I assumed it would have been a "poison pill" - a bad config where the ingestion of the config leads the process to crash/restart. And due to that crash on startup, there is no automated possibility to revert to a good config. These things are the worst issues that all global control planes have to deal with.

The report actually seems to confirm this - it was indeed a crash on ingesting the bad config. However I'm actually surprised that the long duration didn't come from "it takes a long time to restart the fleet manually" or "tooling to restart the fleet was bad".

The problem mostly seems to have been "we didn't knew whats going on". Some look into the proxy logs would hopefully have shown the stacktrace/unwrap, and metrics about the incoming requests would hopefully have shown that there's no abnormal amount of requests coming in.

discuss

No comments yet.