(no title)
paradite | 2 months ago
I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.
The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.
I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.
For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.
vlovich123|2 months ago
Comparing the difficulty of running the world’s internet traffic with hundreds of customer products with your fintech experience is like saying “I can lift 10 pounds. I don’t know why these guys are struggling to lift 500 pounds”.
paradite|2 months ago
If there’s indeed a 5 min lag in monitoring dashboard in Cloudflare, I honestly think that's a pretty big concern.
For example, a simple curl script on your top 100 customers' homepage that runs every 30 seconds would have given the warning and notifications within a minute. If you stagger deployments at 5 minute intervals, you could have identified the issue and initiated the rollback within 2 minutes and completed it within 3 minutes.
autoexec|2 months ago
This kind of thing would be more understandable for a company without hundreds of billions of dollars, and for one that hasn't centralized so much of the internet. If a company has grown too large and complex to be well managed and effective and it's starting to look like a liability for large numbers of people there are obvious solutions for that.
theplatman|2 months ago
dehrmann|2 months ago
notepad0x90|2 months ago
Just speculating based on my experience: It's more likely than not that they likely refused to invest in fail-safe architectures for cost reasons. Control-plane and data-plane should be separate, a react patch shouldn't affect traffic forwarding.
Forget manual rollbacks, there should be automated reversion to a known working state.
cowsandmilk|2 months ago
Cloudflare’s own post says the configuration change that resulted in the outage rolled out in seconds.
paradite|2 months ago
markus_zhang|2 months ago
paradite|2 months ago
I'm more talking about how slow it was to detect the issue caused by the config change, and perform the rollback of the config change. It took 20 minutes.
linhns|2 months ago
theideaofcoffee|2 months ago
The process was pretty tight, almost no revenue-affecting outages from what I can remember because it was such a collaborative effort (even though the board presentation seemed a bit spiky and confrontational at the time, everyone was working together).
prdonahue|2 months ago
lljk_kennedy|2 months ago
nova22033|2 months ago
https://www.henricodolfing.ch/case-study-4-the-440-million-s...
draw_down|2 months ago
[deleted]