top | item 45975590

(no title)

keypusher | 3 months ago

The most surprising thing to me here is that it took 3 hours to root cause, and points to a glaring hole in the platform observability. Even taking into account the fact that the service was failing intermittently at first, it still took 1.5 hours after it started failing consistently to root cause. But the service was crashing on startup. If a core service is throwing a panic at startup like that, it should be raising alerts or at least easily findable via log aggregation. It seems like maybe there was some significant time lost in assuming it was an attack, but it also seems strange to me that nobody was asking "what just changed?", which is usually the first question I ask during an incident.

discuss

eastdakota|3 months ago

That’s not accurate. As with any incident response there were a number of theories of the cause we were working in parallel. The feature file failure was one identified as potential in the first 30 minutes. However, the theory that seemed the most plausible based on what we were seeing (intermittent, initially concentrated in the UK, spike in errors for certain API endpoints) as well as what else we’d been dealing with (a bot net that had escalated DDoS attacks from 3Tbps to 30Tbps against us and others like Microsoft over the last 3 months). We worked multiple theories in parallel. After an hour we ruled out the DDoS theory. We had other theories also running in parallel, but at that point the dominant theory was that the feature file was somehow corrupt. One thing that made us initially question the theory was nothing in our changelogs seemed like it would have caused the feature file to grow in size. It was only after the incident that we realized the database permissions change had caused it, but that was far from obvious. Even after we identified the problem with the feature file, we did not have an automated process to role the feature file back to a known-safe previous version. So we had to shut down the reissuance and manually insert a file into the queue. Figuring out how to do that took time and waking people up as there are lots of security safeguards in place to prevent an individual from easily doing that. We also needed to double check we wouldn’t make things worse. The propagation then takes some time especially because there are tiers of caching of the file that we had to clear. Finally we chose to restart the FL2 processes on all the machines that make up our fleet to ensure they all loaded the corrected file as quickly as possible. That’s a lot of processes on a lot of machines. So I think best description was it took us an hour for the team to coalesce on the feature file being the cause and then another two to get the fix rolled out.

keypusher|3 months ago

Thank you for the clarification and insight, with that context it does make more sense to me. Is there anything you think can be done to improve the ability to identify issues like this more quickly in the future?

QREguy|3 months ago

Any "limits" on system should be alerted... like at 70% or 80% threshold.. it might be worth it for a SRE to revisit the system limits and ensuring threshold based alerting around it..

Fiadliel|3 months ago

If one actually looks at the current pingora API, it has limited ability to initialize async components at startup - the current pattern seems to be to lazily initialize on first call. An obvious downside of this is that a service can startup in a broken state. e.g. https://github.com/cloudflare/pingora/issues/169

I can imagine that this could easily lead to less visibility into issues.