The most surprising thing to me here is that it took 3 hours to root cause, and points to a glaring hole in the platform observability. Even taking into account the fact that the service was failing intermittently at first, it still took 1.5 hours after it started failing consistently to root cause. But the service was crashing on startup. If a core service is throwing a panic at startup like that, it should be raising alerts or at least easily findable via log aggregation. It seems like maybe there was some significant time lost in assuming it was an attack, but it also seems strange to me that nobody was asking "what just changed?", which is usually the first question I ask during an incident.
eastdakota|3 months ago
keypusher|3 months ago
QREguy|3 months ago
Fiadliel|3 months ago
I can imagine that this could easily lead to less visibility into issues.