(no title)
vjeux | 1 year ago
- Code changes
- Configuration changes (this includes the equivalent of server topology changes like cloudformation, quota changes)
- Experimentation rollout changes
There has been issues that are external (like user behavior change for new year / world cup final, physical connection between datacenters being severed…) but they tend to be a lot less frequent.
All the 3 big buckets are tied to a single trackable change with an id so this leads to the ability to do those kind of automated root cause analysis at scale.
Now, Meta is mostly a closed loop where all the infra and product is controlled as one entity so those results may not be applicable outside.
donavanm|1 year ago
Definitely agree that the bulk Of “impact” is back to changes introduced in the SDLC. Even for major incidents infrastructure is probably down to 10-20% of causes in a good org. My view in GP is probably skewed towards major incidents impairing multiple services/regions as well. While I worked on a handful of services it was mostly edge/infra side, and I focused the last few years specifically on major incident management.
Id still be curious about internal system state and faults due to issues like deadlocked workflows, incoherent state machines, and invalid state values. But maybe its simply not that prevalent.
vitus|1 year ago
I'm curious how well that works in the situation where your config change or experiment rollout results in a time bomb (e.g. triggered by task restart after software rollout), speaking as someone who just came off an oncall shift where that was one of our more notable outages.
Google also has a ledger of production events which _most_ common infra will write to, but there are so many distinct systems that I would be worried about identifying spurious correlations with completely unrelated products.
> There has been issues that are external (like ... physical connection between datacenters being severed…) but they tend to be a lot less frequent.
That's interesting to hear, because my experience at Google is that we'll see a peering metro being fully isolated from our network at least once a year; smaller fiber cuts that temporarily leave us with a SPOF or with a capacity shortfall happen much much more frequently.
(For a concrete example: a couple months ago, Hurricane Beryl temporarily took a bunch of peering infrastructure in Texas offline.)