top | item 41326450

(no title)

vjeux | 1 year ago

From my experience, the vast majority of reliability issues at Meta come from 3 areas:

- Code changes

- Configuration changes (this includes the equivalent of server topology changes like cloudformation, quota changes)

- Experimentation rollout changes

There has been issues that are external (like user behavior change for new year / world cup final, physical connection between datacenters being severed…) but they tend to be a lot less frequent.

All the 3 big buckets are tied to a single trackable change with an id so this leads to the ability to do those kind of automated root cause analysis at scale.

Now, Meta is mostly a closed loop where all the infra and product is controlled as one entity so those results may not be applicable outside.

discuss

donavanm|1 year ago

Interesting. It sounds like “all” service state management (admin config, infra, topology) is discoverable/legible for meta. I think that contrasts with AWS where there is a strong DevTools org, but many services and integrations are more of an API centric service-to-service model with distributed state which is much harder to observe. Every cloud provider I know of also has a (externally opaque) division between “native” cloud-service-built-on-cloud-infra and (typically older) “foundational” services that are much closer to “bare metal” with their own bespoke provisioning and management. Ex EC2 has great visibility inside of their placement and launch flows, but itll never look like/interop with cfn & cloudtrail that ~280 other “native” services use.

Definitely agree that the bulk Of “impact” is back to changes introduced in the SDLC. Even for major incidents infrastructure is probably down to 10-20% of causes in a good org. My view in GP is probably skewed towards major incidents impairing multiple services/regions as well. While I worked on a handful of services it was mostly edge/infra side, and I focused the last few years specifically on major incident management.

Id still be curious about internal system state and faults due to issues like deadlocked workflows, incoherent state machines, and invalid state values. But maybe its simply not that prevalent.

vitus|1 year ago

> this leads to the ability to do those kind of automated root cause analysis at scale.

I'm curious how well that works in the situation where your config change or experiment rollout results in a time bomb (e.g. triggered by task restart after software rollout), speaking as someone who just came off an oncall shift where that was one of our more notable outages.

Google also has a ledger of production events which _most_ common infra will write to, but there are so many distinct systems that I would be worried about identifying spurious correlations with completely unrelated products.

> There has been issues that are external (like ... physical connection between datacenters being severed…) but they tend to be a lot less frequent.

That's interesting to hear, because my experience at Google is that we'll see a peering metro being fully isolated from our network at least once a year; smaller fiber cuts that temporarily leave us with a SPOF or with a capacity shortfall happen much much more frequently.

(For a concrete example: a couple months ago, Hurricane Beryl temporarily took a bunch of peering infrastructure in Texas offline.)