top | item 47078209

(no title)

netik | 10 days ago

Caveat: I was employee 13 at Twitter and I spent a long time dealing with random failure modes.

At extremely high scale you start to run into very strange problems. We used to say that all of your "Unix Friends" fail at scale and act differently.

I once had 3000 machines running NTP sync'd cronjobs on the exact same second pounding the upstream server and causing outages (Whoops, add random offsets to cron!)

This sort of "dogpile effect" exists when fetching keys as well. A key drops out of cache and 30 machines (or worker threads) trying to load the same key at the same time, because the cache is empty.

One of the solutions around this problem was Facebook's Dataloader (https://github.com/graphql/dataloader), which tries to intercept the request pipeline, batch the requests together and coalesce many requests into one.

Essentially DataLoader will coalesce all individual loads which occur within a single frame of execution (a single tick of the event loop) and then call your batch function with all requested keys.

It helps by reducing requests and offering something resembling backpressure by moving the request into one code path.

I would expect that you'd have the same sort of problem at scale with this system given the number of requests on many procs across many machines.

We had a lot of small tricks like this (they add up!), in some cases we'd insert a message queue inbetween the requestor and the service so that we could increase latency / reduce request rate while systems were degraded. Those "knobs" were generally implemented by "Decider" code which read keys from memcache to figure out what to do.

By "pushes to connected SDKs": I assume you're holding a thread with this connection; How do you reconcile this when you're running something like node with PM2 where you've got 30-60 processes on a single host? They won't be sharing memory, so that's a lot of updates.

It seems better to have these updates pushed to one local process that other processes can read from via socket or shared memory.

I'd also consider the many failure modes of services. Sometimes services go catatonic upon connect and don't respond, sometimes they time out, sometimes they throw exceptions, etc...

There's a lot to think about here but as I said what you've got is a great start.

discuss

rodrigorcs|9 days ago

This is incredibly generous context... thank you. A few of these hit close to problems I'm thinking about.

The Decider pattern you're describing (reading keys from memcache to decide behavior at runtime) is essentially what Openfuse is trying to productize. A centralized place that tells your fleet how to behave, without each process figuring it out independently. So it's validating to hear that's where Twitter landed organically.

On the PM2 point: you're right, holding a connection per process doesn't scale well at that huge scale. A local sidecar that receives state updates and exposes them via socket or shared memory to sibling processes is a much better model at that density. That's not how it works today, each process holds its own connection, but your framing is exactly how I'd want to evolve it. However, I can't say that is in the short-term goals for now, need to validate the product first and add some important features + publish the self hosted version.

On the dogpile: the half-open state is where this matters most. When a breaker opens and then transitions to half-open, you don't want 50 instances all sending probe requests simultaneously. The coalescing pattern you're describing from DataLoader is a neat way of solving it, I wonder if I can implement this somehow without adding a service/proxy closer to the clients just for that.

On failure modes: agreed, "service is down" is the simplest case. Catatonic connections, slow degradation, partial responses that look valid but aren't, those are harder to classify. Right now Openfuse trips on error rates, timeouts, and latency. However, the back-end is ready for custom metrics, I just didn't implement them yet. Having the breaker tripping based on OpenTelemetry metrics is also something I am looking forward to try, which opens a whole new world.

I'm not going to pretend this is built for Twitter-scale problems today. But hearing that the patterns you arrived at are directionally where this is headed is really encouraging.