top | item 45723138

(no title)

jimmyl02 | 4 months ago

always wondered at what scale gossip / SWIM breaks down and you need a hierarchy / partitioning. fly's use of corrosion seems to imply it's good enough for a single region which is pretty surprising because iirc Uber's ringpop was said to face problems at around 3K nodes.

it would be super cool to learn more about how the world's largest gossip systems work :)

discuss

order

tptacek|4 months ago

SWIM is probably going to scale pretty much indefinitely. The issue we have with a single global SWIM broadcast domain isn't that the scale is breaking down; it's just that the blast radius for bugs (both in Corrosion itself, and in the services that depend on Corrosion) is too big.

We're actually keeping the global Corrosion cluster! We're just stripping most of the data out of it.

chucky_z|4 months ago

Back of napkin math I’ve done previously, it breaks down around 2 million members with Hashicorps defaults. The defaults are quite aggressive though and if you can tolerate seconds of latency (called out in the article) you could reach billions without a lot of trouble.

tptacek|4 months ago

It's also frequency of changes and granularity of state, when sizing workloads. My understanding is that most Hashi shops would federate workloads of our size/global distribution; it would be weird to try to run one big cluster to capture everything.