> Load-sensitivity is one “smart” approach. The idea is that you keep track of the load on each shard, and selectively route traffic to the lightly-loaded ones and away from the busy ones. Simplest thing is, if you have some sort of load metric, always pick the shard with the lowest value.
Gotta be super careful with this one. We did this at reddit and it bit us bad. The problem was as soon as the load on a machine went down it got pounded with new requests and the load shot up, but it takes a few seconds for the load number to react to all the new requests. So we saw really bad see-saw affect.
We had to add extra logic to mark how long a machine had beed at a certain load and also randomly send requests to slightly more loaded machines to keep things even.
The moral of the story here is make sure you pick a metric that reacts to the change in request rate as quickly as your request rate changes!
Adding hysteresis definitely helps to stabilize issues like this. Using rolling windows or exponentially decayed weighting has worked out well in my experience. In general, it seems like load based routing can be quite perfidious if you get the heuristic for “load” wrong.
I worked on a system that used total connections as our heuristic, measured by the load balancer. The problem we experienced was that some failure scenarios could cause requests to fail quickly compared to normal traffic. In effect what would happen is that a host would go into a bad state, start failing requests with a lower latency than normal traffic causing the load balancer to route an increasing amount of traffic to the bad host. This happened because the load balancer was only capable of measuring connections and didn’t discriminate between good/bad responses.
We ended up injecting fake latency into bad responses at the application layer which worked to prevent this sort of “black hole” effect.
Xerox Grapevine experienced this in 1983. Servers that had free disk space for messages in a congested cluster would announce this, and other nodes would swarm and relocate objects there and you'd get oscillation.
> The moral of the story here is make sure you pick a metric that reacts to the change in request rate as quickly as your request rate changes!
The key things are that you don't react to changes in the metric faster than the metric can move (be appropriately damped), and that you react to the metric "smoothly" (e.g. pick a random server where the odds to get a specific server vary with the loading metric, like you mention).
As others say, it's fundamentally a controls problem... a controls problem where there are many, many actuators and the delay/phase shift is relatively unpredictable. Making things easy for the control system by making reaction smooth and the system overall react slowly (overdamped) is important.
That reminds me of a series of blogposts a while back about "the power of two choices", where you randomly pick/sample two nodes and pick the one with the least contention.
The claim is that it results in pretty decent behavior under load while avoiding some problems caused by delayed information.
I worked for a major cloud service a few years back, and got a wonderful introduction in to how loadbalancers are, for the most part, somewhat awful.
They work better when it's just one in front of a fleet of servers, and so have the total picture of what is going on, but of course that's quite the bottleneck. So you get two LBs, or more, and they each only have their notion of what the back end fleet is doing. There's no standard feedback mechanism to them at all.
Some offer approaches like measuring response time, but that doesn't work so great as soon as you consider APIs where no two requests perform the same. Was it a fast request that got answered slowly (back-end overloaded?), or a slow request that got answered quickly (back-end bored?). Who knows.
For the service I was working on a few years ago, no two requests are the same by any stretch of the imagination, even for the same API call, and came with a variation on request size, and computational power required to process them.
As you'd expect, traditional loadbalancer behaviour actually handled about things to an okay degree probably 90% of the time. That 10% was a real killer though.
We had the same problem at Justin.tv with the video servers, with the added wrinkle that every choice had the potential to meaningly affect load for an hour or more. We eventually ended up putting extremely detailed information into a central database for the load balancer so that it could consider not only server load, but also the load on all of our internal and external network links.
We also had to keep track of how many people were on the webpage for a channel when it went live so that we could preemptively replicate the video stream to enough, but not too many, servers.
A useful metric for "load" is just as hard as doing the load balancing itself.
Requests can be vastly different, unless you have only one application they're also constantly changing and there are more load balancers involved (both horizontally and vertically as a single request can pass multiple). There are also numerous failure conditions under which responses are very fast.
In that situation it's easier to design for an even spread, then work to improve that metric as much as possible as more information becomes available.
Per Brendan Gregg's Gesamptkunstwerk, BPF Performance Tools, I feel like you should be able to measure instructions per cycle at the service level. Even in the cloud if exposed by Xen. And even at the resource utilization level for each container.
I managed a team that built a 5x 1000 node distributed setup 10+ years ago.
We ended up going with
a) short DNS TTL + a custom DNS server that sent people to the closest cluster (with some intra-communication to avoid sending people to broken clusters)
b) in each cluster; three layers: 1) Linux keepalived load balancing, 2) Our custom HTTP/TLS-level loadbalancers (~20 nodes per DC), 3) our application (~1000 nodes per DC)
A typical node had 24 (4x6) CPU cores when we started and 48 (4x12) towards the end.
These were not GC/AWS nodes, we were buying hardware directly from IBM/HP/Dell/AMD/Intel/SuperMicro and flying our own people out to mount them in DCs that we hired. Intel gave us some insane rebates when they were're recovering from the AMD dominance.
Load-balancing policy: we just randomized targets, but kept sticky sessions. Nodes were stateless, except for shared app properties - we built a separate globally/dc-aware
distributed key-value store - that was a whole new thing 12 years ago we built based on the vague concept of AWS Dynamo. App nodes reported for duty to the load balancers when they were healthy.
We had a static country-to-preferred-DC mapping. That worked fine at this scale.
This setup worked fine for a decade and 250M+ MAUs. We had excellent availability.
At some point like 10 years ago a kinda well known US-based board member really, really wanted to us to move to AWS. So we did the cost calculations and realized it would cost like 8X more to host the service on AWS. That shut him up.
Different times. It's so much easier now with AWS/GC to build large-scale services. But also so much more expensive - still! I wonder how long that can last until the concept of dealing with computation, network and storage really becomes a commodity.
My favorite sharding/load balancing algorithm is Highest Random Weight, or Rendezvous hashing [0]. It has all the benefits of consistent key hashing without the hotspots, and it doesn't require any coordination between nodes.
Two years or so back, I stumbled on power-of-2 load balancing via Twitter Finagle documentation. Found it pretty interesting. Here is a relevant news.yc discussion: https://news.ycombinator.com/item?id=14640811
> But the cache is a distraction. The performance you’re going to get will depend on your record sizes and update patterns and anyhow you probabl don’t care about the mean or median as much as the P99.
True your 99th percentile slowest requests won't hit the cache, and certainly that caching won't solve all your scaling difficulties.
However, keeping requests for commonly-needed data away from (say) a DB cluster decreases the load on it at a given level of throughput, and that can be good for P99, and (as the post notes) caching can specifically help with super-hot data which can cause problematic hotspots in some sharding strategies.
Obviously situations vary and there're limits, but a cache seems like a legit tool, not just a band-aid, for a decent number of situations.
Yeah; usually sharding implies some form of state or data storage that's split up across machines. If the same state is held on multiple machines it's called "mirroring"; if different state is held on different machines (and there's some function to determine which machine you need to talk to), it's "sharding". Load balancing encompasses both of these strategies, as well as the trivial cases where you have either stateless computations or read-only state that can be replicated.
Load balancing is just a subset of sharding though. It's how you shard your incoming traffic. The same strategies generally apply to both, but you get more leeway with traffic if it's stateless.
Def worth clicking through to the shuffle sharding thread. Simple concept (and somewhat common in my experience) but I’ve never seen the analysis before.
Is it just me, or is this article talking about load balancing, not sharding. My understanding of "sharding" is to split up a database into groups, either by time or by some index key (e.g., A-C on one shard, D-G on another, etc.). This article seems to be about splitting up web traffic, not sharding.
Sharding is about splitting up the data in groups; in this case, the idea is that the web nodes have some local state, reducing the need to hit the databases so much:
"If all the clickstream clicks from the same user or state-change events from the same workflow or whatever go to the same host, you can hold the relevant state in that host’s memory, which means you can respond faster and hit your database less hard."
Is there any (significant) difference between sharding and load balancing?
It seems that in both cases the idea is to distribute (supposedly independent) requests between workers and one of the main difficulties is that requests might not be independent either within one stream (say, in the case of sessions) or between different streams (say, if they need to use one common state).
jedberg|6 years ago
Gotta be super careful with this one. We did this at reddit and it bit us bad. The problem was as soon as the load on a machine went down it got pounded with new requests and the load shot up, but it takes a few seconds for the load number to react to all the new requests. So we saw really bad see-saw affect.
We had to add extra logic to mark how long a machine had beed at a certain load and also randomly send requests to slightly more loaded machines to keep things even.
The moral of the story here is make sure you pick a metric that reacts to the change in request rate as quickly as your request rate changes!
ucarion|6 years ago
I'm curious how you reached this condition as a requirement:
> The moral of the story here is make sure you pick a metric that reacts to the change in request rate as quickly as your request rate changes!
It makes sense intuitively, but I'm having trouble proving to myself that this is necessary+sufficient.
[1]: https://en.wikipedia.org/wiki/PID_controller
plandis|6 years ago
I worked on a system that used total connections as our heuristic, measured by the load balancer. The problem we experienced was that some failure scenarios could cause requests to fail quickly compared to normal traffic. In effect what would happen is that a host would go into a bad state, start failing requests with a lower latency than normal traffic causing the load balancer to route an increasing amount of traffic to the bad host. This happened because the load balancer was only capable of measuring connections and didn’t discriminate between good/bad responses.
We ended up injecting fake latency into bad responses at the application layer which worked to prevent this sort of “black hole” effect.
mlyle|6 years ago
> The moral of the story here is make sure you pick a metric that reacts to the change in request rate as quickly as your request rate changes!
The key things are that you don't react to changes in the metric faster than the metric can move (be appropriately damped), and that you react to the metric "smoothly" (e.g. pick a random server where the odds to get a specific server vary with the loading metric, like you mention).
As others say, it's fundamentally a controls problem... a controls problem where there are many, many actuators and the delay/phase shift is relatively unpredictable. Making things easy for the control system by making reaction smooth and the system overall react slowly (overdamped) is important.
Terr_|6 years ago
The claim is that it results in pretty decent behavior under load while avoiding some problems caused by delayed information.
https://ieeexplore.ieee.org/document/963420
Twirrim|6 years ago
They work better when it's just one in front of a fleet of servers, and so have the total picture of what is going on, but of course that's quite the bottleneck. So you get two LBs, or more, and they each only have their notion of what the back end fleet is doing. There's no standard feedback mechanism to them at all.
Some offer approaches like measuring response time, but that doesn't work so great as soon as you consider APIs where no two requests perform the same. Was it a fast request that got answered slowly (back-end overloaded?), or a slow request that got answered quickly (back-end bored?). Who knows.
For the service I was working on a few years ago, no two requests are the same by any stretch of the imagination, even for the same API call, and came with a variation on request size, and computational power required to process them.
As you'd expect, traditional loadbalancer behaviour actually handled about things to an okay degree probably 90% of the time. That 10% was a real killer though.
copyconstruct|6 years ago
kd5bjo|6 years ago
We also had to keep track of how many people were on the webpage for a channel when it went live so that we could preemptively replicate the video stream to enough, but not too many, servers.
xorcist|6 years ago
A useful metric for "load" is just as hard as doing the load balancing itself.
Requests can be vastly different, unless you have only one application they're also constantly changing and there are more load balancers involved (both horizontally and vertically as a single request can pass multiple). There are also numerous failure conditions under which responses are very fast.
In that situation it's easier to design for an even spread, then work to improve that metric as much as possible as more information becomes available.
ArtWomb|6 years ago
http://www.brendangregg.com/bpf-performance-tools-book.html
Of course, you can always just use cloudflare ;)
ignoramous|6 years ago
tpmx|6 years ago
I managed a team that built a 5x 1000 node distributed setup 10+ years ago.
We ended up going with
a) short DNS TTL + a custom DNS server that sent people to the closest cluster (with some intra-communication to avoid sending people to broken clusters)
b) in each cluster; three layers: 1) Linux keepalived load balancing, 2) Our custom HTTP/TLS-level loadbalancers (~20 nodes per DC), 3) our application (~1000 nodes per DC)
A typical node had 24 (4x6) CPU cores when we started and 48 (4x12) towards the end.
These were not GC/AWS nodes, we were buying hardware directly from IBM/HP/Dell/AMD/Intel/SuperMicro and flying our own people out to mount them in DCs that we hired. Intel gave us some insane rebates when they were're recovering from the AMD dominance.
Load-balancing policy: we just randomized targets, but kept sticky sessions. Nodes were stateless, except for shared app properties - we built a separate globally/dc-aware distributed key-value store - that was a whole new thing 12 years ago we built based on the vague concept of AWS Dynamo. App nodes reported for duty to the load balancers when they were healthy.
We had a static country-to-preferred-DC mapping. That worked fine at this scale.
This setup worked fine for a decade and 250M+ MAUs. We had excellent availability.
At some point like 10 years ago a kinda well known US-based board member really, really wanted to us to move to AWS. So we did the cost calculations and realized it would cost like 8X more to host the service on AWS. That shut him up.
Different times. It's so much easier now with AWS/GC to build large-scale services. But also so much more expensive - still! I wonder how long that can last until the concept of dealing with computation, network and storage really becomes a commodity.
jiggawatts|6 years ago
jedberg|6 years ago
[0] https://en.wikipedia.org/wiki/Rendezvous_hashing
ignoramous|6 years ago
Two years or so back, I stumbled on power-of-2 load balancing via Twitter Finagle documentation. Found it pretty interesting. Here is a relevant news.yc discussion: https://news.ycombinator.com/item?id=14640811
And of course, the exponential weighted moving average is a good algorithm too. It is, I believe, used by Elasticsearch. Cloudflare blogged abt using it, as well: https://blog.cloudflare.com/i-wanna-go-fast-load-balancing-d...
twotwotwo|6 years ago
True your 99th percentile slowest requests won't hit the cache, and certainly that caching won't solve all your scaling difficulties.
However, keeping requests for commonly-needed data away from (say) a DB cluster decreases the load on it at a given level of throughput, and that can be good for P99, and (as the post notes) caching can specifically help with super-hot data which can cause problematic hotspots in some sharding strategies.
Obviously situations vary and there're limits, but a cache seems like a legit tool, not just a band-aid, for a decent number of situations.
plandis|6 years ago
amelius|6 years ago
oweiler|6 years ago
nostrademons|6 years ago
jedberg|6 years ago
gfodor|6 years ago
OJFord|6 years ago
speedplane|6 years ago
icebraining|6 years ago
"If all the clickstream clicks from the same user or state-change events from the same workflow or whatever go to the same host, you can hold the relevant state in that host’s memory, which means you can respond faster and hit your database less hard."
prostodata|6 years ago
It seems that in both cases the idea is to distribute (supposedly independent) requests between workers and one of the main difficulties is that requests might not be independent either within one stream (say, in the case of sessions) or between different streams (say, if they need to use one common state).
KibbutzDalia|6 years ago
[deleted]
planetzero|6 years ago
[deleted]