For client-side LB, moving active healthcheck outside into dedicated service, wouldn't it create more reliability issues with one more service to worry about? Are there any examples of this approach being used in the industry?
IME you end up with both; something like discrete client, LB, and controller. You can’t rely on any one component to “turn itself off.“ ex a client or LB can easily get into a “wedged” state where it’s unable to take itself out of consideration for traffic. For example, I’ve had silly incidents based on bgp routes staying up, memory errors/pressure preventing new health check results from being parsed, the file systems is going read only, SKB pressure interfering with pipes, and of course, the classic difference between a dedicated health check in point versus actual traffic. All those examples it prevents the client or LB from removing itself from the traffic path.
An external controller is able to safely remove traffic from one of the other failed components. In addition the client can still do local traffic analysis, or use in band signaling, to identify anomalous end points and remove itself or them from the traffic path.
Good active probes are actually a pretty meaningful traffic load. It was a HUGE problem for flat virtual network models like a heroku a decade ago. This is exacerbated when you have more clients and more in points.
As a reference, this distributed model it is what AWS moved to 15 years ago. And if you look at any of the high throughput clouds services or CDNs they’ll have a similar model.
one thing to add for passive healthchecking and clientside loadbalancing is that throughput and dilution of signal really matters.
there are obviously plenty of low/sparse call volume services where passive healthchecks would take forever to get signal, or signal is so infrequently collected its meaningless. and even with decent RPS, say 1m RPS distributed between 1000 caller replicas and 1000 callee replicas, that means that any one caller-callee pair is only seeing 1rps. Depending on your noise threshold, a centralized active healthcheck can respond much faster.
There are some ways to improve signal in the latter case using subsetting and aggregating/reporting controllers, but that all comes with added complexity.
From a dataplane perspective, it does mean your healthchecks are running from a different location than your proxy. So there are risks where routability is impacted for proxy -> dest but not for healthchecker -> dest.
For general reliability, you can create partitions of checkers and use quorum across partitions to determine what the health state is for a given dest. This also enables centralized monitoring to detect systemic issues with bad healthcheck configuration changes (i.e. are healthchecks failing because the service is unhealthy or because of a bad healthchecker?)
In industry, I personnaly know AWS has one or two health-check-as-a-service systems that they are using internally for LBs and DNS. Uber runs its own health-check-as-a-service system which it integrates with its managed proxy fleet as well as p2p discovery. IIRC Meta also has a system like this for at least some things? But maybe I'm misremembering.
donavanm|6 days ago
An external controller is able to safely remove traffic from one of the other failed components. In addition the client can still do local traffic analysis, or use in band signaling, to identify anomalous end points and remove itself or them from the traffic path.
Good active probes are actually a pretty meaningful traffic load. It was a HUGE problem for flat virtual network models like a heroku a decade ago. This is exacerbated when you have more clients and more in points.
As a reference, this distributed model it is what AWS moved to 15 years ago. And if you look at any of the high throughput clouds services or CDNs they’ll have a similar model.
dastbe|5 days ago
there are obviously plenty of low/sparse call volume services where passive healthchecks would take forever to get signal, or signal is so infrequently collected its meaningless. and even with decent RPS, say 1m RPS distributed between 1000 caller replicas and 1000 callee replicas, that means that any one caller-callee pair is only seeing 1rps. Depending on your noise threshold, a centralized active healthcheck can respond much faster.
There are some ways to improve signal in the latter case using subsetting and aggregating/reporting controllers, but that all comes with added complexity.
dastbe|6 days ago
For general reliability, you can create partitions of checkers and use quorum across partitions to determine what the health state is for a given dest. This also enables centralized monitoring to detect systemic issues with bad healthcheck configuration changes (i.e. are healthchecks failing because the service is unhealthy or because of a bad healthchecker?)
In industry, I personnaly know AWS has one or two health-check-as-a-service systems that they are using internally for LBs and DNS. Uber runs its own health-check-as-a-service system which it integrates with its managed proxy fleet as well as p2p discovery. IIRC Meta also has a system like this for at least some things? But maybe I'm misremembering.