top | item 47145659

(no title)

dastbe | 5 days ago

From a dataplane perspective, it does mean your healthchecks are running from a different location than your proxy. So there are risks where routability is impacted for proxy -> dest but not for healthchecker -> dest.

For general reliability, you can create partitions of checkers and use quorum across partitions to determine what the health state is for a given dest. This also enables centralized monitoring to detect systemic issues with bad healthcheck configuration changes (i.e. are healthchecks failing because the service is unhealthy or because of a bad healthchecker?)

In industry, I personnaly know AWS has one or two health-check-as-a-service systems that they are using internally for LBs and DNS. Uber runs its own health-check-as-a-service system which it integrates with its managed proxy fleet as well as p2p discovery. IIRC Meta also has a system like this for at least some things? But maybe I'm misremembering.

discuss

No comments yet.