top | item 45915937

(no title)

paulsutter | 3 months ago

> When a system uses very short intervals, such as sending heartbeats every 500 milliseconds

500 milliseconds is a very long interval, on a CPU timescale. Funny how we all tend to judge intervals based on human timescales

Of course the best way to choose heartbeat intervals is based on metrics like transaction failure rate or latency

discuss

hinkley|3 months ago

Top shelf would be noticing an anomaly in behavior for a node and then interrogating it to see what’s wrong.

Automatic load balancing always gets weird, because it can end up sending more traffic to the sick server instead of less, because the results come back faster. So you have to be careful with status codes.

unknown|3 months ago

[deleted]

just_mc|3 months ago

You have to consider the tail latencies of the system responding plus the network in between. The p99 is typically much higher than the average. Also, may have to account for GC as was mentioned in the article. 500ms gets used up pretty fast.

roncesvalles|3 months ago

500ms is actually a very short interval for heartbeats in modern distributed systems. Kubernetes nodes out of the box send heartbeats every 10s, and Kubernetes only declares a node as dead when there's no heartbeat for 40s.

The relevant timescale here is not CPU time but network time. There's so much jitter in networks that if your heartbeats are on CPU scale (even, say, 100ms) and you wait for 4 missed before declaring dead, you'd just be constantly failing over.

paulsutter|3 months ago

Speak for your own network. On a densely interconnected datacenter network there would be no appreciable jitter.

4 x 10s heartbeats sounds like an incredibly conservative decision by whoever chose the default, and I cant imagine any critical service keeping those timeouts.

blipvert|3 months ago

Well, it is called a heartbeat after all, not a oscillator beat :-)

nitwit005|3 months ago

cat /proc/sys/net/ipv4/tcp_keepalive_time

7200

That is two hours in seconds.