top | item 46038251

(no title)

majke | 3 months ago

> Coordinator sees Node A has significantly fewer rows (logical count) than the cluster average. It flags Node A as "underutilized."

Ok, so you are dealing with a classic - you measure A, but what matters is B. For "load" balancing a decent metric is, well, response time (and jitter).

For data partitioning - I guess number of rows is not the right metric? Change it to number*avg_size or something?

If you can't measure the thing directly, then take a look at stuff like "PID controller". This can be approach as a typical controller loop problem, although in 99% doing PID for software systems is an overkill.

discuss

leo_e|3 months ago

The trouble with mmap is the performance cliff. A node goes from 'fine' to 'dead' almost instantly, which breaks our balancing logic.

You are right that we need better backpressure. Instead of a smarter coordinator, we probably need 'dumber' nodes that aggressively shed load (return 429s) the moment local pressure spikes, rather than waiting for a re-balance.