(no title)
majke | 3 months ago
Ok, so you are dealing with a classic - you measure A, but what matters is B. For "load" balancing a decent metric is, well, response time (and jitter).
For data partitioning - I guess number of rows is not the right metric? Change it to number*avg_size or something?
If you can't measure the thing directly, then take a look at stuff like "PID controller". This can be approach as a typical controller loop problem, although in 99% doing PID for software systems is an overkill.
leo_e|3 months ago
You are right that we need better backpressure. Instead of a smarter coordinator, we probably need 'dumber' nodes that aggressively shed load (return 429s) the moment local pressure spikes, rather than waiting for a re-balance.