Good thought... I think you're wrong because the dominant factor is bandwidth over the interconnect. In this case they're using 5Gbps over Ethernet; compare that to 80-120 Gbps for a Thunderbolt 5 connected Mac Studio cluster: https://www.youtube.com/watch?v=bFgTxr5yst0
fc417fc802|10 hours ago
Is it? Why do you say that? I understand inference to be almost entirely bottlenecked on memory bandwidth.
There are n^2 weights per layer but only n state values in the vector that exists between layers. Transmitting a few thousand (or even tens of thousands) of fp values does not require a notable amount of bandwidth by modern standards.
Training is an entirely different beast of course. And depending on the workload latency can also impact performance. But for running inference with a single query from a single user I don't see how inter-node bandwidth is going to matter.