top | item 47204903

(no title)

fc417fc802 | 23 hours ago

Given that APU only has 4 channels isn't this setup comically starved for bandwidth? By the same token, wouldn't you expect performance to scale approximately linearly as you add additional boxes? And wouldn't you be better off with smaller nodes (ie less RAM and CPU power per box)?

If I'm right about that then if you're willing to go in for somewhere in the vicinity of $30k (24x the Max 385 model) you should be able to achieve ChatGPT performance.

discuss

ibeckermayer|11 hours ago

Good thought... I think you're wrong because the dominant factor is bandwidth over the interconnect. In this case they're using 5Gbps over Ethernet; compare that to 80-120 Gbps for a Thunderbolt 5 connected Mac Studio cluster: https://www.youtube.com/watch?v=bFgTxr5yst0

fc417fc802|10 hours ago

> I think you're wrong because the dominant factor is bandwidth over the interconnect.

Is it? Why do you say that? I understand inference to be almost entirely bottlenecked on memory bandwidth.

There are n^2 weights per layer but only n state values in the vector that exists between layers. Transmitting a few thousand (or even tens of thousands) of fp values does not require a notable amount of bandwidth by modern standards.

Training is an entirely different beast of course. And depending on the workload latency can also impact performance. But for running inference with a single query from a single user I don't see how inter-node bandwidth is going to matter.