top | item 47211083

(no title)

> I think you're wrong because the dominant factor is bandwidth over the interconnect.

Is it? Why do you say that? I understand inference to be almost entirely bottlenecked on memory bandwidth.

There are n^2 weights per layer but only n state values in the vector that exists between layers. Transmitting a few thousand (or even tens of thousands) of fp values does not require a notable amount of bandwidth by modern standards.

Training is an entirely different beast of course. And depending on the workload latency can also impact performance. But for running inference with a single query from a single user I don't see how inter-node bandwidth is going to matter.

discuss

No comments yet.