(no title)
fc417fc802 | 4 hours ago
Is it? Why do you say that? I understand inference to be almost entirely bottlenecked on memory bandwidth.
There are n^2 weights per layer but only n state values in the vector that exists between layers. Transmitting a few thousand (or even tens of thousands) of fp values does not require a notable amount of bandwidth by modern standards.
Training is an entirely different beast of course. And depending on the workload latency can also impact performance. But for running inference with a single query from a single user I don't see how inter-node bandwidth is going to matter.
No comments yet.