top | item 47205932 (no title) elcritch | 2 days ago Running inference requires sharing intermediate matrix results between nodes. Faster networking speeds that up. discuss order hn newest wokkel|2 days ago I read (but cannot find this anymore) that the information sent from layer to layer is minimal. The actual matrix work happens within a layer. They are not doing matrix multiplication over the netwerk (that would be insane latency wise). elcritch|23 hours ago The LLM/transformers attention layers require an O(n^2) operation between all tokens, which does require significant bandwidth.Yes the latency hurts performance, that why it’s only achieving ~8tok/s.
wokkel|2 days ago I read (but cannot find this anymore) that the information sent from layer to layer is minimal. The actual matrix work happens within a layer. They are not doing matrix multiplication over the netwerk (that would be insane latency wise). elcritch|23 hours ago The LLM/transformers attention layers require an O(n^2) operation between all tokens, which does require significant bandwidth.Yes the latency hurts performance, that why it’s only achieving ~8tok/s.
elcritch|23 hours ago The LLM/transformers attention layers require an O(n^2) operation between all tokens, which does require significant bandwidth.Yes the latency hurts performance, that why it’s only achieving ~8tok/s.
wokkel|2 days ago
elcritch|23 hours ago
Yes the latency hurts performance, that why it’s only achieving ~8tok/s.