top | item 47206012

(no title)

wokkel | 1 day ago

I read (but cannot find this anymore) that the information sent from layer to layer is minimal. The actual matrix work happens within a layer. They are not doing matrix multiplication over the netwerk (that would be insane latency wise).

discuss

elcritch|6 minutes ago

The LLM/transformers attention layers require an O(n^2) operation between all tokens, which does require significant bandwidth.

Yes the latency hurts performance, that why it’s only achieving ~8tok/s.