top | item 47097785

(no title)

gwern | 8 days ago

K-V caches are large, but hidden states aren't necessarily that large. And if you can run a model once ridiculously fast, then you can loop it repeatedly and still be fast. So I wonder about the 'modern RNNs' like RWKV here...

discuss

order

No comments yet.