top | item 34447048

(no title)

aljungberg | 3 years ago

THe RWKV model seems really cool. If you could get transformer-like performance with an RNN, the “hard coded” context length problem might go away. (That said, RNNs famously have infinite context in theory and very short context in reality.)

Is there a primer for what RWKV does differently? According to the Github page it seems the key is multiple channels of state with different decaying rates, giving I assume, a combination of short and long term memory. But isn’t that what LSTMs were supposed to do too?

discuss

order

thegeomaster|3 years ago

There's already research that tries to fix this problem with transformers in general, like Transformer-XL [1]. I'm a bit puzzled that I don't see much interest in getting a pre-trained model out that uses this architecture---it seems to give good results.

[1]: https://arxiv.org/abs/1901.02860

gok|3 years ago

T5 uses relative positional encoding

solomatov|3 years ago

My understanding is that RNNs aren't worse than Transformers per se, they are just slower to train, and use GPU much more efficiently, i.e. much more stuff could be run in parallel.

Hendrikto|3 years ago

Also slower to perform inference on. RNNs have to be much more sequential.

euclaise|3 years ago

We also don't have evidence that they scale the way transformers do

swyx|3 years ago

> RNNs famously have infinite context in theory and very short context in reality.

any sources to read more about this please? its the first ive heard of it

solomatov|3 years ago

Naive RNN have vanishing gradient, but LSTMs and GRUs are much better in this respect.