top | item 36617375

(no title)

euclaise | 2 years ago

> The bitter lesson [1] is going to eventually come for all of these. Eventually we'll figure out how to machine-learn the heuristic rather than hard code it. Recurrent neural networks (RNNs) do this implicitly, but we don't yet know how to effectively train RNNs on ultra-deep sequences.

Linear RNNs and RWKV are examples of RNNs on deep sequences:

https://arxiv.org/abs/2303.06349

https://arxiv.org/abs/2305.13048

discuss

sdenton4|2 years ago

The work out of that group, starting with S4 layers, is 10000% the stuff to be paying attention to.

https://srush.github.io/annotated-s4/

HiPPO was brilliant - instead of working with the raw sequence, you work with its weighted laplace transform, and instead of actually computing the laplace transform you find the rule to update it when new data is added. Furthermore, we can 'band limit' the Laplace transform (similar to PCA) to keep only the 'most important' information while still preserving most of the information in the sequence - this is a common and quite effective compression technique.

Any 'fast' transformer is going to be working with some kind of sampling or aggregation or compression of the long sequence. Sampling is ultimately going to be too noisy, and standard aggregations are going to be too coarse. So the thing to bet on is better compression techniques, which is what the S4/RWKV group are ultimately working on.

inciampati|2 years ago

Can you point to anything public on your last point about compression? What is being compressed?

gamegoblin|2 years ago

I think the jury is still out if these will actually scale to ultra-long language understanding sequences. KWKV, for example, is still trained like GPT, but is architected so it can be run as an RNN during inference time. This is awesome, but it is unclear if the training regime will limit the effective use of long-ranging recurrent context.

euclaise|2 years ago

Training as GPT vs RNN will give you numerically identical results with RWKV, it's just two ways of computing the same thing. It's trained in GPT-mode because it's cheaper to train that way -- you can parallelize over the sequence length. In practice it isn't going to be any different than training with back-propagation through time for the same sequence length.

LoganDark|2 years ago

> RWKV

The current versions of RWKV slowly go insane when exposed to sequences that are too long, because the state slowly diverges over time as you increase past the context length of the training session. They are experimenting with ways to avoid this though: https://github.com/Blealtan/RWKV-LM-LoRA/tree/dev-infctx

lostmsu|2 years ago

Can you share more details about the divergence?