top | item 45320055

(no title)

lettergram | 5 months ago

Back in 2016 - 2018 my work at Capital One resulted in a modified C-RNN style architecture that was producing gpt-2 level results. Using that model we were able to build a general purpose system that could generate data for any dataset (with minimal training, from scratch):

https://medium.com/capital-one-tech/why-you-dont-necessarily...

At the time it was clear to all on the team that RNNs, just like transformers later on, are general purpose frameworks that really only require more data and size to function. In the 2018-2020 era and probably today, they are slower to train. They also are less prone to certain pitfalls, but overall had the same characteristics.

In the 2019-2020 I was convinced that transformers would give way to better architecture. The RNNs in particular trained faster and required less data, particularly when combined with several architectural components I won’t get into. I believe that’s still true today, though I haven’t worked on it in the last 2-3 years.

That said, transformers “won” because they are better overall building blocks and don’t require the nuances of RNNs. Combined with the compute optimizations that are now present I don’t see that changing in the near term. Folks are even working to convert transformers to RNNs:

https://medium.com/@techsachin/supra-technique-for-linearizi...

There are also RNN based models beating Qwen 3 8B in certain benchmarks

https://www.rwkv.com/

I suspect over time the other methods my team explored and other types of networks and nodes will continue to expand beyond transformers for state of the art LLMs

discuss

algo_trader|5 months ago

> RNN based models beating Qwen 3 8B > https://www.rwkv.com/

Counter consensus is where the alpha is...

Do you think rnn/rwkv have an edge with verifiable domains and tree search inference time? You can use cheaper gpus and do multiple sampling.

(but of course, its hard to beat the sunk cost of a foundation model)