If you want to prove (i.e. show that it works and/or it's faster in a real-world scenario) a new alternative to attention without breaking the bank then one of the best ways to do that would probably be to retrain an already existing model, just with swapped attention modules. Then once you have such a model you can do apples-to-apples benchmarks.This has been done successfully in the past:
https://huggingface.co/featherless-ai/QRWKV-72B
Note that this is a 72B model which would be very expensive to train from scratch, but here they did the conversion for less than $2000.
Herring|1 month ago
https://github.com/KellerJordan/modded-nanogpt
naasking|1 month ago
https://www.techrxiv.org/users/685780/articles/1375955-topol...
nickpsecurity|1 month ago
https://www.databricks.com/blog/mosaicbert
I'll add they should do a number of small, training runs with different architectures and data mixes. That proves generalization.
oofbey|1 month ago
andai|1 month ago
tuned|1 month ago
It is somehow what is called a "Grassmann-like flow" but without the Plucker embedding, or also similar to what is done in DavisTensor but relying on spectral Laplacian instead of purely geometric distances.
The problem with a lot of stuff done before is that it focuses on dense representations. This architecture is focuses on sparse representation and provides a new approximation computation based on energy-informed graphs.
tuned|1 month ago
throwaway314155|1 month ago
tuned|1 month ago
amelius|1 month ago