Based: Simple linear attention language models

[+] vessenes|2 years ago|reply

Together.ai is interesting; I think it might be a relatively new business model in tech -- since they sell inference and training, you might be tempted to think of them as an engineering / infrastructure company.

But, because inference is largely quality based -- e.g. customers seem to be selecting "cheapest generation at the quality I require", they have a strong incentive to optimize speed of inference at different quality points, and so this paper is coming at the market from a very different place than "quality first - sell second", like OpenAI or Anthropic. On those terms, the ideas and concepts in Based are pretty interesting. Faster inference is awesome, faster sequential token generation is awesome, cheaper long range memory is awesome..

As revenues at these places grow, they should have access to more compute, which should mean they'll be able to start training at a scale that will get to 'minimum acceptable quality', and then they'll be off to the races.

I'm looking forward to the next year, where companies like together can start putting out models optimized toward specific workflows that compete on quality!

[+] swyx|2 years ago|reply

we actually talked to them a few weeks ago - they're actually almost 50% a research lab!

https://latent.space/p/together

i think it makes total sense - infra will commoditize rapidly so you have to make research bets on future differentiators. Together is basically the only GPU infra company with a successful research dept (am I missing someone? i probably am) that is likely to pay off turning it into a frontier model lab at some point in future.

[+] anon291|2 years ago|reply

The paper 'Hopfield networks are all you need' talks about why the softmax in the normal attention formulation is important for recall, and I'm always surprised its ideas haven't penetrated further in the community. Basically, viewing attention as a Hopfield network, there's a theoretical maximum number of patterns that, for linear functions, is actually very low, but for the exponential functions, you get very high information density and recall.

[+] dilawar|2 years ago|reply

True. And it's scaling down properties are much better than any other network I've played with (not an expert). I could run MNIST benchmark on a ESP32 board.

I also liked the Convex Concave trick in the paper. The guarantee that at every step you are closer to the minima is very nice.

[+] amai|2 years ago|reply

Schmidhuber invented these models in the 90s : https://www.reddit.com/r/MachineLearning/comments/megi8a/d_j...

[+] fancyfredbot|2 years ago|reply

It's striking how this paper appears like experimental science. They have a proposal and run experiments to confirm it rather than proving it mathematically. Also I love that their new DSL is named thunder kittens. I'm glad they don't take themselves too seriously.

[+] 3abiton|2 years ago|reply

Interesting observation, that's experimental physics for you.

[+] logicchains|2 years ago|reply

It's interesting how Google's filtering works: searches for "based LLM", "based LLM model" and "based LLM model together" all fail to show any reference to this model, while searches for "Hawk LLM" (another recent LLM with a less-threatening name) correctly show that in the first few results. Presumably Google doesn't want anyone looking for models that are actually "based". Bing doesn't do much better for those terms, but if I search "based linear transformer", Bing correctly gets this post as the first result, while Google ignores it completely.

[+] Nuzzerino|2 years ago|reply

They couldn’t have done it without their surrogate army of downvoters that passively gaslight any presumed readers of whistleblowers like yourself. “It’s a conspiracy theory!” Any discussion about censorship around here gets similar treatment.

[+] unknown|2 years ago|reply

[deleted]

[+] unknown|2 years ago|reply

[deleted]

[+] vicktorium|2 years ago|reply

The RWVK modal was mention which is not based on transformers but on NN. [1]

The context window is particularly interesting, i have interacted with the people over discord some time ago and the model seems good but not widely used yet.

People are noticing the limitations will not shift to pure hardware -> energy now.

the transformers allows heavy parallelization but it's too computationally-intensive even with quantitization.

people are simply trying to run from the transformer is seem.

[1] https://github.com/BlinkDL/RWKV-LM

[+] swyx|2 years ago|reply

(not to toot own horn too much but i believe we were also the first big ai pod to feature rwkv: https://latent.space/p/rwkv )

Based presents the first real challenge to rwkv/mamba i've seen, both of which fall prey to the recall tradeoff referenced in TFA. i do have real questions on how the recall can grow unbounded with no tradeoff like that but then again i havent seriously studied the math.

14 comments