(no title)
hanselot | 2 years ago
10m context with that retrieval rate is such a monstrous leap. And to top it off, we got LargeWorldModel in the same week, capable of 1M token context with insane retrieval rate in the open source space. So not only is the open source world currently technically ahead of ChatGPT, so is Google. Which is why they had to announce SORA, because google's model is so far ahead of the competition. That's also why it will probably be ages before we get access to SORA. Now don't get me wrong, the average person can't afford 32 TPU's to run LWM, but we already have quants for it, which is a step towards enabling the average person (that somehow has 24-48gb of VRAM to get a taste of that power).
What is also striking is the fact that the new models are all multimodal as a standard. We not only leapfrogged in context size, but also in modalities. The model seems to only benefit from having more modalities to work with.
I think the statement Bill Gates made claiming that "LLM's have reached a plateau" itself indicates they don't believe they can make more money from training better/larger models. Which indicates that they already did as well as they could with their existing people, and are now "years" behind google. I never thought google could catch up, especially after their infamous "We have no moat" situation. But it seems they actually doubled down and did something about it.
To a lot of people, last Thursday was a very nihilistic day for Local Models, as the goalposts shifted from 128-200k context to 10M tokens with near perfect retrieval. It's literally insanely scary. But luckily we got LWM, and that means we have only been 10xed.
Now the local people will work on figuring out how to bridge the gap, before being leapfrogged again. What is really insane is that, we have had LLAMA2 for over a year now, and nobody else figured out how to get this result from it, despite it being around so long.
I still believe there are modifications to the architecture of MoE that will unlock new powers that we haven't even dreamed of yet.
Sorry, this was supposed to be well thought out, but it turned more into stream of consciousness, and I honestly had no intention of disagreeing with you.
pk-protect-ai|2 years ago
If I remember the paper correctly, it was something about a 4M context in there. So not 10x, but 2.5x.
> What is really insane is that, we have had LLAMA2 for over a year now, and nobody else figured out how to get this result from it, despite it being around so long.
This isn't true. For now, the task of extending context to 10M tokens is brute-forced by money (increased HW requirements for training and inference and increased training time are also a financial domain). And for now, there simply is no leapfrogging solution for open source or commercial models, which will decrease the costs by orders of magnitude.