Due to perverse incentives and the historical nature of models over-claiming accuracy, it's very hard to believe anything until it is open source and can be tested out
that being said, I do very much believe that computational efficiency of models is going to go up [correction] drastically over the coming months, which does pose interesting questions over nvidia's throne
*previously miswrote and said computational efficiency will go down
I don't doubt the increase in efficiency. I doubt the "drastically".
We already see models become more and more capable per weight and per unit of compute. I don't expect a state-change breakthrough. I expect: more of the same. A SOTA 30B model from 2026 is going to be ~30% better than one from 2025.
Now, expecting that to hurt Nvidia? Delusional.
No one is going to stop and say "oh wow, we got more inference efficiency - now we're going to use less compute". A lot of people are going to say "now we can use larger and more powerful models for the same price" or "with cheaper inference for the same quality, we can afford to use more inference".
If the claims in the abstract are true, then this is legitimately revolutionary. I don’t believe it. There are probably some major constraints/caveats that keep these results from generalizing. I’ll read through the paper carefully this time instead of a skim and come back with thoughts after I’ve digested it.
What's not to believe? Qwerky-32b has already done something similar as a finetune of QwQ-32b but not using traditional attention architecture.
And hybrid models aren't new, MLA based hybrid models is basically just Deepseek V3.2 in a nutshell. Note that Deepseek V3.2 (and V3.1, R1, and V3... and V2 actually) all use MLA. Deepseek V3.2 is what adds the linear attention stuff.
Actually, since Deepseek V3.1 and Deepseek V3.2 are just post-training on top of the original Deepseek V3 pretrain run, I'd say this paper is basically doing exactly what Deepseek V3.2 did in terms of efficiency.
This is great! But what if the US invests 1% of GDP in GPU datacenters and then those are not needed becaues someone created a much more efficient architecture?
More efficiency just means more consumption. Think when they add lanes to a highway, traffic gets better for a little bit but very soon the highway is just as congested as before.
Look up Jevons Paradox, when something becomes more efficient, consumption can goes up, often due to price elasticity.
Think of like this: Imagine car prices go from $200,000 to $$20,000 - you wouldn't sell 10x the amount of cars, you'd sell --- In fact I just looked up the numbers - worldwide only 100K or so cars are 200K & higher, whereas roughly 80 million cars are in that affordable category.
So a price drop of 90% allowed sales to go from 0.1M to 80M!! I think this means we need more engines, tires, roads, gas, spare parts.
It would be REALLY cool to see this same technique applied to a much more recent OSS model distillation. For example, Mistral 3 14B would be a great target. How efficient can we get inference there?
> Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7–11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size—down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively—while preserving 100%, 100%, and 97% of average zero-shot performance on LM Harness tasks.
This is an extraordinary claim, is there a catch I’m missing? Am I misreading?
The catch that you're missing is that Deepseek did this ages ago.
They're just using MLA, which is well known to reduce KV size by 90%. You know, the MLA that's used in... Deepseek V2, Deepseek V3, Deepseek R1, Deepseek V3.1, Deepseek V3.2.
Oh, and they also added some hybrid linear attention stuff to make it faster at long context. You know who else uses hybrid linear attention? Deepseek V3.2.
adityashankar|2 months ago
that being said, I do very much believe that computational efficiency of models is going to go up [correction] drastically over the coming months, which does pose interesting questions over nvidia's throne
*previously miswrote and said computational efficiency will go down
credit_guy|2 months ago
https://huggingface.co/amd/Zebra-Llama-8B-8MLA-24Mamba-SFT
ACCount37|2 months ago
We already see models become more and more capable per weight and per unit of compute. I don't expect a state-change breakthrough. I expect: more of the same. A SOTA 30B model from 2026 is going to be ~30% better than one from 2025.
Now, expecting that to hurt Nvidia? Delusional.
No one is going to stop and say "oh wow, we got more inference efficiency - now we're going to use less compute". A lot of people are going to say "now we can use larger and more powerful models for the same price" or "with cheaper inference for the same quality, we can afford to use more inference".
danielbln|2 months ago
a_wild_dandan|2 months ago
jychang|2 months ago
And hybrid models aren't new, MLA based hybrid models is basically just Deepseek V3.2 in a nutshell. Note that Deepseek V3.2 (and V3.1, R1, and V3... and V2 actually) all use MLA. Deepseek V3.2 is what adds the linear attention stuff.
Actually, since Deepseek V3.1 and Deepseek V3.2 are just post-training on top of the original Deepseek V3 pretrain run, I'd say this paper is basically doing exactly what Deepseek V3.2 did in terms of efficiency.
xer|2 months ago
wild_egg|2 months ago
dkural|2 months ago
Think of like this: Imagine car prices go from $200,000 to $$20,000 - you wouldn't sell 10x the amount of cars, you'd sell --- In fact I just looked up the numbers - worldwide only 100K or so cars are 200K & higher, whereas roughly 80 million cars are in that affordable category.
So a price drop of 90% allowed sales to go from 0.1M to 80M!! I think this means we need more engines, tires, roads, gas, spare parts.
chpatrick|2 months ago
_boffin_|2 months ago
Reubend|2 months ago
unknown|2 months ago
[deleted]
AlexCoventry|2 months ago
KnuthIsGod|2 months ago
pryelluw|2 months ago
mason_mpls|2 months ago
This is an extraordinary claim, is there a catch I’m missing? Am I misreading?
jychang|2 months ago
They're just using MLA, which is well known to reduce KV size by 90%. You know, the MLA that's used in... Deepseek V2, Deepseek V3, Deepseek R1, Deepseek V3.1, Deepseek V3.2.
Oh, and they also added some hybrid linear attention stuff to make it faster at long context. You know who else uses hybrid linear attention? Deepseek V3.2.