Zebra-Llama – Towards efficient hybrid models

adityashankar|2 months ago

Due to perverse incentives and the historical nature of models over-claiming accuracy, it's very hard to believe anything until it is open source and can be tested out

that being said, I do very much believe that computational efficiency of models is going to go up [correction] drastically over the coming months, which does pose interesting questions over nvidia's throne

*previously miswrote and said computational efficiency will go down

credit_guy|2 months ago

Like this?

https://huggingface.co/amd/Zebra-Llama-8B-8MLA-24Mamba-SFT

ACCount37|2 months ago

I don't doubt the increase in efficiency. I doubt the "drastically".

We already see models become more and more capable per weight and per unit of compute. I don't expect a state-change breakthrough. I expect: more of the same. A SOTA 30B model from 2026 is going to be ~30% better than one from 2025.

Now, expecting that to hurt Nvidia? Delusional.

No one is going to stop and say "oh wow, we got more inference efficiency - now we're going to use less compute". A lot of people are going to say "now we can use larger and more powerful models for the same price" or "with cheaper inference for the same quality, we can afford to use more inference".

danielbln|2 months ago

I think you mean computational efficiency will go _up_ in the future. To your last point: Jevons paradox might apply.

a_wild_dandan|2 months ago

If the claims in the abstract are true, then this is legitimately revolutionary. I don’t believe it. There are probably some major constraints/caveats that keep these results from generalizing. I’ll read through the paper carefully this time instead of a skim and come back with thoughts after I’ve digested it.

jychang|2 months ago

What's not to believe? Qwerky-32b has already done something similar as a finetune of QwQ-32b but not using traditional attention architecture.

And hybrid models aren't new, MLA based hybrid models is basically just Deepseek V3.2 in a nutshell. Note that Deepseek V3.2 (and V3.1, R1, and V3... and V2 actually) all use MLA. Deepseek V3.2 is what adds the linear attention stuff.

Actually, since Deepseek V3.1 and Deepseek V3.2 are just post-training on top of the original Deepseek V3 pretrain run, I'd say this paper is basically doing exactly what Deepseek V3.2 did in terms of efficiency.

xer|2 months ago

This is great! But what if the US invests 1% of GDP in GPU datacenters and then those are not needed becaues someone created a much more efficient architecture?

wild_egg|2 months ago

More efficiency just means more consumption. Think when they add lanes to a highway, traffic gets better for a little bit but very soon the highway is just as congested as before.

dkural|2 months ago

Look up Jevons Paradox, when something becomes more efficient, consumption can goes up, often due to price elasticity.

Think of like this: Imagine car prices go from $200,000 to $$20,000 - you wouldn't sell 10x the amount of cars, you'd sell --- In fact I just looked up the numbers - worldwide only 100K or so cars are 200K & higher, whereas roughly 80 million cars are in that affordable category.

So a price drop of 90% allowed sales to go from 0.1M to 80M!! I think this means we need more engines, tires, roads, gas, spare parts.

chpatrick|2 months ago

Then they'll be able to use those datacenters much more efficiently.

_boffin_|2 months ago

They will still use capacity. Why would you believe anything different?

Reubend|2 months ago

It would be REALLY cool to see this same technique applied to a much more recent OSS model distillation. For example, Mistral 3 14B would be a great target. How efficient can we get inference there?

unknown|2 months ago

[deleted]

AlexCoventry|2 months ago

This is from May 2025, according to the arxiv watermark. Maybe that should be mentioned in the title.

KnuthIsGod|2 months ago

Looks like the trillions of dollars spent on datacentres will end up being regretted.

pryelluw|2 months ago

I should have been an electrician.

mason_mpls|2 months ago

> Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7–11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size—down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively—while preserving 100%, 100%, and 97% of average zero-shot performance on LM Harness tasks.

This is an extraordinary claim, is there a catch I’m missing? Am I misreading?

jychang|2 months ago

The catch that you're missing is that Deepseek did this ages ago.

They're just using MLA, which is well known to reduce KV size by 90%. You know, the MLA that's used in... Deepseek V2, Deepseek V3, Deepseek R1, Deepseek V3.1, Deepseek V3.2.

Oh, and they also added some hybrid linear attention stuff to make it faster at long context. You know who else uses hybrid linear attention? Deepseek V3.2.

61 comments