(no title)
adityashankar | 2 months ago
that being said, I do very much believe that computational efficiency of models is going to go up [correction] drastically over the coming months, which does pose interesting questions over nvidia's throne
*previously miswrote and said computational efficiency will go down
credit_guy|2 months ago
https://huggingface.co/amd/Zebra-Llama-8B-8MLA-24Mamba-SFT
jychang|2 months ago
I don't know what's so special about this paper.
- They claim to use MLA to reduce KV cache by 90%. Yeah, Deepseek invented that for Deepseek V2 (and also V3 and Deepseek R1 etc)
- They claim to use a hybrid linear attention architecture. So does Deepseek V3.2 and that was weeks ago. Or Granite 4, if you want to go even further back. Or Kimi Linear. Or Qwen3-Next.
- They claimed to save a lot of money not doing a full pre-train run for millions of dollars. Well, so did Deepseek V3.2... Deepseek hasn't done a full $5.6mil full pretraining run since Deepseek V3 in 2024. Deepseek R1 is just a $294k post train on top of the expensive V3 pretrain run. Deepseek V3.2 is just a hybrid linear attention post-train run - i don't know the exact price, but it's probably just a few hundred thousand dollars as well.
Hell, GPT-5, o3, o4-mini, and gpt-4o are all post-trains on top of the same expensive pre-train run for gpt-4o in 2024. That's why they all have the same information cutoff date.
I don't really see anything new or interesting in this paper that isn't already something Deepseek V3.2 has already sort of done (just on a bigger scale). Not exactly the same, but is there anything amazingly new that's not in Deepseek V3.2?
deepdarkforest|2 months ago
> Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that...
Hmmm
adityashankar|2 months ago
moffkalast|2 months ago
ACCount37|2 months ago
We already see models become more and more capable per weight and per unit of compute. I don't expect a state-change breakthrough. I expect: more of the same. A SOTA 30B model from 2026 is going to be ~30% better than one from 2025.
Now, expecting that to hurt Nvidia? Delusional.
No one is going to stop and say "oh wow, we got more inference efficiency - now we're going to use less compute". A lot of people are going to say "now we can use larger and more powerful models for the same price" or "with cheaper inference for the same quality, we can afford to use more inference".
colechristensen|2 months ago
Right now, Claude is good enough. If LLM development hit a magical wall and never got any better, Claude is good enough to be terrifically useful and there's diminishing returns on how much good we get out of it being at $benchmark.
Saying we're satisfied with that... well how many years until efficiency gains from one side and consumer hardware from the other meet in the middle so "good enough for everybody" open models are available for anyone who wants to pay for a $4000 MacBook (and after another couple of years a $1000 MacBook, and several more and a fancy wristwatch).
Point being, unless we get to a point where we start developing "models" that deserve civil rights and citizenship, the years are numbered to where we NEED cloud infrastructure and datacenters full of racks and racks of $x0,000 hardware.
I strongly believe the top end of the S curve is nigh, and with it we're going to see these trillion dollar ambitions crumble. Everybody is going to want a big-ass GPU and a ton of RAM but that's going to quickly become boring because open models are going to exist that eat everybody's lunch and the trillion dollar companies trying to beat them with a premium product aren't going to stack up outside of niche cases and much more ordinary cloud compute motivations.
danielbln|2 months ago
adityashankar|2 months ago
if computational efficiency goes up (thanks for the correction), and CPU inference becomes viable for most practical applications, GPUs (or accelerators) themselves may be unnecessary for most practical functions