top | item 45766937

Kimi Linear: An Expressive, Efficient Attention Architecture

217 points| blackcat201 | 4 months ago |github.com

47 comments

order

eXpl0it3r|4 months ago

For the uninitiated, what's a "hybrid linear attention architecture"?

quotemstr|4 months ago

1/4 of their layers are conventional quadratic attention

arresin|4 months ago

Hey thanks for asking this question. It lead to good replies

oxqbldpxo|4 months ago

I switched from chatgpt to Perplexity; and now to Kimi K2, after reading an article here explaining that all the fear around some of the Chinese models spying and so on.. is simply not true. I have to say that in my experience Kimi K2 is way better than perplexity. I hope we can get our act together. Seems that building this Ai's requires a level of collaboration that is in opposition to greed.

wongarsu|4 months ago

My default assumption would be that every model is spying (or rather: is being spied on). The data is just way too juicy, every major intelligence agency has to be salivating at the though of getting this degree of insight into people.

Of course with Kimi there is fear because the Chinese government can easily pressure Moonshot AI into sharing the data, and other countries have to work to stealthily siphon data off without being caught by Chinese counterintelligence. As opposed to GPT5 where the American government can easily pressure OpenAI and every other country has to stealthily siphon data off without being caught by American counterintelligence. The only way to be reasonably certain that you aren't spied on is to run your own models or rent GPU time to run models.

The bigger worry imho is whether the models are booby-trapped to give poisoned answers when they detect certain queries or when they detect that you work for a competitor or enemy of China. But that has to be reasonably stealthy to work

embedding-shape|4 months ago

> after reading an article here explaining that all the fear around some of the Chinese models spying and so on.. is simply not true

Not doubting they aren't spying on people, but regardless, how would you really know? Are you basing this on that no Chinese police has visited you, or how would you really know if it's "simply true" or not?

With that said, I use plenty of models coming out of China too with no fear, but I'm also using them locally, not cloud platforms.

lostmsu|4 months ago

Why do you think either of Perplexity or Kimi are better than GPT-5?

nh43215rgb|4 months ago

> Chinese models spying and so on.. is simply not true.

They all must be doing a great favor to humanity in a good will then.

Sorry, but seriously -- Chinese government, controlled by the Chinese Communist Party (CCP), can effectively seize or shut down internet services and infrastructure at will within its borders under its national security laws.

No need to read the TOS; it's in the law.

softwaredoug|4 months ago

Everyone is worried about AI data centers destroying the planet with their extreme energy needs. Though it seems we have a big learning curve still to make AI inference and training more efficient.

How likely are we to NOT see the AI data center apocalypse through better algorithms?

wongarsu|4 months ago

We have already seen huge efficiency increases over the last two years. Small models have become increasingly capable, the minimum viable model size for simple tasks keeps shrinking, and proprietary model providers have long stopped talking about new milestones in model sizes and instead achieved massive price cuts through methods they largely keep quiet about (but that almost certainly include smaller models and intelligent routing to different model sizes)

But so far this has just lead to more induced demand. There are a lot of things we would use LLMs for if it was just cheap enough, and every increase in efficiency makes more of those use cases viable

simgt|4 months ago

Without policies, gains in efficiency are always compensated by increased demand. Global energy consumption by source is a good example, we've never consumed as much coal as now even though we have alternatives.

https://ourworldindata.org/global-energy-200-years

naasking|4 months ago

> How likely are we to NOT see the AI data center apocalypse through better algorithms?

Near certain IMO. Algorithmic improvements have outpaced hardware improvements for decades. We're already seeing the rise of small models and how simple tweaks can make small models very capable problem solvers, better even than state of the art large models. Data center scaling is nearing its peak IMO as we're hitting data limits which cap model size anyway.

m00x|4 months ago

I don't think this worry is widespread, or even warranted. China has been able to more than double the US in energy production without massive effects on the environment by using nuclear, solar, and hydro.

If anything, the US is massively underproducing.

logicartisan|4 months ago

Amazing how fast AI keeps improving, every new model feels like a big step forward

hirako2000|4 months ago

It solely is improving on efficiency. While it is extremely valuable given the disproportionate (to value) costs of these things, your statement almost sounds like it has improved an even more challenging aspect, pushing performance.

lostmsu|4 months ago

Any comparison with existing models on common benchmarks? Text? Coding? MMLU?

ted_dunning|4 months ago

Did you even look at the article?

Evaluation Benchmarks Our evaluation encompasses three primary categories of benchmarks, each designed to assess distinct capabilities of the model:

• Language Understanding and Reasoning: Hellaswag [121], ARC-Challenge [14], Winogrande [83], MMLU [36], TriviaQA [47], MMLU-Redux [26], MMLU-Pro [103], GPQA-Diamond [82], BBH [94], and [105].

• Code Generation: LiveCodeBench v6 4 [44], EvalPlus [60].

• Math & Reasoning: AIME 2025, MATH 500, HMMT 2025, PolyMath-en.

• Long-context: MRCR 5 , RULER [38], Frames [52], HELMET-ICL [118], RepoQA [61], Long Code Arena [13] and LongBench v2 [6].

• Chinese Language Understanding and Reasoning: C-Eval [43], and CMMLU [55].

amoskvin|4 months ago

any hardware recommendations? how much memory do we need to this?

uniqueuid|4 months ago

You will effectively want a 48GB card or more for quantized versions, otherwise you won't have meaningful space left for the KV cache. Blackwell and above is generally a good idea to get faster hardware support for 4b (some recent models took some time to ship for older architectures, gpt-oss IIRC).

andai|4 months ago

How does Gemini have a million token context window?

textembedding|4 months ago

125 upvotes with 2 comments is kinda sus

muragekibicho|4 months ago

Lots of model releases are like this. We can only upvote. We can't run the model on our personal computers. We can neither test their 'Efficient Attention' concept on our personal computers.

Honestly, it would take 24 hours just to download the 98 GB model if I wanted to try it out (assuming I had a card with 98 GB of ram).

actionfromafar|4 months ago

I'm hoping someone will explain what this release even means.

WhereIsTheTruth|4 months ago

The Chinese century ain't gonna build itself /s