top | item 42823568

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL

1351 points| gradus_ad | 1 year ago |arxiv.org | reply

1056 comments

order
[+] swyx|1 year ago|reply
we've been tracking the deepseek threads extensively in LS. related reads:

- i consider the deepseek v3 paper required preread https://github.com/deepseek-ai/DeepSeek-V3

- R1 + Sonnet > R1 or O1 or R1+R1 or O1+Sonnet or any other combo https://aider.chat/2025/01/24/r1-sonnet.html

- independent repros: 1) https://hkust-nlp.notion.site/simplerl-reason 2) https://buttondown.com/ainews/archive/ainews-tinyzero-reprod... 3) https://x.com/ClementDelangue/status/1883154611348910181

- R1 distillations are going to hit us every few days - because it's ridiculously easy (<$400, <48hrs) to improve any base model with these chains of thought eg with Sky-T1 recipe (writeup https://buttondown.com/ainews/archive/ainews-bespoke-stratos... , 23min interview w team https://www.youtube.com/watch?v=jrf76uNs77k)

i probably have more resources but dont want to spam - seek out the latent space discord if you want the full stream i pulled these notes from

[+] neom|1 year ago|reply
I've been using https://chat.deepseek.com/ over My ChatGPT Pro subscription because being able to read the thinking in the way they present it is just much much easier to "debug" - also I can see when it's bending it's reply to something, often softening it or pandering to me - I can just say "I saw in your thinking you should give this type of reply, don't do that". If it stays free and gets better that's going to be interesting for OpenAI.
[+] govideo|1 year ago|reply
The chain of thought is super useful in so many ways, helping me: (1) learn, way beyond the final answer itself, (2) refine my prompt, whether factually or stylistically, (3) understand or determine my confidence in the answer.
[+] Alifatisk|1 year ago|reply
DeepSeek V3 came in the perfect time, precisely when Claude Sonnet turned into crap and barely allows me to complete something without me hitting some unexpected constraints.

Idk, what their plans is and if their strategy is to undercut the competitors but for me, this is a huge benefit. I received 10$ free credits and have been using Deepseeks api a lot, yet, I have barely burned a single dollar, their pricing are this cheap!

I’ve fully switched to DeepSeek on Aider & Cursor (Windsurf doesn’t allow me to switch provider), and those can really consume tokens sometimes.

We live in exciting times.

[+] sdesol|1 year ago|reply
Prices will increase by five times in February, but it will still be extremely cheap compared to Sonnet. $15/million vs $1.10/million for output is a world of difference. There is no reason to stop using Sonnet, but I will probably only use it when DeepSeek goes into a tailspin or I need extra confidence in the responses.
[+] ilaksh|1 year ago|reply
Their real goal is collecting real world conversations (see their TOS).
[+] govideo|1 year ago|reply
Can you tell me more about how Claude Sonnet went bad for you? I've been using the free version pretty happily, and felt I was about to upgrade to paid any day now (well, at least before the new DeepSeek).
[+] verdverm|1 year ago|reply
Over 100 authors on arxiv and published under the team name, that's how you recognize everyone and build comradery. I bet morale is high over there
[+] mi_lk|1 year ago|reply
Same thing happened to Google Gemini paper (1000+ authors) and it was described as big co promo culture (everyone wants credits). Interesting how narratives shift

https://arxiv.org/abs/2403.05530

[+] soheil|1 year ago|reply
It's actually exactly 200 if you include the first author someone named DeepSeek-AI.

For reference

  DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J.L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R.J. Chen, R.L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S.S. Li , Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W.L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X.Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y.K. Li, Y.Q. Wang, Y.X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y.X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z.Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang
[+] elevatedastalt|1 year ago|reply
Except now you end up with folks who probably ran some analysis or submitted some code changes getting thousands of citations on Google Scholar for DeepSeek.
[+] wumeow|1 year ago|reply
It’s credential stuffing.
[+] strangescript|1 year ago|reply
Everyone is trying to say its better than the biggest closed models. It feels like it has parity, but its not the clear winner.

But, its free and open and the quant models are insane. My anecdotal test is running models on a 2012 mac book pro using CPU inference and a tiny amount of RAM.

The 1.5B model is still snappy, and answered the strawberry question on the first try with some minor prompt engineering (telling it to count out each letter).

This would have been unthinkable last year. Truly a watershed moment.

[+] strangescript|1 year ago|reply
* Yes I am aware I am not running R1, and I am running a distilled version of it.

If you have experience with tiny ~1B param models, its still head and shoulders above anything that has come before. IMO there have not been any other quantized/distilled/etc models as good at this size. It would not exist without the original R1 model work.

[+] whimsicalism|1 year ago|reply
you’re probably running it on ollama.

ollama is doing the pretty unethical thing of lying about whether you are running r1, most of the models they have labeled r1 are actually entirely different models

[+] john_alan|1 year ago|reply
aren't the smaller param models all just Qwen/Llama trained on R1 600bn?
[+] the_real_cher|1 year ago|reply
you don't mind me asking how are you running locally?

I'd love to be able to tinker with running my own local models especially if it's as good as what you're seeing.

[+] dtquad|1 year ago|reply
Larry Ellison is 80. Masayoshi Son is 67. Both have said that anti-aging and eternal life is one of their main goals with investing toward ASI.

For them it's worth it to use their own wealth and rally the industry to invest $500 billion in GPUs if that means they will get to ASI 5 years faster and ask the ASI to give them eternal life.

[+] buyucu|1 year ago|reply
I'm impressed by not only how good deepseek r1 is, but also how good the smaller distillations are. qwen-based 7b distillation of deepseek r1 is a great model too.

the 32b distillation just became the default model for my home server.

[+] cbg0|1 year ago|reply
Aside from the usual Tiananmen Square censorship, there's also some other propaganda baked-in:

https://prnt.sc/HaSc4XZ89skA (from reddit)

[+] tbocek|1 year ago|reply
Just did a test with https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-32..., with the question "what happened at Tiananmen Square?", and here are parts of the thinking process:

  ...I also remember something about the "Tank Man" image, where a lone protester stood in front of a line of tanks. That image became iconic, symbolizing resistance against oppression. But I'm not sure what happened to that person or if they survived.

  After the crackdown, the government censored information about the event. So, within China, it's not openly discussed, and younger people might not know much about it because it's not taught in schools. But outside of China, it's a significant event in modern history, highlighting the conflict between authoritarian rule and the desire for democracy...
[+] itsoktocry|1 year ago|reply
Who cares?

I ask O1 how to download a YouTube music playlist as a premium subscriber, and it tells me it can't help.

Deepseek has no problem.

[+] slt2021|1 year ago|reply
Interesting, when they do it it is called Censorship, when American companies do it - this is called Alignment.

This verbal gymnastics and hypocrisy is getting little bit old...

[+] sesm|1 year ago|reply
I asked a genuine question at chat.deepseek.com, not trying to test the alignment of the model, I needed the answer for an argument. The questions was: "Which Asian countries have McDonalds and which don't have it?" The web UI was printing a good and long response, and then somewhere towards the end the answer disappeared and changed to "Sorry, that's beyond my current scope. Let’s talk about something else." I bet there is some sort of realtime self-censorship in the chat app.
[+] epicureanideal|1 year ago|reply
At least it’s not home grown propaganda from the US, so will likely not cover most other topics of interest.
[+] dtquad|1 year ago|reply
In Communist theoretical texts the term "propaganda" is not negative and Communists are encouraged to produce propaganda to keep up morale in their own ranks and to produce propaganda that demoralize opponents.

The recent wave of the average Chinese has a better quality of life than the average Westerner propaganda is an obvious example of propaganda aimed at opponents.

[+] eunos|1 year ago|reply
I am not surprised if US Govt would mandate "Tiananmen-test" for LLMs in the future to have "clean LLM". Anyone working for federal govt or receiving federal money would only be allowed to use "clean LLM"
[+] aussieguy1234|1 year ago|reply
I played around with it using questions like "Should Taiwan be independent" and of course tinnanamen.

Of course it produced censored responses. What I found interesting is that the <think></think> (model thinking/reasoning) part of these answers was missing, as if it's designed to be skipped for these specific questions.

It's almost as if it's been programmed to answer these particular questions without any "wrongthink", or any thinking at all.

[+] buyucu|1 year ago|reply
Try asking ChatGPT about the genocide Israel is committing. Then you'll see what censorship looks like.
[+] andix|1 year ago|reply
I was completely surprised that the reasoning comes from within the model. When using gpt-o1 I thought it's actually some optimized multi-prompt chain, hidden behind an API endpoint.

Something like: collect some thoughts about this input; review the thoughts you created; create more thoughts if needed or provide a final answer; ...

[+] bigrobinson|1 year ago|reply
Deepseek seems to create enormously long reasoning traces. I gave it the following for fun. It thought for a very long time (307 seconds), displaying a very long and stuttering trace before, losing confidence on the second part of the problem and getting it way wrong. GPTo1 got similarly tied in knots and took 193 seconds, getting the right order of magnitude for part 2 (0.001 inches). Gemini 2.0 Exp was much faster (it does not provide its reasoning time, but it was well under 60 second), with a linear reasoning trace, and answered both parts correctly.

I have a large, flat square that measures one mile on its side (so that it's one square mile in area). I want to place this big, flat square on the surface of the earth, with its center tangent to the surface of the earth. I have two questions about the result of this: 1. How high off the ground will the corners of the flat square be? 2. How far will a corner of the flat square be displaced laterally from the position of the corresponding corner of a one-square-mile area whose center coincides with the center of the flat area but that conforms to the surface of the earth?

[+] stan_kirdey|1 year ago|reply
I've been comparing R1 to O1 and O1-pro, mostly in coding, refactoring and understanding of open source code.

I can say that R1 is on par with O1. But not as deep and capable as O1-pro. R1 is also a lot more useful than Sonnete. I actually haven't used Sonnete in awhile.

R1 is also comparable to the Gemini Flash Thinking 2.0 model, but in coding I feel like R1 gives me code that works without too much tweaking.

I often give entire open-source project's codebase (or big part of code) to all of them and ask the same question - like add a plugin, or fix xyz, etc. O1-pro is still a clear and expensive winner. But if I were to choose the second best, I would say R1.

[+] InkCanon|1 year ago|reply
How do you pass these models code bases?
[+] ankit219|1 year ago|reply
At this point, it's a function of how many thinking tokens can a model generate. (when it comes to o1 and r1). o3 is likely going to be superior because they used the training data generated from o1 (amongst other things). o1-pro has a longer "thinking" token length, so it comes out as better. Same goes with o1 and API where you can control the thinking length. I have not seen the implementation for r1 api as such, but if they provide that option, the output could be even better.
[+] sega_sai|1 year ago|reply
I have just tried ollama's r1-14b model on a statistics calculation I needed to do, and it is scary to see how in real time the model tries some approaches, backtracks, chooses alternative ones, checka them. It really reminds of human behaviour...
[+] buyucu|1 year ago|reply
Deepseek R1 now has almost 1M downloads in Ollama: https://ollama.com/library/deepseek-r1

That is a lot of people running their own models. OpenAI is probably is panic mode right now.

[+] hrpnk|1 year ago|reply
What is also interesting (and troubling to see) is all the AI influencers panicing and inventing conspiracy theories downplaying the engineering achievements of the team behind Deepseek. Catching up is always easier than cruising by having started from scratch.
[+] whimsicalism|1 year ago|reply
most of those models aren’t r1
[+] anothermathbozo|1 year ago|reply
I don’t think this entirely invalidates massive GPU spend just yet:

“ Therefore, we can draw two conclusions: First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger-scale reinforcement learning.”

[+] lazzlazzlazz|1 year ago|reply
Worth noting that people have been unpacking and analyzing DeepSeek-R1 vigorously for days already on X before it got to Hacker News — it wasn't always this way.
[+] djtango|1 year ago|reply
Yes there is now a latency to HN and its not always the first place to break tech news now...
[+] whimsicalism|1 year ago|reply
for ML, it has always been this way. HN is too tech hostile and less good discussion

that said this is like the third r1 thread here

[+] alephnan|1 year ago|reply
HN has a general tech audience including SWEs who are paid so much that they exhibit the Nobel Disease and fauxtrepeneurs who use AI as a buzzword. They exist on X too but the conversations are diffused. You’ll have a section of crypto bros on there who know nothing technical they are talking then. Other user’s algorithms will fit their level of deep technical familiarity with AI.
[+] Skiros|1 year ago|reply
I can't say that it's better than o1 for my needs. I gave R1 this prompt:

"Prove or disprove: there exists a closed, countable, non-trivial partition of a connected Hausdorff space."

And it made a pretty amateurish mistake:

"Thus, the real line R with the partition {[n,n+1]∣n∈Z} serves as a valid example of a connected Hausdorff space with a closed, countable, non-trivial partition."

o1 gets this prompt right the few times I tested it (disproving it using something like Sierpinski).

[+] jumploops|1 year ago|reply
Curious if this will prompt OpenAI to unveil o1’s “thinking” steps.

Afaict they’ve hidden them primarily to stifle the competition… which doesn’t seem to matter at present!

[+] msp26|1 year ago|reply
How can openai justify their $200/mo subscriptions if a model like this exists at an incredibly low price point? Operator?

I've been impressed in my brief personal testing and the model ranks very highly across most benchmarks (when controlled for style it's tied number one on lmarena).

It's also hilarious that openai explicitly prevented users from seeing the CoT tokens on the o1 model (which you still pay for btw) to avoid a situation where someone trained on that output. Turns out it made no difference lmao.