The real thing that surprises me (as a layman trying to get up to speed on this stuff) is that there's no "trick" to it. It really just does seem to be a textbook application of RL to LLMs.
Going from a base LLM to human instruction-tuned (SFT) ones is definitely an ingenious leap where it's not obvious that you'd get anything meaningful. But when we quickly saw afterwards that prompting for chain of thought improved performance, why wasn't this the immediate next step that everyone took. It seems like even after the release of o1 the trick wasn't apparent to everyone, and if it wasn't for DeepSeek people still might not have realized it.
> why wasn't this the immediate next step that everyone took.
It was actually tested by various labs. Just probably not at this scale. The first model that featured RL prominently was DeepSeek-math-7b-RL, published last year in april. It was at the time the best model for math, and remained so until the qwen2.5-math series, that probably had way more data put into them.
There's a thing about RL that makes it tricky - the models tend to behave very stubbornly. That is, if they see something that resembles their training method (i.e. math problems), they'll solve the problem, and they'll be good at it. But if you want something close to that but not quite solving it (i.e. analyse this math problem and write hints, or here are 5 problems extract the common methods used for solving, etc.) you'll see that they perform very poorly, often times just going straight into "to solve this problem we...".
This is even mentioned in the R1 paper. Poor adherence to prompts, especially ssytem prompts. So that is still challenging.
Chain of thought prompting ("think step by step") only encourages the model to break the problem into steps, which allows it to incrementally build upon each step (since the output is fed back in as part of the input).
Reasoning requires more than chain of thought, since it's often not apparent what the next step should be - you (human, or model) may go down one path of reasoning only to realize it's going nowhere, and have to back up and try something else instead. This ability to "back up" - to realize that an earlier reasoning "step" was wrong and needs to be rethought is what was mostly missing from models that (unlike o1, etc) hadn't been trained for reasoning.
The reason non-reasoning models can't reason appears to be because this type of chain-of-consciousness thought (thinking out loud, mistakes and all) when trying to figure out a problem is hugely underrepresented in a normal training set. Most writing you find on the internet, or other sources, is the end result of reasoning - someone figured something out and wrote about it - not the actual reasoning process (mistakes and all) that got them there.
It's still not clear what OpenAI had to do, if anything, to help bootstrap o1 (special hand-created training data?), but basically by using RL to encourage certain types of reasoning pattern, they were able to get the model to back-up and self-correct when needed. DeepSeek-R may well have used o1 reasoning outputs as a bootstrap, but have been able to replicate RL training to encourage self-correcting reasoning in the same way.
One interesting aspect of DeepSeek-R is that they have shown that once you have a reasoning model, you can run it and use it to generate a bunch of reasoning outputs that can then be used as normal training data to fine-tune a non-reasoning model, even a very small one. This proves that, at least to some degree, the reason non-reasoning models couldn't reason is just because they had not been trained on sufficient self-correcting reasoning examples.
I've wondered this too, I really hope someone with more knowledge can comment. My impression is that people worked on this kind of thing for years before they started seeing a 'signal' i.e. that they actually got RL working to improve performance. But why is that happening now? What were the tricks that made it work?
the tulu team saw it. but, yes nobody like scaled it to the extent deepseek did. I am surprised that the faang labs which have the best of the best didn't see this.
I wonder if OpenAI did the same thing, or they instead took the approach of manually building an expensive, human-designed supervised learning dataset for reasoning. If the latter, they must be really kicking themselves now.
I think a lot of it had to do with DeepSeek need to use as fewer resources as possible why did it do this how can it do it in fewer steps using fewer resources. Where as most of the FAANG were looking at throwing more data and processing power at it.
This was my takeaway as well, the paper was so simple I was shocked by it. We’ve been doing RL on LLMs for awhile now and it’s more surprising this didn’t happen sooner
There was a whole bunch of people who claimed LLMs can't reason at all and that everything is a regurgitation. I wonder what they have to say about this. Like, what exactly is going on here with chain of thought reasoning from their expert perspective?
> There was a whole bunch of people who claimed LLMs can't reason at all and that everything is a regurgitation. I wonder what they have to say about this.
I don't see that as a refutation of the former actually: model trained to be stochastic parrots with next-token prediction as only learning target were indeed stochastic parrots and now we've moved to a completely different technology that features reinforcement learning in its training so it will go farther and farther from stochastic parrots and more and more towards “intelligence”.
If anything, the fact that the entire industry has now moved to RL instead of just cramming through trillions of tokens to make progress is a pretty strong acknowledgement that the “stochastic parrots” crowd was right.
If you pick a random combination, there is a very good chance that the combination and the product do not exist anywhere. So the LLM has to "create" it somehow.
It sure goes through a lot (hundreds of lines of self-reflection) but it successfully does the math.
I don't think it is the same kind of "reasoning" as humans, but there is an emergent kind of structure happening here that is allowing for this reasoning.
I don't get this whole debate, surely what's meant by "reason" can be strictly defined and measured? Then we can conclusively say whether or not it's happening with LLMs.
It seems to me like the debate is largely just semantics about how to define "reason".
This is American history written in R1, it is very logical:
Whenas the nations of Europa did contend upon the waves—Spain plundered gold in Mexica, Albion planted cotton in Virginia—thirteen colonies did kindle rebellion. General Washington raised the standard of liberty at Philadelphia; Franklin parleyed with Gaul’s envoys in Paris. When the cannons fell silent at Yorktown, a new republic arose in the wilderness, not by Heaven’s mandate, but by French muskets’ aid.
Yet the fledgling realm, hedged by western forests and eastern seas, waxed mighty. Jefferson purchased Louisiana’s plains; Monroe’s doctrine shackled southern realms. Gold-seekers pierced mountains, iron roads spanned the continent, while tribes wept blood upon the prairie. Then roared foundries by Great Lakes, bondsmen toiled in cotton fields, steel glowed in Pittsburgh’s fires, and black gold gushed from Texan soil—a molten surge none might stay.
Wilson trod Europe’s stage as nascent hegemon. Roosevelt’s New Deal healed wounds; Marshall’s gold revived ruined cities. The atom split at Alamogordo; greenbacks reigned at Bretton Woods. Armadas patrolled seven seas, spies wove webs across hemispheres. Through four decades’ contest with the Red Bear, Star Wars drained the Soviet coffers. Silicon’s chips commanded the world’s pulse, Hollywood’s myths shaped mankind’s dreams, Wall Street’s ledgers ruled nations’ fates—a fleeting "End of History" illusion.
But the colossus falters. Towers fell, and endless wars began; subprime cracks devoured fortunes. Pestilence slew multitudes while ballots bred discord. Red and Blue rend the Union’s fabric, gunfire echoes where laws grow faint. The Melting Pot now boils with strife, the Beacon dims to a prison’s glare. With dollar-cloth and patent-chains, with dreadnoughts’ threat, it binds the world—nations seethe yet dare not speak.
Three hundred million souls, guarded by two oceans, armed with nuclear flame, crowned with finance’s scepter—how came such dominion to waver? They fortified might but neglected virtue, wielded force but forgot mercy. As Mencius warned: "He who rides tigers cannot dismount." Rome split asunder, Britannia’s sun set; behold now Old Glory’s tremulous flutter. Thus say the sages: A realm endures by benevolence, not arms; peace flows from harmony, not hegemony—this truth outlives all empires.
Just a year ago everyone was saying LLMs aren't intelligent and everything is regurgitation. A lot of people on HN "knew" this and defended this perspective vehemently. It's quite embarrassing how wrong they are.
That being said, I don't think it's quite blown that wide open yet. But for sure the trendlines are pointing at AGI within our lifetimes.
So what is interesting here is that they managed to set up the reward model in such a simple and cost-effective way that CoT emerges as the most optimal strategy for solving math problems, without explicitly fine-tuning the model to do so.
This naturally raises the question: How do you design a reward model to elicit the desired emergent behavior in a system?
Is it accurate to compare 8k example RL with 8k example SFT? RL with the same amount of examples would take massively more compute than the SFT version (though depending on how many rollouts they do per example).
RL is more data-efficient but that may not be relevant now that we can just use Deepseek-R1's responses as the training data.
[+] [-] krackers|1 year ago|reply
Going from a base LLM to human instruction-tuned (SFT) ones is definitely an ingenious leap where it's not obvious that you'd get anything meaningful. But when we quickly saw afterwards that prompting for chain of thought improved performance, why wasn't this the immediate next step that everyone took. It seems like even after the release of o1 the trick wasn't apparent to everyone, and if it wasn't for DeepSeek people still might not have realized it.
[+] [-] NitpickLawyer|1 year ago|reply
It was actually tested by various labs. Just probably not at this scale. The first model that featured RL prominently was DeepSeek-math-7b-RL, published last year in april. It was at the time the best model for math, and remained so until the qwen2.5-math series, that probably had way more data put into them.
There's a thing about RL that makes it tricky - the models tend to behave very stubbornly. That is, if they see something that resembles their training method (i.e. math problems), they'll solve the problem, and they'll be good at it. But if you want something close to that but not quite solving it (i.e. analyse this math problem and write hints, or here are 5 problems extract the common methods used for solving, etc.) you'll see that they perform very poorly, often times just going straight into "to solve this problem we...".
This is even mentioned in the R1 paper. Poor adherence to prompts, especially ssytem prompts. So that is still challenging.
[+] [-] HarHarVeryFunny|1 year ago|reply
Reasoning requires more than chain of thought, since it's often not apparent what the next step should be - you (human, or model) may go down one path of reasoning only to realize it's going nowhere, and have to back up and try something else instead. This ability to "back up" - to realize that an earlier reasoning "step" was wrong and needs to be rethought is what was mostly missing from models that (unlike o1, etc) hadn't been trained for reasoning.
The reason non-reasoning models can't reason appears to be because this type of chain-of-consciousness thought (thinking out loud, mistakes and all) when trying to figure out a problem is hugely underrepresented in a normal training set. Most writing you find on the internet, or other sources, is the end result of reasoning - someone figured something out and wrote about it - not the actual reasoning process (mistakes and all) that got them there.
It's still not clear what OpenAI had to do, if anything, to help bootstrap o1 (special hand-created training data?), but basically by using RL to encourage certain types of reasoning pattern, they were able to get the model to back-up and self-correct when needed. DeepSeek-R may well have used o1 reasoning outputs as a bootstrap, but have been able to replicate RL training to encourage self-correcting reasoning in the same way.
One interesting aspect of DeepSeek-R is that they have shown that once you have a reasoning model, you can run it and use it to generate a bunch of reasoning outputs that can then be used as normal training data to fine-tune a non-reasoning model, even a very small one. This proves that, at least to some degree, the reason non-reasoning models couldn't reason is just because they had not been trained on sufficient self-correcting reasoning examples.
[+] [-] qnleigh|1 year ago|reply
[+] [-] attentionmech|1 year ago|reply
[+] [-] logicchains|1 year ago|reply
[+] [-] xbmcuser|1 year ago|reply
[+] [-] mountainriver|1 year ago|reply
[+] [-] Demlmlm|1 year ago|reply
[deleted]
[+] [-] ninetyninenine|1 year ago|reply
[+] [-] teej|1 year ago|reply
[+] [-] littlestymaar|1 year ago|reply
I don't see that as a refutation of the former actually: model trained to be stochastic parrots with next-token prediction as only learning target were indeed stochastic parrots and now we've moved to a completely different technology that features reinforcement learning in its training so it will go farther and farther from stochastic parrots and more and more towards “intelligence”.
If anything, the fact that the entire industry has now moved to RL instead of just cramming through trillions of tokens to make progress is a pretty strong acknowledgement that the “stochastic parrots” crowd was right.
[+] [-] csomar|1 year ago|reply
Here is R1 trying to multiply a large number (successfully): https://gist.github.com/omarabid/038678cc269a3f2db756a7e0825...
If you pick a random combination, there is a very good chance that the combination and the product do not exist anywhere. So the LLM has to "create" it somehow.
It sure goes through a lot (hundreds of lines of self-reflection) but it successfully does the math.
I don't think it is the same kind of "reasoning" as humans, but there is an emergent kind of structure happening here that is allowing for this reasoning.
[+] [-] Vampiero|1 year ago|reply
When LLMs are good at Prolog, it means they're good at logic, which means they're good at reasoning. Until then, you can't trust them.
[+] [-] calibas|1 year ago|reply
It seems to me like the debate is largely just semantics about how to define "reason".
[+] [-] dartos|1 year ago|reply
It’s all still tokens…
[+] [-] almaight|1 year ago|reply
Yet the fledgling realm, hedged by western forests and eastern seas, waxed mighty. Jefferson purchased Louisiana’s plains; Monroe’s doctrine shackled southern realms. Gold-seekers pierced mountains, iron roads spanned the continent, while tribes wept blood upon the prairie. Then roared foundries by Great Lakes, bondsmen toiled in cotton fields, steel glowed in Pittsburgh’s fires, and black gold gushed from Texan soil—a molten surge none might stay.
Wilson trod Europe’s stage as nascent hegemon. Roosevelt’s New Deal healed wounds; Marshall’s gold revived ruined cities. The atom split at Alamogordo; greenbacks reigned at Bretton Woods. Armadas patrolled seven seas, spies wove webs across hemispheres. Through four decades’ contest with the Red Bear, Star Wars drained the Soviet coffers. Silicon’s chips commanded the world’s pulse, Hollywood’s myths shaped mankind’s dreams, Wall Street’s ledgers ruled nations’ fates—a fleeting "End of History" illusion.
But the colossus falters. Towers fell, and endless wars began; subprime cracks devoured fortunes. Pestilence slew multitudes while ballots bred discord. Red and Blue rend the Union’s fabric, gunfire echoes where laws grow faint. The Melting Pot now boils with strife, the Beacon dims to a prison’s glare. With dollar-cloth and patent-chains, with dreadnoughts’ threat, it binds the world—nations seethe yet dare not speak.
Three hundred million souls, guarded by two oceans, armed with nuclear flame, crowned with finance’s scepter—how came such dominion to waver? They fortified might but neglected virtue, wielded force but forgot mercy. As Mencius warned: "He who rides tigers cannot dismount." Rome split asunder, Britannia’s sun set; behold now Old Glory’s tremulous flutter. Thus say the sages: A realm endures by benevolence, not arms; peace flows from harmony, not hegemony—this truth outlives all empires.
[+] [-] suraci|1 year ago|reply
However, it is still highly literate (both in English and Chinese), which I believe is one of its advantages
[+] [-] JPLeRouzic|1 year ago|reply
It seems LLMs are wiser than humans, after all.
[+] [-] alsaaro|1 year ago|reply
With what prompt?
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] MIA_Alive|1 year ago|reply
[+] [-] EGreg|1 year ago|reply
[+] [-] ggm|1 year ago|reply
[+] [-] zwaps|1 year ago|reply
[+] [-] ldjkfkdsjnv|1 year ago|reply
[+] [-] ninetyninenine|1 year ago|reply
That being said, I don't think it's quite blown that wide open yet. But for sure the trendlines are pointing at AGI within our lifetimes.
[+] [-] trash_cat|1 year ago|reply
This naturally raises the question: How do you design a reward model to elicit the desired emergent behavior in a system?
[+] [-] cye131|1 year ago|reply
RL is more data-efficient but that may not be relevant now that we can just use Deepseek-R1's responses as the training data.
[+] [-] android521|1 year ago|reply
[+] [-] Stevvo|1 year ago|reply
[+] [-] rapsey|1 year ago|reply
[+] [-] govideo|1 year ago|reply
btw, I think this is a net major benefit for the US startup ecosystem -- from new model developers to applications.
Edit: Stevvo - Thanks for your info.
[+] [-] swyx|1 year ago|reply
for some reason a lot of people are choosing to blog on notion
[+] [-] brandonasuncion|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] m3kw9|1 year ago|reply
[+] [-] antman|1 year ago|reply
[+] [-] qrsjutsu|1 year ago|reply
[deleted]
[+] [-] unknown|1 year ago|reply
[deleted]