- Model is finetuned from Qwen-2.5 Instruct, which includes millions of specially filtered math examples in both pretraining and supervised fine-tuning already.
- To generate the perfect 817 math examples for LIMO, they used state of the art models like R1 to filter down from an initial pool of 10 million math problems. In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data. It’s not very clear to me if this is more or less impressive than getting the same result by simply fine-tuning on the 10 million initial pool, but I suppose that would make for a worse headline.
Yes, the authors explicitly highlighted those two points in the abstract, in terms of them being the elicitation threshold for complex reasoning, namely, an extremely complete pre-trained foundation model, and a set of extremely high quality examples post-training.
To your question on finetuning on the initial 10 million pool - intuitively, it would require tremendous amount of finetuning data to move the needle - you really won't be able to move the gradients much with just 817 examples, that initial pool is effectively enforcing pretty rigid regularization.
There is now an increasing interest in showing that small data with inference time scaling is providing significant yield. Couple of recent examples:
Why is everyone is so critical of using information from a previous model to make a more efficient model. There’s nothing wrong with making progress using prior work. And increasing efficiency is progress.
You wouldn’t criticize someone’s kombucha because they didn’t piece their SCOBY (symbiotic culture of bacteria and yeast) together microbe by microbe.
Just imagine a textbook that gives you the understanding you need to score high in math competitions…and it describes less than 1,000 problems. This in itself is a major discovery in metacognition.
> To generate the perfect 817 math examples for LIMO, they used state of the art models like R1 to filter down from an initial pool of 10 million math problems. In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data
The paper, and this comment, seem awfully reminiscent of creating a textbook of curated "maximally informative and distilled" set of cognitive examples to teach students with foundational learning a next level of reasoning.
The last few years of LLM progress have shown we can predict human "reasoning" responses to inputs by modeling likely human responses as if LLM generated. Put another way, most responses are not particularly reasoned, but chain of tokgen*.
Sit near someone who "talks to herself" while doing problems and it's even more evident.
---
* tokgen definition: Listen to conversations in a cafeteria. Many are something other than thoughtful, responses that follow the prompts, with near perfect predictability. To differentiate from these responses and speech that comes after a pause and reflect, one can use the labels thought versus token generation or tokgen.
The context right now is that OpenAI, with first-mover advantage, cutting-edge-hardware, and tens of billions of dollars of investment, are not getting benchmark performance better than Chinese-developed models that are trained with cut-down nvidia GPUs and a lot less money.
You are missing three point, it's about stating the importance of the preselection, now we know that we may not need huge amounts of data for similar results in other reasoning areas, only highly curated data, yes, sometimes by models themselves but not necessarily.
Here is how I make sense of it (I have no expertise in this subject, please feel free to correct me if I am wrong): I think when the model is pretrained on the internet, it does gain most of the skills required to do mathematical reasoning, however, since its task is to predict the next word distribution on the entire internet, it does not normally use this ability, since most of the text on the internet is not this type of reasoning text (think of generative image models a few years ago, where appending "unreal engine" to a prompt would significantly improve the quality of the output, the reason was that the model was trained to generate the distribution of the images on the internet, most of them are not particularly impressive, however, since images containing "unreal engine" were usually high-quality screenshots of images, it would also move the distribution of generated images towards higher quality generations). So I think the model already has most of the ability, it just needs to adjust a few connections to actually utilize this latent skill, so it makes sense that a few training examples are enough to adjust the connections to increase mathematical reasoning skills.
Kinda similar to how Anthropic was able to achieve golden gate Claude or even maximize/minimize features like “buggy code” via analyzing concepts in activations and manipulating them[0].
Pattern identification and continuation can be applied to evaluate symbolic reasoning. You can see this in e.g. the semantics of a functional programming language if evaluation semantics are defined in terms of rewrite rules.
If you have a model which can convert a problem into language that's precise enough to start pattern matching to LLM-encoded generative programs that evaluate logical implications, you can get into a very interesting space. Autoregressive prediction can turn into symbolic progressive evaluation and calculation. The background LLM is still guiding choice of evaluation and goal seeking.
Reinforcing these evaluation rules seems like it should be doable without enormous corpora, as long as the base model already has enough meat on it to cleanly attach to the more precise language.
The reasoning R1 demonstrates most times sounds to me like 5th grader's wording - in support of what you say. But then if you compress compress the knowledge needed for math reasoning, perhaps you get category theory paired with prolog or something along the line which is rule-based.
This suggests fine-tuning a base model (with SL or RL) generally doesn't make the model inherently smarter, only the initial self-supervised learning during pretraining does. Though it would be strange if no amount of reinforcement learning could make the LLM truly smarter.
My guess at the upshot: Some domains, like math, are general but have outsized effective vocabularies like all possible numbers, which makes them more expensive to train by the same method that works for domains of regular-sized vocabularies. If you train for reasoning steps in such a problem domain, you can reinforce the comparatively few general terms of the vocabulary like "add", "inverse", "solve". And that leaves the arithmetic of number combinations separate from particular problems because you're not emphasizing one-shot answers. You can train N reasoning cases + M arithmetic cases instead of N*M whole math problems. So you have to use more inference power but you can get better answers for less training.
Theory aside, I would think a good application-side method is to use this general reasoning process to structure a final expression and then pass that through a traditional evaluator. Then the reasoning and training thereof need only go as far as symbol manipulation. This is something like Wolfram Alpha, if its NLP handed off to the evaluator much later in the process.
A connected question -- has there been an LLM that is a perfect calculator ? I.e. you give it a expression involving standard operations +/- and (say) integer numbers, standard operations and it should returns always a correct result.
I don't remember seeing any papers on this (but i'm not an expert)
I think I've recently read two seemingly contradicting things:
1- LLMs can never generalize theorem proving
2- this paper: "This suggests that contemporary LLMs may already possess rich mathematical knowledge in their parameter space, transforming the challenge from knowledge acquisition to knowledge elicitation"
I think the way to swallow this bitter pill is to acknowledge they can "generalize" because all human knowledge is actually a relatively "small" finite distribution that models are now big enough to pattern match on.
The LLM can generate the correct search space for the problem, but identifying the solution within the search space is inefficient?
Another way to put this: most of students who study the lecture notes for their high school math already have it within them to get a gold on olympiad (the math itself is not more advance than their high school) but getting a high school kid to get gold on olympiad is hard. It might be something similar to P vs NP.
You are going to see a lot of people (both hype and skeptic) tell you things that you can verify. Even while you have a screenshot verifying the opposite of what they are claiming, they will continue to claim it.
For skeptics in particular, you will be able to use a top tier llm and see: does this do the thing someone is claiming it doesn't do? It often will. If you look at recently submitted papers by skeptics you will see them making a claim about state of the art LLMs but then only test using versions from over a year ago (this has happened recently!^)
The way for you to be sure what is what is to just use the thing for yourself and decide what is true.
You could have a rich mathematical knowledge, while being not very good at proving theorems. Also, you might be good at proving competitive mathematics problems without having a rich mathematical knowledge. It's also possible to have rich mathematical knowledge, and being good at proving theorems but mostly in the field of your expertise.
In the same way that image diffusion models showed that convincing approximations of the entire visual world could be summarized in a 5GB model, are "reasoning patterns" similarly compressible? Are there actually countably few reasoning patterns that are used across all domains, and as such can be captured with relatively small training sets?
I would say there are only a smallish number of truly generic "reasoning patterns" (strategies/approaches), but applied reasoning not only requires a reasoning "pattern", but also a repertoire of valid domain-specific reasoning steps that can be applied pursuant to that approach, as well as the combination of capabilities it takes to overcome impasses when you've exhausted your knowledge and learnt reasoning steps and still not got to a solution.
Perhaps in a domain like math a smallish number of math-specific reasoning steps will go a long way, but math itself also has many "sub-domains" (algebra, geometry, calculus, topology, etc) and AFAIK the techniques of one branch are only going to be useful in another to extent you can map the problem from one domain to another.
If the LIMO hypothesis about the existence of a latent capacity for efficient reasoning in small models that can be elicited by finetuning the model with a small datasets is true, then we could see a huge transference of power from huge models to small models and that in a recurrent way seems to offer unlimited power. But to feed that loop there should be a property of those datasets, they teach the model to adapt reasoning to model size and that is verified by the model extending the depth of the reasoning chain using a small branching factor in the exploration space, like a minimum cover to detect deep patterns.
Reasoning is the art of prediction. Reasoning is distilling many observations of reality into a tiny model of reality that predicts new observations well enough. "What's the simplest model that explains most of what I'm seeing?" is the main question our mind tries to answer. When the art of creating such models is mastered, we pattern-match new problems to our models and use them to predict the outcome.
I noticed a similar phenomenon in my work on JoyCaption when I began teaching it VQA. JoyCaption was trained on about 800k image-caption pairs, and built from so400m and Llama 3.1 8B Instruct. There's no VQA data in its training.
As an experiment, I hand built a VQA dataset of ~600 examples, which is a vanishingly small number compared to even rudimentary VQA datasets (which tend to be about 10k examples or more). However, I ensured that the dataset was broad and highly varied, and that the queries aggressively exercised both visual and textual understanding.
With only 600 training examples, I finetuned the base JoyCaption model in a handful of minutes and to my surprise, not only did it gain VQA abilities, it's able to generalize quite far outside of its training set. Even for concepts not in the original 800k caption data.
My hypothesis is that if the training data is varied enough, it forces the model to generalize. It isn't given enough examples of any given type of task to learn specialized circuitry for them, so its only option is to learn a broadly generalized set of circuitry. The data keeps it on its toes, so to speak.
Of course, this leans heavily on Llama's existing instruction (text-based) tuning, so it's starting off on good footing there. The surprising bit is being able to generalize so well to a new domain (vision) with so little data.
One caveat is that this model is highly unstable, and the accuracy of its responses is much worse than the accuracy of the base model. It's able to handle all of the tasks I've tested on it, but often requires a few retries to get it right.
Building these datasets is also tedious and intensive. I've yet to successfully train existing AIs to generate useful user queries/instructions/questions, either through prompting or finetuning. So it has to all be done by hand. And every answer was either written by me, or generated by an existing VLM and then edited by me to ensure perfect accuracy and adherence to the request. Since the queries are complex and challenging, this makes the work of writing those answers similarly challenging and time consuming.
As an aside: this training also seems to have broken Llama's alignment. I've had it be remarkably sassy in its responses, and it's much better at simulating more normal human responses.
With really high quality samples, the reasoning ability of a well trained LLM can be activated using very small amount of SFT samples, this is what I learned from the paper. It is an interesting finding but not practical through, as you need a far more capable reasoning model (R1 in this case) to get those high quality 817 samples first. DeepSeek-R1-Distill-Qwen-32B has better reasoning skills according to the same benchmarks.
Another trend I've noticed is that there are already 3 papers reporting similar findings by using Qwen-2.5-Instruct. Did they find something interesting on LLMs or something unique to Qwen-2.5-Instruct. I guess we need more experiment results to draw conclusions.
I think the title of the paper is misleading. Obviously the result shows an impressive performance with just few training examples. However, I cannot see that while keeping the same method reducing training data leads to more performance. They have simply shifted the performance curve (impressively) to lower thresholds. Still also with this new method more training data should give better results. It would be interesting to see a full performance curve for the method based on training data amount (and potentially quality).
It's actually difficult to work out the affiliation of the authors for non-Chinese. SJTU = Shanghai Jiao Tong University, but couldn't work out GAIR and IIS.
So it sounds like we should have schizophrenic AI's which alternate and collaborate between specialized domain specific submodels. I guess the number of submodels does not cost compute, so can grow quite large, and if each of these models is so reduced as in this paper, the overall compute cost should drop substantially.
While there're interesting findings here, https://arxiv.org/pdf/2502.03373 (also with a lot of good findings) suggested some contradicting theory on the critical mass of training process/data for the sake of reasoning capability.
the S1 paper did the same a few days ago, basically. 1000 total CoT with SFT.
I believe that all this shows that pre-training stage already creates the representations needed for CoT reasoning, so they are very simple to uncover. Either with R1-Zero pure RL, or with few-shots SFT.
Any idea if the same dataset can be used to improve human reasoning? Let's say I manually analyze 817 math examples, would that be optimal strategy for me to improve my math reasoning? Can the same distilation process be applied to leetcode?
This training is less about learning how to reason and more about conditioning the llm to use self-evaluations automatically. You could probably reproduce this effect yourself by sticking a paper reminder in front of you and writing "after every small step, spend 2 minutes considering if it's right and does it work in the context of the task so far; evaluate alternatives" on it. (which yes, could improve reasoning likely)
Come in under the shadow of this impure rock
And I will show you something different from either
Your shadow at morning striding behind you
Or your shadow at evening rising to meet you;
I will show you wisdom in a handful of sand.
Yeah that was the idea behind the Phi series of models. It gets good benchmark results but you can still tell something is missing when you actually try to use it for anything.
You don't want that as a product, in the sense that having an AI model train itself by simply having internal conversations without ever looking at any human-written content, might result in something that humans cannot comprehend.
Also, well - there's the technicality of "you don't 'win' a conversation like you can 'win' at Go", so how would you know to reward the model as you're training it?
We kind-of have that in DeepSeek-R1-zero [1], but it has problem. From the original authors:
> With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing.
A lot of these we can probably solve, but as other have pointed out we want a model that humans can converse with, not an AI for the purpose of other AI.
That said, it seems like a promising area of research:
> DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community.
Someone first needs to design a rule set for a game that only permits the correct use of language but encompasses the entire breadth of language use. Then it's plausible.
Thankfully for mathematics and code this seems plausible due to automated theorem proving.
The advantage of alpha-go-zero is that it is constrained to the language of go. If you made two LLM train only off each other they would develop their own language. Maybe they'd be great at reasoning, but we wouldn't understand them. Even humans in that situation would develop jargon, and as time goes on a dialect or language of their own. And humans are a lot more grounded in their language than LLMs.
highfrequency|1 year ago
- Model is finetuned from Qwen-2.5 Instruct, which includes millions of specially filtered math examples in both pretraining and supervised fine-tuning already.
- To generate the perfect 817 math examples for LIMO, they used state of the art models like R1 to filter down from an initial pool of 10 million math problems. In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data. It’s not very clear to me if this is more or less impressive than getting the same result by simply fine-tuning on the 10 million initial pool, but I suppose that would make for a worse headline.
armcat|1 year ago
To your question on finetuning on the initial 10 million pool - intuitively, it would require tremendous amount of finetuning data to move the needle - you really won't be able to move the gradients much with just 817 examples, that initial pool is effectively enforcing pretty rigid regularization.
There is now an increasing interest in showing that small data with inference time scaling is providing significant yield. Couple of recent examples:
* TinyZero: https://github.com/Jiayi-Pan/TinyZero * s1 Simple Test Time Scaling: https://arxiv.org/abs/2501.19393
amingilani|1 year ago
You wouldn’t criticize someone’s kombucha because they didn’t piece their SCOBY (symbiotic culture of bacteria and yeast) together microbe by microbe.
trott|1 year ago
EternalFury|1 year ago
Terretta|1 year ago
The paper, and this comment, seem awfully reminiscent of creating a textbook of curated "maximally informative and distilled" set of cognitive examples to teach students with foundational learning a next level of reasoning.
The last few years of LLM progress have shown we can predict human "reasoning" responses to inputs by modeling likely human responses as if LLM generated. Put another way, most responses are not particularly reasoned, but chain of tokgen*.
Sit near someone who "talks to herself" while doing problems and it's even more evident.
---
* tokgen definition: Listen to conversations in a cafeteria. Many are something other than thoughtful, responses that follow the prompts, with near perfect predictability. To differentiate from these responses and speech that comes after a pause and reflect, one can use the labels thought versus token generation or tokgen.
orbital-decay|1 year ago
Sounds like any textbook. (and generally the process of knowledge compression over generations that made us who we are)
smallerize|1 year ago
The context right now is that OpenAI, with first-mover advantage, cutting-edge-hardware, and tens of billions of dollars of investment, are not getting benchmark performance better than Chinese-developed models that are trained with cut-down nvidia GPUs and a lot less money.
yishanchuan|1 year ago
mattigames|1 year ago
hexomancer|1 year ago
cube2222|1 year ago
[0]: https://www.anthropic.com/news/mapping-mind-language-model
barrkel|1 year ago
Pattern identification and continuation can be applied to evaluate symbolic reasoning. You can see this in e.g. the semantics of a functional programming language if evaluation semantics are defined in terms of rewrite rules.
If you have a model which can convert a problem into language that's precise enough to start pattern matching to LLM-encoded generative programs that evaluate logical implications, you can get into a very interesting space. Autoregressive prediction can turn into symbolic progressive evaluation and calculation. The background LLM is still guiding choice of evaluation and goal seeking.
Reinforcing these evaluation rules seems like it should be doable without enormous corpora, as long as the base model already has enough meat on it to cleanly attach to the more precise language.
larodi|1 year ago
cubefox|1 year ago
easeout|1 year ago
Theory aside, I would think a good application-side method is to use this general reasoning process to structure a final expression and then pass that through a traditional evaluator. Then the reasoning and training thereof need only go as far as symbol manipulation. This is something like Wolfram Alpha, if its NLP handed off to the evaluator much later in the process.
sega_sai|1 year ago
igleria|1 year ago
1- LLMs can never generalize theorem proving
2- this paper: "This suggests that contemporary LLMs may already possess rich mathematical knowledge in their parameter space, transforming the challenge from knowledge acquisition to knowledge elicitation"
Not sure what is what anymore!
bilater|1 year ago
ak_111|1 year ago
Another way to put this: most of students who study the lecture notes for their high school math already have it within them to get a gold on olympiad (the math itself is not more advance than their high school) but getting a high school kid to get gold on olympiad is hard. It might be something similar to P vs NP.
wrsh07|1 year ago
For skeptics in particular, you will be able to use a top tier llm and see: does this do the thing someone is claiming it doesn't do? It often will. If you look at recently submitted papers by skeptics you will see them making a claim about state of the art LLMs but then only test using versions from over a year ago (this has happened recently!^)
The way for you to be sure what is what is to just use the thing for yourself and decide what is true.
^ https://x.com/tylercowen/status/1881051976102035880
unknown|1 year ago
[deleted]
solomatov|1 year ago
sebzim4500|1 year ago
euthydemus|1 year ago
doug_durham|1 year ago
HarHarVeryFunny|1 year ago
Perhaps in a domain like math a smallish number of math-specific reasoning steps will go a long way, but math itself also has many "sub-domains" (algebra, geometry, calculus, topology, etc) and AFAIK the techniques of one branch are only going to be useful in another to extent you can map the problem from one domain to another.
guyomes|1 year ago
unknown|1 year ago
[deleted]
Limoynada|1 year ago
sega_sai|1 year ago
nicr_22|1 year ago
akomtu|1 year ago
fpgaminer|1 year ago
As an experiment, I hand built a VQA dataset of ~600 examples, which is a vanishingly small number compared to even rudimentary VQA datasets (which tend to be about 10k examples or more). However, I ensured that the dataset was broad and highly varied, and that the queries aggressively exercised both visual and textual understanding.
With only 600 training examples, I finetuned the base JoyCaption model in a handful of minutes and to my surprise, not only did it gain VQA abilities, it's able to generalize quite far outside of its training set. Even for concepts not in the original 800k caption data.
My hypothesis is that if the training data is varied enough, it forces the model to generalize. It isn't given enough examples of any given type of task to learn specialized circuitry for them, so its only option is to learn a broadly generalized set of circuitry. The data keeps it on its toes, so to speak.
Of course, this leans heavily on Llama's existing instruction (text-based) tuning, so it's starting off on good footing there. The surprising bit is being able to generalize so well to a new domain (vision) with so little data.
One caveat is that this model is highly unstable, and the accuracy of its responses is much worse than the accuracy of the base model. It's able to handle all of the tasks I've tested on it, but often requires a few retries to get it right.
Building these datasets is also tedious and intensive. I've yet to successfully train existing AIs to generate useful user queries/instructions/questions, either through prompting or finetuning. So it has to all be done by hand. And every answer was either written by me, or generated by an existing VLM and then edited by me to ensure perfect accuracy and adherence to the request. Since the queries are complex and challenging, this makes the work of writing those answers similarly challenging and time consuming.
As an aside: this training also seems to have broken Llama's alignment. I've had it be remarkably sassy in its responses, and it's much better at simulating more normal human responses.
tw1984|1 year ago
Another trend I've noticed is that there are already 3 papers reporting similar findings by using Qwen-2.5-Instruct. Did they find something interesting on LLMs or something unique to Qwen-2.5-Instruct. I guess we need more experiment results to draw conclusions.
1R053|1 year ago
ak_111|1 year ago
jph00|1 year ago
elif|1 year ago
fallmonkey|1 year ago
antirez|1 year ago
I believe that all this shows that pre-training stage already creates the representations needed for CoT reasoning, so they are very simple to uncover. Either with R1-Zero pure RL, or with few-shots SFT.
xendo|1 year ago
viraptor|1 year ago
fabmilo|1 year ago
pillefitz|1 year ago
delichon|1 year ago
CamperBob2|1 year ago
throwaway314155|1 year ago
yalok|1 year ago
sebzim4500|1 year ago
ysofunny|1 year ago
sebastiennight|1 year ago
Also, well - there's the technicality of "you don't 'win' a conversation like you can 'win' at Go", so how would you know to reward the model as you're training it?
Chio|1 year ago
> With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing.
A lot of these we can probably solve, but as other have pointed out we want a model that humans can converse with, not an AI for the purpose of other AI.
That said, it seems like a promising area of research:
> DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community.
[1] https://github.com/deepseek-ai/DeepSeek-R1
jebarker|1 year ago
Thankfully for mathematics and code this seems plausible due to automated theorem proving.
wongarsu|1 year ago
aymaneSennoussi|1 year ago
ei625|1 year ago
ei625|1 year ago
emorning3|1 year ago
I mean, you wouldn't use this brand of AI to plot your path to Mars. Well, you could, BUT you'll also want to validate the path or risk dying.
But this AI is good enough for Elon and his ilk. Because Elon's not gonna get into the capsule, you are.
Because you are not the master of this AI, you are the validator.
bwfan123|1 year ago
the word reasoning has been subverted by those pushing these llms, and we all have bought-in. quite a magic trick this illusionist has pulled on us.
pillefitz|1 year ago
ej1|1 year ago
[deleted]
unknown|1 year ago
[deleted]
shashanoid|1 year ago
pinoy420|1 year ago
https://news.ycombinator.com/item?id=42896559