top | item 44872850

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

160 points| blueridge | 7 months ago |arstechnica.com

131 comments

order
[+] NitpickLawyer|7 months ago|reply
> Without specification, we employ a decoder-only language model GPT2 (Radford et al., 2019) with a configuration of 4 layers, 32 hidden dimensions, and 4 attention heads.

Yeah, ok. The research is interesting, warranted, but writing an article about it, and leading with the conclusions gathered from toy models and implying this generalises to production LLMs is useless.

We've been here before with small models. Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results.

Research in this area is good, and needed. Mainly to understand limitations, discover if there are any scale levels where "emergent" stuff appears and so on. But writing articles based on incipient research, based on tiny models is not worth the effort.

[+] willvarfar|7 months ago|reply
Doing analysis on small models or small data is perfectly valid if the results extrapolate to large models. Which is why right now we're looking at new research papers that are still listing the same small datasets and comparing to the same small models that papers five years ago did.
[+] kazinator|7 months ago|reply
> conclusions gathered from toy models and implying this generalises to production LLMs is useless

You are just trotting out the tired argument that model size magically fixes the issues, rather than just improves the mirage, and so nothing can be known about models with M parameters by studying models with N < M parameters.

Given enough parameters, a miraculous threshold is reached whereby LLMs switch from interpolating to extrapolating.

Sure!

[+] OtherShrezzing|7 months ago|reply
I think it is worth writing about simply because it might get the (cost constrained) researcher’s work in front of someone who has the near-unlimited research budgets at one of the big AI companies.
[+] Insanity|7 months ago|reply
The results from a smaller model are still viable if the paradigm is identical. Unless you believe that larger volumes of data leads to more (unexplained) emergent properties of the AI. i.e, if you think that a larger volume of training data somehow means the model develops actual reasoning skills, beyond the normal next-token prediction.

I do think that larger models will perform better, but not because they fundamentally work differently than the smaller models, and thus the idea behind TFA still stands (in my opinion).

[+] suddenlybananas|7 months ago|reply
>Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results

You're conflating two very different things. Training on synthetic data one time is very different than cyclically training models on their own data. It has nothing to do with model size.

[+] kevingadd|7 months ago|reply
Almost every mention I've seen of gpt-oss was a complaint that the training on synthetic datasets produced a model that's mostly good at benchmarks. Are benchmarks the great results you're referring to or are there a lot of satisfied users out there that just don't post here on HN? Genuinely curious.

I can see how performing well on benchmarks at the expense of everything else counts as great results if that's the point of the model.

[+] pxc|7 months ago|reply
Well now they could use GPT-OSS, but it wasn't out when they began the study.

I've recently been taking a look at another paper, from 2023, and subsequent research. It has a morally similar finding, though not focused on "reasoning traces", but it's based on GPT-4:

https://proceedings.neurips.cc/paper_files/paper/2023/hash/d...

[+] syllogism|7 months ago|reply
It's interesting that there's still such a market for this sort of take.

> In a recent pre-print paper, researchers from the University of Arizona summarize this existing work as "suggest[ing] that LLMs are not principled reasoners but rather sophisticated simulators of reasoning-like text."

What does this even mean? Let's veto the word "reasoning" here and reflect.

The LLM produces a series of outputs. Each output changes the likelihood of the next output. So it's transitioning in a very large state space.

Assume there exists some states that the activations could be in that would cause the correct output to be generated. Assume also that there is some possible path of text connecting the original input to such a success state.

The reinforcement learning objective reinforces pathways that were successful during training. If there's some intermediate calculation to do or 'inference' that could be drawn, writing out a new text that makes that explicit might be a useful step. The reinforcement learning objective is supposed to encourage the model to learn such patterns.

So what does "sophisticated simulators of reasoning-like text" even mean here? The mechanism that the model uses to transition towards the answer is to generate intermediate text. What's the complaint here?

It makes the same sort of sense to talk about the model "reasoning" as it does to talk about AlphaZero "valuing material" or "fighting for the center". These are shorthands for describing patterns of behaviour, but of course the model doesn't "value" anything in a strictly human way. The chess engine usually doesn't see a full line to victory, but in the games it's played, paths which transition through states with material advantage are often good -- although it depends on other factors.

So of course the chain-of-thought transition process is brittle, and it's brittle in ways that don't match human mistakes. What does it prove that there are counter-examples with irrelevant text interposed that cause the model to produce the wrong output? It shows nothing --- it's a probabilistic process. Of course some different inputs lead to different paths being taken, which may be less successful.

[+] wzdd|7 months ago|reply
> The mechanism that the model uses to transition towards the answer is to generate intermediate text.

Yes, which makes sense, because if there's a landscape of states that the model is traversing, and there are probablistically likely pathways between an initial state and the desired output, but there isn't a direct pathway, then training the the model to generate intermediate text in order to move across that landscape so it can reach the desired output state is a good idea.

Presumably LLM companies are aware that there is (in general) no relationship between the generated intermediate text and the output, and the point of the article is that by calling it a "chain of thought" rather than "essentially-meaningless intermediate text which increases the number of potential states the model can reach" users are misled into thinking that the model is reasoning, and may then make unwarranted assumptions, such as that the model could in general apply the same reasoning to similar problems, which is in general not true.

[+] skywhopper|7 months ago|reply
So, you agree with the point that they’re making and you’re mad about it? It’s important to state that the models aren’t doing real reasoning because they are being marketed and sold as if they are.

As for your question: ‘So what does "sophisticated simulators of reasoning-like text" even mean here?’

It means CoT interstitial “reasoning” steps produce text that looks like reasoning, but is just a rough approximation, given that the reasoning often doesn’t line up with the conclusion, or the priors, or reality.

[+] wnissen|7 months ago|reply
It's not clear what LLMs are good at, and there's great interest in finding out. This is made harder by the frenetic pace of development (GPT 2 came out in 2019). Not surprising at all that there's research into how LLMs fail and why.

Even for someone who kinda understands how the models are trained, it's surprising to me that they struggle when the symbols change. One thing computers are traditionally very good at is symbolic logic. Graph bijection. Stuff like that. So it's worrisome when they fail at it. Even in this research model which is much, much smaller than current or even older models.

[+] heresie-dabord|7 months ago|reply
> It's interesting that there's still such a market for this sort of take.

What do you think the explanation might be for there being "such a market"?

[+] bubblyworld|7 months ago|reply
Not sure why everyone is downvoting you as I think you raise a good point - these anthropomorphic words like "reasoning" are useful as shorthands for describing patterns of behaviour, and are generally not meant to be direct comparisons to human cognition. But it goes both ways. You can still criticise the model on the grounds that what we call "reasoning" in the context of LLMs doesn't match the patterns we associate with human "reasoning" very well (such as ability to generalise to novel situations), which is what I think the authors are doing.
[+] Workaccount2|7 months ago|reply
If you read the comments of AI articles on Arstechnica, you will find that they seem to have becomes the tech bastion of anti-ai. I'm not sure how it happened, but it seems they found or fell into a strong anti-AI niche, and now feed it.

You cannot even see the comments of people who pointed out the flaws in the study, since they are so heavily downvoted.

[+] zerof1l|7 months ago|reply
> ... that these "reasoning" models can often produce incoherent, logically unsound answers when questions include irrelevant clauses or deviate even slightly from common templates found in their training data.

I have encountered this problem numerous times, now. It really makes me believe that the models do not really understand the topic, even the basics but just try to predict the text.

One recent example was me asking the model to fix my docker-compose file. In it, there's the `network: host` for the `build` part. The model kept assuming that the container would be running with the host network and kept asking me to remove it as a way to fix my issue, even though it wouldn't do anything for the container that is running. Because container runs on `custom_net` network only. The model was obsessed with it and kept telling me to remove it until I explicitly told that it is not, and cannot be the issue.

``` services:

  app:

    build:

      network: host

    networks:

      custom_net:

    ...
```
[+] burnte|7 months ago|reply
> It really makes me believe that the models do not really understand the topic, even the basics but just try to predict the text.

This is correct. There is no understanding, there aren't even concepts. It's just math, it's what we've been doing with words in computers for decades, just faster and faster. They're super useful in some areas, but they're not smart, they don't think.

[+] Frieren|7 months ago|reply
This assessment fits with my anecdotal evidence. LLMs just cannot reason in any basic way.

LLMs have a large knowledge base that can be spit out at a moment notice. But they have zero insight on its contents, even when the information has just been asked a few lines before.

Most of the "intelligence" that LLMs show is just the ability to ask in the correct way the correct questions mirrored back to the user. That is why there is so many advice on how to do "proper prompting".

That and the fact that most questions have already been asked before as anyone that spend some time in StackOverflow back in the day realized. And memory and not reasoning is what is needed to answer them.

[+] starchild3001|6 months ago|reply
> This assessment fits with my anecdotal evidence. LLMs just cannot reason in any basic way.

LLM reasoning is brittle and not like human cognition, but it is far from zero. It has demonstrably improved to a point where it can solve complex, multi-step problems across domains. See the numerous successful benchmarks and out of sample evals (livebench.ai, imo 2025, trackingai.ai IQ, matharena.ai etc).

I gained multiple months of productivity from vibe coding personally in 2025. If being able to correctly code a complex piece of software from a vague, single paragraph description isn't reasoning, what else is? Btw, I don't code UIs. I code complex mathematical algorithms, some of which never found in textbooks.

> LLMs have a large knowledge base that can be spit out at a moment notice. But they have zero insight on its contents, even when the information has just been asked a few lines before.

LLMs have excellent recall of recent information within their context window. While they lack human-like consciousness or "insight," their ability to synthesize and re-contextualize information from their vast knowledge base is a powerful capability that goes beyond simple data retrieval.

If anything LLMs show polymath-level ability to synthesize information across domains. How do I know? I use them everyday and get great mileage. It's very obvious.

> Most of the "intelligence" that LLMs show is just the ability to ask in the correct way the correct questions mirrored back to the user. That is why there is so many advice on how to do "proper prompting".

Prompting is the user interface for steering the model's intelligence. However, the model's ability to generate complex, novel, and functional outputs that far exceed the complexity of the input prompt shows that its "intelligence" is more than just a reflection of the user's query.

To summarize, I'm appalled by your statements, as a heavy user of SoTA LLMs on a daily basis for practically anything. I suspect you don't use them enough, and lack a viceral feel or scope for their capabilities.

[+] PeterStuer|7 months ago|reply
Please don't tell me you were one of those marking every SO question as duplicate, more often than not missing the entire nuance in the question that made it not a duplicate at all, and the answers to the so called previously asked question utterly unusable?

This was one of those infuriating things that drove so many away from SO and jump ship the second there was an alternative.

[+] IAmGraydon|7 months ago|reply
>This assessment fits with my anecdotal evidence. LLMs just cannot reason in any basic way.

Agreed completely, and the sentiment seems to be spreading at an ever-increasing rate. I wonder how long it will be before the bubble collapses. I was thinking maybe as long as a few years, but it might be far sooner at this rate. All it will take is one of the large AI companies coming out and publicly stating that they're no longer making meaningful gains or some other way that shows the public what's really going on behind the curtain.

I'm certain the AI hype bubble will be studied for generations as the greatest mass delusion in history (so far).

[+] jongjong|7 months ago|reply
I've used LLMs to generate code for a custom serverless framework which I wrote from scratch that it had never seen before. The framework follows some industry conventions but applied in a distinct way with some distinct features which I have not yet encountered in any other framework...

I'm willing to accept that maybe LLMs cannot invent entirely new concepts but I know for a fact that they can synthesize and merge different unfamiliar concepts in complex logical ways to deliver new capabilities. This is valuable on its own.

[+] mirekrusin|7 months ago|reply
Hold on their evaluation tasks are based on rotating letters in text? Isn't this known weak area for token based models?
[+] Terr_|7 months ago|reply
I think that's the point, really: It's a reliable and reproducible weakness, but also one where the model can be trained to elicit impressive-looking "reasoning" about what the problem is and how it "plans" to overcome it.

Then when it fails to apply the "reasoning", that's evidence the artificial expertise we humans perceived or inferred is actually some kind of illusion.

Kind of like a a Chinese Room scenario: If the other end appears to talk about algebra perfectly well, but just can't do it, that's evidence you might be talking to a language-lookup machine instead of one that can reason.

[+] afro88|7 months ago|reply
I have a real world problem I gave o1 when it came out and it got it quite wrong. It's a scheduling problem with 4 different constraints that vary each day, and success criteria that need to be fulfilled over the whole week.

GPT-5 Thinking (Think Longer) and Opus 4.1 Extended Thinking both get it right.

Maybe this unique problem is somehow a part of synthetic training data? Or maybe it's not and the paper is wrong? Either way, we have models that are much more capable at solving unique problems today.

[+] sachin_rcz|7 months ago|reply
Models today also have access to certain tooling or have been reinforced to use that tooling in complicated situations. i.e. Questions of counting letters in word are being answered by using python code in background.
[+] moi2388|7 months ago|reply
“ the researchers created a carefully controlled LLM environment in an attempt to measure just how well chain-of-thought reasoning works when presented with "out of domain" logical problems that don't match the specific logical patterns found in their training data.”

Why? If it’s out of domain we know it’ll fail.

[+] podgorniy|7 months ago|reply
> Why? If it’s out of domain we know it’ll fail.

To see if LLMs adhere to logic or observed "logical" responses are rather reproduction of patterns.

I personally enjoy this idea of isolation "logic" from "pattern" and seeing if "logic" will manifest in LLM "thinking" about in "non-patternized" domain.

--

Also it's never bad give proves to public that "thinking" (like "intelligence") in AI context isn't the same thing we think about intuitively.

--

> If it’s out of domain we know it’ll fail.

Below goes question which is out of domain. Yet LLMs handle the replies in what appearing as logical way.

``` Kookers are blight. And shmakers are sin. If peker is blight and sin who is he? ```

It is out of domain and it does not fail (I've put it through thinking gemini 2.5). Now back to article. Is observed logic intristic to LLMs or it's an elaborate form of a pattern? Acoording to article it's a pattern.

[+] Octoth0rpe|7 months ago|reply
I don't think we know that it'll fail, or at least that is not universally accepted as true. Rather, there are claims that given a large enough model / context window, such capabilities emerge. I think skepticism of that claim is warranted. This research validates that skepticism, at least for a certain parameters (model family/size, context size, etc).
[+] m3047|7 months ago|reply
There's a question which was rhetorically asked by Yaser S. Abu-Mostafa: "How do we know if we're learning from data?" and his answer was: "We are learning from data if we can generalize from our training set to our problem set."

To me, it feels a lot like Deming's "what gets measured gets done" (with the quiet part "...oftentimes at the expense of everything else."). Of course, the quiet part is different in this case.

What is this "domain" of which you speak? Because LLMs are supposedly good for flying airplanes, mental health, snakebites, and mushroom poisoning.

[+] willvarfar|7 months ago|reply
Its getting to the nub of whether models can extrapolate instead of interpolate.

If they had _succeeded_, we'd all be taking it as proof that LLMs can reason, right?

[+] Gusarich|7 months ago|reply
The article already seems outdated on the first day. The key points about SFT are irrelevant in the era of RL.
[+] pllbnk|7 months ago|reply
We're rapidly reaching trough of disillusionment with LLMs, and other generative transformer models for that matter. I am happy because it will help a lot of misinformed people understand what is and isn't possible (100+% productivity gains are not).
[+] megaloblasto|7 months ago|reply
100% productivity gains on coding tasks are absolutely within the realm of possibility
[+] ricardorivaldo|7 months ago|reply
don't know, maybe in the tecnical circles, but for users the thrill is still going on, and rising
[+] pama|7 months ago|reply
The math in the original paper is questionable. By leaving free the choice of divergence in Eq 3, Eq 4 has no practical value except when said divergence is zero exactly.
[+] Martin_Silenus|7 months ago|reply
If only we could train people like that to see their reasoning output...
[+] ponow|7 months ago|reply
> LLMs are [...] sophisticated simulators of reasoning-like text

Most humans are unsophisticated simulators of reasoning-like text.

[+] NoGravitas|7 months ago|reply
Except you, right? You're one of the special few who can actually reason, not like /those/ people.
[+] floppiplopp|7 months ago|reply
'Chain-of-thought AI "degrades significantly" when asked to generalize beyond training.' - yeah thanks Captain Obvious.