For anyone considering reading the paper and like me don't normally read papers like this, open the PDF and think you don't have time to read it due to its length. The main part of the paper is the first 10 pages and a fairly quick read.
On to the topic here. This is an interesting example that they are using. It is fairly simplistic to understand as a human (even if we may be inclined to quickly jump to the wrong conclusion without thinking for a few seconds).
The thing that really bothers me is that I just don't know is realistically we can fix this given the current state of what these tools actually are. They are not reasoning or thinking in any sense of the word and yet a lot of people are already considering them general purpose AI. It doesn't help that in many situations it can fake it enough that it appears to be reasoning, but it's not.
What is the chance that this paper actually has any impact on the AI rollout and overhype or will just be buried and never talked about again until the next time we see how dangerous these tools are like with Google's search rollout.
For reasoning of any complexity, some sort of internal monologue and iteration is needed. For type-2 thinking, we conceptualize possible solutions, arguments, and reasoning paths in our mind, and judge and evaluate them before deciding which one we think is adequate or correct. This can be emulated to some extent by guiding prompts that make that internal thought process external, but we really want it to be internal and automatic.
We don’t know how to do that yet, because what controls the internal thought process is itself not necessarily language-based, and also, since internal thought processes of biological brains are not directly observable, they can’t be used as training data.
Edit: It occurs to me now that there is some parallel between current LLMs and behaviorism [0], and we really need something to which cognitive psychology could be applied instead.
There's actually a pretty simple solution to this that I thought about testing out and it involves asking the model to re-construct the problem using a logic language (like Prolog) and asking it to execute this type of program in order to come up with a solution rather than attempting simple chain-of-reason training / other methodologies of getting the model to 'reason' through some of these examples. People forget that humans don't come up with their logical models out of the blue - it takes years of elementary school in order for us to understand the world and solve problems in it. The logic programming approach I'd say is really promising but you would need to feed the LLM a LOT of examples in order for it to work, and currently I'm not even sure that we have enough training data in order to implement something like this.
In many ways, this is very obvious and routine to people who use these systems with a critical understanding of how they work. It's dispiriting how we require a arxiv.org pdf in order to point out the absence of the emperor's clothing. There are few people who use LLMs who could not, right now, demonstrate this point to themselves if they so wish.
Such is the strength of corporate tech propaganda that a whole mass of people will instead insist that we have never worn clothes either.
If you really think about what an LLM is you would think there is no way that leads to general purpose AI. At the same time though they are already doing way more than we thought they could.
Maybe people were surprised by what OpenAI achieved so now they are all just praying that with enough compute and the right model AGI will emerge.
Question is how often do humans get the question wrong unprompted ? I expect given the median IQ of 100 and assessing people of all types including those untrained on logic and generally uneducated humans might not be nearly as close to 1.0 as the authors seem to imply.
But it’s also fairly obvious LLMs don’t reason at all so it’s not shocking that LLMs don’t reason at all. What’s remarkable is that they’re able to perform as well at reasoning tasks as they do given that factual truth they have no reasoning facility. Coupled with their natural language ability and emergent abductive reasoning ability that’s generally hard in computing models it’s hard to draw much conclusion here other than “wow gpt-4o is amazing.”
I don't think anybody who actually uses LLMs thinks they are general purpose AI.
Like you ask it to do one thing it's amazing, but then you try to modify or do something with extra steps, or just anything with any complexity to it and it falls over.
> They are not reasoning or thinking in any sense of the word
There is no definition of reasoning or thinking. No single human knows what it is.
The only thing we know is: we as humans are capable of recognizing steps and results of reasoning and thinking.
In a lot of cases, when using LLM's, those results appear to be correct and usable. This is often easy to determine with generated code.
I want to argue that, lacking a definition of reasoning, I am happy to have found that the machine helps me to get results that might as well have been produced by a lot of human knowledge, wisdom and deep reasoning.
Its not an AI hype. A hype is defined as something which gets oversold: "promote or publicize (a product or idea) intensively, often exaggerating its benefits."
Just yesterday I visited a google cloud summit and one person from bosch told the audiance how they are now able to work with less external agencies like texting, graphicsdesigner and photographers for their materials.
It already saves money, has real impacts and continues to progress.
We are also don't know what ChatGPT 5 will bring, because they say this will do more reasoning than before, but we already are working (people/our socity) on solving this in different ways: From code which creates a unit test first and than the code, to different type of architectures.
For me, 2024 was the LLM cost reduction year and the LLM gets a big context window year.
AI doesn't need to be ready tomorrow, but its capabilities are already really good. And i know plenty of people around me who are a lot less interesting to talk to than any llm (from a human skill/knowledge point of view).
llama 3 was also a big achievement 2024. Facebook shows that better data leads to better quality for smaller models.
We haven't not only entered the AI ara but also the 'gather all the knowledge we can, quality check it and refine it because now we can actually do something with it' ara.
The problem is a good chunk of the global population is also not reasoning and thinking in any sense of the word. Logical reasoning is a higher order skill that often requires formal training. It's not a natural ability for human beings.
I appear to be reasoning at times but I have mostly no idea what I am talking about. I hit a bunch of words and concepts in the given context and thus kind of hallucinate sense.
Given a few months of peace of mind and enough money for good enough food, I could actually learn to reason without sounding like a confused babelarian.
Reasoning is mostly a human convention supported by human context that would have been a different one if the Fascists had won the war or the Soviet Union wouldn't have gotten corrupted.
But none of that has anything to do with pulling up a whiteboard to draw some flowcharts and run some numbers, all of which is why I am certain there is nothing the devs have "to fix". It took most reasonable humans many generations to learn stuff. Very few of us did the actual work.
> They are not reasoning or thinking in any sense of the word and yet a lot of people are already considering them general purpose AI.
You yourself did not use reasoning to arrive at this conclusion. It's quite obvious. I'm not trying to belittle you here. But LLMs are black boxes, we do not actually know what they are doing at a high enough resolution where we can call it "not reasoning" or "reasoning".
We can only characterize these AI's as a best fit curve between datapoints which is a way to high level view point to come to any conclusion about "reasoning"
This paper presents evidence of failed reasoning, but how does that prove anything when LLMs exhibit many instances of successful reasoning on complex topics they were not trained on?
You are biased and honing into information that supports a biased conclusion. LLMs are an AI we do not understand at a low level. Hence we talk about the attributes of these AIs in the same way we talk about humans, "Oh the LLM hallucinates", "it tries to justify it's answer..." etc. etc.
You characterize the Danger of these AI's as the result of Human stupidity. The danger according to you is solely from a human mistakenly believing that the AI is anything other than a stochastic parrot.
This is a belief arrived at in the same spirit as your claim. You did not use reasoning to arrive here.
The only logical way to characterize what is going on is that we do not know. It could very well be that these AI's are in fact reasoning. And that in itself presents a different kind of danger. A danger that may be more clear in the far future.
The irony is that your conclusion lacking correct reasoning is similarly parallel to the LLM's lack of reasoning. LLMs are more alike to us than you would like to believe.
Can you give me the step by step signal path ways of an LLM processing a query to prove that it does not reason? Or do you have to use black box anecdotal evidence to prove your point? For any "evidence" where an LLM failed to reason, there is another counter example showing where the LLM succeeded. Contradictory evidence can only lead to a vague conclusion.
Question is: "Alice has 60 brothers and she also has 212 sisters. How many sisters does Alice’s brother have?" (nb: I have added numbers, it's phrased as X and N in the paper)
I must confess, when I tried to answer the question I got it wrong...! (I feel silly). I only realised I got it wrong when I plugged it into GPT-4o and it came back with the correct answer:
Worth noting that the prompts from the experiment include "To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:" so it appears that they are stopping the models from 'thinking out loud'. If I add that to the prompt, GPT4o gets it consistently wrong...
Also worth noting that there are more complex examples where GPT4o seems to fall down such as:
> Alice has 3 sisters. Her mother has 1 sister who does not have children - she has 7 nephews and nieces and also 2 brothers. Alice's father has a brother who has 5 nephews and nieces in total, and who has also 1 son. How many cousins does Alice's sister have?
However I can't honestly say that this is THAT simple or that most people would get this right...
Note that in the paper, all the numbers used were under 10.
> AIW Variation 1, N=3,M=6,C=7
> AIW Variation 2, N=4,M=2,C=3
> AIW Variation 3, N=1,M=4,C=5
> AIW Variation 4, N=4,M=1,C=2.
Also note that the resricted prompt is only one of the prompt variations tested by the paper. It also explores common techinques to get LLMs to perform better, including "thinking out loud". Even with these methods the models still fail to produce a correct answer.
> Model prompt types. It is well known that so-called prompt engineering can heavily influence the model behavior and model response quality [26, 27, 28]. To account for the response variations due to various prompt forms, we created 3 distinct prompt types asking for the solution to the AIW problem: STANDARD, THINKING, and RESTRICTED. The STANDARD prompt type asks to solve the posed problem and output the final answer in the format as described above. This does not put any specific requirements on model behavior. The THINKING prompt type extends STANDARD with the request to think carefully and double check the solution for any mistakes. This should encourage model to invest more computation into obtaining the solution. In contrast to this, the RESTRICTED prompt urges the model to output only the final answer without any further text. This is supposed to restrict compute invested in producing output. We observe substantially shorter outputs across tested models compared to STANDARD and THINKING for this prompt type (Suppl. Fig. 13).
>Worth noting that the prompts from the experiment include "To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:" so it appears that they are stopping the models from 'thinking out loud'. If I add that to the prompt, GPT4o gets it consistently wrong...
Yes this is a common thing I see people who think LLMs are idiots do.
The more an LLM talks the smarter it gets _because that's the only way it can compute anything_. Imagine saying that Turing machines fail the Church–Turing thesis because they can't solve 3-sat for N variables in N moves or less.
That's what you're doing to an LLM when you ask it to be concise.
There must be a name for the new phenomenon, of which your post is an example, of: 1. Someone expresses that an LLM cannot do some trivial task. 2. Another person declares that they cannot do the task, thereby defending the legitimacy of the LLM.
As a side note, I cannot believe that the average person who can navigate to a chatgpt prompter would fail to correctly answer this question given sufficient motivation to do so.
Even "thinking out loud" we have seen these systems fall flat on their face and give very wrong answers.
Thinking out loud also only gets you so far, if the expectation is a certain type of response it can't always "think out loud". In reality that just proves it isn't really reasoning here and is more likely just self referencing.
That being said, I tried this personally allowing it to think out loud and it told me she has 212 sisters. Using your exact prompt.
Seems like an odd thing not to state prominently. ChatGPT 4o solved it every time for me.
Seems to work OK even with the "hands tied behind back" prompt. Wonder if this paper is just inaccurate or if OpenAI adjusted the model -- seems unlikely.
Actually, 4o is reported to solve this riddle 60% of the time. I guess I have to test more, but with my about a dozen tests so far, all were correct.
Of course it's going to give an incorrect answer with that prompt. If the instruction fine tuning is neutered like this prompt, it's going to roll over to the foundation model and offer a completion - probably more influenced by the seed than the prompting text. Bad study.
Edit - I just skimmed the paper - they do use other more appropriate prompt types for reasoning. My initial response was based on the assumption that all prompts used that script prompt quoted in the parent. I retract my "bad paper" comment.
>> I must confess, when I tried to answer the question I got it wrong...! (I feel silly). I only realised I got it wrong when I plugged it into GPT-4o and it came back with the correct answer:
Remember that the authors of the paper did not find that GPT4-o cannot return the right answer. They found that it can't return the right answer more often than ~60% of the time. So you'd have to repeat the experiment many, many times and aggregate the results (the paper uses a binomial Beta this and that etc etc) before you see similar results as the paper.
You won't replicate the results of the paper unless you really put your back into it.
Great observation. It would be really interesting to repeat this research with less limiting prompts. I assume they made the restriction to make it easier to parse the answers, perhaps make it two phase, let them respond unlimited, then a follow up prompt ask to compress it to a single answer. I wonder how the results will vary.
I just tried it with intermediate steps, via the total amount of children Alices mother has, and it worked. So, it seems that an intermediate logical step is too much for these models at the moment, at least in this puzzle where Alice is outside the number given in the description.
Many of the datasets for the "benchmarks" on which the major public LLMs are assessed are clearly present in their training data, making them basically useless for establishing reliability of the models. Its fairly obvious that at least some of the improved scores from later generations of models are that this benchmark data is increasingly represented in the training data.
A better way of assessing LLMs is waiting a few weeks until novel tests have been created explicitly absent from all prior training data, and then using those.
As has been shown, eg., on legal test, exams, etc. performance drops off a cliff when future out-sample data is actually used. Rather than these faked pretend out-sample benchmarks.
MMLU is not a reasoning benchmark. It's a measure of how distributed and representative their training data was and how well it's able to recall (for lack of a better word) based on training epochs.
GPQA etc. test reasoning in some form, and you see the drastic change in score between the two for every model.
Honestly every time I see someone use MMLU as a benchmark I am wondering what they are trying to prove. MMLU is a simple multiple choice test with the answers being available.
Simply picking answers at random should give you 25 points. Knowing 50% of the answers and picking the rest randomly gives you 62.5%, which is very close to the scores of SOTA LLMs. The benchmarks that supposedly show reasoning are pretty bad and have very little to do with reasoning. A lot of the questions can be answered through memorization.
I agree with you. The benchmarks are garbage. I thought about building my own benchmarks, but this would require building a complex benchmarking framework first and I just don't have the time for preparatory work like that.
The idea that these word problems (and other LLM stumpers) are "easily solvable by humans" needs some empirical data behind it. Computer people like puzzles, and this kind of thing seems straightforward to them. I think the percentage of the general population who would get these puzzles right with the same time constraints LLMs are subjected to is much lower than the authors would expect, and that the LLMs are right in line with human-level reasoning in this case.
(Of course, I don't have a citation either, but I'm not the one writing the paper.)
note: I didn't try 60/212 siblings - I consider this an unfair question because it's a reasonable assumption that Alice is a type of creature (human?) that cannot have hundreds of siblings, and thus I don't expect a respondent to give a "correct" answer.
It’s an interesting paper, but my worry is that this is cherry-picking a surprising negative result. (The paper does not seem to discuss how many evals were considered to find this case.)
The general expectation should be that some AI failure modes will be for things that seem obvious to humans. (Because they have a different architecture to humans and different strengths.) The important question is whether this is a single weird cognitive bug on the family size formulation, or representative of a broad family of cognitive errors? It’s difficult to say from this paper.
It’s well known that the framing of a logic problem can dramatically influence its difficulty for humans, even when the underlying comparisons are isomorphic.
I think this line of research is important, but we need a larger battery of evals (and training on the evals will always be a confounder).
The broader family of reasoning here is relational problems (as noted in the paper) and so as a follow-up it would be interesting to explore reasoning performance across a set of isomorphic problems. My hunch is this will generalize somewhat (as LLMs are still not great at relational reasoning), but that there is something about the family formulation in particular that is confusing for LLMs.
I wouldn't be surprised if anti-gender-bias training (which, to be clear, is a good thing to do IMO) does some cognitive damage to these particular formulations because questions about "sister" and "brother" get associated with training data where the correct answer was to refuse to generalize. Pure guess though.
Just try the question with GPT-4o and the results are more entertaining than I would expected.
Prompt: Alice has 3 brothers and she also has 6 sisters. How many sisters does Alice’s brother have? Solve the problem step-by-step
Answer:
To solve the problem step-by-step:
Identify the total number of siblings Alice has:
Alice has 3 brothers.
Alice has 6 sisters.
Understand that Alice herself is one of the sisters:
Including Alice, there are a total of 7 sisters (6 sisters + Alice).
Determine the number of sisters Alice's brother has:
Each of Alice's brothers will have the same number of sisters as Alice does, excluding themselves.
Since Alice's brothers have the same set of siblings, including the 7 sisters:
Therefore, each of Alice's brothers has 6 sisters (not counting themselves as a sister).
So, each of Alice’s brothers has 6 sisters.
>AIW+ problem has following form: "Alice has 3 sisters. Her mother has 1 sister who does not have children - she has 7 nephews and nieces and also 2 brothers. Alice’s father has a brother who has 5 nephews and nieces in total, and who has also 1 son. How many cousins does Alice’s sister have?". The solution to AIW+ problem is harder to obtain than the solution to common sense AIW with very simple structure. Solving AIW+ requires taking different paternal sides, that of mother and father, and carefully calculating the number of cousins, taking care of subtracting Alice and her sister, and summing up the total number of cousins from both sides, for instance: on the mother side: 7 (total nephews and nieces) - 4 (Alice and her sisters) = 3 cousins; on the father side: 5 (total nephews and nieces) + 1 (own son of the father’s brother) - 4 (Alice and her sisters) = 2 cousins; summing up 3 + 2 = 5 cousins which Alice and any of her sisters have.
So I’m sure after the author’s created 100s of these family tree problems the answer here seems clear to them, but there are some unknowns to solve it as written.
1) Does Alice have brothers? Her mother must, but it is phrased as if her mother has 1 sister and her sister has 2 brothers, so we have to deduce that she has 3 siblings.
2) Along the same lines, does her father have a sister?
3) Do her Aunts and Uncles have nephews and nieces by marriage?
I think the LLM would still fail if these answers were made explicitly clear in the prompt. However, I could see the same problem ending with, Alice has only 4 cousins, how can that be?
Key takeaway, LLMs are abysmal at planning and reasoning. You can give them the rules of planning task and ask them for a result but, in large part, the correctness of their logic (when it occurs) depends upon additional semantic information rather then just the abstract rules. They showed this by mapping nouns to a completely different domain in rule and input description for a task. After those simple substitutions, performance fell apart. Current LLMs are mostly pattern matchers with bounded generalization ability.
People also fall apart on things like statistical reasoning if you switch domains (I think it is the Leda Cosmides evo psych stuff that goes into it but there might be a more famous experiment).
I tried asking Claude variants where N was something other than a nonnegative integer, and he properly pointed out the absurdity, but was willing to play along once I brought up theoretical physics:
[...] So, in our quantum mechanics interpretation, how many sisters does Alice's brother have? It's a superposition of states:
1. In one state, he has Alice as his sister (+1) and the "antisister" (-1), totaling 0 sisters. If they met, they'd annihilate, leaving him sisterless!
2. In another state, he has Alice (+1) and owes a future sister to the quantum family field (-1). Currently, that's 0 sisters, but he's in "sister debt."
3. In a third state, he has Alice (+1), but his family's entanglement means another family's loss of a sister affects him too (-1). Bizarrely, he experiences both having and not having a sister simultaneously.
This interpretation is, of course, a playful and highly metaphorical application of quantum concepts to something they don't actually apply to—family relationships. In reality, quantum effects like superposition and entanglement are incredibly delicate. They typically only persist in highly controlled conditions, like ultra-cold atoms or isolated photons. The constant interactions in a family would "decohere" any such quantum states almost instantly.
> The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical "reasoning"-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible.
I like their use of confabulations instead of hallucinations. I think confabulate describes what LLMs are doing much better than hallucinate.
An LLM is a stack of transformer layers (N=96 for GPT-2). Each word/token output is the result of exactly N sentence transformations from input to output.
Clearly there are many things (specifically things requiring deliberation and planning) that require an open ended number of thought steps, and therefore could only, even in theory, we done by an LLM in "think step by step" fashion where it breaks the problem down into pieces (maybe recursively) each simple enough to be done in < N of these primitive steps.
In practice it seems that LLM's ability (esp. given their lack of any internal working memory) to maintain long term focus for an extended task is poor, so the theoretical benefit of using their own memory as a scratchpad for planning doesn't seem to be realized.
The notion of scale it up, and/or train it right, and a transformer can implement AGI, seems to be more wishful thinking than based on any analysis of what the architecture is and is not capable of. The lack of support for continuous incremental learning seems to be a deal-breaker for intelligence that people just ignore.
Interesting. If you ask it to name the sisters it figures it out but then if you ask it why it changed its answer it reverts to the original. These things are clearly very good query engines but perhaps they fall short of reasoning.
I just played the game and sent ChatGPT (free, I think 3.5) "Alice has 5 sisters and 3 bothers. How many sister's does Alice's bother have?"
The whole thing felt like interacting with your typical support rep who's friendly but otherwise has no common sense and intuition about the thing they're supporting. In other words, it felt like I was interacting with a typical "not so smart but friendly and overconfident" human.
It took me a few back-and-forths, but eventually I convinced ChatGPT that Alice's brother has 6 sisters.
Asking gpt to first output prolog program seem to 100% fix it!
Given that Alice has 13 brothers and 31 sisters, we can update the Prolog program with this information. We need to adjust the fact about Alice's siblings and then use the rule to calculate the number of sisters her brothers have.
Here is the updated program:
```prolog
% Facts
siblings(alice, 13, 31). % Alice has 13 brothers and 31 sisters
% Rules
% Alice's brothers have M + 1 sisters, since Alice is also their sister.
sisters_of_brother(TotalSisters) :-
siblings(alice, _, M),
TotalSisters is M + 1.
% Initialization goal
:- initialization(main).
% Predicate to find and print the total number of sisters Alice's brothers have
main :-
sisters_of_brother(TotalSisters),
writeln(TotalSisters).
```
In this program, the fact `siblings(alice, 13, 31)` represents that Alice has 13 brothers and 31 sisters. The rule `sisters_of_brother(TotalSisters)` calculates the total number of sisters that Alice's brothers have by adding 1 to the number of sisters Alice has (since Alice is also their sister).
When we run this program, it will print the total number of sisters Alice's brothers have:
1. Alice has 31 sisters.
2. Each of Alice's brothers will also consider Alice as a sister.
3. Therefore, each brother has 31 + 1 = 32 sisters.
I like the riddle. But the framing is unfortunate. When divising riddles, you want ambiguity where it serves the riddle, but be precise elsewhere so that the solver doesn't get needlessly distracted.
Their AIW riddle is: "Alice has 4 brothers and she also has 1 sister. How many sisters does Alice’s brother have?"
Now it should've been: "How many sisters do Alice's brothers have?" or "..does each of Alice's brothers have". Why single out a specific brother, when you haven't introduced this topic, and it is irrelevant to the riddle? Naturally, a human would ask "Which brother?", fully knowing that it is not important to the riddle.
Since this grammatical distraction puts an additional burden on the LLM, the authors muddled their original goal, which was to provide an easy riddle. I think it may have also muddled their data.
Their AIW+ riddle is just ridiculous. It contains so many ambiguities, that there are several correct answers, even though the authors claim there be only one.
Which is really unfortunate. Because now it only shows that LLMs have problems answering ill-framed riddles.
[+] [-] nerdjon|1 year ago|reply
On to the topic here. This is an interesting example that they are using. It is fairly simplistic to understand as a human (even if we may be inclined to quickly jump to the wrong conclusion without thinking for a few seconds).
The thing that really bothers me is that I just don't know is realistically we can fix this given the current state of what these tools actually are. They are not reasoning or thinking in any sense of the word and yet a lot of people are already considering them general purpose AI. It doesn't help that in many situations it can fake it enough that it appears to be reasoning, but it's not.
What is the chance that this paper actually has any impact on the AI rollout and overhype or will just be buried and never talked about again until the next time we see how dangerous these tools are like with Google's search rollout.
[+] [-] layer8|1 year ago|reply
We don’t know how to do that yet, because what controls the internal thought process is itself not necessarily language-based, and also, since internal thought processes of biological brains are not directly observable, they can’t be used as training data.
Edit: It occurs to me now that there is some parallel between current LLMs and behaviorism [0], and we really need something to which cognitive psychology could be applied instead.
[0] https://en.wikipedia.org/wiki/Behaviorism
[+] [-] photon_lines|1 year ago|reply
[+] [-] mjburgess|1 year ago|reply
Such is the strength of corporate tech propaganda that a whole mass of people will instead insist that we have never worn clothes either.
[+] [-] ericmcer|1 year ago|reply
Maybe people were surprised by what OpenAI achieved so now they are all just praying that with enough compute and the right model AGI will emerge.
[+] [-] fnordpiglet|1 year ago|reply
But it’s also fairly obvious LLMs don’t reason at all so it’s not shocking that LLMs don’t reason at all. What’s remarkable is that they’re able to perform as well at reasoning tasks as they do given that factual truth they have no reasoning facility. Coupled with their natural language ability and emergent abductive reasoning ability that’s generally hard in computing models it’s hard to draw much conclusion here other than “wow gpt-4o is amazing.”
[+] [-] colechristensen|1 year ago|reply
Like you ask it to do one thing it's amazing, but then you try to modify or do something with extra steps, or just anything with any complexity to it and it falls over.
[+] [-] nuancebydefault|1 year ago|reply
There is no definition of reasoning or thinking. No single human knows what it is.
The only thing we know is: we as humans are capable of recognizing steps and results of reasoning and thinking.
In a lot of cases, when using LLM's, those results appear to be correct and usable. This is often easy to determine with generated code.
I want to argue that, lacking a definition of reasoning, I am happy to have found that the machine helps me to get results that might as well have been produced by a lot of human knowledge, wisdom and deep reasoning.
[+] [-] Mmskynettio|1 year ago|reply
Just yesterday I visited a google cloud summit and one person from bosch told the audiance how they are now able to work with less external agencies like texting, graphicsdesigner and photographers for their materials.
It already saves money, has real impacts and continues to progress.
We are also don't know what ChatGPT 5 will bring, because they say this will do more reasoning than before, but we already are working (people/our socity) on solving this in different ways: From code which creates a unit test first and than the code, to different type of architectures.
For me, 2024 was the LLM cost reduction year and the LLM gets a big context window year.
AI doesn't need to be ready tomorrow, but its capabilities are already really good. And i know plenty of people around me who are a lot less interesting to talk to than any llm (from a human skill/knowledge point of view).
llama 3 was also a big achievement 2024. Facebook shows that better data leads to better quality for smaller models.
We haven't not only entered the AI ara but also the 'gather all the knowledge we can, quality check it and refine it because now we can actually do something with it' ara.
We are in the feedbackloop knowledge ara.
[+] [-] glitchc|1 year ago|reply
[+] [-] wtbdqrs|1 year ago|reply
Given a few months of peace of mind and enough money for good enough food, I could actually learn to reason without sounding like a confused babelarian.
Reasoning is mostly a human convention supported by human context that would have been a different one if the Fascists had won the war or the Soviet Union wouldn't have gotten corrupted.
But none of that has anything to do with pulling up a whiteboard to draw some flowcharts and run some numbers, all of which is why I am certain there is nothing the devs have "to fix". It took most reasonable humans many generations to learn stuff. Very few of us did the actual work.
It's all just a matter of time.
[+] [-] ninetyninenine|1 year ago|reply
You yourself did not use reasoning to arrive at this conclusion. It's quite obvious. I'm not trying to belittle you here. But LLMs are black boxes, we do not actually know what they are doing at a high enough resolution where we can call it "not reasoning" or "reasoning".
We can only characterize these AI's as a best fit curve between datapoints which is a way to high level view point to come to any conclusion about "reasoning"
This paper presents evidence of failed reasoning, but how does that prove anything when LLMs exhibit many instances of successful reasoning on complex topics they were not trained on?
You are biased and honing into information that supports a biased conclusion. LLMs are an AI we do not understand at a low level. Hence we talk about the attributes of these AIs in the same way we talk about humans, "Oh the LLM hallucinates", "it tries to justify it's answer..." etc. etc.
You characterize the Danger of these AI's as the result of Human stupidity. The danger according to you is solely from a human mistakenly believing that the AI is anything other than a stochastic parrot.
This is a belief arrived at in the same spirit as your claim. You did not use reasoning to arrive here.
The only logical way to characterize what is going on is that we do not know. It could very well be that these AI's are in fact reasoning. And that in itself presents a different kind of danger. A danger that may be more clear in the far future.
The irony is that your conclusion lacking correct reasoning is similarly parallel to the LLM's lack of reasoning. LLMs are more alike to us than you would like to believe.
Can you give me the step by step signal path ways of an LLM processing a query to prove that it does not reason? Or do you have to use black box anecdotal evidence to prove your point? For any "evidence" where an LLM failed to reason, there is another counter example showing where the LLM succeeded. Contradictory evidence can only lead to a vague conclusion.
[+] [-] Closi|1 year ago|reply
I must confess, when I tried to answer the question I got it wrong...! (I feel silly). I only realised I got it wrong when I plugged it into GPT-4o and it came back with the correct answer:
https://chatgpt.com/share/6eb5fa36-e0fd-4417-87d1-64caf06c34...
Worth noting that the prompts from the experiment include "To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:" so it appears that they are stopping the models from 'thinking out loud'. If I add that to the prompt, GPT4o gets it consistently wrong...
https://chatgpt.com/share/7e6a7201-dd2b-43c6-8427-76e5b003ca...
Also worth noting that there are more complex examples where GPT4o seems to fall down such as:
> Alice has 3 sisters. Her mother has 1 sister who does not have children - she has 7 nephews and nieces and also 2 brothers. Alice's father has a brother who has 5 nephews and nieces in total, and who has also 1 son. How many cousins does Alice's sister have?
However I can't honestly say that this is THAT simple or that most people would get this right...
[+] [-] GrantMoyer|1 year ago|reply
> AIW Variation 1, N=3,M=6,C=7
> AIW Variation 2, N=4,M=2,C=3
> AIW Variation 3, N=1,M=4,C=5
> AIW Variation 4, N=4,M=1,C=2.
Also note that the resricted prompt is only one of the prompt variations tested by the paper. It also explores common techinques to get LLMs to perform better, including "thinking out loud". Even with these methods the models still fail to produce a correct answer.
> Model prompt types. It is well known that so-called prompt engineering can heavily influence the model behavior and model response quality [26, 27, 28]. To account for the response variations due to various prompt forms, we created 3 distinct prompt types asking for the solution to the AIW problem: STANDARD, THINKING, and RESTRICTED. The STANDARD prompt type asks to solve the posed problem and output the final answer in the format as described above. This does not put any specific requirements on model behavior. The THINKING prompt type extends STANDARD with the request to think carefully and double check the solution for any mistakes. This should encourage model to invest more computation into obtaining the solution. In contrast to this, the RESTRICTED prompt urges the model to output only the final answer without any further text. This is supposed to restrict compute invested in producing output. We observe substantially shorter outputs across tested models compared to STANDARD and THINKING for this prompt type (Suppl. Fig. 13).
[+] [-] llm_trw|1 year ago|reply
Yes this is a common thing I see people who think LLMs are idiots do.
The more an LLM talks the smarter it gets _because that's the only way it can compute anything_. Imagine saying that Turing machines fail the Church–Turing thesis because they can't solve 3-sat for N variables in N moves or less.
That's what you're doing to an LLM when you ask it to be concise.
[+] [-] GPerson|1 year ago|reply
As a side note, I cannot believe that the average person who can navigate to a chatgpt prompter would fail to correctly answer this question given sufficient motivation to do so.
[+] [-] nerdjon|1 year ago|reply
Thinking out loud also only gets you so far, if the expectation is a certain type of response it can't always "think out loud". In reality that just proves it isn't really reasoning here and is more likely just self referencing.
That being said, I tried this personally allowing it to think out loud and it told me she has 212 sisters. Using your exact prompt.
[+] [-] toxik|1 year ago|reply
Seems to work OK even with the "hands tied behind back" prompt. Wonder if this paper is just inaccurate or if OpenAI adjusted the model -- seems unlikely.
Actually, 4o is reported to solve this riddle 60% of the time. I guess I have to test more, but with my about a dozen tests so far, all were correct.
[+] [-] segfaultbuserr|1 year ago|reply
In programming there are two difficult problems - naming things, cache invalidation, and off-by-one error.
[+] [-] oidar|1 year ago|reply
Edit - I just skimmed the paper - they do use other more appropriate prompt types for reasoning. My initial response was based on the assumption that all prompts used that script prompt quoted in the parent. I retract my "bad paper" comment.
[+] [-] YeGoblynQueenne|1 year ago|reply
Remember that the authors of the paper did not find that GPT4-o cannot return the right answer. They found that it can't return the right answer more often than ~60% of the time. So you'd have to repeat the experiment many, many times and aggregate the results (the paper uses a binomial Beta this and that etc etc) before you see similar results as the paper.
You won't replicate the results of the paper unless you really put your back into it.
[+] [-] bamboozled|1 year ago|reply
[+] [-] sdiepend|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] eranation|1 year ago|reply
[+] [-] freilanzer|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] mjburgess|1 year ago|reply
A better way of assessing LLMs is waiting a few weeks until novel tests have been created explicitly absent from all prior training data, and then using those.
As has been shown, eg., on legal test, exams, etc. performance drops off a cliff when future out-sample data is actually used. Rather than these faked pretend out-sample benchmarks.
[+] [-] ankit219|1 year ago|reply
GPQA etc. test reasoning in some form, and you see the drastic change in score between the two for every model.
[+] [-] imtringued|1 year ago|reply
Simply picking answers at random should give you 25 points. Knowing 50% of the answers and picking the rest randomly gives you 62.5%, which is very close to the scores of SOTA LLMs. The benchmarks that supposedly show reasoning are pretty bad and have very little to do with reasoning. A lot of the questions can be answered through memorization.
I agree with you. The benchmarks are garbage. I thought about building my own benchmarks, but this would require building a complex benchmarking framework first and I just don't have the time for preparatory work like that.
[+] [-] larkinnaire|1 year ago|reply
(Of course, I don't have a citation either, but I'm not the one writing the paper.)
[+] [-] asah|1 year ago|reply
Gemini nailed it with no extra prompting.
https://g.co/gemini/share/6946d20c0a4d
When I gave it numbers, Gemini got confused.
Alice has 4 brothers and she also has 3 sisters. How many sisters does Alice’s brother have?
https://g.co/gemini/share/14d34dcdc5df
But then when I suggested that it might be a trick question, it nailed it again:
https://g.co/gemini/share/e1f1ce8f32a8
note: I didn't try 60/212 siblings - I consider this an unfair question because it's a reasonable assumption that Alice is a type of creature (human?) that cannot have hundreds of siblings, and thus I don't expect a respondent to give a "correct" answer.
[+] [-] theptip|1 year ago|reply
The general expectation should be that some AI failure modes will be for things that seem obvious to humans. (Because they have a different architecture to humans and different strengths.) The important question is whether this is a single weird cognitive bug on the family size formulation, or representative of a broad family of cognitive errors? It’s difficult to say from this paper.
It’s well known that the framing of a logic problem can dramatically influence its difficulty for humans, even when the underlying comparisons are isomorphic.
I think this line of research is important, but we need a larger battery of evals (and training on the evals will always be a confounder).
The broader family of reasoning here is relational problems (as noted in the paper) and so as a follow-up it would be interesting to explore reasoning performance across a set of isomorphic problems. My hunch is this will generalize somewhat (as LLMs are still not great at relational reasoning), but that there is something about the family formulation in particular that is confusing for LLMs.
[+] [-] kalkin|1 year ago|reply
[+] [-] hellooodarkness|1 year ago|reply
Prompt: Alice has 3 brothers and she also has 6 sisters. How many sisters does Alice’s brother have? Solve the problem step-by-step
Answer: To solve the problem step-by-step:
Identify the total number of siblings Alice has: Alice has 3 brothers. Alice has 6 sisters. Understand that Alice herself is one of the sisters: Including Alice, there are a total of 7 sisters (6 sisters + Alice).
Determine the number of sisters Alice's brother has: Each of Alice's brothers will have the same number of sisters as Alice does, excluding themselves.
Since Alice's brothers have the same set of siblings, including the 7 sisters: Therefore, each of Alice's brothers has 6 sisters (not counting themselves as a sister). So, each of Alice’s brothers has 6 sisters.
[+] [-] twobitshifter|1 year ago|reply
So I’m sure after the author’s created 100s of these family tree problems the answer here seems clear to them, but there are some unknowns to solve it as written.
1) Does Alice have brothers? Her mother must, but it is phrased as if her mother has 1 sister and her sister has 2 brothers, so we have to deduce that she has 3 siblings.
2) Along the same lines, does her father have a sister?
3) Do her Aunts and Uncles have nephews and nieces by marriage?
I think the LLM would still fail if these answers were made explicitly clear in the prompt. However, I could see the same problem ending with, Alice has only 4 cousins, how can that be?
[+] [-] michaelfeathers|1 year ago|reply
Key takeaway, LLMs are abysmal at planning and reasoning. You can give them the rules of planning task and ask them for a result but, in large part, the correctness of their logic (when it occurs) depends upon additional semantic information rather then just the abstract rules. They showed this by mapping nouns to a completely different domain in rule and input description for a task. After those simple substitutions, performance fell apart. Current LLMs are mostly pattern matchers with bounded generalization ability.
[+] [-] cma|1 year ago|reply
[+] [-] Jackson__|1 year ago|reply
> simple problem template that can be easily solved using common sense reasoning but is not entirely
> straightforward, of the following form: "Alice has N brothers and she also has M sisters. How
> many sisters does Alice’s brother have?".
Cute paper, but claiming they invented it, when it's an extremely basic variant of a common[0] riddle asked of LLMs, is kinda weird.
[0] https://news.ycombinator.com/item?id=39786666#39794890 (Me, referencing the riddle, 74 days ago, having known it for a good while back then.)
[+] [-] raldi|1 year ago|reply
[...] So, in our quantum mechanics interpretation, how many sisters does Alice's brother have? It's a superposition of states:
1. In one state, he has Alice as his sister (+1) and the "antisister" (-1), totaling 0 sisters. If they met, they'd annihilate, leaving him sisterless!
2. In another state, he has Alice (+1) and owes a future sister to the quantum family field (-1). Currently, that's 0 sisters, but he's in "sister debt."
3. In a third state, he has Alice (+1), but his family's entanglement means another family's loss of a sister affects him too (-1). Bizarrely, he experiences both having and not having a sister simultaneously.
This interpretation is, of course, a playful and highly metaphorical application of quantum concepts to something they don't actually apply to—family relationships. In reality, quantum effects like superposition and entanglement are incredibly delicate. They typically only persist in highly controlled conditions, like ultra-cold atoms or isolated photons. The constant interactions in a family would "decohere" any such quantum states almost instantly.
[+] [-] foobarian|1 year ago|reply
[+] [-] irrational|1 year ago|reply
I like their use of confabulations instead of hallucinations. I think confabulate describes what LLMs are doing much better than hallucinate.
[+] [-] HarHarVeryFunny|1 year ago|reply
Clearly there are many things (specifically things requiring deliberation and planning) that require an open ended number of thought steps, and therefore could only, even in theory, we done by an LLM in "think step by step" fashion where it breaks the problem down into pieces (maybe recursively) each simple enough to be done in < N of these primitive steps.
In practice it seems that LLM's ability (esp. given their lack of any internal working memory) to maintain long term focus for an extended task is poor, so the theoretical benefit of using their own memory as a scratchpad for planning doesn't seem to be realized.
The notion of scale it up, and/or train it right, and a transformer can implement AGI, seems to be more wishful thinking than based on any analysis of what the architecture is and is not capable of. The lack of support for continuous incremental learning seems to be a deal-breaker for intelligence that people just ignore.
[+] [-] thinkr42|1 year ago|reply
[+] [-] renewiltord|1 year ago|reply
[+] [-] nottorp|1 year ago|reply
[+] [-] gwbas1c|1 year ago|reply
The whole thing felt like interacting with your typical support rep who's friendly but otherwise has no common sense and intuition about the thing they're supporting. In other words, it felt like I was interacting with a typical "not so smart but friendly and overconfident" human.
It took me a few back-and-forths, but eventually I convinced ChatGPT that Alice's brother has 6 sisters.
[+] [-] skyde|1 year ago|reply
Given that Alice has 13 brothers and 31 sisters, we can update the Prolog program with this information. We need to adjust the fact about Alice's siblings and then use the rule to calculate the number of sisters her brothers have.
Here is the updated program:
```prolog % Facts siblings(alice, 13, 31). % Alice has 13 brothers and 31 sisters
% Rules % Alice's brothers have M + 1 sisters, since Alice is also their sister. sisters_of_brother(TotalSisters) :- siblings(alice, _, M), TotalSisters is M + 1.
% Initialization goal :- initialization(main).
% Predicate to find and print the total number of sisters Alice's brothers have main :- sisters_of_brother(TotalSisters), writeln(TotalSisters). ```
In this program, the fact `siblings(alice, 13, 31)` represents that Alice has 13 brothers and 31 sisters. The rule `sisters_of_brother(TotalSisters)` calculates the total number of sisters that Alice's brothers have by adding 1 to the number of sisters Alice has (since Alice is also their sister).
When we run this program, it will print the total number of sisters Alice's brothers have:
1. Alice has 31 sisters. 2. Each of Alice's brothers will also consider Alice as a sister. 3. Therefore, each brother has 31 + 1 = 32 sisters.
The output of the program will be:
``` 32 ```
Thus, Alice's brothers have 32 sisters.
[+] [-] patal|1 year ago|reply
Their AIW riddle is: "Alice has 4 brothers and she also has 1 sister. How many sisters does Alice’s brother have?"
Now it should've been: "How many sisters do Alice's brothers have?" or "..does each of Alice's brothers have". Why single out a specific brother, when you haven't introduced this topic, and it is irrelevant to the riddle? Naturally, a human would ask "Which brother?", fully knowing that it is not important to the riddle.
Since this grammatical distraction puts an additional burden on the LLM, the authors muddled their original goal, which was to provide an easy riddle. I think it may have also muddled their data.
[+] [-] patal|1 year ago|reply
Which is really unfortunate. Because now it only shows that LLMs have problems answering ill-framed riddles.