Comparing humans, GPT-4, and GPT-4V on abstraction and reasoning tasks

[+] krona|2 years ago|reply

This paper evaluates performance compared to a 'human' which presumably is an average adult human without cognitive impairment. I had to dig in to the references:

In the first batch of participants collected via Amazon Mechanical Turk, each received 11 problems (this batch also only had two “minimal Problems,” as opposed to three such problems for everyone else). However, preliminary data examination showed that some participants did not fully follow the study instructions and had to be excluded (see Section 5.2). In response, we made the screening criteria more strict (requiring a Master Worker qualification, 99% of HITs approved with at least 2000 HIT history, as opposed to 95% approval requirement in the first batch). Participants in all but the first batch were paid $10 upon completing the experiment. Participants in the first batch were paid $5. In all batches, the median pay-per-hour exceeded the U.S. minimal wage.

(Arseny Moskvichev et al)

So in conclusion, this isn't a random sample of (adult) humans, and the paper doesn't give standard deviations.

It would've been more interesting of they had sampled an age range of humans which we would place GPT-4 on rather than just 'it's not as good' which is all this paper can say, really.

[+] svnt|2 years ago|reply

What is your concern exactly?

This was a first-pass study in a field addressing some of the criticisms leveraged against an earlier study where the spatial reasoning problems were viewed to be too hard. They seemingly made the spatial reasoning questions as easy as they could.

The qualifications they put on MTurk are pretty standard if you want humans who care about what they are doing responding to your study. It costs more to do this.

It is a limitation of science that is both budgetary and procedural.

By calling into question their results you seem to be suggesting that an average human would only 33% of the time be able to tell e.g. how many points are inside a box, or whether more points are inside or outside of a box. This is extremely basic spatial reasoning we are talking about.

The problem they were addressing with the settings is just noise in the results by cheap bots and clicky humans trying to earn $0.50. It is endemic on MTurk.

[+] cs702|2 years ago|reply

Also, it's possible there are LLMs pretending to be human beings on Mechanical Turk!

[+] a1j9o94|2 years ago|reply

Another thing I take issue with is this doesn't seem to be using known ways to improve performance of LLMs such as chain of thought and tree of thought prompting.

[+] petermcneeley|2 years ago|reply

This critique in no way invalidates the conclusions of the paper.

[+] colincooke|2 years ago|reply

My wife studys people for living (experimental cognitive psychologist), the quality of MTurk is laughable, if that's our standard for higher level cognition then the bar is low. You'll see the most basic "attention check" questions ("answer option C if you read the question") be failed routinely, honestly at this point I think GPT4 would to a better job than most MTurkers at these tasks...

She has found that prolific is substantially better (you have to pay more for it as well), however that may only be because it's a higher cost/newer platform.

[+] karmakaze|2 years ago|reply

This is interesting in a 'human interest news' report way but doesn't do anything to judge current systems any more the average people thinking older less capable chatbots were human.

[+] hackerlight|2 years ago|reply

I also want to see GPT-3 vs GPT-4 comparison on these tasks.

[+] RecycledEle|2 years ago|reply

What every paper I have seen so far is missing is that there are many ways to achieve super-human intelligence. (I need to give creidt to Isaac Arthur of SFIA for this.)

Getting results faster is one way. AIs beat me in speed.

Getting results cheaper is another way. AI is cheaper than I am.

Knowledge across many fields is better. AI beats me here too.

Getting better results in one narrow field is another way, but only one of many ways. I love evaluations of human produced work vs. machine produced work. If we had quality evaluations (not the type-oh riddled garbage most people use.) If we compared AIs to people who work in those fields in occupations recognized by the US Dept of Labor. If we asked both sides to justify their answers. If we had statistically significant sample sizes. Then maybe we could get some good results on quality of work. I can imagine the US DOL spending billions if dollars benchmarking AIs against humans i all the occupations the recognize. Alternately, this could be a very profitable company.

[+] kaoD|2 years ago|reply

Maybe I'm missing what "abstraction" means here but seems like the tasks were centered around grids and other spatial problems, which are a very limited subset of abstraction/reasoning.

In my experience GPT4/V is pretty bad at those specifically, not necessarily around abstraction in general. Positions, rotations, etc. are a concept that GPT4 finds very hard to apply, which is kinda unsurprising since it has no body, no world, no space; it "lives" in the realm of text. DALLE3 suffers a similar problem where it has trouble with concepts like "upside down" and consistently fails to apply them to generated images.

[+] bloaf|2 years ago|reply

It's also worth remembering that blind humans who can recognize squares by feel do not have the ability to recognize squares by sight upon gaining vision.

I suspect the model is bad at these kinds of "reasoning" tasks in the same way that a newly-sighted person is bad at recognizing squares by sight.

[+] joe_the_user|2 years ago|reply

In my experience GPT4/V is pretty bad at those specifically, not necessarily around abstraction in general.

The problem with a statement like is that it leaves the door open to accepting any kind of canned generality as "abstraction in general". Abstract reasoning is indeed a fuzzy/slippery concept and spatial reason may not capture it well but I'm pretty sure it captures it better a general impression of ChatGPT.

...since it has no body, no world, no space; it "lives" in the realm of text.

There's a bizarre anthropomorphism on this thread, both reflexively compare this software system to a blind human and the implicit call to be considerate of this thing's supposed disability.

[+] lazy_moderator1|2 years ago|reply

> which is kinda unsurprising since it has no body, no world, no space; it "lives" in the realm of text

or rather the training set was lacking in this regard

[+] Sharlin|2 years ago|reply

> DALLE3 suffers a similar problem where it has trouble with concepts like "upside down" and consistently fails to apply them to generated images.

This has nothing to do with having "no body, no world" and everything to do with the fact that training pictures where things are upside down are simply vastly rarer that pictures where they aren't.

[+] pixl97|2 years ago|reply

What would directions be for an intelligent creature that lives in zero gravity? I just like thinking about this for the same reasons humans like writing speculative science fiction. Trying to guess what alien perspectives look like, might also give us insights when we're the ones making the alien.

[+] mr_toad|2 years ago|reply

> DALLE3 suffers a similar problem where it has trouble with concepts like "upside down" and consistently fails to apply them to generated images.

There’s probably not many (if any) upside down images or objects in the training data.

[+] cs702|2 years ago|reply

Interesting. Immediate thoughts and questions:

* How would human beings perform on the text-only version of the tasks given to GPT-4?

* How would human beings perform if each grid is shown on its own, making it impossible to perform side-by-side visual comparisons?

* How would human beings perform if each grid is shown on its own only once, making it impossible to perform any back-and-forth comparisons?

* How could we give LLMs the ability to "pay attention" to different parts of images, as needed, so they can make back-and-forth comparisons between parts of different images to solve these kinds of visual reasoning tasks?

[+] QuadmasterXLII|2 years ago|reply

> How could we give LLMs the ability to "pay attention" to different parts of images, as needed, so they can make back-and-forth comparisons between parts of different images to solve these kinds of visual reasoning tasks?

I’ve got good news

[+] YetAnotherNick|2 years ago|reply

Also I want to know how would know how much gain could be made by optimizing the prompt for GPT and by including things like CoT. Current version of the prompt is pretty bad both for humans and AI.

[+] mistermann|2 years ago|reply

* How would human beings perform if they didn't know they were being tested (ie: if in the same mode they are in when writing comments on the internet)?

* How would human beings perform if the questions are based on culture war topics, which tend to invoke System 1 intuitive/emotional thinking?

[+] theptip|2 years ago|reply

If you look at the appendix, you can see example transcripts. The sample they provide looks like a very bad eval.

It’s encoding an originally visual problem into a textual matrix form, and then expecting GPT to recognize visual correlations. You simply can’t compare these two tasks! Most humans wouldn’t recognize the 5x5 matrix for a 4x4 square.

So the comparison with “human level” is completely invalid. And even the valid comparison is only measuring visio-spatial intelligence, not IQ.

[+] skepticATX|2 years ago|reply

It has been interesting to see evidence accumulating that shows, despite initial excitement bred by papers such as "Sparks", there is something missing from current language models.

Individually none of these results will ever get the attention of a "Sparks" type paper, but collectively a strong case has been built.

[+] naasking|2 years ago|reply

Sparks of AGI is not AGI. It's also possible that we're not testing LLMs fairly, or that merely slight tweaks to the architecture or methods would address the issues. I think this comment elaborates nicely:

https://news.ycombinator.com/item?id=38332420

I do think there might be something missing, but I also suspect that it's not as far off as most think.

[+] dr_dshiv|2 years ago|reply

I’m really looking forward to students majoring in “machine psychology.”

[+] ryzvonusef|2 years ago|reply

https://en.wikipedia.org/wiki/Susan_Calvin

    > Graduating with a bachelor's degree from Columbia University in 2003, she began post-graduate work in cybernetics, learning to construct positronic brains such that responses to given stimuli could be accurately predicted. She joined US Robots in 2008 as their first Robopsychologist, having earned her PhD. By 2029, when she left Earth for the first time to visit Hyper Base, her formal title was Head Psychologist.

https://en.wikipedia.org/wiki/Robopsychology

[+] tovej|2 years ago|reply

Conclusion is obvious, but the paper is still probably necessary.

Of course LLM's can't reason. They pattern match answers to previously asked questions, and humans will read the text as a reasonable answer because we assign meaning to it, but there is simply no way an LLM could use a "mental model" to "reason" about a problem other than constructing sentences out of probable matches it's been trained on.

[+] naasking|2 years ago|reply

> Of course LLM's can't reason

That they are not effective at some forms of reasoning does not entail they can't reason.

[+] mcswell|2 years ago|reply

The conclusion may be obvious to you and me (although it's hard to know for certain, since these available LLMs are black boxes). But it's definitely not obvious to everyone. There are plenty of people saying this is the dawn of AGI, or that we're a few short steps from AGI. Whereas people like Gary Marcus (who knows tons more than I do) says LLMs are going off in the wrong direction.

[+] koe123|2 years ago|reply

On the other hand, can we conclusively say that humans aren’t really advanced biological stochastic parrots?

[+] mr_toad|2 years ago|reply

> Of course LLM's can't reason. They pattern match

You’re assuming that reasoning isn’t just pattern matching.

[+] sfn42|2 years ago|reply

Love how the ai hypebros always downvote answers like this. Everyone's just insisting that LLMs are AGI just because they kinda seem that way.

[+] stillsut|2 years ago|reply

I've actually been working on a similar eval task that uses grids of symbols to evaluate LLM's reasoning ability around the game Wordle. My initial results are:

- Success: ChatGPT-4-November, Phind-November

- Failures (via hallucination): GPT-3.5, Mistral-7b, Llama2-13b

Interestingly, ChatGPT-4 wrote a short python script (unprompted) to develop a pseudo-CoT type generation and answered correctly. Phind, which also answered correctly, is also a code-based/code-enabled LLM.

Non-code-based-LLM's appear to reason cogently when describing their games - what moves are illegal, what the best guess in certain conditions are, etc - but their actual answer, even in multiple choice is comically wrong.

These are very initial results but my tentative hypothesis is symbolic logical reasoning - string manipulation / function calling type tasks - are greatly enhanced by having a code modality. How to switch between natural language and code-generation may offer a strong boost on tasks like these.

[+] datadrivenangel|2 years ago|reply

"Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels."

Some combination of LLMs and logical reasoning systems will get us much closer, but that becomes a lot more complicated.

[+] xbmcuser|2 years ago|reply

It has been really interesting to read in last few years with machine learning how the model cant do this or that and the next week or month read it can do this or something else. Chat gpt and the models that have come after seem to have accelerated this back and forth a lot. Unless you keep up with it closely and keep updating your information I think what you knew it could do well or could not do well is no longer correct

[+] intended|2 years ago|reply

This new Gen of AI adds an interesting twist to the infinite monkeys and typewriter issue.

How do you actually check an infinite amount of junk to verify that one of them is the collected works of Shakespeare?

The question I ask now is “whats your error rate for domain specific work ?”

It could be faster and smarter, but it doesnt matter if it’s wrong.

[+] baxtr|2 years ago|reply

Could a human and a LLM submit a summary of the paper so we can compare?

[+] j2kun|2 years ago|reply

That's what an abstract is for:

> Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.

[+] z7|2 years ago|reply

>The paper investigates the abstract reasoning abilities of text-only and multimodal versions of GPT-4 using the ConceptARC benchmark, concluding that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.

[+] firtoz|2 years ago|reply

The best thing from research like this is that they allow new models to be built, or improvements on the existing ones that can lead them to pass these evaluations.

[+] kenjackson|2 years ago|reply

Can someone provide the prompts in text rather than the images from the paper? That would make it easier to try and replicate results.

[+] wouldbecouldbe|2 years ago|reply

Never heard of Mechanical Turk, hahah for a Dutch person it sounds pretty racist. Turk is what we call Turkish people.

177 comments