top | item 43092723

(no title)

skinner_ | 1 year ago

You dismiss parent's example test because it's in the training data. I assume you also dismiss the Sally-Ann test, for the same reason. Could you please suggest a brand new test not in the training data?

FWIW, I tried to confuse 4o using the now-standard trick of changing the test to make it pattern-match and overthink it. It wasn't confused at all:

https://chatgpt.com/share/67b4c522-57d4-8003-93df-07fb49061e...

discuss

zipy124|1 year ago

I can't suggest a new test no, it is a hard problem and identifying problems is usually easier than solving them.

I'm just trying to say that strong claims require strong evidence, and a claim that LLM's can have theory of mind and thus "understand that other people have different beliefs, desires, and intentions than you do" is a very strong claim.

It's like giving students the math problem of 1+1=2 and loads of examples of it solved in front of them, and then testing them on you have 1 apple, and I give you another apple, how many do you have, and then when they are correct saying that they can do all additive based arithmetic.

This is why most benchmark tests have many many classes of examples, for example looking at current theory of mind benchmarks [1], we can see slightly more up to date models such as o1-preview still scoring substantially below human performance. More importantly by simply changing the perspective from first to third person, accuracy drops in LLM models by 5-15% (percent score, not relative to its performance), whilst it doesn't change for human participants, which tells you that something different is going on there.

[1]: https://arxiv.org/html/2410.06195v1