(no title)
d--b | 5 days ago
This test is interesting because it asks the LLM to break a pattern recognition that's easy to shortcut. "XXX Is 50 Meters Away. Should I Walk or Drive?" is a pattern that 99% of the time will be rightly answered by "walk". And humans are tempted to answer without thinking (as reflected in the 71.5% stat OP is mentioning). This is likely more pronounced for humans that have stronger feelings about the ecology, as emotions tend to shortcut reasoning.
For a long time, LLMs have only been able to think in that "fast" mode, missing obvious trick questions like these. They were mostly pattern recognition machines.
But the more important results here, is not that "oh look! Those LLMs fail at this basic question", no. The more important result is that the latest generation actually doesn't fail.
I think I am not the only one to have noted that there was a giant leap in reasoning capacities between Sonnet 4.5 and Opus 4.6. As a developper, working with Opus 4.6 has been incredible.
No comments yet.