top | item 47134842

(no title)

cubefox | 5 days ago

Interesting that GPT-5.1 and 5.2 (0 of 10 correct) are a lot worse than the older GPT-5 (7 of 10 correct).

But unfortunately the article doesn't mention whether they used the reasoning model or not.

Even more interesting: Gemini 2.0 Flash Lite got a perfect score (10/10) despite being a quite small and old model.

discuss

order

randomtoast|5 days ago

> But unfortunately the article doesn't mention whether they used the reasoning model or not.

You can run the test yourself if you ask GPT-5.2 with reasoning effort high or xhigh, it will always answer correctly. So if the got 0 from 10, they used zero reasoning efforts which easily explain the results.

felix089|5 days ago

Good question, I used the API defaults across the board since it felt like the most reasonable baseline to compare. Flash lite getting 10/10 was definitely very surprising