(no title)
cubefox | 5 days ago
But unfortunately the article doesn't mention whether they used the reasoning model or not.
Even more interesting: Gemini 2.0 Flash Lite got a perfect score (10/10) despite being a quite small and old model.
cubefox | 5 days ago
But unfortunately the article doesn't mention whether they used the reasoning model or not.
Even more interesting: Gemini 2.0 Flash Lite got a perfect score (10/10) despite being a quite small and old model.
randomtoast|5 days ago
You can run the test yourself if you ask GPT-5.2 with reasoning effort high or xhigh, it will always answer correctly. So if the got 0 from 10, they used zero reasoning efforts which easily explain the results.
felix089|5 days ago