top | item 46236209

(no title)

zone411 | 2 months ago

I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):

The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.

The medium-reasoning version also improves: 62.7 → 72.1.

The no-reasoning version also improves: 22.1 → 27.5.

Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.

discuss

Donald|2 months ago

Gemini 3 Pro Preview gets 96.8% on the same benchmark? That's impressive

capitainenemo|2 months ago

And performs very well on the latest 100 puzzles too, so isn't just learning the data set (unless I guess they routinely index this repo).

I wonder how well AIs would do at bracket city. I tried gemini on it and was underwhelmed. It made a lot of terrible connections and often bled data from one level into the next.

bigyabai|2 months ago

GPT-5.2 might be Google's best Gemini advertisement yet.

tikotus|2 months ago

Here's someone else testing models on a daily logic puzzle (Clues by Sam): https://www.nicksypteras.com/blog/cbs-benchmark.html GPT 5 Pro was the winner already before in that test.

thanhhaimai|2 months ago

This link doesn't have Gemini 3 performance on it. Do you have an updated link with the new models?

crapple8430|2 months ago

GPT 5 Pro is a good 10x more expensive so it's an apples to oranges comparison.

fellowniusmonk|2 months ago

I think they are overfitting more, I'm seeing it perform worse on esoteric logic puzzles.

Bombthecat|2 months ago

I would like to see a cost per percent or so row. I feel like grok would beat them all

scrollop|2 months ago

Why no grok 4.1 reasoning?

sanex|2 months ago

Do people other than Elon fans use grok? Honest question. I've never tried it.