top | item 46235074

(no title)

minadotcom | 2 months ago

They used to compare to competing models from Anthropic, Google DeepMind, DeepSeek, etc. Seems that now they only compare to their own models. Does this mean that the GPT-series is performing worse than its competitors (given the "code red" at OpenAI)?

discuss

Tiberium|2 months ago

They did compare it to other models: https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png

enlyth|2 months ago

This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI

sergdigon|2 months ago

The fact that the post is comparing their reasoning model against gemini 3 pro (the "non reasoning" model) and not gemini 3 pro deep think (the reasoning one) is quite nasty. If you compare GPT5.2 thinking to gemini 3 pro deep think, the scores are quite similar (sometimes one is better sometimes the other one is)

whimsicalism|2 months ago

uh oh, where did SWE bench go :D

tabletcorry|2 months ago

The matrix required for a fair comparison is getting too complicated, since you have to compare chat/thinking/pro against an array of Anthropic and Google models.

But they publish all the same numbers, so you can make the full comparison yourself, if you want to.

Workaccount2|2 months ago

They are taking a page out of Apple's book.

Apple only compares to themselves. They don't even acknowledge the existence of others.

poormathskills|2 months ago

OpenAI has never compared their models to models from other labs in their blog post. Open literally any past model launch post to see that.

boole1854|2 months ago

https://openai.com/index/hello-gpt-4o/

I see evaluations compared with Claude, Gemini, and Llama there on the GPT 4o post.