top | item 46902729

(no title)

gizmodo59 | 24 days ago

5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!

discuss

wasmainiac|24 days ago

Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?

tedsanders|24 days ago

We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.

(I'm from OpenAI.)

Corence|24 days ago

It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.

However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.

ifwinterco|24 days ago

On benchmarks GPT 5.2 was roughly equivalent to Opus 4.5 but most people who've used both for SWE stuff would say that Opus 4.5 is/was noticeably better

smcleod|24 days ago

I don't think much from OpenAI can be trusted tbh.

cyanydeez|24 days ago

When do you think we should run this benchmark? Friday, 1pm? Monday 8AM? Wednesday 11AM?

I definitely suspect all these models are being degraded during heavy loads.

aaaalone|24 days ago

At the end of the day you test it for your use cases anyway but it makes it a great initial hint if it's worth it to test out.

thinkingtoilet|24 days ago

We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.

purplerabbit|24 days ago

The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out

MallocVoidstar|24 days ago

The -codex models are only for 'agentic coding', nothing else.

callamdelaney|23 days ago

Anthropic models generally are right first time for me. Chatgpt and Gemini are often way, way out with some fundamental misunderstanding of the task at hand.

nharada|24 days ago

That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...

jkelleyrtp|24 days ago

claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?

gizmodo59|24 days ago

Its SWE bench pro not swe bench verified. The verified benchmark has stagnated

Rudybega|24 days ago

You're comparing two different benchmarks. Pro vs Verified.

unknown|20 days ago

[deleted]

kys_gizmodo59|20 days ago

> "What a time to be alive!"

fuck off.