Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?
We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.
It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.
However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.
On benchmarks GPT 5.2 was roughly equivalent to Opus 4.5 but most people who've used both for SWE stuff would say that Opus 4.5 is/was noticeably better
We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.
The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out
Anthropic models generally are right first time for me. Chatgpt and Gemini are often way, way out with some fundamental misunderstanding of the task at hand.
That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...
wasmainiac|24 days ago
tedsanders|24 days ago
(I'm from OpenAI.)
Corence|24 days ago
However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.
ifwinterco|24 days ago
smcleod|24 days ago
cyanydeez|24 days ago
I definitely suspect all these models are being degraded during heavy loads.
aaaalone|24 days ago
thinkingtoilet|24 days ago
purplerabbit|24 days ago
MallocVoidstar|24 days ago
callamdelaney|23 days ago
nharada|24 days ago
jkelleyrtp|24 days ago
Seems like 4.6 is still all-around better?
gizmodo59|24 days ago
Rudybega|24 days ago
unknown|20 days ago
[deleted]
kys_gizmodo59|20 days ago
fuck off.