(no title)
granzymes | 24 days ago
The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.
GPT-5.3-codex scores 77.3.
granzymes | 24 days ago
The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.
GPT-5.3-codex scores 77.3.
the_duke|24 days ago
That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.
So very much looking forward to trying out 5.3.
NitpickLawyer|24 days ago
aurareturn|24 days ago
I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.
Looking forward to trying 5.3.
fooker|24 days ago
Every new model overfits to the latest overhyped benchmark.
Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.
mmaunder|24 days ago
int_19h|23 days ago
jahsome|24 days ago
nerdsniper|24 days ago
leumon|24 days ago
Cost to Run Artificial Analysis Intelligence Index:
GPT-5.2 Codex (xhigh): $3244
Claude Opus 4.5-reasoning: $1485
(and probably similar values for the newer models?)
redox99|24 days ago
Computer0|24 days ago
wilg|24 days ago
dudeinhawaii|24 days ago
Not throwing shade anyone's way. I actually do prefer Claude for webdev (even if it does cringe things like generate custom CSS on every page) -- because I hate webdev and Claude designs are always better looking.
But the meat of my code is backend and "hard" and for that Codex is always better, not even a competition. In that domain, I want accuracy and not speed.
Solution, use both as needed!
soulofmischief|24 days ago
Opus is the first model I can trust to just do things, and do them right, at least small things. For larger/more complex things I have to keep either model on extremely short leashes. But the difference is enough that I canceled my GPT Pro sub so I could switch to Claude. Maybe 5.3 will change things, but I also cannot continue to ethically support Sam Altman's business.
fragmede|24 days ago
__jl__|24 days ago
granzymes|24 days ago
nurettin|24 days ago
Hopefully performance will pick up after the rollout.
nickstinemates|24 days ago
jronak|24 days ago
tedsanders|24 days ago