top | item 46902638

GPT-5.3-Codex

1530 points| meetpateltech | 1 month ago |openai.com

605 comments

order
[+] Rperry2174|1 month ago|reply
Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophically

With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.

that feels like a reflection of a real split in how people think llm-based coding should work...

some want tight human-in-the-loop control and others want to delegate whole chunks of work and review the result

Interested to see if we eventually see models optimize for those two philosophies and 3rd, 4th, 5th philosophies that will emerge in the coming years.

Maybe it will be less about benchmarks and more about different ideas of what working-with-ai means

[+] karmasimida|1 month ago|reply
> With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

> With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.

Ain't the UX is the exact opposite? Codex thinks much longer before gives you back the answer.

[+] ghosty141|1 month ago|reply
I'm personally 100% convinced (assuming prices stay reasonable) that the Codex approach is here to stay.

Having a human in the loop eliminates all the problems that LLMs have and continously reviewing small'ish chunks of code works really well from my experience.

It saves so much time having Codex do all the plumbing so you can focus on the actual "core" part of a feature.

LLMs still (and I doubt that changes) can't think and generalize. If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to. This makes it kinda pointless for the "full autonomy" approach since effecitly code quality and abstractions completely go down the drain over time. That's fine if it's just prototyping or "throwaway" scripts but for bigger codebases where longevity matters it's a dealbreaker.

[+] utilize1808|1 month ago|reply
I think it's the opposite. Especially considering Codex started out as a web app that offers very little interactivity: you are supposed to drop a request and let it run automatously in a containerized environment; you can then follow up on it via chat --- no interactive code editing.
[+] mcintyre1994|1 month ago|reply
This kind of sounds like both of them stepping into the other’s turf, to simplify a bit.

I haven’t used Codex but use Claude Code, and the way people (before today) described Codex to me was like how you’re describing Opus 4.6

So it sounds like they’re converging toward “both these approaches are useful at different times” potentially? And neither want people who prefer one way of working to be locked to the other’s model.

[+] giancarlostoro|1 month ago|reply
> With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.

This feels wrong, I can't comment on Codex, but Claude will prompt you and ask you before changing files, even when I run it in dangerous mode on Zed, I can still review all the diffs and undo them, or you know, tell it what to change. If you're worried about it making too many decisions, you can pre-prompt Claude Code (via .claude/instructions.md) and instruct it to always ask follow up questions regarding architectural decisions.

Sometimes I go out of my way to tell Claude DO NOT ASK ME FOR FOLLOW UPS JUST DO THE THING.

[+] jhancock|1 month ago|reply
Good breakdown.

I usually want the codex approach for code/product "shaping" iteratively with the ai.

Once things are shaped and common "scaling patterns" are well established, then for things like adding a front end (which is constantly changing, more views) then letting the autonomous approach run wild can *sometimes* be useful.

I have found that codex is better at remembering when I ask to not get carried away...whereas claude requires constant reminders.

[+] techbro_1a|1 month ago|reply
> With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

This is true, but I find that Codex thinks more than Opus. That's why 5.2 Codex was more reliable than Opus 4.5

[+] dimgl|1 month ago|reply
Did you get those backwards? Codex, Gemini, etc. all wait until the requests are done to accept user feedback. Claude Code allows you to insert messages in between turns.
[+] bob1029|1 month ago|reply
I think there is another philosophy where the agent is domain specific. Not that we have to invent an entirely new universe for every product or business, but that there is a small amount of semi-customization involved to achieve an ideal agent.

I would much rather work with things like the Chat Completion API than any frameworks that compose over it. I want total control over how tool calling and error handling works. I've got concerns specific to my business/product/customer that couldn't possibly have been considered as part of these frameworks.

Whether or not a human needs to be tightly looped in could vary wildly depending on the specific part of the business you are dealing with. Having a purpose-built agent that understands where additional verification needs to occur (and not occur) can give you the best of both worlds.

[+] aulin|1 month ago|reply
Admit I didn't follow the announcements but isn't that a matter of UI? Doesn't seem something that should be baked in the model but in the tooling around it and the instructions you give them. E.g. I've been playing with with GitHub copilot CLI (that despite the bad fame is absolutely amazing) and the same model completely changes its behavior with the prompt. You can have it answer a question promptly or send it on a multi-hour multi-agent exploration writing detailed specs with a single prompt. Or you can have it stop midway for clarification. It all depends on the instructions. Also this is particularly interesting with GitHub billing model as each prompt counts 1 request no matter how many tokens it burns.
[+] cchance|1 month ago|reply
Just because you can inject steering doesn't mean they stered away from long running...

Theres hundreds of people who upload Codex 5.2 running for hours unattended and coming back with full commits

[+] mdale|1 month ago|reply
I think it's just both companies building/ marketing to the strength of their competitor. As general perception has been the opposite for codex and Opus respectfully.
[+] hbarka|1 month ago|reply
How can they be diverging, LLMs are built on similar foundations aka the Transformer architecture. Do you mean the training method (RLHF) is diverging?
[+] sfmike|1 month ago|reply
It's the opposite? codex course corrects and is self inquisitive. opus is just wrong and need to refeed it it's wrong.
[+] dboon|1 month ago|reply
…what? It is quite literally the opposite. This isn’t a matter of taste or perception.
[+] blurbleblurble|1 month ago|reply
Funny cause the situation was totally flipped last iteration.
[+] mi_lk|1 month ago|reply
It’s the opposite way
[+] rozumbrada|1 month ago|reply
I read this exact comment with I would say completely the same words several times in X and I would bet my money it's LLM generated by someone who has not even tried both the tools. This AI slop even in the site like this without direct monetisation implications from fake engagement is making me sick...
[+] drsalt|1 month ago|reply
be rich, hire an ai guy, let him deal with it
[+] d--b|1 month ago|reply
I am definitely using Opus as an interactive collaborator that I steer mid-execution, stay in the loop and course correct as it works.

I mean Opus asks a lot if he should run things, and each time you can tell it to change. And if that's not enough you can always press esc to interrupt.

[+] granzymes|1 month ago|reply
I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!

The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.

GPT-5.3-codex scores 77.3.

[+] the_duke|1 month ago|reply
I do not trust the AI benchmarks much, they often do not line up with my experience.

That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.

So very much looking forward to trying out 5.3.

[+] leumon|1 month ago|reply
they tested it at xhigh reasoning though, which is probably double the cost of Anthropic's model.

Cost to Run Artificial Analysis Intelligence Index:

GPT-5.2 Codex (xhigh): $3244

Claude Opus 4.5-reasoning: $1485

(and probably similar values for the newer models?)

[+] wilg|1 month ago|reply
In my personal experience the GPT models have always been significantly better than the Claude models for agentic coding, I’m baffled why people think Claude has the edge on programming.
[+] __jl__|1 month ago|reply
Impressive jump for GPT-5.3-codex and crazy to see two top coding models come out on the same day...
[+] nurettin|1 month ago|reply
Opus was quite useless today. Created lots of globals, statics, forward declarations, hidden implementations in cpp files with no testable interface, erasing types, casting void pointers, I had to fix quite a lot and decouple the entangled mess.

Hopefully performance will pick up after the rollout.

[+] jronak|1 month ago|reply
Did you look at the ARC AGI 2? Codex might be overfit for terminal bench
[+] xiphias2|1 month ago|reply
,,GPT‑5.3-Codex is the first model we classify as High capability for cybersecurity-related tasks under our Preparedness Framework , and the first we’ve directly trained to identify software vulnerabilities. While we don’t have definitive evidence it can automate cyber attacks end-to-end, we’re taking a precautionary approach and deploying our most comprehensive cybersecurity safety stack to date. Our mitigations include safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines including threat intelligence.''

While I love Codex and believe it's amazing tool, I believe their preparedness framework is out of date. As it is more and more capable of vibe coding complex apps, it's getting clear that the main security issues will come up by having more and more security critical software vibe coded.

It's great to look at systems written by humans and how well Codex can be used against software written by humans, but it's getting more important to measure the opposite: how well humans (or their own software) are able to infiltrate complex systems written mostly by Codex, and get better on that scale.

In simpler terms: Codex should write secure software by default.

[+] itay-maman|1 month ago|reply
Something that caught my eye from the announcement:

> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training

I'm happy to see the Codex team moving to this kind of dogfooding. I think this was critical for Claude Code to achieve its momentum.

[+] minimaxir|1 month ago|reply
I remember when AI labs coordinated so they didn't push major announcements on the same day to avoid cannibalizing each other. Now we have AI labs pushing major announcements within 30 minutes.
[+] SunshineTheCat|1 month ago|reply
I've always been fascinated to see significantly more people talking about using Claude than I see people talking about Codex.

I know that's anecdotal, but it just seems Claude is often the default.

I'm sure there are key differences in how they handle coding tasks and maybe Claude is even a little better in some areas.

However, the note I see the most from Claude users is running out of usage.

Coding differences aside, this would be the biggest factor for me using one over the other. After several months on Codex's $20/mo. plan (and some pretty significant usage days), I have only come close to my usage limit once (never fully exceeded it).

That (at least to me) seems to be a much bigger deal than coding nuances.

[+] bgirard|1 month ago|reply
> Using the develop web game skill and preselected, generic follow-up prompts like "fix the bug" or "improve the game", GPT‑5.3-Codex iterated on the games autonomously over millions of tokens.

I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?

I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.

[1] https://factory-gpt.vercel.app/

[+] tosh|1 month ago|reply
Terminal Bench 2.0

  | Name                | Score |
  |---------------------|-------|
  | OpenAI Codex 5.3    | 77.3  |
  | Anthropic Opus 4.6  | 65.4  |
[+] nananana9|1 month ago|reply
I've been listening to the insane 100x productivity gains you all are getting with AI and "this new crazy model is a real game changer" for a few years now, I think it's about time I asked:

Can you guys point me ton a single useful, majority LLM-written, preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code?

[+] RivieraKid|1 month ago|reply
Do software engineers here feel threatened by this? I certainly am. I'm surprised that this topic is almost entirely missing in these threads.
[+] jstummbillig|1 month ago|reply
It's so interesting that I start to feel a change, that is developing as a separate thing to capability. Previously, yeah sure, things changed but models got so outrageously better at the basic things that I simply wouldn't care.

Now... increasingly it's like changing a partner just so slightly. I can feel that something is different and it gives me pause. That's probably not a sign of the improvement diminishing. Maybe more so my capability to appreciate them.

I can see how one might get from here to the whole people being upset about 4o thing.

[+] trilogic|1 month ago|reply
When 2 multi billion giants advertise same day, it is not competition but rather a sign of struggle and survival. With all the power of the "best artificial intelligence" at your disposition, and a lot of capital also all the brilliant minds, THIS IS WHAT YOU COULD COME UP WITH?

Interesting

[+] tombert|1 month ago|reply
Actually kind of excited for this. I've been using 5.2 for awhile now, and it's already pretty impressive if you set the context window to "high".

Something I have been experimenting with is AI-assisted proofs. Right now I've been playing with TLAPS to help write some more comprehensive correctness proofs for a thing I've been building, and 5.2 didn't seem quite up to it; I was able to figure out proofs on my own a bit better than it was, even when I would tell it to keep trying until it got it right.

I'm excited to see if 5.3 fairs a bit better; if I can get mechanized proofs working, then Fields Medal here I come!

[+] nickandbro|1 month ago|reply
I have found GPT 5.3-Codex to do exceedingly well when working with graphics rendering pipelines. They must have better training data or RL approaches than Antropic as I have given the same prompt and config to Opus 4.6 and it seems to have added unwanted rendering artifacts. This may be just an issue specific to my use case, but wonder since OpenAI is partners with MSFT, which makes lots of games, that this may be an area they heavily invested in
[+] morleytj|1 month ago|reply
The behind the scenes on deciding when to release these models has got to be pretty insanely stressful if they're coming out within 30 minutes-ish of each other.
[+] dllrr|1 month ago|reply
Using opus 4.6 in claude code right now. It's taking about 5x longer to think things through, if not more.
[+] modeless|1 month ago|reply
It's so difficult to compare these models because they're not running the same set of evals. I think literally the only eval variant that was reported for both Opus 4.6 and GPT-5.3-Codex is Terminal-Bench 2.0, with Opus 4.6 at 65.4% and GPT-5.3-Codex at 77.3%. None of the other evals were identical, so the numbers for them are not comparable.
[+] jiggawatts|1 month ago|reply
I think this announcement says a lot about OpenAI and their relationship to partners like Microsoft and NVIDIA, not to mention the attitude of their leadership team.

On Microsoft Foundry I can see the new Codex 4.6 model right now, but GPT-5.3 is nowhere to be seen.

I have a pre-paid account directly with OpenAI that has credits, but if I use that key with the Codex CLI, it can't access 5.3 either.

The press release very prominently includes this quote: "GPT‑5.3-Codex was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. We are grateful to NVIDIA for their partnership."

Sounds like OpenAI's ties with their vendors are fraying while at the same time they're struggling to execute on the basics like "make our own models available to our own coding agents", let alone via third-party portals like Microsoft Foundry.

[+] zozbot234|1 month ago|reply
GPT 5.3 is not in the API yet AIUI.
[+] kingstnap|1 month ago|reply
> GPT‑5.3-Codex was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. We are grateful to NVIDIA for their partnership.

This is hilarious lol