Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophically
With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.
With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
that feels like a reflection of a real split in how people think llm-based coding should work...
some want tight human-in-the-loop control and others want to delegate whole chunks of work and review the result
Interested to see if we eventually see models optimize for those two philosophies and 3rd, 4th, 5th philosophies that will emerge in the coming years.
Maybe it will be less about benchmarks and more about different ideas of what working-with-ai means
> With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.
> With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
Ain't the UX is the exact opposite? Codex thinks much longer before gives you back the answer.
I'm personally 100% convinced (assuming prices stay reasonable) that the Codex approach is here to stay.
Having a human in the loop eliminates all the problems that LLMs have and continously reviewing small'ish chunks of code works really well from my experience.
It saves so much time having Codex do all the plumbing so you can focus on the actual "core" part of a feature.
LLMs still (and I doubt that changes) can't think and generalize. If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to. This makes it kinda pointless for the "full autonomy" approach since effecitly code quality and abstractions completely go down the drain over time. That's fine if it's just prototyping or "throwaway" scripts but for bigger codebases where longevity matters it's a dealbreaker.
I think it's the opposite. Especially considering Codex started out as a web app that offers very little interactivity: you are supposed to drop a request and let it run automatously in a containerized environment; you can then follow up on it via chat --- no interactive code editing.
This kind of sounds like both of them stepping into the other’s turf, to simplify a bit.
I haven’t used Codex but use Claude Code, and the way people (before today) described Codex to me was like how you’re describing Opus 4.6
So it sounds like they’re converging toward “both these approaches are useful at different times” potentially? And neither want people who prefer one way of working to be locked to the other’s model.
> With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
This feels wrong, I can't comment on Codex, but Claude will prompt you and ask you before changing files, even when I run it in dangerous mode on Zed, I can still review all the diffs and undo them, or you know, tell it what to change. If you're worried about it making too many decisions, you can pre-prompt Claude Code (via .claude/instructions.md) and instruct it to always ask follow up questions regarding architectural decisions.
Sometimes I go out of my way to tell Claude DO NOT ASK ME FOR FOLLOW UPS JUST DO THE THING.
I usually want the codex approach for code/product "shaping" iteratively with the ai.
Once things are shaped and common "scaling patterns" are well established, then for things like adding a front end (which is constantly changing, more views) then letting the autonomous approach run wild can *sometimes* be useful.
I have found that codex is better at remembering when I ask to not get carried away...whereas claude requires constant reminders.
Did you get those backwards? Codex, Gemini, etc. all wait until the requests are done to accept user feedback. Claude Code allows you to insert messages in between turns.
I think there is another philosophy where the agent is domain specific. Not that we have to invent an entirely new universe for every product or business, but that there is a small amount of semi-customization involved to achieve an ideal agent.
I would much rather work with things like the Chat Completion API than any frameworks that compose over it. I want total control over how tool calling and error handling works. I've got concerns specific to my business/product/customer that couldn't possibly have been considered as part of these frameworks.
Whether or not a human needs to be tightly looped in could vary wildly depending on the specific part of the business you are dealing with. Having a purpose-built agent that understands where additional verification needs to occur (and not occur) can give you the best of both worlds.
Admit I didn't follow the announcements but isn't that a matter of UI? Doesn't seem something that should be baked in the model but in the tooling around it and the instructions you give them. E.g. I've been playing with with GitHub copilot CLI (that despite the bad fame is absolutely amazing) and the same model completely changes its behavior with the prompt. You can have it answer a question promptly or send it on a multi-hour multi-agent exploration writing detailed specs with a single prompt. Or you can have it stop midway for clarification. It all depends on the instructions. Also this is particularly interesting with GitHub billing model as each prompt counts 1 request no matter how many tokens it burns.
I think it's just both companies building/ marketing to the strength of their competitor. As general perception has been the opposite for codex and Opus respectfully.
How can they be diverging, LLMs are built on similar foundations aka the Transformer architecture. Do you mean the training method (RLHF) is diverging?
I read this exact comment with I would say completely the same words several times in X and I would bet my money it's LLM generated by someone who has not even tried both the tools. This AI slop even in the site like this without direct monetisation implications from fake engagement is making me sick...
I am definitely using Opus as an interactive collaborator that I steer mid-execution, stay in the loop and course correct as it works.
I mean Opus asks a lot if he should run things, and each time you can tell it to change. And if that's not enough you can always press esc to interrupt.
In my personal experience the GPT models have always been significantly better than the Claude models for agentic coding, I’m baffled why people think Claude has the edge on programming.
Opus was quite useless today. Created lots of globals, statics, forward declarations, hidden implementations in cpp files with no testable interface, erasing types, casting void pointers, I had to fix quite a lot and decouple the entangled mess.
Hopefully performance will pick up after the rollout.
,,GPT‑5.3-Codex is the first model we classify as High capability for cybersecurity-related tasks under our Preparedness Framework , and the first we’ve directly trained to identify software vulnerabilities. While we don’t have definitive evidence it can automate cyber attacks end-to-end, we’re taking a precautionary approach and deploying our most comprehensive cybersecurity safety stack to date. Our mitigations include safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines including threat intelligence.''
While I love Codex and believe it's amazing tool, I believe their preparedness framework is out of date. As it is more and more capable of vibe coding complex apps, it's getting clear that the main security issues will come up by having more and more security critical software vibe coded.
It's great to look at systems written by humans and how well Codex can be used against software written by humans, but it's getting more important to measure the opposite: how well humans (or their own software) are able to infiltrate complex systems written mostly by Codex, and get better on that scale.
In simpler terms: Codex should write secure software by default.
I remember when AI labs coordinated so they didn't push major announcements on the same day to avoid cannibalizing each other. Now we have AI labs pushing major announcements within 30 minutes.
I've always been fascinated to see significantly more people talking about using Claude than I see people talking about Codex.
I know that's anecdotal, but it just seems Claude is often the default.
I'm sure there are key differences in how they handle coding tasks and maybe Claude is even a little better in some areas.
However, the note I see the most from Claude users is running out of usage.
Coding differences aside, this would be the biggest factor for me using one over the other. After several months on Codex's $20/mo. plan (and some pretty significant usage days), I have only come close to my usage limit once (never fully exceeded it).
That (at least to me) seems to be a much bigger deal than coding nuances.
> Using the develop web game skill and preselected, generic follow-up prompts like "fix the bug" or "improve the game", GPT‑5.3-Codex iterated on the games autonomously over millions of tokens.
I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?
I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.
I've been listening to the insane 100x productivity gains you all are getting with AI and "this new crazy model is a real game changer" for a few years now, I think it's about time I asked:
Can you guys point me ton a single useful, majority LLM-written, preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code?
It's so interesting that I start to feel a change, that is developing as a separate thing to capability. Previously, yeah sure, things changed but models got so outrageously better at the basic things that I simply wouldn't care.
Now... increasingly it's like changing a partner just so slightly. I can feel that something is different and it gives me pause. That's probably not a sign of the improvement diminishing. Maybe more so my capability to appreciate them.
I can see how one might get from here to the whole people being upset about 4o thing.
When 2 multi billion giants advertise same day, it is not competition but rather a sign of struggle and survival.
With all the power of the "best artificial intelligence" at your disposition, and a lot of capital also all the brilliant minds, THIS IS WHAT YOU COULD COME UP WITH?
Actually kind of excited for this. I've been using 5.2 for awhile now, and it's already pretty impressive if you set the context window to "high".
Something I have been experimenting with is AI-assisted proofs. Right now I've been playing with TLAPS to help write some more comprehensive correctness proofs for a thing I've been building, and 5.2 didn't seem quite up to it; I was able to figure out proofs on my own a bit better than it was, even when I would tell it to keep trying until it got it right.
I'm excited to see if 5.3 fairs a bit better; if I can get mechanized proofs working, then Fields Medal here I come!
I have found GPT 5.3-Codex to do exceedingly well when working with graphics rendering pipelines. They must have better training data or RL approaches than Antropic as I have given the same prompt and config to Opus 4.6 and it seems to have added unwanted rendering artifacts. This may be just an issue specific to my use case, but wonder since OpenAI is partners with MSFT, which makes lots of games, that this may be an area they heavily invested in
The behind the scenes on deciding when to release these models has got to be pretty insanely stressful if they're coming out within 30 minutes-ish of each other.
It's so difficult to compare these models because they're not running the same set of evals. I think literally the only eval variant that was reported for both Opus 4.6 and GPT-5.3-Codex is Terminal-Bench 2.0, with Opus 4.6 at 65.4% and GPT-5.3-Codex at 77.3%. None of the other evals were identical, so the numbers for them are not comparable.
I think this announcement says a lot about OpenAI and their relationship to partners like Microsoft and NVIDIA, not to mention the attitude of their leadership team.
On Microsoft Foundry I can see the new Codex 4.6 model right now, but GPT-5.3 is nowhere to be seen.
I have a pre-paid account directly with OpenAI that has credits, but if I use that key with the Codex CLI, it can't access 5.3 either.
The press release very prominently includes this quote: "GPT‑5.3-Codex was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. We are grateful to NVIDIA for their partnership."
Sounds like OpenAI's ties with their vendors are fraying while at the same time they're struggling to execute on the basics like "make our own models available to our own coding agents", let alone via third-party portals like Microsoft Foundry.
[+] [-] Rperry2174|1 month ago|reply
With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.
With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
that feels like a reflection of a real split in how people think llm-based coding should work...
some want tight human-in-the-loop control and others want to delegate whole chunks of work and review the result
Interested to see if we eventually see models optimize for those two philosophies and 3rd, 4th, 5th philosophies that will emerge in the coming years.
Maybe it will be less about benchmarks and more about different ideas of what working-with-ai means
[+] [-] karmasimida|1 month ago|reply
> With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
Ain't the UX is the exact opposite? Codex thinks much longer before gives you back the answer.
[+] [-] ghosty141|1 month ago|reply
Having a human in the loop eliminates all the problems that LLMs have and continously reviewing small'ish chunks of code works really well from my experience.
It saves so much time having Codex do all the plumbing so you can focus on the actual "core" part of a feature.
LLMs still (and I doubt that changes) can't think and generalize. If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to. This makes it kinda pointless for the "full autonomy" approach since effecitly code quality and abstractions completely go down the drain over time. That's fine if it's just prototyping or "throwaway" scripts but for bigger codebases where longevity matters it's a dealbreaker.
[+] [-] utilize1808|1 month ago|reply
[+] [-] mcintyre1994|1 month ago|reply
I haven’t used Codex but use Claude Code, and the way people (before today) described Codex to me was like how you’re describing Opus 4.6
So it sounds like they’re converging toward “both these approaches are useful at different times” potentially? And neither want people who prefer one way of working to be locked to the other’s model.
[+] [-] giancarlostoro|1 month ago|reply
This feels wrong, I can't comment on Codex, but Claude will prompt you and ask you before changing files, even when I run it in dangerous mode on Zed, I can still review all the diffs and undo them, or you know, tell it what to change. If you're worried about it making too many decisions, you can pre-prompt Claude Code (via .claude/instructions.md) and instruct it to always ask follow up questions regarding architectural decisions.
Sometimes I go out of my way to tell Claude DO NOT ASK ME FOR FOLLOW UPS JUST DO THE THING.
[+] [-] jhancock|1 month ago|reply
I usually want the codex approach for code/product "shaping" iteratively with the ai.
Once things are shaped and common "scaling patterns" are well established, then for things like adding a front end (which is constantly changing, more views) then letting the autonomous approach run wild can *sometimes* be useful.
I have found that codex is better at remembering when I ask to not get carried away...whereas claude requires constant reminders.
[+] [-] techbro_1a|1 month ago|reply
This is true, but I find that Codex thinks more than Opus. That's why 5.2 Codex was more reliable than Opus 4.5
[+] [-] dimgl|1 month ago|reply
[+] [-] bob1029|1 month ago|reply
I would much rather work with things like the Chat Completion API than any frameworks that compose over it. I want total control over how tool calling and error handling works. I've got concerns specific to my business/product/customer that couldn't possibly have been considered as part of these frameworks.
Whether or not a human needs to be tightly looped in could vary wildly depending on the specific part of the business you are dealing with. Having a purpose-built agent that understands where additional verification needs to occur (and not occur) can give you the best of both worlds.
[+] [-] aulin|1 month ago|reply
[+] [-] cchance|1 month ago|reply
Theres hundreds of people who upload Codex 5.2 running for hours unattended and coming back with full commits
[+] [-] mdale|1 month ago|reply
[+] [-] hbarka|1 month ago|reply
[+] [-] sfmike|1 month ago|reply
[+] [-] dboon|1 month ago|reply
[+] [-] blurbleblurble|1 month ago|reply
[+] [-] pyrolistical|1 month ago|reply
[+] [-] mi_lk|1 month ago|reply
[+] [-] rippeltippel|1 month ago|reply
[+] [-] rozumbrada|1 month ago|reply
[+] [-] drsalt|1 month ago|reply
[+] [-] d--b|1 month ago|reply
I mean Opus asks a lot if he should run things, and each time you can tell it to change. And if that's not enough you can always press esc to interrupt.
[+] [-] adarsh2321|1 month ago|reply
[deleted]
[+] [-] granzymes|1 month ago|reply
The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.
GPT-5.3-codex scores 77.3.
[+] [-] the_duke|1 month ago|reply
That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.
So very much looking forward to trying out 5.3.
[+] [-] leumon|1 month ago|reply
Cost to Run Artificial Analysis Intelligence Index:
GPT-5.2 Codex (xhigh): $3244
Claude Opus 4.5-reasoning: $1485
(and probably similar values for the newer models?)
[+] [-] wilg|1 month ago|reply
[+] [-] __jl__|1 month ago|reply
[+] [-] nurettin|1 month ago|reply
Hopefully performance will pick up after the rollout.
[+] [-] jronak|1 month ago|reply
[+] [-] xiphias2|1 month ago|reply
While I love Codex and believe it's amazing tool, I believe their preparedness framework is out of date. As it is more and more capable of vibe coding complex apps, it's getting clear that the main security issues will come up by having more and more security critical software vibe coded.
It's great to look at systems written by humans and how well Codex can be used against software written by humans, but it's getting more important to measure the opposite: how well humans (or their own software) are able to infiltrate complex systems written mostly by Codex, and get better on that scale.
In simpler terms: Codex should write secure software by default.
[+] [-] itay-maman|1 month ago|reply
> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training
I'm happy to see the Codex team moving to this kind of dogfooding. I think this was critical for Claude Code to achieve its momentum.
[+] [-] minimaxir|1 month ago|reply
[+] [-] SunshineTheCat|1 month ago|reply
I know that's anecdotal, but it just seems Claude is often the default.
I'm sure there are key differences in how they handle coding tasks and maybe Claude is even a little better in some areas.
However, the note I see the most from Claude users is running out of usage.
Coding differences aside, this would be the biggest factor for me using one over the other. After several months on Codex's $20/mo. plan (and some pretty significant usage days), I have only come close to my usage limit once (never fully exceeded it).
That (at least to me) seems to be a much bigger deal than coding nuances.
[+] [-] bgirard|1 month ago|reply
I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?
I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.
[1] https://factory-gpt.vercel.app/
[+] [-] tosh|1 month ago|reply
[+] [-] nananana9|1 month ago|reply
Can you guys point me ton a single useful, majority LLM-written, preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code?
[+] [-] RivieraKid|1 month ago|reply
[+] [-] fishpham|1 month ago|reply
[+] [-] jstummbillig|1 month ago|reply
Now... increasingly it's like changing a partner just so slightly. I can feel that something is different and it gives me pause. That's probably not a sign of the improvement diminishing. Maybe more so my capability to appreciate them.
I can see how one might get from here to the whole people being upset about 4o thing.
[+] [-] trilogic|1 month ago|reply
Interesting
[+] [-] tombert|1 month ago|reply
Something I have been experimenting with is AI-assisted proofs. Right now I've been playing with TLAPS to help write some more comprehensive correctness proofs for a thing I've been building, and 5.2 didn't seem quite up to it; I was able to figure out proofs on my own a bit better than it was, even when I would tell it to keep trying until it got it right.
I'm excited to see if 5.3 fairs a bit better; if I can get mechanized proofs working, then Fields Medal here I come!
[+] [-] nickandbro|1 month ago|reply
[+] [-] morleytj|1 month ago|reply
[+] [-] dllrr|1 month ago|reply
[+] [-] modeless|1 month ago|reply
[+] [-] jiggawatts|1 month ago|reply
On Microsoft Foundry I can see the new Codex 4.6 model right now, but GPT-5.3 is nowhere to be seen.
I have a pre-paid account directly with OpenAI that has credits, but if I use that key with the Codex CLI, it can't access 5.3 either.
The press release very prominently includes this quote: "GPT‑5.3-Codex was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. We are grateful to NVIDIA for their partnership."
Sounds like OpenAI's ties with their vendors are fraying while at the same time they're struggling to execute on the basics like "make our own models available to our own coding agents", let alone via third-party portals like Microsoft Foundry.
[+] [-] zozbot234|1 month ago|reply
[+] [-] kingstnap|1 month ago|reply
This is hilarious lol