top | item 46902873

(no title)

the_duke | 24 days ago

I do not trust the AI benchmarks much, they often do not line up with my experience.

That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.

So very much looking forward to trying out 5.3.

discuss

Just some anecdata++ here but I found 5.2 to be really good at code review. So I can have something crunched by cheaper models, reviewed async by codex and then re-prompt with the findings from the review. It finds good things, doesn't flag nits (if prompted not to) and the overall flow is worth it for me. Speed loss doesn't impact this flow that much.

kilroy123|24 days ago

Personally, I have Claude do the coding. Then 5.2-high do the reviewing.

StephenHerlihyy|24 days ago

I don’t use OpenAI too much, but I follow a similar work flow. Use Opus for design/architecture work. Move it to Sonnet for implementation and build out. Then finally over to Gemini for review, QC and standards check. There is an absolute gain in using different models. Each has their own style and way of solving the problem just like a human team. It’s kind of awesome and crazy and a bit scary all at once.

aurareturn|24 days ago

5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.

I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.

Looking forward to trying 5.3.

koakuma-chan|24 days ago

Opus 4.5 is more creative and better at making UIs

fooker|24 days ago

Yeah, these benchmarks are bogus.

Every new model overfits to the latest overhyped benchmark.

Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.

bunderbunder|24 days ago

All shared machine learning benchmarks are a little bit bogus, for a really “machine learning 101” reason: your test set only yields an unbiased performance metric if you agree to only use it once. But that just isn’t a realistic way to use a shared benchmark. Using them repeatedly is kind of the whole point.

But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.

mrandish|24 days ago

> Yeah, these benchmarks are bogus.

It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .

scoring1774|24 days ago

This has been done: https://arxiv.org/abs/2510.04871v1

mmaunder|24 days ago

ARG-AGI-2 leaderboard has a strong correlation with my Rust/CUDA coding experience with the models.

int_19h|23 days ago

Codex 5.3 seems to be a lot chattier. As in, it comments in the chat about things it has done or is about to do. They don't show up as "thinking" CoT blocks, but as regular outputs, but overall the experience is somewhat more like Claude is in that you can spot the problems in model's reasoning much earlier if you keep an eye on it as it works, and steer it away.

jahsome|24 days ago

Another day, another hn thread of "this model changes everything" followed immediately by a reply stating "actually I have the literal opposite experience and find competitor's model is the best" repeated until it's time to start the next day's thread.

StephenHerlihyy|24 days ago

What amazes me the most is the speed at which things are advancing. Go back a year or even a year before that and all these incremental improvements have compounded. Things that used to require real effort to consistently solve, either with RAGs, context/prompt engineering, have become… trivial. I totally agree with your point that each step along the way doesn’t necessarily change that much. But in the aggregate it’s sort of insane how fast everything is moving.

SatvikBeri|24 days ago

I use Claude Code every day, and I'm not certain I could tell the difference between Opus 4.5 and Opus 4.0 if you gave me a blind test

malshe|24 days ago

This pretty accurately summarizes all the long discussions about AI models on HN.

clhodapp|24 days ago

And of course the benchmarks are from the school of "It's better to have a bad metric than no metric", so there really isn't any way to falsify anyone's opinions...

cactusplant7374|24 days ago

Hourly occurrence on /r/codex. Model astrology is about the vibes.

wasmainiac|24 days ago

[deleted]

nerdsniper|24 days ago

Opus 4.5 still worked better for most of my work, which is generally "weird stuff". A lot of my programming involves concepts that are a bit brain-melting for LLMs, because multiple "99% of the time, assumption X is correct" are reversed for my project. I think Opus does better at not falling into those traps. Excited to try out 5.3

nubg|24 days ago

what do you do?