top | item 43710093

(no title)

gklitt | 10 months ago

I tried one task head-to-head with Codex o4-mini vs Claude Code: writing documentation for a tricky area of a medium-sized codebase.

Claude Code did great and wrote pretty decent docs.

Codex didn't do well. It hallucinated a bunch of stuff that wasn't in the code, and completely misrepresented the architecture - it started talking about server backends and REST APIs in an app that doesn't have any of that.

I'm curious what went so wrong - feels like possibly an issue with loading in the right context and attending to it correctly? That seems like an area that Claude Code has really optimized for.

I have high hopes for o3 and o4-mini as models so I hope that other tests show better results! Also curious to see how Cursor etc. incorporate o3.

discuss

order

strangescript|10 months ago

Claude Code still feels superior. o4-mini has all sorts of issues. o3 is better but at that point, you aren't saving money so who cares.

I feel like people are sleeping on Claude Code for one reason or another. Its not cheap, but its by far the best, most consistent experience I have had.

artdigital|10 months ago

Claude Code is just way too expensive.

These days I’m using Amazon Q Pro on the CLI. Very similar experience to Claude Code minus a few batteries. But it’s capped at $20/mo and won’t set my credit card on fire.

ekabod|10 months ago

"gemini 2.5 pro exp" is superior to Claude Sonnet 3.7 when I use it with Aider [1]. And it is free (with some high limit).

[1]https://aider.chat/

Aeolun|10 months ago

> Its not cheap, but its by far the best, most consistent experience I have had.

It’s too expensive for what it does though. And it starts failing rapidly when it exhausts the context window.

ilaksh|10 months ago

Did you try the same exact test with o3 instead? The mini models are meant for speed.

gklitt|10 months ago

I want to but I’ve been having trouble getting o3 to work - lots of errors related to model selection.

ksec|10 months ago

Sometimes I see in certain areas AI / LLM is absolutely crushing those jobs, a whole category will be gone in next 5 to 10 years as they are already 80 - 90% mark. They just need another 5 - 10% as they continue to get improvement and they are already cheaper per task.

Sometimes I see an area of AI/LLM that I thought even with 10x efficiency improvement and 10x hardware resources which is 100x in aggregate it will still be no where near good enough.

The truth is probably somewhere in the middle. Which is why I dont believe AGI will be here any time soon. But Assisted Intelligence is no doubt in its iPhone moment and continue for another 10 years before hopefully another breakthrough.

enether|10 months ago

there was one post that detailed how those OpenAI models hallucinate and double down on thier mistakes by "lying" - it speculated on a bunch of interesting reasons why this may be the case

recommended read - https://transluce.org/investigating-o3-truthfulness

I wonder if this is what's causing it to do badly in these cases

victor9000|10 months ago

> I no longer have the “real” prime I generated during that earlier session... I produced it in a throw‑away Python process, verified it, copied it to the clipboard, and then closed the interpreter.

AGI may well be on its way, as the mode is mastering the fine art of bullshitting.

kristopolous|10 months ago

Ever use Komment? They've been in the game a awhile. Looks pretty good