top | item 47002811

(no title)

bsaul | 16 days ago

Not my experience too and i'm on claude code. I'd be really curious to see what when wrong in OP case. Maybe too much indication ? Could it be that it used a fast model instead of the deep ones ?

discuss

order

aurareturn|16 days ago

No, OP said he used the Max Opus 4.6.

Anyways, I think one area where Codex and Claude Code falls short is that they do not test the changes they made by using the app.

In this case, the LLM should ideally render the page in a real browser, and actually click on the buttons to verify. Best if the LLM test it before the changes, and then after so that it is the same. Maybe it should take a screenshot of before the change, then take a screenshot after. And match.

I asked why Codex and Claude don't do this here: https://news.ycombinator.com/item?id=46792066

threetonesun|16 days ago

Yeah, if you have these tools in place to validate it's changes you can quickly iterate with it to the right results. But think through how it's making UI changes and it becomes obvious quickly why it can make absolutely wrong and terrible guesses about the implementation details, it can't _see_ what it's doing, or interact with it, it's just pattern matching other implementations its seen.

mwigdahl|16 days ago

You can easily do this, at least with Claude Code. Ask it to install and use Playwright to confirm rendering and flow. You're correct that it is a failing to not do this. When you do, it definitely helps cut down on bugs.

EDIT: Sorry, just noticed you said "real browser". Haven't tried this but Playwright gets you a long way down the road.

lenerdenator|16 days ago

FWIW, I've found Playwright tests to be a decent way of getting Claude to do what you're talking about.

throwup238|16 days ago

See the /chrome command in Claude code.

n4r9|16 days ago

They say explicitly what model they're using.