Not my experience too and i'm on claude code. I'd be really curious to see what when wrong in OP case. Maybe too much indication ? Could it be that it used a fast model instead of the deep ones ?
Anyways, I think one area where Codex and Claude Code falls short is that they do not test the changes they made by using the app.
In this case, the LLM should ideally render the page in a real browser, and actually click on the buttons to verify. Best if the LLM test it before the changes, and then after so that it is the same. Maybe it should take a screenshot of before the change, then take a screenshot after. And match.
Yeah, if you have these tools in place to validate it's changes you can quickly iterate with it to the right results. But think through how it's making UI changes and it becomes obvious quickly why it can make absolutely wrong and terrible guesses about the implementation details, it can't _see_ what it's doing, or interact with it, it's just pattern matching other implementations its seen.
You can easily do this, at least with Claude Code. Ask it to install and use Playwright to confirm rendering and flow. You're correct that it is a failing to not do this. When you do, it definitely helps cut down on bugs.
EDIT: Sorry, just noticed you said "real browser". Haven't tried this but Playwright gets you a long way down the road.
aurareturn|16 days ago
Anyways, I think one area where Codex and Claude Code falls short is that they do not test the changes they made by using the app.
In this case, the LLM should ideally render the page in a real browser, and actually click on the buttons to verify. Best if the LLM test it before the changes, and then after so that it is the same. Maybe it should take a screenshot of before the change, then take a screenshot after. And match.
I asked why Codex and Claude don't do this here: https://news.ycombinator.com/item?id=46792066
threetonesun|16 days ago
mwigdahl|16 days ago
EDIT: Sorry, just noticed you said "real browser". Haven't tried this but Playwright gets you a long way down the road.
lenerdenator|16 days ago
throwup238|16 days ago
unknown|16 days ago
[deleted]
n4r9|16 days ago