top | item 46692211

(no title)

edude03 | 1 month ago

I have the same experience despite using claude every day. As an funny anecdote:

Someone I know wrote the code and the unit tests for a new feature with an agent. The code was subtly wrong, fine, it happens, but worse the 30 or so tests they added added 10 minutes to the test run time and they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests

discuss

monooso|1 month ago

There was an article on HN last week (?) which described this exact behaviour in the newer models.

Older, less "capable", models would fail to accomplish a task. Newer models would cheat, and provide a worthless but apparently functional solution.

Hopefully someone with a larger context window than myself can recall the article in question.

SatvikBeri|1 month ago

I think that article was basically wrong. They asked the agent not to provide any commentary, then gave an unsolvable task, and wanted the agent to state that the task was impossible. So they were basically testing which instructions the agent would refuse to follow.

Purely anecdotally, I've found agents have gotten much better at asking clarifying questions, stating that two requirements are incompatible and asking which one to change, and so on.

https://spectrum.ieee.org/ai-coding-degrades

sReinwald|1 month ago

From my experience: TDD helps here - write (or have AI write) tests first, review them as the spec, then let it implement.

But when I use Claude code, I also supervise it somewhat closely. I don't let it go wild, and if it starts to make changes to existing tests it better have a damn good reason or it gets the hose again.

The failure mode here is letting the AI manage both the implementation and the testing. May as well ask high schoolers to grade their own exams. Everyone got an A+, how surprising!

edude03|1 month ago

> TDD helps here - write (or have AI write) tests first, review them as the spec

I agree, although I think the problem usually comes in writing the spec in the first place. If you can write detailed enough specs the agent will usually give you exactly what you asked for. If you're spec is vague, it's hard to eyeball if the tests or even the implementation of the tests matches what you're looking for.

jermaustin1|1 month ago

This happens with me every time I try to get claude to write tests. I've given up on it. Instead I will write the tests if I really care enough to have tests.

antonvs|1 month ago

> they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests

A very human solution

netsharc|1 month ago

I wonder if Volkswagen would've blamed AI if they got caught with Dieselgate nowadays...

In PR-lese: "To improve quality and reduce costs, we used AI to program some test code. Unfortunately the test code the AI generated fell below our standards, and it was missed during QA.".

Then again they got their supplier Bosch to program the "defeat device" and lied to them that "Oh don't worry, it's just for testing, we won't deploy it to production". (The "device" (probably just an algorithm) detects whether the steering wheel was being moved or not as the throttle is pushed, and if not, it assumes the car was undergoing emissions testing, and it runs the engine in the environmentally friendlier mode).