top | item 47212610

(no title)

> Agents require tests to keep from spinning out of control when writing more than a few thousand lines, but we know that tests are wildly insufficient to describe the state of the actual code.

Provide them with a mature, well-structured codebase to work within. Break the work down into tasks sized such that it's unlikely they'll spin out of control. Limit the scope/nature of changes such that they're changing one thing at a time rather than trying to one-shot huge programs. Use static analysis to identify affected user-facing flows and flag for human review. Provide the human-in-the-loop with fully functional before and after dev builds. Allow the human-in-the-loop to provide direct feedback within the dev build. Track the feedback the same way you track other changes. And, yes, have some automated tests that ensure core functionality matches requirements.

I think everything I've listed there can be built with existing technology.

> You are essentially saying that we should develop other methods of capturing the state of the program to prevent unintended changes.

I think you're imagining something far more sophisticated than what I'm actually suggesting. I also think you're setting a higher bar for agents to clear than what's actually required in practice.

Tests don't need to catch every issue, agents should be expected to make some mistakes (as humans do).

> However there’s no reason to believe that these other systems will be any easier to reason about than the code itself. If we had these other methods of ensuring that observerable behavior doesn’t change and they were substantially easier than reasoning about the code directly, they would be very useful for human developers as well.

There are lots of powerful static analysis tools out there than can be helpful in improving correctness and reducing the incidence of regressions. IME most human developers tend to eschew tools that are unfamiliar, have steep learning curves, or require extra effort when writing code.

> The fact that we’ve not developed something like this in 75 years of writing programs, says it’s probably not as easy as you’re making it out.

I think the cost/benefit of what I'm describing has changed. We've only had LLMs capable of reliably producing working code changes for around a year.

discuss

No comments yet.