But are those tests relevant? I tried using LLMs to write tests at work and whenever I review them I end up asking it “Ok great, passes the test, but is the test relevant? Does it test anything useful?” And I get a “Oh yeah, you’re right, this test is pointless”
manmal|2 months ago
gaigalas|2 months ago
If you leave an agent for hours trying to increase coverage by percentage without further guiding instructions you will end up with lots of garbage.
In order to achieve this, you need several distinct loops. One that creates tests (there will be garbage), one that consolidates redundant tests, one that parametrizes repetitive tests, and so on.
Agents create redundant tests for all sorts of reasons. Maybe they're trying a hard to reach line and leave several attempts behind. Or maybe they "get creative" and try to guess what is uncovered instead of actually following the coverage report, etc.
Less capable models are actually better at doing this. They're faster, don't "get creative" with weird ideas mid-task and cost less. Just make them work one test at the time. Spawn, do one test that verifiably increases overall coverage, exit. Once you reach a treshold, start the consolidating loop: pick a redundant pair of tests, consolidate, exit. And so on...
Of course, you can use a powerful model and babysit it as well. A few disambiguating questions and interruptions will guide them well. If you want true unattended though, it's damn hard to get stable results.
tlarkworthy|2 months ago
elbear|2 months ago
jackschultz|2 months ago
People see LLMs and tons of tests tests written in the same sentence, and think that shows how models love writing pointless tests. Rather than realizing that the tests are standard and people written to show that the model wrote code that is validated by a currently trusted source.
Shows the importance for us to always write comments that humans are going to read with the right context is _very_ similar to how we need to interact with LLMs. And if we fail to communicate with humans, clearly we're going to fail with models.
wahnfrieden|2 months ago
Skill issue... And perhaps the wrong model + harness