Using LLMs to enhance our testing practices

renegade-otter|1 year ago

In every single system I have worked on, tests were not just tests - they were their own parallel application, and it required careful architecture and constant refactoring in order for it to not get out of hand.

"More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code. Sometimes I spend more time on the test code than the actual code (probably normal).

Also, I feel like people would be inclined to go with whatever the LLM gives them, as opposed to really sitting down and thinking about all the unhappy paths and edge cases of UX. Using an autocomplete to "bang it out" seems foolish.

jeswin|1 year ago

> Using an autocomplete to "bang it out" seems foolish.

Based on my own experience, I find the widespread scepticism on HN about AI-assisted coding misplaced. There will be corner cases, there will be errors, and there will be bugs. There will also be apps for which AI is not helpful at all. But that's fine - nobody is saying otherwise. The question is only about whether it is a _significant_ nett saving on the time spent across various project types. The answer to that is a resounding Yes.

The entire set of tests for a web framework I wrote recently were generated with Claude and GPT. You can see them here: https://github.com/webjsx/webjsx/tree/main/src/test

On an average, these tests are better than tests I would have written myself. The project was written mostly by AI as well, like most other stuff I've written since GPT4 came out.

"Using an autocomplete to bang it out" is exactly what one should do - in most cases.

swatcoder|1 year ago

Fully agreed.

It's bad enough when human team members are submitting useless, brittle tests with their PR's just to satisfy some org pressure to write them. The lazy ones provide a false sense of security even though they neglect critical scenarios, the unstable ones undermine trust in the test output because they intermittently raise false negatives that nobody has time to debug, and the pointless ones do nothing but reify architecture so it becomes too laborious to refactor anything.

As contextually aware generators, there are doubtless good uses for LLM's in test developement, but (as with many other domains) they threaten to amplify an already troubling problem with low-quality, high-volume content spam.

BeetleB|1 year ago

Mostly agree.

My first thought when I read this post was: Is his goal to test the code, or validate the features?

The first problem is he's providing the code, and asking for tests. If his code has a bug, the tests will enshrine those bugs. It's like me writing some code, and then giving it to a junior colleague, not providing any context, and saying "Hey, write some tests for this."

This is backwards. I'm not a TDD guy, but you should think of your test cases independent of your code.

danmaz74|1 year ago

I subscribe to the concept of the "pyramid of tests" - lots of simpler unit tests, fewer integration tests, and very few end-to-end tests. I find that using LLMs to write unit tests is very useful. If I just wrote code which has good naming both for the classes, methods and variables, useful comments where necessary and if I already have other tests which the LLMs can use as examples for how I test things, I usually just need to read the created tests and sometimes add some test cases, just writing the "it should 'this and that'" part for cases which weren't covered.

An added bonus is that if the tests aren't what you expect, often it helps you understand that the code isn't as clear as it should be.

viraptor|1 year ago

Pretty much this and I prefer the opposite. "Here's the new test case from me, make the code pass it" is a decent workflow with Aider.

I get that occasionally there are some really trivial but important tests that take time and would be nice to automate. But that's a minority in my experience.

skissane|1 year ago

> "More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code.

Are there ways we can measure this?

One idea that I’ve had, is collect code coverage separately for each test. If a test isn’t covering any unique code or branches, maybe it is superfluous - although not necessarily, it can make sense to separately test all the boundary conditions of a function, even if doing so doesn’t hit any unique branches.

Maybe prefer a smaller test which covers the same code to a bigger one. However, sometimes if a test is very DRY, it can be more brittle, since it can be non-obvious how to update it to handle a code change. A repetitive test, updating it can be laborious, but at least reasonably obvious how to do so.

Could an LLM evaluate test quality, if you give it a prompt containing some expert advice on good and bad testing practices?

nrnrjrjrj|1 year ago

There is an art to writing tests especially getting absraction levels right. For example do you integration test hitting the password field with 1000 cases or do that as a unit test, and does doing it as a unit test sufficiently cover this.

AI could do all this thinking in the future but not yet I believe!

Let alone the codebase is likely a mess of bad practice already (never seen one that isn't! That is life) so often part of the job is leaving the campground a bit better than how you found it.

LLMs can help now on last mile stuff. Fill in this one test. Generate data for 100 test cases. Etc.

dngit|1 year ago

Great point on focusing on high-impact tests. I agree that LLMs risk giving a false sense of coverage. Maybe a smart strategy is generating boilerplate tests while we focus on custom edge cases.

bryanrasmussen|1 year ago

>Sometimes I spend more time on the test code than the actual code (probably normal).

This seems like the kind of thing that should be highly dependent on the kind of project one is doing, if you have an MVP and your test code is taking longer than the actual code then it is clear the test code is antagonistic to the whole concept of an MVP.

aoeusnth1|1 year ago

Detecting regressions is the goal. If LLMs can do that for free to cheap, that’s good. It doesn’t have to be complicated.

idoco|1 year ago

Totally agree, especially about the need for well-architected, high-impact tests that go beyond just coverage. At Loadmill, we found out pretty early that building AI to generate tests was just the starting point. The real challenge came with making the system flexible enough to handle complex customer architectures. Think of multiple test environments, unique authentication setups, and dynamic data preparation.

There’s a huge difference between using an LLM to crank out test code and having a product that can actually support complex, evolving setups long-term. A lot of tools work great in demos but don’t hold up for these real-world testing needs.

And yeah, this is even trickier for higher-level tests. Without careful design, it’s way too easy to end up with “dumb” tests that add little real value.

mastersummoner|1 year ago

I actually tested Claude Sonnet to see how it would fare at writing a test suite for a background worker. My previous experience was with some version of GPT via Copilot, and it was... not good.

I was, however, extremely impressed with Claude this time around. Not only did it do a great job off the bat, but it taught me some techniques and tricks available in the language/framework (Ruby, Rspec) which I wasn't familiar with.

I'm certain that it helped having a decent prompt, asking it to consider all the potential user paths and edge cases, and also having a very good understanding of the code myself. Still, this was the first time for me I could honestly say that an LLM actually saved me time as a developer.

shadowmanifold|1 year ago

This latest update to Sonnet is super impressive.

We are really already past the point of being able to discuss these matters though in large groups.

The herd speaks as if all LLMs on all programming languages are basically the same.

It is an absurdity. Talking to the herd is mostly for entertainment at this point. If I actually want to learn something, I will ask Sonnet.

throwa5456435|1 year ago

All this makes me think making software engineers redundant is really the "killer app" of LLM's. This is where the AI labs are spending most of the effort - its the best marketing after all for their product - fear sells better than greed (loss aversion) making engineers notice and unable to dismiss it.

Despite some of the comments on this thread, despite it not wanting to be true, I must admit LLM's are impressive. Software engineers and ML specialists have finally invented the thing which disrupts their own jobs substantially either via large reduction in hours and/or reduction in staff. As the hours a software engineer spends coding diminishes by large factors so too especially in this economy will hours spent required paying an engineer will fall up to the point where anyone can create code and learn from an LLM as you have just done. Once everybody is special, no one is and fundamentally employment, and value of things created from software, comes from scarcity just like everything else in our current system.

I think there's probably only a few years left where software engineers are around - or at least seen as a large part of an organization with large teams, etc. Yes AI software will have bugs, and yes it won't be perfect but you can get away with just one or two for a whole org to fix the odd blip of an LLM. It feels like people are picking on minor things at this point, which while true, for a business those costs are "meh" while the gains of removing engineers are substantial.

I want to be wrong; but every time I see someone "learning from LLM's", saving lots of time doing stuff, saving 100's of hours, etc I think its only 2-3 years in and already its come this far.

mkleczek|1 year ago

I am very sceptical of LLM (or any AI) code generation usefulness and it does not really have anything to do with AI itself.

In the past I've been involved in several projects deeply using MDA (Model Driven Architecture) techniques which used various code generation methods to develop software. One of the main obstacles was the problem of maintaining the generated code.

IOW: how should we treat generated code?

If we treat it in the same way as code produced by humans (ie. we maintain it) then the maintenance cost grows (super-linearly) with the amount of code we generate. To make matters worse for LLM: since the code it generates is buggy it means we have more buggy code to maintain. Code review is not the answer because code review power in finding bugs is very weak.

This is unlike compilers (that also generate code) because we don't maintain code generated by compilers - we regenerate it anytime we need.

The fundamental issue is: for a given set of requirements the goal is to produce less code, not more. _Any_ code generation (however smart it might be) goes against this goal.

EDIT: typos

mvdtnz|1 year ago

You should NEVER modify generated code. All of our generated code is pretended with a big comment that says "GENERATED CODE DO NOT MODIFY. This code could be regenerated at any time and any changes will be lost."

If you need to change behaviour of generated code you need to change your generator to provide the right hooks.

Obviously none of this applies to "AI" generated code because the "AI" generator is not deterministic and will hallucinate different bugs from run to run. You must treat "AI" generated code as if it was written by the dumbest person you've ever worked with.

smokel|1 year ago

I agree. Adding unit tests without a good reason comes at a cost.

Refactoring is harder, especially if it's not clear why a test is in place. I've seen many developers disable tests simply because they could not understand how, or why, to fix them.

I'm hopeful that LLMs can provide guidance in removing useless tests or simplifying things. In an ideal future they may even help in formulating requirements or design documentation.

DeathArrow|1 year ago

It's hard to generate tests for typical C# code. Or for any context where you have external dependencies.

If you have injected services in your current service, the LLM doesn't know anything about those so it makes poor guesses. You have to bring those in context, so they can be mocked properly.

You end up spending a lot of time guiding the LLM, so it's not measurably faster than writing test by hand.

I want my prompt to be: "write unit tests for XYZ method" without having to accurately describe it the prompt what the method does, how it does it and why it does it. Writing too many details in the prompt takes the same time as writing the code myself.

Github Copilot should be better since it's supposed to have access to you entire code base. But somehow it doesn't look at dependencies and it just uses the knowledge of the codebase for stylistic purposes.

It's probably my fault, there are for sure better ways to use LLMs for code, but I am probably not the only one who struggles.

nazgul17|1 year ago

Should we not, instead, write tests ourselves and have LLMs write the code to make them pass?

jayd16|1 year ago

Just ask it to do both.

satisfice|1 year ago

Like nearly all the articles about AI doing "testing" or any other skilled activity, the last part of it admits that it is an unreliable method. What I don't see in this article-- which I suspect is because they haven't done any-- is any description of a competent and reasonably complete testing process of this method of writing "tests." What they probably did is to try this, feel good about it (because testing is not their passion, so they are easily impressed), and then mark it off in their minds as a solved problem.

The retort by AI fanboys is always "humans are unreliable, too." Yes, they are. But they have other important qualities: accountability, humility, legibility, and the ability to learn experientially as well as conceptually.

LLM's are good at instantiating typical or normal patterns (based on its training data). Skilled testing cannot be limited to typicality, although that's a start. What I'd say is that this is an interesting idea that has an important hazard associated with it: complacency on the part of the developer who uses this method, which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.

johnjwang|1 year ago

Author here: Yes, there are certain functions where writing good tests will be difficult for an LLM, but in my experience I've found that the majority of functions that I write don't need anything out of the ordinary and are relatively straightforward.

Using LLMs allows us to have much higher coverage than if we didn't use it. To me and our engineering team, this is a pretty good thing because in the time prioritization matrix, if I can get a higher quality code base with higher test coverage with minimal extra work, I will definitely take it (and in fact it's something I encourage our engineering teams to do).

Most of the base tests that we use were created originally by some of our best engineers. The patterns they developed are used throughout our code base and LLMs can take these and make our code very consistent, which I also view as a plus.

re: Complacency: We actually haven't found this to be the case. In fact, we've seen more tests being written with this method. Just think about how much easier it is to review a PR and make edits vs write a PR. You can actually spend your time enforcing higher quality tests because you don't have to do most of the boilerplate for writing a test.

wenc|1 year ago

I do use LLMs to bootsrap my unit testing (because there is a lot boilerplate in unit tests and mocks), but I tend to finish the unit tests myself. This gives me confidence that my tests are accurate to the best of my knowledge.

Having good tests allows me to be more liberal with LLMs on implementation. I still only use LLMs to bootstrap the implementation, and I finish it myself. LLMs, being generative, are really good for ideating different implementations (it proposes implementations that I would never have thought of), but I never take any implementation as-is -- I always try to step through it and finish it off manually.

Some might argue that it'd be faster if I wrote the entire thing myself, but it depends on the problem domain. So much of what I do is involve implementing code for unsolved problems (I'm not writing CRUD apps for instance) that I really do get a speed-up from LLMs.

I imagine folks writing conventional code might spend more time fixing LLM mistakes and thus think that LLMs slow them down. But this is not true for my problem domain.

simonw|1 year ago

The answer to this is code review. If an LLM writes code for you - be it implementation or tests - you review it before you land it.

If you don't understand how the code works, don't approve it.

Sure, complacent developers will get burned. They'll find plenty of other non-AI ways to burn themselves too.

tsv_|1 year ago

Each time a new LLM version comes out, I give it another try at generating tests. However, even with the latest models, tailored GPTs, and well-crafted prompts with code examples, the same issues keep surfacing:

- The models often create several tests within the same equivalence class, which barely expands test coverage

- They either skip parameterization, creating multiple redundant tests, or go overboard with 5+ parameters that make tests hard to read and maintain

- The model seems focused on "writing a test at any cost" often resorting to excessive mocking or monkey-patching without much thought

- The models don’t leverage existing helper functions or classes in the project, requiring me to upload the whole project context each time or customize GPTs for every individual project

Given these limitations, I primarily use LLMs for refactoring tests where IDE isn’t as efficient:

- Extracting repetitive code in tests into helpers or fixtures

- Merging multiple tests into a single parameterized test

- Breaking up overly complex parameterized tests for readability

- Renaming tests to maintain a consistent style across a module, without getting stuck on names

deeviant|1 year ago

All of the points you raise I find common in human written tests.

iambateman|1 year ago

I did this for Laravel a few months ago and it’s great. It’s basically the same as the article describes, and it has definitely increased the number of tests I write.

Happy to open source if anyone is interested.

frays|1 year ago

I'd certainly be interested to read more about your experience!

gengstrand|1 year ago

I went with a more clinical approach and used models that were available a half year ago but I also was interested in using LLMs to write unit tests. You can learn the details of that experiment at https://www.infoq.com/articles/llm-productivity-experiment/ but the net of what I found was that LLMs improve developer productivity in the form of unit test creation but only marginally. Perhaps I find myself a bit skeptical on the claims from that Assembled blog on significant improvement.

unknown|1 year ago

[deleted]

unknown|1 year ago

[deleted]

simonw|1 year ago

If you add "white-space: pre-wrap" to the elements containing those prompt examples you'll avoid the horizontal scrollbar (which I'm getting even on desktop) and make them easier to read.

johnjwang|1 year ago

Thanks for the suggestion -- I'll take a look into adding this!

apwell23|1 year ago

i would love to used to use it change code in ways that compiles and see if test fails. Coverage metric sometimes doesn't really tell you if some piece of code is covered or not.

sesm|1 year ago

Coverage metric can tell if lines of code were executed, but they can't tell if execution result was checked.

taberiand|1 year ago

I believe that's called mutation testing. Using an LLM to perform the mutation sounds like a great idea

dfilppi|1 year ago

[deleted]

unknown|1 year ago

[deleted]

77 comments