top | item 39405996

Automated Unit Test Improvement Using Large Language Models at Meta

301 points| mfiguiere | 2 years ago |arxiv.org

188 comments

[+] hubraumhugo|2 years ago|reply

At a large insurance company I worked for, management set a target of 80% test coverage across our entire codebase. So people started writing stupid unit tests for getters and setters in Java DTOs to reach the goal. Of course devs also weren't allowed to change the coverage measuring rules in Sonar.

As a young dev, it taught me that focusing only on KPIs can sometimes drive behaviors that don't align with the intended goals. A few well-thought out E2E test scenarios would probably have had a better impact on the software quality.

[+] hibikir|2 years ago|reply

My favorite anecdote on the topic involved a codebase like this, handled by inexperienced programmers. I came into the team, and realized that a whole lot of careless logic could be massively simplified, so I sent a PR cutting the codebase by 20%, and still passing all the tests and meeting user requirements.

The problem is, the ugly, careless code extremely very well tested: 95% code coverage. My replacement had 100% code coverage... but by being far shorter, my PR couldn't pass tests, as total coverage went down, not up. The remaining code in the repo? A bunch of Swing UI code, the kind that is hard to test, and where the test don't mean anything. So facing the prospect of spending a week or two writing swing test, the dev lead decided that it was best to keep the old code somewhere in the repo, with tests pointing at it, just never called in production.

Thousands of lines of completely dead, but very well covered code were kept in the repo to keep Sonar happy.

[+] Galanwe|2 years ago|reply

I can relate. At my first internship, there was a code quality tool forced on the team by management as well. It had a rule to "disable magic numbers".

The result was a header with:

    static const unsigned ONE = 1;
    static const unsigned TWO = 2;
    static const unsigned THREE = 3;
   ...

Up to some thousands.

[+] wilgertvelinga|2 years ago|reply

The solution to that is mutation tests. They force your tests to actually verify the implementation instead of just running the code to fake coverage. https://en.m.wikipedia.org/wiki/Mutation_testing Tools and frameworks exist for almost all languages. Some examples:

- stryker-mutator (C#, Typescript)

- pitest (Java)

- mutatest (Python)

[+] dgan|2 years ago|reply

We have a mandatory Sonar scan, and when I was hired, my tech lead proudly shown me the "A" grade they have been attributed, and said something like "we have a high standard to maintain"

I have never seen such poorly written application in my 6 years of experience (and I am not only talking about style, stuff was absolutely utterly broken, while they had no clue what's wrong).

I hate Sonar with passion. It only ever should be used to report vulnerabilities, not telling me to rename variables or that i "should refactor this code duplication!" I already have a fucking backlog with Jira tickets, don't tell what i am supposed or not to do, and when i am supposed to do it.

But oh boy mangers love this stupid power burner

[+] drowsspa|2 years ago|reply

"When a metric becomes a target, it ceases to be a good metric".

A big problem is making it mandatory with huge bureaucracy to avoid its stupidness. Just last week I was battling yet another code quality tool they made mandatory: it was complaining that my res.status(200).json() wasn't setting up HSTS headers. And then I tried setting it up manually, it kept complaining, app.use(helmet()), same thing. Apparently it wanted me to write the whole backend code into a single file for it to stop complaining. And of course, HSTS is much more elegantly and automatically handled by the ingress or load balancer itself.

I could have spent a week or two flagging it as a false positive and explaining what is HSTS for upper management to approve it. I ended up just adding a res.sendJson(data, status = 200) to the prototype of the response object. Which is obviously stupid, but working in a bureaucracy-heavy sector has made me realize how much of bad software is composed of many such bad implementations combined.

[+] danielheath|2 years ago|reply

IME, the only "test coverage % rule" that I've ever seen work was "must not decrease the overall percentage of tested code". Once you get to 100%, that becomes "All code must have a test".

Various people objected to this, pointing out that 100% test coverage tells you nothing about whether the tests are any good. Our lead (wisely, IMO) responded that they were correct - 100% tells you nothing - but that _any other percentage_ does tell you something.

[+] foofie|2 years ago|reply

> So people started writing stupid unit tests for getters and setters in Java DTOs to reach the goal.

To me that reads like your team fucked up at a very fundamental level, as they both failed to take into account the whole point of automated tests and also everyone failed to flag those nonsense tests as a critical blocker for the PR.

Unless your getters aand setters are dead code, they are already exercised by any test covering the happy path. Also, a 80% coverage target leaves out plenty of headroom to leave out stupid getter/setter tests.

A team that pulls this sort of stunt is a team that has opted to develop defensive tricks to preserve their incompetence instead of working on having in place something that actually benefits them.

[+] supriyo-biswas|2 years ago|reply

https://en.wikipedia.org/wiki/Goodhart%27s_law

[+] skissane|2 years ago|reply

> So people started writing stupid unit tests for getters and setters in Java DTOs to reach the goal

Ideally, the code coverage tool would have heuristics to detect trivial getter/setter methods, and filter them out, so adding tests for them won't improve code coverage. Non-trivial getters/setters (where there is some actual non-trivial logic involved) shouldn't be filtered, since they should be tested.

Although, there is room for debate about what counts as trivial. Obviously this is trivial:

    public void setUser(User user) {
       this.user = user;
    }

But should this count as trivial too?

    public void setUser(User user) {
       this.user = Objects.requireNonNull(user);
    }

Probably. What about this?

    public void setOwners(List<User> owners) {
       this.owners = List.copyOf(owners);
    }

Probably that too. Which suggests, maybe, there ought to be a configurable list of methods, whose presence is ignored when determining whether a getter/setter is trivial or not.

[+] jes|2 years ago|reply

> At a large insurance company I worked for, management set a target of 80% test coverage across our entire codebase. So people started writing stupid unit tests for getters and setters in Java DTOs to reach the goal.

I attended many TOC conferences in the 90s and early 2000s. Eli Goldratt was famous for saying "Tell me how you'll measure me, and I'll tell you how I will behave."

[+] raverbashing|2 years ago|reply

Java is the only place where (non-automatic/non-syntax-sugared) getters and setters are though as important and valuable

Only goes to confirm my view of them that the language is deficient

[+] seattle_spring|2 years ago|reply

> As a young dev, it taught me that focusing only on KPIs can sometimes drive behaviors that don't align with the intended goals

Something I've learned along the way as well. A few times in my career I will end up working under a manager that insists only work "that can be explicitly measured" be performed. That means they forbade library upgrades, refactors, things like that because you couldn't really prove an improvement in customer metrics or immediate changes in eng productivity.

I've also been at companies that follow that mantra more broadly and apply it to eng performance reviews. The entire culture turns into engineers focusing on either short term gains without regard for long term impact, or gaming metrics for meaningless changes and making them look important and impactful.

Important but thankless work gets left behind because, again, it's work that is not "immediately measurable." End result is a bunch of features customers hate (eg dark patterns), and a rickety codebase that everyone is disincentivised fix or improve.

[+] salawat|2 years ago|reply

And you'd be dead wrong.

Career QA here. E2E are the absolute highest pevel of test, that take the most time to implement, have the most dependencies, and tell you the least about what is actually wrong.

If you think finding what broke is painful with full suites of unit/integration tests under the E2E suite is bad, throw those out or stop maintaining them at all. Let me know how it goes.

[+] carlossouza|2 years ago|reply

The Cobra Effect https://en.wikipedia.org/wiki/Perverse_incentive

Classic

[+] kevin_nisbet|2 years ago|reply

Yea, I've seen this get carried away even by just individual team members.

My personal favorite was tests that a team member introduced for an object that had all the default runtime parameters in it. Think like how long should a timeout be by default, before getting overridden by instance specific settings. This team member introduced a test that checked that each value was the same as it was configured to.

So if I wanted to update or add a default, I had to write it in two places, the actual default value, and the unit test that checked if the defaults were the same as the test required.

[+] nitwit005|2 years ago|reply

I'm convinced high coverage targets cause people to write worse code.

With Java at least, it seems to drive people to do things like not catch exceptions. It's hard to inject errors so all the catch blocks are hit.

Also makes people not want to switch to use record classes, as that feature removes class boilerplate that is easy to cover.

[+] dexwiz|2 years ago|reply

I’ve started getting pinged for this at my current job. I think it’s time to move on.

[+] piersj225|2 years ago|reply

Yup, I believe this is the Cobra Effect

https://en.wikipedia.org/wiki/Perverse_incentive#The_origina...

It comes up a bit on hacker news https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

[+] oneshtein|2 years ago|reply

Is it hard to write ONE test case for ALL getters and setters using reflection?

[+] awwaiid|2 years ago|reply

True.

However we might think about this differently if we flipped it to "our standard is 20% completely untested".

Uncoverage communicates the value much better.

[+] xLaszlo|2 years ago|reply

This is called "Goodhart's Law"

[+] dclowd9901|2 years ago|reply

What if…

They knew that people would write coverage tests for getters and setters, and calculated that eventuality into their minimums.

[+] dataviz1000|2 years ago|reply

At least they didn’t do what IBM did, write tests and pay coding farms to write the code to satisfy the unit tests.

[+] farhanhubble|2 years ago|reply

True of any metric when it becomes the goal in itself.

[+] justapassenger|2 years ago|reply

[deleted]

[+] tivert|2 years ago|reply

> 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage.

The problem I have with LLM generated tests is that it seems highly likely that they'd "ratify" buggy behavior, and I'd think that'd be especially likely if the code-base already had low test coverage. One of the nice things about writing new tests by hand is you've got someone who can judge if it's the system being stupid or if it's the test.

At a minimum they should be segregated in a special test folder, so they can be treated with an appropriate level of suspicion.

[+] ithkuil|2 years ago|reply

Writing tests is indeed a great opportunity for finding bugs.

But a codebase with good test coverage allows you to safely perform large scale refactorings without having regressions and that's useful property even if you have bugs and the refactoring preserves them faithfully.

The risk of using a tool that generates tests designed to encode the current behaviour is that you may be lulled in a false sense of safety, while all you've done is to encode the current behaviour, as advertised.

Perhaps this problem can be just solved by not calling these tests "tests" but something like "behavioural snapshots" or something like that (cannot think of a better name, but the idea is to capture the idea that they were not meant to encode necessarily the correct behaviour but just the current behaviour)

[+] planetjones|2 years ago|reply

From reading the PDF it seems that this ‘merely’ generates tests that will repeatedly pass i.e. that are not flaky. The main purpose is to create a regression test suite by having tests that pin the behaviour of existing code. This isn’t a replacement for developer written tests, which one would hope come with the knowledge of what the functional requirement is.

Almost 20 years ago the company I worked for trialled AgitarOne - its promise was automagically generating test cases for Java code that help explore its behaviour. But also Agitar could create passing tests more or less automatically, which you could then use as a regression suite. Personally I never liked it, as it just led to too much stuff and it was something management didn’t really understand - to them if the test coverage had gone up then the quality must have too. I wonder how much better the LLM approach FB talk about here is compared to that though…

http://www.agitar.com/solutions/products/agitarone.html

[+] TheChaplain|2 years ago|reply

In my experience writing tests is generally an outstanding method to determine code quality.

If a test is complicated or coverage is hard to achieve, it's likely your tested code needs improvement.

[+] bbor|2 years ago|reply

  In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers.

…is that a good rate? I guess I have to read more and see if the unacceptable ones were silly mistakes like the ones that make us all do code review, or serious ones. I don’t think a human engineer with 25% failure rate would be very helpful, if it’s a certain kind of failure.

  As part of our overall mission to automate unit test generation for Android code, we have developed an automated test class improver, TestGen-LLM.

Is that a good mission? I feel like the TDD people are turning over in their graves, or at least in their beds at home. But again something tells me that they caveat this later

[+] shardullavekar|2 years ago|reply

At unlogged.io, for some time - our primary focus was to auto-generate junit tests. The approach didn't take off for a few reasons: 1. A Lot of generated test code that no devs wanted to maintain. 2. The generated tests didn't simulate real-world scenarios. 3. Code coverage was a vanity metric. Devs worked around to reach their goals with scenarios that didn't matter.

We are currently working on offering no-code replay tests that simulate all unique production scenarios and developers can replay locally while mocking external dependencies.

Disclaimer: I am a founder at unlogged.io

[+] regularfry|2 years ago|reply

I want to go the other way. Let me feed acceptance criteria in, have it generate tests that check them, and only then generate code that passes the tests.

You can get close to this with Copilot, sometimes, in a fairly limited way, but why do I feel like nobody is focusing on doing it that way round?

[+] curtis3389|2 years ago|reply

TestGen-LLM is such a strange creation. I can see how it could be used as a first step in a refactoring or rewrite, but the emphasis on code coverage in the paper seems totally brain-broke. I suppose it'd be great if your org is already brain-broke and demanding high coverage, but TestGen-LLM won't make your project's code better in any way, and it'll increase the friction involved in actually implementing improvements. It'd be much more useful to generate edge-case tests that may or may not be passing, but TestGen-LLM relies on compiler errors and failing tests to filter out LLM garbage. The lack of any examples of the generated tests in the paper makes me suspect that they're the same as the rest of the LLM-generated code I have seen: amateurish.

[+] bilekas|2 years ago|reply

I'll admit its interesting, a 12 page paper by Meta employees to promote AI for developers. Even brought out the Sankey diagram.

I'm probably wrong but if its published in this way, shouldn't the information be given to reproduce it ?

Edit: This is not tinfoil hat, I just don't have the kind of data that meta has to learn from. So, maybe they released something ?

[+] refulgentis|2 years ago|reply

If it's anything like Google, it's way too intimately tied to their infra and monorepo to release

[+] seanmcdirmid|2 years ago|reply

It’s an FSE 2024 paper, so I’m guessing the artifacts need theory or formal evaluation.

[+] tdiff|2 years ago|reply

I wonder what would be the cost of maintaining some huge auto-generated corpus of tests in the future. They need to provide some automated way not only to generate cases, but also to update them.

[+] pearjuice|2 years ago|reply

So write unit tests automatically, change code later and then regenerate the unit tests? Now the code has a bug but the unit tests pass. I'm already seeing this today with devs using ChatGPT to quickly get the "test boilerplate" over and over.

[+] newzisforsukas|2 years ago|reply

> In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage.

That doesn't seem great?

[+] aussieguy1234|2 years ago|reply

Not sure about improving. But for Tunnelmole (https://github.com/robbie-cahill/tunnelmole-client) I have used GPT-4 to generate unit tests just by showing it a TypeScript module and asking it to create tests.

[+] Roritharr|2 years ago|reply

I am currently looking into building something like this for a client with large (several > 3M LoCs) and old (started in 2001) Java Projects with low coverage.

Interesting to read how it would work it you already have good coverage (I assume).

[+] oslac|2 years ago|reply

Is there still no type theoretic answer to unit testing? Does not the type or the class generally contain all the necessary information to unit test itself, assuming its a unit? That is, we should not have to even write these "theoretically". Just hit "compiler --unit_test <type>"

[+] afro88|2 years ago|reply

What you're describing is more or less fuzzing [1], at the unit level. I can't remember the names, but there are tools around that work like this at runtime (ie you define a test that executes functions from the test library that run tests based on input/output types and other user defined constraints).

There's almost always more business logic to what a unit should do than it's types though. Depending on the language, the type system can only encode so much of that logic.

Consider the opposite: can't the compiler generate implementations from types and interfaces? In most cases, no. LLMs are filling some of that gap though because they can use some surrounding context to return the high probability implementation (completion) from the interface or type definition.

[1] https://en.m.wikipedia.org/wiki/Fuzzing

[+] unknown|2 years ago|reply

[deleted]

[+] unknown|2 years ago|reply

[deleted]

[+] Lio|2 years ago|reply

It’s papers like this that will act as the justification of the next round of FAANG lAIoffs. Regardless of how successful this approach is in the long term.

[+] emmender2|2 years ago|reply

Happy to see such a huge team collaborating on a project at any company.

Perhaps becoz, it involves LLM and LLMs are hot, and everyone wants a piece of it.

[+] MeteorMarc|2 years ago|reply

So, assign to: LLM when lots of tests are broken after the next refactoring!

[+] westurner|2 years ago|reply

"Automated Unit Test Improvement using Large Language Models at Meta" (2024) https://arxiv.org/abs/2402.09171 :

> This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. [...] We believe this is the first report on industrial scale deployment of LLM-generated code backed by such assurances of code improvement.

Coverage-guided unit test improvement might [with LLMs] be efficient too.

https://github.com/topics/coverage-guided-fuzzing :

- e.g. Google/syzkaller is a coverage-guided syscall fuzzer: https://github.com/google/syzkaller

- Gitlab CI supports coverage-guided fuzzing: https://docs.gitlab.com/ee/user/application_security/coverag...

- oss-fuzz, osv

Additional ways to improve tests:

Hypothesis and pynguin generate tests from type annotations.

There are various tools to generate type annotations for Python code;

> pytype (Google) [1], PyAnnotate (Dropbox) [2], and MonkeyType (Instagram) [3] all do dynamic / runtime PEP-484 type annotation type inference [4] to generate type annotations. https://news.ycombinator.com/item?id=39139198

icontract-hypothesis generates tests from icontract DbC Design by Contract type, value, and invariance constraints specified as precondition and postcondition @decorators: https://github.com/mristin/icontract-hypothesis

Nagini and deal-solver attempt to Formally Verify Python code with or without unit tests: https://news.ycombinator.com/item?id=39139198

Additional research:

"Fuzz target generation using LLMs" (2023) https://google.github.io/oss-fuzz/research/llms/target_gener... https://security.googleblog.com/2023/08/ai-powered-fuzzing-b... https://hn.algolia.com/?q=AI-Powered+Fuzzing%3A+Breaking+the...

OSSF//fuzz-introspector//doc/Features.md: https://github.com/ossf/fuzz-introspector/blob/main/doc/Feat...

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C43&q=Fuz... :

- "Large Language Models Based Fuzzing Techniques: A Survey" (2024) https://arxiv.org/abs/2402.00350 : > This survey provides a systematic overview of the approaches that fuse LLMs and fuzzing tests for software testing. In this paper, a statistical analysis and discussion of the literature in three areas, namely LLMs, fuzzing test, and fuzzing test generated based on LLMs, are conducted by summarising the state-of-the-art methods up until 2024