Mutation Driven Testing: When TDD Just Isn’t Good Enough

[+] nullc|5 years ago|reply

I've deployed mutation testing extensively in libsecp256k1 for the past five years or so, to good ends.

Though it's turned up some testing inadequacies here and there and a substantial performance improvement ( https://twitter.com/pwuille/status/1348835954396516353 ), I don't believe it's yet caught a bug there, but I do feel a lot more confident in the tests as a result.

I've also deployed it to a lesser degree in the Bitcoin codebase and turned up some minor bugs as a result of the tests being improved to pass mutation testing.

The biggest challenge I've seen for most parties to use mutation testing is that to begin with you must have 100% branch coverage of the code you might mutate, and very few ordinary pieces of software reach that level of coverage.

The next issue is that in C/C++ there really aren't any great tools that I'm aware of-- so every effort needs to be homebrewed.

My process is to have some a harness script that:

1. makes a modification (e.g. a python script that does small search and replacements one at a time line by line, or just doing it by hand).

2. attempts to compile the code (if it fails, move on to the next change)

3. Compares the hash of the optimized binary to a collection of already tested hashes and moves onto the next if it's already been seen.

4. Runs the tests and if the tests pass save off the diff.

5. Goto 1.

Then I go back trough the diffs and toss ones that are obviously no meaningful effect, and lob the remaining diffs over to other contributors to figure out if they're false positives or to improve the tests.

[+] Agentlien|5 years ago|reply

This automated approach reminds me of fuzzing.

[+] AlexDenisov|5 years ago|reply

re: absence of C++ tools

Have you looked at Mull[1] or Dextool[2]?

[1] https://github.com/mull-project/mull

[2] https://github.com/joakim-brannstrom/dextool/tree/master/plu...

[+] bastawhiz|5 years ago|reply

I agree with the author's philosophy, but the approach described only gives you confidence at the time the code is written/tested. If someone changes adjacent code, you can no longer assume that your manual mutation testing is still valid. At some point (either in age or size or complexity of the codebase) manual mutation testing is going to decrease in effectiveness until the ROI of doing it is hard to justify. Automation is really key.

There's lots of great tools that help with mutation testing, though they can be expensive to run (depending on how they work and how many tests you have). In a past life, I wrote my own mutation testing library which run the tests after each mutation and generated a "reverse code coverage" report: essentially a report of which lines/functions/statements/etc. did not cause the tests to fail when mutated. Where code coverage is ideally approaching 100%, the reverse code coverage should have been near 0%. If you take the intersection of a coverage report with a reverse coverage report, you can easily find code that is executed, but whose behavior is not checked.

[+] joatmon-snoo|5 years ago|reply

Relevant: https://research.google/pubs/pub46584/

From the abstract:

> We focus on a code-review based approach and consider the effects of surfacing mutation results on developer attention. The described system is used by 6,000 engineers in Google on all code changes they author or review, affecting in total more than 14,000 code authors as part of the mandatory code review process.

[+] postalrat|5 years ago|reply

What many hardcore testing advocates don't want to accept is their tests will always be inadequate. I've found that bringing up mutation testing or bebugging tends to draw dirty looks from such people.

The last 100% test coverage advocate I mentioned it to said it would be a waste of developer effort. I assume they feel that effort would be better spent writing more tests.

[+] rook166|5 years ago|reply

For any Python users, there's a library that automates mutation testing by parsing the AST: https://github.com/EvanKepner/mutatest

[+] schedutron|5 years ago|reply

And for property-based testing there's Hypothesis too: https://hypothesis.readthedocs.io/en/latest/

[+] boxed|5 years ago|reply

There's mutmut (I'm the maintainer), cosmic-ray and mutpy too. In fact those are the established players. I have never heard of mutate st before! I will have to try it.

[+] JustFinishedBSG|5 years ago|reply

While it sounds like a “good idea” mentally it seems also completely unrealistic and unpractical.

Basically what this is is writing tests for your tests. And because the input of tests is functions you need to be able to generate functions. That’s nice but it’s a pain considering the only solution proposed is “just do it manually” which is neither exhaustive nor trustworthy.

Also every single of the author’s exemple is caught by an actually good testing tool like QuickCheck.

https://en.wikipedia.org/wiki/QuickCheck

[+] tom_mellior|5 years ago|reply

> Also every single of the author’s exemple is caught by an actually good testing tool like QuickCheck.

I'd be interested to see what a sufficiently strong QuickCheck specification of this problem would look like. I've used it a bit in the past, but not enough that I could reliably get it to produce all the interesting failure modes and know the expected result for each case.

[+] bavent|5 years ago|reply

I'm a big fan of mutation testing and I've converted a few other devs I've worked with to it. I use a tool called Stryker Mutator in C# and JavaScript/TypeScript to automate the bug-injection. It adds a little bit of overhead - a small-ish TS project where our normal test suite runs in 2 minutes or so takes about 20-25 with Stryker - but it has definitely found things that our tests weren't really covering as well as they should have before.

[+] visarga|5 years ago|reply

This approach, like regular TDD, is upper bounded by the imagination of the tester. You would only catch bugs you can invent.

[+] pjmorris|5 years ago|reply

Ammann and Offutt's 'Introduction to Software Testing' [0] describes model driven test design and criteria-based testing (e.g. input domain ) as approaches to being thorough about testing.

[0] https://cs.gmu.edu/~offutt/softwaretest/

[+] henrikeh|5 years ago|reply

I don’t really get what you are saying. Are you talking about automated tools as a better option?

Edit: To elaborate a bit, programming is naturally limited by the programmer's imagination, too. But what alternative is offered is a much more interesting topic than just disregarding something because it isn't a perfect solution.

[+] paullth|5 years ago|reply

We have a job that runs https://pitest.org/, analyse the report and tweak the codebase as per the results. Not sure it's ever found a bug that's likely to happen in prod but definitely gives us a confidence boost

[+] jghn|5 years ago|reply

We use this as well. It caught some lurking bugs when we first turned it on. Since then it catches things before merging, so it is harder to keep metrics. It did just point out a bug in a PR of mine the other day so it’s at least doing something :)

[+] Kototama|5 years ago|reply

I have also found that most TDD practitioners don't even know about generative tests (QuickCheck and similars), when these kind of tests - when well written - can catch much more subtle bugs than unit tests. Also there is a time where you should invest effort in monitoring and not in testing.

Testing with mutations is certainly interesting but I never had the opportunity to try it.

[+] dathinab|5 years ago|reply

> don't even know [...] QuickCheck and similars

True, but in my experience this kind of testing tools do only complement testing. They should never be used to replace proper manually written tests as they are probability based and as long as the input domain is large enough it's quite viable to miss very obvious bugs not just in one rune but repeatedly.

Through if you are under time pressure it can be a good idea to, replace writing some relatively unimportant test with writing generative tests. Just only start doing so after you cover the most important parts with your manual tests.

It's kinda sad but for a lot of applications making the main features in their main use-case work right, and bringing out more main features, is more important then making all features always right but have less of them. In the end a imperfect but reasonable well working product (especially if all the small demo cases work) sells better then a perfect but very constrained product.

[+] krzepah|5 years ago|reply

What I dislike about heavy test process is that you basically might be writing b*lshit until your product owner validates the implementation.

[+] wpietri|5 years ago|reply

> One the one hand, [TDD is] too strict. Insisting on writing tests first, often gets in the way of the exploratory work

Who are the proponents of TDD that promote 100% adherence even when it's not a good match for the situation? I keep coming across this claim, but it's not how I learned TDD, and I wonder if it's a straw man.

[+] the_af|5 years ago|reply

> Who are the proponents of TDD that promote 100% adherence even when it's not a good match for the situation?

While nobody proposes TDD "when it's not a good match", plenty of people overestimate the cases where they think it's a good match.

Plenty of TDD proponents believe you shouldn't write a single line of code without a test first. I've met them, and they believe it's a good match for the situation.

It's not helped by notorious failures like the infamous sudoku puzzle debacle -- a case where it was evidently not suitable, yet Jeffries went ahead and tried it anyway (and failed, predictably). The conclusion that TDD was not suitable for this kind of algorithmic exploratory development was somehow never reached...

[+] strulovich|5 years ago|reply

I dealt with this a bunch. I think it’s a natural tendency of humans to hear a new idea and consider a simpler less nuanced version of it before they fully grasp it.

As an example. I was taught about object-calisthenics[1] in school. The lecturer treated a straw man version of it, and I thought about this like that for a while. What’s worse, when I later thought it was a good exercise and explained it to others I noticed I always have to reiterate multiple times that this set of rules is meant as an exercise of exploration, not a set of hard rules for whatever piece of code you are going to write next.

TDD is just so easy to turn into a straw man, and I think it has so many hardcore fans that makes it even worse to present the nuanced form of it.

[1] https://williamdurand.fr/2013/06/03/object-calisthenics/

[+] aceBacker|5 years ago|reply

Robert Martin is the dude that said the only acceptable target is 100%. Of course he also said you shouldn't plan to actually hit that target, just get as close as is possible without having to test things that don't matter like frameworks.

Kent Beck recently went through and clarified TDD and he definitely doesn't advocate 100%. https://youtube.com/playlist?list=PLlmVY7qtgT_lkbrk9iZNizp97...

[+] dathinab|5 years ago|reply

Honestly depending on how you interpret TDD it might be to strict but thinks which I belive are never to strict and always a good idea is:

- Writing tests first.

- Have a write test => write code => loop.

But there are some people which are pedantic about how small the loop needs to be or insisting that TDD excludes doing any planing about aspects like how you likely will structure code on a larger picture and similar (i.e. software architecture).

At the same time there are tools/library/framework/language combinations which are testing in generally really hard and can play really bad with TDD. While I believe such tools should be avoided there are situations in which you can not do so and in which you furthermore due to e.g. time constraints are simile not able to do any proper testing including TDD. It's kinda a nightmare situations but it does happen.

EDIT: Yes I realized I responded to the wrong comment :(

[+] UK-Al05|5 years ago|reply

Depends what the exploratory work is.

If you know what result you want. It's easy to apply. Assert on the output.

If you don't know what you want, you can't apply it. Because you can't write the asserts.

So you can do exploratory implementation as long as you know the result you're looking for. But I'd also argue you want to have an idea about what you want before you start writing code. So I don't find that many places where it doesn't apply.

One of the few ones where it doesn't apply is tweaking the UI for looks.

[+] tenaciousDaniel|5 years ago|reply

I might be wrong, but I'm fairly certain I've seen Uncle Bob make statements along the lines of "there are vanishingly small scenarios in which 100% code coverage is not applicable" (heavy paraphrasing). Not arguing for or against, just making that observation. I could be misremembering.

[+] jonny_eh|5 years ago|reply

Every senior engineer who joined a startup before me. It can take a while to disabuse them of this.

[+] User23|5 years ago|reply

When teaching TDD a strict adherence often is appropriate to help learners develop good habits. But like many such rules experts get to break it, because they have the expertise to know when it's appropriate to do so.

[+] mam2|5 years ago|reply

Ehh.. some people who wanna make it their edge in the swe community.

[+] typhonius|5 years ago|reply

I’ve used the infection PHP library (https://github.com/infection/infection) in an API SDK that I maintain.

My experiences were very similar to the author’s when I first started using it. Even though my test coverage was near 100%, the mutations introduced revealed that in large part my tests were fallible due to assumptions I’d made when writing them.

I’ve incorporated mutation testing as the final step in my CI workflow as a test for my tests. It’s a fair bit of work the first time it’s run (especially with larger libraries), but in my opinion vital as a pairing with tests.

[+] jackcviers3|5 years ago|reply

Can't you do this with expectations of failure, especially if your functions under test aren't total functions?

I really like encoding failure types into the type signature of my code (assuming types).

    defa(x: str) -> Union[str, AnErrorType]:

That way, it is clear that your code may fail, and so you must test the unhappy path as well. Limits the places you might be able to throw and read/write input, and requires an effect type, but keeping functions total allows for fewer bugs and fewer places to test horrible things happening.

[+] ChrisMarshallNY|5 years ago|reply

It’s not a bad practice; however, it suffers from the same “low-hanging fruit” issue that affects TDD. It relies on the developers being able to predict faults.

In many cases, this isn’t too difficult, but most bugs I encounter are ones that I never would have considered; no matter how much thought I devoted to the matter.

In my experience, there’s just no way to predict all the bugs, and it isn’t helpful to ever assume that my test suites have full coverage, even if I use a scientific approach to writing tests.

The author mentioned one aspect of TDD that has always bothered me: that I can’t “explore.” My basic design philosophy is “Pave The Bare Spots”[0]. This means that I develop the design as I develop the code[1].

Rigid philosophies like TDD spike this methodology. What I tend to do, is rely on test harnesses, as opposed to unit tests[2].

In any case, I definitely support any effort to improve fundamental software quality. I feel as if the classic “rush to MVP at any cost” approach results in enormous tech debt that never gets repaid.

Obligatory XKCD: https://xkcd.com/2030/

[0] https://littlegreenviper.com/miscellany/the-road-most-travel...

[1] https://littlegreenviper.com/miscellany/evolutionary-design-...

[2] https://littlegreenviper.com/miscellany/testing-harness-vs-u...

[+] jacques_chester|5 years ago|reply

> there’s just no way to predict all the bugs

This is the Nirvana Fallacy, "if it isn't perfect, it's useless". Fine, you can't predict all the bugs. But you can predict some, perhaps many. That is strictly more robust than not predicting at all.

> The author mentioned one aspect of TDD that has always bothered me: that I can’t “explore.” ... Rigid philosophies like TDD spike this methodology.

"Spike" is the word TDD weirdoes like me use to say we are exploring. TDD requires enough knowledge to specify the problem in a way that drives out code. When you don't have that knowledge, you explore first.

I've had codebases where I wrote and discarded untested code multiple times in order to understand what design I needed. Once I began to grok the problem, I backed out and then test-drove my way back in.

[+] jrootabega|5 years ago|reply

Running full mutation testing may be overkill in many situations like people have mentioned here, but the idea is still useful during normal development and review. If you find a problem in the tests, submit the mutation that proves the test still passes.

[+] generated|5 years ago|reply

Used to be called bebugging

[+] GuB-42|5 years ago|reply

More like bugging in this case.

The idea here is to deliberately inject bugs to see if your tests catches them.

[+] omginternets|5 years ago|reply

I didn’t know this had a name! I always though this was just “testing done right”. :)

[+] siscia|5 years ago|reply

At this point I would advocate against mutation driven test and just go for property based testing.

Much less manual work and with more stable results.

[+] pfdietz|5 years ago|reply

The two are complementary. In particular, PBT can be used to generate new minimized inputs that kill mutants, without the need for a test oracle to say what the code should be doing.

[+] underwater|5 years ago|reply

This doesn't really sound like a new coding methodology, it's just describing a way to write good tests. TDD gets a special label because it flips the normal coding and testing process.

[+] mpweiher|5 years ago|reply

That's because TDD is primarily a design technique, not a test technique.

Tests are a (very) nice additional benefit, particularly because it has a simple way of ensuring coverage: only allowed to write production code if there is a failing test case, and only enough to make it pass.

67 comments