skip to main content

gerg.dev

Blog | About me

A demonstration of Mutation Testing

Test coverage is one of the simplest possible metrics to help gauge quality of testing, which makes it one that is often targeted with rules like “don’t commit any code with less than 80% coverage”. However, it is also an easy metric to manipulate, and doesn’t necessarily prove anything about the quality of the tests you do have. A lot of people dismiss it entirely for those reasons. While there is a good defence to be made for paying attention to coverage, that’s not the purpose of this post. Instead, I’m going to provide a simple example of how test coverage can be misleading and introduce mutation testing as a way to address those shortcomings.

Fizzbuzz: A high coverage and buggy example

First, the example code. There’s a simple little game that comes up in coding interviews called fizzbuzz. The rules are:

  • Take turns counting, starting from 1;
  • If a number is a multiple of 3, say “fizz” instead;
  • If a number is a multiple of 5, say “buzz” instead;
  • If a number is a multiple of both 3 and 5, say “fizzbuzz”.

The game would go something like this: 1, 2, fizz, 4, buzz, fizz, 7…

I’ve implemented that algorithm in a JavaScript function that takes a number and returns what you should say, ready to ship out to our clients. The code for this example is on github, if you’d like to play along. I’ve run the tests, they all pass, and I even have 100% coverage. So we’re good to ship, right?

Well, actually, no. Almost immediately, my client comes back to me saying almost everything in their app is broken. The fizzbuzz game doesn’t work. Their customers are furious.

This is no doubt a caricature of a situation we’re all familiar with: a bug gets out to production despite our best efforts to test thoroughly before release. 100% test coverage didn’t serve as the guarantee we might have thought it did.

Let’s take a look at the code we shipped in this example:

function fizzbuzz(number) {
    var result = '';
    if (number % 3 === 0) {
        result += 'fooz'
    }
    if (number % 5 === 0) {
        result += 'buzz'
    }
    return result;
}

That’s… pretty terrible. There’s at least one big typo, and right from number 1 this won’t give us the expected result. I’m sure you can guess that the tests must be equally terrible if they run without raising any alarms. Take a minute to think about what kinds of things go wrong with unit tests that might makes this happen. Bad specs? Bad assertions? Remember we know from having 100% coverage that the code did, at least, run. Sure enough:

describe("Fizzbuzz", function() {
    it("gets fizzbuzz", function() {
        fizzbuzz(15);
    });

    it("not fizzbuzz", function() {
        fizzbuzz(8);
    });
});

Turns out these tests don’t actually assert against anything. Fizzbuzz of 15 should return a string “fizzbuzz”, but we never check the results of calling fizzbuzz(15). At least we know we didn’t throw an error, but that’s about all these tests tell us.

Introducing mutation testing

This is where mutation testing comes in. The concept is this: given some code with passing tests, we’ll deliberately introduce bugs into that code and run the tests again. If the tests fail, that means they caught the bug, and we call that a success. We want the tests to fail! If the tests pass, that means that they’re not capable of catching the bug you introduced.

Whereas regular coverage just tells you that your code ran, mutation coverage tells you whether your tests can fail.

For JavaScript, I use Stryker, a tool named for a character in the X-Men movies known for killing mutants. He’s a bad guy in the movies, but he’s on our side now. It supports React, Angular, Vue, and TypeScript. And of course there are similar tools in other languages, though I haven’t used them personally. The setup is very easy, since it just hooks into your existing test suite to run tests you’ve already written.

Let’s run Stryker on our example code:

Stryker generates 14 mutants from our function, and shows that our tests manage to kill none of them. This is a much more helpful number than coverage was. And much like coverage, it reports for us exactly which mutants survived. While it doesn’t tell us exactly what tests we need, it does point us in the right direction. For example, if no test fails when we force an if condition to always be true, that means we don’t have any tests where it’s false.

For mutant #7, pictured above, the string “fooz” in the code—a typo that we didn’t catch—was replaced with an empty string. Because no test failed, the mutant is counted as a survivor. This is telling us explicitly that this string is never checked in the tests. Let’s fix that.

Fixing fizzbuzz

The easiest thing we can do is just add an assertion to one of the existing tests:

    it("gets fizzbuzz", function() {
        expect(fizzbuzz(15)).toEqual("fizzbuzz");
    });

As always, we want to make sure this test actually fails, and it does:

Next, we can fix the code. If we tried to run our mutation tests right away we’d be in trouble. Stryker wouldn’t be able to tell us if a failure is because our test successfully found a mutant, or if a failure is just because the code is broken in the first place. Luckily, the fix in the code is easy; we just have to correct the typo:

    if (number % 3 === 0) {
        result += 'fizz';     // not "fooz"
    }

Now the tests are passing, and note that the coverage results are still happily (and unhelpfully) at 100%. Running the mutation tests again shows us that we were able to catch all but two mutants:

I’ll leave it as an exercise for you to figure out which two mutants remain and how to catch them too. One last time, here’s a link to the code to get you started.

Mutation testing in real life

This toy example is obviously contrived to show an extreme case, but this works on real code too. I have a number of examples of production code that had full test coverage but still had bugs in areas where mutation testing shone a big red spotlight. As was the case here, it was still up to me to add the tests necessary to assert against the code in question and figure out what the bug was, but it did help tell me where to look.

Mutation testing isn’t a perfect replacement for test coverage, of course. It is only able to catch certain classes of bugs, usually around flow control, booleans, and assignments. It won’t catch faulty logic, or fitness for purpose, though you may find that being unable to test something is a sign that something is wrong. In fact, if you work through the example above, you can find that it is possible to catch 100% of mutants and still not function as a good implementation of fizzbuzz. Even if you add additional mutations with Stryker’s plugin API, like any tool it will never catch everything.

Mutation testing also takes quite a while to run, since it has to run tests again for every mutant it generates. Using Jest, Stryker is smart enough to run only the tests that cover the mutated file, but it is still more resource intensive. In this small example, Jest finishes in 1 second while Stryker takes 6 seconds. Because of that, it’s not necessarily something that you’ll include as part of a regular build pipeline, though it is certainly possible.

I can also give you a bit of a shortcut. In my experience, the types of tests that are required for mutation testing tend to be the same types of tests required for branch coverage. This is just an anecdotal correlation based on the handful of products I’ve used this on, so don’t take my word for it. However, if you’re set on using coverage as a test quality gauge, at least upgrade to making sure all your branches are covered, not just all lines of code. Most coverage tools already have this built-in.

These days, I treat mutation testing as a tool for occasionally reviewing unit tests, especially when there are large changes. Tests are code, after all, and all code can have bugs in it. Even if you don’t consider unit tests part of a tester’s responsibility, they are the foundation of a solid test strategy, so we would do well to make sure that they’re doing what we think they are.

About this article

Leave a Reply