Talks

Kill All Mutants! (Intro to Mutation Testing)

Kill All Mutants! (Intro to Mutation Testing)

by Dave Aronson

In this presentation titled "Kill All Mutants! (Intro to Mutation Testing)" delivered at RubyConf 2019, Dave Aronson introduces the concept of mutation testing as a powerful software testing technique aimed at improving the effectiveness of unit tests. Mutation testing involves creating slight variations of code, known as ‘mutants,’ and running existing tests against them. The main purpose is to identify gaps in the test suite and ensure that any meaningful changes in the code are detected by appropriate tests. The key points covered in this talk include:

  • Definition and Objective of Mutation Testing: Mutation testing assesses code quality by checking if tests can detect introduced faults. A test suite's inadequacy is exposed if mutants can pass without triggering failures, indicating potential flaws in either code or tests.

  • How Mutation Testing Works: Tools create mutants by altering code in various ways and running tests to identify failures. If tests fail for a mutant, it is considered 'killed,' indicating effective tests and meaningful code.

  • Benefits: It ensures that tests are strict and the code is meaningful. By revealing weaknesses in both code and tests, it encourages improving test coverage and code quality.

  • Drawbacks: Mutation testing can be CPU-intensive, slow, and challenging to interpret, especially regarding the meaning behind the results. It is not beginner-friendly, as understanding the implications of surviving mutants can be complex.

  • Examples: Aronson shares specific examples, such as mutant alterations in a mathematical method to demonstrate how surviving mutants indicate poor test coverage. Another example illustrates redundant code, prompting the need for more extensive tests to evaluate loop behavior correctly.

  • Popular Tools: For Ruby, the main tool mentioned is 'Mutant,' which has transitioned to a closed-source model, while alternatives like 'MuTest' and 'Haeckel' were also touched upon.

In conclusion, Aronson emphasizes that mutation testing is not a foolproof solution but a valuable technique for developers to improve code quality and test robustness. Practitioners are encouraged to experiment with mutation testing to enhance their practices, keeping in mind the associated challenges.

00:00:11.860 Okay, good afternoon Nashville! I'm Dave Aronson from t-rexohcoto, and I'm here to teach you how to kill mutants!
00:00:18.410 But first, I think we need to set some expectations. So calling this an advanced talk might be a little misleading. It is an introduction to an advanced topic. If you're already well-versed in mutation testing, I won't be offended if you seek better learning opportunities. Still, I’d rather you stick around so you can correct my mistakes later in private.
00:00:31.250 I mentioned mistakes because I'm not an expert in mutation testing. One of the dirty little secrets of public speaking is that you don't have to be an expert; you just need to know a little more than your audience and be able to convey it.
00:00:43.280 Mutation testing is still so rare that most developers have never even heard of it. So, let's start with the basics: what on Earth is mutation testing?
00:00:56.270 In our universe, that of software development (and not comic books, of course), mutation testing is a software testing technique.
00:01:07.100 Surprise! You might look at the name and think it’s about testing the mutations in genetic algorithms, but no—it's about testing our code and also our unit test suite by using mutations.
00:01:20.640 Its primary benefit is to help ensure that our tests are strict by finding gaps in our test suite that allow our code to have unintended behavior.
00:01:34.040 Once we find these gaps, we can either add tests or expand and improve our tests to close them. This lack of strictness mainly comes from a lack of tests or poorly written or poorly maintained tests that didn’t keep pace with changes in the code.
00:01:49.080 Speaking of which, mutation testing can also help improve our code by ensuring that our code is meaningful. This means that any little change to the code should result in a noticeable change in its behavior.
00:02:02.210 Lack of meaning generally comes from code being unreachable, redundant, or otherwise without any real effect. Once we find code like that, we can either make it meaningful if it fits our original intention or just get rid of it.
00:02:38.950 Mutation testing puts these two concepts together by ensuring, first, that our code is meaningful so that the changes we make have a noticeable effect.
00:02:49.660 Then, it ensures that our tests are strict enough to notice that change and fail. Now, not all the tests have to fail, but each change should make at least one test fail.
00:03:02.310 Mutation testing does all this by mutating copies of our code, hence the name. It does this with the intention of creating test failures, otherwise known as false.
00:03:20.670 So, mutation testing can be categorized as a fault-based testing technique. This means it is sort of related to something you're already familiar with: Chaos Monkey from Netflix. Just like Chaos Monkey helps Netflix find flaws in their error recovery process, mutation testing helps us find flaws in our test suite and in our code.
00:03:34.440 However, the way mutation testing works is sort of upside down from how Chaos Monkey works. Chaos Monkey does many things, but it's best known for injecting faults, such as dropped connections, into Netflix’s production network.
00:03:46.160 If all goes well, in the sense that Netflix's customers don’t notice, they know their error recovery is working well. Mutation testing, however, injects semantic changes.
00:04:02.870 It doesn't know whether these changes will result in faults, but we certainly hope they will. But that's up to the test suite. It puts these changes into copies of our code, not our actual network, and it does this all in our test environment, not in production.
00:04:15.970 Lastly, if all goes well in the sense that our unit tests all still pass, that does not mean everything's great; it means there is a problem.
00:04:28.780 Remember, each change should make at least one unit test fail. But there are some drawbacks, so this is not quite a silver bullet. As we developers know, there's no such thing. Besides, those are for killing werewolves, not mutants.
00:04:42.500 The first drawback is that it's rather CPU-intensive and therefore usually fairly slow. We wouldn’t want to run mutation tests on our whole codebase on every save.
00:04:54.350 It may take a lunch break for a smaller system or overnight, or even longer for larger ones. Fortunately, most tools include an incremental mode so we can check for changes since the last run.
00:05:10.980 This may allow us to do this on every save for a very small system or at least over a much shorter break for others. Its CPU-intensive nature can also significantly increase our bills on cloud-based systems such as AWS.
00:05:26.430 Another drawback is that it is not beginner-friendly; mainly, it shows us that making a particular change to the code made no difference whatsoever to our test results.
00:05:39.890 What does that mean? It takes a lot of interpretation to figure out what a mutant is trying to tell us, their accents are varied. They are almost incoherent—sometimes it's gibberish, but with a much larger vocabulary.
00:05:55.570 They usually indicate that our code is meaningless, our tests are lacking, or both. But it can be very difficult to figure out exactly how. Worse yet, often it's a false alarm.
00:06:10.800 The mutation might not have made any tests fail, but it didn't really make any actual difference either. Figuring that out can still take quite a lot of time.
00:06:25.550 Even if a mutation did make a difference, there’s still quite a lot of code that we shouldn't bother to test. For example: if you have a debugging statement that says "the value of X is <value>", that constant part of the string will get mutated, even though we don’t really care about that.
00:06:41.950 Fortunately, most tools have a way to indicate that we don't want to mutate this line or this whole function. However, this is usually done with comments, which can clutter the code and make it less readable.
00:06:53.290 Now that we’ve seen some of the major pros and cons, let's examine what mutation testing actually does and how it works.
00:07:05.150 First, our chosen mutation testing tool will break our code apart into pieces to test. Usually in Ruby or other object-oriented languages, these pieces will be our methods.
00:07:17.130 For each one, it will try to find the tests for that method. If the tool can’t find any, some will use the entire test suite, which is usually inefficient and leads to a lot more false alarms.
00:07:29.490 But most will just skip that one method and move on. Better yet, many of them will warn us so we know we should add some tests.
00:07:42.540 The tool makes the mutants and to do that it closely examines exactly how a method can be changed, and for each tiny alteration, it will create one mutant that incorporates that change.
00:07:58.590 Once done creating all the mutants it can from a particular method, it iterates over the list. Now we get to the heart of the matter: for each mutant derived from a given method, our tool will run each of that method’s unit tests using each mutant in turn instead of the original method.
00:08:14.370 If we get down to one test that fails, that is called killing the mutant, and that's exactly what we want. It means, first, the code was meaningful enough that the tiny change it made to create that mutant had a noticeable effect on the behavior.
00:08:29.390 Second, at least one of the tests was strict enough to spot that difference and fail. After a mutant has made a test fail, generally, the tool will stop running any more tests against that mutant.
00:08:44.000 We don't care how many more tests that mutant could make fail—like so much in computer science, we only care about ones and zeros.
00:08:57.610 Then we move on to the next mutant. If that was the last one from that function or method, we move on to the next method. But if a mutant survives all those tests, meaning it lets them all pass, then it has the superpower of mimicry—skilled enough to fool our tests.
00:09:09.730 This usually means that our code is meaningless, or our tests are lacking, or both. Now it's up to us to figure out how.
00:09:22.290 Let's peel back another layer and look at some of the technical details of exactly how this works. First, our tool parses our code into an abstract syntax tree.
00:09:35.670 Some of those boxes may be a bit small to read from the back there, but we don’t need to understand this AST in detail.
00:09:45.940 Some tools do it slightly differently; they work against bytecode rather than an AST. Some proof-of-concept tools even work on actual raw text.
00:09:58.780 But the majority of them use an AST, so we'll roll with that. After our tool creates an AST, it traverses it looking for subtrees or branches representing our methods.
00:10:13.940 After it finds them, it looks for their tests, basically seeing if there are any tests or running them. But how does it look for the tests? It usually relies on us developers either annotating our tests or following a naming convention.
00:10:27.570 Sometimes, this is augmented or even completely replaced by the tool examining which tests call which methods.
00:10:40.250 This can get tricky if a method is not called directly from the test. Once the tool has found all the method's tests, it generates the mutants.
00:10:55.440 How does it create mutants from an AST? It traverses that subtree just like it did to the whole tree. But now, instead of looking for smaller subtrees, it's looking for nodes where something can be changed.
00:11:11.150 For each one it finds and for each way it can change that node, it makes a fresh copy of that method's AST subtree.
00:11:25.470 Suppose our tool has traversed the AST and has gotten to a not-equal comparison. It will create a fresh copy of that entire subtree with only that node changed in that particular way.
00:11:39.650 Once it's done doing that for that node, it continues traversing the rest of the subtree and do likewise for the other nodes.
00:11:52.370 I've said a lot about changes it can make, but what kind of changes are we talking about? There are quite a lot.
00:12:03.630 It could change a mathematical, logical, or bitwise operator from one to another. In languages and situations where it can do so, it could even substitute one of a different category.
00:12:17.900 For example, in Ruby, we can treat pretty much anything as a boolean, so X plus Y could be changed to X and Y, or X or Y. It can also change the order of operands in situations where that matters.
00:12:38.580 Most tools are smart enough not to bother doing this with addition, but they will do that with subtraction, division, and exponentiation.
00:12:51.320 It could also change a comparison from one to another; it could insert or remove a logical, mathematical, or bitwise negation.
00:13:03.970 It can change a constant, variable, or function call to a completely different value, possibly of a different type. For instance, it might change a number to a string.
00:13:18.650 It can remove entire lines of code, eliminate a condition, or control a loop. It could scramble or truncate arguments in a method call or a method declaration.
00:13:32.080 It can replace a method's entire body with just a constant, or either of the arguments, or trigger a deliberate error, or do nothing at all if the language permits.
00:13:48.370 There are many more kinds of changes it could make, but I hope you get the idea by now. From here on, I don’t want to add any more low-level details.
00:14:02.060 Let’s look at some examples. We'll start with an easy one. Suppose we have a method like this.
00:14:11.970 It's pretty simple, but think about what a mutant made from this might return. These are the values our unit tests are probably going to be looking at.
00:14:22.180 There are lots and lots of possibilities, but mainly it could return all kinds of values, like any of these expressions or constants, and many more.
00:14:39.050 Suppose we have only one test like this. This is a pretty bad test, but even so, most of the mutants would get killed for the ones shown here.
00:14:54.530 The ones returning constants are quite unlikely to match; for example, '4' is not a very common constant to randomly come up with.
00:15:04.960 Subtracting will yield zero; dividing gets us one; and returning either argument alone gets us two.
00:15:14.560 Even the ones that deliberately raise an error or accidentally do so will at least make the test fail.
00:15:27.700 But addition, multiplication, and exponentiation in reverse order can yield the right answer, and therefore will survive that particular test.
00:15:42.350 We know this because we run a mutation testing tool, and it gives us a report that looks something like this. The exact words and formatting vary widely depending on which tool we use.
00:16:01.520 To unpack the information it provides: it tells us that if we changed the method called 'power', which is in 'demo.rb', starting at line 42 in any of four different ways, then all its unit tests would still pass.
00:16:14.590 Those four ways include changing line 42 to swap the order of the arguments, changing line 43 to change exponentiation to addition or multiplication, or swapping the order of the operands.
00:16:29.960 Pretty straightforward, right? So, what is this set of surviving mutants trying to tell us? A good start to figuring that out is to ask ourselves how these mutants are surviving.
00:16:43.330 What is it about the code that lets all our unit tests, or in this case, our one and only unit test, still pass? The usual answer is that they give the same result or have the same side effects as our original method.
00:16:59.360 To determine how that happens, one useful technique is to look at at least one mutant together with at least one of the unit tests it passes.
00:17:14.450 Let's start with the plus mutant. Looking at that change, changing exponentiation to addition makes it fairly clear that this mutant survives because two to the second power equals two plus two, which both yield four.
00:17:30.560 How do we kill this mutant? The main way to kill a mutant is to ensure you have a test that can fail. You can either change the existing test or add a new one where the two expressions produce different results.
00:17:45.430 In other words, we just need to verify that some X to the Y does not equal Z. For instance, we could add or change a test to say two to the third power equals eight, while two plus three equals five, which is not eight.
00:18:00.520 This kills the plus mutant. Two times three is also not eight, so it kills the X mutant too. Three squared equals nine, again not eight, so that kills the argument swapping mutants as well.
00:18:17.640 We didn't have to attack them all at once; we could add an intermediate test that says two to the fourth equals sixteen, which would kill the plus and times mutants.
00:18:30.710 Then, we could add this or change the test to also kill the argument swapping mutants.
00:18:40.700 Now, this may make mutation testing sound simple, but this was a nearly trivial example. It's generally easy to come up with new arguments to make all reasonable mutants return a different value.
00:18:55.230 There are lots of ways to handle this, so let’s look at a more complex example. Suppose we have a method that sends a message.
00:19:09.750 This method loops over and over, sending in each chunk as much as 'send_bytes' can handle in one shot, picking up where it left off until the whole thing has been sent.
00:19:21.970 A mutation testing tool can create many different mutants from this method. A particular interest would be a mutant that removes a loop control.
00:19:39.320 Suppose that this mutant does indeed survive our test suite, which mainly consists of this. There's a little more to it, but even without seeing those parts, the survival of the non-looping mutant informs us.
00:19:54.590 It indicates that if a mutant that only goes through the loop body once acts the same as far as our test suite can tell as the original code, then our tests are not making the original code go through that loop body more than once.
00:20:08.240 The question then becomes: What does that mean? You'll find that interpreting what mutants are trying to tell you involves recursively asking yourself, 'What does that mean?'
00:20:24.440 In this case, it means we're not testing sending a message larger than what 'send_bytes' can handle in one chunk.
00:20:33.730 For instance, here, the maximum chunk size that 'send_bytes' can handle in one shot is 10,000 bytes. If we're testing with only a small three-byte message, it’s only going to go through that loop once.
00:20:47.700 How do we fix that? It should be easy, assuming there's nothing peculiar about the code preventing us from doing so. We should take the maximum chunk size as declared, presumably publicly, add one, create that big message, and try to send it.
00:21:02.220 However, paraphrasing Shakespeare: 'The fault, dear Nashville, is not in our tests, but in our code that allows these mutants to survive.' Perhaps we did try sending with the largest permissible message from a set of predefined sizes.
00:21:17.620 For instance, here we have smaller and larger messages. If we tried sending a large message and the non-looping mutant still survives, what could that possibly mean?
00:21:31.770 It's trying to tell us that a version of the 'send_message' with the looping removed will do the job just fine. This is an example of mutation testing finding redundant code.
00:21:45.340 If we strip out all the other stuff that was mounted only to support the looping, it boils down to this, and now it’s really clear that the ultimate message is that 'send_message' may well be entirely redundant.
00:22:01.570 However, I say it may be redundant, not that it is. In real-world scenarios, there may be some logging and error handling that needs to be in 'send_message'.
00:22:16.220 But at the very least, the looping was redundant. So what do we do about it? We can either chop out the looping or, if we don't need anything extra in 'send_message', we can eliminate it altogether.
00:22:28.760 This will make our code more maintainable by getting rid of unnecessary clutter. Fortunately, when dealing with this kind of situation, the fix is often pretty clear.
00:22:43.780 Now that we’ve seen a few examples of spotting bad tests and redundant code, let’s address some frequently asked questions.
00:22:56.230 First, this all sounds pretty crazy: literally making tests fail to prove that the code succeeds. Where does this bizarro idea come from?
00:23:10.240 Mutation testing has a surprisingly long history in the context of computers. It was first proposed in 1971 in Richard Lipton's term paper titled 'Fault Diagnosis of Computer Programs.'
00:23:22.860 The first tool didn’t appear until nine years later, in 1980, as part of Timothy Budd's PhD work at Yale. After that, it still wasn't practical for most of us, running consumer-grade desktop machines.
00:23:39.960 Only with advances in CPU speed, multi-core CPUs, and more memory over the past couple of decades did it become feasible.
00:23:56.760 That leads us to the next question: why is it so CPU-intensive? To answer that, we have to do some basic math.
00:24:09.840 Suppose our program has methods with about 10 lines each, and each line has about five places it could be changed to approximately 20 different alternatives.
00:24:27.520 That works out to about a thousand mutants per function. For each one, we probably have to run at least a few unit tests.
00:24:40.469 If we get lucky and kill it on the first shot, we might run just one test; otherwise, we run all applicable tests for that method.
00:24:56.600 If we run just 10% of all unit tests for a typical method, that's still a hundred times more test runs than running our normal test suite.
00:25:10.240 If our normal unit tests take about ten seconds, mutation testing with these assumptions might take about a thousand seconds, which is almost 17 minutes.
00:25:22.440 But there is some good news. Over the past decade, there’s been a lot of research aimed at trimming down the mutant horde.
00:25:35.910 Techniques include weeding out those that are redundant with another mutant, semantically equivalent to the original code, or otherwise trivial.
00:25:50.280 These have squeezed it down to about a third under optimal circumstances. However, even with that, this is still not a silver bullet.
00:26:05.640 Because the leftover mutants are still considerable, the last common question is: "When making mutants, why change them in only one way?".
00:26:19.750 There are a few reasons. First, it helps us focus. It's much easier to tell what a mutant is trying to say if we're only dealing with one change.
00:26:34.599 You can think of it like using the single-responsibility principle. Another reason is that multiple changes may balance each other out, leading to false alarms.
00:26:47.649 For instance, remember that first trivial example with the argument-swapping mutants? If one mutation swapped the arguments, while another swapped them back, it would lead to no net effect.
00:27:01.140 Lastly, allowing multiple mutations would create a combinatorial explosion of mutants, severely increasing CPU intensive workloads.
00:27:14.320 Just to spare you the math: using earlier code-size assumptions, if we managed to reduce them by a third at each step with one mutation per mutant, we would have 333 mutants per method.
00:27:29.950 And by multiplying this through, we would end up with incredibly high quantities—hundreds of thousands to millions of mutants.
00:27:43.210 We can avoid the increased workload and lack of focus if we restrict it to one mutation per mutant. So, to summarize: mutation testing is a powerful tool.
00:27:56.050 It ensures our code is meaningful and our tests are strict. It's relatively easy to set up the tools and annotate tests, which may be tedious but manageable.
00:28:06.660 However, it’s not very easy to interpret results, nor is it easy on the CPU. Even if these drawbacks mean it's not well-suited for our current projects, I still think it's a cool idea.
00:28:20.520 If you'd like to try mutation testing for yourself, here’s a list of some popular tools for various languages and frameworks.
00:28:33.800 Regarding Ruby, 'Mutant' is the main one, though they are moving to a closed-source paid model, leaving the last open-source version available for free.
00:28:49.320 I'm not entirely sure, but if I remember correctly, open-source projects can use even the paid version for free. 'MuTest' is a fork of an old version of 'Mutant' with continued work on it.
00:29:05.380 It has some features that 'Mutant' doesn’t and vice-versa, but it’s only compatible with RSpec rather than MiniTest.
00:29:19.220 My current main client uses MiniTest, so I didn’t explore MuTest much. 'Haeckel' hasn't been updated since 2013, so good luck with that.
00:29:32.300 For the rest of them, just be aware that many may be outdated. I don’t know if all the listed tools are current or if they still support those languages.
00:29:46.020 Okay, has everyone finished taking pictures? Moving on, I’d like to give a shout-out to Tom Tao, a consulting network I'm part of, that has a speakers’ network that helped me prepare this presentation.
00:30:01.660 Please use my referral link there if you want to hire us or join us. And many thanks to Marcus Chirp, who created 'Mutant', the main mutation testing tool I’ve used.
00:30:16.240 He has been very willing to answer my questions and critique my presentation, and now it’s your turn. We have about five and a half minutes for questions.
00:30:30.040 If you think of something later, here’s my contact information up there, and of course, I have cards and will be around for the rest of the conference.
00:30:44.670 Any questions? Okay, first question: How long does it take to get up and running with mutation testing on a project?
00:31:00.090 To get your first report, I suppose you could make a decent dent in it by just installing the tool and running it. How long that takes depends on the size of your codebase and the tool you're using.
00:31:17.240 If it skips methods with no tests, that can impact the time. I would guess that with a not too huge Rails app, you could have an actionable set of mutant reports in under an hour.
00:31:30.520 The most time-consuming part, at least in my limited experience, has been waiting for the tool to run rather than the setup.
00:31:42.360 You could run it during a lunch break or overnight. Even more so than trying to understand what the mutants are trying to tell you, which is why I consider this an advanced topic.
00:31:55.160 Much of it will take experience. Okay, next question: When killing mutants, do I think it’s more of a best practice to optimize a given test or add additional tests?
00:32:07.160 I'd say that’s a subset of whether one does that overall in tests. I generally write tests for specific situations I’m trying to ensure are covered.
00:32:19.600 For instance, considering the initial 'power' method example, I might want some functions like: What if someone passes in zero? Or something else invalid? And, of course, some for normal usage.
00:32:33.160 I’d try to have one or two of those tests set up, but not too many normal tests chosen to avoid letting a lot of mutants survive.
00:32:46.270 It might be difficult with some functions to have one test kill a mass of mutants, but as I said, you don’t need to be a superhero about it.
00:33:00.740 You can have one test eliminate that batch of mutants and another test take out another batch. They may overlap a bit, but don't overthink it too much.
00:33:14.260 Okay, and one last question: Do 'Mutant' and 'MuTest' patch the Ruby interpreter to make this work?
00:33:29.480 I don’t think they do anything to the copy you're actively using for most of your normal Ruby work. I haven't looked into how they work under the hood.
00:33:43.160 They might make a copy and tweak that. I’ve done some custom parsing in that domain. What I've seen usually does not affect the standard interpreter usage.
00:34:00.310 We have 45 seconds left. Does anyone have a quick question? If not, I guess we all get a little lead on the afternoon snacks!