00:00:11.860
Okay, good afternoon Nashville! I'm Dave Aronson from t-rexohcoto, and I'm here to teach you how to kill mutants!
00:00:18.410
But first, I think we need to set some expectations. So calling this an advanced talk might be a little misleading. It is an introduction to an advanced topic. If you're already well-versed in mutation testing, I won't be offended if you seek better learning opportunities. Still, I’d rather you stick around so you can correct my mistakes later in private.
00:00:31.250
I mentioned mistakes because I'm not an expert in mutation testing. One of the dirty little secrets of public speaking is that you don't have to be an expert; you just need to know a little more than your audience and be able to convey it.
00:00:43.280
Mutation testing is still so rare that most developers have never even heard of it. So, let's start with the basics: what on Earth is mutation testing?
00:00:56.270
In our universe, that of software development (and not comic books, of course), mutation testing is a software testing technique.
00:01:07.100
Surprise! You might look at the name and think it’s about testing the mutations in genetic algorithms, but no—it's about testing our code and also our unit test suite by using mutations.
00:01:20.640
Its primary benefit is to help ensure that our tests are strict by finding gaps in our test suite that allow our code to have unintended behavior.
00:01:34.040
Once we find these gaps, we can either add tests or expand and improve our tests to close them. This lack of strictness mainly comes from a lack of tests or poorly written or poorly maintained tests that didn’t keep pace with changes in the code.
00:01:49.080
Speaking of which, mutation testing can also help improve our code by ensuring that our code is meaningful. This means that any little change to the code should result in a noticeable change in its behavior.
00:02:02.210
Lack of meaning generally comes from code being unreachable, redundant, or otherwise without any real effect. Once we find code like that, we can either make it meaningful if it fits our original intention or just get rid of it.
00:02:38.950
Mutation testing puts these two concepts together by ensuring, first, that our code is meaningful so that the changes we make have a noticeable effect.
00:02:49.660
Then, it ensures that our tests are strict enough to notice that change and fail. Now, not all the tests have to fail, but each change should make at least one test fail.
00:03:02.310
Mutation testing does all this by mutating copies of our code, hence the name. It does this with the intention of creating test failures, otherwise known as false.
00:03:20.670
So, mutation testing can be categorized as a fault-based testing technique. This means it is sort of related to something you're already familiar with: Chaos Monkey from Netflix. Just like Chaos Monkey helps Netflix find flaws in their error recovery process, mutation testing helps us find flaws in our test suite and in our code.
00:03:34.440
However, the way mutation testing works is sort of upside down from how Chaos Monkey works. Chaos Monkey does many things, but it's best known for injecting faults, such as dropped connections, into Netflix’s production network.
00:03:46.160
If all goes well, in the sense that Netflix's customers don’t notice, they know their error recovery is working well. Mutation testing, however, injects semantic changes.
00:04:02.870
It doesn't know whether these changes will result in faults, but we certainly hope they will. But that's up to the test suite. It puts these changes into copies of our code, not our actual network, and it does this all in our test environment, not in production.
00:04:15.970
Lastly, if all goes well in the sense that our unit tests all still pass, that does not mean everything's great; it means there is a problem.
00:04:28.780
Remember, each change should make at least one unit test fail. But there are some drawbacks, so this is not quite a silver bullet. As we developers know, there's no such thing. Besides, those are for killing werewolves, not mutants.
00:04:42.500
The first drawback is that it's rather CPU-intensive and therefore usually fairly slow. We wouldn’t want to run mutation tests on our whole codebase on every save.
00:04:54.350
It may take a lunch break for a smaller system or overnight, or even longer for larger ones. Fortunately, most tools include an incremental mode so we can check for changes since the last run.
00:05:10.980
This may allow us to do this on every save for a very small system or at least over a much shorter break for others. Its CPU-intensive nature can also significantly increase our bills on cloud-based systems such as AWS.
00:05:26.430
Another drawback is that it is not beginner-friendly; mainly, it shows us that making a particular change to the code made no difference whatsoever to our test results.
00:05:39.890
What does that mean? It takes a lot of interpretation to figure out what a mutant is trying to tell us, their accents are varied. They are almost incoherent—sometimes it's gibberish, but with a much larger vocabulary.
00:05:55.570
They usually indicate that our code is meaningless, our tests are lacking, or both. But it can be very difficult to figure out exactly how. Worse yet, often it's a false alarm.
00:06:10.800
The mutation might not have made any tests fail, but it didn't really make any actual difference either. Figuring that out can still take quite a lot of time.
00:06:25.550
Even if a mutation did make a difference, there’s still quite a lot of code that we shouldn't bother to test. For example: if you have a debugging statement that says "the value of X is <value>", that constant part of the string will get mutated, even though we don’t really care about that.
00:06:41.950
Fortunately, most tools have a way to indicate that we don't want to mutate this line or this whole function. However, this is usually done with comments, which can clutter the code and make it less readable.
00:06:53.290
Now that we’ve seen some of the major pros and cons, let's examine what mutation testing actually does and how it works.
00:07:05.150
First, our chosen mutation testing tool will break our code apart into pieces to test. Usually in Ruby or other object-oriented languages, these pieces will be our methods.
00:07:17.130
For each one, it will try to find the tests for that method. If the tool can’t find any, some will use the entire test suite, which is usually inefficient and leads to a lot more false alarms.
00:07:29.490
But most will just skip that one method and move on. Better yet, many of them will warn us so we know we should add some tests.
00:07:42.540
The tool makes the mutants and to do that it closely examines exactly how a method can be changed, and for each tiny alteration, it will create one mutant that incorporates that change.
00:07:58.590
Once done creating all the mutants it can from a particular method, it iterates over the list. Now we get to the heart of the matter: for each mutant derived from a given method, our tool will run each of that method’s unit tests using each mutant in turn instead of the original method.
00:08:14.370
If we get down to one test that fails, that is called killing the mutant, and that's exactly what we want. It means, first, the code was meaningful enough that the tiny change it made to create that mutant had a noticeable effect on the behavior.
00:08:29.390
Second, at least one of the tests was strict enough to spot that difference and fail. After a mutant has made a test fail, generally, the tool will stop running any more tests against that mutant.
00:08:44.000
We don't care how many more tests that mutant could make fail—like so much in computer science, we only care about ones and zeros.
00:08:57.610
Then we move on to the next mutant. If that was the last one from that function or method, we move on to the next method. But if a mutant survives all those tests, meaning it lets them all pass, then it has the superpower of mimicry—skilled enough to fool our tests.
00:09:09.730
This usually means that our code is meaningless, or our tests are lacking, or both. Now it's up to us to figure out how.
00:09:22.290
Let's peel back another layer and look at some of the technical details of exactly how this works. First, our tool parses our code into an abstract syntax tree.
00:09:35.670
Some of those boxes may be a bit small to read from the back there, but we don’t need to understand this AST in detail.
00:09:45.940
Some tools do it slightly differently; they work against bytecode rather than an AST. Some proof-of-concept tools even work on actual raw text.
00:09:58.780
But the majority of them use an AST, so we'll roll with that. After our tool creates an AST, it traverses it looking for subtrees or branches representing our methods.
00:10:13.940
After it finds them, it looks for their tests, basically seeing if there are any tests or running them. But how does it look for the tests? It usually relies on us developers either annotating our tests or following a naming convention.
00:10:27.570
Sometimes, this is augmented or even completely replaced by the tool examining which tests call which methods.
00:10:40.250
This can get tricky if a method is not called directly from the test. Once the tool has found all the method's tests, it generates the mutants.
00:10:55.440
How does it create mutants from an AST? It traverses that subtree just like it did to the whole tree. But now, instead of looking for smaller subtrees, it's looking for nodes where something can be changed.
00:11:11.150
For each one it finds and for each way it can change that node, it makes a fresh copy of that method's AST subtree.
00:11:25.470
Suppose our tool has traversed the AST and has gotten to a not-equal comparison. It will create a fresh copy of that entire subtree with only that node changed in that particular way.
00:11:39.650
Once it's done doing that for that node, it continues traversing the rest of the subtree and do likewise for the other nodes.
00:11:52.370
I've said a lot about changes it can make, but what kind of changes are we talking about? There are quite a lot.
00:12:03.630
It could change a mathematical, logical, or bitwise operator from one to another. In languages and situations where it can do so, it could even substitute one of a different category.
00:12:17.900
For example, in Ruby, we can treat pretty much anything as a boolean, so X plus Y could be changed to X and Y, or X or Y. It can also change the order of operands in situations where that matters.
00:12:38.580
Most tools are smart enough not to bother doing this with addition, but they will do that with subtraction, division, and exponentiation.
00:12:51.320
It could also change a comparison from one to another; it could insert or remove a logical, mathematical, or bitwise negation.
00:13:03.970
It can change a constant, variable, or function call to a completely different value, possibly of a different type. For instance, it might change a number to a string.
00:13:18.650
It can remove entire lines of code, eliminate a condition, or control a loop. It could scramble or truncate arguments in a method call or a method declaration.
00:13:32.080
It can replace a method's entire body with just a constant, or either of the arguments, or trigger a deliberate error, or do nothing at all if the language permits.
00:13:48.370
There are many more kinds of changes it could make, but I hope you get the idea by now. From here on, I don’t want to add any more low-level details.
00:14:02.060
Let’s look at some examples. We'll start with an easy one. Suppose we have a method like this.
00:14:11.970
It's pretty simple, but think about what a mutant made from this might return. These are the values our unit tests are probably going to be looking at.
00:14:22.180
There are lots and lots of possibilities, but mainly it could return all kinds of values, like any of these expressions or constants, and many more.
00:14:39.050
Suppose we have only one test like this. This is a pretty bad test, but even so, most of the mutants would get killed for the ones shown here.
00:14:54.530
The ones returning constants are quite unlikely to match; for example, '4' is not a very common constant to randomly come up with.
00:15:04.960
Subtracting will yield zero; dividing gets us one; and returning either argument alone gets us two.
00:15:14.560
Even the ones that deliberately raise an error or accidentally do so will at least make the test fail.
00:15:27.700
But addition, multiplication, and exponentiation in reverse order can yield the right answer, and therefore will survive that particular test.
00:15:42.350
We know this because we run a mutation testing tool, and it gives us a report that looks something like this. The exact words and formatting vary widely depending on which tool we use.
00:16:01.520
To unpack the information it provides: it tells us that if we changed the method called 'power', which is in 'demo.rb', starting at line 42 in any of four different ways, then all its unit tests would still pass.
00:16:14.590
Those four ways include changing line 42 to swap the order of the arguments, changing line 43 to change exponentiation to addition or multiplication, or swapping the order of the operands.
00:16:29.960
Pretty straightforward, right? So, what is this set of surviving mutants trying to tell us? A good start to figuring that out is to ask ourselves how these mutants are surviving.
00:16:43.330
What is it about the code that lets all our unit tests, or in this case, our one and only unit test, still pass? The usual answer is that they give the same result or have the same side effects as our original method.
00:16:59.360
To determine how that happens, one useful technique is to look at at least one mutant together with at least one of the unit tests it passes.
00:17:14.450
Let's start with the plus mutant. Looking at that change, changing exponentiation to addition makes it fairly clear that this mutant survives because two to the second power equals two plus two, which both yield four.
00:17:30.560
How do we kill this mutant? The main way to kill a mutant is to ensure you have a test that can fail. You can either change the existing test or add a new one where the two expressions produce different results.
00:17:45.430
In other words, we just need to verify that some X to the Y does not equal Z. For instance, we could add or change a test to say two to the third power equals eight, while two plus three equals five, which is not eight.
00:18:00.520
This kills the plus mutant. Two times three is also not eight, so it kills the X mutant too. Three squared equals nine, again not eight, so that kills the argument swapping mutants as well.
00:18:17.640
We didn't have to attack them all at once; we could add an intermediate test that says two to the fourth equals sixteen, which would kill the plus and times mutants.
00:18:30.710
Then, we could add this or change the test to also kill the argument swapping mutants.
00:18:40.700
Now, this may make mutation testing sound simple, but this was a nearly trivial example. It's generally easy to come up with new arguments to make all reasonable mutants return a different value.
00:18:55.230
There are lots of ways to handle this, so let’s look at a more complex example. Suppose we have a method that sends a message.
00:19:09.750
This method loops over and over, sending in each chunk as much as 'send_bytes' can handle in one shot, picking up where it left off until the whole thing has been sent.
00:19:21.970
A mutation testing tool can create many different mutants from this method. A particular interest would be a mutant that removes a loop control.
00:19:39.320
Suppose that this mutant does indeed survive our test suite, which mainly consists of this. There's a little more to it, but even without seeing those parts, the survival of the non-looping mutant informs us.
00:19:54.590
It indicates that if a mutant that only goes through the loop body once acts the same as far as our test suite can tell as the original code, then our tests are not making the original code go through that loop body more than once.
00:20:08.240
The question then becomes: What does that mean? You'll find that interpreting what mutants are trying to tell you involves recursively asking yourself, 'What does that mean?'
00:20:24.440
In this case, it means we're not testing sending a message larger than what 'send_bytes' can handle in one chunk.
00:20:33.730
For instance, here, the maximum chunk size that 'send_bytes' can handle in one shot is 10,000 bytes. If we're testing with only a small three-byte message, it’s only going to go through that loop once.
00:20:47.700
How do we fix that? It should be easy, assuming there's nothing peculiar about the code preventing us from doing so. We should take the maximum chunk size as declared, presumably publicly, add one, create that big message, and try to send it.
00:21:02.220
However, paraphrasing Shakespeare: 'The fault, dear Nashville, is not in our tests, but in our code that allows these mutants to survive.' Perhaps we did try sending with the largest permissible message from a set of predefined sizes.
00:21:17.620
For instance, here we have smaller and larger messages. If we tried sending a large message and the non-looping mutant still survives, what could that possibly mean?
00:21:31.770
It's trying to tell us that a version of the 'send_message' with the looping removed will do the job just fine. This is an example of mutation testing finding redundant code.
00:21:45.340
If we strip out all the other stuff that was mounted only to support the looping, it boils down to this, and now it’s really clear that the ultimate message is that 'send_message' may well be entirely redundant.
00:22:01.570
However, I say it may be redundant, not that it is. In real-world scenarios, there may be some logging and error handling that needs to be in 'send_message'.
00:22:16.220
But at the very least, the looping was redundant. So what do we do about it? We can either chop out the looping or, if we don't need anything extra in 'send_message', we can eliminate it altogether.
00:22:28.760
This will make our code more maintainable by getting rid of unnecessary clutter. Fortunately, when dealing with this kind of situation, the fix is often pretty clear.
00:22:43.780
Now that we’ve seen a few examples of spotting bad tests and redundant code, let’s address some frequently asked questions.
00:22:56.230
First, this all sounds pretty crazy: literally making tests fail to prove that the code succeeds. Where does this bizarro idea come from?
00:23:10.240
Mutation testing has a surprisingly long history in the context of computers. It was first proposed in 1971 in Richard Lipton's term paper titled 'Fault Diagnosis of Computer Programs.'
00:23:22.860
The first tool didn’t appear until nine years later, in 1980, as part of Timothy Budd's PhD work at Yale. After that, it still wasn't practical for most of us, running consumer-grade desktop machines.
00:23:39.960
Only with advances in CPU speed, multi-core CPUs, and more memory over the past couple of decades did it become feasible.
00:23:56.760
That leads us to the next question: why is it so CPU-intensive? To answer that, we have to do some basic math.
00:24:09.840
Suppose our program has methods with about 10 lines each, and each line has about five places it could be changed to approximately 20 different alternatives.
00:24:27.520
That works out to about a thousand mutants per function. For each one, we probably have to run at least a few unit tests.
00:24:40.469
If we get lucky and kill it on the first shot, we might run just one test; otherwise, we run all applicable tests for that method.
00:24:56.600
If we run just 10% of all unit tests for a typical method, that's still a hundred times more test runs than running our normal test suite.
00:25:10.240
If our normal unit tests take about ten seconds, mutation testing with these assumptions might take about a thousand seconds, which is almost 17 minutes.
00:25:22.440
But there is some good news. Over the past decade, there’s been a lot of research aimed at trimming down the mutant horde.
00:25:35.910
Techniques include weeding out those that are redundant with another mutant, semantically equivalent to the original code, or otherwise trivial.
00:25:50.280
These have squeezed it down to about a third under optimal circumstances. However, even with that, this is still not a silver bullet.
00:26:05.640
Because the leftover mutants are still considerable, the last common question is: "When making mutants, why change them in only one way?".
00:26:19.750
There are a few reasons. First, it helps us focus. It's much easier to tell what a mutant is trying to say if we're only dealing with one change.
00:26:34.599
You can think of it like using the single-responsibility principle. Another reason is that multiple changes may balance each other out, leading to false alarms.
00:26:47.649
For instance, remember that first trivial example with the argument-swapping mutants? If one mutation swapped the arguments, while another swapped them back, it would lead to no net effect.
00:27:01.140
Lastly, allowing multiple mutations would create a combinatorial explosion of mutants, severely increasing CPU intensive workloads.
00:27:14.320
Just to spare you the math: using earlier code-size assumptions, if we managed to reduce them by a third at each step with one mutation per mutant, we would have 333 mutants per method.
00:27:29.950
And by multiplying this through, we would end up with incredibly high quantities—hundreds of thousands to millions of mutants.
00:27:43.210
We can avoid the increased workload and lack of focus if we restrict it to one mutation per mutant. So, to summarize: mutation testing is a powerful tool.
00:27:56.050
It ensures our code is meaningful and our tests are strict. It's relatively easy to set up the tools and annotate tests, which may be tedious but manageable.
00:28:06.660
However, it’s not very easy to interpret results, nor is it easy on the CPU. Even if these drawbacks mean it's not well-suited for our current projects, I still think it's a cool idea.
00:28:20.520
If you'd like to try mutation testing for yourself, here’s a list of some popular tools for various languages and frameworks.
00:28:33.800
Regarding Ruby, 'Mutant' is the main one, though they are moving to a closed-source paid model, leaving the last open-source version available for free.
00:28:49.320
I'm not entirely sure, but if I remember correctly, open-source projects can use even the paid version for free. 'MuTest' is a fork of an old version of 'Mutant' with continued work on it.
00:29:05.380
It has some features that 'Mutant' doesn’t and vice-versa, but it’s only compatible with RSpec rather than MiniTest.
00:29:19.220
My current main client uses MiniTest, so I didn’t explore MuTest much. 'Haeckel' hasn't been updated since 2013, so good luck with that.
00:29:32.300
For the rest of them, just be aware that many may be outdated. I don’t know if all the listed tools are current or if they still support those languages.
00:29:46.020
Okay, has everyone finished taking pictures? Moving on, I’d like to give a shout-out to Tom Tao, a consulting network I'm part of, that has a speakers’ network that helped me prepare this presentation.
00:30:01.660
Please use my referral link there if you want to hire us or join us. And many thanks to Marcus Chirp, who created 'Mutant', the main mutation testing tool I’ve used.
00:30:16.240
He has been very willing to answer my questions and critique my presentation, and now it’s your turn. We have about five and a half minutes for questions.
00:30:30.040
If you think of something later, here’s my contact information up there, and of course, I have cards and will be around for the rest of the conference.
00:30:44.670
Any questions? Okay, first question: How long does it take to get up and running with mutation testing on a project?
00:31:00.090
To get your first report, I suppose you could make a decent dent in it by just installing the tool and running it. How long that takes depends on the size of your codebase and the tool you're using.
00:31:17.240
If it skips methods with no tests, that can impact the time. I would guess that with a not too huge Rails app, you could have an actionable set of mutant reports in under an hour.
00:31:30.520
The most time-consuming part, at least in my limited experience, has been waiting for the tool to run rather than the setup.
00:31:42.360
You could run it during a lunch break or overnight. Even more so than trying to understand what the mutants are trying to tell you, which is why I consider this an advanced topic.
00:31:55.160
Much of it will take experience. Okay, next question: When killing mutants, do I think it’s more of a best practice to optimize a given test or add additional tests?
00:32:07.160
I'd say that’s a subset of whether one does that overall in tests. I generally write tests for specific situations I’m trying to ensure are covered.
00:32:19.600
For instance, considering the initial 'power' method example, I might want some functions like: What if someone passes in zero? Or something else invalid? And, of course, some for normal usage.
00:32:33.160
I’d try to have one or two of those tests set up, but not too many normal tests chosen to avoid letting a lot of mutants survive.
00:32:46.270
It might be difficult with some functions to have one test kill a mass of mutants, but as I said, you don’t need to be a superhero about it.
00:33:00.740
You can have one test eliminate that batch of mutants and another test take out another batch. They may overlap a bit, but don't overthink it too much.
00:33:14.260
Okay, and one last question: Do 'Mutant' and 'MuTest' patch the Ruby interpreter to make this work?
00:33:29.480
I don’t think they do anything to the copy you're actively using for most of your normal Ruby work. I haven't looked into how they work under the hood.
00:33:43.160
They might make a copy and tweak that. I’ve done some custom parsing in that domain. What I've seen usually does not affect the standard interpreter usage.
00:34:00.310
We have 45 seconds left. Does anyone have a quick question? If not, I guess we all get a little lead on the afternoon snacks!