00:00:13.019
Ah, really warmer, plus thank you.
00:00:19.080
Um, I would like to introduce a funny topic. It's called mutation testing.
00:00:24.779
I'm really proud that there are many people in the room. Sorry, um, okay, who saw this talk at Eurocamp?
00:00:33.180
Ah, some hands. Okay, basically, it's the same talk, but the background is now not black anymore.
00:00:38.520
So that's what I changed, and I'm really proud it looks better now.
00:00:44.520
So I'm going to talk about a testing strategy called mutation testing.
00:00:51.120
It's hard to explain, so I'll try not to rely too much on slides because actually I'm not really good at creating good slides.
00:00:56.460
I hope you will interrupt me in case I go too fast or use terms that need explanation.
00:01:03.059
That's the setup of talks I really like.
00:01:08.460
If there were no questions, I would pick random people in the audience and say, 'Okay, what's the AST I'm talking about? Tell me.'
00:01:14.220
So, let's start. Mutation testing is about testing your tests.
00:01:19.380
Why should tests be tested? We all say, 'Okay, we write code, we do TDD, and once we have the tests green, we can say, 'Okay, good job, let's move on, deploy it.'
00:01:26.820
Yes, it's not that easy.
00:01:33.240
Okay, I started testing years ago. I hated it, seriously, I hated it.
00:01:39.540
When I started testing, I was always freaking out.
00:01:45.479
Oh no, I just hacked for five minutes. I only found holes.
00:01:51.000
Typically yours. To get something working, now an HTML site is presented on the screen, I'm happy, I need to ship this.
00:01:56.340
Then someone told me I needed to test it, not me.
00:02:02.520
I tested it; I ran it multiple times in my browser, so it must work.
00:02:08.880
That was before I finally got into TDD.
00:02:14.760
Then I stopped, started to do too much testing.
00:02:20.940
I basically tested simple arithmetic primitives.
00:02:27.060
One plus one should equal two because I didn't want to stop.
00:02:32.760
I finally started testing, but I did too much. I wasn't identifying edge cases.
00:02:39.540
I was just enumerating everything I could enumerate because I thought this is testing.
00:02:45.959
So it didn't work out. Over time, I discovered testing tools.
00:02:51.120
Oh, what a cool thing! Test matrix, some automated way to verify if I tested enough.
00:02:57.660
Cool, let's achieve it. Let's try to use test matrix.
00:03:03.540
The first thing was pretty lame. I said, okay, there was a ratio between lines of code written and lines of code testing.
00:03:09.720
Yeah, if it's close to some specific number or whatever, it's good. I'm done with testing.
00:03:15.060
Because it's enough, okay? It's a joke.
00:03:22.260
Then I saw the line coverage thing, and the line coverage thing proved pretty well.
00:03:28.500
It showed me there was a nice report. Some lines were backgrounded. Yeah, green—that's probably because it's light on.
00:03:34.140
When I see 100% coverage, I'm done; I can ship my code.
00:03:40.680
Now, that's not actually true, because like here's the autonomy operator.
00:03:46.980
This is a 100% covering test case, so I only have to put in true, and only one of these branches gets tested.
00:03:53.940
So it might still blow up and it happened to me a lot.
00:04:00.480
I was the king of pushing typos because the right side was basically okay.
00:04:05.940
Now it's a little bit simple, but it could be any mistyped method call and would blow up in production.
00:04:12.659
Because your users are actually fuzzing your application all the time.
00:04:17.880
If you don't have any users, Google will pass your application, so it doesn't give you anything.
00:04:24.240
Then I did some formal research and I found the branch coverage—a really cool thing.
00:04:31.620
Because we don't have any tooling in Ruby, but there is a well-tested code outside.
00:04:36.780
We all use SQLite's branch coverage, and it's basically one of the best covered libraries we'll ever have in open source.
00:04:42.000
But actually, branch coverage itself or statement coverage, in my opinion, is the same.
00:04:48.780
We can argue later, but it also has some nasty properties.
00:04:57.000
If you have two statements next to each other, only the last side effect generates needs to be tested.
00:05:04.440
And this side effect does not need any confirmation; it's actually intended.
00:05:11.060
So it's not a solution.
00:05:16.680
Then I joined the ROM team, aka data mapper 2, aka 'We won't finish it' or whatever.
00:05:23.699
Yeah, and there was a guy, his name is Dan.
00:05:30.600
Cobb, and he said, 'Marcus, your tests suck! Please use, uh, try to read the code I wrote.'
00:05:36.300
And that's mutation coverage.
00:05:41.639
I said, 'Oh my God! What's that?' and I ran this test.
00:05:48.720
I was totally freaking out—it was so nice!
00:05:55.080
Because actually, mutation coverage turned out to capture all the metric values I just showed.
00:06:02.160
If you apply it right, you can easily beat all line coverage problems.
00:06:08.220
You can beat the statement coverage problems.
00:06:15.120
It worked, but it has some shortcomings, especially the mutation tests we used at that time.
00:06:21.900
So over time, I realized that, okay, I cannot simply blame a bad mutation tester because I didn't write it.
00:06:28.020
In case I have to fare for myself on doing that, it resulted in the mutant project.
00:06:34.560
Let's dive a little bit more into that.
00:06:41.640
Okay, I don't want to rely on slides too much.
00:06:48.720
So let's go into some formal definitions.
00:06:55.680
Mutation testing is not as easy to explain as line coverage or statement coverage.
00:07:02.520
We need some agreement on names.
00:07:09.960
In the next 10 or 15 minutes, I will talk about alive mutants and killed mutants.
00:07:17.880
A mutant itself is just a variation of your code.
00:07:24.240
If you change a literal one to a literal two, the literal two is a mutation.
00:07:31.260
That's quite easy to understand.
00:07:38.640
The problem is mutation testing changes your code in an automated manner.
00:07:46.080
It runs a test and involves the meaning of tests.
00:07:52.800
If a test fails after the change, we basically count the mutation as killed.
00:07:59.520
If any change slips through the test after it has been automatically introduced, then it is called alive.
00:08:06.180
Because we all know movies, mutants should be dead.
00:08:12.960
So I never saw a good mutant.
00:08:20.220
Okay, good. I also have several specific mutations which should not be that built.
00:08:25.800
We can go into that later.
00:08:33.600
When we want to mutate code, it's not about doing substitutions on the string representation of code.
00:08:41.160
Because you will break the syntax, and you're not really sure what you're doing.
00:08:48.360
So mutation testing must be outside.
00:08:55.920
I saw for the Go language and many other languages, it just grabs through the code and exchanges a minus with a plus.
00:09:02.520
That is not a durable strategy for a mutation tester.
00:09:09.420
The semantics of today's programming languages, especially Ruby, are incredibly complex.
00:09:16.920
So you need a transformable representation, and the transformable representation is the AST.
00:09:24.300
Who knows what the AST is?
00:09:31.140
Yeah, really nice! So just set the advanced shop to transform.
00:09:36.840
Um, it was not full coverage here.
00:09:42.600
So I will just go into the AST without this light because I don't know how to do slides in the right way.
00:09:49.740
The AST is the abstract syntax tree.
00:09:56.220
If you ever saw one at university, you probably saw it; it's one plus one at the top.
00:10:02.700
The left is one, and the right is also one—it's an abstract representation.
00:10:09.000
So if you parse code, you probably end up with an AST.
00:10:16.200
If you parse markdown, you enter this HTML.
00:10:23.940
Okay, there is another class of mutators you could use, but I did not go this route.
00:10:31.860
Because it would not have been portable in Ruby; we only have shared code.
00:10:39.240
We don't even have a shared AST, but the AST is somehow shareable across implementations.
00:10:46.560
So I chose to use an AST-based mutator.
00:10:53.520
Um, this is another mutation example. Probably the slide should have been more early.
00:10:59.640
This is the same method I mentioned where we had the line coverage issue.
00:11:06.780
That's interesting because the mutation tester changed that input to true, a constant.
00:11:12.840
The test we had before, which was one, which gave us 100% line coverage, would still pass.
00:11:19.260
So the mutation tester here identified a missing test.
00:11:25.440
So basically, yeah, cool, the missing statement was identified.
00:11:32.280
If you want to kill the mutations, you basically have to make sure that the other branch gets executed.
00:11:39.600
In this case, just add a new test or drop the code.
00:11:46.740
Okay, there are many ways to mutate your code.
00:11:52.440
I cannot even enumerate all of them.
00:11:58.260
I tried once, but it's a deep rabbit hole.
00:12:05.520
Maybe I don't understand the domain well enough to come up with a definitive set of mutations.
00:12:11.760
I'm still discovering.
00:12:18.240
Yes! Those limitations typically only change the code in an automated way.
00:12:24.240
It tries to make the tests red. In case it cannot make the tests red, probably a test is missing or there is superfluous code.
00:12:30.840
That's the idea behind it.
00:12:37.560
So code coverage is a little bit of an unspecific term.
00:12:44.760
Okay, we can do many types of mutations.
00:12:51.060
We can add or change literals, we can delete statements, which will prove a statement that has a measured side effect.
00:12:57.600
That solves the problem with pure statement coverage.
00:13:04.260
We can inverse conditionals, change binary conditional operator replacements, or delete arguments.
00:13:11.220
There are many other strategies. I cannot enumerate them all.
00:13:17.160
So let’s just move on. There is some real-world use of this stuff.
00:13:23.880
We tried to write ROM, the Ruby object mapper, and we tried to ensure it has 100% mutation coverage.
00:13:30.780
This is some measurement for the component we have; it's called Axiom.
00:13:38.340
It's the relational algebra behind it and I just ran it today.
00:13:45.960
It did not finish because mutant has currently some problems with this library.
00:13:51.899
Yeah, it's super slow.
00:13:59.640
It's a technique you don't want to run at your commit stage.
00:14:06.360
I wanted to cover that stuff later, but it's actually a really good point.
00:14:13.860
For each mutation, in the worst case, your whole test suite gets run.
00:14:19.920
I say your whole test suite, so if it touches the database to set up stuff.
00:14:25.140
Because you are an active record, you basically have super runtime.
00:14:33.060
So if you write real unit tests, mutation testing can be really fast.
00:14:40.080
If you write unit tests in a way the mutation test can identify that there is a unit test for that subject.
00:14:46.380
It can only run those specific tests, and then it becomes quite fast.
00:14:53.520
For example, when we introduced mutation testing to Axiom, it had a runtime of a day.
00:14:59.640
Once we scoped the test execution to public interface tests, it went down to 30 minutes, which is manageable.
00:15:06.240
Okay, I just simply lost track, so I will just discover what comes next.
00:15:12.900
Perfect, okay.
00:15:19.320
Typically, mutation testers need to report the mutations in a way programmers can consume.
00:15:26.339
I choose to implement it as a diff because we all know how to consume the diff.
00:15:33.780
If you are on a mutation tester and see 100,000 un killed mutations, sometimes a single test can kill 50 of them.
00:15:40.680
The mutation test itself is quite dumb. It just fuzzes your code. It's not fuzzing an input, it's just fuzzing code.
00:15:47.940
Okay, to run a mutation tester, you need to manage the CLI.
00:15:54.120
I don’t have any examples about the CLI here because I wanted to talk about the theory of mutation testing.
00:16:00.420
This talk is not really about mutant itself; it’s about what mutation testing is.
00:16:06.600
I want to interest you the audience in trying it out.
00:16:12.840
There are many mutation testers, so I don’t want to start with mutant.
00:16:19.440
No, because all mutation tests are quite complex.
00:16:26.520
I can only scratch the surface in this format.
00:16:32.760
I really hope for questions.
00:16:39.240
Other questions? No questions? Okay.
00:16:45.420
Okay, that's your slide.
00:16:51.000
Write real unit tests; that's actually a nice property of mutation testing.
00:16:57.720
If mutation testing is too slow, you don't have unit tests. Let's know there's nothing to argue about.
00:17:03.120
Okay, test selection; I probably should have just gone forward to the test selection slide.
00:17:09.420
The isolation one is really interesting.
00:17:15.060
If you just randomly change your code and execute it, what can possibly go wrong in a dynamic language?
00:17:21.540
If you have a test that dynamically generates a class, and if your code is subject to generate classes and put them somewhere in the VM.
00:17:27.960
It might leak, and all later tests could be totally screwed up, we could have artificial debt mutant.
00:17:34.560
There needs to be sandboxing.
00:17:41.760
That especially the sandboxing stuff was the reason I wrote my own mutation tester.
00:17:48.240
Because Heckel did not have it, and once I had 100% mutation coverage with one test.
00:17:55.200
Basically, that one test had mutations that invalidated future tests.
00:18:01.920
So currently, before injecting the mutant and measuring the effect, it works quite nicely.
00:18:08.280
There are many other strategies, especially for JRuby.
00:18:14.880
We could probably just build a second runtime to isolate effects.
00:18:21.600
I'm not really sure.
00:18:27.600
Because there is no silver bullet, you all know that there are really shortcomings of all mutation testers.
00:18:34.320
I have to mention here that there is the problem of equivalent code.
00:18:40.920
If a mutation tester mutates your code in a way that has exactly the same semantics, nobody can rescue you.
00:18:47.280
We cannot manually blacklist the mutation to say, okay, I as a human can decide that one should never be in the report again.
00:18:54.720
What's really nice in Ruby is that we have a such dense enumerable API.
00:19:00.960
Most of the equivalent mutants typically occur here.
00:19:07.140
This would not happen in mutant; you would just use one dot up to 10 and no mutant would generate an equivalent mutant here.
00:19:14.460
That's a really nice property.
00:19:19.800
Okay, then we have the infinite runtime problem.
00:19:27.180
I did not solve that currently and I'm not aware of any mutation testers that solve the problem.
00:19:34.620
Because if we have a conditional and that conditional controls a loop, that loop will never terminate.
00:19:42.000
The mutation tester will never terminate.
00:19:49.260
It's the halting problem; nobody can tell you whether you should account it as killed or not.
00:19:55.680
Currently, it doesn't happen much because of the enumerable API.
00:20:02.040
Because we control loop execution more in Ruby, this is not a big problem currently.
00:20:08.220
Another example: if someone codes a bug and writes a test for 9 or runs a lookup table for 100 integers.
00:20:14.940
Mutation testing will not tell you that the implementation is incorrect.
00:20:21.960
Mutation testing can only ensure the coverage between intentions of the tester and the coder.
00:20:28.560
If someone wants to cheat, they can still cheat.
00:20:35.220
So it's not a silver bullet here.
00:20:41.280
There was a hunting cheat sheet; it's totally imperfect.
00:20:48.960
But basically, it boils down to whenever killing a mutation, think about not adding a test, but writing simpler code.
00:20:55.440
If there is a way to avoid a literal, do it. If there is a way to reduce the insects, just do it.
00:21:03.060
It results in fewer mutations, and you will be happy afterward.
00:21:10.260
When to use mutation testing? I can’t say so.
00:21:17.100
I have commercial projects on Rails which are 100% mutation covered.
00:21:24.840
It takes a while; in that project, it proved some value.
00:21:32.160
I have projects where I tried to apply mutation testing, and it failed.
00:21:38.160
Mostly because we had a legacy test base which had too much remote code execution or inter-process communication.
00:21:45.840
The setup and teardown times ruined the experience.
00:21:53.040
That's up to the community to decide when mutation testing is appropriate.
00:22:00.540
For all mathematically and algebraic and transformation domains which are self-contained, I can only recommend it.
00:22:07.680
If you want to try mutation testing on your project, use mutation test classes where money is around.
00:22:14.640
It works; it's a fitting domain. It allows testing critical code.
00:22:21.180
Because your revenue calculation and promotion code stuff should have, in my opinion, potential.
00:22:28.080
It's all about doing the best to our knowledge.
00:22:35.460
So hopefully, you have more knowledge now and can run home to your product owner.
00:22:42.720
And say, 'Okay, I can't stand code without mutation testing anymore.
00:22:49.860
I would love to see some questions.
00:22:58.920
Yes, this statement stuff, yeah.
00:23:05.340
So I'll wait for him to get to the slide.
00:23:11.520
My question is about this slide. We have two statements: side effect A and side effect B.
00:23:17.220
How would mutation testing cover the fact that maybe side effect A should happen?
00:23:24.240
It would delete side effect A and run the test statement deletion, but then you'd have to ensure that the test actually tests the side effect.
00:23:30.840
Yes, exactly. You will notice a live mutant.
00:23:36.840
You will see in the diff just a removed line.
00:23:42.480
You, as a programmer, should identify, 'Hey, why can't the mutation tester remove that line without my knowledge?'
00:23:48.720
Let's go on.
00:23:55.920
There are some authors who write that it's not a good idea to aim for 100% line coverage.
00:24:02.520
Because it's too expensive. Could you tell us what we should aim for with mutation coverage?
00:24:09.540
I personally aim for 100%, but that's because of the loopback effect.
00:24:15.720
It's my tool; I try to fulfill the tool's intention.
00:24:22.320
But I would aim for 100% coverage of core classes.
00:24:29.040
If there's a tricky class, or even a group of classes within a domain that should be 100% covered.
00:24:36.960
It's a great way to ensure testing on the most important business path.
00:24:43.800
I got a question; how do these mutations work when you're mutating the code right here?
00:24:54.960
You're basically showing basic mutations where you mutate the code.
00:25:02.280
But there's also practices where you mutate the tests; QuickCheck is the best example.
00:25:09.960
Yes, do you have any experience with this?
00:25:16.920
I tried to use QuickCheck, and I failed with the setup, so I have zero experience with QuickCheck.
00:25:24.840
But I know it's closely related to mutation testing.
00:25:31.920
I think when we would have a more introspectable invariant definition—
00:25:39.240
I'm not talking about RSpec here; RSpec does not have that.
00:25:45.960
We have to pass the code again and assume certain things.
00:25:53.520
But if we have introspectable predicates on various states of objects under test, we could do something like QuickCheck.
00:25:59.760
So maybe as a follow-up, do you think Ruby is a language where you can do invariant checking sensibly?
00:26:06.120
In my opinion, yes. Ruby is the language where, to my knowledge, mutation testing performs best.
00:26:13.200
It excels under mutation testing because it effectively loads dynamic code.
00:26:19.680
Injecting mutants is nothing more significant than the requirements required in Ruby.
00:26:26.760
So it has a good optimized performance.
00:26:33.720
You gave an example of a project where mutation testing failed.
00:26:41.520
Can you give an example of where mutation testing was successful?
00:26:48.900
Yes, I need to include that stuff in the slide.
00:26:55.560
The person who wrote the Axiom library, mutation-covered Axiom, wrote a fuzzer.
00:27:03.780
They wrote an SQL generator and plugged the fuzzer to generate random mutations.
00:27:10.620
It was able to generate SQL, serialize it, and apply it to SQLite with Honda back in SQLite.
00:27:17.880
Yes, it was a fixed upstream.
00:27:26.760
I have to include the references into the slides.
00:27:33.840
More questions back there? Yeah.
00:27:38.520
Hey, do you have any statistics regarding how many mutants were killed by given tests, and also how many tests killed a given mutant?
00:27:45.180
I only have the first direction you mentioned.
00:27:52.320
I know how many mutants I should have included, but today it did not finish.
00:27:59.760
So, I couldn't update my slides.
00:28:06.000
So I know what mutants are dead on a given subject in the whole project.
00:28:12.720
I don't know the inverse; that's basically a shortcoming of me doing the RSpec integration.
00:28:19.140
We could do that, but in my opinion, we should do something like QuickCheck.
00:28:25.920
There, the measurements would become even better.
00:28:32.420
Hi, my question is because you've already said that it's kind of calculated renders mutant tests.
00:28:40.020
Why do you think they fit into the development process?
00:28:46.560
I run them locally.
00:28:53.040
I know I should probably write a god plug-in so I know I changed that class—let's mutation cover it.
00:29:00.240
It will take a minute before I check in, so I have an excuse to take a new coffee.
00:29:06.840
But I recommend running it on stage two.
00:29:12.480
If you have a normal CI setup, a multi-stage setup, there is a standard unit test.
00:29:19.200
I run them before my headless test.
00:29:25.620
That works best for me because identifying a problem on a headless test is much more costly.
00:29:31.680
It is much simpler to identify a problem that gets reported in a diff.
00:29:37.200
So I arrange the chain of tests in the way that reflects how problematic it would be to track down a problem.
00:29:44.520
Thanks.
00:29:51.600
Yeah, no problem.
00:29:58.080
So I have two questions. The first one is when you started doing mutation testing, what was the most common reason why your tests failed?
00:30:05.160
What were the things that were being changed most often that might lead to the tests failing?
00:30:11.280
It took me a while to learn to infer the connection between the behavior change introduced and the test.
00:30:18.600
And why the test does not kill it.
00:30:24.000
So I ended up basically doing the same mutation just firing it up.
00:30:30.540
I tried to make a dead mutation by hand with my normal TDD cycles.
00:30:36.600
I said, 'Okay, I will keep this mutation if I can't prove it wrong or I will just find a way to refactor my tests.
00:30:43.560
If I make them red, I would then change the code back.
00:30:50.520
At that time, I identified Git is really nice.
00:30:57.120
So my second question is more about the multi-stage testing setup that you mentioned.
00:31:03.840
Is it part of your content where you can switch out the different implementations so the tests run faster?
00:31:10.560
Do you think integrating mutation testing into a workflow encourages you to separate
00:31:17.880
the concerns more and to get to a point where you have a more multi-stage testing environment?
00:31:24.840
Absolutely! I identified good code is easy to mutation cover.
00:31:32.520
I sneaked in mutation testing to various projects already.
00:31:39.240
I started with the core classes, then expanded.
00:31:46.380
However, I realized, okay, I want to have this class mutation cover too.
00:31:52.920
It generated a significant number of mutations I could not test. It was better to refactor the class.
00:31:59.160
Then I mutation covered it by utilizing the existing tests.
00:32:05.520
No? Let's simply move on.
00:32:10.680
Any more questions?
00:32:16.560
No? Yeah, there was—yeah, I saw her hand.
00:32:22.680
When you do one mutation and you launch the whole suite, how do you identify the mutants?
00:32:30.480
They should fail, but you probably have a lot of tests that pass.
00:32:37.380
If you use the shotgun approach, all tests are allowed to kill the mutation.
00:32:46.560
Any test that fails kills the mutation.
00:32:53.220
So there is no need as it's a really unintelligent strategy.
00:32:58.200
For now, mutant's default is to use a selective selection.
00:33:05.460
If you have a mutation on Foo sharp bar, only tests that touch Foo sharp bar can kill the mutants.
00:33:12.840
Only they get executed, which speeds up the mutation test.
00:33:17.520
You selectively don't run the whole suite.
00:33:24.720
There are configurations you cannot set in mutant because I tend to write open-source code.
00:33:30.840
Only for my pleasure, so I'm a little bit easier on that.
00:33:38.160
But now I typically only bring in the features I personally like.
00:33:45.540
If there is a pull request that adds a feature, I will accept it.
00:33:53.280
But I don’t know—there's demand for more selective strategies.
00:34:00.720
But I couldn't make it currently.
00:34:05.460
No, it's okay.
00:34:09.840
My next question is about mocking.
00:34:13.920
In one of the slides, you will use mocking as an example.
00:34:19.200
Mark should receive mocking.
00:34:24.840
Does mutation testing invalidate the need for mocking?
00:34:30.840
No, no.
00:34:33.840
Because you need mocking to write a true unit test.
00:34:38.760
In cases where you test classes that do IO, you need mocks.
00:34:46.440
But I prefer to choose to set up units from my domain.
00:34:52.800
I don’t think mutation testing hinders or promotes mocking.
00:34:58.560
But mutant conducts lots of augmentations.
00:35:03.720
The last time I counted, I had 91 unique mutations.
00:35:10.920
But I should find a better strategy to enumerate them.
00:35:15.480
Well, was it one slide back where you have if input does something?
00:35:21.600
No, maybe I'm thinking of something else.
00:35:28.560
But maybe there was one test or one slide that said something about 'should receive.'
00:35:34.320
That was just an example to measure side effects.
00:35:40.320
You can use any strategy.
00:35:46.680
Mutant can change method calls or it’s just literal stuff.
00:35:52.020
No, it has shown lots of augmentation.
00:35:58.620
So anything that makes the test fail will count as a killed mutant.
00:36:05.220
Does having mutants invalidate the need for some mocking?
00:36:11.280
I don't think so.
00:36:17.520
No, I don't think so.
00:36:22.560
All right, thanks, Markus.
00:36:32.520
Thank you.