Mario Gonzalez
Re-thinking Regression Testing

Re-thinking Regression Testing

by Mario Gonzalez

In this talk titled "Re-thinking Regression Testing" at the MountainWest RubyConf 2014, speaker Mario Gonzalez emphasizes the importance of effective regression testing, particularly focusing on the inherent limitations of traditional code coverage metrics. Gonzalez argues that, despite achieving high code coverage percentages, many tests fail to detect regressions accurately, leading to a false sense of security regarding software reliability.

Key Points Discussed:

  • Lack of Confidence in Testing: Traditional regression testing practices often fail to provide a clear understanding of their effectiveness, leading to uncertainty about which areas of the code are sufficiently tested.
  • Redundancy Issues: Gonzalez points out that high test redundancy can obscure meaningful coverage contributions, with an average of 25% redundancy leading to only 55% regression detection even with an 85% code coverage in Ruby.
  • Mutation Testing Introduction: To remedy the flaws of traditional metrics, the speaker introduces mutation testing (or mutation analysis) as a powerful technique for evaluating test effectiveness. This technique involves modifying existing code and assessing whether tests can detect these changes (loss of functionality).
  • Mutation Score: Mutation analysis provides a numeric mutation score based on the percentage of 'mutants' (modified code) that the tests can successfully detect. A higher score correlates with better test regression detection capability.
  • Practical Application: Gonzalez shares guidelines for implementing mutation analysis, suggesting at least 50 tests for good sample size and highlighting the importance of focusing on lower-order mutants for detecting common errors, while also trying higher-order mutants for more complex scenarios.
  • Implications of Code Coverage: He critiques the idea of relying heavily on code coverage as it often does not accurately represent the reliability of tests over time, proposing mutation analysis as a preferable metric.

Through research findings, Gonzalez illustrates that higher mutation scores correlate with lower redundancy and, thus, provide a more reliable measure for assessing test quality.

Conclusions and Takeaways:
- Prioritize mutation testing over mere code coverage to achieve better regression detection.
- Utilize mutation scores to guide improvements in testing practices.
- Understand that technical debt and under-testing can lead to undetected bugs, especially in over-tested areas of the codebase.
- Engage in practices that maintain low coupling in code to enhance the effectiveness of mutation testing.

Gonzalez concludes by encouraging software developers to embrace mutation testing as an integral part of their testing strategy, providing the confidence required to make changes and improvements in their codebases.

00:00:25.760 Hello, everyone. Can you hear me? Sorry for being a bit late. I'm from Tucson and still on Tucson time. That's when you know you're having a good trip—you tend to be late for everything.
00:00:42.880 Speaking of Tucson, I found out that Sam Rollins, one of my classmates from the University of Arizona, is here. If Sam is still here, hi! I hope we can talk later. My name is Mario Gonzalez, and I want to talk to you about regression testing, specifically rethinking regression testing.
00:01:02.919 But why do we need to discuss regression testing anymore? It's a widely understood and widely used practice. The reason is because of confidence—or rather the lack of confidence in our tests. Even though we trust our tests completely, we don’t really know where their weaknesses or blind spots are. We don’t know if we have over-tested one area of our product while leaving others under-tested or even untested.
00:01:35.680 Some of you might be thinking, 'Well, there's code coverage for that, right?' Code coverage is indeed a good indicator of the quality of our tests. However, let me show you some numbers, which I'll explain shortly.
00:01:48.320 The average Ruby test suite achieves around 85% code coverage. That’s not bad, right? Yet, in our attempts to minimize those gaps, we often introduce redundancy in our tests. These aren’t exact duplicates; rather, they are tests that ultimately test the same functionality but do so in quite different ways. On average, test suites have about 25% redundancy built in, yet the actual regression detection capability of these tests is only 55%.
00:02:12.680 So, is code coverage a good indicator of the quality of our tests? No, not really. While it’s good to have, it’s a second-order metric; it’s merely a proxy for the real quality of our tests.
00:02:26.159 Where do these numbers come from? I work in research focused on software quality. I started my company because I noticed a pattern in my own projects and those of the teams I worked with. Despite following best practices like test-driven development and striving for high code coverage, our projects were still riddled with bugs. Last year, I completed two research projects: one focused on the effects of test redundancy and the other, still unpublished, on regression detection.
00:02:52.960 In this presentation, I want to focus on the regression detection capability of our tests—how to measure it accurately and how to improve it. And we're going to do this using a technique called mutation testing or mutation analysis. I know there have been previous presentations on mutation analysis, but I'll be sharing the results of my own research throughout this presentation.
00:03:16.800 My goal is that by the end of this presentation, you will be empowered to use this more accurate metric to assess the quality of your tests and start improving them confidently. So, let’s talk about why code coverage is unreliable. Imagine a graph where the curve represents the evolution of your software over time. The x-axis represents time, and the y-axis represents lines of code. This is a simplistic but effective representation.
00:03:39.760 Suppose we want to determine the slope at a point on that curve. Would it be reasonable to expect that slope to continue forever? No, it wouldn’t. This is analogous to the problem with code coverage. Code coverage is a point-in-time metric; it tells you what your tests cover right now, but it has no relevance to what they will cover tomorrow or a month from now. This is why code coverage often feels like a game of catch-up; achieving 100% code coverage every day is nearly impossible.
00:04:04.960 We need a better way to assess the quality of our tests, and this is where mutation analysis comes in, shining light on the associated mutation score. Now, what is the mutation score? It's a ratio—it represents a probability. You can achieve a score of 0, meaning there's a 0% chance of your tests detecting regressions, or a score of 1 (or 100%), which means you have a 100% probability of detecting regressions.
00:04:26.320 This is known as mutation adequacy. So, what do we mutate in mutation analysis? It’s a type of fault injection where we don’t introduce new code, although we could. Instead, we change the existing source code and run our unmodified tests against this 'mutant.' If any of our tests fail, that’s a good sign—we say the mutant has been killed, indicating that our tests have a high probability of detecting this type of regression if it is ever introduced.
00:04:47.680 Suppose we modify a relational operator instead. If we run our unmodified tests, but none of them fail, it means our tests are inadequate and have a low probability of detecting that kind of error. By failing to detect this, we have identified a problem area in our testing strategy. Next, I want to show you a demo of a mutation engine I use for my research. Here we go.
00:05:15.760 So, every mutant gives us a binary pass or fail outcome. Mutation analysis is a probabilistic technique, meaning for a more accurate mutation score, we need to run it multiple times. For instance, I recommend running it at least 50 times and then we can analyze the results. Let's see how it does.
00:05:49.680 As you can see, the mutation score is a ratio—the total number of mutants killed or detected by your test divided by the total number of generated mutants. Mutation analysis is one of the few fun techniques in software development. You can have mutant scales, like zombie scale or ninja scale! The more enjoyable the technique is, the more likely you are to apply it consistently, which in turn improves your tests and the quality of your software.
00:06:12.480 So, how many mutants should we have? In the demo I just showed, we only had one mutation. Should we mutate the entire source file and run it again? Let's refer to the mutation quantity represented by variable 'k.' Back in the 1970s, the original researchers made a significant assumption, and it has since been validated. This assumption is known as the competent programmer hypothesis.
00:06:38.560 This hypothesis suggests that experienced programmers create code that is almost correct. The remaining small errors, such as off-by-one errors, indicate bugs in the code. Regarding our mutation quantity 'k,' research has shown that only one mutation is needed to identify this class of bugs.
00:06:58.240 In software engineering and science, we often name concepts for better communication. When there is only one mutation, we call the mutant a first-order mutant. If there are two mutations, we designate it as a second-order mutant, and in general, k mutations result in a k-order mutant. This naming helps us discuss the concepts more easily.
00:07:10.560 First-order mutants are crucial for capturing the primary errors. Meanwhile, higher-order mutants capture more complex bugs, which are combinations of simple errors introduced by experienced programmers. The coupling effect states that many bugs are essentially the result of these small errors working together. These nasty bugs can unexpectedly disappear when changes are made elsewhere in the code.
00:07:34.960 Two primary hypotheses form the foundation for mutation analysis: the competent programmer hypothesis and the coupling effect. For mutation quantity 'k', between two and five mutations are generally sufficient to capture this class of bugs. Now, we've established why mutation analysis effectively assesses regression detection capabilities in our tests.
00:07:56.640 When one programmer commits changes related to bug fixes or features, the competent programmer hypothesis holds. Moreover, if multiple contributors are involved, both hypotheses apply, reinforcing the significance of mutation analysis.
00:08:14.560 But why is this method better than code coverage? Let me present some results from my research. What you're looking at is called Pearson correlations, specifically the correlation between code coverage and mutation score. These correlations can range from -1 to 1, meaning that a negative one represents a perfect downward correspondence while a positive one indicates a perfect upward correspondence.
00:08:42.960 In this case, the correlation between code coverage and mutation score is quite positive and close to one, indicating that higher mutation scores correlate with higher code coverage. However, the opposite is not necessarily true. A cluster of test suites achieving 100% code coverage might have a mutation score of merely 75%, suggesting that mutation score is a stronger, more reliable metric.
00:09:03.040 Now let’s examine the correlation between code coverage and test redundancy. There’s a strong negative correlation here. This means that higher test redundancy leads to lower code coverage, indicating underlying issues with test quality.
00:09:20.800 Test redundancy is indicative of a deeper problem. Surprisingly, the correlation between redundancy and likelihood of bugs is positive: higher redundancy correlates with higher likelihood of bugs. This was unexpected and illustrates that redundancy alone doesn’t guarantee better coverage or effectiveness.
00:09:40.240 To clarify why test redundancy is a smell, let's consider bugs and their likelihood. Higher code coverage correlates with a lower likelihood of bugs, which is expected. However, with redundancy, the correlation is the opposite; increased redundancy raises the likelihood of bugs, which is counterintuitive.
00:10:08.080 Let’s explore an analogy. Imagine you need to inflate a tire. If you choose to over-inflate, expecting it to last longer, you’ll decrease traction and potentially damage the tire’s lifespan. In a similar way, over-testing parts of your application may introduce technical debt, with certain areas being over-tested while others remain under-tested or completely untested; this is where bugs tend to creep in.
00:10:22.880 Now, let's revisit correlations between mutation scores and the other metrics. The correlation between code coverage and mutation score remains strong, showing a consistent relationship. On the other hand, the correlation with test redundancy is still negative but indicates that mutation score is less susceptible to redundancy than code coverage.
00:10:36.480 So how do we apply mutation analysis? Hopefully, you're feeling inspired. Here are some mutation analysis tools for various languages. The first two are for Ruby, the next two for Java, and the last one for C#. In my research, I couldn’t use the Ruby tools due to compatibility issues, so I ended up writing my own tools that were fast and accurate across languages.
00:10:56.320 One significant argument I hear against mutation analysis is that it's slow and complicated. However, I want to share some guidelines for applying it smoothly in your work. The first guideline is that you want passing tests. They don’t have to be perfect, just passing. With a mutation score of zero, you can only improve from there.
00:11:23.520 Earlier, we ran the mutation engine for 50 mutants. As you can see, the mutation score is much more accurate than before, and it only took a few minutes to get a good score for this test suite.
00:11:38.080 In my research, I had to run 10,000 mutants per test suite to achieve a very low margin of error, but you don’t need to go that far. You can use the central limit theorem, which states that with a sufficient sample size, the average becomes apparent. For our purposes, a sample size of 50 is sufficient, though larger sizes yield even better results.
00:12:01.040 You should focus on lower-order mutants to ensure detection of common bugs introduced by experienced programmers. However, you should also experiment with higher-order mutants to ensure your tests can capture those nasty bugs. The process of mutation analysis is similar to refactoring: you mutate, then score your tests and improve based on the results.
00:12:23.760 Every time you encounter a live mutant, it highlights a weakness in your tests. For instance, if a test fails to catch an off-by-one error, it's evident that a particular area requires more focusing on improvements.
00:12:42.960 Lastly, a note on uncoupling: it’s crucial for you to follow good engineering practices to maintain low coupling in your code. With highly coupled code, mutation scores become unreliable and often ineffective. However, mutation analysis could serve as a diagnostic tool for identifying instances of highly coupled code.
00:13:02.560 I hope you feel empowered to start using this valuable technique to enhance the quality of your tests, providing you with the confidence you need. Also, aim to reevaluate the role of code coverage in your strategy, as it is more of a second-order metric and might not serve the intended purpose of assessing test quality.
00:13:20.160 Thank you for your attention!
00:26:33.200 Thank you!