Talks

Improving Coverage Analysis

Improving Coverage Analysis

by Ryan Davis

The video titled Improving Coverage Analysis, presented by Ryan Davis at RubyConf 2016, critically examines the issue of test coverage analysis in software development, particularly within Ruby applications. The central theme is that modern test coverage tools often provide a misleading sense of security, resulting in vulnerabilities due to insufficiently identified regressions. Ryan argues that high coverage percentages can create a false narrative about code quality.

Key Points Discussed:
- Introduction to Coverage Analysis: Coverage analysis is a measurement of test coverage in code; it was first introduced in 1963 and involves various metrics like statement, branch, and condition coverage. However, these metrics do not always accurately reflect the quality of tests or the actual state of the code.
- Common Misconceptions: Many developers equate high coverage with high-quality testing, which can lead to overlooking bugs in the code despite achieving 100% coverage. Ryan highlights that coverage and code quality are orthogonal, meaning they do not directly affect each other.
- Types of Errors in Coverage Analysis: Ryan identifies various types of errors that can occur with coverage tools, including Type I errors (false positives), Type II errors (false negatives), and proposes a Type III error (omission errors) where uncovered code leads to inflated coverage statistics.
- Ruby Tooling for Coverage: He discusses Ruby's built-in coverage tools and introduces SimpleCov, emphasizing that while these tools provide better reports, they still share the foundational flaws of original coverage tools.
- Solutions to Improve Analysis: The presentation journey leads to suggestions for improving coverage reliability, particularly through the development of a new tool called Minitest Coverage, which aims to better assess actual test coverage through enhanced mechanisms.

Conclusions and Takeaways: Ryan asserts that developers should adopt a more robust approach to testing, focusing not only on achieving high coverage numbers but also on ensuring that tests are meaningful. Test Driven Development (TDD) is advocated as a method to ensure comprehensive testing strategies, which can result in better software quality. Ultimately, improving coverage analysis involves recognizing the limitations of existing tools and employing better practices in testing.

00:00:14.940 If you follow modern practices, test coverage analysis is a lie, plain and simple. The tools are deficient, and what they report is a false positive that leaves you with a false sense of security, making you vulnerable to regression without you even being aware of it. Let's look at how and why this occurs, and what we can do about it. First, a bit about me: I hate these slides and talks, but honestly, I need new clients. I've been coding professionally for 25 years, 16 of which have been in Ruby. I'm the founder of Seattle.rb, the first and oldest Ruby brigade. I'm also the author of MiniTest, Flay, and Ruby Parser, among about ninety-five others. I just checked RubyGems, and they are reporting ten point six billion downloads. My gems account for 114 million downloads, which places me in the one percent. Oddly enough, one of my newest gems is called GitHub Score. It has 298 downloads, so if you could download that to help me become part of the two percent, I would appreciate it. I am a developer's developer; I love building tools for other developers. I run CLLRB Consulting, where I do a lot of that. This, I promise, is the most text on any slide I have, so it gets better from here on out. Setting expectations, I want to make clear that this is a conceptual or idea talk and is meant for beginners and up; anyone should be able to understand this. I don't necessarily expect beginners to take what I'm talking about back to their companies and implement it, but anyone should understand this talk. It's one gross of slides, so I'm aiming for 30 minutes at an easy pace. Let's get started.
00:02:26.350 As a consultant, I'm often called in to help teams that are struggling with huge and messy implementations, big ball of mud designs, or worse, the overeager use of patterns before they are necessary. Of course, they are often coupled with too few, if any, tests at all, which can make it dangerous, if not impossible, for me to work on the code. This means either I have to pair with the domain expert 100 percent of the time, which isn't realistic for either me or my client, or I have to work alone, leading to my pull requests sitting in limbo for months, which can be incredibly frustrating. Being a tool builder, I have a load of tools to help me in my work. Vlog points out where the complexity lies and allows me to quickly hone in on the problematic code. Flay points out refactoring opportunities, letting me clean up the code efficiently. Debride helps identify whole methods that might no longer be used, allowing me to delete entire segments of code. Additionally, MiniTest enables me to write tests quickly and easily, providing a range of extra plugins for enhanced functionality. One of those plugins is MiniTest Bisect, which makes it incredibly fast and easy to find and fix unstable tests.
00:03:03.430 But what if there are no tests or too few tests? What do I do then? Well, I'm not getting very far without resorting to full-time pairing or improving the tests. However, I don’t know the system, and I may not necessarily understand the domain or know what is being tested. I need a tool to guide me, and that’s done with code coverage analysis. So, what is code coverage analysis? It was introduced by Miller and Maloney in the Communications of the ACM in 1963. In the simplest terms, possibly too simple, it is a measurement of how much code your tests actually touch. A visual representation might help. Given an overtly simple implementation and the tests for it, you can see that the tests hit both initialize and add. Because there are no branches or extra complexity in this code, the coverage is 100 percent; everything has been executed. There are many ways to measure coverage. While I don’t particularly like the terms, they are prevalent in the industry, so let’s do a quick overview. C0 is called statement or line coverage. C1 is branch coverage, and C2 is condition coverage. Function coverage simply measures what percentage of functions were called at all; in my opinion, it’s not terribly useful, but it has some utility. For example, Debride could be considered a form of function coverage. Chris gave a talk earlier today on deletion-driven development, which covered a very similar concept.
00:04:42.230 So, why doesn't this work? C0, or statement coverage, measures what statements are executed. For some reason, statements equate to lines of code—in other words, what percentage of the lines of code were actually executed. This is a fuzzed example of my client’s code. Branch coverage measures whether all sides of all branching expressions have been exercised; branching expressions are any points in the code where execution might jump due to conditions. You might be conducting exhaustive condition coverage, which requires four times as many tests to run, or a happy medium that checks two Boolean expressions, which gives you four cases to ensure everything gets exercised. Parameter edge case coverage deals with providing relevant argument types to test a method. For strings, these types might include null, empty strings, whitespace, valid formats (like a date), invalid formats, single-byte strings, and multiline strings. Path coverage examines the various paths through a given method—have you traversed all those paths? Entry and exit coverage is like function call coverage but also requires exercising all explicit and implicit exits. State coverage ensures that the types of the states of your objects are covered, which can get out of hand very quickly due to the complexity of data.
00:09:01.100 That's an overview of the types of coverage you could implement. I do, however, want to give you a warning: there do exist bad metrics. I had originally intended to fluff my talk with a Dilbert cartoon, but it turns out Scott Adams propagates borderline racist and sexist sentiments on his blog, so we're not doing that. People, and by people, I mean both engineers and managers, can easily get stuck on the gamification of any metric, doing whatever it takes to push up the score, even at the expense of code quality. To me, code coverage seems like a good thing to have—a false sense of security if you will—while low coverage indicates insufficient tests. High coverage means nothing about the quality of tests; you can still have bug-riddled code. Coverage and quality are orthogonal to each other, yet it is a common misconception that high code coverage correlates to good testing and, thus, good code.
00:09:43.000 This is an actual quote given to an engineer in this room reporting a bug, an actual bug. A simple proof can be made using the previous examples: if you remove an assertion from your tests, you might still show 100 percent coverage, but there won't be any verification that the system is functioning as intended anymore. That’s where TDD (Test Driven Development) can save the day by intentionally writing a failing test and then only implementing what is needed to make it pass—followed by refactoring where possible. In this way, you ensure you have sufficient coverage to make the test pass without playing around with the numbers. This method offers a straightforward solution that has many additional benefits, so it’s worth employing. Now, where does Ruby stand concerning coverage? There’s a standard tool that almost nobody knows about; the reason being it ships with Ruby—not as a gem—so we often overlook it. Despite being fairly easy to use, you need to require it and call it to start early in your run before loading and executing your code.
00:10:57.610 When you request the results, you receive a hash mapping the code paths in question to an array of nils for lines like comments or blank lines, with a zero indicating a line that wasn't covered. The issue with coverage is that it's not primarily designed for user interaction; it’s meant to be utilized by other tools. However, let's take a moment to review its functionality. Coverage integrates hooks with the Ruby VM itself; when you call coverage.start, you set some internal state in the Ruby VM, and any code loaded subsequently is injected with extra instructions to record coverage. As your code runs, each executed line increments its corresponding count in the hash. You can call it to retrieve a copy of this data, after which it shuts down the process, which poses a challenge for what I aim to resolve today. There's also coverage.peek_result, which returns a copy of the data while allowing continuation, but the information remains static, presenting additional obstacles I hope to address.
00:11:43.150 There's a tool called SimpleCov. How many people are familiar with SimpleCov? That’s great! You’re all doomed! Its usage is entirely consistent: require it, tell it to start, require your code and tests, run the tests, and you're done. SimpleCov uses coverage internally but significantly improves the output, providing a clean overview with sortable columns and detailed pages for each class that color-code coverage to allow you to see the specifics. Unfortunately, it still carries forward the flaws of the original coverage tool. Before I proceed to describe those flaws, I must take a tangent to discuss the types of errors that exist within this context. For obvious reasons, statistics classify types of errors: Type I errors (false positives) and Type II errors (false negatives) arise when you have not detected something you should have. I believe that both Type I and Type II errors stem from the enumerator, underscoring the importance of ensuring a well-distributed sampling across your implementation to avoid misrepresentations.
00:14:33.150 How do we define this? I propose a Type III error, or an error of omission, which occurs when you haven’t even loaded the code, leading to a falsely inflated percentage due to a lack of comprehensive data. So, now that we've delineated the types of errors, I’d like to address why I believe Ruby tooling has deficiencies in this regard. Coverage suffers from macro Type I errors, where tests impact some implementations while leaving others untested, resulting in the illusion of adequate coverage. Additionally, the tools are line-oriented, making C0 notoriously insufficient since any execution of any line marks the entire line as covered—even when there may be multiple execution paths through that line. I have yet to see Type II errors associated with simple code; however, I hint that they may exist and I plan on validating that assumption. Meanwhile, Type III errors rely heavily on sampling; thus, if all implementations are not loaded and known to the coverage analysis, it will result in elevated statistics. So what can we do to enhance this situation? I recently developed a new gem called
00:19:32.100 Minitest Coverage,