00:00:14.940
If you follow modern practices, test coverage analysis is a lie, plain and simple. The tools are deficient, and what they report is a false positive that leaves you with a false sense of security, making you vulnerable to regression without you even being aware of it. Let's look at how and why this occurs, and what we can do about it.
First, a bit about me: I hate these slides and talks, but honestly, I need new clients. I've been coding professionally for 25 years, 16 of which have been in Ruby. I'm the founder of Seattle.rb, the first and oldest Ruby brigade. I'm also the author of MiniTest, Flay, and Ruby Parser, among about ninety-five others. I just checked RubyGems, and they are reporting ten point six billion downloads. My gems account for 114 million downloads, which places me in the one percent.
Oddly enough, one of my newest gems is called GitHub Score. It has 298 downloads, so if you could download that to help me become part of the two percent, I would appreciate it. I am a developer's developer; I love building tools for other developers. I run CLLRB Consulting, where I do a lot of that.
This, I promise, is the most text on any slide I have, so it gets better from here on out. Setting expectations, I want to make clear that this is a conceptual or idea talk and is meant for beginners and up; anyone should be able to understand this. I don't necessarily expect beginners to take what I'm talking about back to their companies and implement it, but anyone should understand this talk. It's one gross of slides, so I'm aiming for 30 minutes at an easy pace. Let's get started.
00:02:26.350
As a consultant, I'm often called in to help teams that are struggling with huge and messy implementations, big ball of mud designs, or worse, the overeager use of patterns before they are necessary. Of course, they are often coupled with too few, if any, tests at all, which can make it dangerous, if not impossible, for me to work on the code. This means either I have to pair with the domain expert 100 percent of the time, which isn't realistic for either me or my client, or I have to work alone, leading to my pull requests sitting in limbo for months, which can be incredibly frustrating.
Being a tool builder, I have a load of tools to help me in my work. Vlog points out where the complexity lies and allows me to quickly hone in on the problematic code. Flay points out refactoring opportunities, letting me clean up the code efficiently. Debride helps identify whole methods that might no longer be used, allowing me to delete entire segments of code. Additionally, MiniTest enables me to write tests quickly and easily, providing a range of extra plugins for enhanced functionality. One of those plugins is MiniTest Bisect, which makes it incredibly fast and easy to find and fix unstable tests.
00:03:03.430
But what if there are no tests or too few tests? What do I do then? Well, I'm not getting very far without resorting to full-time pairing or improving the tests. However, I don’t know the system, and I may not necessarily understand the domain or know what is being tested. I need a tool to guide me, and that’s done with code coverage analysis. So, what is code coverage analysis?
It was introduced by Miller and Maloney in the Communications of the ACM in 1963. In the simplest terms, possibly too simple, it is a measurement of how much code your tests actually touch. A visual representation might help. Given an overtly simple implementation and the tests for it, you can see that the tests hit both initialize and add. Because there are no branches or extra complexity in this code, the coverage is 100 percent; everything has been executed.
There are many ways to measure coverage. While I don’t particularly like the terms, they are prevalent in the industry, so let’s do a quick overview. C0 is called statement or line coverage. C1 is branch coverage, and C2 is condition coverage. Function coverage simply measures what percentage of functions were called at all; in my opinion, it’s not terribly useful, but it has some utility. For example, Debride could be considered a form of function coverage. Chris gave a talk earlier today on deletion-driven development, which covered a very similar concept.
00:04:42.230
So, why doesn't this work? C0, or statement coverage, measures what statements are executed. For some reason, statements equate to lines of code—in other words, what percentage of the lines of code were actually executed. This is a fuzzed example of my client’s code. Branch coverage measures whether all sides of all branching expressions have been exercised; branching expressions are any points in the code where execution might jump due to conditions.
You might be conducting exhaustive condition coverage, which requires four times as many tests to run, or a happy medium that checks two Boolean expressions, which gives you four cases to ensure everything gets exercised. Parameter edge case coverage deals with providing relevant argument types to test a method. For strings, these types might include null, empty strings, whitespace, valid formats (like a date), invalid formats, single-byte strings, and multiline strings. Path coverage examines the various paths through a given method—have you traversed all those paths? Entry and exit coverage is like function call coverage but also requires exercising all explicit and implicit exits. State coverage ensures that the types of the states of your objects are covered, which can get out of hand very quickly due to the complexity of data.
00:09:01.100
That's an overview of the types of coverage you could implement. I do, however, want to give you a warning: there do exist bad metrics. I had originally intended to fluff my talk with a Dilbert cartoon, but it turns out Scott Adams propagates borderline racist and sexist sentiments on his blog, so we're not doing that. People, and by people, I mean both engineers and managers, can easily get stuck on the gamification of any metric, doing whatever it takes to push up the score, even at the expense of code quality.
To me, code coverage seems like a good thing to have—a false sense of security if you will—while low coverage indicates insufficient tests. High coverage means nothing about the quality of tests; you can still have bug-riddled code. Coverage and quality are orthogonal to each other, yet it is a common misconception that high code coverage correlates to good testing and, thus, good code.
00:09:43.000
This is an actual quote given to an engineer in this room reporting a bug, an actual bug. A simple proof can be made using the previous examples: if you remove an assertion from your tests, you might still show 100 percent coverage, but there won't be any verification that the system is functioning as intended anymore. That’s where TDD (Test Driven Development) can save the day by intentionally writing a failing test and then only implementing what is needed to make it pass—followed by refactoring where possible. In this way, you ensure you have sufficient coverage to make the test pass without playing around with the numbers.
This method offers a straightforward solution that has many additional benefits, so it’s worth employing. Now, where does Ruby stand concerning coverage? There’s a standard tool that almost nobody knows about; the reason being it ships with Ruby—not as a gem—so we often overlook it. Despite being fairly easy to use, you need to require it and call it to start early in your run before loading and executing your code.
00:10:57.610
When you request the results, you receive a hash mapping the code paths in question to an array of nils for lines like comments or blank lines, with a zero indicating a line that wasn't covered. The issue with coverage is that it's not primarily designed for user interaction; it’s meant to be utilized by other tools. However, let's take a moment to review its functionality. Coverage integrates hooks with the Ruby VM itself; when you call coverage.start, you set some internal state in the Ruby VM, and any code loaded subsequently is injected with extra instructions to record coverage.
As your code runs, each executed line increments its corresponding count in the hash. You can call it to retrieve a copy of this data, after which it shuts down the process, which poses a challenge for what I aim to resolve today. There's also coverage.peek_result, which returns a copy of the data while allowing continuation, but the information remains static, presenting additional obstacles I hope to address.
00:11:43.150
There's a tool called SimpleCov. How many people are familiar with SimpleCov? That’s great! You’re all doomed! Its usage is entirely consistent: require it, tell it to start, require your code and tests, run the tests, and you're done. SimpleCov uses coverage internally but significantly improves the output, providing a clean overview with sortable columns and detailed pages for each class that color-code coverage to allow you to see the specifics. Unfortunately, it still carries forward the flaws of the original coverage tool.
Before I proceed to describe those flaws, I must take a tangent to discuss the types of errors that exist within this context. For obvious reasons, statistics classify types of errors: Type I errors (false positives) and Type II errors (false negatives) arise when you have not detected something you should have. I believe that both Type I and Type II errors stem from the enumerator, underscoring the importance of ensuring a well-distributed sampling across your implementation to avoid misrepresentations.
00:14:33.150
How do we define this? I propose a Type III error, or an error of omission, which occurs when you haven’t even loaded the code, leading to a falsely inflated percentage due to a lack of comprehensive data. So, now that we've delineated the types of errors, I’d like to address why I believe Ruby tooling has deficiencies in this regard. Coverage suffers from macro Type I errors, where tests impact some implementations while leaving others untested, resulting in the illusion of adequate coverage.
Additionally, the tools are line-oriented, making C0 notoriously insufficient since any execution of any line marks the entire line as covered—even when there may be multiple execution paths through that line. I have yet to see Type II errors associated with simple code; however, I hint that they may exist and I plan on validating that assumption. Meanwhile, Type III errors rely heavily on sampling; thus, if all implementations are not loaded and known to the coverage analysis, it will result in elevated statistics.
So what can we do to enhance this situation? I recently developed a new gem called
00:19:32.100
Minitest Coverage,