RSpec

How To Stop Hating Your Test Suite

How To Stop Hating Your Test Suite

by Justin Searls

In the talk titled "How To Stop Hating Your Test Suite" by Justin Searls at RubyConf 2015, the speaker addresses the common frustrations software development teams face with their testing processes. Searls emphasizes the importance of consistency in tests, arguing that inconsistent tests can hinder productivity and make it more challenging to maintain code. He introduces a structured approach to testing that revolves around three main areas: test structure, test isolation, and test feedback.

Key Points Discussed:

- Test Structure: Searls discusses the drawbacks of having large, complex code objects that result in myriad test cases and dependencies. He advocates for smaller, simpler objects with fewer dependencies to ease the testing process, emphasizing that each new argument or feature increases the potential complexity of tests exponentially.
- Testing Phases: He highlights the importance of a clear arrangement, action, and assertion (AAA) structure in tests, promoting clarity and readability. Consistent naming conventions, like using "subject" to refer to the test subject, help maintain clarity across test suites.
- Test Isolation: Searls critiques mixed test strategies that blend integration and unit tests, which can lead to confusion about the test's purpose. He recommends defining test suites based on their focus—either strictly integrated tests or isolated tests—to simplify understanding and management.
- Test Feedback: The speaker stresses the significance of good feedback mechanisms in tests, particularly focusing on error messages and feedback loops. He points out that poor error messages can waste valuable time, urging developers to seek assertion libraries that provide clear, actionable feedback.
- Maintaining Engagement: Finally, Searls discusses the importance of managing test speeds and redundancy, warning against overly realistic tests that can slow down development. He encourages teams to analyze their testing practices regularly to avoid false negatives and maintain morale.

Conclusion: Searls concludes that by prioritizing consistency, leveraging structured test conventions, and emphasizing effective feedback, teams can revitalize their testing processes. This not only leads to more maintainable code but also enhances developer satisfaction with their testing practices. The main takeaway is to approach testing with a plan that focuses on simplicity and clarity, ultimately making tests a source of joy rather than frustration.

00:00:17.840 All right! High energy! I love it! All right, doors are closing and now we're covered. Great! Can we get my slides up on the monitors?
00:00:28.960 Great! Let me start my timer. Where's my phone? Uh-oh, who has my phone? Where's my timer? All right, we'll start.
00:00:41.960 So, there's this funny thing where every year, conference season lines up with Apple's operating system release schedule. And I'm a big Apple fanboy, so I, on one hand, really want to upgrade, and on the other hand, want my slide deck to work. This year, it was because they announced the iPad Pro. I was pretty excited; I was like, maybe this year, finally, OS9 is going to be ready for me to give a real talk from.
00:00:59.600 So this talk was built entirely in OS9. Let's just start it up, see how it goes. I'm a little bit nervous now.
00:01:16.079 Okay, there it is! I've got to find the play button. Here we go! And good! Alright, so this talk is about how to stop hating your tests.
00:01:30.040 My name is Justin, I play a guy named Sur on the internet, and I work at the best software agency in the world, Test Double. So why do people hate their tests?
00:01:44.040 Well, I think a lot of teams start off in experimentation mode, where everything is fun and free. They're pivoting all the time, and having a big test suite would really just slow down their rate of change and discovery. But eventually, we get to a point where we're worried that if we make a new change, we might break things. It's important that things stay working, so people start writing some test suites.
00:02:08.039 They have a build, so when they push new code, they know whether they've just broken stuff. However, if we write our tests in a haphazard, unorganized way, they tend to be slow and convoluted. Every time we want to change something, we end up spending all day updating tests.
00:02:28.440 Eventually, teams reach a point where they yearn for the good old days when they could change stuff and move quickly. I see this pattern repeat so much that I'm starting to believe that an ounce of prevention is worth a pound of cure in this instance.
00:02:36.280 Because once you get to the end, there's not much you can do. A lot of people might say, 'Well, I guess we're just not testing hard enough.' But when you see a problem over and over again, I don't believe the "work harder" approach is appropriate. You should always be inspecting your workflow and your tools and trying to make them better if you keep running into the same issue.
00:03:11.120 Some others might say, 'Okay, well, let's just buckle down; remediate testing is job one. Let's really focus on testing for a while.' But from the perspective of the people who pay us to build stuff, testing is not job one. It's at best job two; from their perspective, they want to see us shipping new features.
00:03:36.760 The longer we go with that impedance mismatch, the more friction and tension we're going to have, and that's not sustainable. While we're talking about prevention, if you are working in a big legacy monolithic application and you're not on greenfield, this is not a problem at all.
00:04:13.040 I have this cool thing to show you. There's this one weird trick to starting fresh with your test suite. That's right, you're going to learn what the one weird trick is: basically, just move your tests into a new directory, and then you make another directory. Now you have two directories, and you can write a shell script that runs both test suites.
00:04:49.119 Eventually, you can port them over and decommission the old test suite. However, I hesitated to even give this talk about testing because I am the worst kind of expert. I have too much experience overthinking testing.
00:05:06.680 I've built and open-sourced tools around testing and been on many teams as the guy who cared just a little bit more about testing than everyone else. I've participated in many highfalutin, philosophical, and nuanced Twitter arguments that really are not pertinent to anyone's life. My advice is toxic. I am overly cynical; I’m very risk-averse.
00:05:41.760 If I told you what I really thought about testing, it would just discourage all of you. So instead, my goal here today is to distill my advice down into a few component parts. The first part we're going to discuss is the structure—the physicality of our tests, like what the lines and files look like on disk.
00:06:24.160 Next, we're going to talk about isolation, because I really believe that how we choose to isolate the code we’re testing is the best way to communicate the concept and value we hope to get out of a test. Lastly, we'll discuss feedback: do our tests make us happy or sad? Are they fast or slow? Do they make us more or less productive?
00:06:48.399 Keep in mind we're thinking about all this from the perspective of prevention, because these are all things that are much easier to establish on day one than to try and shoehorn in on day one hundred. So, at this point, in keeping with the Apple theme, my brother dug up an Apple 2 copy of Family Feud, and he discovered it's really hard to make custom artwork in Apple Works 6.
00:07:20.080 Thus, I just ripped off the artwork from this Family Feud board. We're going to use that to organize our slides for today. It's a working board, which means if I point at the screen and say 'show me potato salad,' I get next. Unfortunately, I didn't have a hundred people to survey; I just surveyed myself a hundred times, so I know all the answers already.
00:07:55.520 So, first round: we're going to talk about test structure. I'm going to say, 'Show me two big to fail.' People hate tests of big code. In fact, have you ever noticed that the people who are really into testing and Test-Driven Development (TDD) seem to hate big objects and big functions more than normal people?
00:08:15.840 We all understand big objects are harder to deal with than small objects, but one thing I've learned over the years is that tests can actually make big objects even harder to manage, which is counterintuitive. You'd expect the opposite. I think part of the reason is that when you have big objects, they might have many dependencies, which means you have lots of test setup.
00:08:41.279 They also might have multiple side effects in addition to whatever they return, meaning you have lots of verifications. But what's most interesting is that they have lots of logical branches. Depending on the arguments and state, there's a lot of test cases that you have to write.
00:09:06.080 So let's take a look at some code. At this point, I realized that OS9 is not Unix, so I found a new terminal. Actually, it's a cool new one; it just came out this week. So let's boot that up. Yep, here we go! We're almost there.
00:09:31.599 It's a little slow. So this is a fully operational terminal. All right, I'm going to type in an arbitrary Unix command that works fine. I'm going to start a new validation method for a timesheet object to see whether or not people have notes entered. Let's say, if you have notes and you're an admin, and it’s an invoice or off week, and you've entered time or not—all four Boolean attributes factor into whether or not that record is considered valid.
00:10:02.880 At this point, I wrote the first test, but I'm like, 'Oh, I've got a lot of other contexts to write.' I'm like, 'Damn, this is a lot of tests I would need to write to cover this case of just four Booleans.' What I fell victim to there is a thing called the rule of product, a concept from the school of combinatorics. It's a real math thing because it has a Wikipedia page.
00:10:53.440 What it essentially states is that if you have a method with four arguments, you need to take each of those arguments and the number of possible values, multiplying them together to find the total amount of potential combinations—or the upper bound of test cases you might need to write.
00:11:02.720 In this case, with all Booleans, it's 2 to the 4th, so we have 16 test cases that we may need to write. If you are a team that’s used to writing a lot of big objects and big functions, you are probably in the habit of thinking, 'Oh, I have some new functionality; I’ll just add one more argument. What more harm could that do?' Other than double the number of test cases I have to write.
00:11:33.520 As a result, as someone who trains people on testing a lot, I'm not surprise at all to see many teams used to big objects wanting to get serious about testing and then saying, 'Wow, this is really hard; I quit!' So if you want to get serious about testing and have many tests on some code, I encourage you: stop the bleeding! Don't keep adding onto your big objects.
00:12:04.079 Try to limit new objects to one public method and at most three dependencies. That, to that particular audience, is shocking. The first thing they usually say is, 'But then we’ll have too many small things! How will we possibly deal with all the well-organized, carefully named, and comprehensible small things?'
00:12:37.120 People get off on their own complexity; they think that's what makes them serious software developers. They feel like it sounds like programming on easy mode. It’s not rocket science to build an enterprise CRUD application. Just write small stuff—it works.
00:13:03.040 Next, I want to talk about how we hate when our tests go off script. Code can be anything; our program can and should be unique and creative; special unicorns of awesomeness. But tests can and should only do three things. They all follow the same script: every test sets stuff up, invokes a thing, and then verifies behavior. It’s writing the same program over and over again.
00:13:53.679 It has these three phases: arrange, act, and assert. A more natural way to express this would be given, when, then. When I am writing a test, I always call out those three phases clearly and consistently.
00:14:01.040 For example, if I'm writing this as a mini test method, I ensure to place exactly two empty new lines in every single xUnit-style test that I write—one after my arrange statement and one after my action. This makes it really clear at a glance what my arrange, act, and assert are. I always ensure that they go in the correct order, which is something people often get wrong.
00:14:27.760 If I'm using something like RSpec, I have many constructs available to specify my intent. I can use 'let' and give a value to do a setup. So 'let' indicates I'm setting up a new thing. I can use 'before' to indicate my action, allowing me to split up those assertions into separate blocks if I choose.
00:14:57.440 If someone knows RSpec, they'll know where each line belongs instantly. I also try to minimize each phase to just one action per line so that test-scoped logic doesn’t sneak in. The late, great Jim Wyrick wrote a fantastic Ruby gem that I hope you check out called RSpec Given. I help maintain it now; he and Mike Moore ported it to MiniTest as well. I also ported it to Jasmine and someone else has taken it on and ported it to Mocha.
00:15:56.240 It's a really cool given/when/then conscious testing API that starts from the same place, using 'given' instead of 'let' for straightforward setups and 'when' instead of 'before.' The real shine comes from ‘then’ being just a little one-liner. I don't have a custom assertions API because it interprets the Ruby inside and splits it to give great error messages; it’s a really expressive testing API.
00:16:25.680 You don't have to use that tool to write your tests in a way that’s conscious of given/when/then; they’re easier to read in general. It points out superfluous bits of test code that don't fit one of those three phases and can highlight certain design smells. For instance, if you have a lot of given steps, maybe you have too many dependencies on your subject or too complex of arguments.
00:16:57.520 If it takes more than one when step, your API is likely confusing or hard to invoke; there's probably something awkward in how you use that object. And, if you have many then steps, your code is probably doing too much or returning too complex of a type.
00:17:39.760 Next up, I want to talk about hard-to-read, hard-to-skim code. Some people are fond of saying that test code is still code, but test code is often untested code. So, I try to minimize it and make it as boring as possible for that reason. What I find is that a good test tells me a story of what the code under test should look like, but if there's logic in the test, it confuses that story.
00:18:16.480 I spend most of my time reading that logic and making sure it’s right because I know there’s no test of that test. So, test-scoped logic is hard to read. If there are any errors, they're easy to miss—maybe it’s passing for fantasy reasons, or maybe only the last item in this loop of data is executing over and over again.
00:18:56.640 A lot of times, people have this impulse; they say, 'Hey, I've got a lot of redundancy in my test... I could really dry this up by generating all my test cases.' For example, this person did a Roman numeral kata, and they can clearly see they could just have a data structure and make looping over that data structure much more tidy and use a defined method.
00:19:40.120 That’s a perfectly reasonable test structure, and in this case, it totally works fine. But I still think it's problematic because often that person experienced test pain, and their reaction was to make the test cleaner. The usual impulse when experiencing test pain is to look and see if there’s something wrong with the production code that led us to that pain.
00:20:02.880 If you look at the person's production code, you can see data hiding in ifs and elses; they contain really dense logic. I would much rather take a look at the same thing and extract the same sort of data structure from that, so instead of having all the ifs and elses, I can loop over the same data structure and figure out whatever rule I need.
00:20:40.119 Now I only need a few test cases. In fact, I can just keep adding additional keys to that hash, and now I've covered a lot of cases without needing a bunch of very explicit test cases. It’s much cleaner this way.
00:21:10.560 Sandy Metz, who's around, has a thing called the squint test. It helps her understand and cope with really big file listings and allows her to draw a few conclusions. I don't have anything nearly as fancy, but when I'm reading your test suite, I hope to glance at it and understand the thing under test—particularly, where are all the methods, are they in order, are they symmetrical, and can I easily find all the tests of just one method.
00:21:55.240 I like to use, for example in RSpec, the context method to point out every logical branch and all the subordinate behavior beneath each logical branch. It’s very easy to organize this way, and when you do it consistently, it makes it easy to read tests.
00:22:40.839 Additionally, the arrange-act-assert should really pop in a consistent way. If I'm using an xUnit-style testing tool like MiniTest, I at least want to see a range-act-assert pattern throughout every single file listing, and the names of the tests should mean something.
00:23:08.880 Next up, let’s talk about tests that are too magic. A lot of people hate tests that are either too magic or not magic enough, as it turns out. All software is a balancing act, and test libraries are no different. The expressiveness of our testing API exists along a spectrum.
00:23:51.440 You know, smaller APIs are generally slightly less expressive than larger ones because they have more features, but you have to learn those features to use them well. For example, MiniTest has classes and methods that we know. Every test is a class, and we override setup and teardown to change that behavior. Every new test is another method, assert is very easy to use. Ryan’s a funny guy, and he has some funny ones like, 'I suck,' and 'my tests are order dependent' to create some custom behavior.
00:24:56.800 When comparing that to RSpec, it’s night and day. RSpec has describe and context as their synonyms, subject and let; these all operate under similar principles. You've also got a lot of extra features like before, after, and around constructs and many matchers. There's a considerable amount to learn in RSpec.
00:25:21.520 Jim tried to have it both ways when he designed 'Given.' It’s a straightforward API using 'given, when, then,' with just a handful of other methods, ensuring it retains its expressiveness while being simple to use. Now, because it's not a standalone testing library, you're still relying on all of MiniTest or RSpec, so it is still rather complicated—yet it's quite pleasant day to day.
00:26:08.160 I'm not here to tell you that there's a right or wrong testing library or level of expressiveness. You just have to remain aware of the trade-offs. Smaller testing APIs are easier to learn, but they might encourage more one-off test helpers, which can make your code more complex, whereas a bigger testing API might yield more useful test feedback at the cost of an onboarding learning curve.
00:26:48.240 Finally, in this category, people hate tests that are accidentally creative because, in testing, consistency is golden. If we look at a similar test as we had before, using 'let' to set up an author, a blog, and a comment might be confusing. It’s not clear what the thing under test is, so I rename it 'subject,' and I always call the thing I get back from that thing that I'm going to assert on 'result' or 'results' 100% of the time.
00:27:26.040 So, if I’m reading a big nagging test, at least I know what’s being tested and what’s being asserted on. This task can be daunting in a lot of test suites. If you learn one thing today and begin calling the thing you're testing 'subject,' then that’s worth all of this preparation.
00:27:51.760 When you're consistent, inconsistency can carry nuanced meaning. For instance, if I have a handful of tests and look at them—oh wait, there's something weird about 'test C.' That implies there's probably something interesting about 'object C.' I should look into that. This is helpful; it speeds me up.
00:28:27.300 But when every test is inconsistent—if every test looks entirely different—I have to bring that same level of scrutiny to each one and read carefully to understand what's going on, to grasp the story of the test.
00:28:35.760 If I'm adopting your test suite, I would much rather see hundreds of very, very consistent tests, even if they're mediocre or just kind of okay, rather than just a handful of beautifully crafted, brilliantly designed, artisanal tests that vary widely. Because every time I fix anything, it just becomes a one-off.
00:29:09.760 Additionally, readers often assume that all our code holds meaning, but especially in testing, it’s often the case that the stuff we put in our tests is just plumbing to make our code execute properly. I try to highlight meaningless test code to help my reader.
00:29:31.120 In this instance, I’m setting up a new author object. He has a fancy name, a phone number, and an email, and they’re all validatable—but that’s not necessary for this method. So here, I’ll just change his name to 'Pants,' I'll remove his phone number because it's not needed, and I'll change his email to '[email protected].' Then I’ll update my assertion.
00:30:04.640 Now everyone in the room, who might have assumed you needed a real author, realizes you didn't. Now everyone in the room could implement this method understanding exactly what it needs to accomplish. So, test data should be minimal, but also minimally meaningful.
00:30:31.560 Now that we’re through section one on test structure, congratulations! Let’s move on to round two, talking about test isolation. The first thing I want to bring up that cheeses me off is unfocused test suites. Most teams define success in boolean terms when it comes to testing. There’s one question: is it tested? And if the answer is yes, then they feel pretty good about themselves.
00:31:14.440 But I think we can dig deeper! My question is, is the purpose of each test readily apparent? Does its test suite promote consistency? Very few teams can answer yes to this question, and when I raise the issue, a lot of people say, "Consistent? I’ve got tons of tests, all with different purposes testing things all over inside my test suite!" And I'm like, yes, that’s true, but you could probably boil it down to four or five.
00:31:59.800 In fact, for each type of test that I define, I create a separate test suite for each, with its own set of conventions. Those conventions are lovingly reinforced with their own spec helpers or test helpers to try to encourage consistency. I actually did a whole talk just on that called 'Breaking Up With Your Test Suite,' which is available on our blog.
00:32:26.880 In Agile land, there's this illustration people like called the testing pyramid. TL;DR: stuff at the top is illustrated to be more integrated, while stuff at the bottom is less integrated. When I look at most people’s test suites, they’re all over the place. Some tests call through to other units, while others fake out relationships.
00:32:58.320 Some tests might hit a database or fake third-party APIs, while others might hit all those fake APIs but operate beneath the user interface. This means every time I open a test, I have to read it carefully to understand what's real, what's fake, and what the test is meant to accomplish. It’s a huge waste of time.
00:33:35.920 So instead, I start with just two suites in every application I write. One suite I make maximally realistic and as integrated as I can possibly manage, and another I design to be as isolated as possible.
00:34:02.880 Part of the reason I do this is because then, intuitively, I can answer the question: should I fake this? Yes or no? So, I lean towards one of those two extremes instead of landing all over. The bottom suite's job is to ensure that every single detail works in your system, while the top suite is to verify that when it’s all plugged together, nothing breaks.
00:34:39.560 It's pretty straightforward and comprehensible. As the need arises, you might need to define some kind of semi-integrated test suite. It’s just important to establish a clear set of norms and conventions.
00:35:09.760 For instance, I was on an Ember team recently, and we agreed that we would write Ember component tests, but upfront, we had to all agree we would fake our APIs. We weren't going to use test double objects; we would trigger actions instead of UI events and verify app state—not HTML templates. These were arbitrary decisions, but we relished the chance to sterilize those arbitrary decisions to ensure consistency.
00:35:47.440 Next, I want to talk about how overly realistic tests can bum us out. When I ask someone how realistic they think a test should be, they usually don’t have a good answer other than maximally realistic—like they want to ensure their thing works. They might be proud of their realistic web test; you know, there’s a browser that talks to a real server and a real database, and they think that’s as realistic as it gets.
00:36:19.680 To poke holes in it, I might ask, 'Does it talk to your production DNS server?'—and they say no. I ask, 'Does it talk to your CDN and verify that your cache and validation strategy are working?'—and they say no. So, technically, it's not a maximally realistic test.
00:36:54.840 In fact, there are definite boundaries here, but those boundaries can be implicit. Implicit shakiness poses a problem because now, if something breaks, anyone on the team is liable to ask why we didn't write a test for that. This traps teams into writing tests, and then when stuff breaks in production, managers come and they all have a come-to-Jesus moment and say, 'Why?' Then they say, 'Never again.'
00:37:56.320 Their only reaction is to increase the realism of all their tests and all their integrations. That's fine, except realistic tests are slower. They require more time to write and debug; they have a higher cognitive load—we have to keep more in our heads at once—and they fail for more reaction parts. They have a real cost.
00:38:35.680 Instead, think of it this way: if you have really clear boundaries, then you can focus on what's being tested really clearly and remain consistent about how you control things. For the same team, with clear boundaries in mind, the same thing occurs.
00:39:08.640 When they write tests and something breaks in production, they can stand tall and have a grown-up conversation about how they agreed that type of test was too expensive. Or, frankly, that they didn’t intentionally break production—they just couldn’t anticipate that particular failure.
00:39:42.320 Simply having tests isn’t a universal ideal; less integrated tests can be useful. They provide richer design feedback on how it should operate, and any failures are easier to understand and reason through.
00:40:13.840 Next, let’s talk about redundant code coverage. Suppose you have a lot of tests in your test suite: browser tests, view tests, controller tests—those all call through to a model. Maybe that model has relationships with other models, and everything is tested in eight different ways. You feel proud of your thorough test suite.
00:40:47.680 In fact, you're a test-first team, so you change that model. The first thing you do is write a failing test, and you make it pass—you feel pretty good about it. But then what happens? All those dependent tests—your controller tests, view tests, browser tests—break because they incidentally depend on that call-through model.
00:41:28.440 So those all broke, and what took you half an hour on Monday morning is now taking you two days just cleaning up all those tests you didn't anticipate would break. It was thorough, yes, but it was redundant too.
00:42:03.680 I've found that redundant coverage can really kill a team's morale. It doesn't bite you on day one when everything is fast and easy to run in one place, but as things get slow, redundant coverage can really kill productivity. How do you detect redundant coverage?
00:42:47.680 Well, it's the same way you detect any coverage; you run a coverage report and look at it. The only thing we typically look at in a coverage report is the targets, ways we could increase our coverage. But there's a lot of columns there; what do those other columns say?
00:43:07.920 We rarely look at those other columns. The last column shows the average number of hits per line. That’s interesting because the top thing, having been hit 256 times when I run my tests, tells me that if I change that method, I'm likely to have tests breaking everywhere—an important insight!
00:43:26.740 You can identify a clear set of layers to test through. For instance, that same team might agree that the browser tests are valuable, but the view and controller tests are more redundant, so they'll just test through the browser and models to reduce redundancy or try outside-in Test Driven Development. This is sometimes referred to as London School TDD.
00:43:50.880 Martin Fowler termed it Mock TDD, though I don’t love that term. If you've read the book 'Growing Object-Oriented Software,' you may understand my new take on it—I call it Discovery Testing. Recently, I did a screencast series on our blog about it, so check that out if you're interested.
00:44:29.520 However, we don’t have time to talk about that today. But since I mentioned test doubles and fakes, it’s fair to discuss how people hate careless mocking. A test double is a catchall term for anything that fakes out another thing for the purpose of writing our tests—like a stunt double. A test double could be a fake object, a stub, a mock, or a spy.
00:45:06.360 Humorously, I co-founded a company named Test Double and maintain several Test Double libraries. When I talk about testing, people often think, 'Oh, Justin, you're probably pro-mocking!' It’s actually a more nuanced relationship with mock objects and test doubles.
00:45:48.520 I start by defining the subject I want to write. I think about what dependencies I need, and I create fakes for those dependencies. I use the test as a sounding board to see if those APIs are easy to use or awkward. The data contracts flowing between those three dependencies should make sense, and if they don't, I can easily change the fake.
00:46:32.440 That’s a very simple time to catch design problems. That’s not how most people use mock objects—most people only try to write realistic tests. They have dependencies, some easy to set up, others hard. When a dependency fails intermittently, they use mocking frameworks as a crutch to shut up those pesky dependencies.
00:47:01.520 Then they try to pass their tests. When they're done, they're exhausted. But on the second day onward, we find those tests just treat the symptoms of test pain, not the root cause. They greatly confuse future readers about which tests are valuable and what's real versus fake.
00:47:39.720 That makes me really sad because it gives test doubles a bad name, and I have to protect my brand, y’all! So, if you see someone abusing a test double, say something.
00:48:16.560 Before we wrap up on test isolation, I want to mention application frameworks because frameworks are cool—they provide repeatable solutions to common problems we face. The most common problems are integration concerns. If we visualize our application as juicy plain old code in the middle and some framework code at the periphery, we see that your framework is likely providing simple ways to talk to HTTP or email or other cool stuff.
00:48:50.840 Most frameworks focus on integration problems. As a result, when they provide test helpers, those helpers assume the same level of integration because you want to use the framework correctly. That’s fair; frameworks aren’t messing up here. However, as framework consumers, if we view the framework as the giver of all we need, we end up writing only integration tests.
00:49:29.840 In reality, if some of our code doesn't rely on a framework, why should our tests? The answer is they shouldn’t. You might still have a first suite that calls through all the framework stuff, an overly integrated test suite to check that everything's plugged together correctly. But if you've got juicy domain logic, test that logic without the framework coupling.
00:50:00.000 Not only will it be faster, but you'll receive much better feedback with clearer messages and better insights.
00:50:25.120 That wraps up our discussion on test isolation! Let's move to our final round: test feedback. We'll start with another issue people face with tests—bad error messages. Consider this example: "Oh crap, I broke the build! Now what?" Let’s pull down this gem that I wrote; this is a real gem and a real build failure.
00:51:15.200 So, naturally, it’s going to have an awesome error message. Let's look at that—"Failed assertion; no message given on line 25!" What’s my workflow here to fix this? First, I’ve got to check the failure and open the test; next, find that line. I then have to set a print statement or debug to understand what the expectation was versus what happened.
00:51:58.240 Then, I change the code and see it passing. At that point, I feel like I deserve a coffee break after spending 20 minutes on this! That’s a wasteful workflow. Every time I see a failure like this, it's frustrating.
00:52:24.000 Even if tests are quick, bad failure messages provide enough friction and waste time that they can offset how fast the test suite is.
00:53:02.160 Now, let's look at a good error message, specifically an RSpec given example. Let’s say "then user.name should equal Sterling Archer." We run the test, and you can see the assertion right there: "Expected Sterling Malery Archer to equal Sterling Archer.' This shows that you tripped the failure by the whole expression evaluating to false.
00:53:50.560 It keeps calling until it can't call anymore. The reference on the left, user.name, evaluated to Sterling Malery Archer. Yes, but it knew it could call user, which is an Active Record object, printing that Active Record object right there for me.
00:54:28.360 Now, when I see a failure in RSpec given, I'm like, 'Okay, cool! My workflow is to see the failure, realize what I did wrong, change my code, then earn a big promotion for being faster than the other guy writing bad assertion library stuff!
00:55:09.920 In my opinion, we should judge assertion libraries based on how they work with assertions, not just on how snazzy their API is.
00:55:48.520 Next, let's discuss productivity and slow feedback loops. 480 is an intriguing number; it’s the number of minutes in an eight-hour workday. I think about this number a lot. For example, if it takes me 15 seconds to change some code, 5 seconds to run a test, and 10 seconds to determine my next action, that creates a 30-second feedback loop.
00:56:09.760 That means in an 8-hour workday, I have an upper limit of 960 thoughts that I'm allowed to have. However, if, like me, you have some non-coding responsibilities, you might have additional overhead; that non-code time may take additional time context-switching.
00:56:48.360 If I then have a 60-second feedback loop, that ties directly back to 480, allowing for two hours of non-code activity in an eight-hour workday. But let’s say we’re successful, and we have many tests. If running a single test takes 30 seconds, that brings us to an 85-second loop, allowing for only 338 actions a day—almost a third of what I could do.
00:57:26.120 That non-code time doesn’t care how fast your tests are, so it needs to be adjusted. Now imagine you have really poor error messages. Instead of being able to see what’s wrong in 10 seconds, it takes you 60 seconds to debug, meaning your feedback loop becomes 155 seconds.
00:57:55.920 Now I can only have 185 useful thoughts in a day, and that sucks! If you're on a team with many integration tests, and you draw the short straw for a given iteration, your job might be to update all those integration tests.
00:58:27.640 I once worked on a team where running an empty cucumber test literally took four minutes as the baseline. That is extremely slow! In this case, if my testing loop averages 422 seconds, I end up with only 68 actions in a single day.
00:59:04.720 If I ran a four-minute test, I’d run it, start browsing Twitter, Reddit, email, or elsewhere, and then come back later, only to realize that the test finished three minutes ago. So my real feedback loop is more like 660 seconds...which translates to 11 minutes! I may only get 43 actions a day. After six months, I could feel my skills atrophy. I was miserable, even though I spent all that time on Reddit.
00:59:56.440 The number 43 is significantly smaller than 480. You might not realize it, but we just did something significant together today. We found it: the mythical 10x developer in the room!
01:00:29.320 So this matters! A few seconds here and there really add up. I encourage you to use a stopwatch to monitor your activity and optimize feedback loops.
01:01:05.400 If you find your app is too slow, just embrace that; iterate quickly and integrate later if needed!
01:01:14.920 Now let's talk about one reason for slowness: painful test data. How much control each test has depends significantly on the testing data strategy we employ.
01:01:47.480 For example, you might use inline model creation in every test for a lot of control over how your test data is set up. Others might use fixtures for a decent starting point for a schema each time you run tests.
01:02:28.320 If you have complex relationships or plenty of data required to do anything interesting, you may curate a SQL dump to load at the start of each test run. Other places lacking control over their data will write self-priming tests to run actions first—like creating an account before running a test.
01:03:16.640 None of these methods are straightforwardly good or bad, but it’s essential to note that you don’t have to pick just one means of setting up data for your application. You can change it at any time.
01:03:40.640 Take the testing pyramid for example; maybe inline is a good way to test models because it’s explicit and has much control. Fixtures could work for integration tests where creating users again and again isn't ideal, while data dumps are wise for smoke tests to prevent 4-minute execution times within a Factory RB file.
01:04:32.800 If you're writing tests against staging or production, you’ll probably rely on self-priming methods because of the risk about direct database access. In many slow test suites, data setup proves to be the largest contributor to slowness.
01:05:06.440 I don't have evidence to support that—it just seems truthful. Therefore, I encourage everyone to profile those slower tests and use Git Blame to figure out exactly what made them slower. If needed, change how you manage your test data.
01:05:43.080 Speaking of things slowing down, let’s address a favorite phenomenon: superlinear build slowdown. Our intuition about how long tests take to run often misleads us. If we write one integration test and it takes 5 seconds, we assume that if we write 25, it’ll take 25 times as long. If we write 50, it’ll take 50 more seconds.
01:06:17.600 We think this way because it seems logical; we measure test duration as spending 5 seconds in the testing code, but that’s not entirely accurate, as we also spend time in the app code or during setup or teardown. We probably spend more time in the app code than in the test code.
01:06:48.760 Typically, in that 5-second test, we may only spend one second in the test code itself. And the more tests we add, we learn these features start to interact with one another, creating marginally tighter code as things get bigger. Consequently, the test that initially took one second now takes seven seconds.
01:07:24.640 If we were to write 50 tests, we’re likely feeding integration issues throughout our code. In that scenario, we observe increases in time—maybe 18 seconds per test. We need to zoom out the graph, which may come to reflect 900 seconds.
01:08:05.840 Halfway through our journey of writing 25 tests, we have added 150 additional seconds to the total time, which differs greatly from the earlier intuition. Tracking this stuff is essential, as I've witnessed teams occasionally mention, 'Our build's a little bit slow' only for three months later to reveal a situation where their build has extended to 9 hours.
01:08:50.400 This is shocking for them and sneaks up out of nowhere! They often do not perceive it because it’s counterintuitive. So track this stuff! Moreover, I encourage everyone to avoid the impulse to create a new integration test by rote for every feature you write.
01:09:26.400 Instead, look to manage a few integration tests that intelligently cover unique interactions across different features, as that’s the better way to test for user behavior!
01:10:09.320 During the early phases of development, you can collectively decide on arbitrary limits, like capping the build at 5 or 10 minutes total. Once we begin to exceed these limits, we agree that something must change: either we need to delete a test or we have to make the whole suite run faster.
01:10:52.480 That’s quite effective—drawing a hard line in the sand. I've seen many of team achieve great outcomes with it. Moving on to our last topic—this is my favorite: false negatives.
01:11:36.480 So, what are false negatives? They raise this question: what does it mean when a build fails? Someone might immediately answer, 'It means the code's broken!' But wait a moment; the follow-up question is: which file needs to change to fix that build?
01:12:23.040 Usually, we really just need to update some test somewhere. So a test failed, not the actual code, and then they scratch their heads trying to define true and false negatives.
01:12:47.080 A true negative refers to a red build, indicating that something was broken and meaningful, and requires fixing to restore the integrity of our code. A false negative is a red build that signals we're unfinished. We just forgot to update that test.
01:13:31.160 True negatives are great because they reinforce the value of our tests. When our managers invest in us writing tests, they don't know false negatives exist; they assume every failure captures a bug that shan't reach production.
01:14:08.440 This perception offers us a labor incentive, but I’ve found that these occurrences are depressingly rare. I can typically count only three or four true negatives in several months of hard work.
01:15:00.320 In contrast, false negatives steadily erode our confidence in our tests, sending signals that builds are failures. Every time we see a build failure, we think, 'Oh, sure, I need to update all these tests'—leading us to feel like slaves to our test suite.
01:15:45.480 The top causes of false negatives are redundant code coverage, changes to models that we forget about, and slow tests. When tests are too slow, we cannot run all our tests ahead of pushing our commits, so we outsource those oversights to CI.
01:16:43.680 This leads to unanticipated failures when our builds show those unseen errors. So, if you take one thing from this discussion, please write fewer integration tests. That will make you a lot happier!
01:17:31.040 I encourage you to track whether your build failures are true or false negatives. Determine how long it took to resolve them because that data can help pinpoint root cause problems in your test suite.
01:18:11.400 We can take this data to justify investing in improvements to our tests. We’ve just talked about five things concerning test feedback, which means we've done it! We've reached the end of our journey together this morning.
01:18:50.240 If this talk bummed you out and felt a little too close to home, remember that regardless of how bad your tests might be, this guy right here probably hates Apple Works more than you hate your tests!
01:19:29.520 I’m here from Test Double, like I mentioned. If you're trying to hire senior developers for your team, you’re likely having a tough time right about now. But at Test Double, we have amazing senior developer consultants, and we love collaborating with existing teams.
01:20:00.000 So get a hold of me! I'll be here all week. We have Josh Greenwood, who is somewhere around here, as well as Jerry Anonio, who’s giving a talk on concurrent Ruby tomorrow afternoon. Hope you can check that out.
01:20:39.280 We work with a mission focused on improving how the world builds software. If you're interested in helping us write new features and get better, consider joining us. You can reach us at [email protected].
01:21:04.680 Most importantly, I hope this was valuable for you! I have plenty of stickers, business cards, and goodies to share, and I would love to meet you and hear your story. Thank you so much for sharing your precious time with me; I truly appreciate it!