00:00:14.000
Hey everyone! It's a little dark here. It's a bit dim because all my slides are in black, so please don't fall asleep. I'll try to speak loudly and quickly to keep us high energy.
00:00:28.320
The story of how this talk came about is that I realized last year that conference season tends to line up with Apple’s operating system release schedule. I'm a huge Apple fanboy and I always want to be riding the latest and greatest. At the same time, I also need to make sure my slides work because I'm always really nervous about them getting broken. Maybe it was because they announced the iPad Pro or something, but I thought that maybe OS 9 is finally ready for me to build an entire talk in it. I put my typical 80 hours of work into building a slide deck, and this talk was written entirely in OS 9.
00:00:52.500
Let’s boot it up and see how it goes. Please bear with me. Alright, we’re starting up; it’s a little retro with the flat look. Alright, I’ve got to find my talk on the desktop. Alright, Apple Works 6, and I've got to find the play button there.
00:01:14.790
Here we go! This talk is called 'How to Stop Hating Your Tests.' My name is Justin Searls, and I work at a really great software agency called Test Double. If you don't know me, you might recognize my space as being associated with lots of snarky tweets that get retweeted. I’m told I don’t look like this anymore, which is depressing for my hairline.
00:01:28.409
This is the tapper edition of the talk. I’ve never given a talk at a conference that served free beer, but I hope that means you all like this more than you otherwise would. Let’s dive in! You know, I thought about hating tests. Why do people hate their tests? What I think happens is that there’s a cycle that teams almost always seem to go through.
00:02:03.329
First, you start a new project in an experimentation mode; you're just making stuff work. You’re having fun and doing new things, iterating really quickly. Eventually, you get into production, and it becomes important that your stuff still works, so you start to write some tests. You put a build around it and ensure your new changes don’t break anything. However, without a lot of forethought and planning, those tests are likely going to be slow, and as they aggregate, they will feel like a burden.
00:02:48.030
Eventually, you’ll see teams get to this point where they feel like they are slaves to their builds and tests. They're always just cleaning up tests and yearning for the good old days when they didn't have to worry about that, and they could just move fast and break things. I see this pattern often enough that I’m beginning to think that an ounce of prevention might be worth a pound of cure in this instance. Once you get to the end, there’s only so much you can do. You might say, 'Oh, our test approach isn’t working,' but how do you respond to that? You could say, 'Well, you’re just not testing hard enough; you need to buckle down!'
00:03:30.930
Whenever I see the same problem emerge over and over again, I’m not a fan of the 'work harder, comrade' approach. I think we should introspect our tools and workflows and be open and honest about how we’re testing or whether we’re testing effectively. Another common response is to treat testing as job one, insisting that we must remediate. You might get away with this for a little while, but testing is never job one from the perspective of the people paying our paychecks. At best, its job two—you’ve got to ship stuff and get it out the door.
00:04:09.750
If you’re not in a greenfield project, I’m talking about prevention stuff today. If you’re in a big monolith and checking out because you’re thinking, 'Oh well this is all about stuff to do at the beginning of the project,' don’t worry; it’s not a problem. Here’s one weird trick to starting fresh with tests: just move your tests into a new directory. Create a second directory, and then you can write a shell script that runs the tests out of both directories as you port them over to the new clean test suite.
00:04:40.280
Eventually, maybe you can decommission the old, crappy tests. I hesitated before giving this talk or producing it at all because I am the worst kind of expert. I’ve spent years navel-gazing about testing and doing lots of open-source projects to help people with tests. I’ve been that guy on every single team you’ve ever been on who cares just a little bit more about testing than you do. I am involved in so many highfalutin Twitter spats about terminology and stuff that I don’t know why anybody follows me.
00:05:12.130
If I were just to tell you what I really think, it would be toxic. My goal today is not to just spew that at you but to distill that toxic stew of opinion into three component parts. The first topic I want to discuss is structure—how we’re writing and organizing our tests.
00:05:29.040
The second is isolation, because how you isolate the thing you're testing from the components around it communicates the concept of the value that you’re trying to get out of the test. Finally, we’ll talk about feedback—do our tests make us happy or sad, are they fast or slow, and what can we do to improve it? Keep in mind, we’re always talking from the perspective of prevention here. These ideas might be new or different for your project, but I hope you can pick them up along the way.
00:05:51.010
It was around this point that I realized doing custom artwork in Apple Works 6 is a huge pain. My brother picked up this old Family Feud Apple II game, and we just ripped off the artwork from that. So we’re going to operate off of this Family Feud board—it's a real board, because if I point to it and say something ridiculous like 'show me potato salad,' it’ll give me an X. In fact, I didn’t have a hundred people to survey for populating this board; I just surveyed myself a hundred times, so I know all the answers.
00:06:13.050
So, it's going to be a really bad game of Family Feud. First, we’re going to talk about test structure. The first thing that people hate about their tests is when they're just too big to fail—big gigantic test files. I want to pause here. Have you ever noticed that people who are really into testing, especially TDD, seem to really hate big objects more than other people do?
00:06:40.990
Big methods—they're harder to deal with than small methods. While we can all agree that big stuff is harder, testing aficionados have even more disdain for them. What I have found is that testing actually makes working with big objects and big methods much harder, which is a bit counterintuitive. The root cause, if we were to analyze the nature of big objects and big tests, is that if you have a big object, you likely have lots of dependencies. This means that your tests have lots of setup.
00:07:05.210
Big objects typically have lots of side effects in addition to whatever they are returning, which means your tests might have numerous verifications. Big objects also have many logical branches based on the arguments you provide and the broader state of the system. Things start to fly off the rails here because you then have many test cases to write based on all those branches.
00:08:02.990
Now, I want to show some code off, but I realized OS 9 has no terminal because it’s not Unix, so I had to find another one somewhere. Let’s check this out. Boot it up, it takes a minute. Sorry, I sold almost.
00:08:27.360
Alright, here’s our terminal editor. This is a fully operational UNIX terminal, which means I can type in arbitrary commands like 'Who am I?' Okay, great. Let’s open up here. I'm going to write a simple test of a rip set; let’s say it's an Active Record model called Timesheet, and it has a validation that depends on whether notes have been entered into their Timesheet, whether they are an admin, if it's invoice week, and whether they’ve entered time.
00:08:58.200
So, I’ve got the first case down, but then I think of all these other permutations; for instance, what if there are no notes? Or what if it’s a non-admin user? Or what if it's an off week instead of an invoice week, or what if they don’t have any time entered? Now I'm realizing I’ve got a lot of different permutations to worry about.
00:09:40.160
What happened here? What happened is I fell victim to what’s called the rule of product, which comes from the school of combinatorics and math. The TLDR is that if I have a method that has four arguments, to figure out the number of possible combinations, I just multiply together the number of variations.
00:10:11.140
In this case, that gives us the upper bound of potential test cases we need to write. When they are all boolean attributes, we have a really easy case; it’s only two to the fourth, so we only need to write 16 test cases for this very trivial validation method. If you're used to writing a lot of big objects and big functions, it’s common to think, 'Oh, I’ll just add one more little argument here.' You may not realize that this implies you’re going to double the number of tests you have to write.
00:10:37.800
So, if you're comfortable with big code and you want to get more serious about tests, recognize that testing makes writing big objects harder to deal with. Testing is supposed to make things easier, but in this case, I advise you to stop the bleeding and stop adding to big objects. When I write Ruby, I try to limit every new object to just one public method and at most three dependencies, maybe just a handful of arguments. Never any more than that.
00:10:56.200
Now, people might push back. If you're used to big objects and like seeing everything in one place, that can feel uncomfortable. You might say, 'But then we'll have way too many small things! How will we possibly deal with all those?' Well, we always have to guard ourselves against the fact that as programmers, we sometimes get caught up in our own incidental complexity. This enterprise CRUD stuff we do doesn’t have to be rocket science, but when we make it convoluted, we feel like that's hardcore programming.
00:11:38.540
To some programmers, this advice feels like telling them to program in easy mode, and my reaction is, 'Yeah, it is easy! We don’t have to make this stuff so hard, just write small objects!' Another thing people hate about tests is when they go off-script. What I mean by this is that we think of code as something that can do anything, but tests can really only do three things: they all follow the same script.
00:12:04.500
Every test ever: set something up, invokes a thing, and verifies the behavior. This can be formalized into phases: the Arrange phase, the Act phase, and the Assert phase. In all my tests, I take great care to call out all three of those phases consistently. For instance, if I have a condensed-looking mini-test, I will add a new line after my Arrange, and another new line after my Act.
00:12:41.370
If I do this consistently, it means that when I skim the test, at least I can easily see the setup, the behavior I'm invoking, and the assertions. I always have it in that order, so that it reads like that script. If I'm doing something more BDD-style, like RSpec, I can use constructs like 'let' to set up a variable, so anyone familiar with RSpec knows that’s a setup step. I can use 'before' to invoke my action and split my assertions across separate 'it' blocks if I want to. It might be more verbose, but at least, at a glance, we know exactly what phase each line belongs to.
00:13:28.569
I really like the 'given when then' approach put forward by Jim Wyrick, a huge hero of mine, who should be a hero to anyone in the Ruby community. His final contribution is called 'RSpec Given,' and it provides a conscientious API that’s as terse as possible yet expressive for tests. Instead of using 'let,' we just call 'given' for a label, 'when' for the action, and 'then' for the assertion. This makes tests easier to read and helps highlight superfluous code.
00:14:02.560
For instance, if you have lots of 'given' steps, it could mean you have too many dependencies, or things are too hard to set up. If you have more than one 'when' step, the API may not be user-friendly. Lots of 'then' steps could mean the code is doing too much or you're violating the command-query separation.
00:14:39.950
Another thing people hate about tests is when they’re hard to read. Some people fondly say that test code is code, implying it should be taken seriously, but in my mind, test code is untested code, which means it should be treated with derision and skepticism. Test scoped logic can confuse the story of what's being tested, especially when it includes conditionals and branches.
00:15:08.490
People excited about testing are often the most eager to introduce abstractions to solve their testing pain. For instance, someone might see an opportunity to DRY something up by looping over test cases. Looks like they are converting Roman numerals to Arabic, and while there is a lot of duplication, it’s tempting to use a data structure to loop over it.
00:15:57.700
Technically, there's nothing wrong with this approach; it will work and provide good error messages. However, they may miss an opportunity by not addressing a root cause in the production code, which has much more importance than the test code. If the production code is convoluted, you could pull out keys into a hash and simplify it significantly.
00:16:29.940
As a result, I won’t need to worry about as many test cases, and perhaps only need a handful. In a similar vein, Sandy Metz has discussed the idea of a 'squint test,' where I open up a bunch of tests at random and see how easy it is to discern what's being tested and if the methods are organized logically.
00:16:56.230
I like to use constructs like RSpec's context to indicate the logical branches and maintain order. Most importantly, I aim for clarity in the Arrange, Act, and Assert sections even if I’m using xUnit. Next, I'd like to discuss tests that seem too magical.
00:17:27.710
Some tests struggle with too much repetition; it’s important to keep in mind that software is a balancing act. Testing libraries and their APIs are no different. They vary dramatically in expressiveness. Some have small APIs that require you to do more heavy lifting, while others allow for more expressive tests but require more learning.
00:17:54.790
If you look at something like Minitest, everything is a class and we know classes. The methods are straightforward, while RSpec has a massive API with constructs like 'describe,' 'context,' 'subject,' and 'before' that can easily feel overwhelming.
00:18:22.359
Jim Wyrick's API offers a thoughtful way to manage this complexity with terms like 'given,' 'when,' 'then,' and 'invariant.' Instead of unpacking a massive assertion library, it carries out much of the heavy lifting through introspection, known as natural assertions, but it still stands atop the complexity of both Minitest and RSpec.
00:18:56.710
There isn't a right or wrong expressiveness level in testing APIs; it’s vital to keep in mind that when using a smaller one, it's easier to learn but you have to guard against complex test logic. On the other hand, a bigger test API like RSpec may provide prettier tests but might also create a greater burden to understand.
00:19:28.080
People also dislike tests that show accidental creativity. My only real takeaway from this whole journey is that consistency is worth its weight in gold. When I look at a test, I always check what the thing under test is, and I name that subject. I look at what I get back from the Act step, and I name that result or results.
00:19:55.800
With consistency, even if a test is huge and unwieldy, I can scroll through it and note its various components. I’d much rather inherit a gigantic test suite of consistent tests even if they’re mediocre, because they can be improved broadly rather than having a few luxurious one-off tests, which require starting from scratch for every improvement.
00:20:27.030
Readers often assume every bit of code we write has meaning, but that’s not the case. A lot of our test code is merely plumbing to facilitate the behavior we’re trying to assert. I strive to make unimportant bits of test code clearly meaningless to the user. For example, if I create a new author with a very realistic name, phone number, and email, but those details don’t matter, so I simplify it to just something plausible.
00:20:56.180
While previously confusing details would seemingly require a valid author, they could be successfully minimized while still driving the assertion I’m after. So, now you can see, my test looks simple enough that anyone could probably implement that function just by looking at it.
00:21:29.490
So, congratulations, we’ve talked about test structure! Now let’s move on to discussing test isolation. The first thing people often mess up is not having a clear focus in their test suites. When teams define their success, they're often content to ask if their stuff is tested, yes or no. If it's yes, they feel good.
00:21:43.770
However, a more nuanced and important question is whether the purpose of every test is readily apparent. Does the test suite promote consistency so it can be maintained? Usually, if the answer is no, people push back with common feedback that they have many different tests due to the variety of conditions because their system does a million things. But if they think carefully, they could likely identify three, four, or five different types of tests that might cover eighty percent of what they need.
00:22:15.440
Not every single test needs to be a special snowflake! I encourage teams to create separate test suites. We can make as many folders in our system as we want. I suggest creating a separate suite with its own directory, its own configuration, and its own set of conventions for every type of test being written in the system, promoting ineffable consistency.
00:22:38.610
One way to visualize test suites is through the testing pyramid concept from agile land. The short version illustrates that the tests at the top are more integrated and realistic, while tests at the bottom are less integrated and more like unit tests. Most test suites I come across look like one gigantic folder. If I open one test at random, I might find a test that efficiently calls through to two other units.
00:23:11.260
Another might fake its relationship with some other units around it, and yet another might hit the database but fake third-party APIs. Another test could call through to those APIs but operate beneath the user interface. They’re all over the place, causing every time a pair of developers works on or writes a new test to discuss the low-value arguments about whether to mock or not. This is not a beneficial discussion.
00:23:46.050
Instead, I try to start with two suites each approaching one extreme: one test suite that is as integrated as I can manage so that if I’m in an argument about mocking something, I can just say no. Let’s make everything as realistic as possible! Similarly, I write another test suite that aims to isolate everything from its collaborators, unless it’s a pure function.
00:24:07.850
In this case, each individual little file ensures that everything works as it’s supposed to, while the test suite at the top checks if everything seems to work when it’s all plugged together. This structure can work well for the first several months of any project. However, over time, the complexity might compel a middle-tier test suite where the norms are agreed upon.
00:24:38.230
Last year, I was part of an Ember team, and we agreed to write middle-tier tests for the Ember components we were writing. We agreed that we would fake all the APIs and not use any test doubles. We triggered logical actions, not user interface events, and verified app state instead of the rendered HTML. We could have done it either way, but simply agreeing upfront made consistency much more manageable, and made it easier for the entire team to know what was going on at a glance.
00:25:22.580
People also dislike test suites that are too realistic. Teams can feel trapped by the question of 'how realistic should your test be?' which is a question that often leads to the answer being 'the most realistic possible.' However, the problem with that is that while one might be proud of a very realistic-looking test suite, they might not consider whether it tests through conditions like DNS configuration or the cache invalidation strategy for a CDN.
00:25:57.500
Although the team might ask if they’ve tested everything, the reality is that if all tests are too realistic, they can lead to slower tests. Higher realism means more work regarding more moving parts, making the tests harder to debug and creating a higher cognitive load. There is also a tangible cost to increasing realism that isn't often acknowledged.
00:26:32.020
Establishing clear boundaries up front helps increase focus on what is being tested and what is being controlled. When teams have clear boundaries about what they are testing up front and agree on how to approach testing, they are more resilient when issues pop up in production.
00:27:06.620
However, if they never set boundaries, they might run into situations where test failures lead to the question of why no tests exist for problems that occur in production. Recognizing that it’s impractical to test for every eventuality is important, and helps teams manage their time and resources better.
00:27:47.700
So we need to write targeted tests for specific issues without increasing the realism of all our tests. While realism is not inherently a bad characteristic, it shouldn’t be the sole guiding principle. Tests with less integration provide valuable design feedback and allow for easier troubleshooting when changes are made.
00:28:34.120
Moving on to test isolation, even if people might not know the term, redundant coverage can be a huge issue. Suppose you're proud of a large test suite consisting of browser tests, view tests, controller tests, and model tests. You might think that thoroughness adds to the quality of your tests, but if you too often encounter failures that are only related to this redundancy, they can lead to problems with morale.
00:29:09.660
When you have a new requirement in your model, and you’re on a test-first framework, writing a failing test that passes the new code can all be worthwhile until the other code breaks due to dependencies. If all other tests break because of changes made, the amount of time spent fixing those failures can often feel like an overwhelming obstacle.
00:29:49.040
Redundant coverage can appear thorough on the surface, but it’s ultimately a source of wasted team morale. It makes CI much slower, often leading to days spent cleaning up tests or managing failures during new feature rolls. To detect redundant coverage, we usually rely on coverage reports.
00:30:17.610
Typically, we focus on the first column to identify parts of code that need more coverage; however, the other columns reveal insights about how often those tests are executed in reality. If you organize or sort by hits per line, you can spot methods hit numerous times during testing and you might realize there are tests that can probably be eliminated.
00:30:43.480
Another approach is to establish clear layers to test through; some tests, like view tests and controller tests, may not offer much value. By eliminating entire classes of tests, we save ourselves time and focus on higher value areas. Another technique you might consider is outside-in test-driven development, focusing on the interactions between layers.
00:31:11.160
While utilizing mocks at each point might feel impractical, encouraging the team to interact with components through interfaces helps enhance productivity, elevating the test experience. Additionally, excessive mocking often leads to confusion about the value being tested. Understanding the role of mocked dependencies becomes crucial.
00:31:32.360
The term 'test double' refers to any fake or stub used in a test, which can lead to some misunderstanding. As a professional working at Test Double, I often engage in conversations about the utility and pitfalls of mocks. It’s crucial to understand when mocks provide a temporary solution versus when they complicate relationships between components.
00:32:00.360
If I had to summarize my relationship with mocking, it would reflect a balanced approach. I believe you should utilize diagrams or images to outline potential dependencies upfront. Thus, if certain dependencies can be mocked, you’ll likely yield better insights into the connectors and their functionality.
00:32:39.050
Another aspect related to test isolation is having a clear narrative about how to integrate with framework and libraries. Frameworks provide repeatable solutions to common problems, but this also triggers integration problems for the tests themselves. It's important to maintain the perspective that the extensive integration provided by these frameworks does not overshadow the boundaries of independent test cases.
00:33:13.070
When writing tests, we need to decide whether to include frameworks in test cases. If we can disengage the framework from our tests while still achieving coverage of our key domain code, we can avoid superfluous integration issues by isolating the tests instead.
00:33:50.970
Now we’ve discussed the significance of feedback in tests. Since tests inherently influence morale, they should aim to provide meaningful error messages. Slow feedback loops are another detrimental factor; let’s take a number like 480, which represents the number of minutes in an eight-hour workday.
00:34:26.250
If our task of changing code takes 15 seconds, running a test takes 5 seconds, and interpreting results takes 10 seconds, we are looking at a rapid feedback loop. However, disruptions can lead to increasing delays, and we may find the actual time spent on feedback becomes longer, reducing the viable number of times we can test throughout the day.
00:35:06.480
Continuing to track feedback across these dimensions helps identify where improvements might be made in terms of increasing response rates to failures, introducing tools or techniques that streamline error interpretations, and countering distractions so we can remain focused.
00:35:48.980
When controlling test data, we find a spectrum between inline models, which require manual creation of instances, to self-priming tests. While we might think having only a single data strategy is optimal, it often shows better results if we adapt our approach based on the types of tests we are executing.
00:36:34.870
For example, integration tests may require fixtures while self-priming tests apply in staging environments. A common slowdown factor in testing is the setup stage, which increases as projects get bigger if multiple dependencies are tied together in frameworks without a clear separation.
00:37:09.190
Often, as we create tests, we have the intuition that one new test will take one unit of time to execute. Still, as the number of tests grows and integrates into developing code and the required setup time, failure rates increase, leading to scenarios where build times skyrocket beyond reasonable expectations.
00:37:50.200
This phenomenon indicates that the longer it has been since getting the suite running correctly, the more challenges one may face once crucial tests start failing. This leads to a belief that each additional test adds to the total time exponentially and exacerbates integration problem.
00:38:25.130
To manage testing sprawl effectively, sets should be defined upfront regarding how many layers should be added and whether the associated testing complexity is truly warranted. Furthermore, it's key to control the value of how many tests can run simultaneously without snowballing into combative states of failure that demoralize the team.
00:39:00.750
In closing, it’s worth reflecting on false negatives—when a build fails, people usually jump to the conclusion that the code is broken. What can be overlooked is that it could just mean an outdated test needs revising. The distinction between true and false negatives becomes apparent when revisiting tests over the weeks to track any irregularities.
00:39:43.370
Conscientiously evaluating build failures can offer clarity on areas where the focus could be better directed, reducing unnecessary redundant coverage or consolidating integration tests while boosting overall confidence in how the test suite operates. The goal isn’t merely to call tests good but to substantiate their value as efficient tools for assessing incremental code changes.
00:40:31.740
This talk may have shed light on various testing frustrations, but remember: no matter how displeasing your tests feel, I probably hate Apple Works more than you hate your tests. It has been a real challenge. I’m grateful for the opportunity to share this time with you. If your team is hiring, I know everyone is looking for senior developers, and it’s hard to find them. Test Double is an agency focused on helping engineers improve how they write software; we’d love to chat with you after.