Rocky Mountain Ruby 2012

To Mock or Not to Mock

To Mock or Not to Mock

by Justin Searls

Summary of 'To Mock or Not to Mock'

In his talk at the Rocky Mountain Ruby 2012 conference, Justin Searls explores the complexities and necessities of testing in software development, particularly in relation to using mocks and test doubles. He reflects on how reality checks are essential in assessing the effectiveness of tests and challenges common philosophies surrounding them.

Key Points Discussed:

  • Testing Realism

    In software engineering, especially in critical systems like aviation, the primary aim is to ensure the functionality of code through comprehensive testing. However, achieving a realistic testing environment can often be prohibitively expensive. Thus, understanding the levels of reality needed in tests is crucial.

  • Types of Tests

    Searls delineates various types of tests: end-to-end tests, full stack integration tests, unit tests, and isolation tests. Each has its own purpose and balance of realism needs. For instance, unit tests should minimize dependencies to better isolate components, while end-to-end tests seek to reflect real-world interactions comprehensively.

  • Mocking and Test Doubles

    He discusses the various forms of test doubles (fakes, stubs, mocks, and spies), clarifying the differences and their applications. Mocks are particularly useful for verifying interaction but can lead to misunderstandings if overused without careful consideration of the underlying code's behavior.

  • Motivations for Testing

    Searls highlights the importance of defining clear objectives for testing—be it acceptance testing for end-user functionality, specification testing for regression stability, or characterization tests for legacy code understanding. Each scenario requires a tailored approach regarding realism.

  • Challenges in Rails Testing

    He reflects on the specific hurdles faced when applying rigorous testing strategies within the Rails framework, discussing how architectural choices can impede effective isolation testing and introduce unnecessary complexity.

Conclusion

Ultimately, Searls emphasizes a thoughtful and nuanced approach to testing. He suggests the need for continuous evaluation of the realism required for each test to maximize efficacy and reliability. The take-home message is to recognize our motivations for testing tools usage and to approach each situation with a critical mindset.

This overall consideration aids in ensuring that testing not only supports software functionality but also aligns with broader design and maintainability goals.

In closing, Searls invites the audience to reflect on how they apply testing strategies within their own projects.

00:00:09.200 I want to start with a story today. This is a story about a friend of mine named John, who used to work as a software engineer at a big aviation contractor.
00:00:14.799 One day, he walked into a meeting and the client came to them and said, 'Here’s a million dollars and here's a deadline that's a year out. Here's a little chip, and your job is to make sure that chip doesn't kill people.'
00:00:21.680 The chip was already done; all they needed to do was test it. What the chip was is a sensor that goes inside of landing gear. It detects extreme temperatures, extreme pressures, and atmospheric conditions that might indicate a problem.
00:00:32.960 If there is a failure, the landing gear could stick, and the plane would crash while failing to recognize that the landing gear was stuck. Clearly, lives are on the line, making it crucial that the tests prove this device works as expected.
00:00:46.960 Normally, our response to testing is to ensure it works—prove that it works. However, in this case, a realistic test was not feasible. What could be a maximally realistic test for this situation? Of course, put the chip in multiple planes, load those planes with people to simulate the right weight, and then keep crashing planes until the chip works.
00:01:03.280 Thus, if our tests, when we use the word 'test', aim to verify our code’s functionality, the term 'fake', or anything less than real, feels like a cheat. It feels like we are diminishing the value of the tests, so we prefer to think that all our tests are real.
00:01:16.880 Web applications are certainly much simpler to test than landing gear on an airplane; they are usually not life-critical. However, we also often partake in various forms of cheating, justifying that they are acceptable ways to cheat without discussing it as if we are faking anything.
00:01:28.320 For example, when we use Selenium, we create a fake user and script that user on what to click on with the keyboard and mouse. It follows our instructions repetitively, without variations or randomness, which does not simulate human behavior.
00:01:36.240 In terms of test data, most testers will clean their database with every run. If you have a long-lived app with years of data migrations, your production data means that users have dealt with all the, let’s say, nuanced column changes for years, but your tests? No problem there—the tests don't encounter any issues because they get a clean run every time. Your tests won’t catch this lack of realism.
00:01:49.439 Moreover, the runtime is usually restarted for each individual test. If you have a subtle memory leak occurring over time, causing your application to lag, these long-running situations might not be visible to your tests. Your tests will pass, so we often cheat.
00:01:59.680 Typically, tests run against a single process or even a single-threaded runner or server. Meanwhile, in production, when scaled up, you have multiple servers running simultaneously, load balancers, and all kinds of potential race conditions in your code that your tests might not expose.
00:02:07.920 So, we are all cheaters. Reality is costly; it would be expensive to hire a multitude of people in focus groups living in a lab all day to ensure our code functions with each commit.
00:02:13.360 If reality is expensive, why not budget it? Why not be more intentional about evaluating how real versus how fake our tests should be? We cannot always have pure reality, so we must determine how much reality each of our tests needs.
00:02:22.400 Instead of asking if a test is too fake, we should ask how much reality it needs. We are flipping the question on its head, aiming to assess the reality worth of the test. Is it worth 8 bits of Mario, 16, or 64, or is that guy a real plumber?
00:02:38.960 Every test has different needs. These reflections come from my observations at conferences, many of which are ruby conferences, where I hear platitudes like 'mocks are good' or 'mocks are bad', or others saying, 'only mock what you own.'
00:02:57.600 Some assert that you should only mock external systems to speed up the tests, while others proclaim, 'mock everything.' Some individuals mock everything all the time to an uncomfortable degree, while occasional voices of reason suggest avoiding over-mocking without defining what that entails.
00:03:08.960 We have a plethora of platitudes surrounding faking in tests, and I want to step back to initiate a more formal discussion. I crave nuance and rigor in how we address testables and thoughtfulness regarding our testing goals.
00:03:21.200 It’s ultimately up to us to determine the value we seek from our tests and to assess how much reality is required for each type of test, along with the associated consequences of that reality.
00:03:35.760 This notion aligns well with the themes of the first two talks today, focusing on long-term sustainability in design and understanding our motivation for our actions.
00:03:50.560 I first want to discuss different types of tests to establish a common language we can use today.
00:03:57.040 Let's start with an end-to-end test. In an end-to-end test, the real application is a black box. Consider a dog app and a separate application that runs in its own process, acting as an interface to test the functionality of the dog app, likely using a web interface or command line.
00:04:11.560 Full stack integration tests involve using the same process as our main application, maintaining visibility into the object model. For example, a dog feeder class, responsible for feeding dogs, will access a bone repository to gather new bones as well as a dog repository to save the animal after receiving a bone.
00:04:25.440 An integration test typically works alongside all components, whereas a unit test, as defined by Michael Feathers, follows four rules. These rules include no network access and no file system access in the case of a unit test, which means it essentially fakes out the database layer to reduce the black box while still owning real repositories.
00:04:43.680 An isolation test features similar elements, but here the test orchestrates its own supporting components. For example, the dog feeder test creates its own fake dog and bone repositories, injecting them into the dog feeder to maintain complete control over that interaction.
00:04:56.760 This set-up allows our test to serve as a thorough specification of what the dog feeder does, including its relationships with others. When following Test-Driven Development, this approach helps clarify what collaborators the object might require, thus forming an isolation test.
00:05:11.080 When discussing mocks, we find the term 'mock' generally refers to any fake we use to represent a real entity in our tests. A 'test double' serves as a more specific term because 'mock' tends to describe a particular subtype of a fake item used during testing.
00:05:25.840 The term 'test double' was popularized by Gerard Meszaros in his book on xUnit patterns, emphasizing that we should visualize it as a stunt double—this fake object stands in for a real actor during tests.
00:05:39.440 We can categorize different types of test doubles further. A 'fake' often consists of makeshift objects we craft for testing simplicity, like a fake file data source instead of a slow, volatile network data source.
00:05:50.560 A 'stub' is a test double that pre-configures responses to certain messages. For instance, if we have a database class, we could create a stub that returns a specific order whenever it receives a message for 'find' with the argument '1'.
00:06:05.120 Mocks come with their distinctions as they pre-configure both responses and expected interactions. For example, using a mock in our test alerts if certain methods are called as expected—such as 'database.save!' with a specified dog object.
00:06:23.200 So, a spy is also a special subtype, designed to verify behavior over time without expecting specific invocations. Instead, it records all interactions so that we can later interrogate its actions against test assertions.
00:06:34.280 These various types of testables provide a framework for understanding why and how to utilize them. However, as we delve into their practical application, it’s essential to ask ourselves why we should use them.
00:06:43.360 Let’s analyze our motivations for testing. Acceptance testing serves as an end-to-end test, showcasing that the application functions as promised to a product owner, customer, or even ourselves in the future.
00:06:57.600 In this context, reality holds immense value, as we wish to ensure our acceptance tests mirror real-world conditions. This type of testing necessitates offering complete transparency about functionality to avoid any ambivalence.
00:07:12.159 Specification tests also hold importance; they exercise code while offering regression value. Their primary benefit is providing future developers with a chance to make safe changes—with clear readability and clarity being pivotal.
00:07:27.680 Regression tests come into play when a bug arises, leading us to write a test to reproduce the issue, correct it, and ensure the bug won’t return. Here, some realism is crucial to accurately replicate bugs; but we must realize that often, identifying bugs involves pinpointing a small, narrow issue.
00:07:39.280 In this case, we can afford a relatively minor level of reality—perhaps a 64-bit Mario's worth, just enough to identify core problems without needing a fully rigged application.
00:07:49.440 When we consider motivations based on test-driven development (TDD), it shifts towards a focus on test-driven design. Here, the objective is to discover the nuances of API design and understand the code's architecture.
00:08:04.480 Reality doesn’t play such a critical role in this context; the regression value of isolated tests is typically low, emphasizing more on layout rather than realism behind the tests.
00:08:18.559 Another situation we encounter is when dealing with legacy code. We might lack clarity about how it operates, yet we must alter it. Characterization tests help us make black boxes out of existing code by documenting inputs and outputs.
00:08:30.720 In such cases, it’s crucial to gain some realism about the potential changes without overly complicating how the surrounding application acts. It requires some realism but doesn't necessitate full, detailed coverage.
00:08:47.360 Overall, my point here is that we should consistently ask ourselves what value we anticipate derived from each test. If I am spending 10 hours on a test, I hope to realize significant returns from that investment.
00:09:00.560 Next, we will explore strategies regarding testing, assessing where and how testables fit into various strategies. Starting on one end of the realism spectrum, some teams prioritize writing tests simply to demonstrate functionality.
00:09:16.960 Under this philosophy, testing is regarded as primarily functional confirmation instead of design guidance. Thus, such teams invest primarily in end-to-end tests and integration tests to ensure things work, taking minimal risks in real-world testing.
00:09:32.960 The benefits here are high confidence in green builds, complementing that confidence with a lower volume of tests. You spend less effort on testing, but the trade-off is that you miss significant design clues.
00:09:48.320 If your primary goal is to make sure everything works, quick feedback becomes vital. You do not wish to wait an extended time to determine if your changes are valid. However, end-to-end and integration tests are often slow and may constrict your overall productivity.
00:10:07.760 Another strategy involves mocking boundaries; this strategy sees teams willing to fake out remote systems and third-party integrations but abstaining from faking things they completely own. This strategy focuses on unit tests and end-to-end tests, utilizing mocks on external resources.
00:10:18.800 On the plus side, this means each object has both unit tests and endorsements from collaborators, allowing for extensive coverage and higher regression significance when the object is used in different contexts.
00:10:30.479 The downside includes both redundancy and overload. One small change can potentially break thirty other tests, always requiring a convergence of mental effort from the initial state to the updated context. Unit tests alone do not guarantee that your entire application is functional, necessitating end-to-end tests.
00:10:45.680 A case-by-case approach intuitively allows developers to select the best tools and methods for each test. While this offers freedom, it may lead to incoherence or inconsistent expectations among teams and developers, blurring the lines of original intent.
00:10:58.960 In this chaotic approach, collaboration is essential to process changes effectively; however, it may enable test doubles being abused. In many Rails projects, I’ve observed a significant number of people using mocks and stubs within model tests.
00:11:14.080 Rails model tests are typically designed to communicate with real databases, and when mocks or stubs enter the mix, it muddles the clarity and the actual intent of the test.
00:11:26.679 In most situations, tests designed to verify key functionality may not effectively drive design decisions or architecture. Consequently, the results become ambiguous, leading to uncertainty over whether a test facilitates the correct approach.
00:11:39.440 As an alternative, we can shift towards the concept of growing object-oriented software, often referred to by its advocates as 'goose'. This method involves utilizing isolation tests to inform the design of the objects we create.
00:11:52.480 We still retain end-to-end tests to validate the software's capabilities. Although we might encounter skepticism surrounding high-level integration tests lacking detail, our real backing comes from integration smoke tests assessing overarching collaboration.
00:12:07.360 I’ve found this balance yields the best results, maintaining a handful of end-to-end tests while having my isolation tests run rapidly—for instance, a complete suite of thousands running in just seconds.
00:12:20.240 Jasmine encourages isolation testing by design due to difficulties surrounding integration tests. My experience showed me that high numbers of test cases quickly yield positive results, often allowing testers to maintain their focus without excessive time required for results.
00:12:33.599 This leads to the understanding that achieving quick feedback directly enhances productivity. Developer efficiency often suffers when feedback takes an excessively long time or has unresponsive delays.
00:12:46.080 A primary advantage of isolation testing is the feedback on abstractions and understanding the responsibilities of collaborating objects. If tests become too cumbersome, it may signify that our object structure requires refining.
00:12:56.080 As we establish and redefine these collaborations, effective testing helps regulate the workflow, keeping codebases clean and manageable without excessive setup for each component.
00:13:10.080 However, not all tests will fit into purely isolated or entirely integrated categories. Therefore, we may entertain the idea of maintaining numerous specialized test suites focused on various tasks.
00:13:24.120 Those categorized by motivation—characterization tests, design tests, and end-to-end assessments—can serve to balance our priorities effectively while allowing flexibility in addressing diverse testing needs.
00:13:39.200 Historically, I delved into goose-style isolation testing within Java projects. Yet, I’ve often questioned why many Rails users avoid test doubles.
00:13:54.080 I believe this stems fundamentally from Rails architecture. It’s not a bad architecture by any means, but it is a factor we must acknowledge and work within correctly.
00:14:09.720 Test doubles excel at writing isolation tests for our subject code, but they require first identifying and substituting all dependencies. For instance, understanding inherited surface area becomes vital—this relates to everything the subject interacts with.
00:14:24.120 While creating a simple application, I found that an ActiveRecord extension had over 300 public methods. The extensive inherited surface area complicates matters as we navigate toward more straightforward, manageable code.
00:14:39.920 In scenarios where classes require extensions, lacking clarity and assumptions complicate isolation significantly within Rails. Moreover, class loading can become a burden due to the performance hit incurred with heavy class reliance.
00:14:54.080 Consequently, we see the emergence of fantasies—these are tests using mocking without any runtime checks, generating a false sense of validity as real classes aren’t loaded during the tests. Should any method name change, tests would still succeed based on prior configurations.
00:15:08.080 Given these observations, Rails presents challenges in adopting a strict isolation testing structure. The goose methodology offers one potential route, but it intersects awkwardly with common challenges experienced in Rails.
00:15:24.120 Further considerations must evaluate existing Rails objects. Writing seamless integration tests, while respecting their inherent structures, yields dependencies that require clarification.
00:15:35.360 Ultimately, if we desire to avoid excessive functional coupling with our tests, we might need to reconsider how we leverage Rails altogether as the framework wasn’t necessarily built with isolation testing in mind.
00:15:50.320 For additional insights on this topic, I authored a post surveying various test libraries used in Ruby, from Mocha to FlexMock, addressing motivations behind my project decisions.
00:16:05.920 As a result of my experiences, I developed a test double library called 'Gimme,' which is open for exploration should anyone find value in it. Although I haven’t had the chance to deeply use it, I welcome any feedback.
00:16:11.440 In closing, I appreciate this opportunity to speak today. My company's name is 'TestDouble,' which seems fitting for this discussion.
00:16:18.560 You can contact me via my Twitter handle, @searles, and I value any feedback from all of you.
00:16:21.720 Thank you all very much for your attention! I truly appreciate it.