Fixing Flaky Tests Like a Detective

by Sonja Peterson

In the talk "Fixing Flaky Tests Like a Detective," presented by Sonja Peterson at RailsConf 2019, the speaker delves into the pervasive issue of flaky tests in software development. Flaky tests are those that pass sometimes and fail at other times without any changes to the underlying code. Sonja emphasizes the importance of not only fixing these by identifying their root causes but also preventing them from being introduced in the first place. She shares a structured approach that parallels detective work to diagnose and resolve flaky tests efficiently.

Key Points Discussed:

Challenges of Flaky Tests: They can significantly impede the development process, leading to wasted time and lost trust in automated tests.
Categories of Flaky Tests: Sonja identifies typical culprits behind flaky tests:
- Async Code: Tests influenced by the order of asynchronous events, particularly in feature tests using Capybara.
- Order Dependency: This arises when tests' outcomes change based on the state influenced by previous tests, emphasizing the need to isolate test states.
- Time Issues: Tests failing due to date and time calculations, potentially fixed by using libraries like Timecop.
- Unordered Collections: Ensure that database queries have predictable outcomes; ordering results can avoid flaky failures.
- Randomness: Reducing reliance on randomness in tests increases reliability.
Information Gathering: Before attempting fixes, developers should gather data on flaky tests such as error messages, timing of failures, and the running order of tests. This can be managed using a bug tracking system.
The Detective Method: Adopt a systematic method similar to investigating a crime: gather evidence, identify suspects, form hypotheses, and test fixes rather than relying on trial and error.
Team Collaboration: The collective responsibility is emphasized to handle flaky tests. Designated individuals should monitor and fix these tests while ensuring continuous communication within the team.

Conclusions and Takeaways:
- Flaky tests are an inherent part of a developer's journey, but they can also be opportunities for growth and learning. Fixing them leads to improved understanding of both code and testing frameworks.
- It is vital to maintain a healthy test suite with a focus on high reliability and effectiveness, which ultimately leads to better software quality and development speed. The overarching message is that addressing flaky tests is a valuable investment in the stability of software development processes.

00:00:20.689 Hello everyone! I'm Sonja, and I really appreciate you all coming to my talk. Thank you, RailsConf, for having me.

00:00:28.140 Today, I'm going to talk about fixing flaky tests, and how reading a lot of mystery novels helped me learn to do that better.

00:00:35.339 I want to start out by telling you a story about the first flaky test I ever had to deal with.

00:00:41.220 It was back in my first year as a software engineer. I worked really hard building out this very complicated form. It was my first big front-end feature.

00:00:52.980 I wrote a lot of unit and feature tests to make sure that I didn't miss any edge cases. Everything worked pretty well, and we shipped it.

00:00:58.829 However, a few days later, we started to have an issue. A test failed unexpectedly on our master branch. The failing test was one of the feature tests for my form, but nothing related to the form had changed.

00:01:09.330 Strangely, the test went back to passing in the next build. The first time it failed, we all kind of ignored it; after all, tests can fail randomly sometimes.

00:01:15.270 But then it happened again, and again. I decided to spend an afternoon digging into it to fix it, thinking we could all move on thereafter.

00:01:31.289 The problem was that I had never fixed a flaky test before, and I had no idea why a test would pass one time and fail on a different run.

00:01:44.160 So, I did what I often do when trying to debug problems that I don't understand: I started trying things randomly.

00:01:50.640 I made a random change and then ran the test repeatedly to see if it would still fail occasionally. This trial-and-error approach can sometimes work with normal bugs.

00:02:02.940 Sometimes, it even leads you to a solution that helps you better understand the actual problem. But with this flaky test, my attempts proved futile.

00:02:15.930 Running random fixes many times didn’t help me figure out the underlying issue, and a few days later, with that supposed fix, it failed again.

00:02:26.640 I needed another approach, and this is what makes fixing flaky tests so challenging.

00:02:31.989 You really can't just try random fixes repeatedly. That's a slow feedback loop. Eventually, we figured out a fix for that flaky test, but not until several different people had tried random fixes that failed, wasting entire days of work.

00:02:52.390 From this experience, I also learned that just a few flaky tests can really slow down your team. When a test fails without actually indicating a problem with the test suite, you have to rerun all of your tests before being ready to deploy your code.

00:03:10.450 This slows down the entire development process and erodes trust in your test suite.

00:03:17.040 Eventually, you might even start ignoring legitimate failures because you assume they are just flaky tests.

00:03:25.420 This makes it essential to learn how to fix flaky tests efficiently and avoid writing them in the first place. For me, the real breakthrough in figuring out how to fix flaky tests came when I developed a method.

00:03:41.030 Rather than trying things randomly, I began by gathering all the information I could about the flaky tests and the times they had failed.

00:03:52.090 Then, I used that information to categorize the problem according to one of the five main categories of flaky tests. I will talk about what those are shortly.

00:04:08.680 Based on that categorization, I formulated a theory on what might be happening. While working on this, I was binging on mystery novels.

00:04:21.070 It struck me that each time I was fixing a flaky test, I felt like a detective solving a mystery.

00:04:26.440 After all, the steps involved, at least in the novels I read, are to gather evidence, identify suspects, develop a theory of motive, and then solve it.

00:04:39.130 Thinking about fixing flaky tests in this way made the experience much more enjoyable, transforming it from a frustrating problem into a fun challenge.

00:04:45.659 This will be the framework I use in this talk for explaining how to fix flaky tests.

00:04:52.390 Let's start with step one: gathering evidence. There are many pieces of information that can help diagnose and fix flaky tests.

00:05:09.599 Some helpful pieces of evidence include error messages for every time the test has failed.

00:05:15.509 You should also document the times of day when those failures occurred, how frequently the test fails, and which tests were run before it when it failed.

00:05:27.180 To efficiently gather all this information, I recommend setting up a process where any failure on your master branch, where you wouldn't expect to see failures, is automatically sent to a bug tracker.

00:05:38.400 This should include all necessary metadata, such as a link to the CI build where it failed.

00:05:43.830 I have had success using Rollbar in the past, but I'm sure other bug trackers could work well too.

00:05:50.669 When doing this, it's important to group failures for the same test together in the bug tracker. This organization enables you to cross-reference different occurrences of the same failure.

00:06:01.889 This can help you uncover patterns that make it easier to understand why they are occurring.

00:06:10.450 Now that we have our evidence, we can start looking for suspects. With flaky tests, what’s nice is that there is a consistent set of usual suspects to start looking at.

00:06:28.610 Those suspects are asynchronous code, order dependency, time, unordered collections, and randomness.

00:06:38.590 I’m going to walk through an example or two of how to identify whether a test fits into any of these categories and how to fix it based on that.

00:06:53.690 Let's start with asynchronous code, which I have found to be one of the largest categories of flaky tests in Rails applications.

00:07:04.990 When we talk about asynchronous code, we mean tests where some code runs asynchronously, leading to events happening in a non-deterministic order.

00:07:18.110 This often arises in system or feature tests, where most Rails apps employ Capybara.

00:07:25.010 Capybara, either through the Rails built-in system tests or feature tests, writes end-to-end tests that spin up a Rails server in the browser.

00:07:31.280 The test interacts with the app similar to how an actual user behaves.

00:07:39.100 The reason you're generally dealing with asynchronous code and concurrency in your Capybara tests is due to at least three different threads.

00:07:51.730 There's the main thread executing your test code, a second thread where Capybara launches your Rails server, and finally a separate process running the browser, which Capybara controls.

00:08:04.250 To illustrate, imagine you have a Capybara test that clicks a 'submit post' button in a blog post form and then checks to see if that post has been created in the database.

00:08:14.250 The happy path for this test looks straightforward in terms of the order of events.

00:08:24.059 Initially, we instruct Capybara to click the button, triggering an AJAX request to the Rails server, which creates the blog post in the database.

00:08:38.880 Once that request is completed, the UI updates, and the test code checks the database to confirm that the post is there.

00:08:48.940 This order is quite predictable—assuming you’re not optimistically updating the UI before the request completes.

00:09:01.640 An issue can arise, for instance, if the test code checks the database immediately after clicking 'submit post', before the blog post is created.

00:09:13.470 To fix this, we just need to wait for the request to finish before checking the database. We can achieve this by using one of Capybara’s waiting finders, like 'have_content', which retries to check until it finds what it is looking for.

00:09:26.250 This adjustment ensures that the code only proceeds to the next line once the post creation has completed.

00:09:37.990 This is an example of a relatively simple asynchronous flaky test that you may have dealt with if you have written Capybara tests.

00:09:44.740 The complexity can escalate, so let’s examine another example.

00:09:57.800 This one involves a test that goes to a page listing books, clicks a sort button, and waits for those books to appear in sorted order.

00:10:03.199 It then clicks again to reverse that order, expecting to see them sorted back.

00:10:12.050 While it may seem that this should work smoothly since we are waiting at each step, it can still lead to flakiness.

00:10:20.890 Suppose when we visit the books page, the books happen to already be sorted. The subsequent expectation check may pass immediately, failing to block further action from occurring.

00:10:33.050 Consequently, both clicks can happen before the browser is properly sorted, resulting in a test that never achieves a reverse-sorted state.

00:10:45.030 The fix involves making sure our waiting finders are looking for markers indicating requests have finished executing.

00:10:51.860 If you're evaluating a flaky test to see if it falls in the asynchronous category, first check if it is a system or feature test utilizing Capybara or a different browser interaction tool.

00:11:04.189 You may also find other areas of your codebase where you're dealing with async code, but this is typically where issues will surface.

00:11:16.740 Next, consider whether the code triggers events without explicitly waiting for the outcomes. Even in seemingly benign places, ensure you're emulating a real user experience by waiting between actions.

00:11:27.300 Using Capybara can also enhance tests; one feature is that it allows saving screenshots, which can be integrated easily to capture the state of the test after it fails.

00:11:41.440 To prevent asynchronous flakes, remember to avoid using sleep or waiting for arbitrary periods. Waiting for specific events is crucial because arbitrary time delays can lead to inconsistency.

00:11:51.730 Understanding Capybara's API, especially which methods wait and which do not, is vital. Familiarity with Capybara's documentation is essential for proper usage.

00:12:07.899 Lastly, it's important to ensure that every assertion in your test behaves as expected. Misleading assertions can mistakenly indicate success.

00:12:19.840 Now let's move on to our next suspect: order dependency.

00:12:27.220 I define this category as any tests that can pass or fail based on the order in which they run.

00:12:35.949 This is usually the result of shared state leaking between tests. If the state created by one test is present or absent for another, it could lead to flaky results.

00:12:48.229 A few potential areas for shared state can include your database, global or class variables modified in tests, and the browser's state.

00:12:59.310 In the context of Rails applications, database state is often a key issue.

00:13:11.590 When writing tests, each one should begin with a clean database state. This doesn't require an empty database, but any entity created or modified should reflect its original state.

00:13:24.260 Think of it as 'Leave No Trace' when camping. This approach prevents unintended changes from affecting subsequent tests, avoiding interdependencies that could lead to cascading failures.

00:13:36.640 To manage database cleanup, wrapping tests in transactions and rolling back after execution is typically the quickest method and is the default in Rails.

00:13:51.450 Previously, you couldn't employ transactions with Capybara because the test code and the test server were operating in distinct transactions.

00:14:06.490 Rails 5 system tests addressed this limitation, enabling shared access to database connections across tests.

00:14:18.660 However, using transactions can have subtle variations compared to the regular operation of your app.

00:14:30.650 For example, if you have any after-commit hooks set up on your models, those won't run unless a transaction has successfully committed.

00:14:48.150 If you're not using transactional cleanup, the Database Cleaner gem can either truncate tables or perform delete operations to maintain state.

00:15:05.640 This method is usually slower but provides a more realistic scenario since tests operate without an extra transaction wrapper around them.

00:15:18.460 It's crucial to ensure that your database cleanup occurs after Capybara's cleanup so any uncleaned requests won't interfere with future tests.

00:15:29.450 If using RSpec, you can insert the Database Cleaner call within an 'after' block to maintain order.

00:15:44.270 Understanding database cleaning is essential. If poor configuration applies, it can lead to flaky tests that seem order-dependent.

00:15:57.640 Here’s an example: suppose you're using Database Cleaner with a truncation strategy. You might initiate this strategy to speed up the test and exclude it from backing up data.

00:16:10.210 However, suppose there's an update to any records. Since you're employing truncation, those updates will not revert between tests.

00:16:26.470 This can create confusion and lead to order-dependent failures.

00:16:38.580 It's not just Database Cleaner that causes order dependency issues. Since tests run in the same browser, they can have states that persist across them, depending on what's executed prior.

00:16:54.210 Capybara does recognize this and usually cleans it up, but depending on your setup, some state can slip through.

00:17:04.889 Another shared state area to look at is global and class variables. Modifying these may cause values to linger between tests.

00:17:13.990 Normally, Ruby will warn you if you reassign a global variable, but modifying a hash assigned to a global variable won't trigger a warning.

00:17:28.499 When you suspect a test is order-dependent, begin replicating its failure using the same set of tests in the same order.

00:17:38.810 If the test fails every time using the same seeds, it’s likely order dependent. However, identifying which tests are impacting others may still be challenging.

00:17:46.490 You can cross-reference historical failures using RSpec's built-in bisect tool to narrow down which ones interact negatively.

00:17:58.200 To prevent order dependency, configure your suite to run tests in random order. This is the default for MiniTest and customizable in RSpec.

00:18:09.490 Comprehending your test setup and teardown processes also aids in pinpointing shared state leakage.

00:18:19.790 Next, let's tackle time as a suspect. This category can include tests that give different results based on when they are run.

00:18:30.150 For instance, consider this code running in a before-save hook on a task model that automatically sets due dates.

00:18:43.059 If a due date is not specified, the next day is automatically assigned. You might write a test checking that it lands on the right date.

00:18:55.390 However, this test may fail unexpectedly after 7 p.m. because of how the date is calculated.

00:19:05.890 You could be using date tomorrow based on your Rails timezone, while date.today relies on system time in UTC.

00:19:15.540 This disparity leads to failures on different days.

00:19:27.440 One straightforward fix is to use Date.current instead, ensuring consistency with the timezone.

00:19:41.490 Alternatively, you can use the Timecop gem to mock Ruby's sense of time by freezing it.

00:19:54.920 Freezing time not only simplifies tests but also makes them clear and easy to comprehend.

00:20:07.520 When identifying flaky tests due to time, check for any date or time references in the code.

00:20:20.150 If you have a record of past failures, observe if they coincide with specific times of day. You may also temporarily harness Timecop to replicate failure scenarios.

00:20:34.880 Wrapping tests in Timecop travel to various random times daily can be immensely helpful, making it easier to spot tests failing primarily after hours.

00:20:45.490 It's essential to log the time of each test run to aid in later debugging by allowing you to rerun tests using the same time.

00:20:59.520 Next up, unordered collections, which refers to tests passing or failing based on the order of elements within a group.

00:21:10.350 For instance, if you have a test examining a collection of active posts and asserting they equal a set of specific posts created earlier, this leads to flaky tests.

00:21:20.400 Database queries often do not return results in a predetermined order, so a lack of control can lead to failure.

00:21:30.870 The solution is to specify political order for items returned from the database, ensuring your expectations mirror that order.

00:21:43.220 When diagnosing potential unordered collections, look for assertions regarding the order of arrays or elements and ensure you use the match array expectation.

00:21:52.740 If necessary, add explicit sorting to both expected and actual results.

00:22:01.830 Finally, let's examine randomness as a possible suspect.

00:22:10.850 Each category of flaky tests shares a similarity in failing randomly but randomness concerns tests explicitly invoking randomness through generators.

00:22:22.520 Take, for example, a test generating a start date that is passed five days forward, with another end date ten days ahead. They can easily clash if random values lead to invalid data.

00:22:36.110 When implementing validations, this leads to flaky tests failing because of invalid conditions. Addressing this means consistently creating the same date.

00:22:52.130 While randomness can seem powerful for testing various data ranges, it's more reliable to test specific edge cases and boundaries.

00:23:07.310 When querying potential randomness in flaky tests, start by identifying any random number generator utilized, often appearing in factories or fixtures.

00:23:21.080 Using the seed option in MiniTest or RSpec allows you to replicate tests consistently and scrutinize their behaviors under controlled conditions.

00:23:34.830 In summary, seek to eliminate randomness from your tests, favoring explicit edge cases instead.

00:23:45.300 Now that we've seen the usual suspects, we can move on to the next phase: formulating a theory and resolving the flaky test mystery.

00:23:56.280 When addressing a flaky test without an obvious solution, run through the categories previously described and look for potential connections.

00:24:07.580 Even if a test seems fine, if it involves dates, delve deeper. Resist the temptation to leap straight into trial and error fixes.

00:24:20.460 Instead, create a solid theory about what might be happening, even if your understanding isn't complete.

00:24:34.870 What you can attempt, involving some trial, is devising methods to replicate failures to prove your theory.

00:24:45.980 As noted earlier, for categories like randomness, dates, and order dependency, you have more control over the factors at play.

00:24:57.410 Flaking tests tend to occur infrequently and yield a majority of passes. If you can replicate failures consistently, you stand a better chance of isolating the issue.

00:25:08.480 Doing so will provide you with more confidence in your findings than relying solely on haphazard trial and error.

00:25:22.380 If you find yourself stuck despite trying these strategies, consider integrating additional findings into the next failing run's logging.

00:25:35.860 If you suspect a certain variable of causing issues, examine that along with the database's state when the failure occurs.

00:25:44.620 Working collaboratively with another developer is also beneficial since fixing flaky tests demands deep familiarity with your testing tools and code.

00:26:00.290 Everyone has unique knowledge gaps, and collaboration enhances overall understanding while preventing individual frustration.

00:26:16.800 Another common question arises when developers wonder if they should simply delete the test if they can't resolve its flakiness.

00:26:29.740 It's crucial to acknowledge that if you engage in testing, you will eventually deal with flaky tests.

00:26:43.130 Deleting flaky tests will lead to compromised coverage and diminish your learning journey in fixing such issues.

00:26:56.240 However, if a flaky test is addressing a low-risk edge case that already has strong coverage, it may be sensible to remove or replace it.

00:27:12.520 This ties into the balance between realism and maintainability when writing tests. Automated testing brings trade-offs when it substitutes for manual QA.

00:27:25.780 Tests vary in realism, and generally, the most realistic ones are often the hardest to maintain. Results may yield flaky responses due to numerous moving parts.

00:27:42.120 It's prudent to balance out the number of complex tests in your suite while preserving coverage for more significant problems.

00:27:58.690 The last point I want to emphasize is teamwork in resolving flaky tests. It shouldn't be viewed as just an individual problem.

00:28:10.520 Flaky tests can hinder everyone’s progress and erode collective trust in your testing suite, making them a priority for the team.

00:28:26.050 It's vital for your team to discuss flaky tests, communicate their importance to newcomers, and work together to ensure progress.

00:28:39.650 Designating specific team members to manage active flakiness can be beneficial. This person should investigate solutions and seek help when necessary.

00:28:50.550 Furthermore, maintaining accountability across the team can ensure that one individual doesn't end up as the sole manager of flaky tests.

00:29:03.250 Establishing targets for pass rates in your main branch is also advisable—tracking progress over time helps support reliability and fix issues promptly.

00:29:17.380 To summarize, I hope you walk away from this talk understanding that flaky tests need not be annoying or frustrating.

00:29:27.550 Rather, fixing them can serve as an opportunity to deepen your understanding of your tools, code, and enjoy the role of a detective.

00:29:40.460 Thank you all for attending. If you have questions, feel free to approach me afterwards.