Eliminating Inconsistent Test Failures

Capybara

Austin Putman

#test-driven-development

#continuous-integration-ci

#debugging

Eliminating Inconsistent Test Failures

by Austin Putman

The video titled "Eliminating Inconsistent Test Failures," presented by Austin Putman at RailsConf 2014, addresses the common issue of random test failures in software development, particularly those involving Cucumber and Capybara tests. The speaker discusses the severity of these failures and their impact on team productivity and morale.

Key Points Discussed:

Random Failures in Testing: The prevalence of random test failures is highlighted, with audience participation showing that many attendees have experienced similar issues.
Testing Culture: Putman emphasizes the importance of a strong testing culture and the consequences of ignoring test failures, leading to reduced trust in build integrity and delayed feedback on code deployments.
Sources of Randomness: Various sources of random failures are examined, including test pollution caused by shared state between tests, race conditions, and external dependencies like third-party services and time zone issues.
Specific Cases: Putman shares anecdotes from his experience at Omada Health where random test failures were rampant, particularly during a critical development cycle, and the strategies employed to manage this chaos.
Mitigation Strategies: The speaker presents several strategies for mitigating random failures:
- Always assert the existence of necessary elements before interacting with them to avoid race conditions.
- Use immutable fixture data to prevent database state inconsistencies, especially with PostgreSQL.
- Rely on tools like mutexes to avoid database access collisions.
- Document database states at the time of failures to help reproduce and diagnose issues.
- Emphasize the need to run tests in a specific order, especially when test pollution is suspected.
- Use libraries such as Webmock and VCR to handle external API calls during tests reliably.
Conclusion: The talk concludes that by focusing on identifying root causes of random failures and applying the right set of tactics, teams can achieve a remarkably stable testing environment, significantly reducing the occurrence of random test failures over time.

Through this presentation, Putman aims to provide actionable insights into managing test reliability, transforming a problematic test culture into a productive and trustworthy one.

00:00:17.039 This is the last session before happy hour. I appreciate all of you for hanging around this long. Maybe you're here because you don’t know there's a bar on the first floor of this hotel; I think that is where the main track is currently taking place.

00:00:23.600 I am Austin Putman, the VP of Engineering for Omada Health. At Omada, we support people at risk of chronic diseases like diabetes, helping them make crucial behavior changes for a longer and healthier life. It’s pretty awesome! I'm going to start with some spoilers because I want you to have an amazing RailsConf. So, if this is not what you're looking for, don’t be shy about finding that bar.

00:00:41.840 We’re going to spend some quality time discussing Capybara and Cucumber, whose flakiness is legendary for very good reasons. Let me take your temperature: can I see hands—how many folks have had problems with random failures in Cucumber or Capybara? Yeah, this is reality, folks. We're also going to cover the ways that RSpec does and does not help us track down test pollution.

00:01:09.520 How many folks out there have had a random failure problem in the RSpec suite, like in your models or your controller tests? Okay, still a lot of people, right? It happens, but we don't talk about it. In between, we’re going to review some problems that can dog any test suite. This includes random data, time zones, and external dependencies—all of this leads to pain.

00:01:43.119 There was a great talk before about external dependencies. Here’s just a random question: how many people here have had a test fail due to a daylight savings time issue? Yeah—Ben Franklin, you are a menace! Let’s talk about eliminating inconsistent failures in your tests. On our team, we call that fighting randos. I'm here to talk about this because I was short-sighted, and random failures caused us a lot of pain.

00:02:09.360 I chose to try to hit deadlines instead of focusing on build quality, and our team paid a terrible price. Is anybody out there paying that price? Does anyone else feel me on this? Yeah, it sucks! So let’s do some science. Some problems seem to have more random failure occurrences than others. I want to gather some data, so first, if you write tests on a regular basis, raise your hand. Wow, I love RailsConf! Keep your hand up if you believe you have experienced a random test failure.

00:02:50.159 Now, if you think you're likely to have one in the next four weeks, keep your hand up. Who's out there—it's still happening, right? You're in the middle of it. Okay, so this is not hypothetical for this audience; this is a widespread problem. But I don't see a lot of people talking about it. The truth is, while being a great tool, a comprehensive integration suite is like a breeding ground for baffling Heisenbugs.

00:03:14.560 To understand how test failures become a chronic productivity blocker, I want to talk a little bit about testing culture. Why is this even a problem? We have an automated CI machine that runs our full test suite every time a commit is pushed, and every time the build passes, we push the new code to a staging environment for acceptance. How many people out there have a setup that's kind of like that? Okay, awesome! So a lot of people know what I’m talking about.

00:03:38.239 In the fall of 2012, we started seeing occasional unreproducible failures of the test suite in Jenkins while we were pushing to get features out the door for January 1st. We found that we could just rerun the build and the failure would go away. We got pretty good at spotting the two or three tests where this happened, so we would check the output of a failed build, and if it was one of the suspect tests, we would just run the build again—not a problem. Staging would deploy, and we would continue our march toward the launch.

00:04:02.959 However, by the time spring rolled around, there were around seven or eight places causing problems regularly. We would try to fix them; we wouldn’t ignore them, but the failures were unreliable, so it was hard to say if we had actually fixed anything. Eventually, we just added a gem called Cucumber Rerun. This gem re-runs the failed specs if there’s a problem, and when they pass the second time, it’s good—you’re fine, no big deal.

00:04:43.840 Then some people on our team got ambitious and thought we could make CI faster with the parallel test gem, which is awesome. But Cucumber Rerun and parallel tests are not compatible. So, we had a test suite that ran three times faster but failed twice as often. As we came into the fall, we had our first bad Jenkins week. On a fateful Tuesday at 4 PM, the build just stopped passing, and there were anywhere from 30 to 70 failures. Some of them were our usual suspects, but dozens of them were previously good tests that we trusted.

00:05:21.600 None of them failed in isolation, right? After like two days of working on this, we eventually got to clean our RSpec build, but Cucumber would still fail, and the failures could not be reproduced on a dev machine or even on the same CI machine outside of the whole build running. Then over the weekend, somebody pushed a commit, and we got a green build; there was nothing special about this commit—it was like a comment change. We had tried a million things, and no single change obviously led to the passing build.

00:06:07.360 The next week, we were back to a 15% failure rate—pretty good. So we could push stories to staging again. We were still under the deadline pressure, so we shrugged it off and moved on. Then maybe somebody wants to guess what happened next? Yeah, it happened again—right? A whole week of no tests passing. The build never passes, so we turn off parallel tests because we can’t even get a coherent log of which tests are causing errors.

00:06:39.480 We then started commenting out the really problematic tests. Yet there were still these seemingly innocuous specs that failed regularly but not consistently. These are tests that have enough business value that we were very reluctant to just delete them. So, we reinstated Cucumber Rerun and its buddy RSpec Rerun. This mostly worked; we were making progress.

00:07:06.319 However, the build issues continued to show up in the negative column in our retrospectives, which was due to several problems with this situation: reduced trust. When build failures happen four or five times a day, those aren’t a red flag; those are just how things go, and everyone on the team knows that the most likely explanation is a random failure. The default response to a build failure became 'run it again.'