Talks

Story of an Haemorrhage

Paris.rb Conf 2020

00:00:10.890 Hello everyone! My name is Anne Sophie, and I'm a back-end Ruby developer at a crowdfunding platform named Kiss Kiss Bang Bang. Today, I'm going to tell you the story of my experience with flaky tests. So prepare yourself; there may be some blood involved, but don't worry—there will also be plenty of cute kittens, so everything's going to be fine!
00:00:30.830 To begin, let me provide some context. At Kiss Kiss Bank Bank, we write our tests using RSpec and use CircleCI for our continuous integration. We care deeply about code reliability and sustainability, which is why we write a lot of tests. However, writing a lot of tests can lead to a point I like to call 'the oh my god, this is way too long' moment. As you can see here, running our entire test suite could take about 30 to 45 minutes, which was not acceptable—it was driving us crazy.
00:01:17.570 So, what we decided to do was implement parallelism, a fantastic feature offered by CircleCI that allows you to split your test jobs across multiple containers. We assigned multiple containers to our tests, which reduced overall test execution time to around seven minutes instead of 30. We were thrilled! We had made significant strides in speeding up our tests, which was a great starting point.
00:01:36.049 However, shortly after that, I noticed that one of my test jobs had failed. I wasn’t really surprised because failures happen, but when I inspected the job, I was shocked to find that it was only one test that had failed. Even weirder was that this test had nothing to do with the code introduced in the branch being tested. It was also running perfectly fine beforehand! I braved denial and simply relaunched the job, and, of course, the test passed. This is denial, everyone! But denial has its consequences, and soon enough, I found that 70% of my jobs were failing for no apparent reason.
00:02:12.490 So, denial was no longer an option. The situation was clear—we were facing a severe flaky test crisis. When confronted with such circumstances, the first thing you want to do is reproduce the failure locally to make it easier to investigate. You can accomplish this by fetching two things from the failed job: the list of tests executed in the failed container and the seed number. The seed number allows you to run the tests in the exact same order as in the failed container.
00:03:00.160 Armed with that information, I could rerun the tests with the seed number as an argument to reproduce the failures. The first flaky test I wanted to look at was dead simple: we were just creating three kittens of different colors, and then we expected our kitten finder to fetch the correct number of kittens based on color. So, I did a quick sanity check to see if everything was fine before diving deeper into the investigation.
00:03:36.490 To my surprise, I found a kitten present in the test case when I expected none. This was confusing—usually, the more kittens, the happier I am. But not this time. It was clear that the flaky test crisis was being caused by what I refer to as 'bleeding' data. This prompts us to the third part of the journey—understanding how and why our data was leaking. Let's ask ourselves a very simple question: why was our test bleeding in the first place? It's the kind of question you often don't consider until you're already in trouble.
00:04:44.220 The first thing I did was to check the Gemfile and quickly stumbled upon the 'database_cleaner' gem. Considering its name, I felt it might be the solution I was looking for. The database_cleaner gem's mission is to help you clean up your data between each test executed. It offers three different cleaning strategies: one relies on the SQL instruction 'TRUNCATE TABLE', the second uses 'DELETE FROM', and the last one allows you to wrap all the data created during a test inside a transaction that rolls back at the end.
00:05:21.450 This last strategy seemed ideal because it is generally easier and faster to roll back a transaction rather than delete data afterwards. Next, I checked our database_cleaner configuration. The first line set our global cleaning strategy to 'transaction', which complimented what I had learned. The next line performed preventive cleaning before executing anything. And then I noticed an uncommented line referencing 'transactional fixtures', which confused me because I thought we were not using Rails' transactional fixtures.
00:06:01.740 So, I quickly researched this feature and discovered that it’s a native Rails feature. I realized that it was the same as the database_cleaner’s transaction cleaning strategy. If you're wondering whether it makes sense to use both transactional fixtures and database_cleaner at the same time, the answer is absolutely not; it’s redundant. However, this redundancy wasn’t the root cause of our flaky test crisis—it simply meant we were not being very optimal.
00:06:41.330 Additionally, while browsing the database_cleaner documentation, I found that one of the motivations for creating it was to provide a simple way to enable the transactional fixtures. This felt somewhat disheartening, but fortunately, for my ego, it appears that the creator had also been confused by this naming issue and subsequently made a pull request to rename it to 'transactional tests', which was clarified in Rails 5.
00:07:24.180 Now, armed with this knowledge, we understand how and why data is not supposed to bleed. Ironically, I was still puzzled as to why our data was bleeding in the first place, especially since we were cleaning our data twice. Therefore, my next step was to check the tests executed right before my flaky test.
00:08:04.420 It turned out that the failing tests happened to be built differently, using 'let' instead of the more traditional approaches to create test data. The key issue with 'let' is that it gets reevaluated with each test run, which isn't inherently wrong, but in this scenario, we were trying to build data only once to save time. I noticed a pattern: every time a test was failing, it had been preceded by a test that created data in this way, which couldn’t be coincidental.
00:08:50.270 So, I dove into the RSpec documentation to explore more about this kind of hook. I discovered that data created in a 'before(:all)' block wasn’t inherently cleaned up, which was a eureka moment! I finally understood not only how and why the bleeding was occurring but also how the recent transition to parallel testing contributed to the issue. Our tests always ran in alphabetical order previously, and the leaking test was harmless at that time. But with parallelism, the order got mixed up, and at times, the leaking test would run first, causing failures in other tests.
00:09:24.790 Now that I had all the keys to solve this issue, it was time to act. The straightforward solution is to randomize your tests; this can help surface such issues earlier. Going back to our flaky test, especially the one using the 'before(:all)' hook, I could solve this by employing an 'after(:all)' hook to properly delete the kittens and other related data.
00:10:06.679 However, there was a potential problem. If other associated objects like humans or toys were created alongside our kittens, we would need to address each of these individually. A more robust solution might be to apply the database_cleaner with a proper configuration to handle this.
00:10:32.239 Ultimately, I think it’s always better to treat the underlying issue rather than just the symptoms. I advise against using 'before(:all)' or 'before(:context)' to create data unless you're absolutely clear on how it affects your tests. To summarize, you probably don’t need, and I emphasize the word 'probably', to use database_cleaner in your Rails application as transactional tests can handle most of the work.
00:11:05.180 Instead: always randomize the execution of your tests, ensure you’re not using 'before(:context)' to define test data, and make sure to read the documentation carefully. Thank you, everyone, for listening to my talk. I’d like to extend a big thank you to my incredible co-workers at Kiss Kiss Bang Bang, Perry's Abby for the invitation, and a very special thanks to my cats for being so cute.
00:11:47.010 Now, are there any questions?
00:12:00.000 Did you employ blame files to identify culprits? "No, we don't assign blame—we focus on collaboration here. As for some suggestions, one of the things you can do with Minitest, as opposed to RSpec, is using Minitest hooks. You can run your entire specs within a transaction, allowing you to benefit from transactional tests while supporting more complex scenarios like creating models before all. The reason for using 'before(:all)' is primarily for performance, but ensuring that you understand what's happening in your tests is crucial.
00:12:49.710 Thank you for your insights! Is there any impact on performance from using both transactional tests and database_cleaner simultaneously? I haven't benchmarked it closely, but upon removing the database_cleaner, we saw a decrease in execution time from about eight minutes to seven minutes for our complete test suite. So while I can't confirm a specific performance gain, instinctively it felt faster after the change.
00:13:51.650 If there are no further questions, thank you all once again!