Speeding up Tests With Creativity and Behavioral Science

00:00:14.290 Hi everyone, my name is Moncef Belyamani, and it's great to be back in Paris. I was actually born here, but I only lived here for the first year of my life. Since then, I've been back several times, including five times to DJ. These days, I work remotely from Virginia for a small consulting firm based out of San Francisco called Truss. It's one of the best places I have worked, and one of the things that drew me in were the mission-driven projects that help improve people's lives.

00:00:29.380 For example, my first project was with the Department of Veterans Affairs, or the VA for short. This was a collaboration between the U.S. Digital Service, Navas Teesta, and Truss. In the U.S., there are hundreds of thousands of veterans who suffer from health conditions and depend on the VA for financial assistance. They can submit claims to receive benefits, but the problem is that it can take a long time to process these claims.

00:00:53.059 So, this group of wonderful folks modernized legacy systems from the 1980s and built open-source Rails apps to make it easier and faster to track and process claims. When I joined the project, I worked on an app called Case Flow, which was almost three years old at the time. I remember being warned on my first day not to try running the test suite locally because of how slow it was. Given the age of the project, this was not very surprising. Raise your hand if you feel like your test suite is slower than it should be.

00:01:25.280 If you look around, this is a common problem, and there are many reasons why a test suite can be slow. However, an interesting one that we don’t often hear about is human nature. To understand why we sometimes behave in ways that don’t benefit us in the long term, let's go back about 200,000 years, to the approximate age of the modern human brain. Back then, we were always focused on the present and on surviving. If we were hungry, we saw an animal, hunted it, and felt satisfied.

00:01:51.470 Scientists call this type of environment an immediate return environment because we can see the results of our actions immediately. Today, however, we live in a delayed return environment where the work we put in can take years to pay off. For example, if you practice deliberately every day, you might progress from having no coding experience to becoming a senior engineer in a few years. However, our brains have not yet adapted to this delayed return environment, which has only existed for about 500 years.

00:02:16.320 So, compared to the age of the brain, this is very little time. Our brains still prefer quick rewards that happen now over potentially larger rewards that happen in the future. This is especially true when we make decisions that have negative consequences that do not appear until far in the future. Our brains are wired to pay attention to immediate threats, but not so much to gradual warning signs.

00:02:36.349 This explains why some people develop bad habits that affect their health, and why many projects end up with technical debt and slow test suites. When we don’t follow best practices, we might enjoy the immediate reward of high test coverage and shipping quickly. However, when our development slows down later on, we feel regret. In contrast, with good habits, the initial work might not be very enjoyable, but the end result feels great.

00:02:55.999 Because of our bias towards the present, we don’t internalize the long-term benefits of speed improvements, whether it’s for a test suite or for how we perform various actions on our computers every day. We may not be sure if it’s worth spending time to optimize a test suite if we only save a few seconds. So, how did I know that it was worth it on the Case Flow project? Well, actually, I didn’t know for sure, but I tested my theory using the Pareto principle, named after Italian economist Vilfredo Pareto, who was also born in Paris.

00:03:26.720 The principle is also known as the 80/20 rule, as Pareto observed that 80% of the land in Italy was owned by 20% of the population. This rule has been applied in various fields— in engineering, for example, you might find that 80% of the bugs come from 20% of the code, or that 80% of the code was written in 20% of the time.

00:03:57.099 Using this rule as a guideline, a good starting point to investigate a Rails app is the Rails helper file. In our case, it was 268 lines long, mostly consisting of helper methods and methods for seeding the database. To test the theory that something in the Rails helper was slowing down the tests, I ran a set of unit tests that didn’t require loading Rails at all, and they ran in about 17 seconds, which seemed too slow to me.

00:04:21.960 Then I noticed that none of these tests explicitly required Rails, so I thought maybe we were automatically requiring the Rails helper everywhere in our .rspec file, and indeed we were. After this, I removed the flag and re-ran those same tests, and this time they ran in a fraction of a second— almost 450 times faster. While this doesn't necessarily mean that the entire test suite will be 450 times faster, it was a sign that we were on the right track.

00:04:53.169 Next, I looked through the large Rails helper file for any obvious clues but nothing jumped out at me. I then thought maybe the issue was in the spec support folder, which we were also automatically requiring, so I disabled those and reran the unit tests while adding the Rails helper flag again, and I continued to see the same speed improvements.

00:05:12.540 After more investigation, I narrowed the issue down to the database cleaner file. In the Case Flow app, we used two databases: Postgres for new data and Oracle for legacy data. It turned out that cleaning Oracle was a lot slower than cleaning Postgres, by about five seconds, and I had a feeling there were enough tests not using Oracle that it would be worth selectively cleaning the database only when needed.

00:05:34.020 To achieve this, I split the database cleaner file into two separate files: one for Postgres only, and one for Oracle plus Postgres. I configured it to only clean the database if the spec was tagged with either Postgres or all databases. We then excluded these two new files from being automatically required, which meant we had to manually add the proper required statements in each test, as well as the tags.

00:06:09.319 As expected, this work was not enjoyable, but I kept the future reward in mind. Once I was done, I tested these changes against a subset of tests— basically all tests except for the feature specs, which are slow by nature. Before the changes, these tests ran in about 18 minutes; after the changes, drumroll please... six and a half minutes!

00:06:28.690 I thought this was a significant enough time difference that it outweighed the cost of doing all that manual work. However, since we're human and might forget to add proper required statements and tags, especially when this workflow is still new to us, I added a new rule to the Danger gem, which we were already using. This rule checks if a pull request made any spec changes and displays a warning in GitHub.

00:06:52.270 The warning reminds people to avoid using a database at all in tests if possible, and if not, to tag appropriately. Problem solved, right? Well, not quite. Some tests slipped through the cracks and ended up being merged with database cleaning issues that resulted in flaky tests that were hard to troubleshoot. On top of that, I ended up leaving the project for unrelated reasons, though I kept up with these suspenseful developments on GitHub.

00:07:33.639 These changes were merged in late July. I left in mid-August, and then in late October, they had a meeting to discuss these flaky tests and whether they should revert the changes. In mid-November, they indeed reverted the changes. What was interesting to me but not surprising in hindsight is that they thought the changes did not have much effect in CircleCI. In my mind, it was unquestionable that the tests were faster, but I obviously failed to communicate that.

00:08:06.640 I was focusing on the local time savings, where the tests were not run in parallel. However, in CircleCI, we had parallelism set to 5. Therefore, when they didn’t see a huge time difference, they were disappointed. This is actually a classic example of a cognitive bias called anchoring. This describes how we are influenced by numerical reference points even if they're not relevant.

00:08:41.540 There’s been repeated demonstrations of this; for example, at MIT, students were handed a sheet with various items on it, like a nice bottle of French wine. They were asked to write the last two digits of their social security number at the top and then again as a dollar amount next to each item. Their responses showed that those with high ranges of last two digits ended up bidding over three times as much as those with lower ranges.

00:09:08.240 Similarly, when I kept repeating the time of six and a half minutes, it became the anchor. Essentially, my teammates were asking themselves how long they were willing to wait for CircleCI to finish. Well, certainly not 17 minutes. It’s worth noting that the six and a half minutes was only for a portion of the tests, whereas the 17 minutes displayed by CircleCI was for the entire build process.

00:09:37.218 Setting up the environment, downloading custom Docker images, and so on. The RSpec portion was only about half of that total time. It wasn’t until after I left the project that I decided to measure the difference in CircleCI for a blog post. I determined that the average build time saved was about 39 seconds, ultimately resulting in three workweeks saved per year.

00:10:09.650 Notice the use of the word 'saved.' It feels natural to discuss time savings, but it turns out it’s the wrong word in this context. To understand why, let's take a look at two key concepts from behavioral science: the framing effect and loss aversion.

00:10:35.370 These concepts were introduced by Amos Tversky and Daniel Kahneman in the late 70s and early 80s. The framing effect suggest that our choices are influenced by how they’re presented. Loss aversion means we experienced losses psychologically twice as strongly as equivalent gains. Marketers utilize this when trying to persuade people by framing their messages as gains.

00:11:01.110 For example, a 5-second ad that can’t be skipped feels like a loss of five seconds, while a 30-second ad that can be skipped after five seconds feels like a gain of 25 seconds. Conversely, if you wish to convince someone to take beneficial action, it’s better to focus on the negative consequences of inaction rather than the benefits of taking action.

00:11:29.770 In my case, I should have framed it as losing three weeks a year instead of focusing on saving time. However, given that we engineers like to rely on data, I wasn’t sure that framing alone would be sufficient to convince my team.

00:11:55.210 To verify whether the changes made a difference in CircleCI, the Case Flow team measured build times before and after reverting the changes. They found that, after the reversion, build times increased by 90 seconds. At that time, the daily build had also increased to 80, so the total loss became nearly a full quarter year.

00:12:30.700 Yet this doesn't address the issue of flaky tests. Given the cognitive biases that lead us to make poor decisions, how can we reduce the risk of broken or slow tests in general? I like to reference the framework known as the four laws of behavior change, which were discussed by James Clear in his excellent book, "Atomic Habits".

00:12:49.510 These laws are based on the habit loop described by Charles Duhigg in his book, "The Power of Habit." To reduce the risk of flaky tests, we can use the third law: make it easy. We want to reduce the friction that prevents people from taking necessary actions. For instance, the friction we faced was determining which database was being used by a test and adding the required statements.

00:13:12.440 One way I thought we could address this issue was to check before running each test whether the database in use is empty. If it isn’t, we know that a previous test did not clean it properly. I proposed an RSpec config block that checks the state of the database and indicates which tables are not empty and which tests failed, which could help catch these kinds of errors early.

00:13:35.360 Another strategy to ensure success involves preventing bad habits, which is essentially making it difficult to engage in them. One approach is to use commitment devices, systems set in place now to prevent undesirable behaviors in the future. A historical example is Victor Hugo, who procrastinated writing "Notre Dame de Paris." His publisher told him he had five months to finish, or they would charge him 1000 francs for each week he was late.

00:14:01.700 Hugo was motivated to complete the book early and bought a nice bottle of ink and a warm shawl to keep him focused, locking away his formal clothes to prevent distractions. Another contemporary example can be seen in GitHub, which has features that prevent merging a pull request unless certain checks pass.

00:14:27.430 To catch any gradual warning signs, you should measure the speed of your test suite. For instance, track tests per minute and make sure this data is visible. This transparency removes uncertainty concerning the state of the test suite and provides immediate feedback, which is often absent in a delayed return environment. For instance, you could set a threshold to ensure test speed doesn’t fall below a certain point.

00:15:01.500 If it drops too low, implement a commitment device to prevent merging pull requests until you resolve the issue. Another check you can perform is to look at the ratio of unit tests to acceptance tests, avoiding a disproportionate allocation of feature tests.

00:15:28.050 Additionally, a powerful way to influence behavior is to focus on social norms. This falls under the second law: make it attractive. Our inherent desire to belong and gain respect from our peers often drives us to conform to the majority, as we tend to rely on others’ experiences when uncertain.

00:15:52.340 A relevant example from a study conducted in the U.K. sought to encourage timely tax payments. Various messaging strategies were tested, but the most impactful was a combination of two sentences—one highlighting what the community was doing and another making it personal to the reader.

00:16:18.200 With all the examples we've explored, let's consider how to apply all four laws of behavioral change concurrently. For 'make it obvious,' display a progress dashboard while you work, showing metrics you care about. An example of this is how the Prius uses a dashboard to demonstrate how driving behavior affects gas mileage in real-time.

00:16:55.150 To further enhance this, implement the second and fourth laws to 'make it attractive' and 'make it satisfying.' When you're performing well, receive statements of praise and encouragement that highlight best practices from successful projects, reinforcing the value of your efforts.

00:17:22.210 When falling behind, use the inverse of the second and fourth laws—'make it unattractive' and 'make it unsatisfying.' Remind yourself that you’re in the minority and that most projects similar to yours maintain functional test suites. Also, emphasize the costs that are accumulating.

00:17:50.570 Lastly, you can use the inverse of the third law, implementing commitment devices that block PR merges until you resolve issues. Through these strategies, we can foster improvement in our work, helping to maintain an efficient and effective approach to testing.

00:18:24.160 Thank you for your attention, and I hope you find value in applying these principles from behavioral science in your work and personal life.