Speed up your test suite by throwing computers at it

Talks

Daniel Magliola

@dmagliola

#continuous-integration-ci

Speed up your test suite by throwing computers at it

by Daniel Magliola

In the video titled "Speed up your test suite by throwing computers at it," speaker Daniel Magliola discusses strategies to improve continuous integration (CI) times, emphasizing the need to reduce the amount of waiting time associated with CI processes. Acknowledging the common frustration developers face when CI runs become lengthy, the speaker presents a variety of techniques aimed at optimizing CI without the need to rewrite tests. Key points include:

Rationale for Speeding Up CI: Daniel highlights the time wasted by developers while waiting for CI results and advocates for a systematic approach to leverage parallel processing by utilizing multiple machines.
Parallel Test Execution: The approach focuses on running tests in parallel rather than optimizing individual test performance. This can be achieved by splitting the test suite into manageable chunks, both manually and automatically, to facilitate running them on various servers simultaneously.
Managing CI Resources: By strategically managing CI resources, such as selecting appropriate instance sizes for different test types and implementing a catch-all job, teams can avoid the pitfalls of long CI times due to neglected test files.
Containerization Insights: The speaker explains how to optimize Docker container initialization times, which often constitute the majority of CI time. Suggestions include using popular and cached images and minimizing unnecessary layers in Docker images.
Reducing Setup Times: Magliola delves into specifics such as implementing gem caching, shallow clones for git repositories, and efficient Docker strategies to mitigate latency and enhance speeds.
Handling Flaky Tests: Addressing the issue of flaky tests, which exacerbate CI delays, Daniel proposes automating the detection of these tests to streamline reporting and tracking for resolution.
Critical Path Optimization: Finally, he emphasizes the importance of identifying critical paths in the CI workflow, advising that efforts should concentrate on optimizing these paths rather than peripheral processes to ensure a faster overall CI time.

Conclusively, the video encourages teams to adopt these approaches to better utilize CI resources, thereby minimizing wasteful waiting and ultimately contributing to improved productivity in software development.

00:00:09.900 It was the best of times, it was the worst of times; it was the age of waiting for CI, which takes forever. It's one of my least favorite things to do, partly because I'm impatient and bad at multitasking. However, if you think about how much time you spend waiting on CI every day, and multiply that by the number of people on your team, that's a lot of time we could be using better.

00:00:17.520 It's significant enough that it makes sense to put in some real work to make it run faster. Today, I'd like to tell you about a few techniques that I've used in the past which yielded great results. First, though, hi, my name is Daniel.

00:00:31.019 I learned all of this while working for GoCardless, a payments company based in London. Now, as you probably noticed, I'm not originally from London—I come from Argentina. So, in case you were wondering, that's the accent you'll be hearing. I want to help you make your CI finish faster, and as the title of the talk suggests, I'm proposing that you do this by throwing lots of computers at the problem.

00:01:05.580 I won't be discussing how to make individual tests run faster; there are plenty of resources out there for that, like using fixtures instead of factories or mocking stuff. These techniques are shared by others who can explain them much better than I can. However, the downside to these techniques is that they often involve rewriting your tests and that takes time, which is typically linear with the number of tests you have.

00:01:24.240 And your test suite is likely huge, or you wouldn't be watching my talk right now. Therefore, what I want to discuss is how to reduce the total runtime of your CI test suite, focusing on getting the most impact for the time you invest. We can do that not by making your tests faster, but by running them on lots of computers at the same time.

00:01:42.060 My focus here is on making systemic changes that allow your tests to run as they always have, while enabling them to run massively in parallel in CI to finish quickly. Now, of course, what I am advocating for here is throwing money at the problem in exchange for saving developers' time. While this may not always be the appropriate approach, for many companies with engineers on typical salaries waiting on CI and getting distracted, it makes sense to spend money to keep things efficient.

00:02:00.960 To be honest, it's not even that much. For example, we're using many machines and spending less than a hundred dollars per developer per month, which is reasonable. Before we start, I want to make a couple of notes. First, you'll notice a lot of CircleCI references in this talk, especially in my screenshots. I'll also mention a tool or two they provide, but this is not an endorsement.

00:02:20.280 The reason I'm using CircleCI as an example is simply that I have the most experience with them, making it easier to provide complex CI setups. However, this talk is not CircleCI-specific; everything I discuss will be applicable to most CI providers. The same goes for when I mention RSpec—I'm using RSpec as an example since most of my points will apply to Minitest or any other testing framework in Ruby or other languages.

00:02:42.640 I've even applied the same advice for a PHP project, despite the projects' specifics being quite different; the underlying thinking applies across programming languages.

00:03:10.080 I’ll show you various code examples to explain these techniques, though I will be oversimplifying to clarify concepts quickly. There’s a supplementary GitHub repo linked, where you'll find fuller code examples and more documentation on how they work. You can use these as a starting point, but most will require adaptation to meet your specific needs.

00:03:36.360 Importantly, this isn't a 'do these three things and you will achieve this exact result' roadmap type of talk. Your mileage will vary, as your CI setup will differ from mine and what's slow for you may be different from others. I'm sharing a variety of techniques, some of which may be highly relevant to you while others might not be as impactful.

00:03:54.240 This talk mainly serves as a framework for considering the problem of CI times, along with tools and techniques to help you improve those times. However, you'll need to experiment, determine what works for you, and tailor these techniques to your situations.

00:04:14.640 As I mentioned, we've improved runtime for both Ruby and PHP projects significantly using these ideas, but what worked for one may not yield the same results for the other. Adapting to your needs takes some work, but it’s worthwhile as you’ll no longer wait on CI for long periods. So, measure, experiment, and see what works for you.

00:04:36.059 With that said, let's dive in. We want our CI suite to finish faster, and we can achieve that by running tests in parallel. The first question is, how do we even do this? There are two main ways: you can manually split your test suite into logical chunks and run each chunk in parallel. The other option is to automatically split the tests into multiple machines that also run in parallel. This isn't an either-or choice; you’ll likely want to implement both.

00:05:22.259 Starting with the simplest method, which you're likely already doing: if you can separate your test suite into different pieces that make sense—likely different subdirectories within your specs directory—you can create different CI jobs calling RSpec on those directories for tests.

00:05:57.840 Rails does this beautifully; their test suite has separate jobs for Active Model, Active Record, Action Cable, etc. Each of these is a coherent unit, making the CI setup easy to grasp, and they all run in parallel. If your app is modular, this offers a significant advantage.

00:06:10.440 In most Rails apps, you'll have integration models, controllers, etc. While they might not be as granular, it's worth separating them, as one may take much longer to run. Two things to remember: separating tests enables greater control over job executions, and some CI providers allow choosing between instance sizes.

00:06:27.780 Particular types of tests may require larger instances to fit all dependencies, so by isolating those tests into their own jobs, you can only pay for larger machines where necessary without inflating the costs of other tests. For instance, in our setup, we separate certain searches that usually reside under integration due to their heavier dependencies on larger machines.

00:06:52.140 The second benefit of separating job executions is controlling inter-job dependencies. For example, JS tests may require Node modules installed, whereas unit tests probably don’t need to wait for that to finish. By separating JS and unit tests, you can increase overall efficiency.

00:07:17.700 Also, having a catch-all job can be beneficial. When using a granular splitting approach, it’s easy to forget adding new jobs for a test subdirectory you may create later. If you forget to add a new CI job for a new directory, then those tests won’t run in CI and might not be detected; this can be dangerous.

00:07:57.300 Instead, having a final catch-all job ensures that instead of targeting specific subdirectories, you locate all tests and filter out the ones already run by other jobs. You can accomplish this easily with the find command.

00:08:20.940 While the structure of the catch-all job may seem unappealing, its safety justifies its awkwardness. The caveat, however, is the need to remember to add exceptions when you create jobs targeting other subdirectories.

00:08:45.180 This approach notably reduces the penalty for forgetting something since the worst-case scenario will involve running certain tests twice instead of skipping entire chunks. That's how to split your tests manually, and many of you might already be doing this, but it's still worth emphasizing.

00:09:11.940 The other method of parallelizing, which we will focus on the most today, is leveraging multiple machines to run tests by splitting the files between them. Here’s how you do it: rather than giving your test runner a single directory to run, provide a complete list of files in that directory.

00:09:36.600 This allows you to split the list into chunks and distribute them across many machines. Each machine will select its assigned chunk of files to run, allowing for all tests to be executed in parallel. This way, no two machines run the same file, and all files are accounted for.

00:10:03.060 Some of you might already know how to do this, or perhaps you're already implementing it. If you're reaching for your phone to check Twitter right now—stay with me for a quick moment. Your mileage will vary, and not every section will be relevant to everyone, so here’s a useful feature in this talk.

00:10:20.279 As a part of this new interactive RailsConf format, if a section doesn’t apply to you, just keep track of the displayed timestamp and skip ahead until the icon is gone. Now, to split the test files efficiently so that every machine knows which file to run, each machine needs to know its respective number and the total number of machines involved.

00:10:44.700 By doing this, every machine fetches every nth file with an initial offset. This process is critical to achieving parallelism without overlap. Of all the aspects I'm discussing today, this is where different CI providers most distinctly differ. For example, CircleCI has a CLI tool that aids you in locating your test files and managing their distribution.

00:11:17.640 You specify a parallelism value for how many boxes you want to run; this command will generally work seamlessly. With some additional effort, you can also implement a smart allocation based on historical execution times for each test file.

00:11:40.380 CodeShip offers a less ideal approach where you define a number of steps to run in parallel, and you can embed environment variables right there. While this method requires more effort upfront, Buddy also provides a feature similar to CircleCI but allows splitting files ahead of time by creating variables like BODY_SPLIT_1, BODY_SPLIT_2, etc., for each machine. You can use those directly for your tests.

00:12:06.180 Sadly, most providers do not feature built-in solutions for this directly, but CI platforms generally will have a build matrix feature that allows you to run the same job repetitively with slightly varying parameters. This can help you run tests across different versions of dependencies, ensuring code compatibility.

00:12:40.640 By crafting customized keywords during the build matrix setup, you can essentially focus your CI configuration on testing across essential dependencies while sharing parameters like OS or Ruby version. This way, it's possible to influence job splitting and distribution.

00:13:09.380 Incorporating a box index into the matrix of jobs can allow you to set environmental variables effectively so each machine identifies which files to process without overlap.

00:13:36.600 The approach involves splitting the requested files dynamically across machines, assigning them based on which file can run for each respective CI instance. This system is efficient and helps ensure full test coverage across parallel execution.

00:13:57.180 Now that we’ve explored the basic methods of parallel execution, let’s shift focus to optimizing CI performance under various conditions. It’s pivotal to remember that while generating millions of testing instances might seem ideal, runtime setup time must also be considered.

00:14:17.760 Effectively, running too many instances leads to setup time eclipsing execution time, which ultimately results in diminishing returns. Most CI providers implement a setup phase that consumes precious time, which can impede efficiency.

00:14:45.180 Consequently, if each CI instance is busy waiting to start the testing phase, speeding up overall CI performance becomes a challenge, as completion times become saturated by these initial setup factors.

00:15:16.780 Initially, I outlined my intention not to delve into optimizing individual test execution, but now it's crucial to reconsider that every step involved in CI has an aggregate impact on performance.

00:15:57.020 Each aspect prior to executing your tests tends to be more of a necessary evil compared to the actual test execution time and can contribute positively to an overall strategy by analyzing them optimally. Because parallelizing is irrelevant if your tests still spend lingering hours waiting on dependencies.

00:16:30.120 To begin addressing this, review your CI job settings carefully, observing the average completion times in various components. By identifying slow points in your process, you can address bottlenecks relative to the frequent completion of your tests.

00:16:57.820 You might find specific CI test install times or even checkout times that warrant improvement, offering chances for optimization without furthering your current bottleneck.

00:17:23.900 Now, as you conduct experiments, ensure you’re comfortable with how often each component in your setup runs over time. Look to identify the points of inefficiency on your CI service and consistently keep track.

00:17:53.460 For example, while running tests, if your gem installations take far too long, aim to minimize that time by using caching effectively to allow local gem files to operate reciprocally.

00:18:20.500 Review how you execute commands, like bundler or application dependencies, to facilitate maintaining necessary files within the cache.

00:18:47.120 This may also apply to checking out your code. As your repository accumulates history, it can become problematic with a larger number of commits piling up, which might slow checkout times.

00:19:01.560 In those instances, consider performing shallow clones of your repository, which ensures that your local checkout includes only the relevant recent history and any new changes.

00:19:29.200 As you work to reduce the time it takes to containerize your CI work, keep in mind the significance of utilizing Docker effectively, which is helpful to mitigate delays during startup.

00:19:54.160 Each CI provider uses Docker environment initialization, which often leads to lengthy container spin-up times if you’re unaware of the underlying processes at play, such as checking for each dependency required during the startup.

00:20:24.640 From building defaults, consider how to specialize the time it takes to create each of those Docker images. Regular maintenance of removal processes of outdated files can improve overall machine latency.

00:21:05.780 If your environment set up is increasingly time-consuming, consider reviewing the caching strategies employed by your CI provider and utilizing custom images built centrally or nearby.

00:21:26.560 This move to more speed through Docker ultimately comes down to efficiency, which exemplifies the balance between keeping your instances online while optimizing overhead, as fine-tuning the cache can save substantial overall runtime.

00:21:58.920 Furthermore, for teams conscious of time and costs, building more personalized containers can yield attributes aiding your CI processes and reduce execution timelines.

00:22:34.000 Docker layers graciously cache relevant processes, redirecting files to enhance performance overall. That being said, review your CI's performance on a regular basis to ensure a proficient return-pattern of effective test management.

00:23:03.480 Be sure to implement observability into your CI processes; it is essential for achieving CI bliss as it allows your team to keep track of external factors impacting runtime critically.

00:23:41.320 The harsh reality is that flaky tests disrupt CI performance tremendously. A scenario where tests take an eternity awaiting resolution exacerbates an already unchallenging debugging process.

00:24:11.780 To begin solving flaky tests, ensure that strict governance of test relationships is in place; for example, utilizing tools tailored to initiating an organization behind identifying failures can facilitate consistent resolution.

00:24:49.600 When you run your tests, consider automating tickets for any flaky tests that arise. Automatically detecting inconsistencies can help ensure resolution, as this formalizes accountability.

00:25:16.160 Solutions like Jira allow creation through CLI commands that can be integrated into your CI process, thereby formalizing tracking and assisting with assigning responsible individuals or points of contact effectively.

00:25:43.400 Making notification chains for flaky tests allows your team to have better consistency in resolving failures, feeding back into the CI's efficiency functionality.

00:26:03.700 Beyond tracking flaky tests, inspecting uneven distributions frequently reveals disproportionate test lengths, which affect setup times and execution latencies.

00:26:23.920 Establishing mechanisms allows for balanced distributions, such as profiling test execution times and grade schools to identify problematic areas helps with optimization.

00:26:56.920 This can involve adjusting the way CI parallelizes jobs, ensuring each instance of distributed CI observes variability and alignment comprehensively.

00:27:30.240 To promote profitability in CI environments, keep a close watch. Doing so can depend heavily on how your platform handles job paralleled distribution.

00:27:52.400 Ensure that your jobs are well-structured, simplifying the critical paths and removing redundancy by separating relevant jobs. Maintain resilient job scheduling to ensure your team is constantly improving CI infrastructure.

00:28:30.840 Even as execution times improve, prioritizing root cases requires diligence; keeping thorough health checks across active CI processes while maintaining backlog with CI optimization in mind remains paramount.

00:29:10.600 As you implement these suggestions, aim to evolve. The best approach integrates visibility with actionable insights to continually help your organization.

00:29:48.340 Wrap-up: CI optimizations are dynamic; the convergence of technologies, effective processes, and robust strategies will lead you toward constructing a high-performing CI environment.

00:30:34.200 Start with a base overview, and then put your foot on the gas, knocking out solutions to bottleneck issues, establishing observability, and reinforcing team accountability across CI workflows.

00:31:15.360 Adapting these strategies will help you regain control over testing efficiency, leading to significant time savings across projects.

00:31:36.000 This approach is critical for shaping positive organizational culture and results through actionable measures on CI performance.

00:31:49.920 Thank you for your attention and participation in this session. I'm looking forward to engaging with you and answering any questions you may have about optimizing your CI process.

RailsConf 2021