00:00:09.900
It was the best of times, it was the worst of times; it was the age of waiting for CI, which takes forever. It's one of my least favorite things to do, partly because I'm impatient and bad at multitasking. However, if you think about how much time you spend waiting on CI every day, and multiply that by the number of people on your team, that's a lot of time we could be using better.
00:00:17.520
It's significant enough that it makes sense to put in some real work to make it run faster. Today, I'd like to tell you about a few techniques that I've used in the past which yielded great results. First, though, hi, my name is Daniel.
00:00:31.019
I learned all of this while working for GoCardless, a payments company based in London. Now, as you probably noticed, I'm not originally from London—I come from Argentina. So, in case you were wondering, that's the accent you'll be hearing. I want to help you make your CI finish faster, and as the title of the talk suggests, I'm proposing that you do this by throwing lots of computers at the problem.
00:01:05.580
I won't be discussing how to make individual tests run faster; there are plenty of resources out there for that, like using fixtures instead of factories or mocking stuff. These techniques are shared by others who can explain them much better than I can. However, the downside to these techniques is that they often involve rewriting your tests and that takes time, which is typically linear with the number of tests you have.
00:01:24.240
And your test suite is likely huge, or you wouldn't be watching my talk right now. Therefore, what I want to discuss is how to reduce the total runtime of your CI test suite, focusing on getting the most impact for the time you invest. We can do that not by making your tests faster, but by running them on lots of computers at the same time.
00:01:42.060
My focus here is on making systemic changes that allow your tests to run as they always have, while enabling them to run massively in parallel in CI to finish quickly. Now, of course, what I am advocating for here is throwing money at the problem in exchange for saving developers' time. While this may not always be the appropriate approach, for many companies with engineers on typical salaries waiting on CI and getting distracted, it makes sense to spend money to keep things efficient.
00:02:00.960
To be honest, it's not even that much. For example, we're using many machines and spending less than a hundred dollars per developer per month, which is reasonable. Before we start, I want to make a couple of notes. First, you'll notice a lot of CircleCI references in this talk, especially in my screenshots. I'll also mention a tool or two they provide, but this is not an endorsement.
00:02:20.280
The reason I'm using CircleCI as an example is simply that I have the most experience with them, making it easier to provide complex CI setups. However, this talk is not CircleCI-specific; everything I discuss will be applicable to most CI providers. The same goes for when I mention RSpec—I'm using RSpec as an example since most of my points will apply to Minitest or any other testing framework in Ruby or other languages.
00:02:42.640
I've even applied the same advice for a PHP project, despite the projects' specifics being quite different; the underlying thinking applies across programming languages.
00:03:10.080
I’ll show you various code examples to explain these techniques, though I will be oversimplifying to clarify concepts quickly. There’s a supplementary GitHub repo linked, where you'll find fuller code examples and more documentation on how they work. You can use these as a starting point, but most will require adaptation to meet your specific needs.
00:03:36.360
Importantly, this isn't a 'do these three things and you will achieve this exact result' roadmap type of talk. Your mileage will vary, as your CI setup will differ from mine and what's slow for you may be different from others. I'm sharing a variety of techniques, some of which may be highly relevant to you while others might not be as impactful.
00:03:54.240
This talk mainly serves as a framework for considering the problem of CI times, along with tools and techniques to help you improve those times. However, you'll need to experiment, determine what works for you, and tailor these techniques to your situations.
00:04:14.640
As I mentioned, we've improved runtime for both Ruby and PHP projects significantly using these ideas, but what worked for one may not yield the same results for the other. Adapting to your needs takes some work, but it’s worthwhile as you’ll no longer wait on CI for long periods. So, measure, experiment, and see what works for you.
00:04:36.059
With that said, let's dive in. We want our CI suite to finish faster, and we can achieve that by running tests in parallel. The first question is, how do we even do this? There are two main ways: you can manually split your test suite into logical chunks and run each chunk in parallel. The other option is to automatically split the tests into multiple machines that also run in parallel. This isn't an either-or choice; you’ll likely want to implement both.
00:05:22.259
Starting with the simplest method, which you're likely already doing: if you can separate your test suite into different pieces that make sense—likely different subdirectories within your specs directory—you can create different CI jobs calling RSpec on those directories for tests.
00:05:57.840
Rails does this beautifully; their test suite has separate jobs for Active Model, Active Record, Action Cable, etc. Each of these is a coherent unit, making the CI setup easy to grasp, and they all run in parallel. If your app is modular, this offers a significant advantage.
00:06:10.440
In most Rails apps, you'll have integration models, controllers, etc. While they might not be as granular, it's worth separating them, as one may take much longer to run. Two things to remember: separating tests enables greater control over job executions, and some CI providers allow choosing between instance sizes.
00:06:27.780
Particular types of tests may require larger instances to fit all dependencies, so by isolating those tests into their own jobs, you can only pay for larger machines where necessary without inflating the costs of other tests. For instance, in our setup, we separate certain searches that usually reside under integration due to their heavier dependencies on larger machines.
00:06:52.140
The second benefit of separating job executions is controlling inter-job dependencies. For example, JS tests may require Node modules installed, whereas unit tests probably don’t need to wait for that to finish. By separating JS and unit tests, you can increase overall efficiency.
00:07:17.700
Also, having a catch-all job can be beneficial. When using a granular splitting approach, it’s easy to forget adding new jobs for a test subdirectory you may create later. If you forget to add a new CI job for a new directory, then those tests won’t run in CI and might not be detected; this can be dangerous.
00:07:57.300
Instead, having a final catch-all job ensures that instead of targeting specific subdirectories, you locate all tests and filter out the ones already run by other jobs. You can accomplish this easily with the find command.
00:08:20.940
While the structure of the catch-all job may seem unappealing, its safety justifies its awkwardness. The caveat, however, is the need to remember to add exceptions when you create jobs targeting other subdirectories.
00:08:45.180
This approach notably reduces the penalty for forgetting something since the worst-case scenario will involve running certain tests twice instead of skipping entire chunks. That's how to split your tests manually, and many of you might already be doing this, but it's still worth emphasizing.
00:09:11.940
The other method of parallelizing, which we will focus on the most today, is leveraging multiple machines to run tests by splitting the files between them. Here’s how you do it: rather than giving your test runner a single directory to run, provide a complete list of files in that directory.
00:09:36.600
This allows you to split the list into chunks and distribute them across many machines. Each machine will select its assigned chunk of files to run, allowing for all tests to be executed in parallel. This way, no two machines run the same file, and all files are accounted for.
00:10:03.060
Some of you might already know how to do this, or perhaps you're already implementing it. If you're reaching for your phone to check Twitter right now—stay with me for a quick moment. Your mileage will vary, and not every section will be relevant to everyone, so here’s a useful feature in this talk.
00:10:20.279
As a part of this new interactive RailsConf format, if a section doesn’t apply to you, just keep track of the displayed timestamp and skip ahead until the icon is gone. Now, to split the test files efficiently so that every machine knows which file to run, each machine needs to know its respective number and the total number of machines involved.
00:10:44.700
By doing this, every machine fetches every nth file with an initial offset. This process is critical to achieving parallelism without overlap. Of all the aspects I'm discussing today, this is where different CI providers most distinctly differ. For example, CircleCI has a CLI tool that aids you in locating your test files and managing their distribution.
00:11:17.640
You specify a parallelism value for how many boxes you want to run; this command will generally work seamlessly. With some additional effort, you can also implement a smart allocation based on historical execution times for each test file.
00:11:40.380
CodeShip offers a less ideal approach where you define a number of steps to run in parallel, and you can embed environment variables right there. While this method requires more effort upfront, Buddy also provides a feature similar to CircleCI but allows splitting files ahead of time by creating variables like BODY_SPLIT_1, BODY_SPLIT_2, etc., for each machine. You can use those directly for your tests.
00:12:06.180
Sadly, most providers do not feature built-in solutions for this directly, but CI platforms generally will have a build matrix feature that allows you to run the same job repetitively with slightly varying parameters. This can help you run tests across different versions of dependencies, ensuring code compatibility.
00:12:40.640
By crafting customized keywords during the build matrix setup, you can essentially focus your CI configuration on testing across essential dependencies while sharing parameters like OS or Ruby version. This way, it's possible to influence job splitting and distribution.
00:13:09.380
Incorporating a box index into the matrix of jobs can allow you to set environmental variables effectively so each machine identifies which files to process without overlap.
00:13:36.600
The approach involves splitting the requested files dynamically across machines, assigning them based on which file can run for each respective CI instance. This system is efficient and helps ensure full test coverage across parallel execution.
00:13:57.180
Now that we’ve explored the basic methods of parallel execution, let’s shift focus to optimizing CI performance under various conditions. It’s pivotal to remember that while generating millions of testing instances might seem ideal, runtime setup time must also be considered.
00:14:17.760
Effectively, running too many instances leads to setup time eclipsing execution time, which ultimately results in diminishing returns. Most CI providers implement a setup phase that consumes precious time, which can impede efficiency.
00:14:45.180
Consequently, if each CI instance is busy waiting to start the testing phase, speeding up overall CI performance becomes a challenge, as completion times become saturated by these initial setup factors.
00:15:16.780
Initially, I outlined my intention not to delve into optimizing individual test execution, but now it's crucial to reconsider that every step involved in CI has an aggregate impact on performance.
00:15:57.020
Each aspect prior to executing your tests tends to be more of a necessary evil compared to the actual test execution time and can contribute positively to an overall strategy by analyzing them optimally. Because parallelizing is irrelevant if your tests still spend lingering hours waiting on dependencies.
00:16:30.120
To begin addressing this, review your CI job settings carefully, observing the average completion times in various components. By identifying slow points in your process, you can address bottlenecks relative to the frequent completion of your tests.
00:16:57.820
You might find specific CI test install times or even checkout times that warrant improvement, offering chances for optimization without furthering your current bottleneck.
00:17:23.900
Now, as you conduct experiments, ensure you’re comfortable with how often each component in your setup runs over time. Look to identify the points of inefficiency on your CI service and consistently keep track.
00:17:53.460
For example, while running tests, if your gem installations take far too long, aim to minimize that time by using caching effectively to allow local gem files to operate reciprocally.
00:18:20.500
Review how you execute commands, like bundler or application dependencies, to facilitate maintaining necessary files within the cache.
00:18:47.120
This may also apply to checking out your code. As your repository accumulates history, it can become problematic with a larger number of commits piling up, which might slow checkout times.
00:19:01.560
In those instances, consider performing shallow clones of your repository, which ensures that your local checkout includes only the relevant recent history and any new changes.
00:19:29.200
As you work to reduce the time it takes to containerize your CI work, keep in mind the significance of utilizing Docker effectively, which is helpful to mitigate delays during startup.
00:19:54.160
Each CI provider uses Docker environment initialization, which often leads to lengthy container spin-up times if you’re unaware of the underlying processes at play, such as checking for each dependency required during the startup.
00:20:24.640
From building defaults, consider how to specialize the time it takes to create each of those Docker images. Regular maintenance of removal processes of outdated files can improve overall machine latency.
00:21:05.780
If your environment set up is increasingly time-consuming, consider reviewing the caching strategies employed by your CI provider and utilizing custom images built centrally or nearby.
00:21:26.560
This move to more speed through Docker ultimately comes down to efficiency, which exemplifies the balance between keeping your instances online while optimizing overhead, as fine-tuning the cache can save substantial overall runtime.
00:21:58.920
Furthermore, for teams conscious of time and costs, building more personalized containers can yield attributes aiding your CI processes and reduce execution timelines.
00:22:34.000
Docker layers graciously cache relevant processes, redirecting files to enhance performance overall. That being said, review your CI's performance on a regular basis to ensure a proficient return-pattern of effective test management.
00:23:03.480
Be sure to implement observability into your CI processes; it is essential for achieving CI bliss as it allows your team to keep track of external factors impacting runtime critically.
00:23:41.320
The harsh reality is that flaky tests disrupt CI performance tremendously. A scenario where tests take an eternity awaiting resolution exacerbates an already unchallenging debugging process.
00:24:11.780
To begin solving flaky tests, ensure that strict governance of test relationships is in place; for example, utilizing tools tailored to initiating an organization behind identifying failures can facilitate consistent resolution.
00:24:49.600
When you run your tests, consider automating tickets for any flaky tests that arise. Automatically detecting inconsistencies can help ensure resolution, as this formalizes accountability.
00:25:16.160
Solutions like Jira allow creation through CLI commands that can be integrated into your CI process, thereby formalizing tracking and assisting with assigning responsible individuals or points of contact effectively.
00:25:43.400
Making notification chains for flaky tests allows your team to have better consistency in resolving failures, feeding back into the CI's efficiency functionality.
00:26:03.700
Beyond tracking flaky tests, inspecting uneven distributions frequently reveals disproportionate test lengths, which affect setup times and execution latencies.
00:26:23.920
Establishing mechanisms allows for balanced distributions, such as profiling test execution times and grade schools to identify problematic areas helps with optimization.
00:26:56.920
This can involve adjusting the way CI parallelizes jobs, ensuring each instance of distributed CI observes variability and alignment comprehensively.
00:27:30.240
To promote profitability in CI environments, keep a close watch. Doing so can depend heavily on how your platform handles job paralleled distribution.
00:27:52.400
Ensure that your jobs are well-structured, simplifying the critical paths and removing redundancy by separating relevant jobs. Maintain resilient job scheduling to ensure your team is constantly improving CI infrastructure.
00:28:30.840
Even as execution times improve, prioritizing root cases requires diligence; keeping thorough health checks across active CI processes while maintaining backlog with CI optimization in mind remains paramount.
00:29:10.600
As you implement these suggestions, aim to evolve. The best approach integrates visibility with actionable insights to continually help your organization.
00:29:48.340
Wrap-up: CI optimizations are dynamic; the convergence of technologies, effective processes, and robust strategies will lead you toward constructing a high-performing CI environment.
00:30:34.200
Start with a base overview, and then put your foot on the gas, knocking out solutions to bottleneck issues, establishing observability, and reinforcing team accountability across CI workflows.
00:31:15.360
Adapting these strategies will help you regain control over testing efficiency, leading to significant time savings across projects.
00:31:36.000
This approach is critical for shaping positive organizational culture and results through actionable measures on CI performance.
00:31:49.920
Thank you for your attention and participation in this session. I'm looking forward to engaging with you and answering any questions you may have about optimizing your CI process.