00:00:09.500
Let's begin. Hi, my name is Emil, and today I'm going to be talking about testing Rails at scale.
00:00:13.259
I am a production engineer at Shopify, where I work in the production pipeline and performance. Shopify is an e-commerce platform that allows merchants to set up online stores and sell products on the internet.
00:00:18.900
To give you a little background on Shopify, we have over 240,000 merchants. Over the lifespan of the company, we've processed 14 billion dollars in sales. On any given month, we handle about 300 million unique visitors, and we have over a thousand employees.
00:00:44.579
When you're testing Rails at scale, you typically use a Continuous Integration (CI) system. I want to make sure we are all on the same page about how I think about CI systems. I like to think of CI systems as having two main components: the scheduler and the compute.
00:00:51.600
The scheduler is the component that decides when the build needs to be kicked off, typically triggered by a webhook from something like GitHub. It orchestrates the work by determining what scripts need to run. In contrast, the compute is where the code and tests are actually executed.
00:01:03.360
The compute encompasses everything that interacts with the machine, including getting the code onto the machine and ensuring all resources are available. If you look at the market of CI systems, you typically have two types: managed providers, which are closed systems and multi-tenant, handling both compute and scheduling for you; and unmanaged providers, where you host both scheduling and compute within your own infrastructure.
00:01:27.360
Some examples of managed providers are CircleCI, CodeShip, and Travis CI, while Jenkins and Travis CI fall into the unmanaged category. Today, Shopify boots up over 50,000 containers in a single day of testing. During that time, we build 700 times, meaning that for every build, we run 42,000 tests. This process takes about five minutes.
00:01:59.930
However, this wasn't always the case. Around last year, Shopify's build times were close to 20 minutes. We were experiencing serious issues with flakiness—not just from code but also from the CI provider we were using. We were the biggest customer of this provider, and they were running into capacity issues, resulting in problems like out-of-memory errors.
00:02:23.250
The provider we were using was also expensive—not just in terms of dollar amount, but you typically pay for a hosted provider monthly. Given that your workload is usually five days a week, eight to twelve hours a day, you aren't even utilizing that computer time during the unused hours.
00:02:48.000
So, we set out on a journey to solve this problem. Our directive was to lower our build times to five minutes while maintaining our current budget. Due to the high level of flakiness and long build times, we realized we needed to completely rebuild our CI system. Despite the test suite showing green statuses multiple times, developers would often see their builds fail, making the deployment process unreliable.
00:03:18.930
We looked around the market and found an Australian CI provider called Buildkite. The interesting aspect of Buildkite is that it’s a hosted provider that provides only the scheduling component; you have to bring your own compute to this service.
00:03:40.950
This is valuable because for 99% of use cases, the scheduling component is generally the same across any CI system. This aligned with our 'not invented here' concerns about rebuilding the wheel. How Buildkite works is that you run Buildkite agents on your own machines. These agents communicate back to Buildkite.
00:04:06.000
Buildkite also integrates with events for your repository. When you push code to GitHub, Buildkite knows it needs to start a build. You tell Buildkite precisely which scripts you want the agents to run, and those agents then pull the code down from GitHub, execute the scripts, and propagate the results back.
00:04:27.540
Our compute cluster is primarily hosted on AWS. At peak performance, it consists of 90 c4.large instances, giving us about 5.4 terabytes of memory and over 3,200 cores. The cluster is configured to auto-scale, managed with Chef and pre-built AMIs. The instances are memory-bound because, within the containers we run, we include all the services required for Shopify.
00:05:02.040
Initially, we required optimizations due to a write-heavy workload. We incorporate temporary files on memory by using ramfs on the machines. We had to write our own auto-scaler since Amazon's only operates on HTTP requests, which did not suit our needs.
00:05:34.680
We created a simple Rails application that polls Buildkite for the current running build agents, checks how many are required based on the number of builds, and activates or shuts down EC2 machines accordingly. We also strategized to keep costs down, which included optimizing our usage of AWS services, such as keeping instances running for a full hour since Amazon bills by the hour.
00:06:31.200
We attempted to maintain efficiency by utilizing spot and reserved instances and improving overall utilization. For peak operations, we allocate a dynamic amount of agents for builds, ensuring that we can handle up to 100 agents for branch builds and 200 for master builds.
00:06:51.060
It’s important to note that not one size fits all; for us, AWS auto-scaling works, but for other companies, bare metal might be the optimal choice. Interestingly, the Buildkite agents serve as an implicit measure of our developers' productivity: the more code they push, the more productive we can consider them.
00:07:06.890
From an analysis on an average day, we noticed distinct time patterns correlating with lunch breaks and productivity levels. For instance, we see peaks in activity just before and after lunch as developers commit work in progress before stepping away from their desks.
00:07:45.730
One of the most significant performance improvements came from using Docker containers for running tests. The configuration for each test environment is done once during the container build, allowing instances to immediately execute tests once the container is up and running.
00:08:34.680
We used Docker's distribution API, which means we can deploy containers wherever necessary as long as we indicate where the registry is. Over time, we've outgrown standard Docker Compose files and developed our internal build system, named Locutus, using the Docker build API.
00:09:01.370
In its initial iteration, Locutus ran on a single EC2 instance that faced challenges due to accumulating technical debt over the span of Shopify's ten-year-old codebase. We struggled particularly with issues like compiling assets and the requirement of a MySQL connection for distributing tests.
00:09:42.170
To work around this, we decided that every container would execute a set of tests, with an offset based on its index. This led to challenges since Ruby tests were faster but numerous, while browser tests took longer. This differentiation created a bottleneck upon completion of testing; Ruby tests would complete quickly, leaving longer-running browser tests to delay final build times.
00:10:14.700
As the CI process concluded, the agents on the boxes would retrieve and upload artifacts to S3. Additionally, we had an out-of-band service that received webhooks from Buildkite, transferring artifacts into Kafka, helping us analyze test failures and the stability of various code areas.
00:10:52.130
This structure was our first iteration, which we later expanded upon. Subsequently, as changes to Docker were introduced, we had to adapt our system to incorporate a second provider. Running two CI systems in parallel created a lot of confusion among developers due to conflicting statuses that made it hard to determine which system to trust.
00:11:23.350
Despite there being a green status for one system, builds often failed due to underlying issues with the other, affecting overall developer confidence. Eventually, we switched to the new system completely, gradually phasing out the old CI system.
00:12:01.849
When we outgrew our original Locutus instance, we headed back to the drawing board to create something more scalable. We focused on keeping our system as stateless as possible, which led to designing a new version of Locutus that utilized a coordinator instance for distributing work.
00:12:30.389
Now, each worker would perform tasks for specific repositories, reducing single-instance load under high traffic situations. However, we faced a challenge: losing cache frequently resulted in longer build times, leading to the need for more efficient caching methods.
00:13:00.480
For our third iteration, we worked on enhancing stability within our testing framework, particularly addressing issues with test failures occurring due to race conditions. If a container pulled a test from the queue and failed, there was a risk that no one would be aware since the overall built could still report green statuses. To combat this, we would redeploy failed tests and verify that all tests had been executed correctly.
00:13:46.570
Today, the final iteration of Buildkite reflects numerous improvements made based on our previous experiences. In conclusion, if your build times are under ten minutes, don’t attempt to build your own CI system. The time spent organizing and optimizing a CI system can outweigh the benefits if your build system is already functioning efficiently.
00:14:29.590
It took us a significant amount of time to navigate through this project, as many people were working on it over the course of months. Often, the issues lie not within compute capacity, but stem from configuration problems where optimizations can yield significant gains.
00:14:52.110
If your build times exceed fifteen minutes, that’s when you should consider starting your own CI system. Having a monolithic application with a highly variable codebase and being at the limits of what your current provider can handle are strong indicators that a custom solution could be beneficial.
00:15:20.520
When choosing to build your CI system, ensure you commit fully to the transition. We learned the importance of not falling down rabbit holes and being realistic about timelines—projects like this often take longer than initially anticipated.
00:15:41.310
Additionally, treat your infrastructure as cattle rather than pets—this adjustment can save you considerable time and reduce headaches associated with managing server failures.
00:16:00.600
Finally, our experience confirmed the importance of efficiently managing test distributions and recognizing test flakiness issues, ultimately fortifying our CI process moving forward. Thank you for your time.
00:17:03.960
If anyone has questions about optimizing our code base or if we faced challenges just focusing on the CI systems, I'm here to clarify any confusion and share our learnings. We found that implementing parallelization helped immensely, even when faced with test flakiness that required significant efforts to mitigate.