Flaky Tests

Summarized using AI

Testing Rails at Scale

Emil Stolarsky • May 26, 2016 • Kansas City, MO

In the talk "Testing Rails at Scale" by Emil Stolarsky at RailsConf 2016, the speaker shares insights into the challenges and solutions Shopify faced in optimizing their Continuous Integration (CI) system. As a production engineer at Shopify, Emil recounts the journey of improving their CI processes due to inefficiencies that included lengthy build times and a lack of reliability which eroded developer trust.

Key points discussed include:

  • Understanding CI Components: Emil clarifies the two main components of CI systems—schedulers (like webhooks from GitHub that trigger builds) and compute resources (where code and tests are executed).

  • Initial CI Challenges: Shopify experienced significant problems with their previous CI provider that resulted in 20-minute build times and flakiness, making deployments unreliable and expensive due to underutilization of resources.

  • Transition to Buildkite: Shopify adopted Buildkite, an Australian CI provider that offers scheduling while enabling users to manage their compute resources. This hybrid model allowed Shopify to leverage its own infrastructure while simplifying the scheduling process.

  • Infrastructure Utilization: The compute cluster, mainly hosted on AWS, utilized 90 c4.large instances, allowing 5.4 terabytes of memory and over 3,200 cores for builds. Autoscaling and resource optimization strategies were developed to maintain costs and efficiency.

  • Enhancements through Docker: The integration of Docker for testing isolated environments proved to be a significant performance booster, reducing build times and increasing reliability. Docker containers allowed for immediate test execution upon startup.

  • Iterative Improvement: Emil describes the ongoing evolution of their CI system, focusing on enhancing stability and reducing the scope of failures originating from test flakiness. This included developing a more scalable version of their internal tool, Locutus, which uses a coordinator instance for work distribution.

  • Key Takeaways: The video concludes with crucial insights regarding CI systems: organizations with build times over fifteen minutes should consider building a custom solution, fully commit to migration, and manage infrastructure efficiently—adopting practices like treating infrastructure as ‘cattle’ rather than ‘pets’ to ease management burdens.

Overall, the talk encapsulates the hard lessons learned from constructing a responsive CI ecosystem capable of supporting Shopify’s scalable e-commerce platform.

This presentation not only highlights the technical challenges but also emphasizes the importance of reliability and efficiency in modern CI practices.

Testing Rails at Scale
Emil Stolarsky • May 26, 2016 • Kansas City, MO

Testing Rails at Scale by Emil Stolarsky

It's impossible to iterate quickly on a product without a reliable, responsive CI system. At a certain point, traditional CI providers don't cut it. Last summer, Shopify outgrew its CI solution and was plagued by 20 minute build times, flakiness, and waning trust from developers in CI statuses.

Now our new CI builds Shopify in under 5 minutes, 700 times a day, spinning up 30,000 docker containers in the process. This talk will cover the architectural decisions we made and the hard lessons we learned so you can design a similar build system to solve your own needs.

Help us caption & translate this video!

http://amara.org/v/J5Cl/

RailsConf 2016

00:00:09.500 Let's begin. Hi, my name is Emil, and today I'm going to be talking about testing Rails at scale.
00:00:13.259 I am a production engineer at Shopify, where I work in the production pipeline and performance. Shopify is an e-commerce platform that allows merchants to set up online stores and sell products on the internet.
00:00:18.900 To give you a little background on Shopify, we have over 240,000 merchants. Over the lifespan of the company, we've processed 14 billion dollars in sales. On any given month, we handle about 300 million unique visitors, and we have over a thousand employees.
00:00:44.579 When you're testing Rails at scale, you typically use a Continuous Integration (CI) system. I want to make sure we are all on the same page about how I think about CI systems. I like to think of CI systems as having two main components: the scheduler and the compute.
00:00:51.600 The scheduler is the component that decides when the build needs to be kicked off, typically triggered by a webhook from something like GitHub. It orchestrates the work by determining what scripts need to run. In contrast, the compute is where the code and tests are actually executed.
00:01:03.360 The compute encompasses everything that interacts with the machine, including getting the code onto the machine and ensuring all resources are available. If you look at the market of CI systems, you typically have two types: managed providers, which are closed systems and multi-tenant, handling both compute and scheduling for you; and unmanaged providers, where you host both scheduling and compute within your own infrastructure.
00:01:27.360 Some examples of managed providers are CircleCI, CodeShip, and Travis CI, while Jenkins and Travis CI fall into the unmanaged category. Today, Shopify boots up over 50,000 containers in a single day of testing. During that time, we build 700 times, meaning that for every build, we run 42,000 tests. This process takes about five minutes.
00:01:59.930 However, this wasn't always the case. Around last year, Shopify's build times were close to 20 minutes. We were experiencing serious issues with flakiness—not just from code but also from the CI provider we were using. We were the biggest customer of this provider, and they were running into capacity issues, resulting in problems like out-of-memory errors.
00:02:23.250 The provider we were using was also expensive—not just in terms of dollar amount, but you typically pay for a hosted provider monthly. Given that your workload is usually five days a week, eight to twelve hours a day, you aren't even utilizing that computer time during the unused hours.
00:02:48.000 So, we set out on a journey to solve this problem. Our directive was to lower our build times to five minutes while maintaining our current budget. Due to the high level of flakiness and long build times, we realized we needed to completely rebuild our CI system. Despite the test suite showing green statuses multiple times, developers would often see their builds fail, making the deployment process unreliable.
00:03:18.930 We looked around the market and found an Australian CI provider called Buildkite. The interesting aspect of Buildkite is that it’s a hosted provider that provides only the scheduling component; you have to bring your own compute to this service.
00:03:40.950 This is valuable because for 99% of use cases, the scheduling component is generally the same across any CI system. This aligned with our 'not invented here' concerns about rebuilding the wheel. How Buildkite works is that you run Buildkite agents on your own machines. These agents communicate back to Buildkite.
00:04:06.000 Buildkite also integrates with events for your repository. When you push code to GitHub, Buildkite knows it needs to start a build. You tell Buildkite precisely which scripts you want the agents to run, and those agents then pull the code down from GitHub, execute the scripts, and propagate the results back.
00:04:27.540 Our compute cluster is primarily hosted on AWS. At peak performance, it consists of 90 c4.large instances, giving us about 5.4 terabytes of memory and over 3,200 cores. The cluster is configured to auto-scale, managed with Chef and pre-built AMIs. The instances are memory-bound because, within the containers we run, we include all the services required for Shopify.
00:05:02.040 Initially, we required optimizations due to a write-heavy workload. We incorporate temporary files on memory by using ramfs on the machines. We had to write our own auto-scaler since Amazon's only operates on HTTP requests, which did not suit our needs.
00:05:34.680 We created a simple Rails application that polls Buildkite for the current running build agents, checks how many are required based on the number of builds, and activates or shuts down EC2 machines accordingly. We also strategized to keep costs down, which included optimizing our usage of AWS services, such as keeping instances running for a full hour since Amazon bills by the hour.
00:06:31.200 We attempted to maintain efficiency by utilizing spot and reserved instances and improving overall utilization. For peak operations, we allocate a dynamic amount of agents for builds, ensuring that we can handle up to 100 agents for branch builds and 200 for master builds.
00:06:51.060 It’s important to note that not one size fits all; for us, AWS auto-scaling works, but for other companies, bare metal might be the optimal choice. Interestingly, the Buildkite agents serve as an implicit measure of our developers' productivity: the more code they push, the more productive we can consider them.
00:07:06.890 From an analysis on an average day, we noticed distinct time patterns correlating with lunch breaks and productivity levels. For instance, we see peaks in activity just before and after lunch as developers commit work in progress before stepping away from their desks.
00:07:45.730 One of the most significant performance improvements came from using Docker containers for running tests. The configuration for each test environment is done once during the container build, allowing instances to immediately execute tests once the container is up and running.
00:08:34.680 We used Docker's distribution API, which means we can deploy containers wherever necessary as long as we indicate where the registry is. Over time, we've outgrown standard Docker Compose files and developed our internal build system, named Locutus, using the Docker build API.
00:09:01.370 In its initial iteration, Locutus ran on a single EC2 instance that faced challenges due to accumulating technical debt over the span of Shopify's ten-year-old codebase. We struggled particularly with issues like compiling assets and the requirement of a MySQL connection for distributing tests.
00:09:42.170 To work around this, we decided that every container would execute a set of tests, with an offset based on its index. This led to challenges since Ruby tests were faster but numerous, while browser tests took longer. This differentiation created a bottleneck upon completion of testing; Ruby tests would complete quickly, leaving longer-running browser tests to delay final build times.
00:10:14.700 As the CI process concluded, the agents on the boxes would retrieve and upload artifacts to S3. Additionally, we had an out-of-band service that received webhooks from Buildkite, transferring artifacts into Kafka, helping us analyze test failures and the stability of various code areas.
00:10:52.130 This structure was our first iteration, which we later expanded upon. Subsequently, as changes to Docker were introduced, we had to adapt our system to incorporate a second provider. Running two CI systems in parallel created a lot of confusion among developers due to conflicting statuses that made it hard to determine which system to trust.
00:11:23.350 Despite there being a green status for one system, builds often failed due to underlying issues with the other, affecting overall developer confidence. Eventually, we switched to the new system completely, gradually phasing out the old CI system.
00:12:01.849 When we outgrew our original Locutus instance, we headed back to the drawing board to create something more scalable. We focused on keeping our system as stateless as possible, which led to designing a new version of Locutus that utilized a coordinator instance for distributing work.
00:12:30.389 Now, each worker would perform tasks for specific repositories, reducing single-instance load under high traffic situations. However, we faced a challenge: losing cache frequently resulted in longer build times, leading to the need for more efficient caching methods.
00:13:00.480 For our third iteration, we worked on enhancing stability within our testing framework, particularly addressing issues with test failures occurring due to race conditions. If a container pulled a test from the queue and failed, there was a risk that no one would be aware since the overall built could still report green statuses. To combat this, we would redeploy failed tests and verify that all tests had been executed correctly.
00:13:46.570 Today, the final iteration of Buildkite reflects numerous improvements made based on our previous experiences. In conclusion, if your build times are under ten minutes, don’t attempt to build your own CI system. The time spent organizing and optimizing a CI system can outweigh the benefits if your build system is already functioning efficiently.
00:14:29.590 It took us a significant amount of time to navigate through this project, as many people were working on it over the course of months. Often, the issues lie not within compute capacity, but stem from configuration problems where optimizations can yield significant gains.
00:14:52.110 If your build times exceed fifteen minutes, that’s when you should consider starting your own CI system. Having a monolithic application with a highly variable codebase and being at the limits of what your current provider can handle are strong indicators that a custom solution could be beneficial.
00:15:20.520 When choosing to build your CI system, ensure you commit fully to the transition. We learned the importance of not falling down rabbit holes and being realistic about timelines—projects like this often take longer than initially anticipated.
00:15:41.310 Additionally, treat your infrastructure as cattle rather than pets—this adjustment can save you considerable time and reduce headaches associated with managing server failures.
00:16:00.600 Finally, our experience confirmed the importance of efficiently managing test distributions and recognizing test flakiness issues, ultimately fortifying our CI process moving forward. Thank you for your time.
00:17:03.960 If anyone has questions about optimizing our code base or if we faced challenges just focusing on the CI systems, I'm here to clarify any confusion and share our learnings. We found that implementing parallelization helped immensely, even when faced with test flakiness that required significant efforts to mitigate.
Explore all talks recorded at RailsConf 2016
+106