00:00:13.980
My name is Simon, and I work for Shopify in Site Reliability Engineering. I work on reliability, performance, and infrastructure. Today, I want to talk about how we created a large application that has many moving parts, and how despite those parts failing, the entire system manages to stay up most of the time. We've learned a lot from this experience, and I aim to share some resources and vocabulary that will help you reason about these challenges.
00:00:24.630
Shopify is a company that helps people sell things; we make commerce easy for individuals to sell online or in their brick-and-mortar stores, on Pinterest, Facebook, and pretty much anywhere else. As we become a larger company, we experience a significant amount of financial traffic flowing through us. As many of you may know, more money can lead to more problems. Today, I want to discuss what we should consider when building these large systems. Nowadays, building distributed systems is the default. We are increasingly using the cloud, which means relying on hardware and components such as routing and networks that we cannot control.
00:01:02.910
We now face a new reality: we have to build systems from numerous components that we don't control, and these components can fail. This is becoming even more relevant with the rise of microservices and modern architectures that introduce more components. This also means your job now includes managing the relationships between these various services. Today, I will discuss how you can make these systems reliable. This has been the most significant win for my team during my two years here, as we now have confidence in understanding what happens when different components fail or become slow.
00:01:32.160
We have developed much greater awareness of our system's behavior, allowing us to reason about it effectively. This newfound understanding has also helped us sleep better at night. One critical aspect of my job is preparing for high-traffic events like Black Friday and Cyber Monday. This is a crazy time for us, marked by a substantial surge in traffic, and some of our stores may even conduct flash sales that double or triple our regular traffic.
00:02:00.210
Every year around this time, my team discusses our preparations for these events. This year and last year, we experienced numerous embarrassing failures; things were failing left and right, and there were times when our entire system went down due to failures in what seemed like trivial components. We realized that we lacked a comprehensive overview of the relationships between our services. So, we sat down as a team of five to seven and brainstormed ways to tackle this issue.
00:02:38.280
This brings me to the topic of resiliency which is the practice of building systems from many unreliable components that collectively remain reliable. If you have a single service that fails or is slow, it should not compromise the availability or performance of the entire system. To achieve this, you need to have loosely coupled components that can function independently, ensuring the reliability of your infrastructure as a whole. If you fail to implement this structure, your uptime will suffer due to the microservice equation, where adding more services reduces availability exponentially.
00:03:37.740
Even with services that achieve four nines of availability, once you reach 10 to 100 services, you can quickly find yourself facing several days of downtime per year. This scenario represents a particularly dire situation that you might not currently face, but it's critical to understand it's a risk worth considering. At the end of the day, your system is only as strong as its weakest link, and it can only remain resilient if you actively monitor and manage your failure points.
00:04:19.560
Many developers might convince themselves that having a monolithic application makes them immune to these problems, especially if it doesn't have external challenges. However, Shopify, as proud as we may be of our monolith, is not exempt from these issues. We still have numerous dependencies, such as baked-in relational stores and various key-value stores. We also interact with payment gateways and APIs, and we send emails through external CRM systems. There are easily tens, if not over a hundred dependencies that we do not control, and each can affect overall system performance.
00:05:05.040
Now, let's discuss fallbacks. When you're dependent on an external service, there can be times when the data you rely on is unavailable, for example, if you're browsing Netflix and the page needs to retrieve star ratings, but that specific service goes down. In this situation, you have two options: you can either fail the entire page, which is the default behavior in high-level languages like Ruby, or you can implement a reasonable fallback. For instance, displaying five gray stars until the rating service resumes.
00:05:49.500
This fallback is preferable because users can still browse your offerings instead of experiencing complete service failure due to one component's unavailability. Let me illustrate how we implement this. In our application, a store might consist of various services, such as search, session storage, shopping carts, and CDN dependencies, all interlinked. If any of these dependencies fail, such as the session storage, we could end up presenting an HTTP 500 error to the user. This situation breaks our application as it results in a complete outage even though the session component may not be essential for storefront operations.
00:06:56.790
When the session storage goes down, instead of showing an error, we might sign the user out but still allow them to browse, checkout, and add items to their carts without experiencing disruption. This way, customers continue having a pleasant experience with the storefront, and merchants are oblivious to any downtime. Moreover, the infrastructure team feels less stressed when they get notifications about such failures because the application is still able to cope with the situation.
00:07:39.060
We can examine each of our dependencies to make sure that our application behaves well, even when one or more components are down. For instance, if the cart service is unavailable, we could modify our application to restrict users from adding items to their carts but still allow them to browse the store. An advanced fallback could involve storing items in local storage until the service is back up.
00:08:01.560
The resilient code might look something like this: we fetch user data from the session layer, but if that layer is unreachable, we ensure that it does not panic, and instead skips over to show that the user has signed out. With just a couple of lines of code, we can make our application significantly more resilient. You can apply this logic broadly—if any service fails, we return an empty data structure instead of raising exceptions that pollute the user experience.
00:09:02.580
In the Ruby community, many developers might ask how to test this resiliency. A common approach is to use mocks where you simulate behavior of various data sources. However, this can quickly become unwieldy, especially if you're trying to test different drivers for different databases like MySQL or Redis. Faking the entire environment may lead to oversights, making it difficult to emulate real-world behavior adequately.
00:09:49.590
One solution is to introduce tests in production through controlled chaos, akin to the chaos monkey concept, which helps us identify potential weaknesses in our systems, but solely testing in this manner can be challenging. Instead, we need to find a middle ground; we want to be able to recreate failure scenarios without affecting the production environment negatively. What we did at Shopify was build a TCP-level proxy called 'toxiproxy,' which lets us simulate various types of failures seamlessly.
00:10:34.650
Toxiproxy allows us to introduce artificial latencies, and other failure characteristics on our service calls without changing our codebase directly. This has helped us uncover numerous bugs within our dependency chains. By employing toxiproxy, our developers now routinely run tests against failing services, ensuring resiliency before the code actually goes live to users.
00:11:13.790
After implementing this tool, we built an interface in our admin where developers can easily adjust settings to simulate different failure scenarios. They can artificially slow down or kill connections to test how resilient the overall application is under stress. This testing has proven incredibly beneficial in revealing hidden weaknesses that we previously overlooked.
00:12:09.240
Prior to this change, we could only describe our concerns and hope to design workarounds. Now, it has transformed the way we think about our applications' resiliency. When testing our session storage fiasco, for instance, we write a test that instructs toxiproxy to kill connections to specific services and check how our UI reacts under these situations.
00:12:26.640
Building a resiliency matrix helps us understand which components are critical and how their absence could affect the entire system. We created a matrix that lists all our services and their dependencies, noting how they influence overall availability under different failure scenarios. During this process, we discovered significant gaps in our application architecture that needed addressing.
00:13:22.860
While many cells indicated operational readiness, we identified others that resulted in failures. The team's mission then became to rectify these failed components and ensure they became successes. This journey of discovery improved not only our application's architecture but allowed us to find bugs in our own code as well as within the frameworks we depended on.
00:14:22.710
As an example, we discovered issues with middleware that handles connections when accessing the database. Even pages that do not interact with the database could still fail due to a lack of connection management within the middleware. Fixing this issue is one of the many adjustments we have made to promote greater resiliency.
00:15:52.590
We continuously evaluate our external data stores, like Redis, to ensure we are resilient to dependency failures. Instead of tightly coupling our system to any individual service or dependency, we have to prepare for potential failures throughout the entire system. This has led us to centralize error handling and implement strategies like returning empty data structures or default values instead of letting errors ripple through the application.
00:17:05.340
Another critical concept to understand is the Little's Law in queueing theory, which describes how increases in response time will reduce overall system throughput. For instance, if our web server can only serve one request at a time and one of our data stores is slow, it creates a backlog that ultimately leads to decreased responsiveness across the system. We need to establish robust timeout and response handling strategies so that slow components do not drag down the entire service.
00:18:55.650
Both timeouts and effective circuit breaker patterns work together to enhance our application's ability to fail faster and recover quicker from unexpected interruptions. The circuit breaker pattern allows us to identify when a service is consistently failing and can allow us to cease trying to communicate with it until a recovery is confirmed. It's essential to validate that the components can withstand failures, and to support better decision-making over how services should interact with one another under stress.
00:20:01.890
Bulkheads directly influence how we distribute load across our backend services by limiting the range of requests any single service can handle at one time. This method allows us to keep the rest of the application functional while mitigating slowdowns, offering our team a more resilient architecture adjusted for varying loads.
00:21:00.000
We continue testing and exploring implementations of circuit breakers and bulkheads across our services. By taking advantage of each of these patterns, your applications will be increasingly prepared for unexpected failures. As your company grows, consider the need for each tool and evaluate your current architecture against any arising challenges.
00:21:49.260
While some of you may feel you do not need this level of resiliency until faced with specific challenges, I encourage you to embrace these patterns early on. Adopting concepts such as circuit breakers and bulkheads now can have significant long-term benefits as your systems evolve. They will prepare you for when you inevitably encounter these problems in production.
00:25:08.280
We implemented a library called Semyon that incorporates these ideas. Companies like Netflix and Twitter have also created their own libraries focused on enhancing resiliency. These libraries come equipped with comprehensive documentation, which can play a role in increasing the reliability of your applications.
00:27:16.260
Netflix developed the Simian Army, a collection of tools designed to test the resiliency of applications in production environments. Implementing Semyon, toxiproxy, or similar technologies will ultimately aid in maturing your resilience practices. As your application grows, incorporate these strategies into your infrastructure, enabling you to handle faults gracefully and fortifying against unpredictable issues.
00:28:00.100
Finally, I encourage you to document your learnings from these experiences, as we have done at Shopify. The broader community can benefit from your insights. Thank you.