How to stay alive even when others go down: Writing and testing resilient applications in Ruby

00:00:07.910 Hey Vaughn, thanks for coming. I'm not going to talk about cryptography today.

00:00:14.150 The talk I want to give today is called 'How to Stay Alive Even When Others Go Down.' I want to discuss how to write and test the resiliency of your applications.

00:00:26.420 Originally, the title was 'Testing in Ruby,' but after I had some discussions, I realized it didn't really say much specific to Ruby, so I kept it a little bit more general.

00:00:38.120 The goal of this talk is to help you understand how your application behaves when something goes wrong, like when something's on fire or a service goes down.

00:00:48.930 You should know exactly what will happen without freaking out or encountering any unknowns. The main takeaway is that you want to gain confidence in your application's behavior as early as possible during testing rather than only in production, where it’s usually too late.

00:01:08.460 For background, I work at Shopify on one of the infrastructure teams. I’m not going to discuss anything specific to Shopify; instead, I’m going to use it as examples to illustrate what I’m talking about.

00:01:19.350 So, when I say 'How to Stay Alive Even When Others Go Down,' who do I mean by 'others'? This can be anything that is not the application you are looking at right now. If you have a monolithic application, it’s every other application besides that.

00:01:37.500 If you have a service-oriented architecture, you're going to look at every single service.

00:01:44.159 From this point of view, 'others' just means all the other services you have. It could be a third-party service, maybe it's S3 or an API you're integrating with, or some part of your core infrastructure that's not actually your application.

00:02:10.259 For context, we still mostly run a monolithic Rails application. It's a huge Rails application, with some smaller services around, but it’s mostly one application. If you find yourself in the same situation, you might think this talk doesn’t really apply to you.

00:02:24.819 For the purpose of this talk, I want to redefine what most people think of as a service. When I say 'service,' I mean any logical unit of your application that's not necessarily a different process. It might be part of the same application but could just be an abstraction or a class.

00:02:49.630 This class might have a backend that is some datastore, or it might be doing some network communication, etc. It doesn’t have to be a separate service in the microservice architecture sense.

00:03:01.930 A couple of examples of services that might fit my definition include the database, a caching server, a session store, an authentication system, or somewhere you store your assets. Everything that's not part of the core application I would consider a service.

00:03:20.440 The goal of this talk is to share a couple of design patterns, best practices, and things we learned at Shopify that made a significant difference to the stability of our platform.

00:03:57.310 These patterns were crucial for us because about a year ago, we experienced a lot of outages due to poor design and things being inappropriately coupled together. We went through a huge cleanup initiative and stumbled upon these patterns, which we tried out.

00:04:21.419 Today, I want to discuss what they mean and how you can use them yourself. The one I want to emphasize, which is highlighted in green, is the easiest one to implement.

00:04:43.530 If you don’t remember anything else I'm talking about, this is the one to remember because it’s super easy to do and has great returns. Testing is highlighted in red because it is, in the long term, the most valuable aspect.

00:05:02.820 For the beginning, let’s talk about timeouts. Most of you probably know what I mean when I say timeouts. To understand why they are important, let’s look at a couple of performance metrics that are crucial for web applications.

00:05:13.620 If you run a web application, you’ve probably heard the term 'capacity'. At Shopify, we have a website where every developer can deploy code to production servers. Before deploying, developers check graphs that say please look at these performance metrics before pressing the deploy button.

00:05:39.150 If anything changes after the deploy, then it’s important to take note. The two main metrics are response time and throughput. Response time indicates how long it takes to process a single request, while throughput describes how many requests can be handled in a given time.

00:06:03.180 Many of you may be using an architecture similar to the Nginx and Unicorn setup. To understand why timeouts are essential, it’s important to note how Unicorn works.

00:06:30.150 When you start the process, it spawns a set number of worker processes that are fixed and do not change with the system load or the number of requests. If all of these workers are occupied, no new workers are spawned, leading to request queuing.

00:06:59.939 If all workers are busy, you cannot serve additional requests until some of the workers finish. This illustration shows how requests took longer to handle when systems were under strain.

00:07:22.650 In situations of high load, making requests slower will ultimately degrade throughput, which means the maximum number of requests you can handle will decline.

00:07:36.240 Capacity can be loosely defined as the maximum throughput you can handle given a reasonable response time. This was evident when someone ran a maintenance task that did data migration using an inefficient command, which caused the system to become unresponsive.

00:08:00.599 After two minutes of stalled requests, the entire Redis instance was stuck, leading to degraded capacity.

00:08:19.409 Even though this example is somewhat specific to the Unicorn worker model, it generally applies to any web server or background queue system.

00:08:31.589 You will always be limited by your number of workers and system resources. This highlights the crucial importance of having timeouts.

00:08:50.389 The concept of timeouts is to fail fast. Failures can happen in various ways, but the best scenario in an outage is for the connection to be refused quickly.

00:09:02.120 In cases where connections take too long to fail, that can lead to more damage than necessary. Hence, setting appropriate timeouts is beneficial.

00:09:26.240 Taking a look at Ruby applications, many Ruby gems do not have default timeouts configured. For instance, the Unicorn web server allows requests to take up to 60 seconds by default.

00:09:45.259 This means during that time, the worker cannot handle any new requests. Some gems, like the Redis gems, have no default timeout at all, which can lead to issues.

00:10:07.960 On a Linux system, if a request takes too long, the kernel will eventually terminate the connection, which is especially problematic with Redis, being a single-threaded server.

00:10:27.920 To address this, you should instrument your code. One approach is to use StatsD or similar monitoring tools to get insight into your performance baseline and to lower timeouts to levels you can afford.

00:10:55.500 When you identify that your baseline is five seconds, there’s no reason to allow 60 seconds. Results that deviate significantly can ruin your capacity.

00:11:19.190 Sometimes prolonged executions are legitimate, so consider moving long-running requests to background jobs.

00:11:45.490 In background processing, similar principles apply because you are still limited by the number of workers in your queue system, though it’s not as noticeable to users as slow web requests.

00:12:06.500 Another alternate example is using Redis commands that are time-consuming or costly. If you run inefficient commands, it can lead to degraded performance across your architecture.

00:12:42.210 In some cases, especially for services like Elasticsearch, if certain memory conditions are not met, you might experience significant garbage collection issues.

00:13:06.360 It's essential to think of a slow service as another form of a failure, as it can cause much more significant issues than a complete service failure.

00:13:22.180 Networking problems are also noteworthy: packet loss or saturation can create further service disruptions. This is a common concern we faced at Shopify and could have been avoided with stricter timeouts.

00:13:54.210 The next pattern I want to talk about is the circuit breaker pattern. The main idea is if you have an external service and you're encountering failures, it's likely that the service will continue to fail.

00:14:11.370 The circuit breaker pattern is designed to temporarily stop communication with a service that is failing and give it time to recover. This is implemented by keeping track of the number of errors within a specified timeframe.

00:14:32.330 Once this error count exceeds a certain threshold, you can switch to an 'open' state and stop sending new requests to the failing service.

00:14:49.090 This behavior can be visualized similarly to an electrical circuit fuse that prevents overheating.

00:15:08.000 Each service should maintain its own independent circuit state; the errors from one service shouldn’t break the circuit of another service.

00:15:25.400 The circuit breaker pattern consists of three states—'closed' (everything is fine), 'open' (requests are failing), and 'half-open' (testing if the service has recovered).

00:15:45.400 You can implement the pattern at the driver level, meaning that the application code does not need to be aware of the circuit breaker; it's taken care of transparently.

00:16:10.640 By having a defined timeout, switches from 'open' to 'half-open' and back depend only on successful requests to ensure proper recovery.

00:16:34.900 Next is the idea of failing gracefully. When we experienced outages, we tried to fix every issue, but it’s crucial to prepare for the inevitability of some failures.

00:16:52.990 The focus should be on degrading user experience rather than breaking it completely. It’s essential to understand that there are often reasonable fallbacks for errors in any application.

00:17:18.500 Using the example of our Shopify storefront, if one service fails, instead of breaking everything, we can hide specific elements or show generic data to maintain continuity.

00:17:46.340 For instance, if the cart service is down, we display a total of zero dollars rather than breaking the entire checkout experience.

00:18:07.980 Moreover, a recommendation system can display generic recommendations if personalized ones are unavailable.

00:18:24.440 Thus, gracefully degrading the user experience always trumps breaking it, and understanding that multiple services cannot fail simultaneously is critical.

00:18:40.080 Simultaneously, maintaining concurrency control to ensure that no two processes run the same code at once can help prevent unnecessary overload on shared resources.

00:19:01.920 If several applications are hitting a single database, quick overload can occur, leading to cascading issues.

00:19:25.420 To counteract this, implement a semaphore system to limit the number of requests to the database to the threshold it can handle.

00:19:47.280 This will isolate issues in smaller areas rather than have one slow query affect multiple applications.

00:20:01.610 To make systems fail graceful and maintain operational integrity, we need to monitor performance consistently and reduce availability impacts.

00:20:19.269 Starting with consolidated areas can be beneficial. Focus on the critical path that sees the most interaction.

00:20:38.080 An application can have specific areas that, if down, will result in significant losses, so tackle those first.

00:20:55.640 Be cautious during deployments as they can create a big pain point. You should prepare for changes to occur smoothly, enabling code rollback even during a service downtime.

00:21:13.780 When introducing new code, avoid creating any dependencies on other services during the initialization sequence to ensure production stability.

00:21:35.740 To systematically analyze resiliency and avoid overwhelming yourself, creating a resiliency matrix can provide a clear visual guide to assess risk levels.

00:21:55.070 A matrix consists of areas of your application against the services it depends upon, and assessing impacts for failures can shine a light on unseen risks.

00:22:15.820 By mapping out dependencies and associated risks, you can prioritize and effectively mitigate weaknesses.

00:22:34.900 Ensure your testing framework accounts for resiliency as a critical component in your codebase; doing so will provide a foundation for discovering inherent weaknesses.

00:22:57.061 In addition to thorough coverage, complexity can be reduced with metaprogramming tools that make checking adherence to resiliency principles more streamlined.

00:23:20.600 If you notice certain tests failing due to a lack of error checking in service calls, modifying them accordingly can lead to enhanced resilience.

00:23:45.240 Lastly, having a dedicated testing environment that mirrors production can provide a practical means to test application responses under duress.

00:24:09.180 Now, once all tests are passed and issues have been addressed, the final step is chaos testing. The concept is to check how your application holds up under various failures.

00:24:32.390 You can achieve this manually at first. For instance, purposely shut down a service you're confident is resilient to observe how your application responds.

00:24:51.990 Once you start to feel confident in different failure scenarios, you can automate this process, just as Netflix does by randomly disabling services to test fault tolerance.

00:25:13.760 I want to encourage everyone here to pick one takeaway from this discussion and experiment with it in your applications.

00:25:34.490 Timeouts are an easy fix that could yield significant improvements if you can implement timeouts now.

00:25:54.560 If you're interested in circuit breakers, implementing one could be a straightforward exercise, perhaps taking less than 100 lines of Ruby code. Create a resiliency matrix if you’re managing an application.

00:26:16.430 Fill in the assumptions of behavior and then verify those assumptions to ensure that your application will react appropriately to failures.

00:26:39.390 If you’re still engaged, learn more about Ruby metaprogramming or work on building tests to handle generates to-do lists for resiliency upgrades.

00:26:59.970 I’d recommend looking into the book 'Release It!' which covers many principles discussed today, particularly from a Java-centric perspective but is broadly applicable.

00:27:20.880 Also, check out Netflix's tech blog for insights around resiliency testing as they provide valuable lessons that can inspire implementations in Ruby.

00:27:43.030 For concurrency control ideas, explore our implementation called Semaphore, which prevents overload by managing shared resource access.

00:28:02.270 This will not only assist in improving your code but also ensure everyone has a shared understanding of the system architecture.

00:28:18.620 Thank you very much for being here today, and I hope you find the resilience concepts discussed valuable for your applications!