00:00:07.910
Hey Vaughn, thanks for coming. I'm not going to talk about cryptography today.
00:00:14.150
The talk I want to give today is called 'How to Stay Alive Even When Others Go Down.' I want to discuss how to write and test the resiliency of your applications.
00:00:26.420
Originally, the title was 'Testing in Ruby,' but after I had some discussions, I realized it didn't really say much specific to Ruby, so I kept it a little bit more general.
00:00:38.120
The goal of this talk is to help you understand how your application behaves when something goes wrong, like when something's on fire or a service goes down.
00:00:48.930
You should know exactly what will happen without freaking out or encountering any unknowns. The main takeaway is that you want to gain confidence in your application's behavior as early as possible during testing rather than only in production, where it’s usually too late.
00:01:08.460
For background, I work at Shopify on one of the infrastructure teams. I’m not going to discuss anything specific to Shopify; instead, I’m going to use it as examples to illustrate what I’m talking about.
00:01:19.350
So, when I say 'How to Stay Alive Even When Others Go Down,' who do I mean by 'others'? This can be anything that is not the application you are looking at right now. If you have a monolithic application, it’s every other application besides that.
00:01:37.500
If you have a service-oriented architecture, you're going to look at every single service.
00:01:44.159
From this point of view, 'others' just means all the other services you have. It could be a third-party service, maybe it's S3 or an API you're integrating with, or some part of your core infrastructure that's not actually your application.
00:02:10.259
For context, we still mostly run a monolithic Rails application. It's a huge Rails application, with some smaller services around, but it’s mostly one application. If you find yourself in the same situation, you might think this talk doesn’t really apply to you.
00:02:24.819
For the purpose of this talk, I want to redefine what most people think of as a service. When I say 'service,' I mean any logical unit of your application that's not necessarily a different process. It might be part of the same application but could just be an abstraction or a class.
00:02:49.630
This class might have a backend that is some datastore, or it might be doing some network communication, etc. It doesn’t have to be a separate service in the microservice architecture sense.
00:03:01.930
A couple of examples of services that might fit my definition include the database, a caching server, a session store, an authentication system, or somewhere you store your assets. Everything that's not part of the core application I would consider a service.
00:03:20.440
The goal of this talk is to share a couple of design patterns, best practices, and things we learned at Shopify that made a significant difference to the stability of our platform.
00:03:57.310
These patterns were crucial for us because about a year ago, we experienced a lot of outages due to poor design and things being inappropriately coupled together. We went through a huge cleanup initiative and stumbled upon these patterns, which we tried out.
00:04:21.419
Today, I want to discuss what they mean and how you can use them yourself. The one I want to emphasize, which is highlighted in green, is the easiest one to implement.
00:04:43.530
If you don’t remember anything else I'm talking about, this is the one to remember because it’s super easy to do and has great returns. Testing is highlighted in red because it is, in the long term, the most valuable aspect.
00:05:02.820
For the beginning, let’s talk about timeouts. Most of you probably know what I mean when I say timeouts. To understand why they are important, let’s look at a couple of performance metrics that are crucial for web applications.
00:05:13.620
If you run a web application, you’ve probably heard the term 'capacity'. At Shopify, we have a website where every developer can deploy code to production servers. Before deploying, developers check graphs that say please look at these performance metrics before pressing the deploy button.
00:05:39.150
If anything changes after the deploy, then it’s important to take note. The two main metrics are response time and throughput. Response time indicates how long it takes to process a single request, while throughput describes how many requests can be handled in a given time.
00:06:03.180
Many of you may be using an architecture similar to the Nginx and Unicorn setup. To understand why timeouts are essential, it’s important to note how Unicorn works.
00:06:30.150
When you start the process, it spawns a set number of worker processes that are fixed and do not change with the system load or the number of requests. If all of these workers are occupied, no new workers are spawned, leading to request queuing.
00:06:59.939
If all workers are busy, you cannot serve additional requests until some of the workers finish. This illustration shows how requests took longer to handle when systems were under strain.
00:07:22.650
In situations of high load, making requests slower will ultimately degrade throughput, which means the maximum number of requests you can handle will decline.
00:07:36.240
Capacity can be loosely defined as the maximum throughput you can handle given a reasonable response time. This was evident when someone ran a maintenance task that did data migration using an inefficient command, which caused the system to become unresponsive.
00:08:00.599
After two minutes of stalled requests, the entire Redis instance was stuck, leading to degraded capacity.
00:08:19.409
Even though this example is somewhat specific to the Unicorn worker model, it generally applies to any web server or background queue system.
00:08:31.589
You will always be limited by your number of workers and system resources. This highlights the crucial importance of having timeouts.
00:08:50.389
The concept of timeouts is to fail fast. Failures can happen in various ways, but the best scenario in an outage is for the connection to be refused quickly.
00:09:02.120
In cases where connections take too long to fail, that can lead to more damage than necessary. Hence, setting appropriate timeouts is beneficial.
00:09:26.240
Taking a look at Ruby applications, many Ruby gems do not have default timeouts configured. For instance, the Unicorn web server allows requests to take up to 60 seconds by default.
00:09:45.259
This means during that time, the worker cannot handle any new requests. Some gems, like the Redis gems, have no default timeout at all, which can lead to issues.
00:10:07.960
On a Linux system, if a request takes too long, the kernel will eventually terminate the connection, which is especially problematic with Redis, being a single-threaded server.
00:10:27.920
To address this, you should instrument your code. One approach is to use StatsD or similar monitoring tools to get insight into your performance baseline and to lower timeouts to levels you can afford.
00:10:55.500
When you identify that your baseline is five seconds, there’s no reason to allow 60 seconds. Results that deviate significantly can ruin your capacity.
00:11:19.190
Sometimes prolonged executions are legitimate, so consider moving long-running requests to background jobs.
00:11:45.490
In background processing, similar principles apply because you are still limited by the number of workers in your queue system, though it’s not as noticeable to users as slow web requests.
00:12:06.500
Another alternate example is using Redis commands that are time-consuming or costly. If you run inefficient commands, it can lead to degraded performance across your architecture.
00:12:42.210
In some cases, especially for services like Elasticsearch, if certain memory conditions are not met, you might experience significant garbage collection issues.
00:13:06.360
It's essential to think of a slow service as another form of a failure, as it can cause much more significant issues than a complete service failure.
00:13:22.180
Networking problems are also noteworthy: packet loss or saturation can create further service disruptions. This is a common concern we faced at Shopify and could have been avoided with stricter timeouts.
00:13:54.210
The next pattern I want to talk about is the circuit breaker pattern. The main idea is if you have an external service and you're encountering failures, it's likely that the service will continue to fail.
00:14:11.370
The circuit breaker pattern is designed to temporarily stop communication with a service that is failing and give it time to recover. This is implemented by keeping track of the number of errors within a specified timeframe.
00:14:32.330
Once this error count exceeds a certain threshold, you can switch to an 'open' state and stop sending new requests to the failing service.
00:14:49.090
This behavior can be visualized similarly to an electrical circuit fuse that prevents overheating.
00:15:08.000
Each service should maintain its own independent circuit state; the errors from one service shouldn’t break the circuit of another service.
00:15:25.400
The circuit breaker pattern consists of three states—'closed' (everything is fine), 'open' (requests are failing), and 'half-open' (testing if the service has recovered).
00:15:45.400
You can implement the pattern at the driver level, meaning that the application code does not need to be aware of the circuit breaker; it's taken care of transparently.
00:16:10.640
By having a defined timeout, switches from 'open' to 'half-open' and back depend only on successful requests to ensure proper recovery.
00:16:34.900
Next is the idea of failing gracefully. When we experienced outages, we tried to fix every issue, but it’s crucial to prepare for the inevitability of some failures.
00:16:52.990
The focus should be on degrading user experience rather than breaking it completely. It’s essential to understand that there are often reasonable fallbacks for errors in any application.
00:17:18.500
Using the example of our Shopify storefront, if one service fails, instead of breaking everything, we can hide specific elements or show generic data to maintain continuity.
00:17:46.340
For instance, if the cart service is down, we display a total of zero dollars rather than breaking the entire checkout experience.
00:18:07.980
Moreover, a recommendation system can display generic recommendations if personalized ones are unavailable.
00:18:24.440
Thus, gracefully degrading the user experience always trumps breaking it, and understanding that multiple services cannot fail simultaneously is critical.
00:18:40.080
Simultaneously, maintaining concurrency control to ensure that no two processes run the same code at once can help prevent unnecessary overload on shared resources.
00:19:01.920
If several applications are hitting a single database, quick overload can occur, leading to cascading issues.
00:19:25.420
To counteract this, implement a semaphore system to limit the number of requests to the database to the threshold it can handle.
00:19:47.280
This will isolate issues in smaller areas rather than have one slow query affect multiple applications.
00:20:01.610
To make systems fail graceful and maintain operational integrity, we need to monitor performance consistently and reduce availability impacts.
00:20:19.269
Starting with consolidated areas can be beneficial. Focus on the critical path that sees the most interaction.
00:20:38.080
An application can have specific areas that, if down, will result in significant losses, so tackle those first.
00:20:55.640
Be cautious during deployments as they can create a big pain point. You should prepare for changes to occur smoothly, enabling code rollback even during a service downtime.
00:21:13.780
When introducing new code, avoid creating any dependencies on other services during the initialization sequence to ensure production stability.
00:21:35.740
To systematically analyze resiliency and avoid overwhelming yourself, creating a resiliency matrix can provide a clear visual guide to assess risk levels.
00:21:55.070
A matrix consists of areas of your application against the services it depends upon, and assessing impacts for failures can shine a light on unseen risks.
00:22:15.820
By mapping out dependencies and associated risks, you can prioritize and effectively mitigate weaknesses.
00:22:34.900
Ensure your testing framework accounts for resiliency as a critical component in your codebase; doing so will provide a foundation for discovering inherent weaknesses.
00:22:57.061
In addition to thorough coverage, complexity can be reduced with metaprogramming tools that make checking adherence to resiliency principles more streamlined.
00:23:20.600
If you notice certain tests failing due to a lack of error checking in service calls, modifying them accordingly can lead to enhanced resilience.
00:23:45.240
Lastly, having a dedicated testing environment that mirrors production can provide a practical means to test application responses under duress.
00:24:09.180
Now, once all tests are passed and issues have been addressed, the final step is chaos testing. The concept is to check how your application holds up under various failures.
00:24:32.390
You can achieve this manually at first. For instance, purposely shut down a service you're confident is resilient to observe how your application responds.
00:24:51.990
Once you start to feel confident in different failure scenarios, you can automate this process, just as Netflix does by randomly disabling services to test fault tolerance.
00:25:13.760
I want to encourage everyone here to pick one takeaway from this discussion and experiment with it in your applications.
00:25:34.490
Timeouts are an easy fix that could yield significant improvements if you can implement timeouts now.
00:25:54.560
If you're interested in circuit breakers, implementing one could be a straightforward exercise, perhaps taking less than 100 lines of Ruby code. Create a resiliency matrix if you’re managing an application.
00:26:16.430
Fill in the assumptions of behavior and then verify those assumptions to ensure that your application will react appropriately to failures.
00:26:39.390
If you’re still engaged, learn more about Ruby metaprogramming or work on building tests to handle generates to-do lists for resiliency upgrades.
00:26:59.970
I’d recommend looking into the book 'Release It!' which covers many principles discussed today, particularly from a Java-centric perspective but is broadly applicable.
00:27:20.880
Also, check out Netflix's tech blog for insights around resiliency testing as they provide valuable lessons that can inspire implementations in Ruby.
00:27:43.030
For concurrency control ideas, explore our implementation called Semaphore, which prevents overload by managing shared resource access.
00:28:02.270
This will not only assist in improving your code but also ensure everyone has a shared understanding of the system architecture.
00:28:18.620
Thank you very much for being here today, and I hope you find the resilience concepts discussed valuable for your applications!