Smit Shah
Resilient by Design

By, Smit Shah

Modern distributed systems have aggressive requirements around uptime and performance, they need to face harsh realities such as sudden rush of visitors, network issues, tangled databases and other unforeseen bugs. With so many moving parts involved even in the simplest of services, it becomes mandatory to adopt defensive patterns which would guard against some of these problems and identify anti-patterns before they trigger cascading failures across systems. This talk is for all those developers who hate getting a oncall at 4 AM in the morning

Help us caption & translate this video!

http://amara.org/v/GF2o/

Garden City Ruby 2015

00:00:15.040 First of all, as they just mentioned, I work at Nilenzo, and I'm also part of the Bunder core team. Today, I want to talk about how to build resilient systems.
00:00:24.400 Why do we care about resilient systems in the first place? Anyone who loses sleep over system downtime, customer issues, or business ramifications would care about the resilience of their systems. Downtime can mean disaster for them, and developers may have to wake up at night to answer on-call pages. Thus, we want to build systems that continue to work and perform under conditions of failure or high traffic.
00:00:43.680 Let's move away from software for a bit and think about cars. Cars are designed to be resilient; they have airbag systems that activate during an accident to save lives. Similarly, nuclear power reactors don't leak radiation if there's a power outage; they have built-in mechanisms to handle such failures. These resilience mechanisms are considered from the start of the design process.
00:01:02.239 Unfortunately, in software resilience, the pattern tends to be that folks build the entire feature and only then think about what happens if a key component, like a cache server, goes down. This is often a realization too late, usually after something has gone wrong in production. One fine day, when you need your cache to be operational and your server is under heavy load, it might fail. At that point, developers are left to manage the aftermath.
00:01:40.560 To design a resilient system, it must be front-loaded with considerations about resiliency. This means anticipating things like, 'What if the service I depend on goes down?' or 'What if my database doesn’t respond?' Many developers often overlook these aspects.
00:02:01.120 Instead, I want to discuss patterns that can help you plan for these situations. It's crucial to think ahead about the limitations of your code. The main crux of my talk today will revolve around resilient design patterns. While I can't guarantee that using these patterns will prevent failures, they can significantly improve system uptime.
00:02:29.200 Let me give an example of why design patterns are preferable to ad-hoc solutions. We had a production system where over time the memory usage of our Unicorn web servers increased significantly. After about a week or two, the memory usage would escalate to a point that it would start swapping. The only option then was to restart the web server, which was quite painful.
00:03:12.000 We couldn't determine the cause at first. Initially, we suspected memory leaks from native Ruby extensions. Eventually, we discovered the issue related to non-unique keys in JSON parsing, specifically timestamps. Every time we parsed the JSON, we tapped into Ruby's symbol table, which doesn't allow for garbage collection of symbols before Ruby 2.2. This mistake compounded over time, leading to an increase in memory usage and swapping, which we needed to control.
00:05:57.480 To mitigate the problem, we had to employ memory limits for our worker processes. We put a limit on the memory a worker could use, and if it increased beyond a certain threshold, we would restart it. This approach allowed us to maintain service continuity while we took our time identifying the root cause of the memory issue.
00:07:37.440 This brings me to the patterns I would be covering today. The first pattern is bounding, particularly timeouts. It's essential to know the default timeout settings in your applications. For instance, the HTTP default timeout in Ruby is a staggering 60 seconds, which may not be suitable for production. A timeout this high can create significant issues in high-load situations.
00:08:00.879 Another critical point of bounding is memory management. For any worker processes you have, ensure that memory usage is monitored, and establish behavior that should occur when thresholds are breached. The next aspect of bounding is to be cautious about your queues and buffers, especially in high-load situations. Having controlled and limited buffer sizes allows you to apply back pressure and better manage the flow of processes.
00:09:05.120 Next, let's talk about circuit breakers—a popular design pattern that can handle failures gracefully. Circuit breakers prevent operations from trying to call services that have already proven to be unreliable. This functionality ensures better resource management and protects your system from cascading failures; if you’re trying to access service C and it keeps failing, the circuit breaker will prevent further attempts until the service is confirmed to be back up.
00:10:43.600 Circuit breakers also allow you to implement fallbacks during failures, directing clients to alternative content or cached responses. This helps ensure a seamless user experience even in the event of failures. Over time, you can transition from an open state to a half-open state to see if the service is back online without overwhelming it.
00:12:03.120 Resilience patterns are not just applicable to service communication but can extend to other elements, like databases. Implementing these practices in MySQL can prevent queries from monopolizing resources, allowing queries to be handled more efficiently.
00:13:38.120 The final resilience pattern I want to impart to you today is the bulkhead design. Borrowed from shipbuilding, the bulkhead pattern suggests compartmentalization to localize failures. In the context of microservices, this means isolating service instances so that, if one service fails, it won’t bring down others relying on it. This approach adds an extra layer of reliability.
00:15:36.239 Lastly, I want to stress the importance of specifications. Writing detailed specifications before you begin coding helps illuminate potential pitfalls and ensures that you're considering edge cases. Effective specifications can lead to better code, preventing oversights from manifesting in production. When you plan patterns into your development cycle, even if you’re working on a smaller service, the complexity and unreliability of distributed systems emphasize the necessity of these patterns. By implementing resilience design patterns, you will lessen the chances of sleepless nights spent on firefighting production issues.