Talks

Fault Tolerance in Ruby

wroc_love.rb 2017

00:00:11.590 My name is Hubert, and I'm a Ruby developer. You can find me on Twitter, GitHub, and other services using the handle HubertLepicki. I really enjoy engaging on Twitter.
00:00:17.500 I have the privilege of working at AmberBit, a Ruby on Rails development shop. Originally, we focused solely on Ruby, but in recent years, we've expanded into JavaScript as well.
00:00:25.000 Today, we're going to talk about fault tolerance and some related concepts, particularly scalability and concurrency. These two subjects often come together with fault tolerance.
00:00:31.029 The same tools can help us address both issues. We will also do some time travel, moving the cursor away—if you're a fan of Star Trek, you know that time travel can be fun.
00:00:38.140 Let's start with fault tolerance by looking at some definitions. According to the collective wisdom from Wikipedia, fault tolerance is a property of software that enables a system to operate properly, even if a failure occurs.
00:00:44.470 However, I don't fully agree with those definitions. Some definitions suggest that fault tolerance means the system should function correctly according to specification, even when a failure occurs, but that rarely happens.
00:00:51.700 In my perspective, fault tolerance is the ability to detect and recover from software failures. If we can achieve automatic detection and recovery from software failures without waking up developers at six in the morning on a Saturday, that would be perfect.
00:01:07.810 So the key question is whether we can avoid cascading failures that affect all users and lead to lost revenue for the business, despite serving pretty or less pretty error pages.
00:01:15.039 The reality is that implementing fault tolerance is one of the hardest challenges in computer science.
00:01:22.120 To illustrate how difficult it is, let's travel back to the year 1992. In that year, we saw an online debate between two influential figures in the software world.
00:01:35.710 On one side, we had Linus Torvalds, the creator of the Linux kernel, and on the other side, we had Professor Andrew Tanenbaum, creator of the MINIX operating system. Most of us are familiar with Linux, and its presence is probably larger than the audience here.
00:01:50.310 MINIX, on the other hand, is still used today for teaching the design basics of operating systems. The exchange between those two gentlemen was quite heated, as Linus is known for being very opinionated.
00:02:06.780 During their debate, Tanenbaum argued for building an operating system based on a microkernel architecture that would provide fault tolerance and resilience to hardware and software failures.
00:02:21.510 Linus countered that it was simpler to build a monolithic kernel that worked for most users. He claimed that creating an operating system according to Tanenbaum's suggestions was far too complex.
00:02:38.210 In the end, both had their points. Linus believed that the simpler solution provided the best advantage for users, whereas Tanenbaum aimed for a more fault-tolerant and resilient approach.
00:02:56.130 Fast forward several decades, and we know that operating systems like Linux are widely used, while MINIX remains relatively niche.
00:03:10.090 However, the initial goals of creating fault-tolerant systems are still relevant today. We often face similar challenges concerning software failures.
00:03:21.650 Despite the allure of implementing extensive fault tolerance mechanisms, we should be cautious. Adding complexity can lead to issues such as increased costs, resource consumption, and potential new failure points.
00:03:34.550 It's important to focus first on delivering business value before incrementing minor tweaks to enhance fault tolerance.
00:03:49.070 Fortunately, Ruby provides various defensive techniques against software failures, particularly when contacting external components.
00:04:03.680 These techniques include exception handling. When something fails in Ruby, an exception is thrown, which carries useful information about the failure.
00:04:17.460 It is advisable to wrap operations that can fail, such as network calls or file operations, in exception handling blocks.
00:04:30.350 Ruby allows you to catch multiple exceptions and implement different responses for different exceptions. You can also implement retry logic for transient failures.
00:04:44.180 Logging exceptions is crucial for diagnosing issues. By integrating logging facilities with exception tracking software, we can monitor and respond to failures effectively.
00:04:58.560 Another effective approach is partitioning parts of the system using 'bulkheads,' as described in the book Exceptional Ruby.
00:05:12.640 Bulkheads isolate specific operations, allowing the system to remain operational even when failures occur in other components.
00:05:29.170 Using timeouts with external operations is also a critical strategy, as it prevents long-running requests from affecting system responsiveness.
00:05:42.360 In traditional Ruby applications, where requests can be blocked by external services, implementing bulkheads and timeouts helps ensure the overall health of the system.
00:05:55.620 The circuit breaker pattern is another useful technique that prevents call failures from propagating through the system.
00:06:07.700 Just like electrical circuit breakers prevent fires, software circuit breakers open up upon detecting failures, halting further calls to the malfunctioning service.
00:06:19.920 When implementing this pattern, it's essential to define the states: closed, open, and half-open. When the circuit opens, it can go back to half-open state to check if the system has recovered.
00:06:33.100 For Ruby, there are libraries available such as the Circuit Breaker gem, which allows for easy implementation of this pattern.
00:06:46.140 However, care must be taken when using these libraries, as Ruby applications often run in a way that might not be conducive to sharing states across processes.
00:07:00.200 Alternatives like the semi-anna library from Shopify can help synchronize circuit states across multiple workers.
00:07:16.040 To make Ruby applications more resilient to failures, we can also subdivide our systems into smaller, more manageable services.
00:07:32.250 This approach improves flexibility and allows for distributed fault isolation. For example, in one of our projects, we divided an application into three distinct parts using RabbitMQ for communication.
00:07:46.200 However, we encountered complications dubbed 'ghost failures' that were hard to trace. These failures emerged after idle periods and were often misconfigurations related to the messaging system.
00:08:01.620 The issues arose due to lost connections rather than actual failures, caused by garbage collection of idle connections by firewalls or cloud providers.
00:08:15.600 The lesson from this experience is that proper monitoring and maintenance of your services are crucial for success.
00:08:35.350 The core of fault tolerance remains the ability to recover from failures without causing cascading impacts on the application or user experiences.
00:08:50.290 Moving forward through technology, we can look at how concurrency and fault tolerance principles were integrated into languages like Erlang.
00:09:05.250 Erlang was designed for telecom systems, combining hardware and software aspects to manage failures, thus addressing the same challenges developers face today.
00:09:16.830 In Erlang, failure management is improved through 'supervisors' that monitor and restart crashing actors. This concept of a supervision tree allows for resilient software structure.
00:09:33.150 The actor model of concurrency implemented by Erlang, where each actor communicates via messages, offers tremendous scalability and fault tolerance.
00:09:48.920 In Ruby, libraries like Celluloid allow for similar actor-based design, aiding in building resilient applications.
00:10:03.520 Celluloid allows Ruby developers to create actors with ease, handling failures effectively and providing a clean interface for managing concurrency.
00:10:15.750 Another helpful library is Concurrent Ruby, utilized within Rails for handling WebSockets, which brings concurrency support to Ruby.
00:10:31.020 These patterns signify a shift towards more structured fault tolerance within Ruby applications, enabling more robust software design.
00:10:45.290 As we anticipate Ruby 3.0's launch, there's excitement for new concurrency improvements inspired by Erlang's actor model.
00:10:54.550 These enhancements promise to make Ruby applications even more resilient to failures, streamlining the process for developers.
00:11:10.700 In this discussion, it's essential to remember that while mechanisms for handling failures are crucial, they still need to be understandable and manageable for developers.
00:11:31.030 As we continue to innovate and improve, let’s remain conscious of the balance between complexity and fault tolerance.
00:11:42.430 Thank you for your attention, and I hope this discussion has provided insights into making Ruby applications more fault-tolerant. If you have any questions, feel free to ask.