00:00:11.590
My name is Hubert, and I'm a Ruby developer. You can find me on Twitter, GitHub, and other services using the handle HubertLepicki. I really enjoy engaging on Twitter.
00:00:17.500
I have the privilege of working at AmberBit, a Ruby on Rails development shop. Originally, we focused solely on Ruby, but in recent years, we've expanded into JavaScript as well.
00:00:25.000
Today, we're going to talk about fault tolerance and some related concepts, particularly scalability and concurrency. These two subjects often come together with fault tolerance.
00:00:31.029
The same tools can help us address both issues. We will also do some time travel, moving the cursor away—if you're a fan of Star Trek, you know that time travel can be fun.
00:00:38.140
Let's start with fault tolerance by looking at some definitions. According to the collective wisdom from Wikipedia, fault tolerance is a property of software that enables a system to operate properly, even if a failure occurs.
00:00:44.470
However, I don't fully agree with those definitions. Some definitions suggest that fault tolerance means the system should function correctly according to specification, even when a failure occurs, but that rarely happens.
00:00:51.700
In my perspective, fault tolerance is the ability to detect and recover from software failures. If we can achieve automatic detection and recovery from software failures without waking up developers at six in the morning on a Saturday, that would be perfect.
00:01:07.810
So the key question is whether we can avoid cascading failures that affect all users and lead to lost revenue for the business, despite serving pretty or less pretty error pages.
00:01:15.039
The reality is that implementing fault tolerance is one of the hardest challenges in computer science.
00:01:22.120
To illustrate how difficult it is, let's travel back to the year 1992. In that year, we saw an online debate between two influential figures in the software world.
00:01:35.710
On one side, we had Linus Torvalds, the creator of the Linux kernel, and on the other side, we had Professor Andrew Tanenbaum, creator of the MINIX operating system. Most of us are familiar with Linux, and its presence is probably larger than the audience here.
00:01:50.310
MINIX, on the other hand, is still used today for teaching the design basics of operating systems. The exchange between those two gentlemen was quite heated, as Linus is known for being very opinionated.
00:02:06.780
During their debate, Tanenbaum argued for building an operating system based on a microkernel architecture that would provide fault tolerance and resilience to hardware and software failures.
00:02:21.510
Linus countered that it was simpler to build a monolithic kernel that worked for most users. He claimed that creating an operating system according to Tanenbaum's suggestions was far too complex.
00:02:38.210
In the end, both had their points. Linus believed that the simpler solution provided the best advantage for users, whereas Tanenbaum aimed for a more fault-tolerant and resilient approach.
00:02:56.130
Fast forward several decades, and we know that operating systems like Linux are widely used, while MINIX remains relatively niche.
00:03:10.090
However, the initial goals of creating fault-tolerant systems are still relevant today. We often face similar challenges concerning software failures.
00:03:21.650
Despite the allure of implementing extensive fault tolerance mechanisms, we should be cautious. Adding complexity can lead to issues such as increased costs, resource consumption, and potential new failure points.
00:03:34.550
It's important to focus first on delivering business value before incrementing minor tweaks to enhance fault tolerance.
00:03:49.070
Fortunately, Ruby provides various defensive techniques against software failures, particularly when contacting external components.
00:04:03.680
These techniques include exception handling. When something fails in Ruby, an exception is thrown, which carries useful information about the failure.
00:04:17.460
It is advisable to wrap operations that can fail, such as network calls or file operations, in exception handling blocks.
00:04:30.350
Ruby allows you to catch multiple exceptions and implement different responses for different exceptions. You can also implement retry logic for transient failures.
00:04:44.180
Logging exceptions is crucial for diagnosing issues. By integrating logging facilities with exception tracking software, we can monitor and respond to failures effectively.
00:04:58.560
Another effective approach is partitioning parts of the system using 'bulkheads,' as described in the book Exceptional Ruby.
00:05:12.640
Bulkheads isolate specific operations, allowing the system to remain operational even when failures occur in other components.
00:05:29.170
Using timeouts with external operations is also a critical strategy, as it prevents long-running requests from affecting system responsiveness.
00:05:42.360
In traditional Ruby applications, where requests can be blocked by external services, implementing bulkheads and timeouts helps ensure the overall health of the system.
00:05:55.620
The circuit breaker pattern is another useful technique that prevents call failures from propagating through the system.
00:06:07.700
Just like electrical circuit breakers prevent fires, software circuit breakers open up upon detecting failures, halting further calls to the malfunctioning service.
00:06:19.920
When implementing this pattern, it's essential to define the states: closed, open, and half-open. When the circuit opens, it can go back to half-open state to check if the system has recovered.
00:06:33.100
For Ruby, there are libraries available such as the Circuit Breaker gem, which allows for easy implementation of this pattern.
00:06:46.140
However, care must be taken when using these libraries, as Ruby applications often run in a way that might not be conducive to sharing states across processes.
00:07:00.200
Alternatives like the semi-anna library from Shopify can help synchronize circuit states across multiple workers.
00:07:16.040
To make Ruby applications more resilient to failures, we can also subdivide our systems into smaller, more manageable services.
00:07:32.250
This approach improves flexibility and allows for distributed fault isolation. For example, in one of our projects, we divided an application into three distinct parts using RabbitMQ for communication.
00:07:46.200
However, we encountered complications dubbed 'ghost failures' that were hard to trace. These failures emerged after idle periods and were often misconfigurations related to the messaging system.
00:08:01.620
The issues arose due to lost connections rather than actual failures, caused by garbage collection of idle connections by firewalls or cloud providers.
00:08:15.600
The lesson from this experience is that proper monitoring and maintenance of your services are crucial for success.
00:08:35.350
The core of fault tolerance remains the ability to recover from failures without causing cascading impacts on the application or user experiences.
00:08:50.290
Moving forward through technology, we can look at how concurrency and fault tolerance principles were integrated into languages like Erlang.
00:09:05.250
Erlang was designed for telecom systems, combining hardware and software aspects to manage failures, thus addressing the same challenges developers face today.
00:09:16.830
In Erlang, failure management is improved through 'supervisors' that monitor and restart crashing actors. This concept of a supervision tree allows for resilient software structure.
00:09:33.150
The actor model of concurrency implemented by Erlang, where each actor communicates via messages, offers tremendous scalability and fault tolerance.
00:09:48.920
In Ruby, libraries like Celluloid allow for similar actor-based design, aiding in building resilient applications.
00:10:03.520
Celluloid allows Ruby developers to create actors with ease, handling failures effectively and providing a clean interface for managing concurrency.
00:10:15.750
Another helpful library is Concurrent Ruby, utilized within Rails for handling WebSockets, which brings concurrency support to Ruby.
00:10:31.020
These patterns signify a shift towards more structured fault tolerance within Ruby applications, enabling more robust software design.
00:10:45.290
As we anticipate Ruby 3.0's launch, there's excitement for new concurrency improvements inspired by Erlang's actor model.
00:10:54.550
These enhancements promise to make Ruby applications even more resilient to failures, streamlining the process for developers.
00:11:10.700
In this discussion, it's essential to remember that while mechanisms for handling failures are crucial, they still need to be understandable and manageable for developers.
00:11:31.030
As we continue to innovate and improve, let’s remain conscious of the balance between complexity and fault tolerance.
00:11:42.430
Thank you for your attention, and I hope this discussion has provided insights into making Ruby applications more fault-tolerant. If you have any questions, feel free to ask.