Fibers Are the Right Solution

The majority of performance improvements in modern processors are due to increased core count rather than increased instruction execution frequency. To maximise hardware utilization, applications need to use multiple processes and threads. Servers that process discrete requests are a good candidate for both parallelization and concurrency improvements. We discuss different ways in which servers can improve processor utilization and how these different approaches affect application code. We show that fibers require minimal changes to existing application code and are thus a good approach for retrofitting existing systems.

RubyKaigi 2019 https://rubykaigi.org/2019/presentations/ioquatix.html#apr18

RubyKaigi 2019

00:00:00.030 Hello everyone, konnichiwa! Welcome to my talk, "Fibers Are the Right Solution." My name is Samuel, and today we're going to discuss how to make Ruby more scalable.

00:00:05.400 Firstly, let us consider the question: What is scalability? What does it mean to be scalable? Here we have a processor doing work; for example, the job could be a web request. In order for a system to be scalable, adding more hardware should give a proportional improvement to job processing capacity. The key point is that the improvement is proportional: if we double the amount of hardware, we should be able to do twice as much work.

00:00:19.770 Why is scalability important? Scalability is a measure of the efficiency of a system with respect to the work it must do. When we create systems, designing for efficiency is not enough; we must consider how efficiency changes as the workload grows. This is a critical consideration not just for technical reasons, but also for businesses that depend on Ruby. Efficient systems are also good for the planet, as the technology sector is responsible for about 2% of global carbon emissions. As a father, I think about the impact we have both as individuals and collectively on this incredible planet we live on. It is something we, as software engineers, need to improve.

00:01:02.520 So, is Ruby scalable? We need to consider the context in which we are asking this question. The biggest context by far is Ruby web applications; Ruby is used on about a million websites globally. Yet, there have been many cases discussed where scaling Ruby has been difficult. In the systems I've observed over the past couple of years, I've seen Ruby web applications often spending a large proportion of time waiting on database queries and other forms of blocking I/O. In a typical 60-second sample, for example, 10 seconds were spent doing actual processing, while 50 seconds were spent waiting on blocking I/O. Blocking I/O is by far the single biggest contributor to general latency and throughput issues, especially in Ruby web applications. The current design of Ruby makes it difficult to utilize hardware resources effectively.

00:01:53.960 So, how do we maximize hardware utilization? This problem is already solved by modern operating systems, but it was not always the case. Let us go back in time to the late 1950s. This is a picture of an IBM 7090 mainframe; it was one of the last computers to use valves rather than transistors, and it was also the first computer on which time-sharing was experimented with. Time-sharing is a way to improve the utilization of hardware by scheduling multiple jobs to share common hardware resources more efficiently. One of the biggest sources of latency in computer programs at the time was reading and writing to tape-based storage; seeking was especially slow, often taking tens of seconds. This kind of inefficiency caused the main procedure processor to spend a lot of time idle, waiting for operations to complete.

00:02:59.459 Even though hardware has improved significantly, the relative difference in performance between CPUs and I/O devices remains similar. The time it takes to execute a single CPU instruction is about one nanosecond, while modern CPUs have a local cache that is fast to access—on the order of tens of nanoseconds. The main memory is much slower, as the CPU must use a memory controller to communicate with external RAM chips, with access times on the order of hundreds of nanoseconds. Main storage, like solid-state disks, is significantly slower, taking milliseconds to access. You can clearly see the latency of accessing the disk greatly exceeds anything that happens within the CPU; network latency is even slower, on the order of tens to hundreds of milliseconds.

00:05:10.049 If a processor is executing a job that must wait for I/O, we can schedule another job to run. By doing this, we can improve hardware utilization. Even if a job isn't waiting on I/O, we may force a context switch—a process referred to as time slicing. This kind of interleaving of work is generally referred to as concurrency, which allows us to improve hardware utilization by sharing it between multiple jobs. Another way to enhance utilization is to increase the available hardware, for example, by adding an extra processor. Each processor can interact with a shared storage system, but if they both try to access it simultaneously, contention occurs, causing processes to wait.

00:06:02.960 Simultaneous execution of jobs is known as parallelism. Parallelism maximizes the utilization of hardware by ensuring different parts of the system are balanced. So, how does this apply to Ruby? A major focus for the past several years has been improving the performance of the Ruby interpreter. Is Ruby fast enough? Let's assume we could make Ruby ten times faster. What happens to a hypothetical application that took ten seconds? It would then only take one second, which is a big improvement, but we still experience a lot of idle time waiting for 50 seconds, which hasn't changed despite improving Ruby significantly. What if we could make Ruby a hundred times faster? Is that even possible? In my experience, Ruby is only about ten to a hundred times slower than native C, making this achievable perhaps with the JIT compiler.

00:08:20.910 Unfortunately, the overall latency remains significant. Making Ruby faster does not help much because most of the latency comes from blocking I/O, meaning we didn't optimize the right area. How do we handle more requests? Interpreter performance is critical, but it's not the biggest source of latency in many cases—especially in web applications.

00:08:57.960 Can we use multiple processes? Absolutely. The operating system can schedule processes efficiently on multiple processors to maximize hardware parallelism and concurrency. What about threads? Like processes, threads allow for parallelism and concurrency. However, because threads operate in the same address space, they can cause data corruption if not used carefully. Threads also have another downside—in MRI, to protect the shared state within the interpreter, the global interpreter lock (GIL) prevents Ruby threads from running in parallel.

00:09:40.100 How bad is the global interpreter lock? Here is a quick benchmark I conducted using Falcon, a multi-process, multi-thread event-driven web server. Ruby was running eight processes and managed about 60,000 requests per second. What was the overhead of the global interpreter lock? We lost about three-quarters of the performance, which is a substantial reduction. Clearly, processes and threads can scale Ruby up, but are they sufficient? How many simultaneous connections can we handle? How many processes can we create? How much memory does Ruby require per process? In my experience, the average web application consumes anywhere from 50 to 500 megabytes of main memory per process.

00:10:54.600 What about threads? How many threads can we create—100, 1000, 10,000? Can you use just processes and threads to handle 10,000 requests per second? What about long-running connections? What if there are a hundred thousand connected WebSockets? Do you spawn a hundred thousand processes or threads? Does your server have 100 gigabytes of memory? Maybe it does, especially if you're on AWS.

00:11:35.000 We need to go deeper. The proven model for massive scalability is event-driven, non-blocking I/O. Instead of having multiple processes—one for each connection—and letting the operating system manage scheduling, we can use a single process which asks the operating system about events that have occurred on any of the connections we are interested in. By handling many connections in a single process, we significantly reduce the per-connection overhead. This design can handle millions of connections, limited more by how many operations you can process rather than the number of connections.

00:12:26.230 When we have one process managing each connection, we can use sequential programming, writing one statement after another. Let us consider a sequential program that uses blocking I/O. Firstly, the program connects to a remote system, initializes the count to zero, and then reads data into a buffer to accumulate the size. This blocking operation means that the program must wait until the data is available. Once completed, it returns the count, ensuring that the connection is closed at the end of the function.

00:13:35.130 Using a sequential approach is easy to read, write, and debug. However, when using an event loop, we need to invoke user code when the I/O operation is ready to continue. One of the simplest ways to manage this is with callbacks. When we have a blocking operation, we provide a callback to be invoked when the operation finishes or fails.

00:14:00.470 Here’s an example of applying callbacks in a function. We initiate the connection to the remote host; if the operation can block, we include a callback for when it completes. Using callbacks means that the code utilizing this function must also be modified accordingly. If the connection is successful, we can start reading data. Implementing loops with callbacks is challenging. We need to create an anonymous function and use recursion instead of a flat loop. This not only complicates the code but can also confuse those who need to maintain it.

00:15:15.350 So, how many of you enjoy using callbacks, and does this kind of programming make you happy? (Responses from audience) Welcome to callback hell! Is this the kind of code you want to write? Having `async` and `await` is a step in the right direction, yet it’s still largely syntactic sugar for callbacks. Thus, it may not eliminate the need for significant changes to your existing code.

00:16:00.000 Here is the same sequential program using a hypothetical implementation of Ruby with `async`. It resembles the previous structure quite a bit. I made a gem called async-await, and you should try it. It feels almost identical to the traditional programming style, but methods that might block must be preceded by `async`. However, you must ensure that calling `puts`, which is I/O, is properly handled, otherwise the code could lead to unforeseen issues. Like a virus, the `async` keyword could spread to other areas of the code.

00:17:36.330 How can we improve on this? The main question is whether these approaches work, which they certainly do in various cases. What do we want for Ruby? Should we rewrite existing code simply to enhance scalability? What if there were another option—using fibers? You’re probably wondering, what are fibers? Fibers operate like functions but maintain their own stack. A fiber retains its state between calls. For example, if we call a fiber with the value 10, it will hold that state and can be resumed later, allowing it to accumulate values over calls.

00:18:40.510 Fibers can yield and resume their operations. This means we can create multiple fibers, one for each connection, which yield during blocking I/O operations. When those operations are ready, we can resume the fibers from where they left off. For instance, if we have two connected sockets: one performs multiple reads until it would block, then yields to the event loop. The event loop waits for the operating system to notify which sockets are ready and resumes the fibers appropriately. This allows the program to handle multiple connections efficiently.

00:20:34.570 Here is our original synchronous code restructured into an asynchronous event loop. Notice that the core logic remains identical, ensuring minimal disruption. This code actually creates two fibers—one for the event loop and one for the main function's body. Nesting `async` blocks will generate multiple fibers in the same event loop.

00:21:38.090 How do we make existing code scalable? Ruby is dynamic, allowing us to replace blocking primitives with non-blocking implementations. This solution is implemented in a pull request that transparently makes all I/O in Ruby non-blocking on a per-thread basis. This is the best way to make existing Ruby code scalable with minimal changes. It's a trivial change that can easily be implemented across MRI, JRuby, and Truffle Ruby. The core of the change involves intercepting `wait_readable` and `wait_writable`, invoking the appropriate selectors, and letting them manage the rest.

00:23:44.430 There is a robust ecosystem of gems providing non-blocking interfaces, including support for Postgres, WebSockets, DNS, HTTP readers, and more. The async gem is compatible with all currently supported Ruby releases. As web applications are almost always I/O-bound, this model allows Ruby to scale effectively. The Ruby ecosystem is lacking in scalability, particularly with WebSockets in Puma, where solutions like async unlock higher tiers of scalability in a truly Ruby-like way.

00:26:33.850 For those migrating from Puma to Falcon, Falcon is a native Ruby application server built on the async framework. Falcon supports multiple processes, multi-threaded containers with non-blocking fibers, as well as HTTP/1 and HTTP/2 with tearless support out of the box, complete with push promises and other HTTP/2 features. With its design, Falcon can dynamically manage thousands of WebSocket connections. If you can swap an existing synchronous library with an asynchronous one, like using async for Postgres, you can vastly improve concurrency and handle more requests.

00:27:43.620 So, how does it perform for I/O-bound work? It scales exceptionally well, but it depends on your specific challenges, whether they involve I/O bound or CPU bound work. In testing, while Puma struggled to fully utilize CPU resources, Falcon effectively engaged all CPU cores, showcasing fibers as the right solution.

00:27:59.998 Fibers can outperform threads alone; they allow for improved concurrency without running into contention issues associated with the global interpreter lock. Furthermore, fibers enable a largely synchronous programming model that avoids callback hell and minimizes the need for extensive alterations of language constructs.

00:28:20.870 Fibers enhance the scalability of existing code. Given Ruby's dynamic nature, we can easily substitute blocking operations with non-blocking event reactors. Thus, fibers are the right solution. Thank you very much for listening to my talk! If you have any questions, feel free to ask.