Don't Wait For Me! Scalable Concurrency for Ruby 3!

We have proven that fibers are useful for building scalable systems. In order to develop this further, we need to add hooks into the various Ruby VMs so that we can improve the concurrency of existing code without changes. There is an outstanding PR for this work, but additional effort is required to bring this to completion and show its effectiveness in real world situations. We will discuss the implementation of this PR, the implementation of the corresponding Async Scheduler, and how they work together to improve the scalability of Ruby applications.

RubyKaigi Takeout 2020

00:00:00.560 Hello and welcome to my talk: "Don't Wait For Me! Scalable Concurrency for Ruby 3!" Today, we are going to discuss Ruby scalability, what it is, and how it impacts your applications. Then, we will look at the new Ruby fiber scheduler, how it is implemented, and how it can improve existing applications with minimal changes.

00:00:24.640 So, what makes Ruby a great language? Ruby has many good qualities, but I wanted to gather some opinions from others. Naturally, I asked Twitter, and here are the results: syntax was the most important factor, followed closely by semantics, libraries, and finally scalability. Someone even voted for math—thank you, Matt, for creating Ruby and supporting my work!

00:00:54.879 Looking at these results, we might wonder: is scalability important? Why did it come last? Many companies handle large amounts of traffic with Ruby, yet it appears scalability might be a weak point. Before we delve deeper, we should discuss what scalability is, so we have a clear idea of the problems we are trying to solve and how we can improve Ruby in this regard.

00:01:30.640 Scalability is fundamentally about the efficiency of a system and how many resources are required as the size of the problem increases. For example, we can compare the efficiency of English and Kanji: English requires six times more characters to convey the same meaning, making it less efficient.

00:01:52.560 As we increase the size of our system, efficiency issues manifest in many different ways. One significant issue is resource utilization: how much time we spend processing versus waiting. If we do not efficiently utilize the hardware, our scalability will be negatively impacted, leading to higher running costs. Cloud providers thrive on inefficient systems because they profit from it, but we want to avoid this situation to enhance the cost-effectiveness of our business growth.

00:02:26.720 Modern processes are so quick that they often spend a lot of time waiting on memory, storage, and network I/O. To utilize hardware more efficiently, we can time-share between multiple tasks; when one task is waiting, another can execute. This is one way to improve scalability.

00:02:51.120 Typical operating systems use processors and threads to execute tasks. While threads are similar to processors, they have less isolation and fault tolerance because they share memory and address space. A single hardware processor can context-switch between these tasks to complete all required work as quickly as possible.

00:03:16.879 We can add more hardware processes to increase system scale, allowing multiple tasks to execute simultaneously. If one task is waiting, the processor can move to another. However, switching between tasks is not free; context switching, as it is called, is an expensive operation at the system level. As processors dance between tasks, the overhead of switching can become significant.

00:03:48.640 We refer to simultaneous execution as parallelism, which is an important aspect of any scalable system. However, parallelism introduces significant complexities, including non-determinism, race conditions, and fault tolerance. Application code should not have to manage these complexities, as it has repeatedly been shown that thread-safety issues are a significant source of bugs. Used carefully, isolated parallelism provides the foundation for scalable systems.

00:04:25.440 Another way to improve hardware utilization is to interleave tasks. This can be achieved through several methods: callbacks, promises, and fibers. When application code would execute a blocking operation, the state is saved, and instead of waiting, another task can be executed. When the application code is ready to process again, it will resume.

00:05:01.039 This approach allows us to combine tasks back to back within a single process, minimizing the overhead of system-level context switches by implementing an event loop that schedules work more efficiently. We refer to interleaved execution as concurrency, which is an important tool for highly scalable systems.

00:05:20.080 Threads and processes are designed to expose hardware-level parallelism, limited by the number of processor cores in your system. In practice, having one process or thread per connection typically implies an upper bound of 10 to 100 connections per system, which is simply not practical for building highly scalable applications.

00:05:43.840 How does scalability impact cost? A lack of scalability in an application can be costly for businesses. Not only does it increase your hardware costs, but it can also negatively impact user experience if systems become overloaded. Additionally, you may find yourself building more complex distributed systems to compensate for a lack of scalability in your core application. Such designs bring many hidden costs, including operational and maintenance overheads, making it difficult to pursue new opportunities due to scalability bottlenecks.

00:06:20.160 Efficient and scalable systems are also beneficial for the planet, as fewer computers are needed to perform the same work. The technology sector is responsible for approximately 2% of global carbon emissions, and civilization faces unprecedented challenges related to climate change. We can help by improving the scalability of our code to reduce natural resource consumption and energy utilization, thus contributing to the development of more efficient systems for society.

00:06:46.639 I believe this is our responsibility as software engineers. Now, is Ruby fast enough? Significant efforts go into improving Ruby, including enhancements to the interpreter, garbage collection, and just-in-time compilation. Yet, we are still striving to improve performance.

00:07:07.120 Let's consider a hypothetical request that includes several blocking operations. I have split it into 10 milliseconds of processing time and 190 milliseconds of waiting time. The total time is 200 milliseconds, a very typical situation. For instance, at its peak, a real Ruby web application might spend 25 milliseconds executing Ruby code and 150 milliseconds waiting on Redis and external HTTP requests.

00:07:46.720 So, what happens if we make Ruby ten times faster? In our hypothetical scenario, we reduce the processing time from 10 milliseconds to 1 millisecond—a tenfold improvement—but the total time is still dominated by waiting, which remains unchanged. Now, what if we could make Ruby's execution a hundred times faster?

00:08:09.679 The results are clear: even if we improve Ruby's performance, so much time is still wasted waiting. Thus, we need to find ways to make Ruby faster and avoid long waiting times.

00:08:25.760 Let's revisit the example of a single request within the context of a small program. This request is one among many, and such sequential processing happens frequently when clients interact with multiple remote systems or fetch multiple records on an application server handling various incoming requests.

00:08:46.640 Imagine if we could change this: while waiting for one request to complete, we could start another. Although we cannot change the latency of individual requests, we can significantly reduce the total latency of all requests together, thereby improving the application's throughput.

00:09:22.640 So, how do we implement this? Let's examine the changes needed in the code. We will start from our sequential program and add a worker to wrap our requests, allowing us to represent individual jobs that can run concurrently.

00:10:12.320 Next, let's expand what a request looks like. First, we will make a connection to the service, then write a request, and finally read the response.

00:10:31.440 Let's make our read operation non-blocking. Primero, we will change 'read' to 'non-block,' and then we need to check the result; if the data is unavailable, we'll wait for the response to become readable. In that case, we will use a specific system call—IO.select—that waits until the data is available. This operation accounts for the waiting time in our program. Instead of waiting, we can use a special operation called 'yield', which passes control out of the current worker, allowing another worker to execute.

00:11:27.919 Additionally, we store the current worker into a waiting table, indexed by the I/O we are waiting on. Finally, we implement an event loop that waits for any I/O to become ready using IO.select, then resumes the corresponding worker so it can continue processing the response.

00:12:01.040 Congratulations! This is the entire implementation of non-blocking event-driven I/O for Ruby.

00:12:05.760 As you can see, the implementation is a bit verbose. We don’t want developers to have to rewrite their code like this because it exposes many details they shouldn’t have to worry about. In fact, I believe that you shouldn’t need to rewrite code at all; scalability should be a property of the runtime system, not your application code.

00:12:42.640 Let's simplify this code and hide some of the details behind a tidy interface. First, we will encapsulate the waiting list into a scheduler object. This scheduler will provide the necessary interface to wait for I/O and other blocking operations, including sleep, while concealing all event loop and operating system details.

00:13:09.440 Next, we will move the specifics of waiting on a particular I/O into the scheduler implementation. There is already a standard method for doing this called 'wait_readable,' which allows the scheduler to handle the scheduling of individual tasks without the user needing to yield control or manage the waiting list—these are typically implementation-dependent details.

00:13:51.760 Finally, we move the entire event loop into the scheduler implementation and replace it with a single entry point: 'run.' Since some operating systems have specific requirements regarding how this loop is implemented, encapsulating it within the scheduler enhances the flexibility of user code across different platforms.

00:14:46.559 As I mentioned earlier, I don't believe we should need to rewrite existing code to take advantage of the scheduler. Other programming languages often require explicit syntax like 'async' or 'wait' to achieve this, but thanks to Ruby's fibers, we can avoid such explicit syntax and conceal the 'await' within the method implementations.

00:15:15.360 In our code, we expanded the read operation to be non-blocking, but this detail can be hidden within Ruby's implementation itself. Therefore, we revert the implementation back to the original blocking read while still needing to call into the scheduler. How can we do this? In our implementation, 'scheduler' is a local variable, but we can introduce a new special thread-local variable to store the current scheduler.

00:16:02.080 This thread-local variable allows us to pass the scheduler as an implicit argument to methods invoked on the same thread. The Ruby implementation of I/O read is mostly similar to our non-blocking implementation, and it internally invokes 'rb_io_wait_readable.' In this method, we add a bit of magic: at the start, we check if a thread-local scheduler is set and, if so, defer to its implementation of 'wait_readable.'

00:16:44.960 This reroutes all blocking I/O operations to the scheduler if it is defined, all without necessitating any changes to application code.

00:17:26.060 Finally, we haven’t revealed the implementation of 'worker.' It is, in fact, nothing more than a Ruby fiber. While the practical implementation has additional internal details, this represents the complete implementation of non-blocking event-driven I/O for Ruby.

00:18:06.720 The original concept of this design began in 2017, and finally, in 2020, after extensive testing of these ideas, we made a formal proposal for Ruby 3. This design outlines the scheduler interface, which must work across different interpreters, including CRuby, JRuby, and TruffleRuby, and describes how existing blocking operations should be redirected.

00:19:00.160 The implementation of the scheduler may vary based on the underlying implementations, allowing for great flexibility. This proposal included a working implementation for CRuby and was recently approved by Matz for experimentation in the master branch, marking a significant milestone for the project.

00:19:35.680 We hope to iterate on the current proposal and deliver a robust implementation for Ruby 3. But how do we support existing web applications? Fundamentally, the scheduler is designed to require minimal changes in application code.

00:19:59.280 However, we must still provide a model to execute application code concurrently. Not all blocking operations are due to I/O; this is especially true for current database gems, which often conceal I/O operations.

00:20:18.720 Despite the challenges, this presents an exciting development: we have found that existing maintainers are willing to begin implementing the necessary improvements.

00:20:41.360 For applications requiring concurrency, you can start using Async. The core abstraction of Async is the asynchronous task, which provides fiber-based concurrency.

00:21:05.959 We introduce new non-blocking event-driven wrappers that have the same interface as Ruby's I/O classes. In many cases, you can inject them into existing code, and everything will function smoothly without requiring changes.

00:21:46.800 In Async, tasks execute predictably from top to bottom, so no special keywords or changes to program logic are necessary. In Ruby 2, you must use the async wrappers, but in Ruby 3, the scheduler interface allows us to hook into Ruby's blocking operations without altering application code.

00:22:06.960 At the core of Async is the non-blocking reactor, which implements the event loop. It supports various blocking operations, including I/O, timers, queues, and semaphores, all of which yield control back to the reactor while it schedules other tasks until the operation can proceed.

00:22:39.920 The Async gem is under 1,000 lines of code with substantial unit test coverage. We believe that reaching a stable 1.0 release with a well-defined interface and a robust test suite is essential because this establishes a strong foundation for building reliable applications.

00:23:01.920 Lately, I have been looking into Async because one of my projects would greatly benefit from non-blocking I/O, and it is designed beautifully. Janko Maron’s thoughts on the general design of Async have been very informative.

00:23:19.840 Async has a growing ecosystem of scalable components, including support for Postgres, Redis, HTTP, and WebSockets. One of our goals is to bring high-quality implementations of common network protocols to Ruby, focusing on both consistency and scalability.

00:23:56.961 Almost all components are tested for performance regressions. Since the Ruby 3 scheduler interface has been merged, we now have experimental support for their interface in Async. Once this interface is finalized, you will be able to make existing Ruby I/O non-blocking when used within an Async task.

00:24:27.200 Currently, this is experimental, but we hope to release it soon. If you write existing code with Async, you won't need to make any changes to see improved concurrency with Ruby 3.

00:24:49.920 One of the most critical components for many businesses using Ruby is the web application server.

00:24:56.640 We built Falcon as a scalability proof of concept, and its success has led us to develop it into a mature application server for Ruby web applications.

00:25:24.960 Falcon is implemented in pure Ruby, yet it rivals the performance of other servers that include native components. This reinforces our original assumptions about the diminishing returns of improving the Ruby interpreter.

00:25:51.520 Out of the box, it supports HTTP/1 and HTTP/2 without any configuration, providing a modern localhost development environment. It can also be used in production without NGINX, featuring a robust model for WebSockets. Recently, we demonstrated a single instance managing one million active WebSocket connections!

00:26:32.240 To achieve this level of scalability, Falcon employs one fiber per request, scheduling your application requests concurrently if they perform blocking operations, such as database queries, web requests, or interacting with Redis.

00:27:07.680 To facilitate parallelism across multiple CPU cores, Falcon uses a hybrid multi-process and multi-threaded model, allowing seamless support for CRuby, JRuby, and TruffleRuby.

00:27:44.160 So, is it scalable? Well, like anything, it depends on how you use it. Various micro-benchmarks have demonstrated that it can perform exceptionally well, especially when compared to operations that are blocking on other systems.

00:28:03.040 For instance, the graph shows Permalink's performance, where Falcon plateaued at 16 workers while Falcon continued to scale upward until all CPU cores were maximized.

00:28:39.200 We regard this as a fantastic outcome, especially for companies that encounter substantial traffic volume. Using Async is the right model because web applications are almost always I/O-bound, and the Ruby web ecosystem has been lacking in scalability.

00:29:16.640 For instance, WebSockets on Permalink unlock the next layer of scalability in the most Ruby-like way possible. Brian Powell has insights on migrating from Permalink to Falcon.

00:29:42.160 Now, the important question: does it work with Rails? Yes, it is Rack-compatible. However, some caveats come into play that we will discuss next.

00:30:08.640 Active Record is a commonly used gem for accessing databases, but unfortunately, it currently utilizes one connection per thread. This leads to issues with multiple independent fibers sharing the same underlying connection.

00:30:40.560 This presents a design flaw within Active Record, and discussions are ongoing on how to improve the situation. I believe Ruby 3 will drive these changes as the model for event-driven concurrency becomes clearer.

00:31:11.920 Another challenge is that database drivers do not use native Ruby I/O and are generally opaque C interfaces. However, the maintainer of the Postgres gem is eager to add support for non-blocking I/O.

00:31:30.560 I anticipate expanded support from other maintainers once the Ruby 3 concurrency model stabilizes, so these changes will hopefully propagate to higher-level libraries and drive adoption.

00:31:54.400 Alternately, we recently began implementing a low-level database interface gem called 'DB,' which currently provides non-blocking access to PostgreSQL and MariaDB/MySQL.

00:32:14.320 While we believe existing gems may eventually adopt the event-driven model, we didn’t want to wait for our implementation.

00:32:49.440 Redis is another commonly used system for swift access to in-memory data structures, widely utilized for job queues, caches, and similar tasks. We initiated a pull request to support event-driven I/O in Redis.rb, but, unfortunately, it was rejected.

00:33:13.440 Thus, we have continued to maintain the pure Ruby async Redis, which is almost as fast as the native high Redis implementation.

00:33:36.320 One essential aspect of the async gems is that they are guaranteed to be safe in a concurrent context, as our research identified many gems that are not thread-safe—including, until recently, the Redis gem.

00:34:09.840 HTTP is a fundamental protocol for the modern web, and we provide a gem called async HTTP, which features full client and server support for HTTP/1 and HTTP/2, and we hope HTTP/3 support will follow.

00:34:34.560 This gem fully supports persistent connections and HTTP/2 multiplexing transparently, thus delivering excellent performance right out of the box when paired with Falcon.

00:34:54.160 We recently added Faraday support with the async HTTP Faraday adapter, which we tested on various projects.

00:35:20.400 The GitHub changelog generator makes numerous requests to GitHub to retrieve detailed information about a project. While we haven't addressed every area of the code, we observed a notable performance improvement by fetching commits in pull requests concurrently.

00:35:55.440 For larger projects, the improvements resulting from multiplexing and persistent connections are even more significant.

00:36:15.680 In another script, a traditional HTTP gem was compared with async HTTP Faraday when fetching job status across a large build matrix from Travis, comprising about 60 to 70 requests.

00:36:45.280 Because those requests could now be multiplexed over a single connection, the overhead was significantly reduced. This highlights the advantages of leveraging async methods and underscores my excitement about the future of Async, Falcon, and non-blocking event-driven Ruby.

00:37:48.160 As a fun note, I've come across an intriguing experiment involving the FFI gem, which we utilize for implementing database drivers. This experiment offloads FFI calls to a background thread, subsequently yielding control back to the scheduler. While this approach is somewhat risky overall, it may be beneficial in certain cases.

00:38:05.680 What are the next steps? It depends on you, the community. Please test out these new gems and features within Ruby and provide feedback. The best thing you can do is enjoy experimenting with these new projects.

00:38:22.560 To businesses, please support this vital work. We recently introduced a paid model for Falcon, and if you're interested, please reach out.

00:38:45.440 We wish to support the Ruby ecosystem and foster its growth, especially regarding scalability. If this is essential for your business, you can assist us in further developing these projects through financial investment.

00:39:09.920 Thank you for listening to my talk!

00:39:16.640 You.