Talks

Real World Applications with the Ruby Fiber Scheduler

RubyKaigi 2022

00:00:03.020 Hello everyone, it is such a pleasure to be here today. My name is Samuel, and today I'm going to tell you about some real-world applications of the Ruby fiber scheduler.
00:00:06.180 All the example code is available online using this QR code link. But before we get to that, I'm going to tell you a little bit about its history and development.
00:00:22.500 Over 10 years ago, I started working on a DNS server implementation in Ruby, unsurprisingly called Ruby DNS. I wanted to host my own website on a server in my bedroom, and I needed to have a hostname resolved differently depending on whether I was inside my home network or coming from the outside internet.
00:00:49.500 So, I created Ruby DNS for this purpose and set it up on my local network. However, everything stopped working. My web browser became really slow, and then it completely stopped working. It turned out that my DNS server had crashed.
00:01:07.920 It turns out that making a DNS server with a single process, and thus a single point of failure, is really bad for a network environment. Accessing a single website can require tens or hundreds of DNS queries. So, if your DNS server is slow, it can cause a lot of problems.
00:01:29.700 I wanted to make Ruby DNS robust and scalable, so I refactored the implementation to use Event Machine, which was a popular framework at the time. However, because Event Machine had a totally different implementation of IO, I had to rewrite my server.
00:01:42.180 Ruby's native UDP interface is pretty straightforward: you create a socket, bind it, and receive data. Event Machine is completely different. It uses a different method for creating and binding sockets and a callback instead of the more traditional interface for receiving data.
00:02:06.899 This made me think about the different models we have for building scalable systems, in this case, Event Machine, but also other proposals like async and await keywords, and how they force you to rewrite your code. Do we really want to rewrite everything?
00:02:31.140 Around this time, I built a Wikipedia DNS server that would fetch Wikipedia summaries and return them as DNS text records. To get the summary from Wikipedia, you need to perform an HTTP GET request. I tried several different HTTP clients to access Wikipedia from within Event Machine, but they all caused unexpected blocking that prevented the DNS server from handling more than one request at a time.
00:03:02.459 Here is a simplified event loop: it receives an incoming DNS request, performs an HTTP request for the summary, and writes the result back as a DNS response. However, because the HTTP GET is blocking, it stalled the event loop, so only one request could be processed at a time.
00:03:28.800 This led to a backlog of delayed requests that could overload the server. In contrast, a non-blocking HTTP client allows multiple requests to be processed simultaneously, reducing the chance of any one request getting blocked, thus improving throughput.
00:03:59.760 Since all the parameters provided by Event Machine were different from native Ruby, it wasn't obvious to me which combinations of libraries were compatible or would introduce blocking behavior. I started looking for other options to help avoid these compatibility issues and found Celluloid.
00:04:39.600 Each actor in Celluloid runs on its own thread, so I could have an actor for a DNS server and another for an HTTP client, allowing them to communicate using non-blocking message passing. I began rewriting Ruby DNS into Celluloid DNS, but I found that it would occasionally hang during test runs.
00:05:31.560 I started digging into the Celluloid code and found that it was occasionally locking up when starting a server before running a test. I investigated the link operation where linked actors tie the lifecycle of one actor to another; if one crashes, the other is forced to deal with the failure.
00:06:25.080 Often, this is handled by using a supervisor to restart the failed actor. However, I found that this implementation introduced a lot of non-deterministic globally visible behavior when writing tests, causing previous states or failures to leak into subsequent tests, making it very hard to guarantee isolation.
00:07:11.460 I wasn't comfortable with a system that had non-deterministic global state as part of its core design. Celluloid was complex internally, so it was difficult to understand or even know what the correct behavior should be.
00:07:22.620 I wasn't sure if I was even solving the right problems and didn't feel confident enough to use or recommend Celluloid as a reliable foundation for scalable Ruby applications. Why is it so hard to build something as simple as a non-blocking DNS server that performs HTTP requests? This shouldn't be so complicated.
00:08:00.900 I was frustrated that Ruby did not provide a compelling solution to this problem, considering it is such a great language in other respects. So, I felt strongly that I wanted to solve this problem and make Ruby even better.
00:08:35.640 I started thinking about what I wanted. I had invested enough of my time in other people's visions; I didn't want to rewrite my code. I wanted things to be compatible with each other, with well-defined semantics and isolation. I wanted a simple, straightforward foundation for introducing concurrency into my programs.
00:09:07.320 In 2017, I took all these ideas and started working on a new gem called async, which stands for asynchronous execution. Several months later, after building the first prototype, I released version one.
00:09:34.920 The short time frame was part of an explicit goal to restrict the implementation to only the very essential interfaces without unnecessary complexity or features. In 2018, I introduced a proposal for what is now called the fiber scheduler to add hooks to Ruby so that async could intercept native blocking IO operations and transparently redirect them to the event loop.
00:10:12.780 Hold on a minute—what are fibers? A fiber is a lightweight unit of execution that must be manually scheduled. They are used by the fiber scheduler to switch between application code, for example, several incoming requests, each having their own fiber.
00:10:41.600 When a fiber executes an operation that would block, such as reading from the network, we can transfer to another fiber. When the data is available, we can switch back. I refer to this hidden context switch as internally non-blocking because it's not directly visible to the application code.
00:11:15.060 One of the first exciting achievements of the fiber scheduler interface was wrapping multiple concurrent Net::HTTP requests. This showed that it was possible for existing code to be embedded in the fiber scheduler without any changes.
00:11:55.260 The fiber scheduler proposal was merged in 2020, and we have continued to develop it to include many other internally non-blocking operations. At the end of 2021, we released async version 2, which has full support for the Ruby fiber scheduler. This was a big milestone, 10 years after I first created my DNS server.
00:12:19.740 Now that you've heard about why I created the Ruby fiber scheduler and the kinds of problems I was trying to solve, let me tell you about how it works. Async is a gem that provides an implementation of the Ruby fiber scheduler interface.
00:12:44.640 It transparently intercepts blocking operations and replaces them with internally non-blocking implementations. We refer to the interface methods allowing this redirection as hooks.
00:13:07.620 One of the first hooks we implemented, and perhaps the most important, was waiting for network sockets to be readable or writable. Here is a small example of a non-blocking HTTP request using async and the fiber scheduler.
00:13:50.160 We use the async block to define units of concurrency, which we refer to as tasks. The top-level async block also creates the event loop, so if you nest async blocks, they will share the same event loop.
00:14:22.020 Everything inside this task executes sequentially like normal code, but the task itself executes asynchronously in relation to other tasks. We can use the fiber scheduler interface to redirect blocking IO operations like connect, read, and write to an internally non-blocking implementation.
00:14:51.199 Notice that we don't need special wrappers or keywords. Depending on the host operating system, async uses efficient multiplexing provided by KQ, epoll, or IOU ring, allowing multiple tasks to execute IO operations concurrently.
00:15:18.420 Being able to integrate with existing code is crucial, so we made all thread primitives safe to use within the fiber scheduler. One of the most important operations is joining a thread and waiting for it to complete, which is internally non-blocking when performed within a fiber scheduler.
00:15:54.900 This provides an escape hatch for native code that might not be compatible with fiber concurrency. Here we show three threads being created; each thread sleeps for one second to simulate some native code. After that, we can wait for that work to finish without blocking the event loop.
00:16:32.040 We expect the total execution time to be around one second. Joining a thread is internally non-blocking and won't stall other tasks. We also support thread mutex, which can be used to implement thread-safe access to shared resources.
00:16:44.340 Here, we compute the Fibonacci sequence using a thread and an event loop that share the same state. The thread reads the previous two values and computes the sum within a synchronized block, while the event loop performs the same operation with mutex lock and synchronize, both of which are internally non-blocking.
00:17:30.060 Thread-safe queues are another crucial tool for building scalable programs and are safe to use in the fiber scheduler. You might want to communicate between different parts of your program using the producer-consumer pattern.
00:17:53.160 Here, we have a thread that adds several items to a queue, and those items are subsequently consumed by a task running in the event loop. Popping from the queue is internally non-blocking.
00:18:30.720 Most network clients need to perform DNS resolution, and the standard libc DNS resolver is blocking. It can be tricky to support non-blocking DNS queries, but we introduced a hook for this too. Almost any program that integrates with the network will need to resolve a hostname to an internet address.
00:19:03.840 Any Ruby code that does DNS resolution will go through the fiber scheduler hook, which async uses in conjunction with Ruby's own resolver library. Ruby's resolver library is usually blocking, but because it runs from within the fiber scheduler, it becomes internally non-blocking.
00:20:03.060 This is yet another great example of how an existing implementation can become event-driven with no changes required. It's also common to launch additional processes and wait on them, so we added a hook for this too.
00:20:51.600 This hook is used by a variety of methods, including the system methods, backticks, and process wait, allowing all of these operations to be non-blocking. Here, we have an example that launches five child processes, each sleeping for one second.
00:21:26.400 Since they are all in separate concurrent tasks, the total execution time is only slightly more than one second. We also introduce safe asynchronous timeouts in Ruby; the timeout module is often regarded as unsafe because it can interrupt code at any point.
00:22:14.640 However, when running within the scheduler, only operations that yield to the event loop can raise a timeout error, making it somewhat more predictable. In this example, we set a timer of one second around a sleep of ten seconds.
00:23:03.240 Both the timeout and the sleep are hooked by the fiber scheduler; the timeout will cancel the sleep after one second, leading to a safe exit with a timeout exception. With the hooks provided by the Ruby fiber scheduler, we can make existing Ruby code transparently non-blocking.
00:23:44.820 It appears to execute sequentially, but it is internally multiplexing between fibers using an event loop. This is a remarkable achievement, and I'm genuinely proud of it.
00:24:38.520 Now that we've looked at how the Ruby fiber scheduler works, what can it do in real-world applications? What kind of problems can we solve with it? I'm going to break this down into five main questions.
00:25:15.780 The first question is: how can we handle lots of requests? Falcon is a web server built on top of async and can help your application manage many requests, especially in the presence of blocking operations like database queries and HTTP requests.
00:25:39.840 Let's compare Falcon with Puma. Both servers can handle more than 10,000 requests per second. This synthetic benchmark generates a small response of about 12,000 bytes. As you can see, both servers perform well.
00:26:03.900 However, the more response data we generate per request, the fewer requests we can handle per second. This benchmark creates a large response body of about 12 megabytes, resulting in around 200 requests per second, which is approximately 2.4 gigabytes per second—close to saturating a 25-gigabit network link just from one Ruby process.
00:27:01.140 But why should we prefer Falcon over Puma? To answer this, we need to observe what happens when a slow blocking operation is introduced. Here is a small application that simulates a blocking operation; we use a predictable sleep to reduce variability in measurements.
00:27:35.280 Each request sleeps for one-tenth of a second, so you'd expect about ten requests per second per connected client in our benchmark. That's exactly what we see with Puma, configured with sixteen threads and maxing out at about 160 requests per second.
00:27:59.820 In the test, I connected 100 clones, so not all of them were able to get a response, leading to timeouts. Falcon has a significant advantage here because it can transparently context switch between all 100 incoming connections, thus getting close to the theoretical maximum of 1,000 requests per second. This is one of the key advantages of using Falcon.
00:28:44.040 Recently, I received feedback from a developer who ported an IO-bound project to Falcon without any code modifications. They reported outstanding concurrency. Unfortunately, Falcon can't solve every problem.
00:29:09.780 For a long time, Rails followed a request-per-thread model that limited our ability to handle many concurrent requests. This benchmark shows Falcon executing a database query in Rails 7.0 compared to my own database gem.
00:29:51.480 The database gem effectively uses one connection per fiber, leading to about eight concurrent requests with eight connected clients, while ActiveRecord assumes one connection per thread. As a result, all the requests share a single ActiveRecord connection, leading to significant contention and reduced throughput.
00:30:29.880 Fortunately, this is set to change in Rails 7.1 with a new configuration option for controlling the isolation of per-request state. With this change, we can allocate one connection per fiber. Although it's still unreleased, I expect this will mitigate many major performance issues when combining Falcon and Rails applications.
00:30:56.400 You can see that in Rails 7.1, the orange bar in the chart performs similarly to my database gem when using the request-per-fiber mode, which I find very exciting. The second question is: how can we gracefully scale as web applications grow in complexity?
00:31:22.440 It's important to be aware of the different parts that limit scalability. Fixed-size pools can create significant performance issues. Puma has a fixed number of workers, which limits the count of simultaneous requests it can handle. ActiveRecord also generally has an upper limit tied to the number of threads.
00:31:59.880 However, this fixed arrangement can be tricky. If you don't provision enough workers, incoming requests are queued and experience high latency or worse, timeouts. If you over-provision your worker pool and don't have enough database connections, you may not be able to handle all incoming requests.
00:32:53.520 Fixed-size pools don't scale gracefully according to utilization; they are static assertions about the available hardware and workload. While there are many situations where they are acceptable, I've also seen cases where they perform poorly.
00:33:41.160 Unfortunately, there is no perfect solution, but async and Falcon adopt a different policy called utilization-sized pools. What this essentially means is that Falcon will continue to accept incoming connections, gradually slowing down as CPU and memory are consumed.
00:34:26.340 The async event loop is designed to ensure every connection gets a chance to execute on each iteration, ensuring a better distribution of resources across all connections. Here's a real-world demonstration: we run Falcon with a single event loop and make eight connections.
00:35:04.140 Each request sleeps for one second, and all eight connections can execute ten requests in ten seconds. This behavior is expected and holds up to about 5,000 connections on my desktop, at which point the event loop becomes saturated and impacts latency across all connections.
00:35:54.480 Now, here is Puma running with four workers and eight connected clients. Only the first four connected clients can complete any requests; the remaining four clients try to make a request but end up stuck in the queue and eventually time out.
00:36:42.540 Falcon can help you avoid situations like this. The third question is: how can we respond quickly? Most Ruby web applications are built on top of Rack, which provides the specification for how our applications and servers should interact.
00:37:14.460 Rack version 2 generally requires you to buffer the entire response for good reason, as it allows catching errors during response generation and redirecting to an error page. However, this approach also increases the time to first byte, and if the response is large, it consumes a lot of memory.
00:37:38.820 Today, we're introducing Rack 3, which makes real-time streaming a required feature of the Rack specification. Rack 3 has been in development for over two years and represents a significant team effort. Falcon is fully compatible with Rack 3, and we also introduced a new conformance test suite to help other servers adopt these new features.
00:38:27.360 You might have noticed that we have a fabulous new logo, created by Marlena Logiston. When we discussed the project, she suggested it seemed to need a vacation and created this outstanding artwork of myself, Jeremy, and Ruby relaxing on the moon. I'm looking forward to taking some time off in the future.
00:39:10.440 Anyway, let's talk about real-time streaming and Rack 3. Here is a simple hello world example: we create a body proc that accepts a stream argument, writes to it, and then closes the stream. We respond with this body proc instead of the more commonly used array of strings.
00:39:53.640 The web server will execute the proc with the underlying HTTP connection as the stream argument, allowing full bi-directional streaming. We can also implement more advanced functionalities like streaming templates. In this example, we create a small template that recites the traditional '99 Bottles' song while sleeping for one second between each line.
00:40:41.520 The generated output can then be returned as a proc that writes to the output stream. This streaming response can easily be demonstrated using an HTTP client like curl, where each line is printed after a one-second delay.
00:41:21.600 Streaming templates can greatly enhance time to first byte, especially when queries to a database are involved. In fact, this kind of output streaming has been a feature of PHP for many years and has been extensively used to create real-time output.
00:41:58.440 However, there are other use cases: buffering large datasets can require significant memory. Multiple requests with pagination are sometimes utilized to avoid these issues, but they ultimately add complexity. Incremental streaming of rows can help reduce both latency and memory usage.
00:42:42.000 Here’s an example demonstrating how to stream CSV data using the Rack 3 streaming body. We wrap the output stream in a CSV interface and generate rows of CSV data; it's as simple as that.
00:43:27.960 The streaming CSV can be seen using the command-line HTTP client. Since we only generate one row at a time, memory usage and latency are significantly reduced. If we want to stream real-time events to a web browser, server-sent events serve as a good option.
00:44:10.320 The protocol was first specified in 2004 and supported by browsers starting in 2006. It's excellent for real-time events like chat messages, stock market tickers, or status updates—anything you wish to update in real time.
00:45:00.420 On the server side, you generate a streaming response of the type text/event-stream. The specification is fairly basic, but you can send multiple event types. In this case, we have a single event type called 'data'.
00:45:36.840 We can implement client-side JavaScript to receive these events. We create a new EventSource object and listen for messages using a callback function. In the callback, we append a message to the document as an example.
00:46:19.260 When we open this page in a browser, we can see the streaming events. The connection will automatically reconnect if it gets disconnected. This approach is far easier to implement than websockets.
00:47:02.520 The fourth question is: how can we reduce latency? It’s common for a web application to interact with external systems while handling a request. In this scenario, a request might perform an HTTP RPC or interact with another web service.
00:47:44.280 Often, there will be more than one external operation, and executing them sequentially can lead to slow responses. If we could rearrange these operations so that we could execute them concurrently, we could significantly reduce latency.
00:48:28.920 In fact, you can achieve this with async in both Puma and Falcon, or even within Kirk. Here’s a simple method that computes the best price by sending requests to several shops concurrently.
00:49:06.840 For each shop, we map the shop URL to a price, perform a GET request, and extract it from a JSON response. Then, we wait for all operations to complete and return the minimum price.
00:49:41.760 This process can be visualized by comparing sequential execution, which takes about 1.5 seconds, for a total of 4.5 seconds, against concurrent execution, which only takes about 1.8 seconds.
00:50:27.300 Async makes it very straightforward to implement fan-out and map for juice-style processing, greatly reducing overall latency. The fifth question is: how can we support real-time interactions?
00:51:00.000 Rack 3 bi-directional streaming simplifies the implementation of websockets. I created a gem called async-websocket, which supports this new interface. Here's a straightforward example demonstrating its simplicity.
00:51:39.000 You can comfortably handle between 1,000 and 10,000 connections per process. Beyond that, Ruby's garbage collection and latency may become an issue.
00:52:26.160 This level of simplicity encourages experimentation and novel interactivity approaches. This example application renders HTML on the server in real time and streams it to the client. It also streams events back to the server for interactivity with the buttons and board.
00:53:10.440 You're welcome to try this online application. It's utilizing Rack 3 and websockets; if enough people connect, we can test how robust the server implementation is, which is running on a free Heroku Dyno with a limited number of connections.
00:54:03.360 There may be potential for software crashes, but it might not be the fault of Falcon. So please feel free to test it out.
00:54:46.560 In summary, Falcon is an excellent server for IO-bound applications. Ruby 3.2 and Rails 7.1 will deliver exciting new opportunities for concurrency in production. Be cautious with fixed pool sizes, especially for servers like Falcon that will push the limits of your application.
00:55:31.800 Take full advantage of real-time streaming to reduce latency and memory usage, and look for opportunities to improve your applications' concurrency.
00:56:16.680 Let's explore new opportunities for interactivity with websockets. And now, what about Ruby DNS? I rewrote Ruby DNS on top of async, making it much easier to create your own scalable DNS server in Ruby.
00:56:57.779 First, we start the server, and then we can make a request to rubycargy.wikipedia and get the response back. Surprisingly, most of this code is very similar to the original implementation from over ten years ago, but it's non-blocking, so I feel like I achieved my goal.
00:57:27.300 Thank you for listening to my talk! Please come and see me during lunchtime if you have any questions, or feel free to reach out to me on Twitter. I hope this talk excites you for the future of a more scalable Ruby.
00:57:59.199 Thank you.