00:00:00.000
Thank you. Hello everyone! So, about two weeks ago, this happened in production for the primary care of an important table.
00:00:05.759
Yes, we ran out of numbers. I'm sure several people here have had the pleasure of this problem.
00:00:11.639
The primary key of this table used the int data type, and the biggest value you can represent is around 2.1 billion.
00:00:19.080
Once you've reached this value, you can't insert any new rows. As you can imagine, this is a problem.
00:00:26.279
It turns out there is a data type bigger than int, perhaps unsurprisingly called bigint.
00:00:33.600
So, that's a simple fix, right? Let's just alter the table schema.
00:00:39.600
We worked out that the throughput is around fifteen thousand rows per second per core, and we have around 2 billion records. So, 2 billion divided by 15,000 records per second is about 30 to 34 hours, give or take.
00:00:50.399
Fortunately, we pay AWS wheelbarrows of money for our database and have 48 cores. So, we ran the migration, and it was beautiful! We could utilize all the hardware available to us, and we got the migration done in a couple of hours off-peak.
00:01:09.659
Actually, that wasn't what happened at all. PostgreSQL, for all the awesome things that it enables, actually has some pretty sharp edges when it comes to big data.
00:01:23.700
If we ran this migration, it would execute on a single core, and the company would have been out of operation for several days. As you can imagine, this was unacceptable.
00:01:30.720
So, we had to solve the problem by breaking it down into several jobs. But dealing with such a problem makes you think: what kind of properties do we want our software systems to have to avoid scalability issues?
00:01:49.259
Efficiency is really important. It's scary when you have two billion records and find that even simple operations take a huge amount of time. The difference between two times slower on small data might be between 5 and 10 minutes of execution time, which probably wouldn't cause any problems. But for big data, it could be the difference between 40 and 80 hours of execution, which is significantly longer in absolute terms.
00:02:07.140
Responsiveness is also very important. The aforementioned migration blocks all other access to the table. It would literally prevent the business from operating for several days. How we choose to build systems and the latency of operations within those systems greatly affects our users.
00:02:26.400
Finally, being robust is another important property. As systems scale in size and complexity, errors become the norm, not the exception. Handing these errors is part of normal operation and is necessary for the correct and reliable operation of a system.
00:02:51.959
So, that's what I've been working on for the past two weeks, and it's a great case study for talking about asynchronous Rails. So, hello, my name is Samuel Williams, and today I'm going to talk about asynchronous Rails.
00:03:03.599
It's a real pleasure to be here. Let’s start with the basics. What is synchronous execution?
00:03:09.660
In the database migration example I talked about, PostgreSQL creates a new table with the updated schema and copies the records across one at a time. When it's done, it replaces the old table with the new table. Here is a visualization of that process: we see the original table and the temporary table with the updated schema. We copy the items across one at a time, in order. Each item is copied one after the other, and we're done with all the items copied. At this point, we can swap the tables.
00:03:36.099
This is an example of synchronous execution. The time it takes is the sum of all the individual operations to complete an order. Actually, while I was waiting for coffee this morning, I started thinking about the same problem—asynchronous execution. In contrast, it means execution without synchronization between the individual steps.
00:04:10.920
If those steps can be executed independently of each other, we can potentially interleave or overlap the execution, thus reducing the total execution time. In the case of the PostgreSQL migration, you could subdivide the problem into multiple chunks and copy each chunk independently.
00:04:27.900
If we use two processes to copy rows, we could potentially reduce the time of the operation by half. Because we know in advance how to divide the work into independent chunks, no coordination is needed, and the two operations run independently of each other—in other words, asynchronously.
00:04:46.680
However, in order to complete the operation, we do need a synchronization point where we wait for all the individual jobs to complete. So, the total time depends on the slowest sequential job. If we can break the problem down, we can reduce the total run time.
00:05:04.919
We haven't yet defined how to execute multiple jobs asynchronously. There are two main ways in which you can execute multiple jobs: parallel execution and concurrent execution.
00:05:30.240
Parallel execution is normally accomplished by having distinct processing units work on independent jobs. Most modern processors have a lot of cores, and each core can handle one job at a time. But, as we saw earlier, if you don't break your work down into separate jobs, you won’t be able to take advantage of all the cores.
00:05:49.860
Concurrent execution is another way to execute asynchronous jobs. It allows us to improve the utilization of a single processing core by interleaving execution. For example, if a job is waiting for data from the network or local storage, another job can execute. Alternatively, jobs might have fixed maximum time slices, so that each job gets an opportunity to execute.
00:06:05.000
Practically all modern computer systems are both parallel and concurrent, with multiple processor cores interleaving the execution of multiple jobs. So, what are the advantages of asynchronous execution? Let's consider both the advantages for application performance and the advantages for application design.
00:06:32.640
The first and most obvious advantage of asynchronous execution is that it allows us to improve application performance. As we saw, the synchronous execution model was limited to a single processor, but by breaking the problem into separate asynchronous jobs, we are able to improve the algorithm to take advantage of all the available hardware resources, thus giving us better scalability.
00:06:46.740
In addition, by splitting up the work, the total execution time can be reduced, lowering latency and making your application more responsive.
00:07:00.600
The second area that can benefit from asynchronous execution relates to application design and the kind of problems we can solve. We are no longer limited to synchronous request-response style execution models for the user experience and can leverage asynchronous operations to enhance the interactivity of the application, including multi-user experiences.
00:07:19.080
Similarly, with real-time updates, users can receive timely and relevant information, enabling applications to provide a continuous stream of data.
00:07:29.940
It's also true that some problems can't be solved at scale with a single sequential processor in a realistic time frame. Some problems are literally impossible to solve without huge data centers with tens of thousands of processes, each solving a chunk of the problem.
00:07:40.919
Asynchronous execution enables us to solve scalability problems and unlocks new user experiences.
00:07:52.500
But why does this matter to Rails? Rails is a framework for building web applications and naturally benefits from asynchronous execution in a number of ways.
00:08:03.600
In order to understand how Rails benefits from asynchronous execution, let's consider a single request. Firstly, we will focus on Ruby, the language and environment which executes all the code. Then, we’ll discuss Rack, which defines the interface that servers adhere to for handling HTTP requests, and finally Rails, which provides the framework for building your application.
00:08:16.139
Firstly, let's talk about the changes we made to Ruby to provide a foundation for asynchronous execution.
00:08:28.800
One of the big bottlenecks in a lot of web applications is I/O wait. This manifests in things like databases, where you send a query over the network and wait for the results to come back.
00:08:40.320
Depending on the complexity of your query, you could be waiting a long time. It's also common for web applications to do HTTP requests, connecting to remote services. Sending the request and waiting for the response can easily take hundreds of milliseconds.
00:09:03.840
A lot of background job processing systems use Redis for enqueuing work. This involves sending a message across the network and waiting for an acknowledgment. Here is an example of a real web application which does all of these things: at its peak, 25 milliseconds is spent executing Ruby code (the blue part), while 150 milliseconds is spent waiting on I/O operations (the yellow and green parts).
00:09:25.023
So, can't we just make Ruby faster? Here is the proportional time spent executing Ruby (which took 25 milliseconds) versus waiting on I/O (which took 150 milliseconds). So, what happens if we make Ruby 10 times faster?
00:09:44.880
Well, because Ruby was only a small proportion of the total execution time, the advantage is not that huge. The total request time was reduced by about 15%.
00:10:06.300
What about if we could improve the performance of Ruby a hundred times? It would be a huge effort to improve Ruby's execution performance by 100x, and at best, in this case, it can only improve the latency by about 15%.
00:10:30.780
This is why it's important to characterize performance issues before attempting to address them. As an example of this, we have been working and continue to work on improving the performance of fibers and threads.
00:10:51.300
Between Ruby 2.6 and 2.7, I introduced native coroutines and improvements to fiber allocation, making it roughly three times faster in general. We also did something similar for threads, reducing thread allocation costs by about 4X.
00:11:12.840
But unless your application does nothing but allocate fibers and threads, you won't see a huge advantage from these optimizations. At best, you might get a five to ten percent reduction in latency.
00:11:26.700
So, how do we reduce waiting time? Well, let me be more precise. If a network operation is going to take 100 milliseconds, we generally have to wait. But while we are waiting, how do we improve utilization?
00:11:44.280
We need a way to interleave the execution of multiple jobs so that when one is waiting, another can execute. It turns out we have this already. If a fiber can execute a block of code at times we choose, for example, when we know we have to wait, we can suspend its execution and resume a different fiber.
00:12:01.440
By using multiple fibers and an event loop, we can convert synchronous execution into asynchronous execution and reduce the total execution time by interleaving the execution of the individual jobs.
00:12:21.899
In 2017, I experimented with and created a gem for fiber-based concurrency called async. Version 1 uses wrappers around Ruby's I/O objects to switch between fibers when an I/O operation would otherwise wait.
00:12:40.920
Compatibility with existing code was tricky because it used the native Ruby objects rather than the wrappers I created in the async gem, so in 2018, I introduced a proposal for what is now called the fiber scheduler.
00:12:57.300
This allows us to intercept the operations that could wait, enabling us to be more compatible with a wider range of existing code. The fiber scheduler proposal was merged in 2020 and was supported in async version 2.
00:13:17.760
From the user's point of view, the code executes sequentially, but the fiber scheduler delegates waiting operations to an event loop, where they can be executed concurrently.
00:13:34.680
So, let's talk about the different operations that we hooked in Ruby. One of the first hooks we implemented, and possibly one of the most important, was waiting for network sockets to be readable or writable.
00:13:51.600
Here is a small example of an HTTP request using async and the fiber scheduler. We use the async block to define tasks that can execute concurrently.
00:14:09.540
Internally, each task has a fiber for execution. Everything inside the task executes sequentially like normal code, but the task itself executes asynchronously in relation to other tasks using a shared event loop. We can use the fiber scheduler interface to redirect the blocking I/O operations like connect, read, and write to the event loop.
00:14:29.640
Notice that we don't need to use any special wrappers or keywords. Depending on the host operating system, async uses highly efficient multiplexing provided by Kqueue, epoll, or I/O Uring. This allows us to interleave multiple tasks that are waiting on I/O.
00:14:47.220
Being able to integrate with existing code is also critical to the success of async, so we made all thread primitives safe to use within the fiber scheduler.
00:15:02.910
One of the most important operations, joining a thread and waiting for it to complete, can be delegated to the event loop when running on the fiber scheduler. Here we show three threads being created.
00:15:24.360
Each thread sleeps for one second to simulate external work, and then we wait for that work to finish without blocking the event loop, which we expect to take around one second in total.
00:15:38.220
We also support thread mutexes, which can be used to implement safe asynchronous access to shared resources.
00:15:51.339
Here we compute the Fibonacci sequence using a thread and an asynchronous task, which have access to the same state.
00:16:03.480
Here is where the thread reads the previous two values and computes the sum within a synchronized block, and here is where the event loop does the same operation.
00:16:22.740
The asynchronous queue is another important tool for building scalable programs, and it's okay to use in the fiber scheduler.
00:16:39.460
You might want to communicate between different parts of your program using the producer-consumer pattern.
00:16:55.020
Here we have a thread that adds several items to a queue, and those items are subsequently consumed by an asynchronous task.
00:17:09.780
Most network clients need to perform DNS resolution, and the standard library's DNS resolver is blocking.
00:17:13.720
This can make it tricky to support asynchronous DNS queries, but we introduced a hook for this too. Almost any program which integrates with the network will need to resolve a hostname to an Internet address.
00:17:29.160
Any Ruby code that does DNS resolution will go via the fiber scheduler hook, which in async uses Ruby's own resolver library. Resolving addresses is asynchronous when done within the fiber scheduler.
00:17:46.920
It's also common to launch child processes and wait on them, so we added a hook for this too. This hook is used by a variety of different methods, including the system method and backticks.
00:18:10.980
Here we have an example that launches five child processes, each sleeping for one second. Since they are all in separate asynchronous tasks, the total execution time is only a little bit more than one second.
00:18:35.400
We also introduced safer timeouts in Ruby. The timer module is often regarded as unsafe due to the fact that it can interrupt code at any point.
00:18:52.740
However, when running in the scheduler, only operations which yield to the event loop can raise a timer error, which is somewhat more predictable.
00:19:11.520
In this example, we put a timeout of one second around a sleep of 10 seconds. Both the timeout and the sleep are hooked by the fiber scheduler; the timer will cancel the sleep after one second and safely exit with the timer exception.
00:19:27.480
So, with the hooks provided by the Ruby fiber scheduler, we can improve the concurrency of existing Ruby programs without any changes. It appears to execute sequentially, but it is internally multiplexing between fibers using an event loop.
00:19:44.560
This is a huge achievement, and I'm really proud of it. One other area I'm excited about is asynchronous state handling. With parallel execution, it was relatively easy to deal with per request state.
00:20:03.060
Each request runs until completion, and it is impossible for state to be interleaved. In this example, at the start of each request, a thread-local variable is set to a request ID number.
00:20:21.860
With concurrent execution, we need to be careful to scope per request state to each individual fiber. Lots of existing code assumes the former sequential execution model.
00:20:41.045
In this example, each fiber is assigning a thread local, but because the execution is interleaved, when the first request is resumed after request two sets it, it has the wrong request ID.
00:21:00.540
To solve this problem, we introduced per-fiber inheritable storage. Each fiber has associated storage, which can be considered per request.
00:21:18.180
When you create nested fibers or threads, that state is automatically copied, allowing you to easily and conveniently share context and data between nested asynchronous jobs.
00:21:38.580
This greatly simplifies Ruby programs that need to use this kind of state. So, Ruby now provides the foundation on which we can build asynchronous applications.
00:21:53.640
The next critical layer is Rack, which is the interface between the web server (like Unicorn, Puma, or Falcon) and your application framework (like Rails).
00:22:09.240
Rack is a gem that provides convenient code for building web applications and a specification that defines how web servers should invoke web applications.
00:22:26.640
For a long time, Rack-based servers only supported HTTP/1. Rack itself is loosely based on the CGI specification, so this made sense given the circumstances when Rack was created.
00:22:43.840
However, it also became an obvious limitation as I worked towards supporting HTTP/2. Here's a small example of a Rack application. You can run this on any Rack-compatible server; it takes any request and returns a 200 OK response with the plain text 'Hello, world!'.
00:23:02.640
Rack 2, which is the most popular release of Rack, presents limited support for bi-directional streaming. HTTP/2 has actually had full support for bi-directional streaming.
00:23:20.460
It's unfortunate then that Rack did not expose this. In particular, Rack's lack of direct support for streaming has made implementing things like WebSockets particularly challenging.
00:23:37.560
In September last year, we released Rack 3, which is probably best described as a stricter subset of Rack 2 with mandatory support for bi-directional streaming.
00:23:59.520
Bi-directional streaming is the most important feature of Rack 3 because it makes it easier for web applications to take advantage of asynchronous execution, for example improving interactivity or reducing latency.
00:24:18.360
That doesn't mean it wasn't possible with Rack 2; it's just a lot easier and better defined with Rack 3.
00:24:36.300
Here is a simple 'Hello, World' example. We create a body proc which accepts the stream argument, writes to it, and then closes the stream. We respond with the body proc, which we will refer to as a streaming body.
00:24:54.540
The web server will execute this proc with the underlying HTTP connection as a stream argument. We can do more advanced things like streaming templates.
00:25:13.920
In this example, we create a small template which sings the traditional 99 bottles song. Between each line, we sleep for one second while generating the output.
00:25:31.680
We can return this template as a streaming body that writes to the output stream. The streaming response can be easily demonstrated by an HTTP client.
00:25:48.180
Each line is printed after a one second delay because the templates are generated and streamed in real-time. The time to first byte can be greatly reduced.
00:26:07.800
Here's an example of how to use WebSockets, which is compatible with any Rack 3 server, including Falcon. Recently, with Puma's 6.1 release, it's compatible with Puma as well, which just blows my mind.
00:26:30.480
This adapter takes a Rack request environment and generates a suitable response using a streaming body. The implementation itself just echoes back every incoming message.
00:26:54.060
Making this easy was a really important goal for Rack 3, and I'm looking forward to seeing what people do with it.
00:27:11.580
Rack 3 can also help reduce memory usage. Buffering large data sets can require a lot of memory.
00:27:33.300
Multiple requests with pagination are sometimes used to avoid these issues but ultimately result in more complexity. Incremental streaming of large data sets can simultaneously reduce latency and memory usage.
00:27:52.080
Here is an example showing how to stream CSV data using Rack 3's streaming body. We wrap the output stream in the CSV interface, and then we generate rows of CSV data.
00:28:09.360
The streaming CSV can be easily seen using the command line with HTTP Client. Since we only generate one row at a time, the memory usage and latency are limited.
00:28:28.680
Rack 3 also makes it easy to implement Server-Sent Events. If we want to stream real-time events to a web browser, Server-Sent Events are a good option.
00:28:47.040
This protocol was first specified in 2004 and supported by browsers starting in 2006. It's an excellent choice for real-time events like chat messages, stock market tickers, status updates—anything you want updated in real time.
00:29:06.380
On the server side, you generate a streaming response of type text/event-stream. We can write events one per line using this format.
00:29:23.640
We can implement some client-side JavaScript to receive these events. We create a new EventSource object and listen for all messages with a callback function.
00:29:41.640
We append messages to the document in the callback to show them being received. If we open this page in a browser, we can see the streaming events.
00:29:57.740
It will automatically reconnect if disconnected, and it's far easier to use than WebSockets, so I highly recommend it.
00:30:10.560
With the introduction of Rack 3, how do we ensure that all servers implement the new functionality correctly? The Rack specification has middleware that validates how applications should behave, but it does not provide any mechanism to test servers.
00:30:28.560
I wanted to encourage servers to adopt Rack 3, so I created Reconform, a test suite for server implementations.
00:30:48.420
Reconform provides an application and a test suite around the server being tested and validates the requests and the responses. It's not hugely extensive but includes tests for Rack 2, Rack 3, and new features like bi-directional streaming.
00:31:06.960
As you can see, all the tested servers pass the Rack 2 level conformance, while Falcon, Puma, and Webrick pass the Rack 3 conformance.
00:31:23.820
This has been a great tool for helping server maintainers adopt Rack 3. I just mentioned Falcon, which is my own web server that I created based on async.
00:31:43.320
It has full support for HTTP/1, HTTP/2, Rack 2, and Rack 3. Since Falcon is built on async, it provides fiber-based concurrency, which is different from Puma's thread pool-based design.
00:32:03.240
So let's compare Falcon with Puma. Both servers are capable of handling more than 10,000 requests per second, which is absolutely excellent.
00:32:19.320
The synthetic benchmark generates small responses of about 1200 bytes. You can see they both perform well.
00:32:38.520
The more response data we generate per request, the fewer requests we can handle per second. This benchmark generates larger responses of about 12 megabytes each at the rate of 200 requests per second—about 2.4 gigabytes per second, which is close to saturating a 25-gigabit network link from just one Ruby process.
00:32:56.040
This might seem pretty decent, so why should we prefer Falcon over Puma? To answer that question, let's look at what happens when we introduce a blocking operation.
00:33:10.920
Here is a small application that simulates a blocking operation, for example, a database query. We use a predictable sleep to reduce the variability of the measurements.
00:33:25.920
Each request sleeps for one tenth of a second, so you'd expect about 10 requests per second per connected client. Puma is able to use all 16 threads, handling about 160 requests per second.
00:33:40.620
However, Falcon has an advantage, since it can interleave the execution of all the requests. In this benchmark, I actually connected 100 clients and Falcon was able to serve them all while Puma had some timeouts.
00:33:54.420
We can see this advantage in the real world too. After porting an I/O-bound project to Falcon without any code modifications, it shows outstanding concurrency.
00:34:12.180
I think that speaks volumes about how important concurrency is in modern web applications.
00:34:32.940
Rails version 7.1, which should be released soon, is on track to fully support Rack 3. So, how do we change Rails?
00:34:49.200
I'll touch on a few of the pull requests that have been merged over the past two years. One of the most tricky things in Rails is the ossification of request per thread.
00:35:06.840
A lot of internal systems use thread-local variables or were actually accidentally fiber-local. If we are concurrently interleaving fibers, you can end up with state leaking between requests.
00:35:23.220
Rails introduced a new interface called isolated execution state that can be configured for either request-per-thread or request-per-fiber, allowing Rails to work correctly on Falcon, which uses the request-per-fiber model while still being backward compatible with existing applications that might depend on request-per-thread.
00:35:39.180
One particular area where this was a problem was the per-thread Active Record connection. Since the connections were shared per thread, it was entirely possible to open a transaction in one request fiber and see that transaction still open in a different request fiber.
00:35:57.840
By using the isolated execution state interface, it was possible to fix the connection pool implementation so that it is safe to use Active Record in Falcon.
00:36:14.820
Rack 2 specifies that the request input body should be rewindable; that means it can be read multiple times. However, to do this, the input body must be buffered.
00:36:34.200
Buffering the input body can consume a lot of storage. It can also increase the latency if it's buffered completely before passing the request to the application.
00:36:53.640
Very few applications made use of this buffer, and it's also at odds with real-time streaming. So in Rack 3, this requirement was dropped, and so we need to update Rails accordingly. This is still a work in progress.
00:37:10.320
Set-Cookie is one of the most unique headers in HTTP in that it's the only header in which multiple values can't be merged into a single field using commas.
00:37:30.180
Rack 2 separates multiple value headers using the new line character, but Rack 3 prefers to use an array of string values, which is easier to manipulate and is also safer and more memory efficient.
00:37:46.440
The Rails test suite also assumed the internals of Rack 2 format rather than the HTTP semantics, so we implemented set-cookie header parser which follows the HTTP semantics and updated all the relevant tests to use it.
00:38:05.280
This makes it compatible with both Rack 2 and Rack 3. In addition, Rack 3 requires all header keys to be lowercase and stored in a hash table.
00:38:21.060
Surprisingly, this wasn't actually the case in Rack 2. This is to make it more efficient to implement middleware and modify response headers.
00:38:41.700
This is also the normalized form in the HTTP specification and a requirement for HTTP/2. Because there is a lot of code which assumes mixed case for header keys, we've introduced a class (Rack::Headers) which implements lowercase normalization with case insensitive key lookup. Rails has adopted this as part of the controller layer to avoid compatibility issues.
00:39:06.840
Rack 2 had a lot of code related to the Rack specification. Sorry, Rack 2 had a lot of code unrelated to the Rack specification, and it was hard to evolve the Rack specification independently from the implementation.
00:39:21.600
Specifically, we've extracted code into two new gems: Rack::Session, which includes a convenient implementation of session management; and Rack::Runner, which includes a convenient way to run a Rack application.
00:39:37.440
Rack::Session, in particular, suffered from several security issues that cut across multiple releases of Rack. By extracting it into a separate gem, it will become easier to update independently of Rack.
00:39:55.320
These gems are compatible with both Rack 2 and Rack 3, so they are now integrated into Rails.
00:40:08.880
This will allow us to evolve these specific areas of Rack more rapidly and independently from Rack 2's specification.
00:40:23.280
What database adapter can I use? Unfortunately, not every native library is compatible with the fiber scheduler.
00:40:36.840
However, this problem is solved with a bit of effort. In short, the best option right now is PostgreSQL.
00:40:50.760
The PG gem has built-in support for fiber-based concurrency, which, when combined with the per-fiber isolation level in Rails, just works like you'd expect.
00:41:08.520
If you're using MySQL, GitHub has just released a new library called Trilogy and an Active Record adapter that supports the Ruby fiber scheduler. I've heard rumors that this will be released as part of Rails 7.1.
00:41:22.680
This will be very exciting. So, with Rails 7.1, you'll be able to use Falcon and Rack 3 to gain some interesting new capabilities.
00:41:36.360
How do you take advantage of this in your application? Actually, this was a trick question: you don't need to change anything.
00:41:47.880
The fiber scheduler provided by Ruby transparently improves the concurrency of your application. Rack 3 makes it easier to build asynchronous applications, and Rails continues to evolve with these underlying changes.
00:42:09.960
So, in summary, Ruby 3.2 provides a great foundation for asynchronous programs and, in particular, concurrent execution.
00:42:35.880
Rack 3 provides a standard interface for bidirectional streaming, enabling things like WebSockets without escape hatches.
00:42:51.480
Rails 7.1 is compatible with Rack 2 and Rack 3, allowing more applications to take advantage of these new features. And that's just the beginning!
00:43:06.600
We've built the foundation for you all; now it's up to you to go and build awesome asynchronous Rails applications.
00:43:28.680
I'm incredibly excited to see what you all create, and I'm incredibly proud and privileged to be part of it. Thank you for listening to my talk.
00:43:49.680
Please feel free to ask me any questions or come and find me later.