Running Rack and Rails Faster with TruffleRuby

Optimizing Rack and Rails applications with a just-in-time (JIT) compiler is a challenge. For example, MJIT does not speed up Rails currently. TruffleRuby tackles this challenge. We have been running the Rails Simpler Benchmarks with TruffleRuby and now achieve higher performance than any other Ruby implementation.

In this talk we’ll show how we got there and what TruffleRuby optimizations are useful for Rack and Rails applications. TruffleRuby is getting ready to speed up your applications, will you try it?

RubyKaigi Takeout 2020

00:00:03.439 Hello and welcome to my talk, "Running Rack and Rails Faster with TruffleRuby." My name is Benoit Daloze, and I am the project lead of TruffleRuby. If you don't know yet, TruffleRuby is a high-performance Ruby implementation from Oracle Labs. It uses the Graal JIT compiler to achieve its impressive performance. TruffleRuby targets full compatibility with Ruby 2.6, including C extensions, and it is open source and available on GitHub.

00:00:14.960 There are two ways you can run TruffleRuby: in GVM mode or in native mode. In GVM mode, you can run on the Graal VM, allowing you to seamlessly integrate with Java. In native mode, the TruffleRuby interpreter and the Graal compiler are compiled ahead of time into a native executable. This yields a result similar to MRI Ruby, where you have a single executable containing everything. The benefits of this approach are faster startup times, quicker warm-up periods, and a smaller memory footprint.

00:00:43.600 The reason for these enhancements is that there is a virtual machine tailored specifically for TruffleRuby, which avoids loading unnecessary classes and operations found in a more generic VM. As for compatibility, TruffleRuby is progressing well. It is compatible with Ruby 2.6, and many C extensions work out of the box, including libraries such as LibRuby and geary, as well as many database drivers.

00:01:06.760 There's a good chance that if you try a random C extension, it might function correctly with TruffleRuby. This is a significant advantage if you're looking to run existing applications on TruffleRuby, as most applications from MRI should work without any changes to the Gemfile, or at most with minimal adjustments.

00:01:29.440 In terms of completeness, TruffleRuby currently passes 97% of the Ruby spec, which is an impressive benchmark, and it boasts the highest ratio among all Ruby alternative implementations. One of the primary goals of TruffleRuby is to execute Ruby code more quickly, and we have seen great performance on many CPU-intensive benchmarks and micro benchmarks. However, the focus of today's talk is on web applications.

00:02:06.240 TruffleRuby does not use a global lock, allowing Ruby code to be executed in parallel, which is a significant advantage. Yet, the situation gets more complicated when it comes to C extensions. Many C extensions require that all their C code runs under a single lock. To maximize compatibility, TruffleRuby has a global lock enabled by default.

00:02:36.400 As the project evolves and provides new tools that interoperate across languages, we will gradually remove this lock. For instance, tools such as cross-language debuggers and profilers. A simple example of this is the CPU sampler, which is a sampling profiler. Running it on TruffleRuby is straightforward. You can execute it with the command "truffleruby --cpu-sampler" followed by your program, and it generates a report detailing time spent across various parts of your code.

00:03:04.800 Today, I want to look at the Real Simple Benchmark, or RSB. This project was developed by Noah Gibbs, who conducted extensive benchmarking on Ruby 3 times three. The objective of Ruby 3 times three is to make Ruby three times faster than Ruby 2.0. RSB consists of two primary applications: a simple Rack application and a simple Rails 4 application, both tested under various loads.

00:03:34.079 The benchmarking for both applications uses wrk to measure their performance, focusing on latency and handling various configurations. For our discussion, we will concentrate on a few key aspects. Specifically, we will benchmark the Rack application with a simple ERB template, as well as the Rails 4 application serving plain responses.

00:04:04.080 Both applications will run on Puma, a popular web server, and we will measure both Ruby 2.6, which is the version compatible with TruffleRuby, and the latest version of TruffleRuby in GVM mode for its superior speed. Unfortunately, I encountered a bug with using ActiveRecord in conjunction with Puma, which resulted in extremely slow response times.

00:04:36.079 Eventually, we decided to just create a new database connection for each request instead of using a persistent connection for the benchmarks. This approach significantly impacted the measurement, skewing results toward measuring kernel and network performance rather than Ruby itself.

00:05:06.800 We adjusted our concurrency settings, controlling the number of server threads and wrk request threads. To maintain clarity and consistency, we used a uniform number of threads for both settings. Each request maintains a single connection, which simplifies processing and helps focus on the response generation.

00:05:26.240 My benchmarking runs were performed on an eight-core processor that had frequency scaling and turbo boost enabled to mimic real-world server usage. However, with too many threads, my results did not scale perfectly. Monitoring indicated that increasing threads led to diminishing returns beyond a certain point.

00:05:56.919 During the measurements, I focused on how requests stabilized around a certain throughput figure, reported in requests per second. Each run lasted for ten seconds, measuring how many requests were completed in that time.

00:06:25.280 It's worth noting that these were simple Rack and Rails applications, and the conditions for measurement were idealized to a single request at a time. Real-world conditions will likely be more complex as they involve multiple concurrent requests.

00:06:55.680 Let's begin with the Rack application. For the simple ERB template, we found that Ruby 2.6 handled approximately 20,000 requests per second, which is quite minimal but impressive nonetheless. As we increased the thread count, Ruby 2.6 became slower due to the global lock, demonstrating that threads are not effective for CPU-bound work.

00:07:00.559 In contrast, TruffleRuby achieved over 30,000 requests per second. By running it with more threads, performance improved further. With Ruby 2.6's multi-process setting, throughput did increase but not to the same extent as TruffleRuby, which remained significantly ahead.

00:07:30.400 When evaluating Rails performance, we found that Ruby 2.6 handled around 3,000 requests per second for a basic response. This result illustrates that it performs reasonably well. TruffleRuby managed to process about 6,000 requests per second, marking a notable 2x speedup on a single thread. The performance improved progressively as the number of threads increased, with peaks reaching about 18,000 requests per second.

00:08:06.800 In conducting these benchmarks, it is crucial to note that I ran tests with the C extension lock disabled. The only C extension in use was the Puma HTTP parser, which is designed for safe parallel operation. The discussion surrounding C extensions and their ability to operate in parallel will also extend to Ruby 3 and the Ractor project led by Koichi Sasada.

00:08:46.560 We are keen to standardize how C extensions are marked for safety in parallel execution. One challenge we encountered was adjusting the splitting limit in Truffle to accommodate larger codebases like Rails, which exceed the default limits, potentially causing performance hits during execution.

00:09:27.600 The aim is to refine the system so the limits can be intelligently omitted without future complications. This requires an understanding of code optimization strategies, particularly as it pertains to how certain functions and methods are compiled into efficient executing code.

00:10:07.520 In conclusion, the results showcased in my talk reflected a basic benchmarking setup. However, I encourage all developers to benchmark their own applications and share their findings, whether their results are favorable or otherwise. If you are looking to try TruffleRuby, it’s quite simple to set up.

00:10:29.920 You can use your favorite Ruby version manager to install it in native mode, or leverage GraalVM for both native and GVM modes. GraalVM also facilitates access to other languages in a single VM, allowing efficient multi-language interactions.

00:10:57.120 This implementation of C extensions in TruffleRuby is an exciting frontier. Thank you for listening!