Splitting: the Crucial Optimization for Ruby Blocks

00:00:00.000 Ready for takeoff.

00:00:17.359 Okay, so welcome to my talk on splitting: The crucial optimization for Ruby blocks. I'm Benoit Daloze, and I've been involved with Ruby for quite a while, actually for eight years. I did a PhD on parallelism in concurrent dynamic languages, I'm a maintainer of the Ruby specification, and I'm also a committer for TruffleRuby. You can find me on Twitter, Mastodon, and other platforms.

00:00:22.740 TruffleRuby is a high-performance implementation of Ruby that aims to be the fastest Ruby available. A significant part of this performance comes from its just-in-time (JIT) compiler. TruffleRuby targets full compatibility with CRuby 3.1, including extensions, making it a drop-in replacement for CRuby. Recently, we tried to run Mastodon, which is a Rails app with several hundred gems, and it just worked on TruffleRuby with only a single line change in the Puma configuration. Everything else, including server restarts, functioned smoothly.

00:00:43.980 Today, I want to talk about splitting. But first, what is splitting? To explain it, I want to go back to the origins of this concept, which was actually developed quite a long time ago. In 1986, researchers created the Self language. The Self language is somewhat similar to Smalltalk, and by extension, Ruby, but it is prototype-based, making it similar to JavaScript. The researchers made many breakthroughs, and I will list four of these. These concepts are still used in almost all dynamic language implementations today because they are fundamental to achieving good performance in dynamic languages.

00:01:23.700 One of these breakthroughs is *self-maps* or *self-shapes*, which provide an efficient way to represent objects. TruffleRuby utilized this concept early on, while CRuby only recently adopted it a few months ago. Another significant concept they developed is the optimization technique for just-in-time compilation, where sometimes the optimizations may no longer be valid. For example, if someone uses monkey patching to alter a method afterward, we can no longer use the optimized code. Therefore, we must revert to the interpreter, restore all the state, and only later can we possibly compile again. Additionally, there are polymorphic inline caches, which I will explain as we proceed through the talk with concrete examples. Lastly, there's splitting, which is the main focus of this talk.

00:02:39.780 Splitting was first introduced in a seminal paper titled 'Customization of the Splitting' by Craig Chambers and David Garriga at Stanford, dating back 33 years. The remarkable aspect of their work is that the examples used in their paper can very easily be translated into Ruby, highlighting a one-to-one relationship between their concepts.

00:03:03.780 We will take their example and translate it to Ruby. It is quite straightforward, as both Self and Ruby support blocks, sharing similar underlying concepts. This example defines a method called *sum_to*, which is defined for all numeric types. It initializes a sum to zero and iterates from the current number (self) up to an upper bound, including it, adding each number to the sum. While this is a trivial example, it serves as a good illustration for understanding splitting.

00:03:51.900 When stepped, this function can be invoked on various numeric types; you could call *sum_to 10* on integers, which results in 55. You can also utilize floating-point numbers, rationals, and large integers. The question arises: how do we optimize this function? How can an optimized Ruby implementation approach this to properly optimize the function? The answer lies in just-in-time compiling, wherein we aim to illustrate the just-in-time compilation for the *sum_to* method.

00:05:10.200 The straightforward *sum_to* method performs simple operations by calling another method named *step*. The initial challenge is whether we can inline the *step* function during the compilation of *sum_to* and optimize them together. At this point, we face a limitation. From the static analysis viewpoint, we cannot determine if *step* is always going to be numeric because we don't know if someone has redefined *float.step* or *integer.step*.

00:06:04.920 Dynamic language implementations deal with this uncertainty using inline caches—a cache embedded within the representations used by the virtual machine during method execution. For instance, in CRuby, there is a cache integrated directly within the bytecode, which is why it is termed an inline cache. It caches the results of method lookups because method lookups can be significantly expensive, involving multiple hashmap lookups, which we want to avoid repeating.

00:07:02.000 In this context, we can evaluate method lookups for *integer* and *float*. When we resolve method lookups for these types, we increasingly find that the result is the numeric step method. Additionally, when we increase two inline caches, we see both the lookup cache and a call target cache. The call target cache records which method was eventually called, allowing us to build a smoothed process where we anticipate the method to call and optimize accordingly.

00:07:50.760 The next step is to analyze whether we can inline this method. Therefore, we explore whether the method relies on variables defined outside its scope. However, this leads to increased complexity, as we risk adversely impacting the performance of any block calls if we were to extract certain variables from within the loop. If we have multiple blocks, the scenario becomes even more complex. It wouldn't be reasonable to handle the inline checks for each unique block, as that could lead to a significant slowdown. Our solution, then, is to copy the method, generating two specialized versions of it, corresponding to the calling contexts.

00:09:00.240 This result is known as splitting, where the virtual machine can manage two distinct copies of the *step* method, each tailored for each calling context. This allows us to optimize both copies independently without additional overhead during calls. As a visual representation, we can think of the *sum_to* method and how it calls the unique *step* methods based on the context, maintaining a smoother and more efficient execution flow.

00:09:30.600 So, each call site now has its dedicated version of *step* that it can optimize based on the context. This approach leads to significant performance improvements because each copying optimizes fewer variables, leading to more predictable execution paths. We are improving the overhead associated with method lookup and branching within our logic, which can yield remarkable performance benefits.

00:10:31.560 In conclusion, copying a method optimized for its specific context reduces the overall memory footprint on operations while still maintaining the flexibility necessary for dynamic languages like Ruby. Therefore, splitting turns out to be a crucial optimization for method execution, providing efficiency and feasibility, especially as it pertains to inline blocks. With this technique, various dynamic languages can gain insights into more efficient memory management and execution speed.

00:11:38.640 Now, as we transition to the subject of benchmarks, I want to mention a few performance indicators. TruffleRuby significantly outperformed CRuby 3.1 across various benchmarks, thanks in large part to the optimizations provided by our implemented splitting. As we explored various applications of *sum_to*, we noticed it comfortably operates at speeds exceeding previous implementations by a factor of 7.7 times for specific cases, indicating how valuable this optimization is for the Ruby ecosystem.

00:12:35.559 The significant impact of splitting can be observed among various applications across the Ruby landscape, including Rails benchmarks, where TruffleRuby achieved up to 2.75 times faster performance than CRuby. Moreover, inspections of race benchmarks yielded remarkably similar results, exemplifying the efficacy of splitting in practical application scenarios.

00:13:50.780 Additionally, our findings align with the ongoing research into performance optimizations across dynamic languages. We've gained substantial insights from studying the ramifications of polymorphism and megamorphism in dynamic method implementations, enabling us to refine how we handle method dispatch efficiently. Thanks to our implementation in TruffleRuby, we can directly correlate these concepts with noticeable performance enhancements.

00:15:28.980 Conclusively, splitting as an optimization technique serves as a pivotal advancement from earlier theoretical frameworks into practical applications among modern dynamic programming languages like Ruby. It not only facilitates considerable performance improvements but also enhances memory management efficiency across various implementations. Hence, our discussion today hopes to encourage further exploration of optimizations and their roles within Ruby's evolution.

00:16:58.139 Now that we have reviewed the technology, let’s discuss some of the trade-offs involved. Does splitting impact memory footprint? Yes, it does have some implications. When we duplicate methods, we require additional memory. However, this increase in memory consumption tends to be proportional to the representation that the virtual machine uses to manage its code. It is important to understand that while we have specific limits on how much can be duplicated, in practice, this additional memory requirement is manageable.

00:18:07.080 I'm happy to stay and answer any questions regarding the practices we've adopted within TruffleRuby for this approach. Furthermore, I'm eager to engage with anyone interested in collaborative exploration of the optimizations available for dynamic languages. Thank you for your attention.