Optimizing ruby core

by Shyouhei Urabe

In this presentation from RubyConf 2016, Shyouhei Urabe discusses advancements in optimizing the Ruby programming language, specifically CRuby version 2.4. After years of experience as a Ruby contributor and having previously not been active in the development scene, Urabe reveals that he has implemented an optimization layer which can drastically enhance Ruby's execution speed, achieving boosts up to 400 times in some benchmarks. The key focus of the talk includes the following points:

Ruby's Performance Issue: Ruby has been traditionally slower than other programming languages, mainly due to lack of optimization. Common reasons cited include garbage collection and the Global VM Lock (GVL).
Complex Assembly Code: Urabe illustrates that more complex assembly code generated for simple operations underscores the need for optimization, particularly in how Ruby handles variable definitions and evaluations.
Deoptimization Strategy: He introduced a mechanism of deoptimization, allowing the Ruby interpreter to focus on typical cases while efficiently handling infrequent exceptions.
Portable Optimization: The optimisation is performed in pure C, which helps avoid assembly complexities and maintains VM state integrity during execution. This is crucial because modifications occur while ensuring that the existing program counter remains intact.
Performance Analysis: Extensive benchmarks show improvements in execution speed across various cases, with some instances achieving significant speedups. However, some cases also showed slower performance, indicating variability based on method interactions and overhead management.
Future Enhancements: Shyouhei expresses interest in exploring additional complex optimizations, including C expression elimination and escape analysis, which could further improve Ruby's performance.

In conclusion, the implementation of the optimization engine in CRuby reflects initial success and establishes a foundation for future improvements, emphasizing that while Ruby's performance can be enhanced, there is still significant work to be done to reach optimal performance levels. Shyouhei invites the audience for questions, indicating a collaborative effort towards enhancing Ruby optimization strategies.

00:00:16.160 Welcome! Today, we're going to talk about optimizing Ruby. Let me introduce myself: I'm Shyouhei Urabe. It's a bit difficult to pronounce, but just call me Shyouhei. I've been a Ruby contributor since 2008, and I was the moderator of Ruby 1.8 to 2.9. I also created the JRuby and Ruby VM. However, I was not an active developer for a while because my job kept me busy. But I changed jobs last February, which allows me to develop new things, and today I will show you something I've developed recently.

00:00:24.880 To give you a brief overview of this talk, I implemented what we call the optimization layer on CRuby version 2.4. Some benchmarks I will show you later indicate that it can boost execution speed by up to 400 times, depending on the benchmark. This is the very first attempt of this kind, which leaves a lot of room for future optimizations. Now, I know that everyone has something to say about this, but Ruby is not the fastest programming language. Here you can see a screenshot of a language shootout site that compares various programming languages in their speed. The chart compares several languages, and you will see that Ruby is not among the fastest.

00:01:01.920 The orange line corresponds to Ruby, and you can see that it's on the right side of the chart, indicating it's on the slower end. Interestingly, JRuby is just a bit faster, but it's incorrectly positioned on the right. There are many reasons cited as to why Ruby is slow, such as garbage collection or the Global VM Lock (GVL), but I'd like to suggest another reason—Ruby is slow because it is not optimized. What you see here is the assembly code for a simple expression like 1 + 2 being evaluated. It appears complicated when it doesn’t need to be. The correct evaluation should just be adding two numbers directly, without all the complexity.

00:01:42.720 Now, there are many definitions and many rules in Ruby, especially with regard to variable definitions, which complicates optimization. For instance, one plus two (1 + 2) should always equal three, but this can be hard to prove dynamically due to how Ruby evaluates expressions globally. This complexity is part of the reason why we evaluate expressions this way every time. The definitions of variables are crucial, and they must work; however, should they really be optimized at the same time? To tackle this problem, I would like to introduce a mechanism called deoptimization. In this approach, we will stop worrying about rare definitions, since they are unlikely to happen, and we will only focus on when they do. When they occur, we revert to a basic evaluation mechanism.

00:02:37.240 That said, rare definitions occur infrequently; so primarily, our optimization will be efficient. Now let's take a look at how we can optimize execution sequences without introducing a new binary format or changing the lengths of the existing execution sequences. We aim to modify the existing sequences into more sophisticated forms by overriding them on the fly. This means we cannot change the length of these sequences, only modify them while preserving their lengths. This diagram illustrates a part of Ruby’s internals, showing how the VM instructions are encoded. Program counters typically reside in the machine stack, which is different from the management structure. In our implementation, we've added two new fields called ISD optimize and created that will be used later.

00:03:41.360 Now, if we have some optimizations encoded, they can be changed depending on the original implementation. The main procedure you see here is 'mcpi.' The advantage of this approach is its portability, as it is written in pure C, and it avoids any assembly involvement. The optimization we create does not touch the program counter, ensuring that the VM states remain intact. This preparation is done only once at the beginning. When redefinitions occur, we maintain a global VM timestamp that increments whenever something happens, such as constant assignments or method definitions. The implementation is extensive, so I will skip some details here, but it is crucial that this new state variable tracks modification accurately.

00:05:01.520 Importantly, the deoptimization occurs in the bottom half of the method call that we execute right after invoking a method. The incrementing of the state variable happens in this code when the method returns. If something changes when the call returns, we can test for that immediately afterward. The optimization can be triggered during this process. This means we can scan through the parts of the instruction sequence that have become outdated over time. A major advantage of this approach is that it adds almost no overhead to the execution speed as indicated in a preliminary experiment. This experiment involved invoking methods multiple times to see if the overhead affects performance. The graph shows that our optimizations add minimal overhead and maintain performance within acceptable margins.

00:06:42.680 In summary, we have introduced the optimization engine in Ruby, which is designed to maintain consistency with VM states like the program counter. As a result, the engine functions quite lightly. Now, let’s discuss how we perform optimizations while adhering to our restrictions around VM states. We achieve this through techniques like eliminating methods, constant folding, and eliminating unused variables. Constant folding is an effective strategy, and in this case, we transform sequences of instructions into a more efficient format by consolidating what would typically require several instructions into a single process. This is straightforward since constants are already stored in the inline cache.

00:08:06.800 Moreover, we also eliminate unnecessary method calls. A method is considered pure if it doesn’t modify any non-local variables or if it doesn’t yield any blocks that might alter state. If a method's behavior is known to be pure, it can be optimized away, greatly simplifying the execution pathway. However, it's essential to note that this purity is often context-dependent. In cases where we cannot determine if a method is pure or not—particularly when dealing with C-written methods—we opt for caution. The goal is to ensure that methods requiring input and output remain intact while still eliminating redundant calls wherever possible.

00:09:40.400 Next, we have variable elimination strategies to ensure that when local variables are unnecessary, we can safely reduce overhead by removing their assignments. This requires a detailed analysis to assess if variables can be classified as lifeless based on their usage patterns. We perform this optimization during runtime for efficiency, ensuring that the optimizations can occur on the fly while also accounting for any modifications that may arise through binding to other methods.

00:11:05.480 In summary, optimizing through C-level adjustments has provided several key advantages, yielding efficient methods without overly complicating existing structures. Notably, the optimizations we implemented during runtime maintain critical VM states while stripping out unnecessary checks and balances. Although we haven’t directly discussed handling contradictions or exceptions, I want to emphasize that they don’t interact negatively with these optimizations.

00:11:51.719 In our benchmarks, we assessed the proposals against Ruby version 2.4 using standard benchmark libraries under controlled conditions. Results indicate a noticeable difference, with most benchmarks yielding improved execution performance. Here are the test results illustrated—values greater than one signify improved speed, while values below indicate slower performance. It's important to note that not every area boasted dramatic improvements, and some benchmarks performed slightly slower than the original Ruby execution.

00:12:23.920 On one hand, we see some benchmarks achieving exceptional speed, yet on the other, some tests suggested slower execution under specific conditions. A striking example observed from the data is the Evo method benchmarks, which showcased a marginal overhead despite being slower overall. The key to efficient optimization lies in how each piece of code interacts with state and how we manage secondary overheads during execution. Ultimately, variable elimination provided vital speed-ups in several cases, indicating our method's potential for boosting performance.

00:13:45.360 As we conclude this discussion, I’ve presented the optimization engine implemented in CRuby version 2.4. The benchmarks showed that we can boost execution speeds by up to 400 times in certain cases. Given that this was the first attempt of this nature, it opens doors for future enhancements in our optimizations. Next steps will include working on more complex optimizations like C expression elimination, which may further improve Ruby's performance through reduced sequence sizes. We are also considering methods such as static library analysis and escape analysis, both of which may present useful optimization avenues. However, any such optimizations involving modifications to VM states should be approached with caution.

00:16:12.200 Thank you for your attention. I’d like to open the floor to any questions you might have regarding these findings or the optimization strategies presented. It is essential to understand that optimizing Ruby can be achieved smoothly in your applications; however, we still have a long way to go regarding performance improvements. Thank you!