Performance Optimization

Summarized using AI

Message in a Bottle

Konstantin Haase • October 08, 2012 • Earth

In the video titled "Message in a Bottle," Konstantin Haase delves into the internals of Ruby programming language implementations, analyzing how method dispatch is executed and optimized across different versions. The talk, presented at Aloha RubyConf 2012, seeks to demystify the operations that occur under the hood when methods are called in Ruby, ultimately uncovering strategies for enhancing performance.

Key points discussed in the presentation include:

- Performance Overview: Ruby versions 1.8 and 1.9 are interpreted languages, leading to inherent slowness. The focus shifts to Ruby 1.9 and the upcoming Ruby 2.0, as well as JRuby and Rubinus, to explore their dispatch mechanisms.

- Method Dispatch: The majority of Ruby's performance concerns arise from method dispatch. Haase explains how each implementation utilizes different methods of caching, such as call-site caches and inline caches, to improve speed.

- Bytecode Representation: Understanding the bytecode produced by each implementation is vital. Haase contrasts MRI’s and Rubinus's bytecode representations, demonstrating how they reflect the underlying operations.

- Optimizing Dispatch: Strategies such as using inline caches and reducing CPU instructions can streamline method lookups. Haase illustrates how the method search process navigates through class inheritance to locate methods efficiently.

- Rubinus Features: Rubinus's unique capabilities allow for runtime modifications of bytecode and method specialization, enhancing its performance via JIT compilation and LLVM (Low-Level Virtual Machine).

- JRuby Overview: Haase briefly introduces JRuby's innovative use of "invoke dynamic" to enhance method dispatch efficiency by simplifying lookup processes in the JVM. The use of invoke dynamic eliminates the deep lookup overhead present in traditional method calls, optimizing overall execution.

The talk concludes with actionable insights for developers looking to implement Ruby interpreters or seeking to optimize their Ruby code performance: prioritize minimizing CPU instructions, enhance method lookup speeds, and properly leverage available optimizations. Overall, effectiveness in Ruby execution hinges on foundational concepts of method dispatch and caching strategies.

Message in a Bottle
Konstantin Haase • October 08, 2012 • Earth

What does really happen when we call a method? How do the different Ruby implementations actually figure out what code to execute? What plumbing is going on under the hood to get a speedy dispatch? In this talk we will have a look at the internals of the the major Ruby implementations, focusing on their dispatch. From look-up tables and call site caches, to inlining and what on earth is invokedynamic? Fear not, all will be explained!

Help us caption & translate this video!

http://amara.org/v/FGfe/

Aloha RubyConf 2012

00:00:15.519 Hi everybody, can you hear me? Perfect. Okay.
00:00:20.680 So, I'm here to give my talk titled 'Message in a Bottle.' So here it goes: Ruby 1.8 is slow.
00:00:29.439 Because it’s interpreted. Surprise! Ruby 1.9 is interpreted too. Thanks. Any questions?
00:00:41.039 Okay, um, I'm Konstantin. This is my ridiculously long Twitter handle. You're welcome to follow me; it actually seems long, but it's not that bad. It's Konstantin Haase, my first name and last name.
00:00:54.120 On GitHub, you might know me from the last time you did a Google image search for 'bike shedding,' and my face popped up.
00:01:08.159 I work for a company that supports a project called Travis CI. Who in here has not heard of Travis CI?
00:01:15.720 Okay, so you two. We do continuous integration for open source projects, nearly 25,000 of them. We also have servers for your private projects if you have them on GitHub. We can run your tests, so just grab me after this talk. We obviously charge for the service; that’s how we finance the work on the open source project. If you're willing to sign up before we go public, I can give you a sweet coupon, so just talk to me.
00:01:50.560 This is our Berlin-based team. One of our sponsors is Agile Animal, and it’s thanks to them that I’m able to speak here. They covered my travel expenses. Otherwise, I would be sitting in rainy Berlin right now, much colder, working on Travis instead of giving fun talks. When I added that slide, I actually had to use Google image search for Agile Animal. Stuff like that comes up. I highly recommend it; there are many agile animals out there.
00:02:28.080 Okay, so let's get to the talk. I usually chat a lot before I start, because when I get up here to talk, I get nervous. To calm myself down, I wonder, 'What would Aaron Poon do?'
00:02:36.280 So, this talk is actually about Ruby, especially Ruby internals. I want to focus on some core aspects of all the Ruby implementations related to performance.
00:02:49.319 Bear with me, it might be that some of the stuff I show on my slides is not immediately obvious or that you won't have to understand all the code I’ll show, which may include C++, C, and bytecode. It’s important that you grasp the general picture of what is going on. As I said, I'll be talking about different Ruby implementations. How many Ruby implementations do you think there are?
00:04:10.319 Shout out some numbers! 28? 28! Okay. 51? 55? Alright, 51. You guys are good. I gave this talk in Barcelona, and they were guessing five or eight. Actually, I think there are 59 or something, including implementations like RubyMotion or HotRuby, which runs Ruby in JavaScript by compiling to bytecode and then executing it in a browser. It’s by Kichi Sasada, the guy also behind the YARV (Yet Another Ruby VM).
00:04:51.600 I'm going to focus on the two larger ones—Ruby 1.9 and the coming Ruby 2.0, JRuby and Rubinus. What I want to look at is how they make method dispatch fast. It turns out that when you run Ruby, what you're doing most of the time is actually doing method dispatch. So, for MRI, we will examine method dispatch and execution in general. For Rubinus, we will look at inline caches and JIT (Just-In-Time Compilation), and for JRuby, we will look at invoke dynamic.
00:05:29.280 This doesn't mean that only those implementations do those things. Obviously, all implementations perform method dispatch and some form of caching. The code sample we’re going to examine is this nice code example. I hereby release this into the public domain; you may use it if you like. It is a class called 'DeepThought' that checks if the ultimate answer is 42 or whatever you pass in. It accepts both integers and strings. We’ll look at this specific line, assuming that 42 is passed as an argument.
00:06:21.479 One thing you should know about all these Ruby implementations is that they love bytecode. So let’s look at the bytecode for this statement: 'value.to_s.' The bytecode encodes the instructions, but you can also write them as instruction names in a way similar to assembly language. For MRI, this would look like 'get value' followed by 'send to_s' with no arguments. For Rubinus, it would look like 'push local zero,' since that's the first local variable we have.
00:06:56.560 For JRuby, depending on your JRuby version and the JVM version, it might look something similar, but with some missing parts here, there, and elsewhere. When you look at that, you might feel overwhelmed and think, 'What am I looking at?' The best thing to do next is to break it down a bit. As you may know, or may not, if you’ve played with implementing programming languages, generally, a programming language works like this: You start with the source code, which is a text stream that goes through a parser generating a parse tree. The parse tree is a tree in memory, like (plus one one); you might know that from computer science classes.
00:07:47.360 You can interpret that directly, walking the tree and sending out CPU instructions, which is what Ruby 1.8 does. Alternatively, you could generate bytecode from that and then interpret the bytecode, which is basically a big loop that gets the next byte and checks which instruction it corresponds to, sending out further bytes and executing CPU instructions. This is what Ruby 1.9 typically does. Lastly, you could take that bytecode, generate machine code, and execute that. The machine code is basically the same as bytecode except that the part of the system that understands the bytes is no longer software—it’s your CPU.
00:08:34.159 The general plan for the VM is that we have to find a method and then execute it quickly. How do we speed that up? We can find the method faster, and we can execute the method faster. It’s as easy as that! There are tricks for doing just that. For finding methods faster, we can use inline caches or call-site caches, and we can apply lookup caches, which is what MRI utilizes. We can also employ inlining, which I will explain later.
00:09:38.480 To execute methods faster, we can reduce operations. This is a feature coming in MRI, where instead of looping around multiple times to get a value and send it, we bundle those operations together. This reduces the interpretation overhead. We can even optimize search speeds because faster searches often equal faster execution. Let’s take a look at how MRI finds methods. One way to search for a method is by asking for the RB method entry, a C method available from the public Ruby API. This method will call RB method entry without caching unless it's already cached in the lookup cache.
00:10:48.320 When I say 'finding a method,' I mean that I have the object and the name of the method, and I want to find the corresponding bytecode that needs to be executed. Eventually, it ends up in the search method, which is clean and nice to play with if you want to hack on Ruby. The arguments to this method include the class of the object, which if it has a Singleton Class, will be the Singleton Class; otherwise, it will be the proper class. The ID is the C version of the method name's symbol.
00:11:09.160 The method basically loops through the inheritance chain, including all the superclasses of the class and the modules, looking for a method entry within the method table of each class. If a class becomes null after checking, we’ve reached the end of the inheritance chain without finding the method. In that case, the method returns zero. Once we find it, we now need to execute the method, which ultimately leads us to an internal method called VM call zero. This takes numerous arguments, but the general structure is that the VM Loop is not a simple while loop; instead, it’s a switch statement often nested within itself.
00:12:26.440 This method is quite lengthy, so I’ve cut out a lot of unnecessary details. It used to rely heavily on jumps, which implement the loop system. For example, if you have an instruction, it will execute that corresponding bytecode actually. The VM knows which instruction to execute by looking it up in an instruction manual, which is found in a file called instructions.def. The same principle applies to Rubinus, which operates in a similar way, defining what to execute when encountering a specific instruction.
00:13:30.560 If the instruction is called 'send', it fetches the class and searches for the method, triggering the search function again. This is why speeding up the search process will ultimately speed up overall execution. All of you still awake? Great! Let's look at Rubinus. The nice aspect of Rubinus is that you can work with all of that from Ruby directly.
00:14:02.880 The structure you just saw is exposed as Ruby objects. Each module has a method table comprising buckets that contain the compiled code. You could implement the search for a method in a Ruby idiomatic manner, which would look similar to the search method. You obtain the object's single class and continue checking in the lookup table as long as there remains a class to check and the method has not yet been found.
00:15:07.680 This will eventually yield something called compiled code. If you look up 'ultimate answer' in the method table of DeepThought and ask for its method, you can decode the method. This will provide you with the opcode representation and a bytecode stream format. These bytes can also be written to a file and executed directly from there. You can check out the literals such as 2, 42, and equality checks. Let’s take a closer look at this.
00:16:27.920 Now, if I rearrange this, you'll see the literals, locals, and bytes. Grouping the bytes around opcodes reveals additional insights. The first byte on the stream is 20. In Rubinus, the bytecode reads 'push local,' which takes one argument. The next byte indicates zero, meaning we push the first local onto the stack. Checking our first local confirms it’s 42, so we push 42 onto the stack. The next instruction is 'send to_s', which takes what's on the stack and processes it. We pop off what's on the stack and push into it the result of that method call, which in this case results in '42' as a string.
00:17:24.679 Next, we push literal number one (the second literal 42) onto the stack, so now there are two 42s on it. Following this, we execute a string equality operation, which optimizes equal comparisons for many cases. The op statement will pop both from the stack, pushing true onto it. Finally, opcode 11 serves the purpose of returning by just popping the last item off the stack. Since this is a Ruby object, you can modify it, and those modifications will take effect.
00:18:29.000 Rubinus utilizes this for implementing inline caches. You could replace information, for instance, the number 42 within the literals tuple, which is a fixed size array. You can make modifications at runtime, so from that point on, the ultimate answer is no longer 42; let’s say it becomes 23. You could adjust the bytecode as well, adding in modifications to the code that do something different and update the return value accordingly to reflect your changes.
00:19:14.000 Rubinus leverages the compiled method at runtime. If we examine the bytes with their representation, we see 'push local zero' and the argument reference, where 2 sends the symbol object. The first time Rubinus sees the code during execution, it replaces that with an inline cache referencing the compiled code. Next time, when it encounters 'send to_s', it won’t have to look things up again but will call the inline cache for the method.
00:20:00.800 The inline cache retains a reference to the compiled code and can check if the code at hand is still the correct code to execute. This is key for late binding; if you had a statically typed language, e.g. C, it could decide at compile time which code to call. In Ruby, the types are integers - for example, an array might be type 17. As objects come in, they check if they're of type 17, and if not, the method needs looking up again.
00:21:18.080 The beauty here is that this check can become a single CPU instruction, making the overhead compared to a full lookup extremely low and efficient. Additionally, you can change what the method points to, creating specialized methods. When you turn on debugging in MRI, everything slows down, but in Rubinus, it can keep all optimized code unchanged aside from the method you're currently debugging. You retain clear performance without affecting the rest of the software.
00:22:15.440 Furthermore, Rubinus can specialize for arguments and inline bytecode by adjusting the logic based on branch conditions to improve speed for specific cases. If a method gets called often enough, Rubinus employs JIT compilation to machine code. However, compiling to machine code is resource-intensive, so this isn’t done frequently. When done, Rubinus uses LLVM - Low-Level Virtual Machine, immensely useful for understanding just-in-time compilation. Instead of the conventional interpretation, it compiles to native code allowing Rubinus to run several optimizations.
00:23:56.032 LLVM operates with the intermediate representation (IR), independent of CPU architecture. Therefore, Rubinus doesn’t need to get bogged down with architecture-specific instructions. Now, who is still awake? Good.
00:25:04.158 Let’s touch upon JRuby briefly. I assume you might not want to hear that much about JRuby or otherwise, you would be in another presentation. Anyway, when it comes to JRuby, you may have heard of something called 'invoke dynamic.' Who here has heard of invoke dynamic? Raise your hands. Okay, who of you feels comfortable explaining how invoke dynamic works?
00:26:50.920 So let's examine some bytecode snippets from JRuby with and without invoke dynamic. Without invoke dynamic, the bytecode appears as 'invoke virtual’; it's similar to method calls but JRuby needs to implement its own look-up. Therefore, it has to call the current object, push the necessary arguments onto the stack, and invoke the dynamic object which facilitates the Ruby method dispatch.
00:27:56.720 But now the JVM handles optimizations similar to what happens in Rubinus, which are not applied to the Ruby method dispatch itself. The introduction of invoke dynamic in JVM 7 streamlines the process; it simplifies calls that rely on a singleton, linking the Ruby method's dispatching to a more efficient execution pathway.
00:28:29.360 Invoke dynamic modifies the calling process, instead simply accepting a call site and performing Ruby method dispatch without deep lookup burdens. It uses a bootstrap method linking directly back to the bytecode whenever possible. At the first invocation of invoke dynamic, the JVM will indeed call this bootstrap method, unlock the correct method during execution, and create a reference that sticks for subsequent calls.
00:29:22.560 The process continues to optimize and lookup methods through whatever means possible, so anytime it can return to an optimized path, it will. The take-home point here is that updating JVM and JRuby can lead to free performance enhancements, allowing for extremely efficient Ruby execution.
00:30:19.799 To summarize, if you intend to implement a Ruby interpreter or contribute, or if you're curious why some codes perform better than others, you should focus on reducing CPU instructions, minimizing jumps, inlining code, and balancing those efforts with maintaining late binding capabilities. Understanding these processes allows for effective leveraging of optimizations. Remember, fast lookups usually equal fast executions.
Explore all talks recorded at Aloha RubyConf 2012
+17