00:00:15.519
Hi everybody, can you hear me? Perfect. Okay.
00:00:20.680
So, I'm here to give my talk titled 'Message in a Bottle.' So here it goes: Ruby 1.8 is slow.
00:00:29.439
Because it’s interpreted. Surprise! Ruby 1.9 is interpreted too. Thanks. Any questions?
00:00:41.039
Okay, um, I'm Konstantin. This is my ridiculously long Twitter handle. You're welcome to follow me; it actually seems long, but it's not that bad. It's Konstantin Haase, my first name and last name.
00:00:54.120
On GitHub, you might know me from the last time you did a Google image search for 'bike shedding,' and my face popped up.
00:01:08.159
I work for a company that supports a project called Travis CI. Who in here has not heard of Travis CI?
00:01:15.720
Okay, so you two. We do continuous integration for open source projects, nearly 25,000 of them. We also have servers for your private projects if you have them on GitHub. We can run your tests, so just grab me after this talk. We obviously charge for the service; that’s how we finance the work on the open source project. If you're willing to sign up before we go public, I can give you a sweet coupon, so just talk to me.
00:01:50.560
This is our Berlin-based team. One of our sponsors is Agile Animal, and it’s thanks to them that I’m able to speak here. They covered my travel expenses. Otherwise, I would be sitting in rainy Berlin right now, much colder, working on Travis instead of giving fun talks. When I added that slide, I actually had to use Google image search for Agile Animal. Stuff like that comes up. I highly recommend it; there are many agile animals out there.
00:02:28.080
Okay, so let's get to the talk. I usually chat a lot before I start, because when I get up here to talk, I get nervous. To calm myself down, I wonder, 'What would Aaron Poon do?'
00:02:36.280
So, this talk is actually about Ruby, especially Ruby internals. I want to focus on some core aspects of all the Ruby implementations related to performance.
00:02:49.319
Bear with me, it might be that some of the stuff I show on my slides is not immediately obvious or that you won't have to understand all the code I’ll show, which may include C++, C, and bytecode. It’s important that you grasp the general picture of what is going on. As I said, I'll be talking about different Ruby implementations. How many Ruby implementations do you think there are?
00:04:10.319
Shout out some numbers! 28? 28! Okay. 51? 55? Alright, 51. You guys are good. I gave this talk in Barcelona, and they were guessing five or eight. Actually, I think there are 59 or something, including implementations like RubyMotion or HotRuby, which runs Ruby in JavaScript by compiling to bytecode and then executing it in a browser. It’s by Kichi Sasada, the guy also behind the YARV (Yet Another Ruby VM).
00:04:51.600
I'm going to focus on the two larger ones—Ruby 1.9 and the coming Ruby 2.0, JRuby and Rubinus. What I want to look at is how they make method dispatch fast. It turns out that when you run Ruby, what you're doing most of the time is actually doing method dispatch. So, for MRI, we will examine method dispatch and execution in general. For Rubinus, we will look at inline caches and JIT (Just-In-Time Compilation), and for JRuby, we will look at invoke dynamic.
00:05:29.280
This doesn't mean that only those implementations do those things. Obviously, all implementations perform method dispatch and some form of caching. The code sample we’re going to examine is this nice code example. I hereby release this into the public domain; you may use it if you like. It is a class called 'DeepThought' that checks if the ultimate answer is 42 or whatever you pass in. It accepts both integers and strings. We’ll look at this specific line, assuming that 42 is passed as an argument.
00:06:21.479
One thing you should know about all these Ruby implementations is that they love bytecode. So let’s look at the bytecode for this statement: 'value.to_s.' The bytecode encodes the instructions, but you can also write them as instruction names in a way similar to assembly language. For MRI, this would look like 'get value' followed by 'send to_s' with no arguments. For Rubinus, it would look like 'push local zero,' since that's the first local variable we have.
00:06:56.560
For JRuby, depending on your JRuby version and the JVM version, it might look something similar, but with some missing parts here, there, and elsewhere. When you look at that, you might feel overwhelmed and think, 'What am I looking at?' The best thing to do next is to break it down a bit. As you may know, or may not, if you’ve played with implementing programming languages, generally, a programming language works like this: You start with the source code, which is a text stream that goes through a parser generating a parse tree. The parse tree is a tree in memory, like (plus one one); you might know that from computer science classes.
00:07:47.360
You can interpret that directly, walking the tree and sending out CPU instructions, which is what Ruby 1.8 does. Alternatively, you could generate bytecode from that and then interpret the bytecode, which is basically a big loop that gets the next byte and checks which instruction it corresponds to, sending out further bytes and executing CPU instructions. This is what Ruby 1.9 typically does. Lastly, you could take that bytecode, generate machine code, and execute that. The machine code is basically the same as bytecode except that the part of the system that understands the bytes is no longer software—it’s your CPU.
00:08:34.159
The general plan for the VM is that we have to find a method and then execute it quickly. How do we speed that up? We can find the method faster, and we can execute the method faster. It’s as easy as that! There are tricks for doing just that. For finding methods faster, we can use inline caches or call-site caches, and we can apply lookup caches, which is what MRI utilizes. We can also employ inlining, which I will explain later.
00:09:38.480
To execute methods faster, we can reduce operations. This is a feature coming in MRI, where instead of looping around multiple times to get a value and send it, we bundle those operations together. This reduces the interpretation overhead. We can even optimize search speeds because faster searches often equal faster execution. Let’s take a look at how MRI finds methods. One way to search for a method is by asking for the RB method entry, a C method available from the public Ruby API. This method will call RB method entry without caching unless it's already cached in the lookup cache.
00:10:48.320
When I say 'finding a method,' I mean that I have the object and the name of the method, and I want to find the corresponding bytecode that needs to be executed. Eventually, it ends up in the search method, which is clean and nice to play with if you want to hack on Ruby. The arguments to this method include the class of the object, which if it has a Singleton Class, will be the Singleton Class; otherwise, it will be the proper class. The ID is the C version of the method name's symbol.
00:11:09.160
The method basically loops through the inheritance chain, including all the superclasses of the class and the modules, looking for a method entry within the method table of each class. If a class becomes null after checking, we’ve reached the end of the inheritance chain without finding the method. In that case, the method returns zero. Once we find it, we now need to execute the method, which ultimately leads us to an internal method called VM call zero. This takes numerous arguments, but the general structure is that the VM Loop is not a simple while loop; instead, it’s a switch statement often nested within itself.
00:12:26.440
This method is quite lengthy, so I’ve cut out a lot of unnecessary details. It used to rely heavily on jumps, which implement the loop system. For example, if you have an instruction, it will execute that corresponding bytecode actually. The VM knows which instruction to execute by looking it up in an instruction manual, which is found in a file called instructions.def. The same principle applies to Rubinus, which operates in a similar way, defining what to execute when encountering a specific instruction.
00:13:30.560
If the instruction is called 'send', it fetches the class and searches for the method, triggering the search function again. This is why speeding up the search process will ultimately speed up overall execution. All of you still awake? Great! Let's look at Rubinus. The nice aspect of Rubinus is that you can work with all of that from Ruby directly.
00:14:02.880
The structure you just saw is exposed as Ruby objects. Each module has a method table comprising buckets that contain the compiled code. You could implement the search for a method in a Ruby idiomatic manner, which would look similar to the search method. You obtain the object's single class and continue checking in the lookup table as long as there remains a class to check and the method has not yet been found.
00:15:07.680
This will eventually yield something called compiled code. If you look up 'ultimate answer' in the method table of DeepThought and ask for its method, you can decode the method. This will provide you with the opcode representation and a bytecode stream format. These bytes can also be written to a file and executed directly from there. You can check out the literals such as 2, 42, and equality checks. Let’s take a closer look at this.
00:16:27.920
Now, if I rearrange this, you'll see the literals, locals, and bytes. Grouping the bytes around opcodes reveals additional insights. The first byte on the stream is 20. In Rubinus, the bytecode reads 'push local,' which takes one argument. The next byte indicates zero, meaning we push the first local onto the stack. Checking our first local confirms it’s 42, so we push 42 onto the stack. The next instruction is 'send to_s', which takes what's on the stack and processes it. We pop off what's on the stack and push into it the result of that method call, which in this case results in '42' as a string.
00:17:24.679
Next, we push literal number one (the second literal 42) onto the stack, so now there are two 42s on it. Following this, we execute a string equality operation, which optimizes equal comparisons for many cases. The op statement will pop both from the stack, pushing true onto it. Finally, opcode 11 serves the purpose of returning by just popping the last item off the stack. Since this is a Ruby object, you can modify it, and those modifications will take effect.
00:18:29.000
Rubinus utilizes this for implementing inline caches. You could replace information, for instance, the number 42 within the literals tuple, which is a fixed size array. You can make modifications at runtime, so from that point on, the ultimate answer is no longer 42; let’s say it becomes 23. You could adjust the bytecode as well, adding in modifications to the code that do something different and update the return value accordingly to reflect your changes.
00:19:14.000
Rubinus leverages the compiled method at runtime. If we examine the bytes with their representation, we see 'push local zero' and the argument reference, where 2 sends the symbol object. The first time Rubinus sees the code during execution, it replaces that with an inline cache referencing the compiled code. Next time, when it encounters 'send to_s', it won’t have to look things up again but will call the inline cache for the method.
00:20:00.800
The inline cache retains a reference to the compiled code and can check if the code at hand is still the correct code to execute. This is key for late binding; if you had a statically typed language, e.g. C, it could decide at compile time which code to call. In Ruby, the types are integers - for example, an array might be type 17. As objects come in, they check if they're of type 17, and if not, the method needs looking up again.
00:21:18.080
The beauty here is that this check can become a single CPU instruction, making the overhead compared to a full lookup extremely low and efficient. Additionally, you can change what the method points to, creating specialized methods. When you turn on debugging in MRI, everything slows down, but in Rubinus, it can keep all optimized code unchanged aside from the method you're currently debugging. You retain clear performance without affecting the rest of the software.
00:22:15.440
Furthermore, Rubinus can specialize for arguments and inline bytecode by adjusting the logic based on branch conditions to improve speed for specific cases. If a method gets called often enough, Rubinus employs JIT compilation to machine code. However, compiling to machine code is resource-intensive, so this isn’t done frequently. When done, Rubinus uses LLVM - Low-Level Virtual Machine, immensely useful for understanding just-in-time compilation. Instead of the conventional interpretation, it compiles to native code allowing Rubinus to run several optimizations.
00:23:56.032
LLVM operates with the intermediate representation (IR), independent of CPU architecture. Therefore, Rubinus doesn’t need to get bogged down with architecture-specific instructions. Now, who is still awake? Good.
00:25:04.158
Let’s touch upon JRuby briefly. I assume you might not want to hear that much about JRuby or otherwise, you would be in another presentation. Anyway, when it comes to JRuby, you may have heard of something called 'invoke dynamic.' Who here has heard of invoke dynamic? Raise your hands. Okay, who of you feels comfortable explaining how invoke dynamic works?
00:26:50.920
So let's examine some bytecode snippets from JRuby with and without invoke dynamic. Without invoke dynamic, the bytecode appears as 'invoke virtual’; it's similar to method calls but JRuby needs to implement its own look-up. Therefore, it has to call the current object, push the necessary arguments onto the stack, and invoke the dynamic object which facilitates the Ruby method dispatch.
00:27:56.720
But now the JVM handles optimizations similar to what happens in Rubinus, which are not applied to the Ruby method dispatch itself. The introduction of invoke dynamic in JVM 7 streamlines the process; it simplifies calls that rely on a singleton, linking the Ruby method's dispatching to a more efficient execution pathway.
00:28:29.360
Invoke dynamic modifies the calling process, instead simply accepting a call site and performing Ruby method dispatch without deep lookup burdens. It uses a bootstrap method linking directly back to the bytecode whenever possible. At the first invocation of invoke dynamic, the JVM will indeed call this bootstrap method, unlock the correct method during execution, and create a reference that sticks for subsequent calls.
00:29:22.560
The process continues to optimize and lookup methods through whatever means possible, so anytime it can return to an optimized path, it will. The take-home point here is that updating JVM and JRuby can lead to free performance enhancements, allowing for extremely efficient Ruby execution.
00:30:19.799
To summarize, if you intend to implement a Ruby interpreter or contribute, or if you're curious why some codes perform better than others, you should focus on reducing CPU instructions, minimizing jumps, inlining code, and balancing those efforts with maintaining late binding capabilities. Understanding these processes allows for effective leveraging of optimizations. Remember, fast lookups usually equal fast executions.