Making Ruby Fast(er)

00:00:05.920 Thank you, everyone, for coming. I know it's kind of late in the day now, so I appreciate this many people showing up.

00:00:11.040 Thank you to the organizers. It's great going back to a smaller, single-track conference. The Ruby community kind of lost these, and it's nice to be at one again.

00:00:17.119 Here's my contact information. I can't figure out how to customize the title slide, so it's here.

00:00:22.720 The title of this talk is 'Making Ruby Fast(er)'. The basic motivation for this talk is that the three major Ruby implementations—JRuby, TruffleRuby, and CRuby, or MRI, whatever you want to call it—have Just-In-Time (JIT) compilers.

00:00:31.240 Over the past few years, representatives from each of the projects have talked extensively about the latest developments in their JIT compilers.

00:00:36.559 But if you come from a coding boot camp, or maybe you're someone that works in high-level Ruby and writes applications all day, you might not fully understand how these JIT compilers work.

00:00:42.960 Consequently, you might be missing out on the latest developments in the language. This talk is intended to be something of a prequel to those other talks; hopefully, we can fill in the knowledge gap.

00:00:50.039 It's a lot of information to cover in 30 minutes. I'll probably run a little long, but I'm going to go as quickly as I can. To start off, we'll look at how CPUs actually run software. We’ll do a quick crash course in computer architecture, then we will get into what compilers do and finally how that pertains to JIT compilers and Ruby in particular.

00:01:14.159 How software is run on CPUs? If we take a typical laptop or server and boil it down to its essence, we have two critical components: the CPU, or Central Processing Unit, which performs the operations in your computer, and memory, which we generally think of as RAM.

00:01:32.399 The memory and CPU are connected by something called a bus. For practical purposes, you can think of this as a set of wires, so if the two need to communicate, they transmit data over these wires. However, there are transmission delays involved in doing that.

00:01:45.680 On top of that, there are trade-offs in memory that make it slower than the CPU. The CPU has a set of storage locations built into it called registers. There’s a limited number of these, typically 16 or 32, depending on whether you have a 32-bit or 64-bit processor.

00:02:02.960 Registers are small, typically 4 or 8 bytes each. If we can work with them, they are incredibly fast because they are built directly into the CPU; they don't need to go over the bus, making access much quicker.

00:02:16.239 The CPU also has execution units. One you might come across is the ALU, or Arithmetic Logic Unit, which performs mathematical and bit operations. A computer is quite simple; the CPU doesn't do anything until you feed it instructions.

00:02:44.120 When we compile a program, we are encoding it into a set of instructions. When you execute the program, the CPU loads it, placing it in memory, and it tells the CPU where in memory your program is and how to start executing it.

00:02:51.599 The CPU communicates with memory because we want it to operate on data. Otherwise, it becomes a closed system and does nothing useful. The instructions change per CPU or per CPU family and are defined by something called the Instruction Set Architecture, or ISA.

00:03:09.119 The ISA describes everything about the CPU, such as the names of its registers, their sizes, how to access them, and whether they are intended for general use or have special purposes. For instance, the x86-64 architecture has records that allow registers to be accessed in different modes—32-bit, 16-bit, or 8-bit—and it helps define the data types the CPU supports and how they are represented in memory.

00:03:43.040 In terms of assembly language, it's a written representation of the machine code, which is much more readable. While the ISA includes numerous details about instructions, the specifics of how to encode these in binary is given by an opcode table. This table tells you how to represent each instruction in binary, allowing the CPU to decode it accurately. We've already mentioned that there are two major types of instruction sets: CISC (Complex Instruction Set) and RISC (Reduced Instruction Set).

00:04:24.040 CISC architectures like Intel make use of a wide variety of complex instructions, while RISC architectures, widely adopted in mobile and newer CPUs, are simpler. The key difference is the complexity of instructions and how they manage register use, with RISC architectures generally featuring more registers that are more freely usable. Special registers in a CPU include the Program Counter, which stores the address of the next instruction to execute, and the Stack Pointer, which manages memory in an efficient manner utilizing a stack data structure.

00:05:11.039 When it comes to compilers, they translate high-level languages into machine code or native code. Although you could write machine code by hand, most work is done at a level called Assembly Language, which is much friendlier but still closely tied to its corresponding ISA. An assembler then converts this high-level source into machine code, while a linker might be involved in creating executable files from libraries and object files. This generation of machine code can then be executed directly by the CPU.

00:05:58.080 For practical purposes, compilers provide a whole host of optimizations and improvements over manually written machine code. For instance, if you have a piece of code that is never called, the compiler can determine that and skip generating machine code for it. It can also eliminate redundant calculations within a function and optimize function call overhead by inlining the function's code directly in the calling context, reducing function call overhead.

00:06:36.880 When we run our code and enable optimizations, the assembly code generated becomes significantly shorter, with instructions executed much more efficiently. Now this brings us to JIT compilers in Ruby. Up until now, we've been discussing Ahead-Of-Time (AOT) compilers, where the entire application must be compiled before execution.

00:07:09.920 With Ruby, we can employ Just-In-Time (JIT) compilation because Ruby code runs on a virtual machine—essentially a program that mimics a physical computer, eliminating the differences among various ISAs. This allows Ruby code written on a MacBook, for example, to run on a Linux server in production without any additional modification. The virtual machine facilitates operations like memory management without us needing to handle it at a low level, allowing for easier development.

00:08:07.600 Typically, the compilation process involves multiple steps. One essential step is parsing, which transforms your Ruby code into a format the computer can work with more efficiently. For instance, the parser converts high-level Ruby instructions into bytecode or an abstract syntax tree, which can be iterated over and executed more easily.

00:08:53.800 In C-Ruby, the intrinsic bytecode is known as YARV. When running the YARV dump instruction on some Ruby code, you may observe that the output appears as a tree structure, which is much more manageable for the interpreter to iterate over. The YARV instruction can be adjusted with various commands to impact how it processes given Ruby code.

00:09:37.640 Ruby can optimize its bytecode just as C programs are optimized by traditional compilers. You might see that a standard method call can be replaced with a specialized instruction for a straightforward operation, like addition. Additionally, we can flexibly expand the instruction set in a VM as compared to needing to add an instruction to the ISA, which requires new silicon.

00:10:36.960 A virtual machine’s profiler can analyze your code’s execution to determine pieces that run frequently; this metric helps in deciding when sections of the program can be compiled to machine code using the JIT compiler. The core benefit of a JIT compiler is to emit optimized machine code that can be executed quickly. In C-Ruby, the compilation process includes generating machine code stored in a code cache, enabling access to segments of code that are often executed.

00:11:40.040 What’s fascinating is that during execution, the JIT compiler can compile smaller pieces of code, often referred to as basic blocks, which simplifies optimization efforts. A specific improvement in Ruby’s JIT compilation is speculative optimization, allowing the compiler to guess the data types that will likely be used during execution.

00:12:20.560 For instance, if it processes a function that can handle multiple input types, the JIT compiler learns from prior execution and assumes that the same types will persist. When internal assumptions hold, it grants efficient execution; if incorrect, the JIT can de-optimize, reverting to the interpreted state.

00:13:05.720 A compelling optimization arises in Ruby from method lookups. Since nearly everything in Ruby involves method calls, executing them efficiently can have significant performance implications. The interpreter supports caching for method lookups to avoid costly full lookups every time, employing a global method cache.

00:13:45.960 If a method lookup fails, a full VM lookup occurs, but as the cache is utilized, it helps guarantee faster execution. For example, an inline cache is another optimization that stores method dispatch information based on where a method call is made, allowing for better performance over continually performing lookups.

00:14:43.560 This approach helps keep the system efficient by identifying when specific methods or classes are being used and allows for adjustment based on usage frequency. Inline caches have various states (monomorphic, polymorphic) which dictate how method lookups behave if new function calls deviate from the anticipated norm.

00:15:29.720 To maximize efficiency, JIT compilers need to adapt swiftly to changes in method calls without impacting speed. The optimization process from static lookups to efficient execution addresses the distinction between minor versus major alterations in code patterns. Once changes are detected that affect core assumptions, the inline cache system updates.

00:16:16.680 As Ruby evolves, so do the optimizations with some in architectures such as Truffle Ruby benefiting significantly from bytecode optimizations. With the aide of the JIT, method inlining becomes a key factor: on-the-fly code replacement allows for replacing method calls with direct calls to their implementation, avoiding overhead altogether.

00:16:50.000 Moreover, metaprogramming allows us to dynamically create and manipulate code at runtime, yet it typically incurs some overhead. However, JIT compilers can mitigate this by employing inline caches and other optimizations so the performance loss can be arrested, maintaining desirable response times.

00:17:22.200 By incorporating optimizations from various compile processes, we can enhance Ruby's behavior significantly, making it far more efficient. Ultimately, moving toward embracing the JIT compiler concept in Ruby allows for higher performance without requiring complex alterations from developers.

00:18:19.400 Overall, JIT compilers are instrumental, dynamically enhancing Ruby code execution speed while still being user-friendly. We suggest developers lean into idiomatic Ruby instead of complex workarounds that may interfere with JIT optimizations.

00:19:00.380 In conclusion, Ruby's JIT capabilities significantly elevate the performance landscape, allowing code to run efficiently while maintaining a developer-friendly paradigm.

00:19:34.840 Thank you for your time.