YJIT (Yet Another Ruby JIT)

Keynote: Stories from developing YJIT

Keynote: Stories from developing YJIT

by Alan Wu

In this keynote presentation at RubyKaigi 2022, Alan Wu explores the development and optimization of YJIT (Yet another Just-In-Time compiler) for Ruby. He shares personal insights and experiences gathered throughout the YJIT project, which started in 2020. The talk provides a chronological overview of YJIT's development, focusing on key design considerations and the challenges faced in optimizing Ruby code execution.

Key points discussed include:

  • Understanding the Interpreter: Wu explains how the Ruby interpreter executes code through instruction handlers, which break down Ruby instructions into manageable pieces.
  • Initial Design and Performance Issues: The early design aimed at minimizing jumps in machine code led to mixed results, with some benchmarks showing speedup while Rails benchmarks experienced slowdowns.
  • Performance Profiling: Wu discusses how performance counters and profiling methodologies were used to diagnose bottlenecks in different Ruby workloads, revealing that while some tasks were back-end bound, others, like Rails bench, were front-end bound due to diverse Ruby features.
  • Lazy Basic Block Versioning: An innovative approach using lazy loading of stubs allowed YJIT to delay compilation until necessary, significantly improving execution speed by minimizing jumps back to the interpreter.
  • Speculative Optimizations: YJIT benefits from speculative optimizations that dynamically adapt based on method call patterns, enabling more efficient execution paths to be created and reused.
  • Integration with the Interpreter: Wu emphasizes the symbiotic relationship between YJIT and the Ruby interpreter, noting techniques for maintaining an interactive user experience while optimizing performance.

In conclusion, YJIT has demonstrated the capacity to significantly speed up Rails benchmarks while reducing the execution instruction load and improving instruction-level parallelism. Wu highlights the importance of carefully balancing performance optimization with user interaction, ensuring that enhancements to execution speed do not detract from the usability of Ruby applications.

00:00:09.860 Thank you! Hi everyone.
00:00:15.120 I just have a talk here about YJIT. It's just some random things I found interesting during development.
00:00:21.300 I know it's a keynote, but it's really just a greatest hit from my development experience.
00:00:26.580 This is actually my first time giving a talk at a conference, so mostly I’m just going to read from my script. Please bear with me for the next 40 to 50 minutes.
00:00:33.719 I'm going to take you chronologically through YJIT's development history.
00:00:41.879 When we started the project in 2020, we wanted to do something simple to get our feet wet.
00:00:49.070 At the time, I was looking at overhead in the interpreter, and influenced by that, we tried to design something closely related to how the interpreter works.
00:00:55.500 So first, I'll talk a little bit about the interpreter. The interpreter does not run the text of the Ruby code directly; instead, the text is broken down into small instructions.
00:01:02.340 First, the list of instructions encodes some information obtained through parsing the text, which makes it easier to execute than running the text directly.
00:01:09.000 For example, the order of operations is resolved, so we don't need to worry about running the right-hand side of an assignment first.
00:01:15.840 The interpreter can just go from top to bottom in the list of instructions to execute the code.
00:01:21.180 The chunks of machine code in the Ruby executable implement the logic for each kind of instruction.
00:01:27.000 These chunks of machine code are called instruction handlers. Each handler is designed to pass control to the next handler.
00:01:32.460 They update the Ruby program counter to keep track of which part of the program they are running and then they pass control by jumping.
00:01:38.460 You don't need to worry too much about the details here; the main point is that the interpreter runs the Ruby program a bit at a time with these handlers.
00:01:45.420 Handlers can jump to each other. I made some animations to illustrate how the setup runs over time.
00:01:50.720 We are going to simulate 'x = 2 + 2.' In the middle of this animation, I have the machine code for the handlers as they could be laid out in the Ruby executable.
00:01:56.220 On the right, I am going to log the name and address of the instruction each time we run one.
00:02:02.900 We will run address 30 first, because the first instruction calls 'put_object' as you can see in the leftmost column.
00:02:08.640 After executing the 'put_object' handler, we look at the next Ruby instruction and jump to the handler for it.
00:02:15.840 The next Ruby instruction is 'put_object' again, so we jump back to address 30.
00:02:21.780 We run 'put_object,' and now the next instruction is 'plus,' with address 10.
00:02:29.159 We take the jump, run through the handler, and finally we get to the last instruction, which is at address 40.
00:02:37.319 The main thing I’m trying to show here with the simulation is that there are two execution contexts when the interpreter runs Ruby code.
00:02:44.400 Even though the Ruby instructions are completely linear, the machine code may need to jump forwards and backwards during execution.
00:02:51.480 As you can see from the three arrows, the jumps are necessary because there's only one copy of each handler in the Ruby executable.
00:02:58.080 For example, to repeat the same handler, like 'put_object' in this example, we need to jump backwards.
00:03:04.319 Allowing ourselves to generate new machine code at runtime would mean we are limited by what’s in the executable.
00:03:10.620 So the first design we tried revolved around generating code to eliminate the need to jump.
00:03:16.139 The open machine code is basically what the interpreter runs over time, minus the jumps that stitch the handlers together.
00:03:22.680 This layout creates a linear sequence of Ruby instructions leading to linear machine code, achieving a nice smashing effect.
00:03:28.440 I like this design because it’s conceptually simple. The machine code generated is closely related to how the interpreter runs over time.
00:03:34.680 It was also easy to see how the output could be faster due to the similarity to the interpreter; it’s the same code with a few jump instructions removed.
00:03:42.540 Fewer instructions should imply less work, right? Simple and nice.
00:03:48.900 Well, too bad it's not that easy. We did get some speedup on my upcar, but we actually got a slowdown on Rails bench.
00:03:55.440 Since it's our production use case, we want Rails to run well. So what's the difference between the two benchmarks?
00:04:01.680 Of course, it might be obvious that an emulator for a family entertainment console is different from a web app.
00:04:08.280 However, I want to characterize the benchmarks in terms of how the processor runs them.
00:04:14.820 Processors have performance counters that can tell us how the workload uses computational resources.
00:04:22.139 Since there are hundreds of these counters, it’s hard to know which ones to focus on.
00:04:29.520 Luckily, there are existing tools and methodologies to get a concise summary.
00:04:36.060 We can use the Linux performance profiler to compare the two benchmarks at a low level.
00:04:43.440 The profiler output looks like this:
00:04:49.260 Except for the retiring column, it shows different types of bottlenecks.
00:04:55.380 I created some animations to explain what these columns mean.
00:05:01.740 To run some computation, the processor has to know what to execute first.
00:05:09.000 Machine code is stored in memory, so the processor has to load from memory before decoding it.
00:05:15.120 In each clock cycle, the front-end issues a few operations to the back-end for actual computation.
00:05:21.300 In this example instruction stream, we want to do three multiplications but there are only two arithmetic units in the back-end.
00:05:27.660 The back-end can execute two multiplications simultaneously, but while it’s working on them, the third multiplication will be stuck waiting.
00:05:34.680 We call this situation back-end bound since the pipeline is stalled due to the back-end.
00:05:41.520 Congestion can occur at the front-end as well.
00:05:48.900 Loading code from memory takes time, leading to periods when the back-end has underutilized capacity.
00:05:56.520 You can see the ads slowing down in this example, where the back-end has no work to do.
00:06:02.520 After finishing the final multiplication, hopefully, that made sense.
00:06:08.820 This methodology is designed for processes that can perform multiple operations per clock cycle, and percentages are based on execution slots.
00:06:16.380 Let’s say we can execute up to four operations per clock cycle; then we have four slots per cycle.
00:06:23.520 If we spend all the slots computing, we reach the theoretical maximum throughput.
00:06:29.520 As we saw, though, there can be delays in the pipeline; some slots may need to be spent waiting for either the front-end or the back-end.
00:06:37.020 This model counts in slots so it can account for partial delays within the cycle.
00:06:44.280 In the animation I showed, we were still able to perform some computation while waiting for new code to load from memory.
00:06:51.780 So only some slots were spent waiting.
00:06:58.440 Now that we understand what the different columns mean, we can compare the results for Upcar and Rails bench.
00:07:05.160 We can see that Upcar uses execution slots fairly efficiently, with over 60% spent on retiring the computations rather than waiting.
00:07:12.000 In terms of bottlenecks, back-end bound wins, but it’s evenly distributed across different categories.
00:07:19.740 What does Rails bench look like? Well, it's very front-end bound.
00:07:26.520 Rails bench uses a diverse set of Ruby features and a few C extensions, so the processor needs to execute a lot of machine code from different places.
00:07:34.020 This turns out to be quite significant.
00:07:42.900 These results are from running just the interpreter. How does YJIT change things?
00:07:50.880 YJIT generates more machine code on the fly for the processor to load.
00:08:00.200 Even though it's true that YJIT code takes less work to run than the interpreter.
00:08:06.300 Going through this system increases the work for the front-end since jumping to the code generated by YJIT is a new source of delay.
00:08:13.800 If the generated code is much more efficient than the interpreter, it may be worth spending a few slots to access it.
00:08:20.280 However, the initial version simply wasn't optimized well enough.
00:08:26.640 Also, since the design can only handle linear sections of Ruby code, it needs to exit back to the interpreter frequently.
00:08:33.360 Thus, we faced frequent jumps between the interpreter and the generated code.
00:08:40.200 I’ll explain why these jumps can introduce delays.
00:08:46.920 The idea with this profiling methodology is to drill down into each category of delays and identify where in the code these delays occur.
00:08:55.080 I have an example of a contributing factor to front-end bound delays.
00:09:01.680 This occurs when the front end is delayed due to incorrectly predicting where a jump goes.
00:09:08.280 This type of delay happens in the interpreter a lot because of the jumps between the handlers.
00:09:14.640 We must load the address of the jump destination before we can load the instruction at that destination.
00:09:21.240 So running these two instructions has some inherent memory delay, even when things are cached.
00:09:27.600 Modern front ends strive to hide this delay by predicting where these jumps land.
00:09:33.840 By guessing ahead of time, the front end can work while waiting for the actual destination to come from cache or memory.
00:09:40.320 The animation I showed earlier is a bit misleading; while the ads were rolling in, the front end was likely already feeding instructions to the back end.
00:09:47.160 If the front end makes the correct prediction, it appears as though there were no delays at all.
00:09:54.000 However, if we guess incorrectly, there’s a penalty incurred for redirecting the front end and correcting everything.
00:10:00.600 Now, recent security research reveals that front-end predictors utilize previous destinations for jumps to forecast.
00:10:06.960 Unfortunately, there's a growing number of places the Ruby executable needs to jump to as YJIT compiles more Ruby code.
00:10:13.620 The predictor can only track so much history, and we can easily surpass that limit.
00:10:19.620 On Rails bench and Upcar, there’s less Ruby code, so this issue isn’t as severe.
00:10:26.280 There’s also a structural issue with the initial design, where branches are used to jump between handlers that transition to generated code.
00:10:33.540 This means that branches now have history for handling addresses in addition to the growing set of generated code addresses.
00:10:40.800 This doesn’t help the predictor when running the interpreter due to this mixed history, making accurate predictions more difficult.
00:10:47.040 Consequently, we still exit back to the interpreter frequently as a result.
00:10:54.120 That's the longer version of why jumping from the Ruby executable code can be challenging for the front end.
00:11:01.680 At this point, we were two to three months into the project.
00:11:08.640 We knew that jumping between the interpreter and the generated code was taxing.
00:11:16.200 So ideally, we wanted to minimize those jumps.
00:11:22.680 It would also be beneficial if we could further optimize the output.
00:11:30.180 This is when we decided to change our design and started using lazy basic block versioning to address these changes.
00:11:36.960 This is a technique for making just-in-time compilers, which Maxine developed during her PhD.
00:11:44.340 She talked about this in detail during RubyKaigi last year, and if you want to learn more about this topic, please check it out.
00:11:53.160 Today, I'm going to focus on the lazy aspect because it's my favorite part; I’m pretty lazy myself!
00:11:59.640 YJIT implements laziness with these code fragments called 'stubs.'
00:12:06.720 Each stub represents some code that hasn’t been compiled yet, kind of like a placeholder for machine code.
00:12:12.060 The general idea is to generate stubs first, then replace them with actual code the first time they’re run.
00:12:18.300 Since the interaction with stubs unfolds over time, I believe it’s easier to understand how they work with some animations.
00:12:25.380 I have an example method here that just calls itself; its name is 'call_self.'
00:12:31.680 Let’s see how YJIT uses stubs to generate code for it.
00:12:39.800 YJIT starts by generating code for accessing the local variable, but when it sees the call to itself, it faces a challenge.
00:12:46.200 It needs to know where the call goes to connect the correct code.
00:12:52.920 In this case, it will go to the method 'founding_custom,' but that won't always be the case for other objects.
00:12:59.520 This is when YJIT decides, 'I don’t want to deal with this problem right now; I'm lazy,' and generates a stub.
00:13:05.020 The compiler completes its operation after making the stub, and now we run the freshly generated code.
00:13:11.960 We did not compile the entire method module; however, running what we just compiled gives us a slight advancement in execution.
00:13:18.240 When we encounter the stub, we are within the body of 'call_self' and 'obj' is 'custom.'
00:13:25.200 From here, we recognize that, at least in this immediate case, the call goes to the method 'founding_custom.'
00:13:32.040 However, we also need to accommodate other cases, so we can solve this using a stub again.
00:13:38.400 We usually check, if a call goes to the stub, whether the receiver is not customs.
00:13:44.760 If that check passes, then we know that the method call goes to 'custom.self,' which simply returns.
00:13:52.680 We can generate the code for that immediately after the check.
00:13:58.920 When we call a method, we have to provide a return address, so the caller knows what code to run once the method execution is complete.
00:14:05.520 We will use the stub for the return address as well.
00:14:12.600 Next, we implement 'custom.self' method.
00:14:18.420 It doesn’t do anything special, so we don’t actually require a stub for that; we can simply generate the code for returning.
00:14:26.280 YJIT then completes generating code during this segment.
00:14:32.880 All the while, we have been generating code for the stub, highlighted in orange here.
00:14:39.240 Once we finish, we can patch the jump from the stub to the new generated code.
00:14:45.480 The new code is directly after the jump to the stub, so we actually need not jump at all; we can remove the jump.
00:14:52.299 When the new code runs, we hit the stub and use the return address.
00:14:58.440 The 'call_self' method simply accomplishes its task and returns.
00:15:05.160 We then adjust the reference to the stub to point to the new code.
00:15:12.300 When we resume execution, we can return from the two method and finally finish running the complete program.
00:15:20.400 Now, let’s examine what happens if you repeat the call with the same 'custom' array.
00:15:29.880 What’s intriguing is that you get this linear layout again, executing from top to bottom.
00:15:35.520 This layout is practically optimal when the execution order is linear, displaying favorable behavior toward the processor front end.
00:15:42.120 There’s a strong likelihood that the next instruction is already decoded since it decodes instructions in blocks.
00:15:48.360 What I find remarkable is that YJIT never performed explicit analysis to achieve this nice layout; we naturally get this from being lazy.
00:15:55.260 This contrasts with how other just-in-time compilers might generate similar code.
00:16:01.920 We don’t record an execution trace nor do we take any profiling data before compiling; we simply create stubs and respond to them.
00:16:08.760 What we are doing here is called speculative optimization in the academic literature.
00:16:15.660 The code performs optimally when the input is the 'custom' array.
00:16:22.200 In the future, though, the method could receive different types of inputs.
00:16:28.560 So, we are essentially guessing what future inputs will occur.
00:16:35.040 If we call the method again with a standard array, which has a '3' in it, we will fail the check and then hit the stub.
00:16:41.640 This will result in generating additional code to handle this new condition.
00:16:48.120 The checks we leave in the code are termed guards, which protect specialized code that follows them.
00:16:54.690 Let’s observe the code produced when this guard fails.
00:17:00.600 This one code path could be linear; if we receive the normal array, we take the first jump.
00:17:06.600 While we keep entering this stub, we gradually roll out and generate more code.
00:17:12.660 It’s common for just-in-time compilers to utilize speculative optimizations, as they are effective.
00:17:18.540 The main approach for achieving speedup is based on making wise guesses using runtime state.
00:17:25.380 YJIT makes predictions about where a method call will go based on its initial behavior.
00:17:32.220 This shares a relationship with inline caching in the interpreter.
00:17:41.160 For a specific code location making a method call, the interpreter remembers the method called.
00:17:47.340 If you repeat the call and it goes to the same method, then it's a cache hit.
00:17:54.000 When you hit the cache, you avoid a full method look-up and save work when setting up parameters.
00:18:00.660 Sadly, this cache only has space for one target, so if it’s called again and goes to another method, it’s a cache miss.
00:18:07.440 Despite this small cache limitation, studies show that it works quite effectively.
00:18:13.680 For instance, using Linux BPF Trace, I was able to measure how often we hit the cache in production.
00:18:20.640 The findings revealed a hit rate exceeding 90%.
00:18:27.360 This refers to the interpreter's monomorphic inline caching.
00:18:34.440 Looking at inline caching on Wikipedia reveals that this type of caching is quite effective in small talk applications.
00:18:39.720 Essentially, method calls generally tend to go to one target.
00:18:46.560 This pattern comes from how people typically write code, which is why I bring up small talk.
00:18:53.520 When writing a method call, you likely have a specific method in mind.
00:18:59.160 Moreover, when reading code, it’s complex to ascertain where a method call goes next if you don’t know.
00:19:05.160 This confusion reinforces why calls pointing to multiple targets are less common.
00:19:12.060 Returning to our example, it shows how method calls can target multiple places.
00:19:18.840 However, the first call target has the shortest path, requiring only two checks to access the second case.
00:19:26.460 The linear layout is still achieved in this case.
00:19:33.000 Thus, speculation based on the first call target works well, as there’s usually only one.
00:19:39.060 With the stubs and laziness, YJIT can generate code for many more Ruby constructs.
00:19:46.560 This is crucial, especially as jumping from the executable to generated code can be very costly.
00:19:54.120 So, once we enter the generated code, we aim to remain there for as long as possible.
00:20:01.500 A key element of this is inhibitor speculation.
00:20:08.160 Not all speculations work out, as we’ve noted, and we can’t generate code for everything at once.
00:20:16.560 Using the interpreter as a fallback lets us build YJIT progressively.
00:20:23.520 More importantly, we can concentrate on the critical Ruby features.
00:20:30.180 Fast forward to today, YJIT clearly speeds up Rails bench, with a timing ratio of about 1.3.
00:20:37.680 We perform the same tasks using fewer instructions.
00:20:44.400 Furthermore, we execute more instructions per clock cycle—enhancing how we use slots.
00:20:51.240 This is reflected in better instruction level parallelism achieved through making good speculations.
00:20:58.140 All of this results in a better layout of code, which we observed earlier.
00:21:05.400 We have managed to overcome the inherent overhead of jumping to the generated code, as YJIT compiles more Ruby features.
00:21:12.480 Thus, we now end up with more machine code than before.
00:21:20.280 However, we still achieve speedup.
00:21:27.600 Having more code than the instruction cache can hold isn’t necessarily problematic—provided we structure the instruction stream to help the processor conceal memory delays.
00:21:34.560 Earlier, I mentioned how the interpreter is integral to YJIT.
00:21:41.700 Let me explain how we utilize the interpreter when required.
00:21:48.480 It's a straightforward two-step process: basically, we undertake two storage operations to memory in the output code.
00:21:56.640 The simplicity of this process reflects how YJIT code functions similarly to the interpreter.
00:22:02.700 Typically, when the output code is more specialized, reconstituting the interpreter state entails more work.
00:22:09.120 For context, let’s delve into why assigning to 'cfp->sp' is sufficient.
00:22:17.040 This represents the stack pointer for temporary values arising during computation.
00:22:24.960 Here, I have a simple method call on 'self,' which breaks down into several instructions.
00:22:31.320 The 'put_self' instruction pushes onto the temporary stack, while the 'send' instruction pops from it.
00:22:37.560 Both 'put_self' and 'send' write to memory in the interpreter and YJIT.
00:22:44.160 If YJIT encounters an unresolved 'send' instruction and has to revert back to the interpreter, 'self' has already been stored in the proper location.
00:22:51.120 So, all YJIT has to do is adjust the pointer so the interpreter can correctly locate the value.
00:22:58.920 This strategy works well when YJIT needs to exit. However, YJIT usually operates independently.
00:23:06.000 Thus, we can find ourselves with code leading to an unfortunate memory round trip.
00:23:12.240 In this scenario, a value stores to memory, and then it retrieves it immediately.
00:23:20.760 This situation may be using the same register, X11.
00:23:27.240 This merges the optimization requirements into the happy path.
00:23:33.480 In cases where enough information is available, we can remove the memory operations and retain values in registers.
00:23:40.680 While making this optimization, we still need to reconstruct interpreter states.
00:23:48.900 Consequently, YJIT positions values in registers, while the interpreter requires them in memory.
00:23:56.520 Thus, we have to store the values back in memory to bridge this gap when we exit from YJIT.
00:24:02.520 I’ve also noted other factors that may require tweaking for implementation.
00:24:09.780 YJIT incorporates multiple moving pieces.
00:24:17.640 Now let’s look at the second part of the optimization process.
00:24:24.840 Recall earlier that the interpreter updates the program counter while running each instruction.
00:24:32.220 It essentially needs to know this to determine what code to run next.
00:24:39.840 When updating through YJIT, everything translates to machine code, so updating the Ruby program counter isn’t mandatory for making process progress.
00:24:46.980 Updates are only required before operations necessitating knowledge of execution progress.
00:24:54.780 Examples include before obtaining a back trace or raising an exception.
00:25:03.960 In addition to the program counter, various other information aids in tracking execution progress.
00:25:10.560 To name a few, there’s the 'I' object, which houses VM instructions and lots of metadata like source code filenames and line numbers.
00:25:17.640 There's also a method object since the same Ruby code could be executed as different methods.
00:25:24.180 Furthermore, a scope object assists with constant resolution and refinement method lookups.
00:25:31.920 Both YJIT and the interpreter need to fill out all of this information during method calls.
00:25:39.060 This is a heavy workload, especially since method calls occur frequently.
00:25:46.560 Conveniently, YJIT often needs this data to run stubs and validate code.
00:25:54.840 Consequently, YJIT effectively possesses a global perspective of where every call leads.
00:26:01.080 If YJIT generates code for a method, we then retain metadata surrounding it.
00:26:09.840 This allows us to speculate that we constantly target methods YJIT already recognizes.
00:26:17.160 When executing a method call, rather than filling out information gradually, we can select from a precomputed bundle.
00:26:24.060 If our bundle covers the correct call graph, we only need to change one variable to select the appropriate progress.
00:26:30.900 This method ultimately requires substantially less work than previously.
00:26:38.220 Incorporating this scheme into YJIT would make it remarkably distinct from the interpreter.
00:26:45.780 However, bridging these metadata-based optimizations would slow down the interpreter.
00:26:52.920 Naturally, making these optimizations necessitates careful deployment to maintain a positive user experience.
00:26:59.640 The caution lies in avoiding sacrificing user interactivity for higher peak performance.
00:27:07.140 So that was kind of the dark side of optimization.
00:27:14.520 I raise this point because I want YJIT to be comprehensive, devoid of notable performance caveats.
00:27:20.760 One area where YJIT excels is its invisibility to users.
00:27:27.180 We compile code with minimal latency that is perceptible to humans.
00:27:34.440 While we desire superior peak performance, we still strive to maintain an interactive feel with IRB.
00:27:41.520 As runtime systems advance and become more complex, there exists a genuine risk of inadvertently trading human interactivity for higher performance.
00:27:48.420 Thus, we must exercise caution.
00:27:58.140 That’s all I have.
00:28:05.520 Thanks for listening, and thank you for having me!