Keynote: Stories from developing YJIT

00:00:09.860 Thank you! Hi everyone.

00:00:15.120 I just have a talk here about YJIT. It's just some random things I found interesting during development.

00:00:21.300 I know it's a keynote, but it's really just a greatest hit from my development experience.

00:00:26.580 This is actually my first time giving a talk at a conference, so mostly I’m just going to read from my script. Please bear with me for the next 40 to 50 minutes.

00:00:33.719 I'm going to take you chronologically through YJIT's development history.

00:00:41.879 When we started the project in 2020, we wanted to do something simple to get our feet wet.

00:00:49.070 At the time, I was looking at overhead in the interpreter, and influenced by that, we tried to design something closely related to how the interpreter works.

00:00:55.500 So first, I'll talk a little bit about the interpreter. The interpreter does not run the text of the Ruby code directly; instead, the text is broken down into small instructions.

00:01:02.340 First, the list of instructions encodes some information obtained through parsing the text, which makes it easier to execute than running the text directly.

00:01:09.000 For example, the order of operations is resolved, so we don't need to worry about running the right-hand side of an assignment first.

00:01:15.840 The interpreter can just go from top to bottom in the list of instructions to execute the code.

00:01:21.180 The chunks of machine code in the Ruby executable implement the logic for each kind of instruction.

00:01:27.000 These chunks of machine code are called instruction handlers. Each handler is designed to pass control to the next handler.

00:01:32.460 They update the Ruby program counter to keep track of which part of the program they are running and then they pass control by jumping.

00:01:38.460 You don't need to worry too much about the details here; the main point is that the interpreter runs the Ruby program a bit at a time with these handlers.

00:01:45.420 Handlers can jump to each other. I made some animations to illustrate how the setup runs over time.

00:01:50.720 We are going to simulate 'x = 2 + 2.' In the middle of this animation, I have the machine code for the handlers as they could be laid out in the Ruby executable.

00:01:56.220 On the right, I am going to log the name and address of the instruction each time we run one.

00:02:02.900 We will run address 30 first, because the first instruction calls 'put_object' as you can see in the leftmost column.

00:02:08.640 After executing the 'put_object' handler, we look at the next Ruby instruction and jump to the handler for it.

00:02:15.840 The next Ruby instruction is 'put_object' again, so we jump back to address 30.

00:02:21.780 We run 'put_object,' and now the next instruction is 'plus,' with address 10.

00:02:29.159 We take the jump, run through the handler, and finally we get to the last instruction, which is at address 40.

00:02:37.319 The main thing I’m trying to show here with the simulation is that there are two execution contexts when the interpreter runs Ruby code.

00:02:44.400 Even though the Ruby instructions are completely linear, the machine code may need to jump forwards and backwards during execution.

00:02:51.480 As you can see from the three arrows, the jumps are necessary because there's only one copy of each handler in the Ruby executable.

00:02:58.080 For example, to repeat the same handler, like 'put_object' in this example, we need to jump backwards.

00:03:04.319 Allowing ourselves to generate new machine code at runtime would mean we are limited by what’s in the executable.

00:03:10.620 So the first design we tried revolved around generating code to eliminate the need to jump.

00:03:16.139 The open machine code is basically what the interpreter runs over time, minus the jumps that stitch the handlers together.

00:03:22.680 This layout creates a linear sequence of Ruby instructions leading to linear machine code, achieving a nice smashing effect.

00:03:28.440 I like this design because it’s conceptually simple. The machine code generated is closely related to how the interpreter runs over time.

00:03:34.680 It was also easy to see how the output could be faster due to the similarity to the interpreter; it’s the same code with a few jump instructions removed.

00:03:42.540 Fewer instructions should imply less work, right? Simple and nice.

00:03:48.900 Well, too bad it's not that easy. We did get some speedup on my upcar, but we actually got a slowdown on Rails bench.

00:03:55.440 Since it's our production use case, we want Rails to run well. So what's the difference between the two benchmarks?

00:04:01.680 Of course, it might be obvious that an emulator for a family entertainment console is different from a web app.

00:04:08.280 However, I want to characterize the benchmarks in terms of how the processor runs them.

00:04:14.820 Processors have performance counters that can tell us how the workload uses computational resources.

00:04:22.139 Since there are hundreds of these counters, it’s hard to know which ones to focus on.

00:04:29.520 Luckily, there are existing tools and methodologies to get a concise summary.

00:04:36.060 We can use the Linux performance profiler to compare the two benchmarks at a low level.

00:04:43.440 The profiler output looks like this:

00:04:49.260 Except for the retiring column, it shows different types of bottlenecks.

00:04:55.380 I created some animations to explain what these columns mean.

00:05:01.740 To run some computation, the processor has to know what to execute first.

00:05:09.000 Machine code is stored in memory, so the processor has to load from memory before decoding it.

00:05:15.120 In each clock cycle, the front-end issues a few operations to the back-end for actual computation.

00:05:21.300 In this example instruction stream, we want to do three multiplications but there are only two arithmetic units in the back-end.

00:05:27.660 The back-end can execute two multiplications simultaneously, but while it’s working on them, the third multiplication will be stuck waiting.

00:05:34.680 We call this situation back-end bound since the pipeline is stalled due to the back-end.

00:05:41.520 Congestion can occur at the front-end as well.

00:05:48.900 Loading code from memory takes time, leading to periods when the back-end has underutilized capacity.

00:05:56.520 You can see the ads slowing down in this example, where the back-end has no work to do.

00:06:02.520 After finishing the final multiplication, hopefully, that made sense.

00:06:08.820 This methodology is designed for processes that can perform multiple operations per clock cycle, and percentages are based on execution slots.

00:06:16.380 Let’s say we can execute up to four operations per clock cycle; then we have four slots per cycle.

00:06:23.520 If we spend all the slots computing, we reach the theoretical maximum throughput.

00:06:29.520 As we saw, though, there can be delays in the pipeline; some slots may need to be spent waiting for either the front-end or the back-end.

00:06:37.020 This model counts in slots so it can account for partial delays within the cycle.

00:06:44.280 In the animation I showed, we were still able to perform some computation while waiting for new code to load from memory.

00:06:51.780 So only some slots were spent waiting.

00:06:58.440 Now that we understand what the different columns mean, we can compare the results for Upcar and Rails bench.

00:07:05.160 We can see that Upcar uses execution slots fairly efficiently, with over 60% spent on retiring the computations rather than waiting.

00:07:12.000 In terms of bottlenecks, back-end bound wins, but it’s evenly distributed across different categories.

00:07:19.740 What does Rails bench look like? Well, it's very front-end bound.

00:07:26.520 Rails bench uses a diverse set of Ruby features and a few C extensions, so the processor needs to execute a lot of machine code from different places.

00:07:34.020 This turns out to be quite significant.

00:07:42.900 These results are from running just the interpreter. How does YJIT change things?

00:07:50.880 YJIT generates more machine code on the fly for the processor to load.

00:08:00.200 Even though it's true that YJIT code takes less work to run than the interpreter.

00:08:06.300 Going through this system increases the work for the front-end since jumping to the code generated by YJIT is a new source of delay.

00:08:13.800 If the generated code is much more efficient than the interpreter, it may be worth spending a few slots to access it.

00:08:20.280 However, the initial version simply wasn't optimized well enough.

00:08:26.640 Also, since the design can only handle linear sections of Ruby code, it needs to exit back to the interpreter frequently.

00:08:33.360 Thus, we faced frequent jumps between the interpreter and the generated code.

00:08:40.200 I’ll explain why these jumps can introduce delays.

00:08:46.920 The idea with this profiling methodology is to drill down into each category of delays and identify where in the code these delays occur.

00:08:55.080 I have an example of a contributing factor to front-end bound delays.

00:09:01.680 This occurs when the front end is delayed due to incorrectly predicting where a jump goes.

00:09:08.280 This type of delay happens in the interpreter a lot because of the jumps between the handlers.

00:09:14.640 We must load the address of the jump destination before we can load the instruction at that destination.

00:09:21.240 So running these two instructions has some inherent memory delay, even when things are cached.

00:09:27.600 Modern front ends strive to hide this delay by predicting where these jumps land.

00:09:33.840 By guessing ahead of time, the front end can work while waiting for the actual destination to come from cache or memory.

00:09:40.320 The animation I showed earlier is a bit misleading; while the ads were rolling in, the front end was likely already feeding instructions to the back end.

00:09:47.160 If the front end makes the correct prediction, it appears as though there were no delays at all.

00:09:54.000 However, if we guess incorrectly, there’s a penalty incurred for redirecting the front end and correcting everything.

00:10:00.600 Now, recent security research reveals that front-end predictors utilize previous destinations for jumps to forecast.

00:10:06.960 Unfortunately, there's a growing number of places the Ruby executable needs to jump to as YJIT compiles more Ruby code.

00:10:13.620 The predictor can only track so much history, and we can easily surpass that limit.

00:10:19.620 On Rails bench and Upcar, there’s less Ruby code, so this issue isn’t as severe.

00:10:26.280 There’s also a structural issue with the initial design, where branches are used to jump between handlers that transition to generated code.

00:10:33.540 This means that branches now have history for handling addresses in addition to the growing set of generated code addresses.

00:10:40.800 This doesn’t help the predictor when running the interpreter due to this mixed history, making accurate predictions more difficult.

00:10:47.040 Consequently, we still exit back to the interpreter frequently as a result.

00:10:54.120 That's the longer version of why jumping from the Ruby executable code can be challenging for the front end.

00:11:01.680 At this point, we were two to three months into the project.

00:11:08.640 We knew that jumping between the interpreter and the generated code was taxing.

00:11:16.200 So ideally, we wanted to minimize those jumps.

00:11:22.680 It would also be beneficial if we could further optimize the output.

00:11:30.180 This is when we decided to change our design and started using lazy basic block versioning to address these changes.

00:11:36.960 This is a technique for making just-in-time compilers, which Maxine developed during her PhD.

00:11:44.340 She talked about this in detail during RubyKaigi last year, and if you want to learn more about this topic, please check it out.

00:11:53.160 Today, I'm going to focus on the lazy aspect because it's my favorite part; I’m pretty lazy myself!

00:11:59.640 YJIT implements laziness with these code fragments called 'stubs.'

00:12:06.720 Each stub represents some code that hasn’t been compiled yet, kind of like a placeholder for machine code.

00:12:12.060 The general idea is to generate stubs first, then replace them with actual code the first time they’re run.

00:12:18.300 Since the interaction with stubs unfolds over time, I believe it’s easier to understand how they work with some animations.

00:12:25.380 I have an example method here that just calls itself; its name is 'call_self.'

00:12:31.680 Let’s see how YJIT uses stubs to generate code for it.

00:12:39.800 YJIT starts by generating code for accessing the local variable, but when it sees the call to itself, it faces a challenge.

00:12:46.200 It needs to know where the call goes to connect the correct code.

00:12:52.920 In this case, it will go to the method 'founding_custom,' but that won't always be the case for other objects.

00:12:59.520 This is when YJIT decides, 'I don’t want to deal with this problem right now; I'm lazy,' and generates a stub.

00:13:05.020 The compiler completes its operation after making the stub, and now we run the freshly generated code.

00:13:11.960 We did not compile the entire method module; however, running what we just compiled gives us a slight advancement in execution.

00:13:18.240 When we encounter the stub, we are within the body of 'call_self' and 'obj' is 'custom.'

00:13:25.200 From here, we recognize that, at least in this immediate case, the call goes to the method 'founding_custom.'

00:13:32.040 However, we also need to accommodate other cases, so we can solve this using a stub again.

00:13:38.400 We usually check, if a call goes to the stub, whether the receiver is not customs.

00:13:44.760 If that check passes, then we know that the method call goes to 'custom.self,' which simply returns.

00:13:52.680 We can generate the code for that immediately after the check.

00:13:58.920 When we call a method, we have to provide a return address, so the caller knows what code to run once the method execution is complete.

00:14:05.520 We will use the stub for the return address as well.

00:14:12.600 Next, we implement 'custom.self' method.

00:14:18.420 It doesn’t do anything special, so we don’t actually require a stub for that; we can simply generate the code for returning.

00:14:26.280 YJIT then completes generating code during this segment.

00:14:32.880 All the while, we have been generating code for the stub, highlighted in orange here.

00:14:39.240 Once we finish, we can patch the jump from the stub to the new generated code.

00:14:45.480 The new code is directly after the jump to the stub, so we actually need not jump at all; we can remove the jump.

00:14:52.299 When the new code runs, we hit the stub and use the return address.

00:14:58.440 The 'call_self' method simply accomplishes its task and returns.

00:15:05.160 We then adjust the reference to the stub to point to the new code.

00:15:12.300 When we resume execution, we can return from the two method and finally finish running the complete program.

00:15:20.400 Now, let’s examine what happens if you repeat the call with the same 'custom' array.

00:15:29.880 What’s intriguing is that you get this linear layout again, executing from top to bottom.

00:15:35.520 This layout is practically optimal when the execution order is linear, displaying favorable behavior toward the processor front end.

00:15:42.120 There’s a strong likelihood that the next instruction is already decoded since it decodes instructions in blocks.

00:15:48.360 What I find remarkable is that YJIT never performed explicit analysis to achieve this nice layout; we naturally get this from being lazy.

00:15:55.260 This contrasts with how other just-in-time compilers might generate similar code.

00:16:01.920 We don’t record an execution trace nor do we take any profiling data before compiling; we simply create stubs and respond to them.

00:16:08.760 What we are doing here is called speculative optimization in the academic literature.

00:16:15.660 The code performs optimally when the input is the 'custom' array.

00:16:22.200 In the future, though, the method could receive different types of inputs.

00:16:28.560 So, we are essentially guessing what future inputs will occur.

00:16:35.040 If we call the method again with a standard array, which has a '3' in it, we will fail the check and then hit the stub.

00:16:41.640 This will result in generating additional code to handle this new condition.

00:16:48.120 The checks we leave in the code are termed guards, which protect specialized code that follows them.

00:16:54.690 Let’s observe the code produced when this guard fails.

00:17:00.600 This one code path could be linear; if we receive the normal array, we take the first jump.

00:17:06.600 While we keep entering this stub, we gradually roll out and generate more code.

00:17:12.660 It’s common for just-in-time compilers to utilize speculative optimizations, as they are effective.

00:17:18.540 The main approach for achieving speedup is based on making wise guesses using runtime state.

00:17:25.380 YJIT makes predictions about where a method call will go based on its initial behavior.

00:17:32.220 This shares a relationship with inline caching in the interpreter.

00:17:41.160 For a specific code location making a method call, the interpreter remembers the method called.

00:17:47.340 If you repeat the call and it goes to the same method, then it's a cache hit.

00:17:54.000 When you hit the cache, you avoid a full method look-up and save work when setting up parameters.

00:18:00.660 Sadly, this cache only has space for one target, so if it’s called again and goes to another method, it’s a cache miss.

00:18:07.440 Despite this small cache limitation, studies show that it works quite effectively.

00:18:13.680 For instance, using Linux BPF Trace, I was able to measure how often we hit the cache in production.

00:18:20.640 The findings revealed a hit rate exceeding 90%.

00:18:27.360 This refers to the interpreter's monomorphic inline caching.

00:18:34.440 Looking at inline caching on Wikipedia reveals that this type of caching is quite effective in small talk applications.

00:18:39.720 Essentially, method calls generally tend to go to one target.

00:18:46.560 This pattern comes from how people typically write code, which is why I bring up small talk.

00:18:53.520 When writing a method call, you likely have a specific method in mind.

00:18:59.160 Moreover, when reading code, it’s complex to ascertain where a method call goes next if you don’t know.

00:19:05.160 This confusion reinforces why calls pointing to multiple targets are less common.

00:19:12.060 Returning to our example, it shows how method calls can target multiple places.

00:19:18.840 However, the first call target has the shortest path, requiring only two checks to access the second case.

00:19:26.460 The linear layout is still achieved in this case.

00:19:33.000 Thus, speculation based on the first call target works well, as there’s usually only one.

00:19:39.060 With the stubs and laziness, YJIT can generate code for many more Ruby constructs.

00:19:46.560 This is crucial, especially as jumping from the executable to generated code can be very costly.

00:19:54.120 So, once we enter the generated code, we aim to remain there for as long as possible.

00:20:01.500 A key element of this is inhibitor speculation.

00:20:08.160 Not all speculations work out, as we’ve noted, and we can’t generate code for everything at once.

00:20:16.560 Using the interpreter as a fallback lets us build YJIT progressively.

00:20:23.520 More importantly, we can concentrate on the critical Ruby features.

00:20:30.180 Fast forward to today, YJIT clearly speeds up Rails bench, with a timing ratio of about 1.3.

00:20:37.680 We perform the same tasks using fewer instructions.

00:20:44.400 Furthermore, we execute more instructions per clock cycle—enhancing how we use slots.

00:20:51.240 This is reflected in better instruction level parallelism achieved through making good speculations.

00:20:58.140 All of this results in a better layout of code, which we observed earlier.

00:21:05.400 We have managed to overcome the inherent overhead of jumping to the generated code, as YJIT compiles more Ruby features.

00:21:12.480 Thus, we now end up with more machine code than before.

00:21:20.280 However, we still achieve speedup.

00:21:27.600 Having more code than the instruction cache can hold isn’t necessarily problematic—provided we structure the instruction stream to help the processor conceal memory delays.

00:21:34.560 Earlier, I mentioned how the interpreter is integral to YJIT.

00:21:41.700 Let me explain how we utilize the interpreter when required.

00:21:48.480 It's a straightforward two-step process: basically, we undertake two storage operations to memory in the output code.

00:21:56.640 The simplicity of this process reflects how YJIT code functions similarly to the interpreter.

00:22:02.700 Typically, when the output code is more specialized, reconstituting the interpreter state entails more work.

00:22:09.120 For context, let’s delve into why assigning to 'cfp->sp' is sufficient.

00:22:17.040 This represents the stack pointer for temporary values arising during computation.

00:22:24.960 Here, I have a simple method call on 'self,' which breaks down into several instructions.

00:22:31.320 The 'put_self' instruction pushes onto the temporary stack, while the 'send' instruction pops from it.

00:22:37.560 Both 'put_self' and 'send' write to memory in the interpreter and YJIT.

00:22:44.160 If YJIT encounters an unresolved 'send' instruction and has to revert back to the interpreter, 'self' has already been stored in the proper location.

00:22:51.120 So, all YJIT has to do is adjust the pointer so the interpreter can correctly locate the value.

00:22:58.920 This strategy works well when YJIT needs to exit. However, YJIT usually operates independently.

00:23:06.000 Thus, we can find ourselves with code leading to an unfortunate memory round trip.

00:23:12.240 In this scenario, a value stores to memory, and then it retrieves it immediately.

00:23:20.760 This situation may be using the same register, X11.

00:23:27.240 This merges the optimization requirements into the happy path.

00:23:33.480 In cases where enough information is available, we can remove the memory operations and retain values in registers.

00:23:40.680 While making this optimization, we still need to reconstruct interpreter states.

00:23:48.900 Consequently, YJIT positions values in registers, while the interpreter requires them in memory.

00:23:56.520 Thus, we have to store the values back in memory to bridge this gap when we exit from YJIT.

00:24:02.520 I’ve also noted other factors that may require tweaking for implementation.

00:24:09.780 YJIT incorporates multiple moving pieces.

00:24:17.640 Now let’s look at the second part of the optimization process.

00:24:24.840 Recall earlier that the interpreter updates the program counter while running each instruction.

00:24:32.220 It essentially needs to know this to determine what code to run next.

00:24:39.840 When updating through YJIT, everything translates to machine code, so updating the Ruby program counter isn’t mandatory for making process progress.

00:24:46.980 Updates are only required before operations necessitating knowledge of execution progress.

00:24:54.780 Examples include before obtaining a back trace or raising an exception.

00:25:03.960 In addition to the program counter, various other information aids in tracking execution progress.

00:25:10.560 To name a few, there’s the 'I' object, which houses VM instructions and lots of metadata like source code filenames and line numbers.

00:25:17.640 There's also a method object since the same Ruby code could be executed as different methods.

00:25:24.180 Furthermore, a scope object assists with constant resolution and refinement method lookups.

00:25:31.920 Both YJIT and the interpreter need to fill out all of this information during method calls.

00:25:39.060 This is a heavy workload, especially since method calls occur frequently.

00:25:46.560 Conveniently, YJIT often needs this data to run stubs and validate code.

00:25:54.840 Consequently, YJIT effectively possesses a global perspective of where every call leads.

00:26:01.080 If YJIT generates code for a method, we then retain metadata surrounding it.

00:26:09.840 This allows us to speculate that we constantly target methods YJIT already recognizes.

00:26:17.160 When executing a method call, rather than filling out information gradually, we can select from a precomputed bundle.

00:26:24.060 If our bundle covers the correct call graph, we only need to change one variable to select the appropriate progress.

00:26:30.900 This method ultimately requires substantially less work than previously.

00:26:38.220 Incorporating this scheme into YJIT would make it remarkably distinct from the interpreter.

00:26:45.780 However, bridging these metadata-based optimizations would slow down the interpreter.

00:26:52.920 Naturally, making these optimizations necessitates careful deployment to maintain a positive user experience.

00:26:59.640 The caution lies in avoiding sacrificing user interactivity for higher peak performance.

00:27:07.140 So that was kind of the dark side of optimization.

00:27:14.520 I raise this point because I want YJIT to be comprehensive, devoid of notable performance caveats.

00:27:20.760 One area where YJIT excels is its invisibility to users.

00:27:27.180 We compile code with minimal latency that is perceptible to humans.

00:27:34.440 While we desire superior peak performance, we still strive to maintain an interactive feel with IRB.

00:27:41.520 As runtime systems advance and become more complex, there exists a genuine risk of inadvertently trading human interactivity for higher performance.

00:27:48.420 Thus, we must exercise caution.

00:27:58.140 That’s all I have.

00:28:05.520 Thanks for listening, and thank you for having me!