Fast Metaprogramming with Truffle

00:00:00.260 Good afternoon, everyone. My name is Kevin Menard. I'm a principal member of the technical staff at Oracle Labs, which is a research group within Oracle.

00:00:07.170 I'm here today to talk to you about fast metaprogramming with Truffle. Before we get started, I need to inform you that what I'm presenting today is research work from our group. It should not be construed as a product announcement.

00:00:15.480 You really shouldn't buy or sell stock in Oracle based on anything you hear today. All right, with that out of the way, let's get back to fast metaprogramming with Truffle.

00:00:22.439 We'll start by focusing on what Truffle is, since I'm guessing many of you are not familiar with it. Truffle is a framework and DSL for writing self-optimizing AST interpreters. It provides a lot of the common tooling needed to build out a language VM, such as debugging and instrumentation.

00:00:39.930 It also provides facilities for writing and optimizing AST interpreters, such as cache control, node splitting, and type specialization. Truffle can interface with Graal, which is an optimizing JIT compiler. The combination of the two performs partial evaluation across the AST interpreter to generate highly efficient machine code for your programs.

00:01:03.090 For those of you who are familiar with part of what I just said, the basic idea is that when building a way to execute a dynamic language, you often start with an AST interpreter. These are very easy to build and easy to reason about, but they often suffer from performance issues. Many languages go through this evolution where they start with an AST interpreter, hit a performance wall, and then try to build out a bytecode interpreter or their own compilation phases for their language.

00:01:34.799 Alternatively, they might integrate with something like LLVM to take advantage of its facilities. However, they still face the challenge of implementing all the other necessary components to build out a language VM, such as a debugger.

00:02:01.229 The Truffle philosophy is to build a simple AST interpreter and keep that implementation straightforward while we handle the optimization part for you. You can maintain the simplicity of the AST interpreter by adding a bit of annotation, while gaining everything you would typically need by building out a full language VM.

00:02:34.840 Truffle provides a way to build support for dynamic languages, but Ruby is a large language with a rich core API. What we've done is pull in code from both the JRuby and Rubinius projects; we are actually part of the JRuby project. Everything we do is open source, and we've been shipping with JRuby since JRuby 9000.

00:02:53.880 If you use the -X +T switch, you can activate our backend. While we're not quite ready for production usage, we are far from a toy implementation. Currently, we pass ninety percent of the core library specs for Ruby and eighty-two percent of the language specs.

00:03:11.350 Due to gaps in coverage of Ruby specs, we've been emphasizing running actual libraries, and we can now pass Ninety-nine percent of the tests for the stable version of Active Support. So, with Truffle out of the way, let’s turn our attention to metaprogramming.

00:03:51.250 Metaprogramming has existed for a long time across various languages and is a core component of Ruby. Unfortunately, it seems to almost defy definition. If I asked each of you to define metaprogramming, I’m quite certain I'd receive different definitions from all of you, some potentially conflicting definitions. It appears to be one of those concepts you know when you see it.

00:04:24.210 For the purposes of this talk, I'll restrict the definition of metaprogramming to the dynamic modification of objects, including classes and the generation of code at runtime to execute dynamically. Moreover, this usage will be limited to Ruby's reflection and evaluation APIs.

00:04:48.050 It's quite a mouthful, but basically, if you used accessor helpers to load and store instance variables of an object, we won't consider that metaprogramming. The act of generating the accessors would be, but using the generated accessors to get or modify that state isn't metaprogramming.

00:05:14.699 On the other hand, if you had used instance variable get or set, which are reflection API methods, then that would be metaprogramming. Ruby has a rich metaprogramming API. It's not explicitly called out as such, and it's not cohesive; these methods appear in several classes and modules, but they provide the necessary functionality for what we consider metaprogramming.

00:05:40.159 Now, 'fast' in Ruby is a contentious term. I’ve been using Ruby since 2008, and I’m sure many of you have been using it longer. The number of times I have seen the battle waged between developer productivity and execution efficiency is mind-boggling. I don’t know if it's a case of a vocal minority wanting to pit these as opposites, but they needn’t be.

00:06:18.270 There seems to be a belief that to write concise, elegant Ruby, you must sacrifice performance. Conversely, if you want something that runs fast in Ruby, it needs to be ugly. For 'fast', we want to satisfy both conditions: it should be elegant and perform well.

00:07:00.790 Fast and metaprogramming are two words seldom seen together. If you've ever done metaprogramming in MRI (Matz's Ruby Interpreter), you have probably noticed performance issues. Here, I implemented four micro-benchmarks, which are indeed micro-benchmarks, so please take the usual warnings that accompany them.

00:07:37.660 I measured method dispatch time using static dispatch by calling the method directly and comparing it to the time required for metaprogramming by calling methods through 'send'. I also measured the time required for loading and storing instance variables using the generated helpers versus instance variable get and set.

00:08:05.930 Lastly, I implemented a simple object proxy using method missing. In this case, I created a wrapper that just wrapped around the string and delegated to string's length. In the static dispatch case, there is a method defined that does the delegation, while in the metaprogramming case, the same delegation is performed using method missing.

00:08:40.039 What we can see is, for method dispatch, it's about two-thirds as slow as calling with 'send'. The time spent on instance variable getting and setting is roughly the same; there's a performance hit there as these are also optimized calls in the yard.

00:09:14.830 However, for method missing, we see a substantial performance hit. When running the same benchmarks with JRuby+Truffle, we have virtually eliminated the overhead associated with metaprogramming. The numbers do skew a bit more in comparison to MRI due to the JIT compiler and its nondeterministic nature.

00:09:57.740 I took the average of three runs here, but there’s perhaps a five percent skew on the scores to demonstrate that I didn't employ any trickery, since I could optimize them to match quite well if I did not focus on real-world performance.

00:10:32.600 I've compared the costs and performance of MRI's static dispatch and metaprogramming against running the same code with JRuby+Truffle's metaprogramming. What you can see is that with dispatch, we are about four times faster, and for instance variable operations, we're about five times faster. For the method missing cases, we're about ten times as fast.

00:11:09.890 So, does it matter? Ruby has been around for twenty years, and no one has really seemed bothered by this performance gap. We are a research group, and we invented a solution to fix it, so I wanted to survey how frequently metaprogramming calls are made.

00:11:46.310 For this, I looked at MRI's standard library directory, reviewed the most recent stable version of Rails, and ran a GitHub code search. I can't show all the values for all the metaprogramming methods, but I took a sample here.

00:12:05.030 What we see is that MRI is fairly conservative in its usage of metaprogramming within the standard library. It does make considerable use of respond_to, but for the most part, it doesn't heavily rely on metaprogramming calls. Rails adds to this a little bit, but the GitHub code search shows that Rubyists are using metaprogramming calls across a wide variety of projects.

00:12:36.380 However, the GitHub code search results can be noisy, as they depend on whatever GitHub's code indexing does and how it facets the data. For instance, in the case of respond_to, I couldn't include results because GitHub's code search doesn't index question marks, and Rails has its own respond_to helper that does something different, dominating the results.

00:12:59.940 Having metaprogramming methods being called doesn't give the full picture. It helps paint it somewhat, but we could have a small number of metaprogramming calls made on very hot execution paths and a large number of calls made in execution paths that simply never execute.

00:13:22.000 I wanted to analyze where these metaprogramming calls are actually being made. In a sample from net/http, if you post a file, you could end up calling respond_to multiple times for method lookup. If you use the delegate library, which employs metaprogramming extensively, Matrix uses 'send' inside of a loop.

00:14:01.900 In Rails, there are two prevalent cases: Rails has this class called TimeWithZone, which is used virtually everywhere, and ActiveRecord has these find_by column helpers. These helpers have an interesting performance history, as they're implemented using method missing.

00:14:49.780 In JRuby, method missing performs very well. Some projects have transitioned from using these find_by column helpers to using AR's where clause, which handles JIT compilation. However, find_by column helpers got cache optimizations that may have surpassed the performance gains of using where—it still utilizes method missing.

00:15:12.880 At some point, someone else added a feature in the method missing handler that dynamically defines the helper method on the model class, so you don’t have to go through method missing next time. However, adding a method to a class can invalidate inline caches, and eventually things stabilize.

00:15:43.940 There has been extensive work around managing the performance implications that metaprogramming introduces. Using static dispatch means we don’t have the same level of performance issues. Just to illustrate this, in the Benchmark 9000 suite, which is a benchmark tool part of the JRuby project, we have found that jRuby + Truffle is about 33 times faster than MRI in certain tests.

00:16:22.940 This speed advantage is not solely linked to metaprogramming, but a significant portion is indeed attributable to it. So, what makes JRuby + Truffle different? Why are we able to eliminate metaprogramming overhead when other implementations seem not to have been able?

00:17:01.490 If we take a look at this subset of the metaprogramming API, you'll notice a common structure among many of these methods. They often take some form of name as the first argument. This name could be a method name, variable name, or constant name, and provides a means to discriminate what the methods ultimately do.

00:17:46.000 This structure has a strong parallel to Ruby's method lookup. Method lookup in dynamic languages is a slow operation, and in Ruby, it is particularly slow due to its rich class hierarchy. Calling a method could require traversing classes, singleton classes, and mixed-in modules.

00:18:16.000 In isolation, method lookup might not seem that slow, but there's a compounding factor: everything in Ruby is a method call. For example, adding two numbers involves method calls, leading to a large number of lookups and potential performance hits.

00:18:46.000 What we do is cache method lookups. The basic anatomy of a call site includes a receiver, a method name, and arguments for that method. Ruby allows dynamic definitions, so we must ensure our cached values reflect the current state of the code.

00:19:12.350 In MRI, the global method cache has a default size of 2048 entries. Once you exceed this limit, cache evictions trigger a full method lookup to retrieve the replacement value in the cache. Such evictions can lead to potential performance bottlenecks.

00:19:44.000 Dynamic language features complicate this further. For example, methods may be added or removed at runtime, which could invalidate cached results. This complexity can lead to severe performance issues due to frequent cache invalidations.

00:20:16.000 To solve this, we localize our caches. We implement an inline cache system, where the method cache pointer is stored at the call site. Since Ruby is dynamic, we guard this cache check against potential changes in receiver type and class.

00:20:49.450 This inline cache mechanism matches the majority of calls, leading to performance boosts. If the guards fail, a full method lookup is performed. This strategy, coupled with the inline caches, provides efficient method dispatch without the overhead of global caches.

00:21:32.000 JRuby + Truffle extends this concept using dispatch chains, which generalize inline caching. In practical terms, self-defined inline caches check the method arguments and maintain performance when calling methods dynamically.

00:22:30.210 With dispatch chains, we check the argument type before executing the method, maintaining speed and efficiency without compromising the structural integrity of Ruby. If the argument does not match, we can then perform a lookup and update the cache accordingly.

00:23:09.540 Truffle optimally supports these dispatch chains through specialized nodes catering to every metaprogramming API method. Therefore, we are able to store what type of value to cache depending on the methods being called. This setup eliminates the need for unnecessary overhead, optimizing performance considerably.

00:23:54.450 The caching mechanisms we implement in JRuby+Truffle minimize overhead when metaprogramming. This efficient handling leads to enhanced performance, allowing our system to execute metaprogramming faster than traditional implementations without compromising language features.

00:24:34.080 To conclude, Rubyists like to use metaprogramming, and we have over twenty years of Ruby code that demonstrates its effectiveness. Our goal as language implementers is to ensure metaprogramming is fast, without penalizing users for leveraging the language's capabilities. Rather than introducing limitations surrounding metaprogramming, we should aim to make it as efficient as possible.

00:25:16.630 Furthermore, I've demonstrated that metaprogramming can indeed be fast, and we achieved this without modifying the Ruby language. Thus, it's fast in the ways Rubyists have come to expect it in terms of developer productivity, but it is also efficient regarding execution speed.

00:25:50.360 This simple solution took considerable research and effort to achieve. The work was presented at PLDI last year. The team investigated performance in both Ruby and Smalltalk implementations and compared results with metatracing JITs and partial evaluators.

00:26:32.310 They found that the technique of using dispatch chains works exceptionally well across various systems. JRuby + Truffle is the first—and I believe the only—implementation in the wild utilizing these dispatch chains to minimize the overhead associated with metaprogramming.

00:27:12.930 Achieving this took a collective effort from many researchers across the JRuby and Rubinius communities. That’s my contact information if you have any questions about this; please feel free to email or tweet at me.

00:27:59.510 We have a comprehensive wiki page within the JRuby project, and you can always search for JRuby Truffle to find a variety of blog posts and resources out there. That's it; thank you very much.

00:28:17.970 We actually have several minutes for QA. Does anyone have a question?

00:28:25.440 Yes, hi! What happens if you show the example of what the two-level caching looks like where you're also caching the argument length?

00:28:37.080 Could you explain more about what happens when that starts to fail frequently? The point of the 'send' mechanism is that the argument could be different each time, and it very well might be.

00:29:06.450 A good example might be the Active Record find_by methods; they cycle through every single call to different actual methods being called. Can you explain a little bit more about what happens in that scenario?

00:29:32.790 So we essentially treat this similarly to how an inline cache would operate. There hasn’t been extensive research yet on whether the arguments tend to be monomorphic, but we can certainly support polymorphic calls.

00:29:46.650 For the examples we've run thus far, they tend to be a low number, typically less than three. In the case where you would have a large number, similar to an inline cache, we would eventually mark it as megamorphic and give up.

00:29:56.240 So we give up after eight entries, but this is tunable.

00:30:01.750 Any other questions?

00:30:04.070 Thank you!