00:00:10.580
All right, I'll get going here. I guess hi everyone, my name is Kevin. I've met some of you before, and I work at Oracle Labs, which is a research group within Oracle.
00:00:22.080
In particular, I work on a team that specializes in virtual machine and compiler technologies. So I'm here today to talk about TruffleRuby, which is our implementation of the Ruby language, and how we've improved its startup time with a new tool called the SubstrateVM.
00:00:34.710
Before I get started, I need to inform you that when I'm presenting today, it's research from a research group and should not be construed as a product announcement. Please do not buy or sell stock in anything based on what you hear in this talk today.
00:00:45.360
The title of my talk, 'Improving TruffleRuby’s Startup Time with the SubstrateVM,' is somewhat verbose. I'm not super creative when it comes to these things, but it is quite descriptive.
00:01:03.180
If you're new to Ruby or you don't keep track of all the various Ruby implementations out there, you might be wondering what TruffleRuby is. TruffleRuby is, as I mentioned, an alternative implementation of the Ruby programming language. It aims to be compatible with all of your Ruby code but provides new advantages compared to other Ruby implementations.
00:01:19.409
It's relatively young—about four years old now. I like to think of it as a best-of-breed approach. We actually pull in code from JRuby, Rubinius, and MRI. The core of TruffleRuby is written in Java, so there's a natural ability for us to share some code. Unlike Rubinius, we aim to implement as much of the language in Ruby itself.
00:01:37.170
We were able to leverage a lot of the work the Rubinius team had previously done in implementing the core library in Ruby. We also pull on the standard library from MRI. More recently, we've begun being able to run MRI's C extensions, and we can run MRI's OpenSSL implementation and Zlib. Currently, we're about 97% compatible with Ruby based on the core library specs from the Ruby Spec Project.
00:01:50.250
We achieved 99% compatibility on the Ruby language specs. These test suites are really nice, but they're not as comprehensive as we'd like. So we've also spent a fair amount of time testing the top 500 gems out there, with Activesupport being one that's really popular; we're actually 99% compatible with that.
00:02:10.140
We don't quite have database drivers yet, but that's something we are working on. We can't run all of Rails yet, but we're closing the compatibility gap quickly.
00:02:27.170
TruffleRuby is implemented in Truffle. Truffle is a language toolkit for generating simple and fast runtimes. With Truffle, you build an AST interpreter for your language, which is about the simplest way to implement a language.
00:02:40.050
AST interpreters are very straightforward to develop; they're easy to reason about and debug. However, the problem with AST interpreters is that they tend to be slow. We fixed that by pairing Truffle with another project called Graal, which is our JIT compiler.
00:03:03.930
Graal is a compiler for Java written in Java, and it has hooks from Java that Truffle can use to optimize these AST interpreters through a process called partial evaluation. This is important because many languages start with an AST interpreter, then hit a performance wall, and have to build a language-specific VM.
00:03:13.560
Building a VM requires a lot of work and expertise, and it can be hard to get right. MRI went through this itself; Ruby up through 1.8 was a simple AST interpreter, while Ruby 1.9 introduced the YARV instruction set and a virtual machine. What we want to do with Truffle is to say, 'Hey, just build your language; stay in the AST interpreter where it’s simple, and we'll take care of the performance optimization.'
00:03:37.170
In addition to being a language-building toolkit, Truffle provides some additional features like debugging, profiling, and general instrumentation—things that all languages need—you get those out of the box. Additionally, some JIT control features, such as inline caching, are also provided.
00:03:49.960
Truffle also has a polyglot feature; this is a first-class feature of the framework. All languages implemented in Truffle can call into and out of each other, and because they inherit from a base node class hierarchy, Truffle nodes from one language can be mixed together easily.
00:04:11.100
When they are submitted for compilation with Graal, we can eliminate that cross-language call boundary. So for instance, you can call JavaScript code from Ruby, and there is no performance penalty for calling into JavaScript after it gets optimized.
00:04:29.270
As I mentioned, this is a first-class feature of Truffle. Some of Truffle's functionalities are actually implemented as domain-specific languages within Truffle. If you have been following TruffleRuby, you might be wondering what we've been up to over the past year.
00:04:52.090
We actually spun out of JRuby, which used to ship as an alternative back to JRuby. At that time, we were called JRuby plus Truffle, but now we are TruffleRuby. We began running C extensions from MRI last year, and Chris Eaton, who was on our team, gave a talk at RubyConf outlining a blueprint for running C extensions in MRI.
00:05:14.300
Since then, we have managed to run OpenSSL, Nokogiri, JSON, and we're also beginning to work on some of the database adapters. This approach is working, and the results are really promising. We now have Java interrupts, so you can call Java methods on Java objects from Ruby using a familiar Ruby syntax if you've ever used JRuby. The syntax looks very similar.
00:05:41.540
We've been working on improving our native calls as well. Ruby has a rich interface for making underlying POSIX calls, and Truffle has a new feature called the Native Function Interface (NFI), which provides a functionality akin to Ruby's FFI for Java in Truffle languages.
00:06:00.940
We've made some really good progress in the short time the project has been around. We have achieved a high level of compatibility, and while performance has been good, we've encountered one sticking problem, which is related to startup time.
00:06:19.350
Applications typically go through three cycles: startup, warm up, and then steady-state. Startup time is the period the interpreter uses to prepare itself to start running your code. Warm up time starts when it begins running your code, but since the code is cold, it tends to be slow.
00:06:40.070
The idea is that as it executes multiple times, it should get progressively faster to the point where we call it hot, and thus warm up. If you have a JIT, this is when you would be profiling the application, submitting items for compilation, and compiling them. However, even if you don't have a JIT, your application could still have a warm-up phase.
00:07:01.000
The operating system will perform tasks that populate filesystem caches, and you'll also be populating cache lines in the CPU. Eventually, you hit your steady-state, which is where your application spends most of its life. Most applications encounter an event loop or something that will keep them up until they stop executing, which is a very different pattern.
00:07:13.520
You can broadly classify applications into two types: those that are long-lived and those that are short-lived. In a long-lived application, you spend most of the time in the steady-state. However, TruffleRuby, like many languages implemented with a JIT, has a slower startup and warm-up time.
00:07:34.390
This trade-off is made because it generates very fast code for the steady-state, as the idea is that your application will spend enough time in that steady-state to make the upfront time spent generating fast code worthwhile. However, many Ruby applications are short-lived, like IRB or Pry, or the test suites we often run.
00:07:56.870
In cases like these, most of the time of your code can actually be in the startup phase. This is illustrated in a graph where the startup time does not get any longer; it merely accounts for a larger percentage of the overall application's lifecycle.
00:08:18.700
After reaching the warm-up phase, a significant amount of work could be wasted, since the time spent in the steady-state before exiting is so minimal, resulting in little to no benefit from warming up. Thus, while optimized for long-lived applications, TruffleRuby has not spent much time optimizing for short-lived ones.
00:08:41.520
To improve startup time, it's helpful to see what the current status is. I ran a very simple 'Hello, World!' application, and we can observe that MRI is hands down the fastest, running it in about 38 milliseconds. JRuby runs it in 1.4 seconds, while TruffleRuby lags behind at 1.7 seconds.
00:09:06.630
While nobody really runs a 'Hello, World!' application in production, I thought it was a great way to illustrate the differences. Turning back to the Ruby Spec Suite, the test suite I mentioned earlier that we use to evaluate our language compatibility.
00:09:19.700
The Ruby Spec Suite is modular, broken up into various components like tests for core language features and tests for the standard library. The idea is that multiple Ruby implementations can select subsets of this test suite to progressively add functionality.
00:09:39.320
If you're developing a new implementation, it starts without running all of Ruby, so you pick a subset or one of the components of the spec suite that you can start running to evaluate progress. This approach is beneficial as it tests something that will run on multiple Ruby implementations without favoring one over another.
00:10:02.420
Another interesting aspect of the spec suite is that it includes a test runner called MSpec that looks and feels a lot like RSpec. It uses matchers and dot should syntax for clarity.
00:10:19.550
Looking particularly at the language specs, this isn't the largest test suite in the world, but it has about 2,100 examples and 3,800 expectations. This makes it a decent proxy for what we can expect when running application-level test suites.
00:10:32.570
When running these on the various Ruby implementations, MRI is again the fastest, completing all tests in about a second and a half. JRuby comes in at 33 seconds, and TruffleRuby lags at 47.5 seconds, which presents a notable challenge for us.
00:10:49.240
While we are making great strides in improving compatibility, many developers running test suites discount us due to this slower performance. You might be asking yourself, 'If TruffleRuby is slow in running these tests and startup time, what do I gain from it?' Our advantage has been in peak performance.
00:11:10.400
To evaluate peak performance, I turn to the Optcarrot benchmark. If you saw Matt's opening keynote, he presented some numbers in the context of MJIT. Optcarrot is a benchmark used by the core Ruby team to evaluate progress on its Ruby 3x3 initiative.
00:11:30.780
It's essentially a Nintendo Entertainment System emulator written in Ruby that runs NES games and presents a score based on the number of frames rendered per second. Ruby 2.0 runs these games at about 20 frames per second.
00:11:47.430
If Ruby 3 achieves its intended 3x speed-up objective, it would mean running at 60 frames per second, which coincidentally matches the frame rate of real NES hardware. In previous discussions, Matt indicated that MJIT can run approximately 2.8 times the speed of MRI 2.0, edging closer to that 3x goal.
00:12:01.580
I ran Optcarrot with these same Ruby implementations, and what I found is that MRI 2.3 runs about 23 frames per second. JRuby roughly doubles that at 46, while TruffleRuby achieves about 8.5 times more, at 197 frames per second.
00:12:29.300
In summary, we made this trade-off where our startup time hasn't improved as much as we'd like, but our peak performance is quite impressive. I've been presenting this from the angle of short-lived versus long-lived applications in terms of application profiles.
00:12:49.850
Still, we can also consider it as a development versus production issue. In development, you typically run Pry or IRB or your test suites, where the focus is on fast iteration. In production, you often have long-lived application profiles.
00:13:05.440
Thus, balancing between the two can be somewhat problematic. I take a step back here to share experiences with JRuby, which I used before joining the TruffleRuby team. JRuby experiences similar problems; its startup time isn't as favorable as MRI, but it has a peak performance advantage over it.
00:13:27.070
With my teams, we decided how to balance these aspects. One option was to always use MRI and optimize for developer time, enabling quicker movement and happiness among the development team.
00:13:48.050
On the other end of the spectrum, you could opt to optimize entirely for peak performance with JRuby in production, which would ultimately deliver more value to customers, while accepting that development time would incur additional costs.
00:14:12.640
Another option is a hybrid model, which I was never able to implement successfully. However, I know of teams that managed to do so by running MRI locally for fast development time while deploying JRuby in production for peak performance.
00:14:38.420
That said, there is inherent risk, as MRI and JRuby have different runtimes, leading to potential bugs from differing behavior in production versus local environments, despite JRuby’s high compatibility with MRI.
00:15:02.800
You could mitigate some of this risk, but it remains a risk nonetheless. Technical hurdles arise when certain aspects, such as the Ruby engine environment variable or global values, have different parameters depending on the runtime.
00:15:25.550
Having experienced this beforehand, I was particularly interested to see how we could manage it better with TruffleRuby. I believe we can handle it well, and so now I will introduce how we plan to achieve this with a new project known as the Substrate VM.
00:15:49.450
The basic idea is that since TruffleRuby is implemented in Java and Ruby, it runs on the JVM. However, we can also re-target it to another VM known as the Graal VM. If you run on the JVM, you receive this AST interpreter, but by not including our JIT, the runtime won't be as fast.
00:16:12.950
The key point is that we can target both the JVM, providing functional correctness, and the Graal VM, which packages the JVM with Graal to deliver that optimized compiler. The goal is to integrate yet another target: the Substrate VM.
00:16:34.190
The Substrate VM has two core components: an ahead-of-time compiler (also known as the native image generator) and VM services that link into the application. The ahead-of-time compiler compiles a Java application, taking the JDK and any additional jars or libraries your application may rely on, compiling everything directly to native machine code.
00:17:01.550
The program in question, in this case, is the TruffleRuby interpreter. The ahead-of-time compiler essentially treats Java like other languages such as C, C++, or Go. The output is a static native binary; the JVM is completely removed.
00:17:25.430
This binary will then have the Substrate VM linked into it, allowing for garbage collection, hardware abstraction, and other features expected from a virtual machine. To illustrate this further, I have an example of a simple ad method in Java.
00:17:44.310
This method takes two integers, A and B, and calls Math.addExact. When you compile this for the JVM using Java C, you'll generate Java bytecode, which is typically contained in dot class files fed into the JVM for interpretation until the hot code paths are optimized.
00:18:07.680
On the other hand, the native image generator creates actual machine code output for that method written in Java. What the native image generator does is perform a closed-world static analysis on the application. While it may sound convoluted, it's quite straightforward once broken down.
00:18:30.640
Every Java application has a main method, which acts as the entry point. This differs from Ruby in that Ruby scripts can execute code directly without needing an explicit entry point. Java has static methods, static fields, and static initializers roughly equivalent to Ruby's class variables and methods.
00:19:00.800
The analyzer starts with the main method, looking for all the classes your program actually uses, all the methods used from those classes, and discarding everything else. The Java Development Kit (JDK) is massive, so we wouldn’t want to compile the entire thing, instead discarding unused code.
00:19:24.240
It's a closed-world model, meaning you can't dynamically load classes after this process. Everything your application could use during runtime must be supplied to the native image generator input, and the analysis must be somewhat conservative.
00:19:40.280
For example, if an interface or an abstract class cannot determine concrete subtype classes for calling, then all candidate classes for it must be compiled in, which could inadvertently compile the whole JDK.
00:19:56.220
We monitor this with each push to the TruffleRuby repository to ensure no unintended dependencies are introduced into the static image. If you're interested in more details, Christian Bimmer gave a talk at the JVM Language Summit this year that covers this process in depth.
00:20:12.420
What’s interesting for us on the TruffleRuby team is that we can take advantage of this static analysis. When we load Java classes into the image generator, it executes all the static blocks, which allows computation to be pushed into the ahead-of-time compilation procedure.
00:20:33.490
As I mentioned, we implement a lot of Ruby with Ruby, so core methods are written in Java. Once we bootstrap the runtime, we can implement, for instance, all of Enumerable in Ruby. The downside to this structure is that every time the TruffleRuby interpreter starts, we load and parse these files, setting up their respective state.
00:20:58.310
However, these files seldom change, only altering when we issue a new release of TruffleRuby. This results in a lot of duplicated computation whenever we run the interpreter.
00:21:24.370
With the ahead-of-time compiler, we can push this computation into the native image generator and execute it once. When starting the static binary generated from the image generator, all relevant sections are calculated beforehand.
00:21:41.490
For instance, we will pre-parse core Ruby files, obtaining the AST and storing it in the binary. This means when we launch this binary, we simply retrieve it from memory, circumventing file system operations and AST building.
00:22:09.240
We go a step further by including all encoding tables. Ruby supports 110 different encodings, each with a substantial metadata set, as well as transcoding mappings to convert the encoding on a string through these mappings.
00:22:36.670
These characteristics seldom change, beneficially impacting our ability to generate reduced startup times. We can also pre-construct some common strings, for instance, string literals that are utilized across multiple areas.
00:22:55.880
This allows us to avoid dynamically allocating bytes for strings. Now, let's see what we gain from this change. The objective is to improve our startup time, so let’s revisit the previous results for the Hello, World! benchmark.
00:23:10.770
Recalling our previous timings where TruffleRuby was at 1.7 seconds, we can now observe that on the Substrate VM, it has decreased to a total of 100 milliseconds. While this doesn’t quite match MRI, we're certainly closing the gap.
00:23:48.270
Again, nobody runs 'Hello, World!' in production, so let’s take another look at that test suite. By running on the Substrate VM, we’ve dropped our test suite time from 47.5 seconds to just below 7 seconds.
00:24:04.390
MRI still outpaces us by several multiples, but for a development team, there's a significant difference in waiting a minute for a test suite to complete versus MRI's one and a half seconds.
00:24:22.590
The crucial question remains; did we sacrifice peak performance in the process? Initially, I framed everything around this idea of slow startup time to clarify faster steady-state performance.
00:24:35.280
But if we review those Optcarrot performance results running on the Substrate VM, we indeed take a minor performance hit, dropping our frames-per-second from 197 to 169.
00:25:06.570
This figure still positions us as eight times faster than MRI. Thus, while we encounter a 15% reduction, the trade-off with improved startup time appears worthwhile.
00:25:27.380
However, there’s no inherent reason why this performance hit should occur. What happened? Why did we experience a performance dip? To contextualize, Optcarrot heavily relies on interpreting and dispatching instructions dynamically.
00:25:52.660
The mechanism essentially decodes opcodes corresponding to instructions for NES hardware, and in tight loops, it uses opcodes as an index for a dispatch table, employing metaprogramming to manage dynamic dispatch operations.
00:26:15.650
The first issue relates to the small array being generated, where the Substrate VM generally isn't optimized for the creation of smaller arrays as efficiently as Graal.
00:26:37.460
This means that while Graal can occasionally avoid small array allocations and instead access elements directly, the Substrate VM has yet to refine this ability.
00:26:58.150
The second issue arises from the send calls, which often become mega-morphic quickly, an occurrence that results in the loss of inline caching opportunities for method calls.
00:27:16.960
TruffleRuby excels in the unique ability to optimize metaprogramming with inline caching, a feature not currently mirrored by other implementations.
00:27:36.840
As our send calls become increasingly problematical at the substrate level, method invocation performance reduces under this methodology, resulting in slightly slower call functions relative to Graal.
00:27:55.510
The Substrate VM team is already aware of these inefficiencies and intends to fix them. In summary, I believe we’ve effectively fixed TruffleRuby's startup time.
00:28:14.500
While we currently don't match the speed of MRI, we are on a promising path, and I’m personally thrilled to announce that the Substrate VM is now publicly available.
00:28:34.290
For those following TruffleRuby, we've often been quizzed on our startup time, which is a fair criticism, and while we previously reported that we had the Substrate VM on the way to enhance startup performance, it was at times perceived as vaporware.
00:29:04.560
However, it’s now publicly available, and we are actively utilizing it. I think one of the exciting takeaways about the Substrate VM is that it further validates the strategies we’re implementing for building Ruby.
00:29:25.400
Building atop this Truffle AST interpreter framework, we now have free access to debugging and profiling tools.
00:29:46.830
Now, with the Substrate VM, we have an excellent virtual machine that effectively addresses our startup time issues.
00:30:08.030
From the perspective of the TruffleRuby codebase, we don't have to undertake any substantial modifications to leverage these benefits; although there might be minor adjustments to circumvent certain functionalities depending on reflection availability in the Substrate VM.
00:30:25.870
Ultimately, most of the same codebase can target both the Graal VM and Substrate VM with minimal modifications. There’s certainly more work ahead.
00:30:46.920
Currently, the Substrate VM lacks support for compressed pointers, an optimization that the JVM already possesses, although they are working on integrating that feature.
00:31:04.360
When this is resolved, pointers will consume half the memory and enhance performance for those systems running smaller heaps.
00:31:22.560
They are diligently looking into refining array handling, and I have mentioned there are additional optimizations we could focus on regarding memory consumption.
00:31:43.250
In the future, we'll see potential for improved native function overhead management and incorporating the standard library into the image.
00:32:07.140
I believe we can continue making meaningful enhancements to startup times, enabling us to narrow the gap with MRI further or address memory consumption concerns in parallel.
00:32:31.940
I have included some slides that detail how to run the TruffleRuby SVM binary, as well as additional information on our benchmark environment.
00:32:48.490
I have also provided links to talks for those interested in learning more about the involved processes. Here’s a picture of the GraalVM team. The GraalVM team is significant; Oracle has invested considerable resources into these various projects.
00:33:05.620
Many people have contributed to this work, including alumni, interns, and university collaborations. This is a huge undertaking with far more effort than I can manage alone, so I want to acknowledge the efforts of my colleagues on the TruffleRuby team.
00:33:27.690
I’d like to specifically highlight contributors like Chris Seaton, Peter Hu, Malcolm McGregor, Ben Lood, and Brandon Fisch, as well as from the Substrate VM team, specifically David Yablonski, who played an integral role in bringing this presentation together.
00:33:48.220
Here's my contact information. I love discussing this material, so if you're interested in learning more about TruffleRuby, please reach out. If you'd like to try running your application or library with TruffleRuby, I'm also happy to work with you to ensure compatibility.
00:34:02.720
TruffleRuby is completely open-source, so you can also check out the project for yourself. And that’s it. Any questions? We have a few minutes left.
00:34:17.880
One of the questions raised is about real support, which is indeed an ongoing topic of discussion. The question pertains particularly to database adapters.
00:34:30.070
Chuffy Ruby represents a new implementation, and the database drivers are shipped primarily as extensions. While a pure Ruby version exists for the Postgres driver, most database drivers contain a native component.
00:34:49.720
The difference in APIs between MRI and JRuby leads to the challenge of integrating database drivers. We've opted to develop compatibility with MRI, thereby bridging any gaps via a pseudo-API tailored for it.
00:35:08.830
As our work has progressed, we've successfully implemented extensions for OpenSSL and JSON, and we are currently working on the Postgres driver.
00:35:25.370
Working toward establishing SQL calls with MySQL has made steady headway, and with efforts being made toward SQLite3, completion of Rails functionality is becoming feasible. Active Model, Active Support, and Action Mailer all function well with TruffleRuby.
00:35:49.150
However, Spring may pose challenges. To answer a question: if startup time dominates the application profile, why doesn’t 'Hello, World!' show larger effects than running the test suite?
00:36:08.920
This discrepancy can arise because the test suites impose forking and exacting practices, which multiply the impact of startup time throughout the process.
00:36:26.610
Also, the garbage collection observed during 'Hello, World!' exits quickly, limiting memory generation opportunities. The Substrate VM's garbage collector differs from JVM's and hasn't been tuned as effectively.
00:36:46.390
The question arises regarding static linking of the JDK into the binary and its implications for Java calls from Ruby. It's true; we forego the Java Interop feature due to the reliance on runtime reflection.
00:37:02.340
Substrate VM has begun to develop limited reflection capability, allowing for some Java classes to be linked and utilized. However, we cannot dynamically load new classes.
00:37:22.280
From a developer standpoint in Java, you may be used to adding a jar and implementing a new interface, which would not be possible in the Substrate environment. Fortunately, our needs from the JDK are predetermined, preventing complications.
00:37:39.680
Oracle hopes to gain valuable insights from this project at Oracle Labs, a research group with a mission to identify and explore promising new technologies. In terms of the overall Graal project, the core integrates various languages.
00:37:54.960
These implementations are crucial for advancing Truffle and Graal technology, keeping it from becoming overfitted or generic. With Graal now shipped as part of Java 9, it's evidence of ideas being translated back into other divisions.
00:38:11.100
In summary, I hope to reflect on key points toward enhancing Ruby’s evolution. Thank you for being a great audience!