RubyKaigi 2017

Improving TruffleRuby’s Startup Time with the SubstrateVM

RubyKaigi2017
http://rubykaigi.org/2017/presentations/nirvdrum.html

We’ve solved the startup time problem in TruffleRuby! In this talk, I’ll introduce the SubstrateVM and how we make use of it to compile the Java-based TruffleRuby to a static binary and massively improve our startup time.

RubyKaigi 2017

00:00:00.179 The next speaker is Mr. Kevin Menard, and the title is "Improving TruffleRuby’s Startup Time with the SubstrateVM." Hi, thanks for coming out! I know we're heading into the tail end of the conference here, so yeah, thanks for coming.
00:00:07.379 My name is Kevin. I work at Oracle, particularly with a team working on a new implementation of Ruby called TruffleRuby. This is part of a larger research group that investigates and works on virtual machine and compilation technology.
00:00:35.250 Before I get started, I do need to inform you that what I'm presenting today is research work out of a research group and should not be construed as a product announcement. Please don't buy any stocks based on what you hear today.
00:01:07.950 So, improving TruffleRuby's startup time with the SubstrateVM. I wish I could pick a shorter title, but it’s descriptive. For those of you who aren't familiar with TruffleRuby, you're probably wondering what TruffleRuby is.
00:01:17.970 As I said, it is an alternative implementation of Ruby. It's relatively new, although it has been around for four years. It was originally called Ruby Truffle, then we merged with JRuby at that time, and it became JRuby + Truffle. About a year ago, we rolled it back out into its own project now called TruffleRuby.
00:01:44.460 I like to think of TruffleRuby as kind of a rubyist Ruby. We still pull in a fair bit of code from JRuby such as their parser, encoding libraries, transcoding tools, and regular expression engine, as well as some utility functions. So, TruffleRuby is written in Java, like JRuby, but like Rubinius, we wanted to run as much Ruby in Ruby.
00:02:31.780 We've pulled in a lot of the core library, and we now actually implement a lot of grouping within Ruby. We also took code from MRI, pulling in the standard library, but recently we've started running MRI C extensions as well.
00:03:00.160 As a quick rundown on completeness, we now run about 96% of the core library specs. Around this time last year, I think we were at 90 or 91 percent, so we've continued to make progress. Some of the missing functionalities or things that are hard on the JVM, like working with the language specs, we passed 99% of those, having failed one or two, but we've been actively looking at support for various Ruby gems.
00:03:34.940 Active support and some of the popular ones are now ported. We've almost achieved 100% compatibility with that, so TruffleRuby is Ruby implemented with Truffle. A natural follow-up question is, what is Truffle? Truffle is a language toolkit for generating simple and fast runtimes.
00:04:10.780 They are simple because the idea behind Truffle is to build an Abstract Syntax Tree (AST) interpreter. AST interpreters are among the simplest possible runtimes you can build; they are straightforward to reason about and very simple to debug. However, they can be slow since they end up walking the full tree.
00:04:37.000 Truffle Ruby aims to address this by using Graal, our optimizing compiler, to take care of performance. Graal is our Just-In-Time (JIT) compiler, and Truffle knows how to communicate with it. This runtime uses a process called partial evaluation. Tom Stewart gave a great talk on partial evaluation for Ruby, and I recommend searching for it if you're interested. In addition to building the AST interpreter, Truffle aims to provide common VM runtime components, essentially a one-stop shop.
00:05:41.770 We provide a debugger and a general instrumentation API so you can build a profiler if you'd like. We also offer some JIT controls, allowing you to unroll loops or avoid inlining certain methods. With Truffle, we also have this polyglot feature, allowing every language implemented with Truffle to extend from the same node hierarchy for implementing the ASTs, enabling languages to call in and out of each other.
00:06:22.510 Because they all extend from a base class, we can mix and match nodes from different languages. When it’s time to optimize, we can compile that across languages into a single optimized method. In doing so, we effectively achieve zero overhead in cross-language calls, a first-class feature of Truffle. Some of the milestones we've achieved in the last year, as I mentioned, have to do with TruffleRuby being part of JRuby.
00:07:34.560 Originally, we were JV + Truffle, but we are now our own project. We can run MRI C extensions successfully, demonstrating that this past week, we have our first SQL query running with the MySQL gem, and we have some C library support implemented. The way we achieved this is through an LLVM bit code interpreter called Sulong. Chris Seaton, who's on our team, gave a talk on how all of this works at last year’s Ruby conference, so if you're interested, I urge you to check it out.
00:08:53.030 The reason we don't have 100% compatibility is that the MRI C extension API is quite vast, which requires implementing each of the RB underscore functions or various macros. I think we’ve proved that the overall approach we want to take works, so we just need to fill in these implementations. We've prioritized the extensions we will work with first. We now have a Java Interop capability, allowing you to call into Java from Ruby, and we've been improving our native calls using one of the Native extensions I mentioned called Truffle Native.
00:10:35.880 We want to minimize overhead in our native calls, as we've been dealing with pointer handling. Nonetheless, we face the core problem regarding startup time. This is something that has emerged as a theme in various talks at this conference. During the application lifecycle, there are three basic phases: startup time, warmup, and steady state.
00:11:50.330 Startup time refers to the time the runtime needs to get into a state to start executing code. Warmup is when you start running code for the first time; it tends to be slower than when running it multiple times since JIT compilation begins during this phase. Even if there isn’t a JIT, the file system needs to populate cache lines. As demonstrated in a talk by Noah a couple of days ago, even MRI has a warm-up time. Finally, the steady state is when everything is warmed up, and the code is running at the fastest speed.
00:13:41.780 There are two basic application profiles: long-lived applications and short-lived applications. The optimization has been designating long-lived applications because we expect that users will spend substantial time in the steady state, and that faster code generation will compensate for the slow startup and warmup times.
00:14:46.180 However, Ruby is also used in short-lived applications, which presents some challenges for us. In this case, the startup time does not actually elongate; rather, it accounts for a larger percentage of the application's runtime. For instance, while executing a simple hello world application, we find that MRI has a notable advantage in startup time, around 40 milliseconds, while JRuby takes about 1.4 seconds, and TruffleRuby approximately 1.7 seconds.
00:15:53.290 Although nobody runs a hello world program in production, it serves to illustrate the challenges. More realistically, I examined the Ruby spec suite. This suite consists of tests that Ruby implementations run to test their overall compatibility. Each implementation must handle a decent number of tests with over 3,800 expectations or assertions. MRI is able to execute this test suite in about 1.5 seconds, while JRuby takes around 33 seconds, and TruffleRuby ends up around 47.4 seconds. This demonstrates our current challenge.
00:16:44.390 Now, concerning peak performance, we believe that TruffleRuby has some significant advantages. I’m using the op-carat benchmark, which Ruby is currently employing to validate its Ruby 3x3 performance targets. It’s a Nintendo Entertainment System emulator written in Ruby, and NES hardware runs at 60 frames per second.
00:17:26.219 MRI 2.0 runs around 20 frames per second. If Ruby 3 achieves its goal, it will run as fast as NES hardware. Presently, we see that MRI runs around 23 frames per second, JRuby nearly double that at approximately 46 frames per second, while TruffleRuby stands at around 197 frames per second. That's about eight and a half times faster than MRI.
00:18:13.370 This causes a dilemma. While you want to optimize for development, you also want to optimize for production. Looking at JRuby, it has been the mature alternative to MRI for quite some time now, and users often face tough trade-offs. Many choose to use MRI for everything, sacrificing peak performance to take advantage of faster startup times, enabling quick Rails console launches and faster-running test suites.
00:19:45.660 Alternatively, one could opt for the more performance-oriented Ruby implementation, incurring the slower startup time or pursue a hybrid approach, where some companies have succeeded. However, this carries risks as running two different runtimes against one application can lead to complex issues, especially with native extensions.
00:20:46.890 As such, I'm hopeful that we can do better as I introduce a new project called the Substrate VM. Truffle Ruby can already target two distinct virtual machines. It runs on the JVM as an AST interpreter, achieving full functionality but sacrificing peak performance.
00:22:34.200 We have a separate distribution known as GraalVM, which packages the JVM along with the Graal compiler. This is where you actually see peak performance. GraalVM includes JavaScript, R, and Sulong languages, offering a complete Truffle development environment. When Java 9 launches, you should be able to add Truffle Ruby as a standard JVM time.
00:23:28.220 Now, by utilizing the same TruffleRuby code base, we can change the virtual machine it operates on, generally requiring no alterations to the application. The Substrate VM includes two essential components: an ahead-of-time compiler, also called the native image generator, and the ability to streamline the application’s deployment.
00:24:56.890 When feeding in your application—again in this case, Truffle Ruby—as well as the Java Development Kit (JDK) and necessary JAR files (which are Java's equivalent of gems), the ahead-of-time compiler treats Java similarly to C or C++. The output of this process is a static native binary that incorporates the Substrate VM, which ensures existing VM components such as garbage collection and hardware abstraction are maintained.
00:26:42.690 To illustrate, upon compiling a Java method, you use a tool called Java C to generate Java bytecode. The JVM interprets this bytecode. If it sees that the method becomes hot, the JVM will compile it into machine code. Contrast this with the ahead-of-time compiler of the Substrate VM, producing real machine code with no JVM required.
00:27:55.500 The compiler performs static analysis on your application during a closed world where it can only work with the classes provided to it. It comprehends the main class and determines all reachable types, discarding everything else. This ensures that not the entire JDK is compiled into the binary, but only the parts that you actually use. If the compiler can't ascertain a type due to Java's interfaces or class hierarchies, it must compile all encountered interfaces.
00:29:52.630 This carries opportunities: since we load all classes during compilation, we can execute any static assignments or blocks beforehand. This process leads us to pre-build the AST and store it as a static field, allowing immediate application startup without parsing Ruby files.
00:31:23.590 However, the Substrate VM imposes some limitations. As a language implementer, you can't dynamically load Java classes, and you must handle the unsafe API cautiously, given that its name suggests risk. Additionally, the behavior as calculated or offsetting the Substrate VM is different from that of the JVM.
00:32:22.540 Currently, the Substrate VM does not optimize the creation of small arrays, unlike Graal on the JVM, which can unroll copies and inline more effectively. The Substrate VM also struggles with method calling speed but is something we are addressing.
00:33:45.889 While it seems that the Substrate VM incurs a slight performance hit, we still operate op-carat at 169 frames per second, a significant step forward compared to MRI. As for startup times, we effectively reduced Truffle Ruby's initiation from 1.7 seconds to 100 milliseconds for simple applications.
00:35:20.541 With language specifications, we see a drop from 47 seconds to just under 20 seconds. There’s potential to further reduce these times remarkably. The idea revolves around pre-compilation of the core library.
00:36:52.220 Finally, we’re actively working on improving native calls built into the Substrate VM. We manage native calls without the JVM's intermediary, which allows us to directly communicate with libraries, thus understanding performance gains.
00:38:01.600 We still need to work on reducing overall memory usage as some aspects can contribute to additional overhead. The Substrate VM can reduce memory but also has a relatively larger binary size because it loads up compiler graphs.
00:39:09.100 The growth could be significant when considering multiple applications and overall scaling. I’m planning to release benchmarks that plot the efficacy of Majors to gauge memory usage during execution.
00:40:02.420 The core takeaway is that while we address startup time improvements, future work will include implementing compressed objects to further enhance execution speed.
00:41:02.580 I appreciate your attention, and I'm open to questions.
00:41:45.410 Is there any build time increase when building on the Substrate VM?
00:42:03.820 This is similar to compiling MRI from source, as you’re building the interpreter that runs your applications, only needing a new build when releases occur.
00:42:51.060 Currently, we’re heavily utilizing JRuby implementation success while integrating various extensions to ensure functionality.
00:43:20.640 Community aids are welcome as we iterate improvements made to the compiler and runtime extensions. While driven by progress, we see continued collaboration fostering an efficient implementation.
00:44:23.690 Follow-up queries address integration and extension compatibility will be aligned with overall optimization goals over project periods.
00:45:12.170 We hope to compile further community efforts to optimize native extensions and build tighter application integrations.
00:46:04.330 Thank you for your participation and willingness to support us through various phases of development.
00:46:29.890 If there are more inquiries or follow-up conversations, feel free to ask.