Talks

Upcoming Improvements to JRuby 9000

http://rubykaigi.org/2015/presentations/headius_enebo

JRuby 9000 is here! After years of work, JRuby now supports Ruby 2.2 and ships with a redesigned optimizing runtime. The first update release improved performance and compatibility, and we've only just begun. In this talk we'll show you where we stand today and cover upcoming work that will keep JRuby moving forward: profiling, inlining, unboxing...oh my!

RubyKaigi 2015

00:00:01.239 Alright, thanks for coming today. I'm Thomas, and this is Charles 'Tender Love'.
00:00:10.610 I tweeted this earlier today. It was originally Ben and Jerry of Ben and Jerry's ice cream for those few people who don't know what JRuby is. I'll talk for just a short period of time about it.
00:00:20.690 First of all, we are a Ruby implementation that tries to be as compatible with CRuby as we can, but our differences are a little more interesting.
00:00:32.660 Most of those differences come because we're built on top of the Java platform. We have multiple garbage collectors that you can tune to make your application perform better.
00:00:44.059 It runs on top of HotSpot, so we have this best-of-both-worlds type of native JIT that employs dynamic profiled optimizations, which makes JRuby run fast. There's lots of tooling, and probably the most important difference is that we have no Global Interpreter Lock (GIL).
00:01:02.239 This means you can use native threads and fully utilize all cores from one Ruby process. Additionally, Java runs absolutely everywhere, making it really easy to pull in Java libraries. Any Java library you integrate can be accessed as if it were a Ruby type, and it feels like you're programming in Ruby.
00:01:22.670 Let's say you have a Ruby gem and there's some problem that the Ruby gem isn't solving; you can find the Java equivalent version of that gem, pull it in, and potentially solve your problem. As I mentioned, you can easily call into the Java language, but you can also call into any other JVM language that's out there, such as Clojure and Scala.
00:01:41.719 So, there are two supported versions of JRuby today. There's JRuby 9000, which we're going to be talking about today, tracking the latest version of CRuby, which right now is 2.2. At the end of the talk, we will mention our plans for supporting Ruby 2.3.
00:02:03.409 We have a maintenance release of JRuby 1.7, which supports either Ruby 1.8.7 or 1.9.3, and you can pass in a command-line flag to specify which version you want to run. These are the big bullet points for our latest major release, JRuby 9000.
00:02:30.440 As I said, we're tracking the latest release of CRuby. We have a brand new runtime called Internal Representation, which we've been working on for years. I'll be discussing that later in the talk.
00:02:51.680 Most of the I/O in JRuby now bypasses the JVM and goes out to C, so we can be better compatible with CRuby. Additionally, Ono's Guruma has been fully ported, so the transcoding engine is entirely over, and all the data tables are now bug-free.
00:03:11.820 You might be curious about our version number – it started as a joke and eventually became a reality. True story.
00:03:29.430 But the only thing anyone ever cares about at conferences is performance, so we're going to focus on that. Before we dive into performance, I will give a quick overview of our new runtime, Internal Representation.
00:03:45.720 We were tired of working with an Abstract Syntax Tree (AST). We wanted something that directly represented Ruby semantics. We also aimed to create a traditional compiler design so that those who took compiler courses at university would recognize the same nouns and verbs in our work.
00:04:06.360 In JRuby 1.7 and earlier, we would parse the source and create an AST. Now, in JRuby 9000, we have these additional compiler phases. During semantic analysis, we translate that syntax tree into a set of virtual machine instructions and create other data structures like a control flow graph.
00:04:30.690 After that, we go into an optimization phase where we run a set of pluggable compiler passes. We may then create further data structures, like a data flow graph. Finally, we enter the interpreter phase where we interpret the machine instructions we've created.
00:04:55.290 Once the code has been running for a while, we decide it's time to generate Java bytecode. The JVM takes that bytecode and optimizes it for performance.
00:05:08.040 To illustrate this, here's our first look at instructions. On the left is the Ruby code, and on the right are the instructions. For example, on line 0 we check the arity to ensure there are two required arguments; otherwise, we raise an exception.
00:05:21.840 On lines 1 and 2, we bind the parameters 'a' and 'b' to the zeroth and first required arguments. We have special variables, and on line 3 we receive any block that’s passed into the method. Also, on line 5, we have a line number instruction, so if we raise an exception, we can pinpoint the line number where the error occurred.
00:05:54.960 On line 8, we are calling the '+' method on the receiver 'a' and giving it the argument 'c' to compute 'a + c'. This is a register-based design as opposed to a stack-based one, which makes it easier to read.
00:06:15.950 During optimizations, we execute a series of compiler passes, which I will demonstrate with a simple session showing how we reduce work to make JRuby faster.
00:06:28.430 Initially, we notice that 'b' is not accessed, and we don’t perform any actions with the closures. Therefore, through dead code elimination, we can completely remove those instructions.
00:06:42.780 Next, we observe that 'c' is assigned to a constant value, and we only read that value, allowing us to propagate that constant. This enables us to eliminate the instruction.
00:07:03.610 Moreover, we notice that there are two line number instructions consecutively. Since there's no chance of raising an exception between those two instructions, we can eliminate one of them.
00:07:22.930 Now I will discuss some actual shipping optimizations that we've implemented in the current release, JRuby 9000.
00:07:36.940 First, we always wanted to compile blocks as well as methods. However, in JRuby 1.7, only method bodies would get compiled. If a block was within a method, and the method compiled, it would be converted to JVM bytecode and run faster.
00:07:57.789 It is much easier for us to compile blocks independently now. Here are some performance comparisons between MRI, which is running a regular method and defining one alongside JRuby 9001. As expected, performance isn't where we wanted it to be because the benchmark itself had blocks.
00:08:18.539 Although that benchmark is still interpreted, with the change to compile blocks, compiled procs, and lambdas, JRuby 9004's performance is more aligned with our expectations. Methods are now running as fast as they should.
00:08:30.330 The benchmark shows that it's much faster than running in CRuby, but some of these issues remain. How many people here use defined method in their code? Most Rails users do.
00:08:43.919 So, this performance difference is something we aimed to fix. There are two different cases of defined methods: the first is a simple defined method that doesn't rely on state around its block; it only needs the state that's inside. The second case is what we call a capturing block, as it uses some state from outside of the defined method.
00:09:09.010 In this case, it sends a modified method name based on a variable. Both cases introduce overhead compared to a regular method.
00:09:19.230 Here's a comparison highlighting that this affects both MRI and JRuby, with a regular defined method compared to both forms of defined method. As expected, MRI performs well on regular methods; however, both defined methods show severe performance hits.
00:09:38.230 When identifying these inefficiencies, if we see that a defined method does not capture any state in the simple form, we will compile it as a method. Essentially, we treat it as a normal method body, avoiding all the overhead associated with blocks.
00:09:56.889 We also want to extend this to capturing cases. Typically, if it captures state from outside, it is often a constant value. Thus, we should be able to propagate that accordingly. This is future work that we're planning.
00:10:15.020 As of now, in JRuby 9004, the performance of the simple defined method is nearing that of a regular method definition, which is excellent. Historically, exceptions have been an expensive operation for JRuby.
00:10:35.740 This is primarily due to the JVM, where generating an exception backtrace demands a significant amount of resources. Unfortunately, exceptions are often used as flow control, leading to many thrown and ignored exceptions.
00:10:48.710 For instance, we encountered a bug report for CSV.rb, which exhibited very poor performance on JRuby—even up to one hundred times slower than on CRuby. This was because enabling value converters would trigger a cascading series of blocks.
00:11:07.780 In the worst case, every value parsed by the CSV library would generate multiple exceptions, complete with backtraces, which severely impacted performance. We needed to address this issue, as JRuby was significantly slower at generating backtraces.
00:11:26.370 To resolve this, we implemented a new state of affairs. On the left, you can see JRuby performance before we optimized exception handling, and on the right is the performance after.
00:11:40.880 To achieve this, we analyze the code and inspect our internal representation (IR). If we determine that an exception is ignored, we flag it and avoid generating a backtrace during that throw. As a result, it is now considerably cheaper to handle exceptions.
00:12:03.510 This update has restored performance on libraries like CSV.rb to where it should be.
00:12:14.170 Now, let's discuss some forthcoming works that we're currently enhancing, which we hope to see in production shortly.
00:12:28.470 The first area is method inlining. For those unfamiliar, inlining a method means that when you make a call to a method that is called often, we can pull the body of that method right back to where the call was made.
00:12:51.340 By doing this, we eliminate the need to set up a stack frame or allocate extra memory, and we avoid the setup of parameters for method calls. Many compilers and dynamic runtimes utilize inlining methods to achieve better performance.
00:13:06.790 For instance, let's say we have some Ruby code where every iteration in a loop calls a method. This is a prime candidate for inlining. In the inlined version, we can simplify the loop without the overhead of making calls.
00:13:23.220 In the inline version, we must ensure that nothing changes. If it does, we must fall back to traditional method dispatch. Thankfully, we have the failure case already accounted for as the 'else' branch.
00:13:39.080 At present, this functionality works in the JIT since we implemented it recently. After establishing the loop and when inlining works, we want to focus on specialization—specifically, numeric specialization.
00:14:01.990 In Ruby, everything is an object, but in CRuby, numbers are a special type referred to as tagged pointers. Floats and fixnums typically do not have the overhead of an object. However, with the JVM, we must choose between a lightweight number or a reference when compiling.
00:14:24.890 To achieve strong performance for numeric algorithms in JRuby, we need to specialize fixnums, floats, and other types into their primitive Java counterparts.
00:14:48.430 By optimizing these numeric types to use long values directly, we escape the overhead of objects and significantly enhance performance. Prototypes of this specialization have yielded impressive results, making them considerably faster than traditional fixnum code.
00:15:05.920 We're nearing the point of achieving Java-level speeds for these operations, and we're looking forward to releasing this specialization feature with our next JRuby 9000 release.
00:15:20.730 In addition to the above optimizations, we require a robust profiling mechanism. For each call site, we keep track of which methods are invoked and how frequently.
00:15:41.540 To clarify, a call site is literally where a call occurs within the code. We focus on monomorphic call sites, which are instances where only one method is invoked repeatedly. These are ideal for optimization.
00:16:06.500 Once we identify a number of hot monomorphic call sites, they become candidates for optimizations, such as numeric specialization or method inlining.
00:16:23.570 Currently, we have a functional profiler in place, and it only became operational on Friday afternoon, with overhead being just around 2%.
00:16:44.440 The profiler presently only focuses on method inlining, but we plan to include numeric specialization once we resolve the remaining issues.
00:17:01.870 Let me demonstrate a benchmark representing a very common pattern in code. In the big loop method, we consistently call the small loop method, passing in a block, which is a typical Ruby pattern.
00:17:15.840 The small loop method essentially calls 'yield' several times. You could think of it as a 'reduce' method or any enumerable in Ruby. This test evaluates how well we optimize the inlining of blocks.
00:17:31.170 I want to clarify—this isn't an indication that we're now running four times faster. It happens to be the case we're targeting one specific test where historically, we've struggled with inlining.
00:17:48.559 Due to our inability to inline blocks back to the call site, JRuby historically invoked blocks slower than CRuby.
00:18:01.280 However, we've begun to mitigate this timing overhead, and we're progressing towards getting faster.
00:18:17.410 The last section I want to touch on is the collaboration between JRuby and the OpenJDK team. We work closely together, sharing benchmarks and optimizing the JVM based on our findings.
00:18:41.610 We are excited about the addition of native foreign function interfaces (FFI) to the JVM via Project Panama, which bridges the Java and native worlds.
00:19:01.840 By developing this native FFI within OpenJDK, we'll be able to write Ruby code calling C libraries more seamlessly. For developers, this means reduced complexity and improved performance.
00:19:29.550 In the past, interfacing between Ruby and native libraries involved numerous layers, such as JNI and jnr, which added considerable complexity for developers.
00:19:56.400 With Project Panama, the FFI will be built directly within the JVM, allowing Ruby or Java code to call out to native functions with a simpler interface. This change will significantly benefit both Ruby and Java developers.
00:20:17.740 The second major enhancement we are focusing on is startup time, which remains one of the most significant challenges for JRuby. In comparison to CRuby, JRuby experiences a notable startup delay.
00:20:42.240 This is largely due to all of JRuby's components starting cold—the parser, interpreter, and core classes, which dramatically slows down performance on initial boot.
00:21:05.770 To alleviate this, we've implemented a development flag that disables unnecessary JRuby optimizations during development, allowing for improved startup times without impacting performance.
00:21:22.260 For example, in tests involving Rails apps that boot the entire framework, JRuby's default startup can significantly lag behind that of CRuby.
00:21:43.100 This severe lag in startup time often leads to hesitation in using JRuby, which is why it is essential to leverage the development flag to speed this process.
00:22:05.840 In addition, the OpenJDK team is actively developing an ahead-of-time (AOT) compiler, which precompiles JRuby bytecode to native code to expedite initialization.
00:22:28.890 This approach aims to ensure optimal performance from the start while also facilitating continual optimization over time.
00:22:49.740 We see this as a means to significantly reduce cold startup delays and enhance the overall JRuby experience.
00:23:12.090 The work for this AOT compiler is still in prototype stages, but we anticipate engaging closely with Oracle to perfect its integration into JRuby.
00:23:36.970 Furthermore, many further optimizations are on the table for JRuby, as we're collaborating deeply with JVM developers to enhance the integration process and smooth out issues.
00:23:55.250 I want to extend my gratitude to the numerous companies and individuals who actively support JRuby. Your engagement is what drives us to improve.
00:24:16.470 If you are currently using MRI without any issues, that's great! We encourage you to consider JRuby when you need better performance or native threading capabilities.
00:24:55.540 I'd also like to mention that we have stickers available on the table, and feel free to take some. Now, I would like to discuss the work we're doing for JRuby 2.3.
00:25:17.750 We have a tracking issue on GitHub for our ongoing work, which includes the missing features. The community's contributions have been invaluable. Charlie has reached out to Ruby contributors for support.
00:25:43.840 We've seen multiple pull requests daily targeting these missing features. By the end of January, we hope to release version 9.1 or 9.1.0, which will support Ruby 2.3.
00:26:05.170 I believe we now have time for questions.
00:26:50.720 Thank you, JRuby team! We indeed have time for questions.
00:27:04.860 One question regarding your Intermediate Representation (IR) is, is it SSA? It is not; the actual temporary variables and local variables are in this form, but not in general.
00:27:23.050 We contemplated this, and the engineer who worked with us on IR had extensive experience optimizing JVM JIT during his Ph.D. work and, so far, we haven't found a compelling reason to adopt SSA fully.
00:27:44.440 A simple question many here would relate to: if I want to get started with JRuby tomorrow using the dev flag, what is the straightforward process?
00:28:00.520 You can begin by installing JRuby through RPM and then navigate to your Rails project. Generally, just execute 'bundle exec rails s', how would you apply the dev flag?
00:28:12.540 All JVMC options, including the dev flag, can be set using the environment variable JRUBY_OPTS. This is typically the easiest approach. Remember to unset it when you start benchmarking.
00:28:36.830 This is a common mistake where developers add the dev flag but forget to un-set it, losing optimizations inadvertently. Is there another question?
00:28:51.479 Regarding a potential workflow of using MRI for development while deploying on JRuby in production, do you recommend this approach?
00:29:06.720 This is quite common, and many companies pursue this strategy. However, be cautious with any C extensions that may sneak into your development side.
00:29:20.930 Ensure you run regular tests against JRuby to catch any issues with compatibility, although such bugs have decreased significantly.
00:29:35.630 Is it possible to change the dev flag dynamically? I have a command in my script that runs multiple commands, and one command requires it to run quickly, while another necessitates the JIT.
00:29:57.130 You can turn off JRuby's JIT and optimizations dynamically after the application starts. However, the segment responsible for optimizing the JVM will not work dynamically.
00:30:18.080 Unfortunately, this is one of the key advantages of turning off JVM optimizations, which could significantly impact performance.
00:30:32.009 Alright, I believe our time is up. Thank you, everyone!