It's dangerous to GC alone. Take this!

00:00:04.220 Hey everybody! Thank you so much for coming out to our talk. We're really excited to be here today. This trip has been full of firsts for us—it's our first time in Japan and in Tokyo, and this has been an amazing experience.

00:00:18.779 Today, we want to share our work with garbage collection and the OMR (Open Multi-language Runtime) framework. So, I'm Robert Young, and this incredibly handsome man beside me is Craig Lehmann. We're both software developers from Ottawa, Canada, and we are members of the IBM Runtime Technologies organization.

00:00:41.190 Our team is responsible for the various language runtime offerings that we provide to our clients. In particular, our Ottawa lab is known for our work on the J9 Java Virtual Machine. Our team has been involved in language runtime development for over 20 years, beginning with a Smalltalk heritage. Our first IBM product, Visual Age Smalltalk, was released in 1988, followed by the first version of the J9 Java Virtual Machine in 1993.

00:01:09.240 This was created from the same core technologies we originally developed for the Smalltalk VM. Since its release, J9 has undergone hundreds of person-years of active development. So today, we're excited to talk about our new project and the future of our technology, OMR.

00:01:31.920 OMR is an open-source toolkit for language runtime development, derived from the source code of our production J9 virtual machine runtime. We’re distributing it as a C++ framework, open-sourced under the EPL (Eclipse Public License). The core idea behind OMR is that a significant portion of a language runtime can be shared across different language implementations.

00:02:07.620 As a toolkit, we include technology for just-in-time compilation, garbage collection, and a portability layer that allows us to run on various platforms. The idea is that OMR can serve as a shared backbone for many language runtimes; that means an improvement to one language could benefit all languages using it.

00:02:48.770 Currently, we are using OMR in production code. This codebase has been battle-tested and has been deployed successfully in J9 for years. In its current form, OMR is actively used in our Java SDK and we have proof-of-concept integrations with Ruby, C Python, and a Smalltalk implementation. We are keen on bringing OMR to various languages because we support them in the cloud, aiming to empower our clients as much as possible.

00:03:19.070 However, since this is a Ruby talk, let's focus on OMR and Ruby. I'm going to hand it over to Craig, who will explain how we began integrating our garbage collection technology into Ruby.

00:03:31.730 Thanks, Rob. We didn't want to merely tell everyone to use our great framework; we aimed to prove that OMR adds value to languages outside the JVM, and we thought Ruby would be a perfect fit. Today, we're going to discuss our work on integrating OMR’s garbage collection framework into Ruby. Throughout this process, we've encountered some roadblocks and challenges, but we've also achieved some fantastic successes.

00:04:04.670 Let’s start from the beginning, when we knew nothing about the internals of C Ruby. Step one was figuring out how to implement OMR GC into Ruby. This required understanding what was happening beneath the Ruby VM, so we turned to our savior, Google, and searched for information on how garbage collection works in Ruby.

00:04:41.980 In our research, we discovered a book on garbage collection published in 2013, but unfortunately, we don't speak Japanese yet, which left us with the option to read the source code of Ruby itself. We examined the C code associated with garbage collection in the Ruby interpreter and noticed something interesting: the file was created in 1993. Given that both Rob and I were not garbage collection developers back then, it shows how long this technology has been in development.

00:05:49.789 Ruby, particularly MRI (Matz's Ruby Interpreter), has a rich history and has been under active development for over 20 years. Over those years, Ruby has evolved to become more expressive and powerful. However, the design of MRI introduces complexities. We realized that Ruby internals directly use object pointers, and the GC needs to mark these pointers by conservatively scanning the stack to identify valid objects.

00:06:41.420 This allows C extensions to use object pointers directly; however, it also means that Ruby objects cannot move. In Ruby, objects have a fixed size of 40 bytes, which could be problematic for large objects like big strings or arrays. Ruby solves this by allocating any additional space needed for large objects.

00:07:01.600 The advantage of this approach is that the managed heap experiences no direct memory fragmentation, which simplifies conservative marking. However, there are downsides; malloc and free operations can be slow, and fragmentation issues can arise in the malloc space, which behaves like a black box implementation.

00:07:34.880 Ruby uses an incremental garbage collection strategy, interleave marking and sweeping with the application’s execution. The incremental GC helps mitigate potentially long GC pauses by cleaning up memory on demand.

00:08:10.110 Having clarified how garbage collection works in Ruby, let’s discuss how we integrated OMR's GC into Ruby. We started with several goals: making allocations faster, improving GC time, enhancing mutator performance (which refers to the performance of applications), and ensuring 100% compatibility. This means there are no changes to the C extension API; if code runs on Ruby today, it should run with OMR’s GC in Ruby.

00:08:52.280 Initially, we created a simple single-threaded mark-sweep implementation of our GC in Ruby. To implement this, we modified areas where Ruby GC initiations occurred and directed them to use OMR's GC instead. Similarly, instead of allocating a Ruby object traditionally, we adopted OMR’s allocator to request that memory. With just a few more connection points, we were able to plug in the OMR GC.

00:09:40.440 However, OMR’s heap is not slot-oriented like Ruby's, which necessitated the creation of an object map to identify the beginnings of valid objects. At this point, we had a single-threaded mark-sweep implementation of the GC working in Ruby, and we decided to take it a step further by making it parallel.

00:10:10.149 As we all know, parallel programming is challenging, particularly when it comes to sharing workloads between threads. Fortunately, the code for efficiently executing this already exists in OMR. OMR exposes a straightforward and portable API for parallel atomic operations, featuring advanced work-sharing algorithms, highly parallel internal operations, worker thread pooling, and task synchronization.

00:10:47.820 To make marking parallel, we began with the roots; any root that was too large to be scanned in a single thread was split into smaller work units. The remainder of the marking phase is managed by OMR's GC and is marked efficiently in parallel. To enable this, we had to make some changes to Ruby’s internal GC APIs, including passing around a thread context for C extensions. We placed this thread context in thread-local storage, allowing us to achieve parallel marking without disrupting anything in the C extensions.

00:11:41.640 Following the marking phase comes the sweeping phase of the GC. Some objects, like C extension objects, need to be swept together to protect the C API. Nevertheless, we managed to sweep simple types like strings and arrays in parallel. Additionally, when various Ruby objects are associated with internal VM data structures, the death of an object necessitates updating that structure to remove the reference. We've transitioned this work from an on-demand format to performing it entirely in parallel with the sweeping operation.

00:12:02.080 At this point, we have a parallel mark-sweep GC working in Ruby. With OMR, we also provide a couple of valuable tools for garbage collection. One of these is a Garbage Collection Memory Visualizer (GCMU) that allows you to visualize details like the size of your heap, the amount of free space, and GC pause times. Beyond providing direct analytics, it also offers profiling recommendations to help optimize the performance of your garbage collection.

00:12:59.440 Another useful component is the Health Center, which can attach to a running application to provide real-time information. The GC view in Health Center displays similar metrics to those shown in the GCMU, but from a running application context. A popular element of the Health Center is method profiling, which allows users to see the call chain of each method sample and to sort methods by their accumulated execution time.

00:13:39.740 These tools are actively used in production for Java, and they will soon be available for any runtime that incorporates OMR. Now, let's delve deeper into Ruby. Rob will explain how we've made OMR’s GC and Ruby even faster.

00:14:05.660 Our primary objective here was to enhance application performance. To initiate this, we focused on object allocations, aiming to speed up mutator performance and facilitate quicker object allocation. We brought the concept of thread-local heaps from OMR, which allows exclusive ownership of a large contiguous region of the heap to a thread, enabling it to allocate objects without global synchronization or acquiring a lock.

00:14:57.260 If objects allocated together are utilized together, they benefit from improved cache locality during their execution. However, we encountered challenges when using thread-local heaps, as this approach relies on having large free spaces of memory, making us susceptible to heap fragmentation when available free memory is tiny.

00:15:38.460 To address this, within a thread-local heap, a thread allocates its objects from left to right. By granting exclusive ownership of this region, we can efficiently track very little global state—essentially just maintaining the next allocation pointer and the remaining free space within the heap. This technique, known as bump allocation, is incredibly quick and lightweight due to its simplicity.

00:16:49.779 We also worked hard to minimize allocations and enhance their speed. Upon profiling, we determined that malloc and free are notably expensive because virtually every allocated object in your program has some off-heap memory associated with it. This allocation occurs on the critical path of your program, slowing performance.

00:17:47.040 Moreover, when an object dies during garbage collection, the associated memory must be freed, adding to the overall cost. Thus, we aimed to replace malloc and free with a solution more suited to our needs. Our approach uses VM-specific knowledge to develop a new type of allocator that aligns with the Ruby Virtual Machine's requirements.

00:18:48.980 We introduced a new built-in type called the OMR Buffer, the only variable-sized object allocated on the Ruby heap. This structure consists of a header and a data portion, replacing traditional malloc-allocated memory. We have effectively replaced malloc buffers with these on-heap OMR Buffer objects and now utilize pointers to their data segments instead of traditional buffers.

00:19:32.990 Using OMR Buffers helped us transition all native off-heap memory associated with built-in types onto the heap. This change enables access to that fast object allocation pathway we've implemented, enhancing cache locality by allocating them in close proximity to their parent objects. An OMR Buffer remains alive as long as it is reachable from a root (like the stack) or a parent object.

00:20:19.450 Most importantly, these OMR Buffers don’t need to be explicitly freed. Once they become unusable, they are automatically collected by the garbage collector. By shifting data onto the heap, we have alleviated the need to free off-heap native memory. We’ve also moved much of the object finalization process out of sweeping individual objects and integrated it into parallel work units, effectively almost eliminating the remaining workload associated with sweeping.

00:21:05.920 We also wanted to add a new API to simplify the process of allocating C extensions on the heap, such as with Nokogiri. To assist with this, we introduced a new flag to the Ruby C API, allowing implicit allocation of what would typically scale off-heap memory directly onto the heap.

00:21:56.030 Thus, if you usually relied on `default_free`, you can now eliminate your `free` method altogether. This change allows the collector to avoid unnecessary work when sweeping your C extension objects. The allocation’s lifetime now aligns with the allocation's lifespan of the API, making it straightforward to use. Simply add the flag, and you will gain better cache locality, faster allocations, and reduced GC pause times.

00:22:38.470 The implementation of OMR Buffers has significantly optimized object allocation, resulting in faster allocations in the mutator and better cache locality. Consequently, they've enhanced your program’s execution. We have been navigating some issues along the way, particularly related to heap fragmentation introduced by OMR Buffers' variable-sized objects.

00:23:32.760 We've had to adjust how we handle conservative collection to ensure these OMR Buffers remain alive, as they are now used in lieu of off-heap memory. We’ve also encountered intriguing issues with concurrency within the MRI internals. For example, MRI can perform certain internal operations without holding the GBL (Global Interpreter Lock), which often happens during lengthy I/O operations.

00:24:23.040 Operations such as reading from a socket into a Ruby string can modify slots and create resizing or even free the native memory tied to Ruby objects. However, with OMR Buffers, this memory resides on the heap, necessitating the introduction of a finer-grained locking mechanism that offers more permissive locking than the conventional GBL.

00:25:01.520 This adjustment allows us to manage and allocate OMR Buffers from background tasks, yielding exciting capabilities like true parallel allocations and the ability to trigger garbage collections from background threads without needing to hold the GBL. At this juncture, we have achieved significant changes in object mutation and heap transformation within a multi-threaded context.

00:25:48.790 Thus, we view this as a preliminary step towards potentially implementing a multi-threaded Ruby. By integrating our garbage collector with Ruby, we are seeing an increase in throughput—specifically, an eight percent improvement on our Is Ruby Fast Yet stress test for Rails. We’ve successfully reduced GC overhead from approximately eight percent to three percent while employing our collector.

00:26:43.020 Throughout the development of OMR for Ruby, we have undertaken numerous experiments, and we are excited to discuss them with you. We are currently experimenting with reintroducing a non-copying generational GC into Ruby with OMR, which traditionally has not been supported in our garbage collection framework. This implementation aims to reduce CPU time costs during the marking phase of garbage collection.

00:27:37.600 Since we are mindful of heap fragmentation issues, we are also looking to reclaim as much memory as possible during a single GC cycle. We are observing higher costs in implementing bitmap and generational garbage collection with OMR due to the guarantees about heap layout and object alignment. To tackle these challenges, we are exploring the concept of a segregated heap, where the heap is divided into separate regions, each with a different fixed size.

00:28:21.720 This modification will help us make specific guarantees regarding the maximum amount of fragmentation that can occur in the heap, which will be especially beneficial for long-running Ruby processes. We are exploring integrating more OMR features with Ruby, including a concurrent garbage collector, which functions similarly to Ruby’s incremental garbage collector. The intention is to interleave the marking phase with mutator execution while offloading a lot of that marking work to background threads.

00:29:13.200 This approach will enable marking to occur outside the critical path of your application while still employing our stop-the-world parallel sweeping phase. Now I’ll hand it back to Craig to discuss what’s next for OMR and Ruby.

00:29:57.760 Thanks, Rob. In addition to the ongoing experimentation Rob has mentioned, we have plans to incorporate more GC technology into OMR soon. One example is the Balanced GC algorithm, which is a GC policy that leverages very large heaps in your application, specifically useful for heaps over 28 gigabytes.

00:30:46.430 Another upcoming algorithm is a semi-space copying generational collector. This is similar to the current generational collector in Ruby but will actively move objects into different heap sections. The advantage is improved cache locality and further reduction in GC pause times overall. Currently, Ruby's internals are not structured to allow object relocation, but in the future, if certain components are rewritten, it may enable object movement.

00:31:32.320 So what’s next for OMR? We’ve accomplished much work to date and presented on OMR’s GC and its application in Ruby. We’ve boosted performance, integrated notable features, and wish to clarify: we are not looking to fork Ruby or create an IBM version of it. Our goal is to contribute this ongoing work directly to C Ruby.

00:32:21.110 We want to receive feedback from the Ruby community, as you are the true Ruby experts. This weekend, we’re releasing a technology preview for RubyKaigi that contains all of the GC changes we’ve discussed, though it does not include ongoing experimental work. If you attended Matthew’s talk on Friday, this preview will also contain the JIT work he’s been diligently developing.

00:33:07.830 Although the OMR directories are not ready for the preview, we hope to open source the OMR work soon. Please check out the link and let us know your thoughts. To conclude, thank you for your attention. If you'd like to connect with us, my name is Craig, this is Rob, and you can contact us through email or Twitter. We also have Charlie, the architect for the OMR project, available for any questions.

00:33:49.720 Mark is the OMR project lead, and John is the CTO for IBM Runtimes. With that, I believe we have time for questions.

00:34:16.960 Yes, we still have 10 more minutes. Are there any questions from the audience?

00:34:26.660 Thank you for the great presentation. My question is, what does the future hold? You mentioned many techniques, so what is finished and what does the successful implementation look like?

00:34:32.880 So, in terms of what is currently available, the performance enhancements we shared today showcase the finalized implementations. As for the ongoing experiments, those are in the pipeline, and we plan to release benchmarks highlighting the results. Our standard Ruby implementations have undergone testing against things like Rails performance, measuring the request handling capabilities.

00:35:53.730 The test compares the Ruby implementation with OMR against the stock Ruby implementation to assess throughput.

00:36:09.440 What we found is approximately an eight percent increase in requests processed.

00:36:20.740 Additionally, we maintain complete compatibility with MRI.

00:36:38.620 Are there any increases in memory usage?

00:37:13.160 Yes, when we refer to memory usage in the runtime during testing, our binary size has increased, yielding comparable resident memory consumption.

00:37:44.230 We achieved efficiency in managing memory usage, retaining similar sizes across benchmarks.

00:38:13.680 I have a couple of questions. Firstly, how expensive are finalizers in the OMR GC?

00:38:55.290 From what we've observed, finalizers in OMR aren't too costly and are separate from Ruby's finalized features.

00:39:36.640 Also, you discussed finer-grained locks. Are those only in the allocator?

00:40:22.020 The finer-grained locks pertain to VM access, particularly around allocations.

00:40:58.180 These leverage VM access beyond mere allocations.

00:41:21.820 With your updated locks, are there many operations possible with multiple threads?

00:41:38.840 With fine-grained locks, multiple threads can access resources during non-critical operations, resulting in streamlined performance.

00:41:59.720 We can facilitate multiple threads accomplishing allocations and modifications, which optimizes resource efficiency.

00:42:23.680 Further questions?

00:42:50.260 Thank you for the intriguing presentation. In regards to the benchmark data you provided, what specific techniques contributed to the performance improvements?

00:43:14.440 We see enhancements primarily due to the reduction in cache misses that were achieved, particularly around mutator execution.

00:43:55.020 Additional improvements were observed due to the reduced overhead of the GC.