Plug & Play Garbage Collection with MMTk

00:00:06.859 Thank you for coming. This talk is called 'Plug and Play Garbage Collection with MMTk.' My goal is to propose what I believe is a way to fundamentally improve Ruby's garbage collector, enabling it to take advantage of some of the latest advancements in garbage collection (GC) research, not just now but also in the future.

00:00:10.200 My name is Matt, and I've been working on Shopify's Ruby and Rails infrastructure team for almost three years now. I'm a Ruby committer since March of this year, and I have a keen interest in garbage collection and memory management. To understand how we can move forward with GC, it's necessary for us to comprehend not only where we are now but also how we got here.

00:00:32.660 Let’s wind the clock back to December 1995 when the first public version of Ruby was released. We'll discuss what GC looked like back then and how it has evolved since then. The entire memory management code in Ruby 0.95 was 562 lines of code. Today, that same file has grown to over 14,000 lines; it is about 25 times larger than it was back then. Nevertheless, despite this significant growth, the lineage of Ruby's memory management remains quite clear.

00:01:00.899 Similar to today, Ruby objects were implemented as 'R' values, which are fixed-size C structs laid out in contiguous memory regions. Those regions were then grouped together. While there are some differences in terminology compared to modern Ruby, the general heap layout is very recognizable even at this early stage.

00:01:27.740 Just like today, these empty 'R' value slots were linked together in what is known as a free list. When Ruby needed to allocate a new object, it would unlink the first empty slot from the free list and write into it. If there were no slots available on the free list, that was when garbage collection (GC) would start.

00:01:50.220 GC in Ruby 0.95 was obviously much simpler than it is today, given how little code there was. However, the core algorithm has not changed all that much. Ruby 0.95 implemented Mark and Sweep, an algorithm originally described in just a few paragraphs in a 1960 research paper about Lisp.

00:02:05.280 In the Mark and Sweep process, the GC starts when the program tries to allocate and cannot proceed because the free list is empty. The garbage collector then starts from a set of known roots, tracing the entire object graph and marking every object that can be reached along the way. Next, it sweeps through every single object on the heap, adding any unmarked objects back to the free list.

00:02:30.720 This is exactly what Ruby 0.95 did; it set a bit in each object’s header to indicate whether it was marked and iterated over all objects, adding any that did not have that mark bit set back to the free list.

00:02:40.680 Fast forward over 15 years, and we observe the first major step change in Ruby's GC: the introduction of lazy sweeping in 2011 for Ruby 1.9.3. Mark and Sweep is a stop-the-world collector, meaning that program execution halts while the GC is running. At the time, particularly with web applications like Rails, this was a significant issue. If Ruby's GC executed in the middle of a web request, it would adversely affect the user experience.

00:03:01.620 Applications frequently experienced very slow P99 response times. Lazy sweeping attempted to mitigate this by dividing the sweep phase into many shorter steps, meaning even if the GC ran during a request, its impact was minimized. Although lazy sweeping shortened the GC pause times, the additional work required to track progress ultimately slightly decreased overall application performance.

00:03:39.599 This illustrates the trade-offs involved in building GCs. In practical terms, a web application might be willing to sacrifice a slight drop in overall requests per second if it resulted in a significant reduction in P99 response times, but this obviously depends on the workload. After this, there was a two-year gap until the next major change, when Ruby 2.1 introduced generations into the GC.

00:04:05.099 Generational collectors are based on observations made in a 1984 research paper that noted most objects in a system tend to die young, and those that don’t are likely to live forever. Therefore, generational collectors can enhance performance over non-generational ones by not scanning many objects that are unlikely to change.

00:04:27.840 Ruby's first implementation of a generational GC featured two generations: the young generation and the old generation. Marking was divided into two phases: a minor mark that focused exclusively on young objects, and a major mark that considered all objects.

00:04:47.580 By default, minor marks are performed, and any object that survives a minor mark is promoted to the old generation. A major mark is triggered when the count of old objects doubles.

00:05:04.500 An interesting aspect of generational collectors is that they often can ignore references from old objects to young objects during minor marks. If this scenario arises and a minor mark occurs, live young objects can end up being swept.

00:05:19.620 In traditional implementations, this is addressed using a write barrier—code that detects when an object receives a reference to a young object and adds that old object to a remembered set. This remembered set is merely a group of objects that are always considered during the minor mark phase, alongside the young objects. This is what Ruby implemented. However, Ruby has always had a rich C API, allowing gem authors to write native extensions in C.

00:05:53.660 Since none of the gems were utilizing these write barrier APIs initially, an object created from a C extension could lead to application crashes with this new GC. Ruby circumvented this by establishing a distinction between write barrier protected and write barrier unprotected objects, with all objects originating from C extensions remaining unprotected by default. This meant gem authors had to explicitly implement barrier support in their gems.

00:06:19.680 Unprotected objects could never be promoted to the old generation and were always considered for marking, even during a minor GC. This allowed Ruby to implement a generational GC in a backwards-compatible manner, albeit at the expense of performance, as minor marks now needed to account for numerous objects that might not have been necessary.

00:06:46.500 It's also important to note that Ruby implements generational GCs differently compared to traditional literature. Most generational GCs are evacuating, whereby generations exist in distinct memory spaces, and promoting an object means copying it to the new region. This distinction gives the ability to tune performance based on the specific characteristics of each memory region.

00:07:00.840 Ruby's GC does not have this separation yet. Returning to our timeline, the next major advancement in GC occurred in 2014 when Ruby 2.2 introduced incremental marking. Incremental marking is similar to lazy sweeping from Ruby 1.9.3; it broke up the long marking phase of a major collection.

00:07:23.280 Marking involves tracing the entire object graph, and Ruby needed to pause marking and then continue from where it had left off. At that time, there was no established method for doing this, so the well-known tri-color abstraction was applied, which was first noted by Edgar Dijkstra in 1978.

00:07:43.560 This approach classifies objects at the start of tracing as possibly dead, referred to as white. As objects are traced, they are marked as gray. Once fully scanned, they are presumed alive and marked black. Such object classification permits the marking state to be saved and resumed by disregarding all black objects and continuing to mark from the gray objects.

00:08:05.220 However, a similar issue arose as with the old objects in a generational GC: new references could be added to a black object, leading it to be swept because the GC considered that black object fully marked. This was resolved using write barriers, which marked a black object as gray if a new reference was appended, ensuring it wouldn’t be ignored in the subsequent minor mark.

00:08:31.260 As of late 2014, Ruby had a non-copying generational and fully incremental Mark and Sweep garbage collector. The next major change came with Ruby 2.7 in 2019, which introduced a compactor, marking a monumental shift since it was the first time objects could move around in the heap.

00:08:43.500 Ruby's compactor, similar to its original GC core, was based on algorithms from Lisp in the 1960s, fitting well with the contiguous memory regions and fixed-size object model that Ruby employs. Compaction unlocked numerous possibilities for performance improvements, such as reducing heap size, enhancing data locality by minimizing fragmentation, and improving copy-on-write friendliness.

00:09:05.640 However, C extensions were required to explicitly opt into this change by integrating custom code to update their own references, which meant that initially, objects from C extensions did not support compaction and had to remain static in the heap, counteracting many benefits of compaction.

00:09:32.580 With Ruby 3.0, introduced in 2020, the compactor gained the ability to run automatically; previously, it was a manual process requiring developers to trigger 'gc.compact.' Now, compaction was intertwined with the sweep phase of a major GC, filling openings in memory automatically by relocating objects from elsewhere in the heap.

00:09:49.140 However, Ruby previously assumed it was safe to modify other objects during sweeping, as they wouldn’t move. For example, every class maintains a reference to its subclass list and a pointer to its superclass. If a parent class needs to be swept, its subclasses would require updates, but if that parent had already been relocated, it would lead to inaccuracies.

00:10:08.760 This issue mirrored challenges seen with incremental and generational GCs, as previous assumptions about GC behavior had changed. To address this, Ruby implemented a read barrier, which ensured that if a read was attempted on a moved object, that object would first be restored to its original location before allowing the read to proceed. This circumvented the object modification problem and facilitated the implementation of auto-compaction.

00:10:31.260 However, the introduction of read barriers added complexity to the codebase, ultimately slowing throughput and extending GC pauses. Therefore, as it stands, compaction is currently an optional feature in Ruby and remains disabled by default.

00:10:54.420 This brings us to Ruby 3.2 and the most recent change regarding variable allocation. My colleague Peter and I discussed this at RubyKaigi a couple of years ago if you're interested in detailed intricacies, but to summarize, we have focused on enhancing application performance instead of GC pause times by modifying the memory allocation approach to improve data locality and diminish external allocation instances.

00:11:03.300 This briefing brings us almost to date regarding the current state of Ruby's GC. I have left out some smaller yet significant adjustments for the sake of brevity, but I believe a full talk could be crafted around the complete history of Ruby's GC. In summary, Ruby comes equipped with an incremental, non-copying generational Mark and Sweep collector, along with an optional compaction phase.

00:11:24.120 None of these changes were originally designed as goals; rather, Ruby's GC has evolved organically over the past 29 years. It is challenging to address assumptions in Ruby that no longer hold true. Sometimes, Ruby has resolved problems using methods that were applicable in their time but have since produced an inflexible GC system. All of this is entirely normal; in 2011, a research paper analyzing the landscape of managed languages concluded that language developers must prioritize memory performance from the outset.

00:12:06.000 Failing to do so would lead to a strong compiler being hindered by poor memory performance. As we progress with this talk, I urge you to keep in mind that Ruby's GC implementation is approximately 30 years old; it rests on algorithms that have been around for decades. It constitutes a substantial and intricate portion of the codebase, which makes any alterations to it a lengthy endeavor.

00:12:41.820 When you embed a GC into a VM, weak boundaries can often result. We have already observed several instances where other parts of the VM make assumptions about how the collector operates or how memory is structured. Consequently, the collector becomes tightly integrated, posing difficulties in adapting to new ideas and stifling innovation within the VM.

00:13:36.660 With that context, I want to discuss some innovations occurring within the GC research community, focusing on two noteworthy algorithms. The first is IMIX, which was presented at the Programming Language Design and Implementation Forum (PLDI) in 2008. Every GC algorithm can be boiled down to three core components: allocation, garbage identification, and reclamation.

00:14:10.680 Each of these components has established fundamental approaches. Allocation can employ a free list approach—like we have seen in Ruby—or a bump pointer approach, where a marker indicates the first free address in the heap, allocating at that point, then shifting the pointer forward to the next available space. Garbage identification has an implicit procedure where garbage is determined by first identifying what is active; this is essentially the tracing method Ruby implements. Alternatively, there is an explicit approach in which objects manage their own references, though reference counting poses challenges with circular references.

00:15:01.560 Reclamation involves sweeping, where garbage objects are wiped and their addresses are returned to a free list, evacuation, where live objects are copied to a new memory space and the old space is discarded, and compaction, which is akin to evacuation but consolidates objects to one end of the existing space. Fundamentally, these processes can intermingle to form three canonical collectors: Mark and Sweep (combining free list allocation and tracing), Mark Compact (combining bump pointers with tracing and compaction), and Semi-space (combining bump pointers, tracing, and evacuation).

00:15:35.099 Each of these canonical algorithms possesses distinct performance characteristics tailored to specific workloads. However, these collectors are rarely used independently nowadays as advanced GC methodologies such as G1, Shenandoah, and ZGC build upon these algorithms through various innovative approaches, integrating them with generations, concurrency, and parallelism, along with layering numerous optimizations.

00:16:10.680 IMIX formalized a new category of GC algorithms known as Mark Region, showcasing performance advantages of up to 25% over traditional collectors, averaging across around 20 standard GC benchmarks. This chart presents total program execution time relative to heap size, contextualizing traditional GC performance dichotomy: you either obtain a swift collector or a memory-efficient collector. IMIX, however, breaks away from this dichotomy, outperforming other canonical collectors even at larger heap sizes.

00:16:51.840 The subsequent chart exhibits GC time relative to heap size, revealing IMIX matching Mark and Sweep as the fastest collector, thereby minimizing overall GC time while providing comparable data locality performance to Semi-space, which is engineered for optimal locality.

00:17:34.800 Notably, IMIX is currently utilized in the Inco concurrent programming language and Scala Native, a managed runtime and ahead-of-time compiler for Scala. There is ongoing effort to implement IMIX in GHC, yielding promising real-world outcomes. Of interest, Rubinius once employed an evacuating generational GC using Semi-space for the young generation while utilizing IMIX for the old generation.

00:18:24.420 The next algorithm worth mentioning is LXR, introduced through a 2022 paper at PLDI. LXR amalgamates IMIX with reference counting, along with a series of optimizations, allowing it to outshine the most prevalent collectors currently in use.

00:19:07.619 Benchmarks pitting LXR against G1 (designed for throughput), Shenandoah (latency optimized), and ZGC (aiming for both throughput and latency) across various heap sizes revealed remarkable results. The data is normalized relative to G1, illustrating that lower numbers equate to better performance. ZGC's benchmarks showed that it would only execute if permitted to use six times the usual memory for the benchmarks, but it showcased LXR outpacing other collectors in latency and throughput.

00:19:52.859 IMIX and LXR have both been developed within MMTk, which stands for the Memory Management Toolkit. This initiative aims to equip programmers with a concrete suite of building blocks to facilitate the creation of high-level GC algorithms, akin to a toolkit for garbage collectors. MMTk serves as an infrastructure upon which innovative GC systems can be constructed.

00:20:28.000 Originally part of the Jikes Research VM established in 2004—a flexible testbed for prototyping VM design—MMTk was rewritten from scratch in 2017 as a standalone project to make it language-agnostic. The version used for LXR’s implementation derived from this rewrite. Notably, the exceptional modularity and abstraction characteristics of MMTk facilitated LXR's journey from initial commit to a full implementation, alongside a published research paper, within approximately 19 weeks.

00:21:06.960 Currently, MMTk features bindings for several languages and runtimes, including OpenJDK, the V8 JavaScript runtime in Chrome, the Jikes RVM, and ongoing efforts toward bindings for Julia and GHC, the Haskell compiler. As anticipated, work for Ruby bindings is also underway. Chris Eaton, a gifted member of the Ruby Community who tragically passed away last year, believed that Ruby would greatly benefit from being better aligned with cutting-edge research in virtual machines and interpreters.

00:21:51.780 Among his many initiatives at Shopify, he was instrumental in enabling our investment in significant research projects deemed interesting to the Ruby community, including MMTk. Consequently, Shopify became one of the industry supporters of MMTk, joining Huawei and Google. While the implementation effort was led by researchers at the Australian National University and the MMTk team, Shopify's support informed our Ruby development experience.

00:22:41.040 Now, let's discuss how MMTk connects to Ruby. MMTk comprises core components distributed as a Rust library, where all GC algorithms and essential building constructs are implemented—anything that isn’t specific to the host runtime. To interface it with Ruby, we need to develop a binding layer consisting of two parts.

00:23:06.000 The first part, written in Rust, encapsulates Ruby-specific code that necessitates proximity to the GC for performance reasons—tracing being a good example. Ruby has its own unique object layout in R values, which MMTk needs to accommodate without the slowdown associated with cross-language calls. The second part, crafted in C, the same language as Ruby, contains MMTk-specific code that benefits from being closer to Ruby, like functions to start and stop Ruby threads or pause the world.

00:23:47.740 The Rust and C components come together into a shared object linked against Ruby at compile time. The Rust bindings leverage the Foreign Function Interface (FFI) to make functions easily callable from the Ruby side. I’ll now demonstrate our current progress, and if you wish to try this out yourself, we have nightly Ruby releases available on our GitHub repository.

00:24:23.760 You can download pre-built binaries or access container files with build scripts to create your own versions. This is a depiction of my built version of Ruby with MMTk; we can confirm this by examining the version string, which indicates MMTk is included. When passing MMTk arguments, you will see it initializes with its Mark and Sweep collector by default, as denoted by the uppercase '+ MMTk' and bracketed 'Mark Sweep' notation.

00:25:11.260 By default, executing a GC manually will utilize Ruby's internal GC, not MMTk. Observing the output via 'GC.stat,’ we would see typical results. Calling MMTk's default collector can be accomplished by simply passing 'MMTk' or by explicitly selecting a plan with 'mmtk plan.' Here is Ruby executing a successful GC using MMTk’s Mark Sweep implementation, complete with added logging.

00:26:03.820 Notably, this marks a substantial milestone for us as we witness MMTk executing a full and successful collection with IMIX. However, there are caveats to consider—this work is still highly experimental. Currently, many of the C Ruby test suites do not operate as expected, and MMTk is Linux-exclusive; therefore, booting a Rails application using Mark Sweep is possible, yet IMIX tends to crash with Rails, necessitating smaller scripts.

00:26:42.960 Enhancements in performance are still very much on our roadmap, and we are prioritizing correctness over speed at this stage; we are not as fast as Ruby's internal GC just yet. So, looking ahead, I want to share some of our aspirations.

00:27:33.420 We believe there is significant scope for improving Ruby's GC. Currently, our fork comprises a wealth of code, relying heavily on preprocessor directives to compile certain components when MMTk is included; runtime checks are then required to determine if it’s enabled, which results in convoluted code that’s prone to complexity and maintenance issues.

00:28:00.460 However, this approach has allowed us to accomplish much in a reasonable timeframe, establishing MMTk with several collectors, after which we can now focus on refining that code. Navigating through this process has illuminated areas within Ruby that lack sufficient flexibility to manage multiple collectors.

00:28:35.100 We've identified leaking abstractions and assumptions regarding the memory model. Ultimately, however, this isn’t the code we aspire to ship. Both V8 and GHC took the initiative to ensure the support for MMTk was tangible, addressing pain points that complicated operating multiple GCs. They abstracted key GC-related functionalities into a unified interface, minimizing edge cases requiring consideration.

00:29:01.260 Being in this unique position with Ruby, we can harness the established stable and mature GC while also implementing MMTk as a second system, enabling us to analyze shared elements that can be unified into a general-purpose API.

00:29:37.779 But what if we extend beyond just MMTk? Could we create a standard memory management interface for Ruby—one that could allow for plug-and-play GCs at runtime, whether it be MMTk, Ruby's native GC, or something entirely different? I believe this concept carries substantial potential.

00:30:04.680 It could facilitate a library of GC implementations structured as C extensions, integrated with Ruby as needed, depending on workload—illustrated with MMTk and Ruby's VM among others—allowing anyone to develop and deploy a customized Ruby GC.

00:30:38.400 Moreover, we could utilize MMTk to design innovative GCs, presenting researchers with a platform to test against while simultaneously advancing Ruby at the forefront of GC and memory management innovation.

00:31:21.700 We could establish a method for effortlessly testing custom GC patches or configurations, duplicating a GC module for side-by-side comparisons during runtime in our production environments.

00:32:08.520 While this fully extensible memory management system can yield immense creative opportunities, we must also address several critical questions: How will we manage C extensions? What API should we provide for extensions, especially established ones in the wild?

00:32:52.620 We’ve initiated some efforts on this front with recent patches, including one that merges the two existing GC extension callbacks for marking and moving into a single tracing function. This adjustment establishes a way to declare references in a gem so the GC can trace them instead of necessitating custom tracing code from extension authors.

00:33:35.460 Another concern involves the potential development burden on Ruby for maintaining a stable interface. This structured code organization mandates diligence and upkeep, ensuring its continued relevance as the VM evolves.

00:34:16.140 Finally, a pressing question remains: what performance penalties might arise from implementing such a modular GC? I hope that any performance enhancements resulting from adopting modern research will exceed potential slowdowns from this abstraction, yet this remains an area where empirical testing and validation are vital.

00:34:42.560 In conclusion, allow me to emphasize that garbage collection is not a resolved issue. Innovations in memory management techniques evolve continuously, and I am genuinely enthusiastic about utilizing these advancements in Ruby.

00:35:17.440 I believe we have a unique opportunity to transform Ruby's memory management narrative, equipping it with high-performance, modern GCs to capitalize on novel innovations and position Ruby for longevity and relevance across diverse production workloads.

00:35:54.880 With that, I sincerely thank you for attending my talk. If you wish to further discuss any of the ideas presented here, please feel free to approach me. I am always eager to converse about GC research. You will find links to all the papers and references I relied upon for this talk at this URL.

00:36:10.740 I hope you enjoy the rest of RubyKaigi. Thank you very much!