Variable Width Allocation: Optimizing Ruby's Memory Layout

00:00:00.399 Hello RubyKaigi! I'm Peter, and I am a Ruby core committer and production engineer at Shopify on the Ruby infrastructure team. This is Matt, who is also at Shopify and a senior engineer in the Ruby infrastructure team. For the last few months, we have been working together to implement improvements to how memory is laid out in Ruby. In this talk, we will summarize the work we've done so far, the challenges we have faced, and the direction we are heading in the future.

00:00:19.600 Before we delve into details, we will provide some background on how Ruby's garbage collector structures and manages memory. The primary unit of storage in the Ruby heap is an R-value. An R-value is a C struct that holds one Ruby object; it's always 40 bytes wide and contains a union of all possible Ruby call types, each having its own data structure within the union. This means we could have a structure representing a string, an array, a class, and so on.

00:00:35.280 The R-value itself is really just a container for our object. Let's take a closer look at one of the object types. The first 16 bytes of every type are always the same: an 8-byte field called flags followed by an 8-byte class pointer. The remaining 24 bytes differ for each type, where class objects store a pointer to an extension object, while strings can store the actual string content and so on. These R-values are organized into regions called pages. A heap page is a container for a 16 kilobyte memory region, and the number of slots per page can be either 408 or 409. The important aspect of a heap page's memory region is that all the slots are contiguous, with their memory addresses spaced 40 bytes apart and no gaps between the slots.

00:01:21.040 When a heap page is created, all the slots are initially filled with R-values containing a special internal type called T_NONE, which denotes an empty slot. This empty slot only contains flags and a pointer value called 'next' that can point to another R-value, allowing these empty slots to be linked together to form a data structure known as the free list. When a heap page is initialized, Ruby traverses each page, visiting each slot in turn. As it visits each slot, it sets a pointer on the heap page called the free list pointer to the address of the slot and updates the slot's 'next' pointer to the address of the previously visited slot, creating a linked list of empty slots.

00:02:04.479 When we need to allocate an object and require an empty slot, Ruby queries the heap page for the first slot on the free list. The page returns the slot, and then the free list pointer is updated to point to the next slot in the list, unlinking our slot from the free list and allowing us to store data within it. This utilization of a free list ensures that object allocation remains a constant-time operation, even when a page is sparsely populated. If there is at least one empty slot, the free list is guaranteed to point to it.

00:02:50.560 Eventually, heap pages will run out of slots, at which point the garbage collector takes over. If the free list is empty when Ruby attempts to allocate a new object in the page, it signifies that no free slots are available on that page. Now, Peter will explain how the garbage collector functions in such cases to reclaim space from unused objects on that page.

00:03:01.280 After reviewing how memory is laid out inside Ruby, let's explore how Ruby's garbage collector operates. The garbage collection phase consists of three distinct phases: marking, sweeping, and compaction. The last phase, compaction, is optional and does not run unless it is manually enabled. During garbage collection, Ruby stops executing code, meaning Ruby code cannot run while the garbage collector is active.

00:03:38.720 The garbage collector's complexity prevents us from covering all technical details in just a few minutes, so we will provide a high-level overview containing all necessary information for this talk. First, we examine marking, which is the phase where we determine which Ruby objects are alive and which can be freed. We initially mark the roots, including global variables, classes, modules, and constants. Next, we recursively mark the unmarked children of the marked objects until the mark stack is empty.

00:04:05.120 Here’s a toy example of a heap with two pages containing seven slots on each page, with a total of ten objects labeled A through J. Slots without labels indicate that they are empty. The arrows illustrate references; for example, if there’s an arrow from object J to object D, it means that object J holds a reference to object D. The first step of marking involves marking root objects, which in this case are objects A and B. Once marked, they go onto the mark stack. After marking A and placing it on the mark stack, we pop it off and mark its child, object C, pushing it onto the mark stack as well. We repeat this process until all objects that need to be marked are processed.

00:07:36.960 After marking, all marked objects are live, while unmarked objects are dead and can be reclaimed by the garbage collector. To do this, we scan the slots on the pages and free those that are unmarked. Continuing with our marking example, we locate the first unmarked object, object D, which we can free and reclaim memory. We move through the rest of the objects, finding object E as the next unmarked, and proceed to page two, freeing object I and then object J. Once all pages are swept, we conclude the sweeping phase.

00:10:43.040 The manual compaction feature, introduced in Ruby 2.7 and improved in Ruby 3.0, is optional and disabled by default. Compaction moves objects within the heap to the start of the heap, providing several benefits, such as reduced memory usage, faster garbage collection, and better copy-on-write performance. Ruby employs a two-finger algorithm for compaction, consisting of two steps: the compact step and the reference updating step. In the compact step, two cursors move through the heap, efficiently relocating objects to create contiguous memory.

00:12:05.440 The reference updating step follows, where we scan all the objects, updating any pointers that may have moved during compaction. For example, if an object has a forwarding address due to being moved, we update that pointer to reference the new location. This ensures all object references remain valid, and any unused forwarding addresses can be reclaimed. This is how the garbage collector operates, but throughout this talk, we have referred to slots as being fixed size, and Matt will now discuss handling cases where an object needs to store additional data.

00:13:26.080 As we have seen, Ruby objects aren't always fixed in size and may not fit in a standard slot. There are two scenarios to consider: when objects fit within the 24 bytes at the end of an R-value and when they exceed that limit. For instance, the string 'Hello World' is 12 bytes, fitting comfortably into a 40-byte R-value. However, the string 'Hello RubyKaigi, thanks for having us' is 37 bytes, which requires extra handling. When allocating the string, Ruby checks its length against an internal constant called R_STRING_EMBED_LEN_MAX. If the requested length is shorter, it directly pushes the string data into the remaining slot after the flags. If longer, it sets a flag indicating that the data is a pointer to memory allocated outside of the Ruby heap.

00:16:10.800 This behavior allows Ruby to manage its memory allocation more effectively, maintaining performance while accommodating larger objects. Now, let’s discuss the challenges associated with a fixed memory layout and introduce our Variable Width Allocation project, aimed at resolving these challenges. A significant bottleneck in the Ruby heap arises when objects require multiple pieces of memory to store data. This situation often leads to poor cache locality, which in turn is exacerbated by the need to use system malloc for memory allocation.

00:16:59.760 Our CPU has faster memory built onto it called caches, which are divided into multiple levels. The lower the level of the cache, the closer it is to the CPU core, leading to improved performance due to the reduced distance for electric signals. The entire memory fetch creates an architecture where the cache also affects the allocation performance. For example, array allocations in the current implementation point to a memory region in the system heap, resulting in separate fetches for both the R-value and the data within the array. This requires two memory fetches, which is inefficient.

00:19:52.480 Ruby 2.6 introduced the Transient Heap to improve performance by reducing system malloc calls; however, this approach has limitations and cannot serve as a true malloc replacement. Our project aims to enhance overall performance by allowing Ruby to control its memory allocation rather than relying on the system's memory management. By optimizing the layout structure and placing object contents right after their respective objects, we intend to improve cache locality and performance during dynamic size allocations without system calls. As we implement this, we have merged changes in the Ruby codebase, establishing this functionality while still considering testing implications.

00:22:59.679 In this talk, we have explored the mechanics of Ruby's memory allocator and garbage collector, discussed the issues present in Ruby's memory layout, and shared how we are addressing them with our Variable Width Allocation project. Although we continue to make improvements to the implementation, we acknowledge that much work lies ahead. Thank you for listening, and thanks to RubyKaigi for the opportunity to present.