00:00:00.799
It’s probably better to give the floor to Peter and Matthew, who are coming to talk about optimizing Ruby's memory layout.
00:00:07.200
Peter is a Ruby core committer and production engineer at Shopify, working on improving Ruby. He is passionate about open source software and fascinated by the unknown. Matthew has been working with Ruby in various capacities since 2007 but hasn't previously ventured behind the scenes into its implementation. He is currently a senior developer at Shopify and lives with his wife and daughter in a small town in the UK.
00:00:18.080
When he is not scratching his head over segmentation faults, he can be found brewing beer and building mechanical keyboards.
00:01:02.559
In this talk, we’ll discuss how Ruby's current memory layout does not optimize for cache performance and what implications this has. Object data, like string and array contents, are often not stored close to the objects that use them, resulting in poor data locality, reduced efficiency of CPU caches, and increased complexity for the garbage collector.
00:01:10.799
Additionally, objects may contain pointers to data that point back into the Ruby heap, which slows down object access due to pointer indirection. Join us as we explore how the variable width allocation project will change the garbage collector heap to replace the system’s malloc heap.
00:01:16.040
This change will give us finer control over the memory layout to optimize for performance. I know you both have a lot to learn about this, so without further ado, Peter and Matthew, it’s your turn.
00:01:35.119
Hello, Euruko! I'm Peter, a Ruby core committer and production engineer at Shopify on the Ruby infrastructure team. This is Matt, who is also at Shopify as a senior engineer within the Ruby infrastructure team. For the last few months, we have been working together to make improvements to how memory is laid out in Ruby.
00:01:53.440
In this talk, we are going to summarize the work we’ve done, the challenges we’ve faced, and where we're heading. But before we dive into details, we’ll provide some background on how Ruby's garbage collector structures and manages memory.
00:02:11.920
When our Ruby programs create objects, we need somewhere to store them that is consistently accessible and organized. Ruby does this by requesting memory from the operating system and arranging it into a series of data structures collectively known as the heap.
00:02:31.520
The primary unit of storage in the Ruby heap is a C struct called RValue, which holds a single Ruby object. It is always 40 bytes wide and contains a union of all the possible Ruby core types, each represented by its own data structure inside that union. We could have a struct representing a Ruby string, an array, a class, and so on.
00:02:49.920
These types contain their own fields and data structures specific to that type. The RValue itself is essentially a container for Ruby objects.
00:03:03.760
Let's look more closely at one of the Ruby object types: RClass. Ruby creates one of these for every class in the system. The first 16 bytes of every Ruby type are always the same: an 8-byte field called flags, followed by an 8-byte class pointer.
00:03:20.000
We use flags to store the type of an object, as well as data about its state, such as whether it’s frozen. The class pointer references another RValue that represents the Ruby class of that object. The remaining 24 bytes have different uses for every type.
00:03:43.680
For example, RClass objects store a pointer to an extension object, whereas strings can store the actual string content. These RValues are organized into regions called pages.
00:03:56.840
A heap page is a container for a 16-kilobyte memory region, and the maximum number of slots per page is 409, although sometimes it’s 408 depending on the alignment of the memory region returned by malloc.
00:04:09.960
Initially, when the heap page is created, all slots are filled with RValues containing a special internal type called T_NONE. This type represents an empty slot, containing only flags and a pointer called 'next' that can point to another RValue.
00:04:26.880
This allows empty slots to be linked together to form a data structure called the free list. When a heap page is initialized, Ruby traverses each page, visits each slot, sets a pointer on the heap page called the free list pointer to the pointer of the slot, and connects it to the previous slot.
00:04:42.560
When we need to allocate an object and we need an empty slot, Ruby asks the heap page for the first slot on the free list. The heap page returns the slot, and the free list pointer gets updated to point to the next slot in the list. This unlinking process allows us to put data in it.
00:05:03.520
The use of a free list keeps the object allocation process quick and constant time even when a page is sparsely populated, as it guarantees that if the page has at least one empty slot, the free list always points to it.
00:05:26.000
However, eventually our heap pages may run out of slots, which is where garbage collection comes into play. If the free list is empty when Ruby tries to allocate in the page, it indicates that there are no free slots left.
00:05:37.760
Now, Peter will elaborate on how the garbage collector runs to reclaim space from objects that are no longer in use.
00:05:50.240
Having examined how memory is laid out inside Ruby, let’s look at how Ruby's garbage collector operates. Ruby's garbage collection consists of three distinct phases: marking, sweeping, and compaction.
00:06:04.960
The last phase, compaction, is optional and does not run unless manually enabled. Ruby’s garbage collection is a "stop the world" operation, which means Ruby code does not execute while the garbage collector is active.
00:06:17.440
Because the garbage collector in Ruby is fairly complex, we can’t possibly cover all the technical details in just a few minutes, but we’ll provide a high-level overview that contains the essential information for this talk.
00:06:30.800
Let’s start with marking. Marking is the phase where we determine which Ruby objects are alive and which can be freed. Initially, we mark the roots, which include global variables, classes, modules, and constants.
00:06:49.920
Next, we recursively mark unmarked children of marked objects until the mark stack is empty. For example, if we have a toy example of a heap with two pages and seven slots.
00:07:00.960
The objects are labeled from A to J, and the arrows indicate references: an arrow from object J to object D means that object J has a reference to object D. The first step of marking requires us to mark root objects.
00:07:21.600
For our example, the root objects are A and B. We mark these two objects black and place them on the mark stack. Now that we have marked all the root objects and our mark stack is not empty, we can pop objects from the mark stack and mark their children.
00:07:41.040
We begin by popping object A from the stack and marking its child, which is object C. We then mark object C black and place it on the mark stack. Next, we pop object B from the mark stack; it has one child, which is object G. We mark it black and push it onto the stack.
00:07:58.560
Next, we pop object C from the stack, which has one child, object G. Since G is already marked, we do nothing. Once again, we only have object G on the mark stack, so we pop it.
00:08:15.440
Object G has two children, objects F and H, so we mark both of them and add them to the stack. Finally, when we pop object F from the stack, it has no children, so we do nothing. We then pop object H from the stack; H has one child, object G, which is already marked, so we do nothing.
00:08:36.080
When the mark stack is empty, marking is complete. All marked objects are live, and all unmarked objects are dead and can be reclaimed by the garbage collector.
00:08:48.160
Next, we scan the slots on the pages and free the ones that are not marked. Moving the cursor to the first unmarked object in the page brings us to object D, which we can free.
00:09:04.720
Continuing, we move our cursor until we find the next unmarked object, which is object E. We’ve reached the end of page one, so we proceed to page two.
00:09:17.760
On this page, we continue scanning for unmarked objects as we find and free object I. We keep moving the cursor until we reach the next unmarked object, which is object J.
00:09:31.440
Now that we’ve swept all the pages, sweeping is complete. Manual compaction was a new feature introduced in Ruby 2.7 and further improved with automatic compaction in Ruby 3.0.
00:09:47.680
However, this feature remains optional and is turned off by default. Compaction moves objects within the heap to the start of the heap, yielding several benefits including reduced memory usage, faster garbage collection, and enhanced copy-on-write performance.
00:10:04.960
Ruby uses the two-finger algorithm for compaction, which consists of two steps: the compact step and reference updating step. During the compact step, two cursors, the free cursor and compact cursor, are involved.
00:10:19.200
The free cursor starts at the beginning of the heap and moves forward, while the compact cursor starts at the end and moves backward. When these cursors meet, the step is complete.
00:10:36.080
After compaction, the next step is the reference updating step, where we scan all objects and update pointers that have been moved. For example, an object reference holds addresses of other objects.
00:10:53.400
During updating, we check the references of each object scanned and update the references accordingly to point to the correct object.
00:11:09.440
Once we've recovered by reclaiming forwarding addresses, this is how the garbage collector operates. Throughout this talk, we’ve emphasized that slots are fixed-size.
00:11:24.000
Now, Matt will discuss how we handle cases where the object needs to store additional data. Ruby objects aren’t always a fixed size and won't always fit into a slot.
00:11:43.040
For example, let’s look at two strings: "Hello, world!" which is 12 bytes, and "Hello, Yuruko! Thanks for having us!" which is 34 bytes. When allocating the shorter string, Ruby checks the requested string length against an internal constant called max embedded length.
00:12:01.760
If the string length is shorter than the maximum length, Ruby pushes the string data straight into the remainder of the slot after flags. Objects like this are termed embedded objects.
00:12:18.080
However, if the string length exceeds the maximum, Ruby sets a status bit called No Embed in the RValue’s flags, signifying that the data inside the RValue isn't the actual string, but a pointer to a separately allocated memory region.
00:12:31.680
It uses malloc to allocate a new memory region and stores the address in the RValue, copying the string data into that newly allocated memory. This makes operations on large or varying size objects like strings and arrays non-trivial.
00:12:46.320
First, Ruby has to ascertain whether the object is embedded, and if it’s not, it has to access a different memory region, potentially involving multiple cache reads.
00:13:03.840
So, with that, you should understand how objects are stored in memory in your Ruby process, even when those objects are large.
00:13:12.959
We’ve examined how Ruby lays out its memory and seen how the garbage collector helps us organize that memory by reclaiming space from unused objects.
00:13:27.040
Now, Pete will address some challenges with this memory layout and introduce our project, Variable Width Allocation, which aims to fix them.
00:13:34.160
Let’s discuss some bottlenecks that exist within the Ruby heap that we aim to tackle with this project. The first issue is that a Ruby object often requires multiple memory pieces to store its data.
00:13:51.120
For instance, we previously saw how the contents of a string object exist separately from the string object itself, resulting in poor cache locality. Additionally, this memory is often allocated via malloc, which negatively impacts performance.
00:14:06.080
We think that the CPU only has a single level of memory, but that’s not true. There's faster memory built into the CPU itself, called caches. These caches are usually divided into three levels; the lower the level, the closer it is to the CPU core.
00:14:24.240
Being closer means the electric signal travels a shorter distance, providing better performance. Here we see a shot of the CPU die, illustrating eight cores—four on each side. Zooming in on one of the cores reveals a level one cache, very small at 32 kilobytes.
00:14:43.360
Next, to the left of the core, we see the level two cache, which is larger, with a capacity of 256 or 512 kilobytes. The large region of level three cache shared among all cores has a capacity of up to 32 megabytes.
00:14:59.760
When memory is fetched from main memory, it is stored in CPU caches as well. This data is fetched in units called cache lines, which, in the x86 architecture of Intel and AMD CPUs, is 64 bytes.
00:15:17.360
Thus, even if you only need a single byte of data, 64 bytes will be fetched. When the CPU cache is full, it will overwrite the least recently used data to make way for the most recently accessed data.
00:15:31.360
Increasing the cache hit rate is crucial for optimizing performance, as cache hits mean we can access data quickly without doing a slow read from main memory.
00:15:43.920
Now let’s examine the cache performance of the current Ruby implementation. Consider an array; the current version points to a memory region in the system heap allocated via malloc to store its contents.
00:16:01.760
If our array has four elements, we need four pointers to Ruby objects, requiring 32 bytes of memory. Hence, the object occupies one RValue, which is 40 bytes, while its content occupies 32 bytes, requiring two memory fetches to get the elements of the array.
00:16:18.560
Moreover, fetching memory from the system heap using malloc has performance overhead, so we need to minimize malloc calls. Malloc also necessitates space for headers to store metadata when allocating memory.
00:16:36.560
However, we are not the first to focus on tackling this issue. Ruby 2.6 introduced a second heap called the transient heap, which reduces malloc calls and has improved performance by up to 50 percent in some benchmarks.
00:16:50.920
Nonetheless, the transient heap has technical limitations and cannot fully replace malloc. One significant goal of this project is to enhance Ruby's performance by allowing it to control memory layout directly instead of relying on the malloc library.
00:17:06.560
By leveraging Ruby’s garbage collector, we can optimize allocation and efficiency by placing contents of objects right next to the object itself.
00:17:21.360
We can enhance cache locality by allocating dynamically sized memory inside the Ruby garbage collector and avoid the overhead of malloc system calls.
00:17:35.760
Next, we’ll look at how the memory layout changes with variable width allocation. Here we see the array example again, showcasing our project's aim to transfer the array's data from the system heap to the Ruby heap right after the object itself.
00:17:50.160
Eliminating the need for a malloc system call when creating the array object can significantly impact performance. Every Ruby object requires allocating an RValue, which is 40 bytes in size.
00:18:06.720
In this case, if the array’s data itself is 32 bytes, placing the contents of the array next to the array object means we can access the first three elements while remaining in the same cache line.
00:18:18.480
Matt will now discuss our progress and what we've implemented thus far.
00:18:32.560
We are building this project incrementally and recently merged our first major pull request into Ruby core. This pull request marks the first step toward supporting variable width objects in the garbage collector heap.
00:18:49.200
In summary, we built the necessary infrastructure to move data that would previously have been allocated on the system heap into a heap page alongside the attached RValue. We have also provided a single reference implementation.
00:19:05.520
I should note that this is currently suitable for testing only. Please do not use it on your production workloads. However, if you want to experiment with it, you can enable it by compiling Ruby with 'USE_OF_RGC' in your CFLAGS.
00:19:21.360
Let's walk through an example of what this entails, while keeping the technical jargon minimal. We'll talk about class allocation. Every time you instantiate a class in Ruby, an RValue containing the RClass object is allocated into a heap page.
00:19:36.240
Each class has data associated with it, such as instance methods, class methods, and instance variables. To facilitate this, Ruby creates a separate data structure named RbClassExt, which is fixed in size at 104 bytes and is allocated on the system heap.
00:19:53.760
This is the code responsible for allocating a class: the function Allocation does the work of claiming a slot from the appropriate page’s free list, creating an RClass struct, and returning a pointer to it.
00:20:08.680
The call to 'z_alloc' here is where the system heap space needed for the RClass extension gets allocated. This call returns the address of the allocated memory, which Ruby will store in a field within the RClass structure.
00:20:27.040
After our patch, this is what the class allocation code looks like. The notable change is that we are calling a different entry point, which takes a payload size indicating the number of bytes for the class extension (104 bytes in this case).
00:20:46.000
This specifies enough space to store that number of bytes immediately following our initial RValue in the heap page. When this new function is called, Ruby’s allocator calculates the minimum number of slots needed to accommodate the payload size, as well as the original RValue slot.
00:21:08.720
Instead of requesting a single slot from the free list, it requests a pointer to a contiguous set of slots large enough to hold both the RValue and its data.
00:21:29.920
This is clearly more complex than simply taking a slot from the free list. Let’s illustrate what happens when we search for three contiguous slots in a heap page.
00:21:48.280
Initially, we start at the head of the free list. If a slot to the right exists, we check if it is empty; if not, we scan the next slot. If the adjacent slot isn’t immediately next to the one we just surveyed, we keep track of the address we've just left to connect the free list later.
00:22:05.920
As we continue, at each stop, we assess the immediate right slot for emptiness. We keep checking the next adjacent slots until we find one or more that are empty. Once we find a region of the appropriate length, we stop looking.
00:22:29.280
Once we have this pointer to the start of our region, we can seamlessly remove these slots from the free list. Next, we allocate the RValue in the first slot and then allocate for our payload. We introduce a new Ruby type, TPayload.
00:22:51.120
This TPayload object is straightforward—it holds a single 8-byte flags field. The flags are used for two purposes: the least significant bits indicate type information, just like all other RValue types, and the upper bits hold the payload length in slots.
00:23:08.920
Once the payload header has been established, the system starts treating the RValue and its entire adjacent payload as a single object. Thus, during the garbage collection marking phase, if an object with a payload is marked, it also marks the slots holding the payload.
00:23:27.360
During the sweeping phase, when the object is swept, it will add the payload slots back to the free list for reuse. This simplification allows the sweeping phase to avoid custom functions meant to free various system memory parts.
00:23:45.760
Now that we've initialized our payload, we need to determine where to start allocating extra data. We’ve added a helper to the allocation API to calculate the offset address from the start of the payload slot.
00:24:00.560
This acts as a drop-in replacement for the z_alloc call, calculating and returning an address to already allocated memory instead of allocating new memory.
00:24:19.760
With these two functions, we have successfully replaced the original RValue plus system heap allocation strategy with a new multi-slot allocation system.
00:24:36.400
This has eliminated a call to malloc and the associated call to free, ensuring that class data is stored contiguously with its payload in a heap page.
00:24:52.080
Our initial implementation has some compromises, and now Peter will discuss the challenges we’ve faced while implementing variable width allocation.
00:25:06.880
Thanks, Matt. Now that you’ve seen an overview of the current implementation, I’ll discuss a few challenges we’ve encountered.
00:25:19.760
Many of these challenges remain unsolved or have been addressed using non-optimal techniques, which means there’s still room for improvement and optimization.
00:25:36.080
The first challenge is heap parsability. With the current Ruby heap, where each object occupies a single slot, we can directly jump to specific indexes in the heap and access the corresponding object. However, with variable width allocation, we can no longer do that.
00:25:59.760
Payloads of the objects we store can contain arbitrary data, so a payload slot may appear to be an object. To address this issue, we need to scan from the beginning of the page to determine whether a particular slot represents a Ruby object.
00:26:13.920
This complex scanning necessitates thorough checks, impacting performance and leading to poor cache performance.
00:26:30.720
The current implementation does not support compaction, meaning every object allocated via variable width allocation is pinned and cannot be moved by the compactor. This results in memory fragmentation.
00:26:45.680
Fragmentation occurs when sufficient memory exists overall for object allocation, but there is no single region large enough to fit the request.
00:27:01.120
For example, with occupied slots colored red and three slots colored white, if we attempt to allocate an object that requires three slots, we'll find inadequate space.
00:27:16.480
If we supported compaction, all objects could be consolidated at the beginning of the heap, creating adequate empty slots to potentially satisfy our future allocation requests.
00:27:36.760
Our current implementation employs a simple first fit allocation strategy, but this can lead to inefficiencies.
00:27:49.760
In highly fragmented situations, where no suitable slot exists for allocation, we may have to explore all free slots to confirm this, which consumes unnecessary computational resources.
00:28:05.760
The elevated computation demands and poor cache locality can cause us to read from many scattered memory regions.
00:28:19.760
Lastly, Matt will share our future plans and the direction this project is heading.
00:28:36.080
It should be evident that much work remains to fully realize the vision for true variable width allocation. We're focusing on two major components.
00:28:48.600
The most ambitious undertaking is our plan to change the structure of heap pages. Currently, all heap pages are the same size and store 40-byte-wide objects.
00:29:04.720
We are exploring the potential for allocating heap pages with size classes, enabling different classes to support varying slot sizes. A size class allocation approach is already implemented in Jemalloc, which is a popular alternative to the libc malloc library.
00:29:21.520
To determine our size classes efficiently, we’re evaluating the distribution of object sizes across various Ruby workloads.
00:29:37.680
Moreover, we’re investigating the possibility of removing the TPayload header, which currently separates the RValue from its data by eight bytes.
00:29:54.960
Doing so would enhance CPU cache utilization by bringing the RValue closer together with its data. These two changes aim to overcome many existing challenges.
00:30:11.920
If pages are grouped into size classes, we could revert to allocating a single slot from the free list, thus restoring allocation speed.
00:30:27.600
Furthermore, we would be able to utilize the existing two-finger compaction algorithm without modifications, as each object and its payload would reside in a single slot.
00:30:43.360
Additionally, addressing slots accurately would become feasible without skipping payload bodies, as those will no longer span multiple slots.
00:30:59.440
In conjunction with these changes, we’re also extending our variable width allocation implementation to encompass more objects beyond our reference implementation on classes.
00:31:15.280
Classes were the simplest choice due to the fixed size of the allocation region, known at compile time. We need to support objects whose size isn’t known at compile time or could change during our Ruby program's runtime, like strings, hashes, and arrays.
00:31:31.760
Currently, we lack the capability to resize a payload region, which is essential for handling these adaptable objects.
00:31:44.440
That’s it, folks! In this talk, we covered how Ruby's memory allocator and garbage collector function, the issues in Ruby's memory layout, and our attempts to resolve them with variable width allocation.
00:32:05.760
We're continuously enhancing the implementation, and there’s still a long way to go. Thank you for listening and thanks to Euruko for having us. Join us for the Q&A session for any questions you may have.
00:32:19.840
Thank you, Matt and Peter! I see you are both online. Yes, I can see two Matts and two Peters. I’m in good company now. Stick around for an AMA with some Shopify Rails teams.
00:32:55.519
At 5:15, there are also several Cute Foundation team members at the Shopify Discord channel, and you can find the discord link in the stream chat.
00:33:10.399
Before the AMA, we’ll have some comments from our speakers and some questions. First, I’d like to thank you both for this excellent presentation.
00:33:24.880
Peter and Matt, you delivered an awesome deep dive. The audience is truly delighted by your talk.
00:33:41.040
A question from the audience: does 'page' in this talk refer to a virtual memory page, or do Ruby internals have their own concept of a page as a memory region?
00:33:53.520
I can take this one. Inside the Ruby VM, it refers to memory structures called pages. The heap is composed of these pages, which are different from system heap or virtual memory pages.
00:34:05.680
We apologize for not clarifying this aspect better. Another question: when searching for free space needed for three slots, is it not doing this efficiently?
00:34:19.680
Indeed, this process can be relatively slow. In the worst case, it needs to scan every free slot on the page. This is a challenge we’re currently addressing.
00:34:33.760
I am out of my depth on this topic—I have no real skills to continue the discussion. Again, my appreciation for the remarkable work you had just shared.
00:34:43.440
The detail-rich concepts explained so well—thank you once more, Matt and Peter!