00:00:11.240
minan highlight
00:00:47.079
Thank you very much for having me here. It's a pleasure to be in Naha talking to you all today.
00:00:49.280
I am privileged to have the opportunity to discuss a fun little performance regression that we found and fixed regarding string interpolation during the development of Ruby 3.3.
00:00:58.320
My name is Matt, also known as 8bit Raptor on the internet. I am a full-time C Ruby core committer, and I work for Shopify in the Ruby infrastructure team.
00:01:15.040
This talk is related to variable width allocation, a feature that my team and I developed to improve the performance of Ruby by changing the way memory was laid out. We shipped the first version of this feature for Ruby 3.2 and continued to work on it during the development cycle for 3.3 to further improve performance and refine the implementation.
00:01:38.280
While working on this, we noticed a performance regression that we wanted to fix. I set out to determine the source of that regression and whether it was related to our changes.
00:01:58.039
I won't spend much time explaining variable width allocation during this talk since it's been covered in previous RubyKaigi events and has been part of the implementation for a couple of years. However, it's important to understand how Ruby uses memory to comprehend this talk, so I will give a quick summary of how allocation works.
00:02:02.680
When you create a Ruby object, it gets allocated into a specific part of the heap called a size pool based on the size of the object. The smallest size pool holds objects that are up to 40 bytes, and those thresholds double from there.
00:02:21.560
The largest size pool we have is actually 640 bytes (not 320), and anything larger than that is handled separately. When we create an object, Ruby chooses the appropriate size pool depending on the total size of that object.
00:02:30.680
An object that fits completely within the slots of its respective size pool is called an embedded object. However, objects are malleable; they can be resized after allocation. For example, if we substitute part of a string for something longer using gsub!, the original object will be modified directly.
00:02:44.599
This may cause it to no longer fit in the size pool where it was allocated. In this case, Ruby allocates a chunk of memory elsewhere in the system and creates a pointer from the original object in the garbage collection (GC) heap to that piece of memory. We then refer to this object as an extended object rather than an embedded object because it no longer fully fits in its original slot.
00:03:13.720
We can inspect VM-specific information about objects using the object space dump API, which provides details about the object's size, the pool it's allocated in, and additional information. Running code like this will yield output showing this data about the string—its address, whether it’s frozen, if it’s write-protected, and other attributes.
00:03:37.640
Additionally, it will display whether the string is embedded alongside its size in the respective pool. For example, if a string is embedded in the 80-byte size pool and utilizes 36 bytes of memory to hold its data, it is crucial to understand how this allocation fits into the size pools and why it might not be allocated as expected, specifically noting that every Ruby object is prefixed with 16 bytes of metadata.
00:04:44.759
This metadata contains internal status flags about the object and information or a pointer to its class object. Moreover, Ruby strings are null-terminated, necessitating an additional byte for the null terminator. Therefore, the total allocation size could result in an embedded allocation of 80 bytes being justified, even if the string was only 36 bytes.
00:05:08.360
So, let's recap our background. At one point while working with variable width allocation, I had code that looked something like this. It was allocating a new string, constructed from a shorter and a longer string using string interpolation.
00:05:53.600
I inspected the strings using the object space dump API to ensure they were being allocated correctly. The first string looked good; it was embedded in the 40-byte bucket with the byte size of 5. Remember, 5 plus 16 gives 21, plus the null terminator results in a total size of 22 bytes.
00:06:02.160
The second string was also embedded and in the 80-byte bucket, with a byte size of 29. Considering all the additional overhead, its total was 46 bytes. However, the third string, which was the result of interpolating the first two strings into a new object, had a byte size of 36. When factoring in the extra header and the terminator, it should have resulted in a total object size of 53 bytes.
00:06:59.360
So, why was it placed in the 40-byte bucket? This meant that the embedded flag indicated the string had been created as an extended object. I suspected this as a bug, so I began investigating, starting with what the Ruby VM did when it executed that code.
00:07:40.040
I ran the code again, dumping the instructions using the dump instructions command. This process pauses and compiles our code but stops short of executing it, allowing us to print the bytecode to the screen to see exactly what's going on inside the interpreter. This output reveals the sequential instructions the VM executes to run the string interpolation code.
00:08:39.040
First, Ruby retrieves our short string, which because it’s defined right inside the source code, has already been created as an object during the compilation phase. That string can be pushed directly onto the stack and set to a local variable using the set local instruction. The longer string undergoes the same process, subsequently being assigned to a local variable.
00:09:43.360
Next, we handle the interpolation logic by utilizing a concat strings instruction that utilizes three strings from the stack, pulling them off and combining them into our new string.
00:10:16.360
Given that most of the work is happening in this concat strings instruction, I decided to investigate it further. All valid Ruby VM instructions are defined in a file called insns.def, written using a C-based DSL.
00:10:59.320
During the compilation stage of the Ruby interpreter, this DSL is processed to generate a C function for every instruction, which is then linked into the VM core. This structured format makes it easier to read and reason about the instructions.
00:11:36.840
Each instruction begins with a call to the DefineINSNS macro, followed by the name of the instruction (in this case, concat strings), and lists its valid parameters and types.
00:12:17.440
Next, there's a body of the instruction that will execute when the VM reaches a concat strings instruction in the bytecode. It uses our num parameter, which was passed (in this instance, the number three) to the stack add from top macro, grabbing the top three elements off the stack.
00:12:51.439
This leads to the function RB_string_concat_literals, which, at first glance, appears to have two rough halves. The first half builds a string object that serves as our return value, while the second half iterates through the array of strings passed in and appends them onto that initial string.
00:13:24.679
However, the first half felt odd to me as it loops through the strings to calculate the total length needed for the final interpolated string but uses a hardcoded magic number defined as MinPRALoSize that dictates whether to take an optimized path or not.
00:14:02.960
This is where I spotted the bug. The total length of the three strings was 36 bytes, which is less than the MinPRALoSize of 48 bytes, leading the code through the string resurrection path where it would use the first string as a base. However, that string was allocated in the 40-byte size pool.
00:15:05.639
This can quickly push past the boundary, causing Ruby to convert it to an extended object, resulting in the observed performance regression. I began searching for the rationale behind this optimization and found it was implemented in 2017, originally committed by Minami now.
00:15:58.760
The original author confirmed that MinPRALoSize was a magic number chosen experimentally on his laptop, raising doubts about whether the optimization remained relevant today.
00:16:26.440
I ran benchmarks with the original optimizations and discovered that removing this check slowed short string interpolation by around 9%, which indicated that the optimization likely still had merit.
00:17:02.960
To address the bug while retaining the optimization, I modified the conditional so that rather than determining whether to take the optimized path based on an arbitrary length, we assess if the final string would fit within the same size pool. This would allow us to avoid allocation penalties when converting to extended objects.
00:17:38.960
After running benchmarks again, performance numbers were much closer, although some small variations were still present. These could be accounted for by typical benchmarking noise, given the differences between the environments used in the original benchmarks and my own.
00:18:09.960
While the string was now embedded in the correct size pool, I wanted to address why string resurrection produced a 9% improvement when using arbitrarily small strings and whether this was affected by the allocation specifics.
00:18:36.720
Upon revisiting the allocation path, I discovered it performed extra actions that the string resurrection path did not, particularly copying encoding from the head of the string array into the new string. This copying process invokes a public API function that conducts stringent safety checks due to user input, leading to potential overhead.
00:19:24.639
By benchmarking this again without the need for safety checks, I confirmed that it was indeed the overhead from the encoding checks that caused performance issues. In contrast, the string resurrection path bypassed those checks as it builds from known, valid strings.
00:20:03.959
Research led me to discover a private function called RB_str_copy_direct designed to skip the safety checks. I refactored the code to pre-allocate strings while utilizing this faster internal API for encoding.
00:20:54.799
This significant change streamlined the code structure, removed unnecessary complexity, and ensured correctness of allocations, addressing the behavior of objects within the pools.
00:21:48.360
In doing so, performance improved dramatically—doubling performance for interpolations within the same bucket and enhancing performance across buckets. This refactored code made its way into Ruby 3.3. Now, string interpolation is not only faster but cleaner than before.
00:22:38.960
I appreciate your attention while I shared this debugging story, which, while not groundbreaking, illustrated vital lessons learned along the process. Small changes can yield substantial impacts, as demonstrated by an eight-line reduction resulting in nearly a 2x speedup in specific cases.
00:23:23.820
This reflects the importance of questioning our assumptions and encouraging curiosity. Don’t hesitate to probe into areas that are not clear to you—these efforts can lead to meaningful discoveries.
00:24:13.600
I am truly thankful for the collaboration and support from my colleagues, particularly the original authors of the optimization, whose documentation was vital in unraveling the complexities I faced in this journey.
00:24:50.960
To everyone out there, I encourage you to explore the Ruby codebase and engage with it. Contributing offers valuable learning experiences, so keep diving in and remember to have fun! Thank you very much for listening.