The Second Oldest Bug

RubyKaigi 2023

00:00:07.080 In this presentation, I'll be discussing what was Ruby's second oldest open bug and how I fixed it in Ruby 3.3. My name is Jeremy Evans, and I'm a recommitter who focuses on fixing bugs.

00:00:14.219 I'm also the author of "Polished Ruby Programming," which was published in English a couple of years ago and was published in Japanese last month. This book is aimed at intermediate Ruby programmers and focuses on teaching principles of Ruby programming, as well as trade-offs to consider when making implementation decisions.

00:00:25.800 So, what is the second oldest bug that I am discussing? I’m referring to bug 4040, which has the subject line 'system stack error with hash splat of a very large array.'

00:00:38.640 The reproduction code given in the bug report was simple Ruby code. It calls the array reference method on a hash with the splat of a very large array containing over a million entries. It turns out that this bug is not specific to that hash method; at the time the bug was reported, this issue applied to any method called with a splatted array if the number of arguments passed to the method was large enough.

00:00:55.920 You could reproduce the bug with the kernel puts method. You could also reproduce it when calling a method you defined in Ruby, and even with a method that does not accept any arguments, such as the kernel object ID method. This is because the error occurs while setting up the arguments for the method call before actually calling the method.

00:01:10.260 Ruby's behavior when passing a very large array splat to a method has changed over time. Back in Ruby 1.8, passing too many arguments to a method would result in the Ruby interpreter crashing and dumping core, both for methods defined in C and for methods defined in Ruby.

00:01:26.280 In Ruby 1.9, this condition was recognized, and instead of the program crashing, a system stack error was raised. Ruby 2.0 and 2.1 maintained the same behavior as Ruby 1.9. Then, in Ruby 2.2, Sadasan fixed this issue for methods defined in Ruby, and that same behavior remains in Ruby 3.2.

00:01:49.560 As you can probably guess, we fixed this issue in Ruby 3.3. Now, calling the object ID method with a very large array splat correctly results in an argument error. If you switch to the example code given in the bug report, you can see that it works correctly and returns a hash with a single entry.

00:02:42.720 To start working on fixing this bug, I needed to find out where the bug was occurring. When debugging in Ruby, I usually use GDB to get a general idea of how method calling works. I can run Ruby through GDB and tell Ruby to execute the 'self' method, after which I can set a breakpoint on the underlying C function that will be called.

00:03:06.300 After the breakpoint is hit, I can have GDB print the backtrace. This backtrace can be split into four separate sections: the bottom section represents everything Ruby runs before calling the 'self' method; this section is generic method calling code, regardless of how the method was defined. The next section is specific to calling methods defined in C, and finally, the top section contains code specific to the 'self' method.

00:03:31.800 Now, since this bug occurs when calling methods defined in C, it seems likely that this problem lies within one of these three functions or in one of the functions they call. While this information is helpful, it would be best to see exactly where the system stack error is being raised. We can start by grepping to see where 'system stack error' is defined, and it’s always nice when you find a single result. After a few more greps to gather information, we end up looking inside the virtual machine code.

00:04:24.000 Looking at the line highlighted, there is one indication that this is what we are looking for, and that's the function name, which contains the words 'stack overflow.' It looks like this is the function called when the stack would overflow to raise an exception instead of crashing. We can use GDB to see if that is indeed the case.

00:05:00.660 In this instance, we will pass code that should trigger the stack overflow and then set a breakpoint on the 'ec stack overflow' function. Thankfully, we guessed correctly, and we can use the backtrace command to see how we got there. Most of the backtrace is the same as the previous backtrace, but we’ve highlighted the new lines. The top two lines refer to functions with 'stack overflow' in the name, likely called when a potential stack overflow has been detected, but not the cause of the stack overflow.

00:05:41.580 The function directly before those lines, named 'vm caller setup arg splat,' is the probable cause. This function determines which object is being splatted; if the object being splatted is not nil, we assume it is an array, and we then determine the length of the array to splat. In the normal case, we copy all objects from the array to the VM stack and increase the number of arguments passed to the method by the length of the splatted array, subtracting one for the array itself.

00:06:05.400 However, before doing any copying, we check whether copying all objects to the VM stack would overflow the stack. This is the check that is hit when you are splatting a very large array. Unfortunately, there is no way to avoid this issue as long as the approach of copying all arguments to the VM stack is taken.

00:06:39.000 The only way to avoid this issue would be to not pass the arguments via the stack. If we examine the initial backtrace when calling 'self' without any arguments, we see that the function calling 'rb object self' is named 'vm call c func with frame.' The prototype for that function looks like this: it takes the receiver of the Ruby method, the number of arguments that the Ruby method was called with, a pointer to the first argument for the method, and a function pointer for the C function to call that implements the Ruby method.

00:07:20.460 Now, the function calling 'rb safe call c func0' is named 'vm call c func with frame.' This function is large, so I will focus on the line calling 'rb safe call c func0' through a function pointer. The important part here is the value that it passes as a pointer to the first argument, which is always passing a pointer to the stack. Perhaps if we change this to pass a pointer to a different location in memory, such as the heap, calling a method defined in C with many arguments would work.

00:08:03.300 We can start this process back in the 'vm caller setup arg splat' function, reducing the code focus just on the case where there is an array to splat. If some condition holds true, we should use an approach that does not copy arguments via the stack. One simple condition is if the number of arguments being passed to the method, plus the length of the splatted array, exceeds a certain high number, say a thousand, then do not use the stack for argument passing.

00:08:28.620 Now, how should you pass arguments if you're not going to pass them on the stack? One of the easiest ways to do so is by creating a temporary Ruby array for the arguments. Instead of passing a pointer to the stack, we can pass a pointer to the first element of that array. Here is the code that implements this approach: we start by creating a Ruby array with the expected capacity, and we then hide the array to ensure it cannot be accessed via object space.

00:09:03.780 We copy the methods before the splatted array to the temporary array, followed by copying the elements of the splatted array to the temporary array. We still need to pass a pointer to the temporary array of arguments on the VM stack; thus, we adjust SP, the stack pointer, to reflect that. Since there is only one argument on the stack, we adjust the calling information to show only one argument. While this is correct from a stack perspective, it is incorrect from a method calling perspective.

00:09:38.640 Finally, we set a new flag on the calling information, named 'heap argv', to indicate that the arguments are being passed on the heap in a temporary Ruby array instead of on the stack. Returning to the function for calling methods defined in C, once we have set up the temporary array of arguments on the heap, we need to modify the function to use a pointer to the first element of that array instead of pointing to the stack. We need to recognize when we should use the temporary array for arguments, and we do this by checking for the 'heap argv' flag we set earlier.

00:10:02.459 Here is the code for using the temporary array for arguments: we obtain a pointer to the temporary array of arguments; this pointer is located on the stack. Then we get the pointer to the first element of that temporary array, which is located on the heap. We use the length of the temporary array as the number of method arguments when we call the C function, using the correct number of arguments and a pointer to the first element of that Ruby array.

00:10:48.180 While the changes I discussed are the most significant changes, there were a few additional modifications needed in other functions. Most notably, we added about five different checks for the 'heap argv' flag, as well as adjusting argument handling in that case. I don't have time in this presentation to discuss every area requiring changes, but after implementing them, it didn't take much more work to get Ruby's test suite passing with this change.

00:11:09.600 Unfortunately, while the tests passed, the change resulted in a minor slowdown. The additional checks added caused method calling microbenchmarks to slow down by as much as seven percent. Given how few users are affected by this change, the slowdown in method calling to such a degree was deemed unacceptable.

00:12:00.180 I tried a couple of different approaches to speed up the implementation and ultimately managed to make the performance difference small enough that the microbenchmark no longer indicated it was slower. However, my approach was somewhat invasive, making multiple changes to a few methods used in the generic method dispatch code. Since this issue had already been solved for methods defined in Ruby, it would be better if changes could be confined to methods defined in C.

00:12:50.600 In discussions with Sadasan, I suggested we could switch to that approach, but anticipated it would result in more duplication. Thankfully, Sadasan took it upon himself to implement that non-invasive approach. It turns out I was right about the duplication; the VM call C func function before the changes was quite simple. As we review the function after the changes, we see a significant increase in lines.

00:13:42.960 In normal cases where you’re not passing a large, array splat, the two lines use the standard argument setup functions. However, those two lines have been replaced with expanded and customized versions when a large array is being splatted.

00:14:11.320 There is a maintainability trade-off here. One option is to use generic functions, which complicates those functions and could make them slower and harder to debug when code doesn’t benefit from the additional features you’re adding. On the plus side, it can make maintenance easier, especially if future code can benefit from those features.

00:14:58.140 The other option is to duplicate parts of the generic functions needed and modify them for the specific use case. This can result in faster code and minimize complexity, but it presents challenges for maintenance if changes are required in multiple locations.

00:15:34.080 The patch was merged back in January, ensuring that the second oldest bug will be fixed in Ruby 3.3. Just kidding, that's what I thought when I started working on this presentation. It turns out this bug is not a single issue but part of a general class of bugs.

00:15:55.440 My initial patch and Sadasan's patch only fixed a single instance of this class of bugs, so this particular bug was not fixed or at least not completely resolved. I will go over some code that still raises a system stack error for large argument splats even after the patch was committed.

00:16:27.900 I mentioned earlier that Sadasan made changes to Ruby 2.2 to allow this code to work. However, if you replace the normal singleton method definition with the equivalent 'define singleton method' call, it raises a system stack error. To understand why the normal method definition works but the block-based approach fails, you need to realize that Ruby has multiple method types. Each type is handled differently internally.

00:17:07.100 Ruby has a C function named 'vm_call_method_each_type' that handles all method types. This function uses a C switch statement on the type of method, similar to a Ruby case expression. Normal Ruby methods defined with def use the 'iseq' method type, and the logic specific for calling 'iseq' methods is in 'vm_call_iseq_setup.' This is the code path that Sadasan fixed in Ruby 2.2 to allow normal methods to accept large argument splats.

00:18:09.960 Ruby methods defined by C functions utilize the 'cfunc' method type. The logic specific to calling Cfunc methods is in 'vm_call_c_func,' which is the implementation path fixed by my initial patch and Sadasan's patch. However, Ruby methods defined with 'define_method' or 'define_singleton_method' use the 'bmethod' type, and logic specific to calling B methods is in 'vm_call_b_method,' which is another code path that had not been addressed.

00:18:49.500 This explains why calling a B method with a large argument splat still raises a system stack error, even after Sadasan's patch was merged. Unrelated to the current bug but useful to note is that Ruby uses a call cache to improve performance. The first time you call a method, the call cache directs you to a generic and slower method dispatch function.

00:19:33.520 Once Ruby determines a more specific and faster method handler, it updates the call cache so that future method calls at the same call site can quickly access the more optimized method. Looking at the definition of the 'vm_call_b_method' function, we need to determine why it fails. There is one similarity here with the 'vm_call_c_func' function prior to fixing the bug.

00:20:09.540 Both functions utilize the 'caller_setup_arg' function. Reviewing the backtrace for the stack overflow concerning C methods, the stack overflow occurs during a call to 'caller_setup_arg,' indicating vulnerability in the setup function.

00:20:38.700 It turns out that all callers of this function are vulnerable to this issue, and as mentioned previously, all method types other than 'iseq' utilize 'caller_setup_arg,' which is why they are susceptible to it. I also mentioned that my initial patch to fix this issue complicated the generic argument setup code, while Sadasan's patch was less invasive because it only modified the logic for Cfunc methods.

00:21:09.540 While the duplication approach may be preferable for issues specific to certain cases, since all usage of 'caller_setup_arg' is vulnerable, the duplication approach is not maintainable. The only maintainable solution that resolves the issue for all cases requires complicating the generic code.

00:22:07.440 Here's the initial code I discussed earlier to avoid the system stack error. Between when I originally submitted my pull request and when I began working on this presentation, I discovered these additional bugs. In the meantime, Sadasan improved related code to significantly boost performance.

00:22:51.000 Therefore, I had to modify my initial patch to adapt to Sadasan's improvements. One of the changes made is that at the point this code is called, the VM stack has already removed the splatted array, and another change is that any keyword splat is passed as a separate argument rather than as the last element of that array.

00:23:36.720 Instead of using separate 'rb_array_new' and 'rb_object_hide' functions, I found there was an 'rb_array_new' function that combined those features, so I switched to employing that. Now the capacity of the temporary array is set to the number of arguments before the splat plus the number of arguments in the array being splatted, plus one extra for a possible keyword hash.

00:24:11.520 Another significant change was storing a pointer to the temporary array in the calling structure instead of only keeping the pointer on the stack. This change was primarily to simplify referencing it later. Going back to the 'vm_call_b_method' function, we can modify it to use this new approach. The first change is adding an argument to 'caller_setup_arg' to indicate whether it is safe to use a temporary array for arguments.

00:25:01.260 This is only the case if a temporary array was created to handle a large argument splat. We now get a pointer to the first element in that array and need to adjust the stack pointer by two because the array counts as one argument. This adjustment matches the stack pointer modification in the normal case. Failing to calibrate the stack pointer correctly will lead to stack pointer mismatch errors or violate consistency checks, both of which are burdensome to debug.

00:25:43.020 Unfortunately, this change did not rectify the issue for B methods, due to a similar problem arising later within the VM. Following the call to 'vm_call_b_method_body,' the VM invokes a function named 'invoke_iseq_block' from C, which copies all arguments to the VM stack, potentially raising a system stack error if doing so would overflow the stack.

00:26:16.260 To address this problem, I reverted to using the same method I employed in 'caller_setup_arg' — creating a temporary Ruby array and copying the elements into it. If the total number of arguments surpasses a certain limit, we invoke a newly added 'vm_ruby_array' function to create the temporary Ruby array, copying the arguments from argV into it, then returning a pointer to the Ruby array.

00:27:16.860 We always pass an array of two arguments; the first is the temporary array, while the second argument is always a keyword hash. If 'keyword_splat' is true, this means the last argument of the array represents keyword arguments, so we remove it from the array and store it as a separate argument. If 'keyword_splat' is false, we add an empty keyword hash.

00:27:56.640 With these modifications, we can now call methods defined with 'define_method' or 'define_singleton_method' using a large array splat. However, additional cases also failed; while passing a large array splat typically worked with normal method definitions starting in Ruby 2.2, using 'send' on the same method raised a system stack error.

00:28:41.460 Similarly, using 'symbol_to_proc' to obtain a proc and calling that with a large array splat caused a system stack error. Additionally, if you generated a method object for the method and called it with a large array splat, it raised a system stack error. Furthermore, defining 'method_missing' and then invoking a non-existent method with a large array splat also led to a system stack error.

00:29:24.540 The same issue occurred for C extensions when using 'rb_yield_block' with a large number of arguments. Those instances were the most critical cases demanding fixes to ensure the capability of handling a considerable number of arguments. Additionally, in circumstances where a method type only accommodates zero or one argument, passing a large array splat instead resulted in a system stack error instead of an argument error.

00:30:07.320 Returning to the normal method definition, when changing the argument splat to a regular argument, if you call the method with a large array splat, it is preferable for an argument error to be raised, but Ruby raises a system stack error instead. This dilemma also applies to classes utilizing attr_reader, where calling the method defined with a large array splat raises a system stack error rather than an argument error.

00:30:54.720 The same circumstance is true for attr_writer. Likewise, if you create a struct class and invoke the member reader method for an instance of that struct using a large array splat raises a system stack error instead of an argument error. Addressing this bug for method types capable of accepting a large number of arguments and raising the correct exceptions for types accepting zero or one argument proved to be time-consuming.

00:31:31.740 One complication with this case is that this issue isn't just one bug but a class of related bugs. It’s worth noting that nearly no code in Ruby’s test suite currently passes a large array splat. Running Ruby's accompanying test suites is unlikely to catch such issues arising from the changes I was making.

00:32:15.660 To gain a greater level of confidence that the bug fixes were effective, I conducted most of my debugging by enforcing temporary array usage for all method calls with splats instead of adhering to the default behavior of only utilizing temporary arrays for large arrays.

00:32:56.880 Ultimately, I managed to ensure the entire test suite passed after fixing all cases I was aware of where passing a large array splat caused a system stack error. However, just because a bug has been completely resolved doesn’t mean the fix should be committed. When deciding on committing changes, one must understand the costs and benefits of the change and whether it outweighs them.

00:33:31.740 In this scenario, the benefit is being able to splat an array when calling any Ruby method while expecting the desired behavior regardless of the array's size. However, I believe that the actual need to pass a large array splat is uncommon and that it is almost always more efficient to provide a large array of arguments as a regular argument instead.

00:34:17.900 Significantly, there are considerable costs associated with the fix. For one, the modifications have proven invasive, particularly more so than my initial efforts to address the bug for Cfunc methods. Additionally, there’s a minor slowdown that accompanies this change. Given the efforts to enhance Ruby's performance, committing this patch may seem like a regression.

00:35:01.260 I worried that committers would be disinclined to accept a slight slowdown to remedy these edge cases. Thankfully, my experiences resolving these bugs taught me considerable insights regarding Ruby's internals, and from those insights, I devised a series of patches to optimize method calling in particular instances.

00:35:45.480 One notable optimization involved B method calls. Here is the code for 'vm_call_b_method.' Notice that this function consistently invokes 'caller_setup_arg.' The rationale for this is that it flattens the arguments into a C array, allowing a pointer to the first argument to be passed to 'vm_call_b_method_body.' I determined this call to 'caller_setup_arg' is not required in common scenarios where the block for the B method is defined in Ruby.

00:36:38.940 It is essential for other instances where the block was defined in C or created by 'symbol_to_proc.' I thus decided to segregate B method call handling into two pathways: one to manage B methods where the block is defined in Ruby, and another for other block types. All this code is necessary to determine if the block had been defined in Ruby.

00:37:32.760 If the block was defined in Ruby, we can call an optimized function that does not require us to utilize 'caller_setup_arg.' Conversely, if the block was not defined in Ruby, we call another function that utilizes 'caller_setup_arg,'' mirroring the previous implementation. Before invoking any function, we update the call cache to ensure future calls at the same call site can directly transition to the optimized method and bypass all initial code.

00:38:23.880 I included benchmarks and discovered these changes improved B method calls by forty percent in simple cases and up to 180 percent in scenarios involving keywords. Another optimization was made regarding 'method_missing' calls. Here is the top of the 'vm_call_method_missing' body prior to the bug fix and optimization. I determined that this specific invocation of 'caller_setup_arg' is unnecessary, as 'method_missing' calls do not need to modify existing arguments; they only need to appraise arguments.

00:39:12.840 Hence, the call to 'caller_setup_arg' can be completely eliminated. However, it is imperative to rectify the calling flags; these flags encompass information concerning whether the call comprises an argument splat or keyword arguments. Since ‘caller_setup_arg’ was previously invoked, new calling flags were generated. However, upon removing this invocation, we can merely duplicate the calling flags from the original method call that led to 'method_missing,' which resolves the issue.

00:40:02.340 This simple modification improved 'method_missing' calls by ten percent in uncomplicated cases and up to a hundred percent for calls involving keyword arguments. I implemented a similar change to calls to symbol procs — procs created using 'symbol_to_proc.' This optimization resulted in a five percent enhancement for simple calls and up to one hundred percent for keyword argument calls. I executed a similar adjustment for method calls utilizing 'send,' which improved performance by five percent for uncomplicated calls and up to 115 percent for those employing keyword arguments.

00:40:43.680 To evaluate the overall impact of the bug fixes in conjunction with the optimizations, I used 'wide jetbench,' a testing tool developed by Shopify to assess performance. 'Wide jetbench' serves as an excellent set of general benchmarks and encompasses thirty-three distinct test cases. Results indicate performance remained relatively unchanged in twelve cases, eight of which benchmarks witnessed a slowdown, despite the optimizations, by as much as three percent in the least favorable instance.

00:41:36.540 However, thirteen benchmarks improved, with some showing speed enhancements of up to ten percent because the performance increase due to optimizations surpassed the decrease stemming from the bug fixes.

00:42:27.060 After finishing those method calling optimizations, I raised this issue as a discussion topic last month within the monthly developer meeting. This allowed other committers to offer feedback, and Mats could decide whether the benefits outweighed the costs. Concerns arose regarding performance and the invasiveness of the changes, but ultimately, I received approval to merge the modifications.

00:43:31.140 There was one significant caveat: I mentioned earlier that I had fixed all the bugs I was aware of, but it should be noted that I had resolved all bugs within the virtual machine. It turned out the YG did not support the alterations I had made, resulting in the widget tests failing on arm64 with sporadic failures on AMD64.

00:44:06.540 I discussed those issues with the widget team, and thankfully, they swiftly remedied the widget-related problems. After addressing the widget issues and conducting one last round of testing, I successfully merged the changes, ensuring that the second oldest bug and its various instances will finally be resolved in Ruby 3.3.

00:44:42.900 Here are the lessons I learned from my experience fixing these bugs: First, just because a bug is old does not mean it cannot be fixed. Splatting a very large array during a method call had been an issue since support for splatting arrays was introduced, and this problem had been known for over a decade before I began rectifying it. In general, it takes one determined individual to resolve an issue.

00:45:18.420 Secondly, fixing an old bug often serves as a learning opportunity, imparting knowledge you might not have otherwise acquired through experience. The knowledge I gained from rectifying these bugs equipped me to implement performance optimizations that I previously mentioned.

00:46:04.559 Lastly, don't fret if you cannot perfectly resolve any Ruby bug all by yourself. As this situation shows, other committers are likely available to assist you in enhancing your bug fixes, making them suitable for commitment. Ruby currently has over eighty open bugs in the bug tracker that have surpassed five years, just awaiting your contribution.

00:46:44.979 We look forward to your contributions! I hope you enjoyed learning about Ruby's second oldest bug and the steps taken to fix it in Ruby 3.3. If you found this presentation enlightening, I highly encourage you to check out my book, 'Polished Ruby Programming.' That concludes my presentation, and I would like to thank all of you for your attention. A special thanks goes to Shopify and QuickPad for sponsoring my travel to RubyKaigi.