Optimizing Partial Backtraces in Ruby 3

Backtraces are very useful tools when debugging problems in Ruby programs. Unfortunately, backtrace generation is expensive for deep callstacks. In older versions of Ruby, this is true even if you only want a partial backtrace, such as a single backtrace frame. Thankfully, Ruby 3 has been optimized so that it no longer processes unnecessary backtrace frames, which can greatly speed up the generation of partial backtraces. Join me for an interesting look a Ruby backtraces, how they are generated, how we optimized partial backtraces in Ruby 3, and how we fixed bugs in the optimization in 3.0.1.

RubyKaigi Takeout: https://rubykaigi.org/2021-takeout/presentations/jeremyevans0.html

RubyKaigi Takeout 2021

00:00:02.159 Hello everyone.

00:00:03.520 In this presentation, I will discuss backtraces. I will explain why backtraces are important, the implementation of backtraces in Ruby, and how the generation of partial backtraces was slow in older versions of Ruby, as well as how it was fixed in Ruby 3.

00:00:06.960 I will also cover the bugs that occurred as a result of the optimization and how those bugs were fixed.

00:00:08.960 My name is Jeremy Evans, and I'm a Ruby committer who mostly focuses on fixing bugs in Ruby. I'm also the author of the book, 'Polished Ruby Programming,' which was published a couple of months ago. This book is aimed at intermediate Ruby programmers and focuses on teaching principles of Ruby programming, as well as trade-offs to consider when making implementation decisions.

00:00:12.400 So, why do we care about backtraces? Consider programming without them. Let's say you are working on a refactoring in your library and, after making your changes, you run the tests, only to find that one test fails with an ArgumentError, outputting something similar to this. This output informs you that you have an error and what type of error it is. However, without a backtrace, you have no idea where the error is occurring, which makes debugging significantly more difficult.

00:00:45.520 So, backtraces are important. However, how important is backtrace performance, specifically partial backtrace performance? I do not actually know the exact answer to that question. As I mentioned earlier, I focus most of my time on Ruby by going through the bug tracker and fixing bugs, not necessarily on implementing performance improvements. However, it happened that another Ruby committer filed issue #17031 on the issue tracker regarding the lack of optimization for partial backtraces.

00:01:18.799 The other committer mentioned that they discovered the issue while checking the memory usage of RuboCop. In any case, this performance issue was filed as a bug, and since I typically try to fix many Ruby bugs, I did some research into it. This was the first time I had looked into how backtraces are generated in Ruby, so it was a learning experience for me. I think the process of how Ruby generates a backtrace is interesting, and I'm guessing most of you are unfamiliar with it, as I was before working on this bug. So, I will go over the process now.

00:01:58.160 Before understanding the backtrace generation process, it helps to better understand what backtraces are. In most programming languages, backtraces are a representation of the call stack. So, let's start with an example of a call stack. We'll begin with a method called 'foo' that calls 'bar,' which in turn calls 'caller' with zero passed as the argument. This makes the backtrace include the call to the 'foo' method.

00:02:11.360 The default value for 'caller's' first argument is one, which excludes the method that calls 'caller.' Then we have the 'bar' method that simply yields. To see the results of the backtrace, we will use the command 'puts foo,' and these are the results. Note that the dashes here represent the filename. In this case, I was providing the Ruby code to execute via standard input, in which case Ruby uses the dash to represent the filename.

00:02:38.560 The first entry in the backtrace is the call to 'caller' inside the block passed to 'bar.' Each call to a block has a separate entry in the backtrace. The second entry in the backtrace is the yield in 'bar,' which is the same as calling the block that was passed to 'bar.' The next entry is the call to 'bar' inside 'foo,' and the last entry in the backtrace is the call to 'foo' in the main script. I mentioned earlier that a backtrace is a representation of the call stack.

00:03:02.919 If you read the backtrace from bottom to top, you can see how each call was added to the stack. In the main script, you called 'foo,' which called 'bar,' which then called the block in 'foo.' Now, let's dive into how backtraces are generated because Ruby does not have a call stack as a separate data structure. Ruby uses a virtual machine to execute code. In order to execute your Ruby code, Ruby compiles your Ruby source code into bytecode that can be executed on the virtual machine.

00:03:35.840 The virtual machine then processes the bytecode for execution. It keeps track of the execution of the current method or block using a virtual machine control frame, often abbreviated as VM frame control frame or just frame. The virtual machine tracks these control frames using a stack. Each time a method or block is called, the virtual machine pushes a frame onto the stack, and when the method or block returns, it pops the frame from the stack. I just described something that sounds a lot like a call stack, but earlier I said that Ruby does not have a call stack.

00:04:08.160 The reason for the difference is that the virtual machine control frame stack is not exactly a call stack. There are control frames that are not part of the call stack and are ignored during backtrace generation. We will later see that these control frames, which do not belong to the call stack, end up causing bugs in the optimization of partial backtraces. Like most parts of Ruby, the implementation of backtraces is written in C. Backtrace handling is implemented in Ruby's vm_backtrace.c file.

00:04:43.360 To make the backtrace generation process more understandable to Ruby programmers, this presentation will translate the algorithm used into Ruby pseudocode. I will also slightly simplify it to avoid parts that are not important to this discussion. We will start with a description of the backtrace generation used in Ruby 2.7, which is mostly the same for older versions of Ruby, to describe the process before the optimization.

00:05:13.920 The backtrace generation process starts with the virtual machine's end control frame. This is the starting control frame for backtrace generation because the Ruby 2.7 backtrace generation process runs from the earliest control frame to the current control frame, in the order that each control frame was added to the stack. The first two control frames after the end control frame should be skipped, as they are considered dummy frames from the perspective of backtrace generation.

00:05:36.560 Assuming the control frame's next method returns the control frame pushed after the receiver, the backtrace is returned as an array. Thus, we will initialize an empty array for it. The backtrace process will iterate over the remaining control frames. It needs to know the total number of backtrace frames and have a counter that is incremented for each control frame to ensure that it does not process off the end of the control frame stack.

00:06:08.480 The majority of the backtrace process happens while iterating over the control frames, which we will do here using a while loop. At the end of each loop iteration, we will move to the next frame and increment the counter. Here is the main part of the backtrace generation algorithm: as I mentioned earlier, there are different types of virtual machine control frames, and not all of them are considered in backtrace generation. Only two types of control frames are taken into account during backtrace generation: isec frames and c-funct frames.

00:07:00.320 Isec stands for instruction sequence, and isec frames are usually, but not always, frames for normal methods defined in Ruby. C-funct stands for C function, and c-funct frames are frames for calling methods defined in C. Isec and c-funct frames must be treated differently during backtrace generation because methods defined in C do not have file or line numbers associated with them.

00:07:39.760 The location of c-funct frames is the same as the preceding isec frame. This is the reason that the backtrace is processed starting with the virtual machine’s end control frame. The generation process does not use all isec frames; it only uses isec frames that have an instruction pointer. Isec frames without an instruction pointer can occur for internal functions that call blocks, such as Enumerable's group_by method. These frames should be ignored, as otherwise a single method call would result in multiple backtrace entries, one of which wouldn't have location information.

00:08:16.640 Since the location information is determined using the instruction pointer for the isec frames that have an instruction pointer, these represent methods defined in Ruby. In that case, we need to use the location info from the current isec frame for any future c-funct frames until the next isec frame. Therefore, we record the location of the call in the current control frame to use it if needed when creating the backtrace entry for a future c-funct frame.

00:08:53.680 After recording the location, we add the backtrace entry to the array storing the backtrace. For c-funct frames, we also add the backtrace entry to the same array. However, creating the backtrace entry for a c-funct frame requires knowledge of the last isec location, which we pass in when creating the entry. After this whole process completes, we reverse the backtrace array since we want to return the backtrace ordered from the most recent call to the oldest call. Finally, at the very end of this process, we use the arguments provided to 'caller' to extract the entries in the backtrace that we are interested in.

00:09:37.280 Caller's default start value is 1, and the default length value is nil, meaning all entries in the backtrace after the first entry. Hopefully, this demonstrates why partial backtraces were slow; they required unnecessary work. For instance, if you only want the most recent caller entry before the current method, you would call 'caller' with 1 as the start entry and 1 as the length, which returns an array of at most one backtrace entry.

00:10:05.760 However, to obtain that array, Ruby would still need to create the backtrace entries for most or all of the control frames before filtering it to only the requested frames. So, the problem was clear: Ruby was doing unnecessary work that would ultimately be discarded. The solution was equally clear: we needed to avoid this unnecessary work and only add the backtrace frames that we cared about.

00:10:27.040 As the first step in implementing the solution, we need to consider the start and length arguments passed to 'caller.' In this preliminary example, we will assume that both were provided, and that the start value plus the length is still less than the number of control frames. To illustrate the changes in this case, instead of iterating over the total number of control frames, we will use the start argument passed to 'caller' to reduce the number of iterations, which we will store in a variable named 'last.'

00:10:57.760 We calculate the starting point where we actually want to add backtrace entries by subtracting the desired length of the backtrace from the number of iterations. We still iterate starting from the same control frame we used in the previous implementation, but until we reach the starting point, we only record the last isec location for every isec control frame with an instruction pointer, without adding any backtrace entries. We need to do this so that if the first frame in the returned backtrace is a c-funct frame, it will have the correct location, even though it obtains that location from an isec frame not included in the returned backtrace.

00:11:36.720 I mentioned that in this initial example, we will assume that the start and length were both provided, and that their combination is still less than the number of control frames. However, 'caller' can provide arguments where the length is nil or where the start frame plus the length exceeds the number of control frames, in which case this logic is not sufficient. Therefore, to manage scenarios where the start plus the length exceeds the total number of frames, we should replace it with the following code: if the start argument passed to 'caller' is greater than the number of frames, then the caller needs to return nil instead of an array.

00:12:16.560 If the start plus the length is greater than the number of frames, or if no length argument is provided, the starting point for backtrace generation should be set to zero, ensuring that backtrace generation begins at the outermost frame – the main script. This algorithm is basically what Ruby 3.0.0 shipped with. It passed all tests and specifications, resulting in significantly faster partial backtraces, especially in scenarios involving many control frames.

00:12:48.720 I was quite pleased with my work, having declared victory over this long-standing performance issue. Unfortunately, the optimized algorithm turned out to be subtly incorrect. It took a whole month after the release of Ruby 3.0.0 before someone reported issues with it. As detailed in issue 17581, the optimized algorithm could result in missing backtrace entries.

00:13:20.720 The way the algorithm was implemented, it was implicitly expected that each loop iteration would yield a backtrace entry to arrive at the expected number of entries. A problem arises when there is an isec control frame without an instruction pointer. In this case, an iteration occurs without adding a backtrace entry; thus, if you are past the start iteration, any iterations where an isec control frame is encountered without an instruction pointer results in that many missing backtrace entries.

00:14:06.560 The solution to this issue is to decrement the counter if this situation occurs; this ensures that the desired number of backtrace frames is generated. However, certain conditions can lead to the processing of the end of the control frame stack. To prevent this, a separate counter needs to be maintained, which is compared against the total number of control frames. This extra measure guarantees that even if you are skipping control frames, you do not go beyond the bounds of the control frame stack.

00:14:46.560 Again, I was quite pleased with myself as I was able to quickly and easily address this regression. Unfortunately, while that change corrected the issue of missing frames in the backtrace, it later broke a test case in Rails that had previously passed with the original optimization. At first, neither the Rails developers nor I could construct a self-contained example demonstrating the failure, which complicated debugging.

00:15:25.600 However, the Rails developers managed to provide a method to reproduce the problem using the Rails test framework, which helped me debug the issue. Later, after the fix was developed but before it was committed, another Ruby committer was able to create a self-contained example that illustrated the problem, which they posted as a note on the bug tracker. This example was included as a test case when the fix was eventually committed.

00:16:06.080 It turns out that the cause of this issue was similar to the cause of the previous issue, involving isec control frames without an instruction pointer. Instead of missing a backtrace entry—as in the earlier issue—it resulted in an incorrect backtrace entry. For instance, a 'caller' without an argument could be treated as a caller with a start argument of zero.

00:16:40.640 Two changes were necessary to rectify this bug: one was simple, and the other was complex. The straightforward fix required ensuring that we did not process frames after the requested start frame. This was the issue that made a 'caller' without an argument behave like a caller with a start argument of zero. We implemented this by subtracting the caller's start argument from the initial size of the backtrace, making sure we do not process frames that the caller has requested to ignore.

00:17:21.440 We then adjusted the related code that utilizes the caller start argument to assume that the caller's start argument is zero, considering it has already been applied to adjust the size. Now let’s discuss the complex fix. The purpose of this fix is to adjust the starting frame for backtrace collection to ensure that the optimized process produces the correct number of frames. Since the text is quite small, I will zoom in for the explanation.

00:17:55.080 This fix is applicable if the backtrace is limited in both directions, where the start argument to 'caller' is non-zero and provided a length shorter than the actual number of frames. We begin scanning the entire backtrace from the most outer control frame to the most recent control frame, looking for ignored frames. Checking for ignored frames utilizes the same criteria as before, specifically targeting isec frames without an instruction pointer.

00:18:33.680 If the ignored frame is located before the starting frame in the backtrace, we flag that it requires additional work. Otherwise, we decrement the 'i' counter. The difference between the 'i' and 'j' counters will then indicate the number of ignored frames. If there are no ignored frames, we do not have to take any further action; otherwise, additional work is required.

00:19:18.160 If there are no ignored frames remaining before we expect to start backtrace generation, we will just subtract the number of ignored frames from the start. If the issue is more complex, we first have to scan to the intended start control frame. Subsequently, we need to scan backward from that control frame, decrementing the start and the number of ignored frames until all ignored frames have been accounted for. If, during this backward scan, we encounter an ignored frame, we will increment the number of ignored frames.

00:19:58.160 This approach resolved the issue, and both fixes were included in Ruby 3.0.1. Once again, I was able to declare victory over this issue. However, the victory felt somewhat hollow. On examining the code required for the optimization, I noticed it was complex and did not inspire any joy. The workarounds necessary to align the existing backtrace algorithm with the optimization resulted in far more code than the algorithm itself.

00:20:28.240 While this optimized approach provided enhanced performance, the cost was substantial. Throughout the development of this presentation, I came to the conclusion that rather than optimizing the previous algorithm, we needed an entirely new algorithm. Instead of starting from the outermost control frame and working inward, we should begin at the current control frame. Theoretically, this change should lead to a significant improvement in partial backtrace performance by only scanning the control frames we need to examine.

00:21:05.840 Here is the new algorithm, again translated into Ruby code, and we will review each section. We begin with the top section. Instead of having the starting VM frame for backtrace generation be the virtual machine's end control frame, we will utilize the virtual machine's start control frame. We will still need the end control frame to avoid processing past the control frame stack’s end.

00:21:29.920 Remember that the reason the old algorithm originally processed from the outermost frame and worked inward was to establish the location information for c-funct backtrace locations when they were created. However, this new algorithm cannot do that because at the time the c-funct backtrace locations are established, we do not know the location of the calling isec frame. Hence, we must maintain a counter of how many c-funct backtrace locations have been added.

00:22:06.240 If the length of the requested backtrace is not set or exceeds the number of frames, we set the maximum backtrace length to match the number of frames. Here lies the core of the algorithm: the loop over the control frames. Different from the prior algorithm, this new version scans from the innermost frame outward to the previous frame on each iteration.

00:22:41.840 The loop stops either when we reach the end control frame or when the size of the backtrace meets the requested length. Much like the previous algorithm, we only focus on isec control frames with an instruction pointer and c-funct control frames. Assuming the control frame for the current iteration belongs to those two categories, if we are skipping initial backtrace frames, we decrement the number of frames to be skipped.

00:23:16.560 'Caller' defaults to skipping a single frame, but this can be adjusted by passing the first argument to 'caller.' For c-funct frames, if we are not bypassing the frame, we add the backtrace location to the array and increment the c-funct counter. Currently, this c-funct backtrace location will not have location information for isec frames with instruction pointers.

00:23:46.640 If we are not skipping the frame, we first add the isec backtrace location. If the c-funct counter is positive, we then set the same location to the directly preceding c-funct backtrace locations. After this, we reset the c-funct counter to zero.

00:24:19.760 Next, we discuss the final section, which is only required if the last entry in the backtrace is a c-funct backtrace location. In this case, we must continue scanning the control frames to identify a location to use. We first check to ensure that the c-funct counter is positive; if it is not, then the last entry in the backtrace is an isec backtrace location, and the location will already be set for it.

00:24:53.440 The scanning criteria essentially mirror those used in the main loop, except that we do not need to consider the backtrace size as we will not be adding to the backtrace. To locate the final c-funct backtrace locations, we need to find the next isec frame with an instruction pointer. If we do encounter such an isec frame, we update the location of the previous c-funct frames and halt the scan.

00:25:36.080 Switching from the old algorithm to this new one took only a few hours of work. The new algorithm is significantly simpler than the previous version and about twice as fast as the earlier optimized algorithm if only the most recent backtrace location is needed, and slightly faster if retrieving the entire backtrace.

00:26:18.160 Additionally, during the testing of the new algorithm, I discovered an unreported bug in the previous optimization, which emerged when a backtrace was requested with a starting frame following an isec frame without an instruction pointer. Shortly thereafter, two more bugs were reported regarding the previous optimization: one led to a segfault, and the other resulted in an infinite loop.

00:26:52.320 So, while I initially thought the attempt to optimize the previous algorithm represented a small victory, it turned out to be a defeat following the last two bug fixes. I submitted a pull request for the new algorithm, which has been merged and will be included in Ruby 3.1. Thus, I can once again claim victory over this performance issue, at least until the next bug report comes in.

00:27:23.000 However, I still needed to address the issues in Ruby 3.0, and there were a few potential approaches to achieve that. One option was to revert to the Ruby 2.7 behavior; however, doing so would prove problematic due to unrelated changes in backtrace generation that were implemented after the optimization was committed.

00:27:54.080 Another approach was to backport the entire new algorithm from the master branch to Ruby 3.0. While the new algorithm is simple, the actual changes committed were extensive and invasive due to alterations in related data structures. I opted for a third approach: working to fix the bugs in the previous optimization while retaining the same core algorithm.

00:28:24.320 It took me significantly longer to rectify the issues within the previous optimization than it did to implement the new algorithm. However, I eventually succeeded. I ended up employing a similar methodology to that used in the new algorithm by scanning from the current control frame towards older control frames and halting once we located the last frame that should be included in the returned backtrace.

00:29:01.760 That frame then becomes the starting frame for backtrace generation. Unfortunately, due to time constraints, I am unable to walk you through how the new algorithm operates in this presentation. However, I would like to share some lessons I've learned about this optimization.

00:29:35.360 First, do not hesitate to step out of your comfort zone by working on unfamiliar code. I had never looked at the vm_backtrace.c file before tackling this issue. Many programmers think that low-level code, such as programming languages, databases, and operating systems, is impenetrable to newcomers. Nonetheless, generally, it is merely code, and by dedicating time to read it, you are likely to grasp its concepts.

00:30:04.560 Second, do not be discouraged if your first attempt at making improvements leads to complications in complex systems. It is inherently impossible to test for all potential use cases, and you should expect that any significant change could cause regressions in untested circumstances. It's crucial to understand that regressions in untested cases are the price of improvements.

00:30:27.680 Lastly, while you often do not need a new algorithm, there are times when you do. Make an effort to recognize instances where the challenges you are facing stem from your approach and consider whether an entirely different methodology could simplify your process. I hope you found this presentation engaging and informative as we explored backtraces and the optimizations made to partial backtraces in Ruby 3.

00:31:00.320 If you enjoyed this presentation and wish to learn more about my insights on Ruby programming, consider picking up a copy of 'Polished Ruby Programming.' That concludes my presentation, and I would like to thank all of you for listening.