Automatically Find Memory Leaks in Native Gems

RubyKaigi 2022

00:00:08.540 Thank you.

00:00:11.360 Peter, Tomoshimasu. Oh, I'm supposed to be speaking in English. Hi everyone, my name is Peter. I'm on the Ruby core team and I'm a senior developer at Shopify on the Ruby infrastructure team. Today, I'll be talking about how you can automatically find memory leaks in Native gems using the Ruby memcheck tool.

00:00:20.820 First, let's talk a little bit about what Native gems are. Native gems are Ruby gems with parts of them written in a lower-level language such as C, C++, or Rust. This is usually done for performance purposes or to be able to use a library on the system.

00:00:26.519 Languages like C lack a garbage collector found in languages like Ruby. This means that developers are required to manually clean up memory, and if they forget to clean up the memory, then a memory leak will occur. So, what's a memory leak? A memory leak is a piece of memory that is no longer useful but has not been released back to the system. In Ruby, this can happen if you hold onto useless global or instance variables. An example of how this could happen would be a logger that never flushes its buffer, which will accumulate more and more entries. In C, this can happen if you forget to clean up memory when it's no longer used. As the program executes, more and more memory leaks will accumulate, causing your program to use more and more memory. Eventually, your computer will run out of memory, and your program will be killed by the system to free up memory.

00:01:06.659 For a web server, such as Rails, this could mean that it crashes mid-request. If the service is not self-healing, meaning it does not replace dead servers with new ones, then you may experience downtime. Even if it is self-healing, this is bad for capacity as a new server has to boot, which can take many minutes. Let's see an example of the consequences of memory leaks. This is an image of the memory usage graph of a Shopify service in production on a weekend. The weekend was chosen because there are no deploys on the weekend. We see typical behavior of memory leaks with a linear growth of memory usage over time.

00:01:59.640 At about three and a half gigabytes of memory usage, the memory usage drops because containers are being killed due to running out of memory. The service is on Kubernetes, which has self-heating capabilities, so the killed containers are replaced with new containers. Here we see a deploy that kills all of the containers and starts new ones. This is the same service on the weekend after the memory leak was fixed. We see a flat graph with no linear growth. We also see much lower memory usage. Previously, we peaked at over three gigabytes of memory usage per container, and we had out of memory kills. Now, each container uses less than 1.4 gigabytes and there are no out of memory kills.

00:03:10.800 Now that you've seen what a memory leak is and the impacts of memory leaks, let me tell you the story of how the memory leaks were found and fixed. On the afternoon of Friday, October 8th, 2021, I found a memory leak in the native gem called Liquid C. It's a native gem we use at Shopify, and this is the pull request that fixes it. If you don't know what Liquid is, it's an open-source templating language used at Shopify. You can think of it like Ruby, but without the features that could cause remote code execution vulnerabilities.

00:03:29.400 It's used to render web pages for stores on Shopify, and the code is supplied by the merchant, which is arbitrary code, so we need to ensure there are no remote code execution vulnerabilities. If you've ever used Jekyll to generate your personal site or blog, you've written Liquid templates. The Liquid gem is completely written in Ruby, while Liquid C is an extension to the Liquid gem that speeds it up by re-implementing parts of it in C. So, fixing the memory leak was easy enough, but I was sure there were more memory leaks. So I had a choice: I could either spend a day or two debugging and trying to find the memory leaks, or I could sink an unknown amount of time into making a tool that might or might not work to find it for me.

00:04:03.180 Of course, I chose the latter approach. That got me thinking: what tools could I use to automatically find memory leaks in Native gems? I had ideas on how to build this tool. I knew I wanted to use Valgrind memcheck. Valgrind is a collection of error detectors and performance profilers. One of the tools, Memcheck, tracks all memory allocations and reports them upon shutdown. It detects memory leaks by reporting all of the memory that has been allocated but not freed. Additionally, it detects invalid memory usages, such as using memory after it has been released, known as 'use after free,' or accessing memory that is out of bounds.

00:05:00.360 Let's try Valgrind on an empty Ruby script. Let's scroll to the end of this output. Yikes! There's nearly 70,000 lines of output with over 10,000 memory leaks. Note that this output has been truncated to about 300 lines due to limitations in Keynote, so in reality, it's about 200 times longer. In a non-trivial Ruby program, this output can easily get to hundreds of thousands of lines long. Valgrind memcheck is a great tool, but unfortunately, it’s unusable on Ruby. This is because Ruby does not free all of its resources at shutdown. This is deliberate, as when the process exits, the system will reclaim all of the memory anyway. So freeing the memory is just extra work that will only make Ruby's shutdown slower.

00:06:00.840 A solution to this would be to add a feature to Ruby to free all memory during shutdown. However, this would be a non-trivial task as I would need to track down every memory leak and figure out where to free the memory. I didn't really want to go through tens or hundreds of thousands of lines of output. Let's look at an example output from Valgrind. This is an example of a memory leak in an earlier version of Liquid C, from before the memory leaks were fixed. What you can see here is that this was part of tens of thousands of lines of irrelevant output. We see information such as it leaked about five kilobytes of memory, it occurred 53 times, and it was part of about 22,000 memory leaks, but most of which are false positives.

00:06:53.220 It's then followed by the stack traces of the source of allocation. The first line we see is a call to the malloc system library, which allocates the memory. This is the part in Liquid C that actually allocates the memory, with the rest in Ruby, including at the top a malloc wrapper in Ruby. So, how can we apply heuristics to this stack to filter out noise and automatically find memory leaks? Well, one idea I had was to keep all the allocations that happen inside of the native gem and reject all of the others as false positives. We assume that the binary that allocated the memory is also the one responsible for managing and eventually freeing it.

00:07:29.640 Thus, Ruby is responsible for managing the memory it allocates, and the native gem is responsible for managing the memory that it allocates. If we do this, then all allocations that happen only in Ruby will be filtered out. That sounds like a solid plan. Let's try implementing this. As I started coding this, I quickly realized I had a technical difficulty. I knew where each line in the stack traces came from, but how would my algorithm know that? One idea I had was to use the filename to determine where each stack frame came from, but that wouldn't work if there are duplicating names.

00:08:02.940 For example, both Ruby and Liquid C have a vm.c file. So, one solution around that would be to read symbol names from Ruby and the native gem binaries to determine where each function in the stack trace comes from. However, this is pretty complex, as it requires the use of additional tools, so I wanted to keep this as a last resort solution. So, before doing this, I took one last look at Valgrind's documentation, and I found Valgrind's XML equals yes flag. This isn't a very well-documented feature, but essentially it outputs in an XML format that's easier to parse programmatically.

00:08:55.020 Additionally, there are other features that make it easier to use programmatically, such as outputting to a file descriptor, a custom file, or to a socket. How does this flag actually help us? Let's look at the same memory leak in XML format. We see similar information, such as the number of bytes leaked and the number of times this leak occurred. Let's zoom in on the stack trace. This XML output gives us more information. We see the binary name, which will let us figure out where the stack frame actually comes from, whether it's Ruby or the native gem or some library on the system. We did not have this output in the previous output.

00:09:29.880 We see similar information about the function name, the filename, and the line number. But now that we have the binary name, we can easily figure out that this stack frame is from the system library. We also have frames from our native gems and from Ruby. Now we have all the pieces of data needed to implement our heuristics. Now that you've seen the information in Valgrind's output, let's talk about some of the heuristics Ruby memcheck uses to filter out false positives.

00:10:07.740 The first heuristic Ruby memcheck uses is that if the stack trace does not contain a frame inside of our native gem, we reject it as a false positive. We assume that only memory allocated by the native gem is the responsibility of the Native gem to manage and eventually free. This should remove most of the false positives coming from Ruby. The second heuristic Ruby memcheck uses is that if the stack trace calls into our native gem but then goes back into the Ruby virtual machine, it should be rejected as a false positive. We can call back into the Ruby virtual machine using functions such as rb_funcall, which calls a Ruby method, rb_raise, which raises an error, or rb_yield, which yields the block passed into the method.

00:11:20.940 However, we don't want to skip all functions that call into Ruby. For example, we don't want to skip the creation of typed data objects, which is a special type of Ruby object that allows native extensions to store custom data. Typed data objects are created by Ruby; however, it is the responsibility of the native gem to manage and eventually free the memory. So how do we determine whether a stack trace is in the Ruby virtual machine or the native gem? The stack could go from Ruby into a native gem and back into Ruby and so on.

00:12:07.260 We want to scan from the top of the stack downwards, where the top of the stack is the location closest to the allocation source. If while scanning top-down we first encounter a call to the Ruby virtual machine, then we want to reject the memory leak, as it means that the memory leak was allocated by the virtual machine. However, if the stack frame is inside of our native gem first, then we want to keep the memory leak because it could be real. Let's see an example of a memory leak that is rejected by this heuristic.

00:12:55.800 This is a memory leak report by Valgrind, and this is the status. Let's overlay information on where each stack frame comes from. The first line is from the malloc system library, followed by several lines in Ruby, which are shown in red, then in our native gem, which is shown in yellow, and then back in Ruby again. We start scanning from the top of the stack, which is closest to the allocation source. We first enter Ruby, perform some hash operations, but then we see an rb_funcall call. Recall that rb_funcall calls a Ruby method through the C API, and this call occurs before we enter the native gem.

00:13:40.899 Thus, the native gem has called back into the Ruby virtual machine, so we reject this memory leak as a false positive. Let's see an example of a real memory leak that is kept by this heuristic. This is the same memory leak as the examples used earlier. We will see if the heuristic will successfully keep this memory leak. Again, let's overlay information on where each stack frame comes from. As we scan from the top of the stack downwards, we first see a call to the malloc system library, then we enter a stack frame in Ruby.

00:14:24.360 However, we never enter the Ruby virtual machine before encountering a stack frame in the native gem. Therefore, we accept this memory leak as valid. The third heuristic memcheck uses is to reject memory leaks in the init function of the native gem. Every native gem has an init function. This is called the first time the native gem is required. It's used to set up state and do other preparation work that the gem needs to do. It's important to note that this function is only called once, so even if it is memory inefficient, it cannot cause memory to grow over time.

00:15:06.120 Since the init function is used to set up state, it often allocates memory that is used throughout the lifetime of the program. This means that it is never freed and it ends up being reported as a memory leak. This is the code for the heuristic, which is directly extracted from the Ruby memcheck gem. This method returns true if the memory leak should be rejected and false if it is determined to be a real memory leak. Let's loop through the stack frames from the top of the stack downwards. We check where the stack frame originated from. If it's in Ruby, and we haven't seen a frame in our native gem yet...

00:15:48.000 ...and we see a function that calls back into the Ruby virtual machine, then we should reject this memory leak as a false positive. Okay, this is a lot to unpack, but this is essentially rule 2 from the previous slide. Configuration.skipped Ruby functions is an array of regex or function names that call into the Ruby virtual machine. We want to check whether we’ve been into the native gem yet, because if we were in the native gem before we entered the Ruby virtual machine, then the source of allocation is the native gem and not Ruby.

00:16:30.840 Okay, let's consider the case when the frame is in the native gem. We set the in-binary variable to true to remember that we've already entered the native gem. We skip if it's the init function, which is rule 3 from the previous slide, and we don't care if the frame is neither in Ruby nor the native gem. In the end, we reject if we’ve never entered the native gem and keep if we did enter the native gem. This is rule one from the previous slide. Ruby memcheck is designed to be easy to integrate into native gems. It uses your existing MiniTest or RSpec test suite to find memory leaks, and the memory leaks and memory errors are outputted after your test completes.

00:17:09.240 Valgrind, and the heuristics that I'm using, have their limitations. So, this tool isn't perfect. Let's see some of the limitations of Ruby memcheck. The first limitation is that because Valgrind only works on Linux and Ruby memcheck uses Valgrind under the hood, Ruby memcheck won't work on other platforms such as Mac OS or Windows. This is not a problem for CI, as most workers are Linux, but if your personal development machine is on another platform, then you may need to run a virtual machine or a Docker image.

00:17:40.560 Valgrind works by running your program in a sandbox. This is how it tracks memory allocations and detects invalid memory usages. This allows Valgrind to be powerful, but it comes at a significant performance penalty. Running with Valgrind will cause your program to run 20 to 50 times slower. The third limitation of Ruby memcheck is that it uses your test suite to find memory leaks, so the coverage is only as good as that of your test suite. Memory leaks and situations outside of your test suites won't be found. This is yet another reason to have a high-quality test suite with good coverage.

00:18:24.600 The fourth limitation is that because the heuristics filter out everything that originated inside of the Ruby virtual machine, memory leaks caused by the Ruby virtual machine won't be reported. Memory leaks in Ruby are challenging to debug, but unfortunately, Ruby memcheck won't be able to help you with that. The fifth limitation of Ruby memcheck is that it may miss memory leaks that were allocated inside of Ruby but leaked by the native gem. For example, this can happen if a string was allocated in Ruby and passed into the native gem, and the native gem changes the pointer of the string content buffer into another region of memory but forgets to free the original buffer.

00:19:03.600 Then a memory leak will occur, but Ruby memcheck won't be able to catch that since the original allocation happened inside of Ruby. Okay, hopefully, you've understood all of that from a high level. We’ve seen the algorithm and the limitations caused by the very aggressive heuristics. Now you might be thinking: do these heuristics even work? Do they effectively filter out noisy false positives and keep the real memory leaks? Yes, it was able to find the original memory leak in Liquid C, and I was able to find another one. It was able to do this without a single false positive, and it's running on CI using Ruby memcheck to prevent memory leaks in the future.

00:19:54.360 So now you might be thinking, does this work on other native gems or was I just lucky with Liquid C? Well, I found four memory leaks in Nokogiri and a fifth memory leak also in Nokogiri, and it’s running on CI on Aquagiri. They found the memory leaks in Tracepoint – a feature of Ruby, and it's running on CI on Rotoscope. And they found a memory leak in another gem.

00:20:07.518 So I'm asking you, the native gem maintainers and users, to try Ruby memcheck on your native gems or any native gems that you suspect have memory leaks. You can follow the instructions in the readme on how to use your existing MiniTest or RSpec test suite to automatically find memory leaks. It's designed to be really easy to set up and should only take five to ten minutes, but if it finds memory leaks, that will potentially take much longer. Together, let's make Ruby a more efficient and stable platform for everyone.