rubyday 2020

Reduce Memory by Using More

Reduce Memory by Using More

by Aaron Patterson

In Aaron Patterson's presentation entitled "Reduce Memory by Using More," delivered at Ruby Day 2020, he explores the intricacies of garbage collection (GC) in Ruby, specifically focusing on recent advancements in GC compaction. He introduces the concept of automatic concurrent compaction, emphasizing that chances are most developers are unaware of how memory management works in the languages they use daily.

Key Points Discussed:

  • Overview of Garbage Collection: Aaron starts by explaining the role of the garbage collector, which manages memory allocation and deallocation automatically, thus sparing developers from having to manage memory themselves.
  • Feature Introduction - gc.compact: He highlights a feature introduced in Ruby 2.7 called gc.compact, which aims to reduce memory fragmentation by compacting the heap. This is likened to defragmenting a hard drive, making memory usage more efficient.
  • Compaction Process: The compaction process is detailed, involving moving live objects closer together in memory and updating references. This ensures that memory is utilized efficiently, reducing wastage.
  • Goals of Compaction: Aaron enumerates goals for future memory compaction implementations:
    • Memory Savings: To reduce memory usage effectively.
    • Automatic Operation: Developers should not have to invoke compaction manually.
    • Concurrent Processing: The GC should run while the Ruby program continues its operations, minimizing downtime.
    • Speed: Compaction procedures should be fast to maintain performance.
  • Implementation Challenges: He addresses challenges faced during implementation, particularly with regard to reference updating and the speed of the compaction process.
  • Using Read Barriers: An innovative approach using read barriers was proposed to handle reference updates without crashing the program. This involves intercepting memory read requests to ensure references remain valid after compaction.

Significant Examples and Illustrations:

  • Aaron shares a graphical representation of memory layout before and after compaction, illustrating the efficiency gained through compaction.
  • The complexities of class relationships in Ruby are exemplified to explain how the compaction algorithm needs to manage references correctly to prevent crashes.

Conclusion and Takeaways:

Aaron concludes that while performance metrics showed the read barrier implementation to be slower, the primary objective was to ensure functional accuracy before optimizing for speed. The crucial lesson emphasized is that developers should ensure their code works correctly before attempting to enhance its speed. This talk reflects a deeper understanding of memory management in Ruby and the ongoing improvements aimed to enhance developer experience.

00:00:48.480 Well then, let's get started. The last talk is by our pretty renowned speaker, Aaron. The topic is going to be. Hello, Aaron, how are you? I'm doing well. How are you? I'm doing well, thank you. No cat butts on the screen, not from my side. Same here. Okay, cool. I'm enjoying the sun and still hanging in there until tomorrow. Just another day of summer, yes! We can do it! Nice! I was about to say that your talk is going to be about the garbage collector. This reminds me of a Marvel franchise because I've been seeing talks on garbage collectors from you since I think three or four years now, and they're generally great.
00:01:20.320 Because, as I was saying before to someone else in the hallway chat, seldom do you get a chance in your daily job to look at the internals of something that you use all the time. I've always appreciated the insights I get from your talks, besides the fun, of course. Oh, thank you! I like talking about the garbage collector and looking at it because it's this thing that's always there, doing work for us. But we don't ever have to look at it, nor do we have to mess with it too much, until sometimes we do, and then it's like, oh boy! It's perfect. I think everybody shares that sentiment when looking at it.
00:01:52.960 Fun fact: The only technical question I got in my interview when I interviewed for my current company was about how Ruby handles memory. I was like, 'What kind of question is that?' It seemed strange. I was like, are you really sure you’re a Ruby developer? I guess so. Anyway, this is going to be a pre-recorded talk, and the people in the chat know that already. They’ll post questions there if there are any, and I’ll read them to you afterwards. Just be funny, people. Come on, the bar is pretty high here. I leave the stage to you then.
00:04:09.360 Hi everyone! Today I'm going to be talking about reducing memory by using more. Actually, I'm mostly going to be talking about concurrent and automatic compaction in Ruby, but I thought that the other name was much better. So I wanted to use that name instead. I am going to be talking about reducing memory by using more, but the main focus will be on automatic concurrent compaction today.
00:04:21.519 Hi everyone, welcome to Ruby Day! I really wish that I could have come in person, but I guess the good news is that since I can film from home, you can actually see my cat. Hey buddy, say hello to everyone! Ah, oh, hello! Okay, I think he’s done. My name is Aaron Patterson, and I'm known on the internet as Tenderlove, but you can just call me Aaron too; it’s really fine.
00:04:54.320 This year, I started a new job at a company called Shopify, and I'm a full-time Ruby and Rails contributor for the company. I feel really lucky to have this job, and I just wanted to say that I'm really happy I get to work on Ruby and Rails, improving the language and the framework all day. So I just want to say thanks to Shopify; it’s really a dream come true for me.
00:05:28.960 In this presentation today, I want to talk about the developments I’ve been doing at work on Ruby, as well as the progress and process that I’ve made so far. As I said, I'm on the Ruby and Rails core team, and I'm employed to be a Ruby and Rails contributor. I’m also the inventor of the Analog Terminal Bell, and I don’t know if you have seen the ad for this, but if we have time, I may play it.
00:06:03.600 Anyway, let's get into the presentation. In Ruby 2.7, I introduced a feature called gc.compact, and this feature allows us to save memory by filling in spaces in our heap. So if we have extra unused areas in the heap that are fragmented, we can fill in those spaces and use them. This is a technique called eliminating fragmentation.
00:06:28.000 If you recall, there was a DOS utility program from a long time ago that would defrag your hard drive—it would fill in any unused spaces on your hard drive. This is basically the same concept, except instead of your hard drive, it's filling in spaces in your memory. Compaction allows us to reclaim memory that's not being used. This sample program demonstrates the effectiveness of compaction.
00:06:59.199 So first off, it loads a Rails application into memory. Then it runs the GC to remove any garbage that was created while loading all the code, and then it compacts the heap to remove any unused spaces in memory. Here’s an image showing what the memory layout looks like before compaction runs. The green and black dots represent objects, while the white dots represent free space. After the compactor runs, it looks something like this: the layout changes, and the red dots indicate objects that couldn’t be moved. In this new graph, the objects are much closer together, so you don’t see as many white spaces between them.
00:07:36.800 And it's kind of hard to see, but the graph is more narrow than the previous one, indicating that we're consuming less memory. Before we dive into my current work, I want to cover the existing compaction algorithm in Ruby 2.7. The compaction process has two steps: first, moving the objects closer together, and then updating all the references.
00:08:11.120 Imagine that there’s a memory layout that looks like this: the blue squares represent live Ruby objects, and the white squares represent empty slots where we could store a Ruby object. Yellow represents a forwarding address. The compaction process in Ruby 2.7 requires two pointers, one at either side of the heap: one at the beginning and one at the end. One of them is called a free pointer, and the other one is called a scan pointer.
00:08:50.280 The free pointer scans right until it finds a free slot, and then the scan pointer moves left until it finds a filled slot. When it finds a filled slot, it swaps the two, so they get swapped around. It then writes a forwarding address to the old location. For example, the object in slot 9 is moved to slot 3, so we write a forwarding address of 3. We repeat this process of looking for empty slots, finding filled slots, swapping them, and writing addresses until the two pointers meet. Then we know we're done with the movement process.
00:09:17.920 Here’s the same heap layout before compaction. Some of these objects have references to each other, and it might look something like this graph. Now, when the objects get moved around, the references can go bad. For example, some references have gone bad, so we need to update all the references.
00:09:53.280 To do this, we simply iterate over the heap, checking the references and updating them. In this case, object one references object seven, but the object that was at seven is now at four, so we update that reference to point to four. The next object, two, is pointing at six; we need to update that to point to five. We repeat this process for every single object in the heap, and once we reach the end, we’re done with the reference update process.
00:10:31.440 Finally, when we're done, we can clear out all the forwarding addresses, and those become just empty slots again. Now we're done compacting the heap. This is the existing algorithm in Ruby 2.7. As far as compaction itself is concerned, I want to talk about my personal goals for that feature.
00:11:05.760 First off, I want to save memory, which is probably the most important goal. We want to save memory, and that’s essential. I also want it to be automatic, so developers don't have to know that it's happening or explicitly call for heap compaction; it should just happen automatically in the background and just work. Additionally, I would like it to be concurrent, which means your program can run while we compact the heap without pausing it too much.
00:11:47.680 And finally, I want it to be fast. I will discuss each of these goals and where they stand. First, let's talk about saving memory. In this slide, we see two graphs: one is the layout of a Rails application before we've compacted the heap, and the other one is after we've compacted the heap.
00:12:06.400 By comparing these two graphs, we can see that we've saved memory because the compacted heap is narrower. We were able to free up some space, and the compacted heap has all the objects much closer together, so there’s no free space in between them. This helps improve copy-on-write performance. Thus, we can confidently say that the first goal of saving memory is met.
00:12:46.240 Now, let’s look at the next goal of making the compactor automatic. I want to make the compactor automatic because I don't think developers should have to know that it's running or invoke it on their own. It should just happen behind the scenes for you. But today, we have to call gc.compact in order for the compactor to run, which clearly does not meet this goal.
00:13:14.160 Typically, we load our entire Rails application and then manually compact the heap because the code will remain loaded until we kill the process. However, the need to call gc.compact ourselves is a clear indicator that this goal is not met.
00:13:46.720 As for concurrency, we also must call gc.compact explicitly, confirming that this goal is not met either. I want to take this opportunity to discuss an implementation problem: When objects are moved, the destination slot has to be empty, which might not sound like an issue. However, it means that the GC steps must happen in a certain order.
00:14:04.720 As previously mentioned, the first step is moving the objects, followed by updating the references. Since we can only move objects into empty slots, the sweep step must occur before the compact step. The mark step must happen before the sweep step as well. Therefore, we need to call gc.compact ourselves, adhering to this specific order of operations.
00:14:41.760 Finally, the last goal of speed presents a subjective issue. The question arises: how fast is compacting the heap? It depends on what you are comparing it against. In my view, discussing speed is moot unless we have concurrency.
00:15:06.720 If you have to call gc.compact when booting your application, the time it takes becomes irrelevant because you won’t be serving any requests until the process is fully initialized. Therefore, we can't accurately assess whether the current solution is fast. However, making the compactor automatic, and fast, both hinge on implementing concurrency.
00:15:48.960 When we mention concurrency, we must consider running concurrently with something. Ideally, it would be best if we were executing concurrently with Ruby code. This way, we can reduce the pause times of your program, allowing it to run while compacting. However, for my first implementation, I thought it would be easier to start by making it concurrent with the sweeping step.
00:16:31.840 So, for the first implementation, we will adjust the move and sweep steps to execute at the same time. Now, let's talk about the sweeping process and how we can integrate the move process. After the marking step of the GC, any objects that will be kept will be marked, indicated by the marked bits along the top.
00:17:09.440 We keep a bit table that indicates whether or not an object is marked. Anything that is marked will be red and cannot be released, while unmarked objects can be swept and freed. So we look at each object; if the object is marked, we ignore it and move to the next one. If it isn't marked, we free the object and mark that slot as reusable. We continue this process, iterating through the heap.
00:17:48.720 Let’s now take a look at how we can combine the sweep and move steps. The sweep cursor resembles the move cursor, so we should be able to combine these two processes, moving across and examining each object.
00:18:02.960 The first step is to add a scan cursor analogous to the one in the original algorithm. The sweep step, in this case, remains unchanged. It just needs to look at each object and its marked bit. However, after freeing an object, instead of proceeding to the next, we utilize the scan cursor to search backward for an object that we can move into that slot.
00:18:46.640 By introducing the scan cursor, we can write a forwarding address and this looks identical to the previous compaction algorithm, merging movement and freeing any objects. When the two cursors converge, we know that the heap has been compacted, just as before.
00:19:14.560 Once they meet, we pause the sweep step to update all the references. Reference updating is the one non-concurrent step, so we must halt the movement and sweeping at that time. After we update the references in the same manner as the earlier compaction algorithm, we can continue the sweep step as normal.
00:19:59.120 The sweeping mechanism automatically frees any forwarding addresses. Thus, the sweep will clear out any forwarding addresses that aren’t marked, returning them to the free list. After the sweep step finishes, compaction is complete.
00:20:37.200 After implementing this algorithm, it seemed to work well. I sent a PR, and you can see it at this URL; the CI passed, and everything appeared fine. However, after merging, CI started failing, prompting us to revert the patch.
00:21:04.880 I set out to discover what was wrong. The algorithm itself seemed sound, but it took time to pinpoint the bug. I will describe what it was. Let’s say in Ruby we have maybe three classes: A, B, and C, where B and C are subclasses of A.
00:21:41.440 Internally, Ruby tracks the class parents and children. So, class A knows its children, B and C, while B and C know their parent A. If class C gets deleted, the GC frees C, but C must also be removed from class A's child list.
00:22:19.520 So due to this, class C has to access class A during the GC sweep phase. Let’s illustrate what happens when we incorporate the compaction algorithm. The sweep cursor moves along until it finds class C, which is unmarked, therefore it must remove itself from class A's child list.
00:22:59.760 After that, class C is deleted. However, let’s examine the same process involving the compaction algorithm. The sweep moves along and spots an empty slot; thus, the scan pointer proceeds to class A, leaving a forwarding address.
00:23:41.120 But when the sweep reaches class C during the sweep, class C needs to remove itself from class A's child list. It attempts to access its parent to perform this task, but class C is now pointing at a forwarding address, not the actual class A.
00:24:13.920 Therefore, when C tries to remove itself from class A's list, it encounters a forwarding address it cannot process, resulting in a program crash.
00:24:36.960 In the debug logs from the revert commit, if we evaluate the crash, we can see that this child class is in fact pointing at a moved address. To address this issue effectively, I made a significant assumption: when an object was freed, it would only free its own memory.
00:25:10.560 Accordingly, I wanted to implement a restriction on objects, ensuring they could only interact with their internals and not affect external objects. However, I realized I could not enforce this, as existing code might rely on such behavior.
00:25:43.200 Thus, I needed a solution. I opted to introduce read barriers. What are read barriers? Read barriers are functions triggered when something is read. For instance, a simple read barrier could intercept attempts to access an instance variable.
00:26:15.920 When an object is read, we have a chance to execute an action. In this particular GC bug scenario where C tries to read A, we could intercept and reposition A back into its original place.
00:26:52.960 Unavoidably, read barriers present challenges. They require placement in numerous locations, which complicates implementation because every significant point in the code might require a barrier.
00:27:50.080 Also, if we have to incorporate these barriers everywhere, it might slow down program execution, as we will have to check if a read barrier needs execution each time. This method treats our program as a graph, aiming to preserve its integrity.
00:28:27.360 Should any instance variable move, the barrier could execute upon reading and rectify the location. While this holds great potential, it is indeed challenging to implement and may hinder performance.
00:29:08.000 I, being quite lazy, decided to use and protect. This function allows us to safeguard a memory section from being accessed. If there’s an access attempt, an exception occurs, allowing us to take appropriate action.
00:29:49.600 To elucidate how this resolves the bug, we would implement the same process as before. After moving an object, we would protect the associated page, such that when C attempts to read the address, it triggers an exception since the region of memory is guarded.
00:30:26.640 After handling the exception, we remove the protection from the page and reposition all objects back to their original locations, allowing the sweep to proceed.
00:30:44.320 At this point, A is back in the right location, enabling C to free itself from A's child list.
00:31:16.000 However, this implementation leads to one hurdle: the system call's protected regions must be divisible by four kilobytes. The beginning address and the protected region’s size must both meet this criterion.
00:31:52.880 Although we confirm that the page addresses align to the four-kilobytes requirement, the total size to protect presents an issue. The calculated size for memory is not divisible by four kilobytes, creating complications.
00:32:35.440 Instead, I proposed leveraging exactly 16 kilobytes, enabling compatibility with mprotect. An interesting observation is that the required size by malloc equaled a Ruby object size.
00:33:05.680 Ultimately, if we allocate precisely 16 kilobytes, each Ruby page accommodates an additional object. To test this patch, I computed memory usage during RDoc generation as a benchmark.
00:33:44.720 The graph showcases memory usage over time. The x-axis represents time while the y-axis denotes memory consumption by the process. The red lines represent the master branch while the blue lines demonstrate the branch containing the applied patches from my changes. After running the benchmark 50 times, we can observe that the blue lines are typically lower.
00:34:35.440 In fact, our analysis shows reduced memory overall when applying these changes. This surprising outcome stems from how jd malek organizes memory allocation utilizing fixed size buckets.
00:35:33.040 In essence, allocating a suitably sized block encourages memory efficient usage, indicating that requesting more memory can actually help reduce overall usage.
00:36:00.480 This is why I titled my talk 'Reducing Memory by Using More,' because it encapsulates a discovery regarding how our memory management works.
00:36:44.080 The pivotal part of the patch is where heap page size now aligns with heap page alignment. Therefore, allocation respects the four-kilobyte boundary.
00:37:21.760 I made a pull request that implements a read barrier while combining compact and sweeps, using mprotect effectively. Unfortunately, the read barrier's performance turned out to be very slow.
00:38:04.400 Comparing the read barrier with the original compacting shows that the read barrier implementation is 4.5 times slower than before. This slow down raises the question of speed.
00:39:02.560 In responding to that question, I recall that with performance optimization always check if your compiler optimizations are enabled beforehand.
00:39:46.400 Reflecting upon this project, I initially felt dissatisfied as it appeared to slow down feedback. Truly, the project isn’t solely about enhancing speed but proving that we could implement a read barrier while facilitating concurrent compaction.
00:40:20.560 With that objective achieved, I realize the primary takeaway is about functional implementation rather than sheer speed. One essential point I want to convey is this: ensure that your code works successfully before attempting optimization.
00:40:59.520 This brings us to the conclusion of today’s talk; I see this issue in many different projects, and I admit to being a victim of it myself. It’s vital to establish the functionality of your code first, and once that is certain, focus on speeding it up. Thank you so much for having me! I wish I could have attended in person, but next time, I'll definitely be there.
00:41:38.080 Thank you all again! Hi, everybody!
00:42:09.760 Oh, you are not happy, are you?
00:42:56.569 Okay.
00:43:07.440 Has this ever happened to you? Changing directories, Git, and wondering if that is the right directory.
00:43:21.680 If only I had the Analog Terminal Bell! You’ll never miss a terminal bell notification again.
00:43:40.080 The Analog Terminal Bell is a device that rings anytime a bell character is displayed on your terminal.
00:43:56.440 Simply plug it into USB, then enable the Analog Terminal Bell in your terminal settings.
00:44:09.360 It works with your shell, it works with Vim, it even works with Devview.
00:44:16.720 Thanks to the Analog Terminal Bell, I'll never miss a terminal bell notification again. Go online and get yours today at AnalogTerminalBell.com.
00:44:29.680 Please note: the Analog Terminal Bell is not actually for sale and requires a custom build of item two. Other restrictions may apply; see the website for details.
00:44:52.960 Hello, hello! Just brilliant.
00:45:02.280 Thank you! I was totally not expecting that. Hello, there she is!
00:45:11.360 Unfortunately, not yet, but we’re gearing up to wrap up the day.
00:45:24.720 Let’s see if there are any questions.
00:45:51.120 The jokes in the chat have been delightful already! My question is, was that really a microwave plate?
00:46:05.040 That’s what I’ve been wondering throughout this talk! Months of work, and my real question is: was that a microwave plate?
00:46:29.120 It was indeed! I wanted something to spin on, and so I thought—oh, it's the microphone!
00:46:36.960 I was filming it, and Ebby was like, 'What are you doing?'
00:46:44.960 Now I have another question. That’s actually my question too!
00:46:51.680 For the audience, we asked all our speakers to send us a bio just to get to know them better.
00:47:01.760 In Aaron’s bio, he mentioned his interests: keyboards, cats, and codes. These are all hard C's.
00:47:14.880 But I noticed, why K and not C, like for cheese? Your Twitter feed is full of cheese!
00:47:28.320 That's a good point. Yes, yes! Bad joke should have been another one of my interests: cheese! What has it done to you?
00:47:41.760 Now, moving to the serious question in the final version.
00:48:04.960 Matteo wonders: Is concurrency simply with respect to sweeping and moving, or is it involved with other Ruby codes?
00:48:20.960 Actually, I must admit that when I started this project, I initially aimed to make it concurrent with freeing.
00:48:31.680 But thanks to being able to implement a read barrier, it is now concurrent with Ruby code as well. It runs alongside your Ruby code.
00:48:42.240 The only thing is that reference updating is not concurrent; there’s definitely a pause required for that segment.
00:49:04.640 Gregorio is asking: How does the cat feel about the Analog Terminal Bell?
00:49:24.000 The cat doesn’t mind it much; however, when the bell rings, my wife inevitably asks, 'What are you doing?'
00:49:35.040 Usually, the bell sound is loud! You might find my cat, Choo Choo, residing on my desk and she often sleeps on it.
00:49:55.680 In that case, I really cannot ring it while she’s on it! I could, but it might not be the greatest idea.
00:50:08.480 As we approach the conclusion, we’ve run out of questions, serious and otherwise.
00:50:26.240 If you had one piece of advice to offer the audience interested in diving deeper into internals, where should they stand?
00:50:40.480 I would recommend starting by writing a C extension if you want to begin hacking on MRI. Afterward, you can delve into Ruby internals.
00:51:05.680 A good book to read is 'The Garbage Collector's Handbook,' or just the GC handbook. It includes all the algorithms I've discussed, as well as those utilized in Ruby.
00:51:21.680 I believe what we’ve implemented in MRI isn’t innovative technology; rather, it’s generally known and awaits documentation in that resource.
00:51:45.680 The final question from Thiago is whether there’s any benchmark on large web projects.
00:51:56.480 We’ve conducted benchmarks, noting that memory savings hover around 10% for certain objects. However, the performance impacts of automatic compaction remain uninvestigated.
00:52:16.800 This warrants further exploration. Additionally, we must understand that this feature only engages during major GC runs, which should ideally arrive infrequently with proper tuning.
00:52:34.440 Thus, even if the performance dips slightly, it should not matter significantly in a larger context. Thank you profoundly for your time and participation today!
00:52:57.440 Of course! Thank you for having me!