Talks

Parallel Ruby: Managing the Memory Monster

Parallel Ruby: Managing the Memory Monster

by Kevin Miller

In the talk "Parallel Ruby: Managing the Memory Monster," Kevin Miller discusses the challenges of memory management when transitioning from single-threaded to multi-threaded Ruby applications. This transition was made at Flexport to better manage the high number of IO-bound background jobs, which peaked at around 20,000 jobs per minute. The presentation identifies how overloaded memory can impact performance, specifically focusing on the implications of Ruby's garbage collection process and memory fragmentation.

Key points discussed include:
- The Transition to Multi-threading: Flexport moved from a single-process, single-threaded setup to a threadpool to improve resource utilization, despite initial successes, they soon encountered severe performance degradation.
- Performance Challenges: As job runtimes increased over time after deployments, it was crucial to understand the memory patterns. It was found that although the memory usage increased with garbage collector activity, there was a consistent upward trend indicating deeper issues.
- Memory Management in Ruby: Miller explained how Ruby manages memory with a Ruby heap, contrasting it with traditional C memory management. The Ruby heap's organization into slots contrasted with the less structured C heap, leading to memory fragmentation.
- Impact of Parallelization: Multiple threads exacerbate memory fragmentation, leading to inefficient memory use and potential leaks in multi-threaded applications.
- Solutions Implemented: Miller introduced strategies for mitigating memory issues, including setting the MALLOC_ARENA_MAX environment variable to unify thread memory allocation and adopting alternative memory allocators like jemalloc, which reportedly improved memory efficiency significantly.

To conclude, Miller emphasizes best practices such as continuous deployment that can help quickly resolve performance issues. Notably, employing specific configurations can lead to improved memory management and performance in Ruby applications, especially those running multiple threads.

00:00:12.050 Alright, go ahead and get started. Welcome everyone.
00:00:16.379 There are some seats for those of you standing on the side. If you want them, you can also stand on the side; it's all good.
00:00:22.110 Cool, so I'm talking about Parallel Ruby and how it eats all of your memory, as well as some magic to stop it from eating all your memory.
00:00:30.689 A bit of background about myself: my name is Kevin, and I work at a company called Flexport. Our mission is to make global trade easier for everyone.
00:00:39.899 This matters for two reasons. One, it's a shameless way for me to plug my company using my platform. But more importantly, we run a lot of background jobs because there are a lot of boats, planes, and trucks we're tracking at all times.
00:00:51.579 Keeping track of these things requires running a lot of cron jobs, scraping data from various sites, and hooking up to GPS devices we have plugged into trucks. We end up doing around 7,000 jobs a minute, peaking at around 20,000.
00:01:02.069 These jobs are almost all IO-bound jobs, involving network requests to determine the location of various transports like boats, trucks, and planes. It's worth noting that trucks and planes are usually more interesting than boats, as they often remain in the middle of the ocean.
00:01:28.830 Initially, we ran all of these jobs in a single-process world with about 20 servers, each running 12 Ruby processes using Sidekiq. While Sidekiq is awesome, we didn't necessarily trust ourselves to be thread-safe.
00:01:41.579 As a result, we ended up running a single-threaded, single-process setup across 20 servers, totaling 240 Ruby processes. Even though these are IO-bound jobs, it felt like we were underutilizing our resources.
00:01:54.300 Now, everything I'm discussing here is specifically about CRuby (also known as MRI Ruby). In Ruby, there is a global interpreter lock, which means that only one thread can execute at a time. However, since our jobs are IO-bound, we should be utilizing threads—especially when they’re waiting on network requests.
00:02:32.070 We did a major migration by increasing the number of servers, which allowed us to save money, and my boss was pleased with my performance. As a result, we could run way more threads and handle a lot more work.
00:02:53.610 However, things didn’t go as planned. This graph shows our job runtime slowing down over time. New deploys are indicated by green bars, and you’ll notice that every time a new deploy happens, the time drops but subsequently becomes slower again.
00:03:18.000 During lunch, the performance tanked further as no one was deploying, leading to a host of issues. I wanted to dive into why our jobs were becoming increasingly slow. One important side note here is that if you deploy every 10 minutes, you can often ignore many problems.
00:03:49.779 Continuous deployment has its benefits, allowing you to quickly resolve problems. However, that’s not the crux of my talk today. I’d rather not delve into the actual code since it can be confusing to dive into any code base, so I'll present a simplified example.
00:04:03.630 In my example, I'm spinning up multiple threads. Specifically, I'm creating 20 threads, each in a loop that allocates memory without ever returning it. Each thread allocates a thousand-element array, and each element in that array consists of a thousand-character string.
00:04:26.220 This means each thread uses 20 times 1,024 KB, ultimately producing around one million characters per loop iteration. When we run this and monitor the memory usage, we see it start climbing rapidly.
00:04:44.730 As expected, the memory usage increases, but we also notice drops when the garbage collector runs to clean some memory. Despite this, the overall trend shows continuous growth in memory usage.
00:05:03.449 To illustrate this better, I previously ran the same code and graphed the results. You can see the garbage collector has run, yet we still face a memory leak...
00:05:28.620 ... which leads us to a discussion about memory fragmentation. To understand this better, we should discuss how memory management works in Ruby.
00:05:45.609 Ruby manages its own memory differently than C does, using what's called a Ruby heap, which can be confusing when compared to a C heap. The Ruby heap is organized into chunks called slots, with everything that needs to be allocated fitting into these slots.
00:06:06.660 When we allocate any object, it goes into these slots. However, if anything exceeds the slot size, then it gets sent to the C heap—which is OS managed memory. The Ruby heap is consistent and structured, while the C heap is less organized, composed of 4KB pages.
00:06:23.189 Each insertion in Ruby's heap requires the system to allocate new slots. When we allocate arrays, they get assigned to a Ruby slot, which in turn references the corresponding data stored in the C heap. This can cause fragmentation in the C heap, leading to wasted space that impacts performance.
00:06:53.070 In other words, if page allocation fails because the size is too large, it results in wasted memory in the OS heap. Therefore, while Ruby handles memory superbly, fragmentation can arise due to the limitations of the C heap.
00:07:08.189 Now, let's consider our problem with parallel Ruby. When we start using multiple threads, each thread gets allocated its own fragmented OS heap, worsening the memory fragmentation over time. This is why memory usage can become unexpectedly high in multi-threaded Ruby applications.
00:07:50.170 With this understanding, what can we do about it? One option is to not assign one heap per thread—instead, we could configure a single heap. However, this conflicts with Ruby's global interpreter lock, meaning only one thread can run at a time.
00:08:22.440 This can be achieved by adjusting environment variables related to memory heap management, such as setting MALLOC_ARENA_MAX to one. By doing this, we've noticed improved memory efficiency because all threads share the same arena.
00:09:12.230 As a comparison, I tested this, and the first time our memory climbed to over a gigabyte. When using the modified setting, it stabilized around 600-700MB, demonstrating the effectiveness of this environment variable.
00:09:36.570 Another option is to consider using a different memory allocator. By default, Ruby uses the standard C malloc, but we can replace it with alternatives like jemalloc, which is known for its efficiency in memory management.
00:10:01.709 The reason I'm choosing to demonstrate jemalloc is that it has a strong recommendation from various communities, including the authors of Sidekiq. It's widely adopted and has shown stable performance across various applications.
00:10:20.690 Once installed with the correct configuration, using another environment variable to specify this memory allocator leads to a significant drop in memory usage, often dropping below expected levels compared to the default setting.
00:10:57.390 To summarize my findings thus far, I highly recommend adopting continuous deployment practices since they help mitigate many operational issues. Furthermore, switching to jemalloc can enhance memory management significantly, even for single-threaded Ruby applications.
00:11:38.609 In conclusion, if you have control over your environment, consider altering the VALUES of MALLOC_ARENA_MAX and utilizing jemalloc to reduce memory usage significantly. However, for multi-threaded applications, staying within the confines of Ruby's architecture is crucial.
00:12:03.350 Charitably speaking, these adjustments might benefit memory efficiency. Even when limiting Ruby's default behavior, you can observe a substantial performance uptick in managing threads.
00:12:30.350 It's essential to balance options because memory management remains a complex subject. The discussions regarding the best setups and allocators are ongoing, and while some settings yield improvements, others may experience the occasional quirks.
00:12:50.180 I'll conclude my talk here, and I appreciate you all joining me. If anyone has further questions concerning Ruby, or if you're interested in deep dives into memory management, be sure to also attend Aaron Patterson’s talk tomorrow.
00:13:50.000 Thank you for your time and for allowing me to share insights on parallel Ruby, managing memory, and its complexities.