00:00:12.050
Alright, go ahead and get started. Welcome everyone.
00:00:16.379
There are some seats for those of you standing on the side. If you want them, you can also stand on the side; it's all good.
00:00:22.110
Cool, so I'm talking about Parallel Ruby and how it eats all of your memory, as well as some magic to stop it from eating all your memory.
00:00:30.689
A bit of background about myself: my name is Kevin, and I work at a company called Flexport. Our mission is to make global trade easier for everyone.
00:00:39.899
This matters for two reasons. One, it's a shameless way for me to plug my company using my platform. But more importantly, we run a lot of background jobs because there are a lot of boats, planes, and trucks we're tracking at all times.
00:00:51.579
Keeping track of these things requires running a lot of cron jobs, scraping data from various sites, and hooking up to GPS devices we have plugged into trucks. We end up doing around 7,000 jobs a minute, peaking at around 20,000.
00:01:02.069
These jobs are almost all IO-bound jobs, involving network requests to determine the location of various transports like boats, trucks, and planes. It's worth noting that trucks and planes are usually more interesting than boats, as they often remain in the middle of the ocean.
00:01:28.830
Initially, we ran all of these jobs in a single-process world with about 20 servers, each running 12 Ruby processes using Sidekiq. While Sidekiq is awesome, we didn't necessarily trust ourselves to be thread-safe.
00:01:41.579
As a result, we ended up running a single-threaded, single-process setup across 20 servers, totaling 240 Ruby processes. Even though these are IO-bound jobs, it felt like we were underutilizing our resources.
00:01:54.300
Now, everything I'm discussing here is specifically about CRuby (also known as MRI Ruby). In Ruby, there is a global interpreter lock, which means that only one thread can execute at a time. However, since our jobs are IO-bound, we should be utilizing threads—especially when they’re waiting on network requests.
00:02:32.070
We did a major migration by increasing the number of servers, which allowed us to save money, and my boss was pleased with my performance. As a result, we could run way more threads and handle a lot more work.
00:02:53.610
However, things didn’t go as planned. This graph shows our job runtime slowing down over time. New deploys are indicated by green bars, and you’ll notice that every time a new deploy happens, the time drops but subsequently becomes slower again.
00:03:18.000
During lunch, the performance tanked further as no one was deploying, leading to a host of issues. I wanted to dive into why our jobs were becoming increasingly slow. One important side note here is that if you deploy every 10 minutes, you can often ignore many problems.
00:03:49.779
Continuous deployment has its benefits, allowing you to quickly resolve problems. However, that’s not the crux of my talk today. I’d rather not delve into the actual code since it can be confusing to dive into any code base, so I'll present a simplified example.
00:04:03.630
In my example, I'm spinning up multiple threads. Specifically, I'm creating 20 threads, each in a loop that allocates memory without ever returning it. Each thread allocates a thousand-element array, and each element in that array consists of a thousand-character string.
00:04:26.220
This means each thread uses 20 times 1,024 KB, ultimately producing around one million characters per loop iteration. When we run this and monitor the memory usage, we see it start climbing rapidly.
00:04:44.730
As expected, the memory usage increases, but we also notice drops when the garbage collector runs to clean some memory. Despite this, the overall trend shows continuous growth in memory usage.
00:05:03.449
To illustrate this better, I previously ran the same code and graphed the results. You can see the garbage collector has run, yet we still face a memory leak...
00:05:28.620
... which leads us to a discussion about memory fragmentation. To understand this better, we should discuss how memory management works in Ruby.
00:05:45.609
Ruby manages its own memory differently than C does, using what's called a Ruby heap, which can be confusing when compared to a C heap. The Ruby heap is organized into chunks called slots, with everything that needs to be allocated fitting into these slots.
00:06:06.660
When we allocate any object, it goes into these slots. However, if anything exceeds the slot size, then it gets sent to the C heap—which is OS managed memory. The Ruby heap is consistent and structured, while the C heap is less organized, composed of 4KB pages.
00:06:23.189
Each insertion in Ruby's heap requires the system to allocate new slots. When we allocate arrays, they get assigned to a Ruby slot, which in turn references the corresponding data stored in the C heap. This can cause fragmentation in the C heap, leading to wasted space that impacts performance.
00:06:53.070
In other words, if page allocation fails because the size is too large, it results in wasted memory in the OS heap. Therefore, while Ruby handles memory superbly, fragmentation can arise due to the limitations of the C heap.
00:07:08.189
Now, let's consider our problem with parallel Ruby. When we start using multiple threads, each thread gets allocated its own fragmented OS heap, worsening the memory fragmentation over time. This is why memory usage can become unexpectedly high in multi-threaded Ruby applications.
00:07:50.170
With this understanding, what can we do about it? One option is to not assign one heap per thread—instead, we could configure a single heap. However, this conflicts with Ruby's global interpreter lock, meaning only one thread can run at a time.
00:08:22.440
This can be achieved by adjusting environment variables related to memory heap management, such as setting MALLOC_ARENA_MAX to one. By doing this, we've noticed improved memory efficiency because all threads share the same arena.
00:09:12.230
As a comparison, I tested this, and the first time our memory climbed to over a gigabyte. When using the modified setting, it stabilized around 600-700MB, demonstrating the effectiveness of this environment variable.
00:09:36.570
Another option is to consider using a different memory allocator. By default, Ruby uses the standard C malloc, but we can replace it with alternatives like jemalloc, which is known for its efficiency in memory management.
00:10:01.709
The reason I'm choosing to demonstrate jemalloc is that it has a strong recommendation from various communities, including the authors of Sidekiq. It's widely adopted and has shown stable performance across various applications.
00:10:20.690
Once installed with the correct configuration, using another environment variable to specify this memory allocator leads to a significant drop in memory usage, often dropping below expected levels compared to the default setting.
00:10:57.390
To summarize my findings thus far, I highly recommend adopting continuous deployment practices since they help mitigate many operational issues. Furthermore, switching to jemalloc can enhance memory management significantly, even for single-threaded Ruby applications.
00:11:38.609
In conclusion, if you have control over your environment, consider altering the VALUES of MALLOC_ARENA_MAX and utilizing jemalloc to reduce memory usage significantly. However, for multi-threaded applications, staying within the confines of Ruby's architecture is crucial.
00:12:03.350
Charitably speaking, these adjustments might benefit memory efficiency. Even when limiting Ruby's default behavior, you can observe a substantial performance uptick in managing threads.
00:12:30.350
It's essential to balance options because memory management remains a complex subject. The discussions regarding the best setups and allocators are ongoing, and while some settings yield improvements, others may experience the occasional quirks.
00:12:50.180
I'll conclude my talk here, and I appreciate you all joining me. If anyone has further questions concerning Ruby, or if you're interested in deep dives into memory management, be sure to also attend Aaron Patterson’s talk tomorrow.
00:13:50.000
Thank you for your time and for allowing me to share insights on parallel Ruby, managing memory, and its complexities.