00:00:09.910
Hi, I'm Prats. I work for a company called Recipe. We do a variety of things, but one of our main focuses is clothing recommendation engines.
00:00:16.960
Today, I want to share with you two problems we've encountered: one related to performance and the other concerning memory. I will tell you the stories around these challenges and how we addressed them. This talk is not about 24/7 uptime; it's about real-time recommender systems.
00:00:30.099
If you are familiar with building recommendation systems, you often face a gap between batch predictions and real-time recommendations. Batch predictions use the most accurate algorithms, but they may take 12 to 24 hours to run, and can only be done with all users' data at once. While this method excels in accuracy, it's not suitable for timely responses to current events.
00:00:58.350
On the other hand, real-time recommenders respond immediately to individual events like clicks, interactions, and purchases. However, the results may be more approximate, changing every second in reaction to ongoing events. Naturally, we want both: the accuracy of batch predictions and the immediacy of real-time responses. So we decided to combine these two approaches.
00:01:21.579
Initially, everything went well. The pull request looked great, I approved it, and it passed our staging tests. We also conducted evaluation tests that predicted the improvement in accuracy when using real data. The numbers were promising, and for the first 12 hours post-deployment, everything functioned smoothly.
00:01:50.439
However, that evening, things changed. The blue line in our metrics graph started to climb around 7:00 p.m. This blue line reflects the number of jobs in our queue, which reached a backlog of 10,000 jobs. The red line indicates the number of instances processing the queue. We could add new instances, but they weren’t significantly making a difference. Eventually, around 11:00 p.m., the backlog decreased, but that was simply because users had gone offline.
00:02:23.010
In summary, our real-time recommender system was operating with a two-hour delay, which wasn’t ideal. Fortunately, the failure mode of our system resulted in users experiencing a non-real-time recommendation system, something they were unaware of since we had just rolled out the feature that morning.
00:02:35.490
The next day, I began monitoring the situation. My suspicion was that the job creating the backlog was likely the last step in our process. In this setup, we compute batch predictions overnight, and then we have an API to get real-time predictions for users. The data is serialized into a specific format before being stored in another database. The delay could have been caused by any one of these steps or a combination of them.
00:03:03.540
We took a closer look at application performance metrics using AppSignal. The output indicated several API calls and data store interactions, all contributing to the slowdown. It became clear that the serialization process, taking up 90-95% of our time, was the significant bottleneck. I hadn't anticipated it would be an issue, but that class was performing poorly, causing the lag.
00:03:27.209
The serialization method was intended to convert data into an integer array in a specific format. The challenge was that it iterated over element pairs, scaling the values so that they fell between 1 and 1000. This was an attempt to simplify processing but ultimately proved inefficient.
00:03:57.099
To address the issue, we began using RubyProf, a profiler that records behavior such as method calls. This gave us insight into what functions were being called and how much time was spent on them. Running this profiler enabled us to benchmark our serialization process effectively.
00:04:30.479
The profiling output revealed a wealth of information. Each row represented a method call with its respective execution time compared to the total run time. I noticed a common pattern: many methods were called, but they accumulated negligible time spend until we reached the serialization method. This is where significant time was consumed.
00:04:58.740
The call tree showed that the method being invoked had no arguments, and the cache was growing excessively due to the nature of how the middleware was designed. It continued to grow without being cleared, leading to the performance degradation we experienced.
00:05:22.920
Next, we explored the StackProf, a sampling profiler, which is less intrusive than RubyProf. This tool provides insights without the considerable overhead. Its sampling methodology allowed us to analyze without distorting performance metrics starkly.
00:05:58.110
We carried out tests, revealing that StackProf had significantly less impact on the run time versus RubyProf. After running some benchmark tests, we found that while RubyProf added substantial overhead, StackProf maintained near-ideal performance levels, allowing us to run it in production safely.
00:06:16.360
In terms of tackling performance issues, we learned to use effective monitoring tools and gather feedback effectively. When confronted with performance regressions, collecting data quickly and iterating on those changes is paramount. RubyProf proved beneficial here for detailed analysis, while StackProf offered real-time insights without significant overhead.
00:06:45.820
Now shifting focus, I want to address memory issues. As a new Ruby developer coming from C, I was delighted to be freed from memory management burdens. The garbage collector handles memory allocation and cleanup, and the Ruby core team continuously enhances its capabilities.
00:07:16.680
However, I quickly realized that memory concerns still exist. Loading massive files can lead to significant memory usage, influencing application performance and hosting costs. Garbage collection can result in runtime delays, especially if involved processes are heavy on memory allocation.
00:07:45.060
One particular memory issue arose from our monitoring system that alerted us to a memory error. These types of errors often go unnoticed during normal operation and show up only under specific conditions, such as high usage scenarios.
00:08:06.950
Examining the process lifetime of our application, we anticipated a typical growth curve: rapid increases during loading, followed by stabilization. However, we unexpectedly found a persistent upward trend with no signs of stabilization — an indication of a memory leak.
00:08:36.050
Memory leaks can be quite insidious. As a new Ruby developer, I felt betrayed by this possibility. The common culprits include persistent references from long-lived objects, excessive use of global variables, or poorly managed caches.
00:08:59.090
In my experience, one particular memory leak was caused by misunderstanding object lifecycles. I had implemented a simple cache in a rack middleware, inadvertently allowing memory to grow excessively.
00:09:05.560
This issue stemmed from the fact that while Rails controllers are instantiated per request, middleware operates persistently throughout the application's lifetime. The controller object would vanish swiftly, but the middleware maintained a cache holding onto memory.
00:09:38.810
Because this misunderstanding combined with improper references caused a memory leak, we saw a gradual but unmistakable increase in memory usage over time.
00:10:00.300
What made the leak challenging was the complex interactions and no clear indication of where to begin diagnosing the issue. Unlike performance alerts that suggest specific endpoints, memory leaks simply presented as high memory consumption.
00:10:30.660
We realized the leak had existed for quite some time without detection, possibly for over a year. Our monitoring data showed that it took about two weeks to exhaust memory, presenting a slow and gradual increase in consumption.
00:10:56.960
As a result, we found ourselves frequently deploying new instances to manage load, which masked the underlying issue. Often, instances would auto-restart before noticeable effects registered.
00:11:22.090
Addressing this leak required careful analysis and eventual fixes.
00:11:30.379
Next, we discussed the debugging approaches we utilized in production for investigating memory issues. Although a daunting task, collecting real-time insights is often crucial for unearthing issues that are invisible through standard development and staging environments.
00:12:00.480
Among the tools I employed, I first used a gem called rbtrace. It functions as a real-time data collection tool, allowing us to track method calls and memory allocations during live application use.
00:12:29.740
I also integrated ObjectSpace, which records every memory allocation, creating logs for in-depth analysis later.
00:12:48.030
The downside is the performance overhead incurred while tracking this information. Understanding the impact of this kind of profiling is essential in discerning its effectiveness as you troubleshoot.
00:13:11.520
Running the profiling tools produced substantial data, which took time to sift through. However, we needed insight into object allocation patterns to determine where memory was being retained unnecessarily.
00:13:34.680
The heap dump data provided specific information about allocated objects, including memory addresses, sizes, and locations in the code where these allocations occurred.
00:13:58.189
When analyzing the dumps from our non-leaking app, I noted expected patterns of memory usage. There were allocations that occurred, but those were promptly released, reflecting healthy memory management.
00:14:26.640
In contrast, the heap dump from our leaking application revealed retained objects that were never released. By studying these leaks, I could identify specific areas in the codebase requiring attention. This analysis directly targeted memory mismanagement patterns.
00:14:56.200
Ultimately, understanding object lifecycles allowed us to rectify the issue. I learned significant lessons about managing memory within Ruby applications and the importance of scrutinizing object duration.
00:15:20.680
In conclusion, effective monitoring and detailed analysis are key to managing both performance and memory challenges in Ruby. Never underestimate the importance of understanding how your objects behave and interact with each other in various contexts.
00:15:42.710
Thank you for listening.