Fixing Performance & Memory problems

by Frederick Cheung

The video "Fixing Performance & Memory Problems" by Frederick Cheung at RubyConf 2019 addresses common challenges faced in Ruby application development, focusing on performance and memory management issues. The session shares practical insights and tools for debugging and optimizing applications based on two real-world experiences.

Key Points Discussed:
- Introduction to Real-time Recommender Systems:

- The speaker explains the difference between batch processing and real-time recommendations, highlighting the need to balance accuracy and immediacy.
- Initial Deployment Challenges:

- After a promising deployment, a performance regression was noticed when a backlog of jobs accumulated, indicative of systemic delays in response times.
- Identifying Bottlenecks with Profilers:

- Using RubyProf, the team uncovered that serialization processes were consuming 90-95% of their time, leading to the decision to refactor the serialization method for efficiency.
- Using StackProf for Production Monitoring:

- StackProf was utilized due to its lower overhead in comparison to RubyProf, providing crucial insights without heavily impacting application performance.
- Memory Management Issues:

- Addressing the initial misconception of Ruby's garbage collection, the speaker delves into problems encountered in memory usage, particularly focusing on a memory leak caused by long-lived objects in middleware.
- Diagnostic Tools:

- The presentation covers how tools like rbtrace and ObjectSpace were employed to track memory allocations and diagnose leaks in real-time, significantly aiding in understanding memory behavior.
- Heap Dumps for Memory Leak Analysis:

- An analysis of heap dumps allowed the identification of retained objects that never released memory, leading to effective memory management practices.

Conclusions and Takeaways:
- Effective monitoring and profiling tools are critical in identifying and addressing performance and memory issues within Ruby applications.
- It is vital to understand object lifecycles and memory allocation patterns to manage application resources efficiently.
- The session emphasizes the importance of collecting data swiftly and iteratively improving application performance and memory handling through strategic testing and debugging techniques.

00:00:09.910 Hi, I'm Prats. I work for a company called Recipe. We do a variety of things, but one of our main focuses is clothing recommendation engines.

00:00:16.960 Today, I want to share with you two problems we've encountered: one related to performance and the other concerning memory. I will tell you the stories around these challenges and how we addressed them. This talk is not about 24/7 uptime; it's about real-time recommender systems.

00:00:30.099 If you are familiar with building recommendation systems, you often face a gap between batch predictions and real-time recommendations. Batch predictions use the most accurate algorithms, but they may take 12 to 24 hours to run, and can only be done with all users' data at once. While this method excels in accuracy, it's not suitable for timely responses to current events.

00:00:58.350 On the other hand, real-time recommenders respond immediately to individual events like clicks, interactions, and purchases. However, the results may be more approximate, changing every second in reaction to ongoing events. Naturally, we want both: the accuracy of batch predictions and the immediacy of real-time responses. So we decided to combine these two approaches.

00:01:21.579 Initially, everything went well. The pull request looked great, I approved it, and it passed our staging tests. We also conducted evaluation tests that predicted the improvement in accuracy when using real data. The numbers were promising, and for the first 12 hours post-deployment, everything functioned smoothly.

00:01:50.439 However, that evening, things changed. The blue line in our metrics graph started to climb around 7:00 p.m. This blue line reflects the number of jobs in our queue, which reached a backlog of 10,000 jobs. The red line indicates the number of instances processing the queue. We could add new instances, but they weren’t significantly making a difference. Eventually, around 11:00 p.m., the backlog decreased, but that was simply because users had gone offline.

00:02:23.010 In summary, our real-time recommender system was operating with a two-hour delay, which wasn’t ideal. Fortunately, the failure mode of our system resulted in users experiencing a non-real-time recommendation system, something they were unaware of since we had just rolled out the feature that morning.

00:02:35.490 The next day, I began monitoring the situation. My suspicion was that the job creating the backlog was likely the last step in our process. In this setup, we compute batch predictions overnight, and then we have an API to get real-time predictions for users. The data is serialized into a specific format before being stored in another database. The delay could have been caused by any one of these steps or a combination of them.

00:03:03.540 We took a closer look at application performance metrics using AppSignal. The output indicated several API calls and data store interactions, all contributing to the slowdown. It became clear that the serialization process, taking up 90-95% of our time, was the significant bottleneck. I hadn't anticipated it would be an issue, but that class was performing poorly, causing the lag.

00:03:27.209 The serialization method was intended to convert data into an integer array in a specific format. The challenge was that it iterated over element pairs, scaling the values so that they fell between 1 and 1000. This was an attempt to simplify processing but ultimately proved inefficient.

00:03:57.099 To address the issue, we began using RubyProf, a profiler that records behavior such as method calls. This gave us insight into what functions were being called and how much time was spent on them. Running this profiler enabled us to benchmark our serialization process effectively.

00:04:30.479 The profiling output revealed a wealth of information. Each row represented a method call with its respective execution time compared to the total run time. I noticed a common pattern: many methods were called, but they accumulated negligible time spend until we reached the serialization method. This is where significant time was consumed.

00:04:58.740 The call tree showed that the method being invoked had no arguments, and the cache was growing excessively due to the nature of how the middleware was designed. It continued to grow without being cleared, leading to the performance degradation we experienced.

00:05:22.920 Next, we explored the StackProf, a sampling profiler, which is less intrusive than RubyProf. This tool provides insights without the considerable overhead. Its sampling methodology allowed us to analyze without distorting performance metrics starkly.

00:05:58.110 We carried out tests, revealing that StackProf had significantly less impact on the run time versus RubyProf. After running some benchmark tests, we found that while RubyProf added substantial overhead, StackProf maintained near-ideal performance levels, allowing us to run it in production safely.

00:06:16.360 In terms of tackling performance issues, we learned to use effective monitoring tools and gather feedback effectively. When confronted with performance regressions, collecting data quickly and iterating on those changes is paramount. RubyProf proved beneficial here for detailed analysis, while StackProf offered real-time insights without significant overhead.

00:06:45.820 Now shifting focus, I want to address memory issues. As a new Ruby developer coming from C, I was delighted to be freed from memory management burdens. The garbage collector handles memory allocation and cleanup, and the Ruby core team continuously enhances its capabilities.

00:07:16.680 However, I quickly realized that memory concerns still exist. Loading massive files can lead to significant memory usage, influencing application performance and hosting costs. Garbage collection can result in runtime delays, especially if involved processes are heavy on memory allocation.

00:07:45.060 One particular memory issue arose from our monitoring system that alerted us to a memory error. These types of errors often go unnoticed during normal operation and show up only under specific conditions, such as high usage scenarios.

00:08:06.950 Examining the process lifetime of our application, we anticipated a typical growth curve: rapid increases during loading, followed by stabilization. However, we unexpectedly found a persistent upward trend with no signs of stabilization — an indication of a memory leak.

00:08:36.050 Memory leaks can be quite insidious. As a new Ruby developer, I felt betrayed by this possibility. The common culprits include persistent references from long-lived objects, excessive use of global variables, or poorly managed caches.

00:08:59.090 In my experience, one particular memory leak was caused by misunderstanding object lifecycles. I had implemented a simple cache in a rack middleware, inadvertently allowing memory to grow excessively.

00:09:05.560 This issue stemmed from the fact that while Rails controllers are instantiated per request, middleware operates persistently throughout the application's lifetime. The controller object would vanish swiftly, but the middleware maintained a cache holding onto memory.

00:09:38.810 Because this misunderstanding combined with improper references caused a memory leak, we saw a gradual but unmistakable increase in memory usage over time.

00:10:00.300 What made the leak challenging was the complex interactions and no clear indication of where to begin diagnosing the issue. Unlike performance alerts that suggest specific endpoints, memory leaks simply presented as high memory consumption.

00:10:30.660 We realized the leak had existed for quite some time without detection, possibly for over a year. Our monitoring data showed that it took about two weeks to exhaust memory, presenting a slow and gradual increase in consumption.

00:10:56.960 As a result, we found ourselves frequently deploying new instances to manage load, which masked the underlying issue. Often, instances would auto-restart before noticeable effects registered.

00:11:22.090 Addressing this leak required careful analysis and eventual fixes.

00:11:30.379 Next, we discussed the debugging approaches we utilized in production for investigating memory issues. Although a daunting task, collecting real-time insights is often crucial for unearthing issues that are invisible through standard development and staging environments.

00:12:00.480 Among the tools I employed, I first used a gem called rbtrace. It functions as a real-time data collection tool, allowing us to track method calls and memory allocations during live application use.

00:12:29.740 I also integrated ObjectSpace, which records every memory allocation, creating logs for in-depth analysis later.

00:12:48.030 The downside is the performance overhead incurred while tracking this information. Understanding the impact of this kind of profiling is essential in discerning its effectiveness as you troubleshoot.

00:13:11.520 Running the profiling tools produced substantial data, which took time to sift through. However, we needed insight into object allocation patterns to determine where memory was being retained unnecessarily.

00:13:34.680 The heap dump data provided specific information about allocated objects, including memory addresses, sizes, and locations in the code where these allocations occurred.

00:13:58.189 When analyzing the dumps from our non-leaking app, I noted expected patterns of memory usage. There were allocations that occurred, but those were promptly released, reflecting healthy memory management.

00:14:26.640 In contrast, the heap dump from our leaking application revealed retained objects that were never released. By studying these leaks, I could identify specific areas in the codebase requiring attention. This analysis directly targeted memory mismanagement patterns.

00:14:56.200 Ultimately, understanding object lifecycles allowed us to rectify the issue. I learned significant lessons about managing memory within Ruby applications and the importance of scrutinizing object duration.

00:15:20.680 In conclusion, effective monitoring and detailed analysis are key to managing both performance and memory challenges in Ruby. Never underestimate the importance of understanding how your objects behave and interact with each other in various contexts.

00:15:42.710 Thank you for listening.

See Slides on speakerdeck.com