Talks

Finding a needle in the haystack - Debugging performance issues

Finding a needle in the haystack - Debugging performance issues

by Puneet Khushwani

In the presentation titled "Finding a Needle in the Haystack: Debugging Performance Issues" delivered by Puneet Khushwani at RubyConf 2023, the speaker explores systematic methods for identifying and resolving performance issues in software development. The discussion focuses on the challenges of debugging severe performance problems, particularly in large monolithic applications built on Ruby on Rails. Puneet highlights the importance of measurement, profiling, and benchmarking to effectively analyze and optimize application performance.

Key Points Discussed:
- Importance of Measurement: Establishing a performance baseline is crucial. Continuous measurement against benchmarks allows teams to identify performance degradations early.
- Profiling and Benchmarking: Profiling tools help developers pinpoint which parts of their code are responsible for performance issues. Benchmarking allows for the comparison of code changes to assess their impact on performance.
- Use of Profiling Tools: Several profiling tools, including statistical profilers and tracers, are discussed. These tools aid in understanding performance bottlenecks by providing insight into resource consumption and execution time of methods.
- Statistical Profilers: These provide snapshots of the call stack at defined intervals, allowing developers to see which methods are consuming resources.
- Flame Graphs: These visual tools help identify performance issues by aggregating method calls and representing execution hierarchies.
- Tracers: Capturing detailed runtime activity, tracers log all events within the code, although they may introduce significant overhead.
- Case Studies from Koopa: Puneet shares real examples from his team’s experiences at Koopa Software:
- Regex Performance Issue: A job using complex regex in loops was consuming excessive memory. Profiling tools revealed the high memory usage was due to active processing rather than a job being stuck, leading to a successful optimization.
- Boot-Up Time Increase: By utilizing RB Spy and flame graphs, the team identified unnecessary class variable usage that was impacting boot-up times.
- Ruby Version Upgrade: The transition from Ruby 2.x to 3.x resulted in unexpected performance degradation. The team’s analysis showed increased garbage collection frequency and memory allocation issues, leading to refinements in caching configurations.

Conclusions and Takeaways:
- Monitoring and proactively addressing performance issues are essential for preventing customer complaints.
- Debugging is an opportunity for learning—approaching it with curiosity can help uncover deeper inefficiencies.
- The process of performance optimization is ongoing and requires continuous measurement, profiling, and adjusting configurations.

In closing, Puneet emphasizes that once developers identify a performance issue, they should remain vigilant in searching for any further inefficiencies, advocating for a thorough approach to performance debugging.

00:00:17.960 Hello, hello! Good morning! My name is Peter, and I'm with the programming committee. Today, I'll be introducing our first speaker, Puneet Khushwani, all the way from India. A brief bio: the ethos of Puneet's career has always been to never stop building. The first part of his nearly decade-long career was a crazy rollercoaster ride, where he and his team built several products. In 2020, Puneet joined Koopa Software, where he transitioned to an engineering manager role. Please give a warm welcome to Puneet.
00:01:03.160 Thank you! Thank you! Hello! Mic check, can people hear me alright? Thank you! Let's start.
00:01:15.759 Welcome to RubyConf and this talk! I am Puneet, and the title for this presentation is 'Finding a Needle in the Haystack: Debugging Performance Issues.' As mentioned, I traveled all the way from Pune, India, for this talk. This is my first time in the United States, and I must say it's a beautiful country. I'd like to start off with a Namaste, which is an important part of our Indian culture. We respectfully greet people by bowing slightly, so, Namaste, and welcome again to this talk.
00:01:38.840 A brief introduction about myself: I've been in the industry for about 10 years now, which explains my gray hair! Over these 10 years, I’ve only changed jobs once. I spent around six and a half years with my first employer, where the environment was quite like a startup. We built various products, starting with e-commerce and moving on to private label furniture and a social media app called Instagram Stories back in 2016-2017. We also ventured into the crypto space in 2018-2019, aiming to be the Stripe for blockchains.
00:02:01.920 All of these experiences were unique, but they were all like a 'zero to one' kind of journey. By 2019-2020, I realized I wanted to be part of a company that was already at a 'ten' and contribute to taking it to 'a thousand.' That is when I decided to join Koopa Software. Over the last three to three and a half years at Koopa, I transitioned from an individual contributor role to a manager position, supporting one of the platform teams from Pune, India.
00:02:28.120 Let me share a bit about Koopa. Koopa is the leader in the business spend management industry. We help companies save money by enabling them to spend smarter. I won't go into a lot of detail, but our product portfolio includes procurement, invoicing, and expense management.
00:02:39.640 On the technology front, Koopa has one of the largest monolithic applications still running on Ruby on Rails. A lot of our code is in Ruby on Rails, and we have hundreds, maybe close to thousands of developers who contribute to it. In cases where teams are large, it sometimes happens that code gets merged inadvertently, which can cause performance issues. The timing of when these issues appear varies; some may be noticed a week later, while others might take a month to show up. Tracking back to find what caused those issues is literally like finding a needle in a haystack.
00:03:11.680 Today, we'll discuss a systematic approach to tackling these issues. Here's a brief agenda for my talk: We'll start with the importance of measurement because unless you measure performance, it's very difficult to track it. Then, we'll talk about some key concepts in this area, such as profiling and benchmarking. After that, we'll examine various profiling tools. There are a lot of different tools available, and we’ll discuss when to use which and how they can assist us. In the second half, I'll cover some performance issues that my team at Koopa faced last year and how these tools helped us resolve them.
00:03:50.799 But before we get there, are there any Bollywood fans here? I am! So we'll start off with this picture, and I want you to tell me the difference between this one and that one. Look at this one and this one: the lovely lady is missing, and the men seem happier, enjoying their time. But what I want to highlight here is the seat belts. Yes, seat belts save lives. Literally!
00:04:13.520 If you see this car, it’s a much faster racing car. A lot is at stake, hence there is increased safety; the seat belts here are better. The faster the car, the more safety measures are needed. In the context of software, processes are like seat belts. I actually did a whole talk at RubyConf India a couple of months ago about five processes we have at Koopa. One of those processes involves measuring performance. Unless you measure performance, you won't be able to track it and identify degradations, which might otherwise be reported by customers. You don’t want that to happen.
00:05:07.480 So, you need a very systematic process in which you measure performance week over week, sprint over sprint, or whatever time frame suits you. How do you do that? Locally, there's no point in measuring on a QA instance; you need to do this on an instance that has production-like data, a production-like stack, and production-like hardware capacity. You need to test under production-like load. If you do all of this, you will get a realistic picture of your software’s performance.
00:05:30.240 Then, you measure this against your benchmarks, against your baseline, over various time periods. If you notice degradations, you need to flag them right away because you don't want your customers to flag them for you. So, starting with the measurement of performance is crucial. Now, let’s say the performance team informs you that a background job is running 30% slower. How do you tackle this?
00:06:10.880 In legacy software applications, your code may go through a lot of outdated software, producing a long chain of methods without your knowledge. This makes debugging more challenging. That's where profiling comes into play. Profiling is a step-by-step accounting of the resources consumed by a program and its subroutines. In simpler terms, if you have a big program with ten methods, you want to identify which of those methods is the slowest. This magic is profiling.
00:06:44.239 When you have that data, it may seem overwhelming, and you need to analyze it meticulously. For instance, let’s say there’s a simple background job with three methods. If method two does nothing but sleeps for 10 seconds, that's your slowest piece. You need some magic to tell you that method two is the slowest. There are profilers available that can help you with this.
00:07:17.205 I'll talk about this in detail in the following slides, but this is the essence of profiling. The next important concept is benchmarking. Benchmarking is crucial because once you identify that method two is slow, you’ll want to optimize it. However, you don’t want to just merge it into master and wait for the next performance testing report, which could take weeks. You need a structured approach to benchmark and compare your changes.
00:07:49.040 For instance, if method two originally sleeps for 10 seconds and you optimize it down to 8 seconds, that’s a trade-off between memory allocation and execution time. You’ll need to compare and measure again. Something like Benchmark IP may be helpful. I have personally found it quite interesting. As you can see here, the configuration allows you to run it for 100 seconds, and the warmup is important. As all of us know, software often runs slower during the first execution since it’s loading various components.
00:08:25.679 We need to ignore the first few iterations of our run and focus on the remaining ones for our comparison. The report generated helps outline allocations, which can be categorized into allocated and retained memory. Your program will allocate memory, but it should ideally not retain it in excessive amounts. If it does, it's essential to investigate. My case showed zero retention, which gave me a clear overview of my allocations.
00:08:58.839 This structured approach allows for a shorter feedback loop, as you can make a change today and receive feedback sooner rather than waiting months for results. Now we will talk about various profiling tools available in the industry. There are different types of tools, and we will explore how they assist us. Starting with statistical profilers, these tools are interesting because they take snapshots of the call stack at defined intervals.
00:09:50.200 They collect data and present it to you. Just to clarify what a call stack is: if function A calls B, B calls C, and C calls D, when you look at the call stack during D’s execution, D is at the top, followed by C, B, and A. These profilers will take snapshots over time and report back. They generally work in two ways: one requires you to add instrumentation code, while the other, such as RB Spy, attaches directly to a process without the need for extra code.
00:10:38.440 Let’s say we have a scenario where method two is shorter, taking just 5 seconds. If we were to run the stack prof on that, we could generate a snippet to do so. An important consideration is the measuring mode: 'wall time' indicates end-to-end time, while 'CPU time' denotes the time the CPU is actively engaged processing. Since method two does almost nothing when it sleeps, CPU time would be close to zero.
00:11:11.040 But the wall time would reflect the sleep time as well. Likewise, the object mode can check for allocations by your program. The output of profiling can sometimes seem overwhelming, but tools are available to visualize the data better. One such tool is the flame graph, which is generated from JSON data and can display various views.
00:11:41.920 The flame graph represents the sequence in which your program was executing. It aggregates multiple calls of the same method, allowing you to visualize performance bottlenecks in a clearer way. Now, we move on to tracers, a different kind of profiling tool. Unlike statistical profilers, tracers inject themselves into your code and log every single activity happening within.
00:12:15.479 This method comes with significant overhead since they can slow down your program. For instance, a program that runs in 5 seconds could take longer with a tracer, which could extend the time to 8 seconds. Ruby Prof is a notable example of such trigging tools. It captures detailed data, which can be beneficial.
00:12:43.600 Now you might wonder why we need various types of profilers. While statistical profilers can overlook certain events due to their snapshotting, tracers help capture every nuanced behavior. In one of my presentations, I discussed the call stack visualizer that works in a cohesive way to represent this data. This presentation has covered the academic aspects of performance debugging, now let's dive into real-world examples that my team at Koopa faced.
00:13:38.679 Before delving into specific cases, I'd like to share a powerful quote by Albert Einstein about the importance of curiosity: 'Curiosity exists for a reason.' He mentioned he has no special talent; he’s just passionately curious. When you're debugging performance issues, curiosity is vital. If you lack curiosity, you may struggle to unearth the root causes of problems.
00:14:05.839 Along with curiosity, having the right mindset is crucial. Instead of feeling burdened by being pulled into debugging, see these situations as learning opportunities. When problems arise, take them as chances to learn something new that you may not have encountered before.
00:14:46.640 Let’s jump into our first example. One situation involved our DevOps team observing a job that was stuck—our async processing workers were non-responsive and were consuming a lot of memory without releasing it. This led our DevOps team to suspect a bottleneck in processing.
00:15:20.760 They took the initiative to use GDB, a profiling tool that requires attaching a process ID. GDB is considered risky in production since it can interrupt execution, but they were confident that they only needed 5 seconds to get a snapshot. After just this short evaluation, they found that the job was engaged in complex regex operations in loops, which explained the high memory usage.
00:15:55.920 The data revealed that it wasn’t stuck but actively processing large datasets, which led us to optimize this regex operation significantly. Without these tools, reaching such insights would have been very difficult. The second example involves the app's boot-up time.
00:16:34.500 In another case, the team reported a multi-fold increase in boot-up time. Fortunately, owing to a robust performance testing infrastructure, we caught this issue during the development phase, prior to production release. The importance of boot-up time cannot be overstated—it affects both production environments and developer experiences.
00:17:05.040 In typical deployments, when you execute a command like 'bundle exec rails s' in production, you will receive a process ID. The team utilized this PID with RB Spy to generate flame graphs, comparing these to baseline metrics without making any code changes.
00:17:46.960 This allowed them to identify differences caused by new code that altered the boot-up process, helping us pinpoint unnecessary class variables that were leading to delays in startup. This proved invaluable information for optimizing the code.
00:18:29.120 Now, let's talk about a significant performance issue we encountered. Every transaction within the application showed a 2x to 3x slowdown over a period of two to three weeks. Since Koopa is a large company with many employees, this raised alarm, making us the first responders to investigate the issue.
00:19:06.920 Initially skeptical about an incident leading to such extensive performance degradation, I reviewed the commits made in the last few weeks and noted one outlier that involved upgrading our Ruby version from 2.x to 3.x.
00:19:41.000 Before diving into this, I want to clarify that this isn’t about comparing Ruby versions—my upcoming slides will reflect how this impacted our application specifically. First, we conducted a detailed analysis comparing Git commits before and after the upgrade to determine if performance testing could identify where the regression originated.
00:20:20.080 Using our APM tools, we discovered the heap size increased dramatically, along with spikes in garbage collection frequency and object allocations. Additionally, profiling the previously slowing transaction revealed that 63% of processing time now involved garbage collection.
00:21:01.760 It became clear that the garbage collection settings pertaining to Ruby might have been altered in some way. We decided to explore the environment variables associated with garbage collection configurations introduced in Ruby 3.
00:21:37.520 This investigation didn’t yield the results we had hoped for. However, continuing our efforts, we noticed that while Ruby 3 performed better locally, it also allocated more memory than its predecessor. We then scrutinized profiling data under memory mode, which led us to further insights.
00:22:10.760 Our profiling data indicated that many allocations correlated with marshalling activities—an unexpected discovery. In detail, this prompted us to redirect our focus back to our application's configuration to determine which components were leading to inflated memory use.
00:23:01.420 As we revisited our configurations, we used a thread profiler to visualize call stacks, discovering that about 70% of operational time was spent on inflation, which is related to caching. When caching data, compression is often applied; while this can optimize space, it may result in increased costs in performance.
00:23:40.360 Further investigation led us to identify a specific commit involving changes to our caching template, an upgrade that affected how marshaling was incorporated into our data processing. We soon learned that due to the configuration of our caching engine, we were inadvertently compressing the data twice, which resulted in unnecessarily lengthy processing times.
00:24:17.760 Upon uncovering this, we were diligent in reevaluating our configurations and took measures to fix any missteps, restoring our application to its previous performance levels. However, after some adjustments, we witnessed a significant increase in the volume of writes to our caching server, which highlighted the need for further refinements.
00:25:05.960 While our performance improved and backdated to normal levels, we also realized the impact of our fixes on our system’s behavior, which required ongoing adjustments to maintain stability.
00:25:45.040 To summarize key takeaways from this discussion: we initiated our talk with foundational concepts, emphasizing the importance of monitoring performance closely—if you don’t measure it, you can't gauge profiling needs, and without profiling, you can’t establish benchmarks for performance evaluation.
00:26:14.960 We also examined the functionality of various profiling software, like Stack Prof and others, while emphasizing how these tools can save developers time by helping to troubleshoot issues early and effectively. During discussions, I encouraged awareness of the quantity of effort required to truly optimize performance and reiterated the necessity of each stage.
00:27:02.840 As we wrap up, I want to encourage you with another quote from Albert Einstein: 'When you’re trying to find a needle in the haystack, do not stop when you find the first one; keep looking.' This principle applies directly to performance debugging—don’t give up once you’ve discovered an issue, as there may be additional inefficiencies lurking that warrant attention.
00:27:44.220 In closing, I appreciate the Ruby community for being supportive. The journey to achieving effective performance is continuous, and the experience shared here in my presentation is a result of my team's efforts at Koopa. Thank you for allowing me to present, and I’d like to acknowledge my team for their backing and Mike for being my mentor on this talk.
00:28:19.640 Thank you all for attending!