Tool belt of a seasoned bug hunter

by Damir Zekić

In the video titled "Tool belt of a seasoned bug hunter," speaker Damir Zekić addresses critical debugging techniques and tools every developer should have, emphasizing his personal experiences with performance issues in a Rails application. He introduces the concept of different types of bugs — loud bugs that reveal themselves quickly and silent bugs that cause long-term damage without immediate notice.

Key points discussed include:

- Identifying Bugs: The process begins with reproducing the issue reported by the CTO concerning slow-running specs. Initial tests show no problems, but repeated tests eventually reveal a significant delay, confirming the existence of a bug.

- Using Git Bisect: The speaker highlights the use of git bisect as a binary search method to identify the commit that introduced the performance issue, showcasing its efficiency in narrowing down potential problematic changes based on good and bad commit markers.

- Challenges in Debugging: Zekić recounts the hurdles faced, such as crashes due to uninstalled gems and database mismatches while checking out older commits. He emphasizes the importance of stable commits to ensure that any code version can be executed successfully in the future.

- Memory Profiling: The discussion moves into memory profiling and detecting potential memory leaks using Ruby tools. He explains how Ruby's memory management can lead to bloat over time, making it crucial to analyze memory consumption trends.

- Performance Issues: Further investigation reveals a particular method in Rails responsible for performance slowdowns. He explains the method's logic, which dynamically alters route helper methods, directly impacting application performance.

- Solution Implementation: To mitigate this issue, Zekić recommends lazy-loading controllers, limiting the instances of method generation to when they are actually needed, significantly improving application speed.

In conclusion, the video stresses the need for developers to not only understand their code but also to have a solid grasp of all application dependencies. Effective debugging not only requires technical skills but also attentiveness to application structure and external libraries that can affect overall performance.

00:02:24.590 So, while people are still working, we might as well do something a little bit chaotic. We’re going to do a Friday hug! Who is not familiar with what a Friday hug is? Okay, you're lying; I see it. A Friday hug is where you stand with your arms wide open, as if you’re about to approach someone and give them a big hug. You are all facing me while you do that, and we’ll have a photographer here to capture it. So, we’re all getting up. Yes, I know it was just lunch, but hey! On the count of three, we’ll do some hugging. One, two, three—hug!

00:04:01.200 It’s not just about wanting to go there; you must actually be willing to organize it. So, please tell the people around you to stop talking—yeah, that’s a lot easier said than done! I’m going to go all flight attendant on you now. There are flyers in your goodie bags that tell you about two beautiful challenges that Niš Laps is organizing. One of them is a coding challenge where you can win actual cash. That’s kind of nice! The other challenge is for you to create a mind map of one of the upcoming talks, share it with them, and you’ll get a free t-shirt.

00:05:02.070 Additionally, we have another opportunity—not really a challenge—sponsored by GitHub. They are upstairs and looking for people to interview about their open-source projects. So, if you have open-sourced something and want to share it with the rest of the world, they will do audio and video recordings, so it won’t be a stressful interview. It will be a nice chance to share your stories of open source.

00:05:31.170 Now, I hope you are silent for our next speaker. Hi! Damir is a software engineer who loves teaching. He’s also a remote worker, as I guess many of us are, and he will try to get to the bottom of a performance issue with us based on a true story, so that's exciting.

00:06:17.370 I program or debug, and we’ve been programming and debugging for quite a while. Naturally, she has heard a lot of horror stories and has seen some, such as mismatched parentheses, which is pretty bad. But there are worse things than that—like the bugs that keep you up at night, those that wake you up and destroy your production server or database when you don’t have a backup. But we won’t speak about those today.

00:06:31.660 There are also bugs that aren't so bad in the grand scheme. They make you a bit embarrassed, but no one is hurt, and everything is fine. We can move forward. Then there are loud bugs; for instance, you call a method on Mail, and it appears that everything has exploded, but it's fine because it tells you where you need to look to fix the problem. However, there are also silent bugs, which are worse because they do their damage over time. You don’t know they are there until it’s too late, and things have gotten pretty bad. So, this will be a story of a hunt for one such bug, which I like to call "2016: A Debugging Odyssey."

00:07:36.729 The odyssey begins with my CTO coming to the team channel and asking us to look at a problem that has caused our specs to start running tremendously slow. By ‘running slow,’ I mean that it takes a huge amount of time for them to start at all, which is not great. Naturally, that’s lowering down our specs as well. If we can fix this, we could save hours of effort for everyone. Every debugging story begins with trying to reproduce the problem, being able to see it by yourself.

00:08:58.840 So, I naturally start my application, try to run tests, and they start quickly. I’m using Spring, so it’s a classic Rails app running in memory. You start tests, but the actual application forks into another process, which is typically very quick. I can’t just go to my CTO and say, "Well, there’s no bug; you’re crazy," because I know he’s not. I try again, and it’s still pretty quick. I try to change some code and try again, and something is happening. Basically, I iterate: I modify some code, run tests, and observe the outcome.

00:10:25.030 After six or seven times, suddenly it takes about 20 to 30 seconds for the Rails application to fork and for tests to start. So, my CTO is not crazy—there is definitely something there! I feel a rush of adrenaline because I solved something; I reproduced the issue. Now, I need to find out what actually introduced the issue. If I try the same steps on a smaller Rails application, there’s no problem; it runs quickly. Thus, I want to find one commit that introduced the issue into the codebase and understand how it happened.

00:11:40.050 I reach for one of my favorite tools for that: git bisect. Git bisect uses the magic of binary search to quickly identify a problematic commit in a very small number of steps. I just need to tell Git where the bug is and where there is no bug, marking one commit as bad and the other as good. Now I let git do its job.

00:12:21.590 Git checks out the commit in the middle between those two and asks me to tell it whether it's good or bad. I run the tests, and I find that the commit is bad, so Git marks it as such. Now, we automatically know that all comments that follow this are also bad. Git selects another middle commit and asks for the status, which builds fine.

00:12:55.790 Now we are left with this final commit to test, and we will know where the problem is. I will leave you with a cliffhanger: the problem with git bisect is that you need to know which commit was good to identify it first. However, because git bisect uses binary search, we don’t have to worry too much about that. To identify one bad commit among thousands, we might just need 13 stops, so we can take something written centuries ago and rely on Git to do the magic.

00:14:00.750 So, that’s what I try to do. I write a small script to check whether a particular commit is getting slower to start. I go to sleep, run the script, and wake up in the morning excited to see the results. However, I find that the script has crashed. Why? Well, I had gone back far enough in history that I didn’t have the necessary gems installed anymore. I had to install the gems as I checked out old versions of the codebase. Typically okay, but I added a line to bundle install in the script, thinking that would solve it.

00:15:22.270 I go to bed again, very excited, but then, of course, I have another crash in the morning. Why? I forgot about the database. You know, code and the database can also go out of sync as you check out old commits. So, I have to run git reset to fix that. After several attempts of this kind, I realize, funny enough, there was a guy who introduced a dependency in our primary product pointing to his private repository on GitHub.

00:16:13.160 Sometimes we do such things, and eventually that dependency was removed from the gem file. He cleaned up his Git repository and removed the wrapper, but now there’s a hard line where I cannot check out the previous code and get it to work. There’s a lesson here expressed in the famous words of General Maximus: "What we commit in life echoes in eternity." So, always aim for making every commit stable, ensuring that you can check it out in the future and run your application or your tests.

00:16:51.680 Another debugging problem was that the issue was getting more pronounced over time. As we got closer to the source, it was becoming harder to identify the signal versus the noise. However, we gradually got an indication about where the problem lied, and it turned out to be in a framework that we built internally to help us write monolithic Rails applications. This would allow us to mount classes as business logic directly onto Rails routes.

00:18:03.570 This framework encapsulated controller logic and various operations but what I found was not the hundreds of lines of code I hoped to find; instead, it was a merge commit that included this whole new technology that changed tens of thousands of lines of code. Finding what probably introduced this does not help me too much. However, knowing that if this was an issue, it explains why it was getting worse with time, as we were gradually adopting it. We could assume it has something to do with this new framework.

00:18:56.870 However, that doesn’t really help us fix anything. As I was discussing this with people, some asked whether it could be a memory leak, which could explain our problems. One way to test this is simply to look at the memory footprint of your application using a memory profile. If the profile is wibbly-wobbly but returns to some baseline, we can assume there is no memory leak. However, if your memory profile resembles a steady incline, we can be fairly confident that there is indeed a memory leak.

00:20:14.760 This is tricky, though, as Ruby manages memory allocation slowly but does release it back to the operating system eventually. Thus, the memory profile of most Ruby applications looks as if they allocate memory but only release it back slowly. Eventually, this creates a sort of memory bloat. While this is not inherently bad, it can lead to issues and complications. So we need to look into what is consuming the memory.

00:21:12.910 Ruby provides two handy methods to help us investigate this. One is traced object allocation, which begins to collect information about what Ruby allocates into memory, while the other is 'dump all', which exports everything in Ruby's dynamic memory to a single file. This file will contain many lines, listing each object that you can inspect.

00:21:54.440 For instance, there are many interesting attributes, but for now, I want to focus on one particular point: generation, which indicates the garbage collector generation with which the subject was initialized. It tells us how long the object has been in memory. To explore this, I generated report data. However, as you could imagine, good luck scrolling through it! Thankfully, there are tools like Hippy, which condenses and summarizes the object allocations for us.

00:23:56.400 So, I was exploring this Hippy report for my application and found many classes and objects but started having trouble interpreting it. I marked this as a plausible theory that there could be a memory leak, but I kept exploring. It wasn't enough to look at memory profiles; I needed to examine why one method was taking longer than necessary to run.

00:25:01.700 Thus, I decided to trace the call stack over time to see if there was a persistent issue with one of the methods. To illustrate what I mean, consider a simple program: the main method calls other methods such as ‘say hello’ and ‘say bye’. By observing how this call stack changes over time, I can analyze which method might be running longer.

00:26:23.030 To visualize this, I used a tool called Flame Graph, which easily provides visualization in SVG format. However, when I tried to load it in Chrome, it crashed instantly due to the number of methods being called during Rails initialization. However, Safari handled it fine, and I switched to it as my default browser.

00:27:25.250 Looking at several flame graphs side by side, I could make better comparisons. After analyzing multiple graphs, it quickly became evident that there was one method consistently taking more time than in previous iterations. And, naturally, this method belonged to Rails, so I was concerned about how to navigate that issue.

00:28:38.340 So, I continued to investigate and recognized that the method causing the slowdown was in the action dispatcher, specifically within a method named 'clear'. This method has a list of operations that include removing and adding methods for route helpers dynamically. Therefore, every time you change the code in Rails, it reloads and redefines routes, resulting in slowdowns.

00:29:44.430 To explore the performance, I set up a benchmark script. This script created tens of thousands of objects, managed module inclusions, and systematically timed the addition and removal of methods. Unfortunately, the performance results were not what I initially expected; the method invocation timings were slower than anticipated.

00:30:53.500 This led me to explore deeper into Ruby’s handling of method definitions. I discovered that when Ruby defines methods for a module or class, it essentially tags them with a marker indicating they are undefined until they are actively defined. It turned out this process introduced further slowness in my application.

00:32:34.160 Ultimately, I understood that if there is an excessive number of modules related to the controllers, reloading them could become a bottleneck as Rails injects methods into the objects. Thus, by transitioning to lazy-loading of controllers, we were able to alleviate the API's performance issues. This adjustment ensured we only generated controllers on-demand.

00:33:05.100 In conclusion, while troubleshooting, you need to know not just your code but the entirety of your application, including all dependencies and libraries. When debugging, it’s essential to probe deep into all artifacts of your application and choose your libraries wisely because they can significantly affect your performance.