Damir Zekić

Toolbelt of a Seasoned Bug Hunter

wroc_love.rb 2018

00:00:13.640 Thank you very much for the warm welcome and for the introduction. As has been said, my name is Damir Zekić, and I'm going to share with you a tool belt for debugging. It's a bit more advanced than your typical debugging tools, like binding.pry, where you try to figure out what happened. The talk I will give is about covering some bugs that are a bit more difficult to track down. I have been working as a programmer for quite a while, and I particularly enjoy programming. In fact, I have two cats, and together we do programming. She purrs while I program, and that works quite well for us. While programming together, we've encountered a lot of issues, not all of which have been pleasant. We've faced unmatched parentheses and also bugs.
00:01:14.880 We've seen some bad bugs—bugs that can destroy your production database or bugs that crash your servers. You know, really bad bugs. We've also seen bugs that aren’t so bad; essentially, bugs that make you feel a bit guilty about not doing all the due diligence you should have done, but it’s fine since no one noticed. However, there is another way to talk about bugs. We can speak about loud bugs and silent bugs. Loud bugs are usually easy to notice. For example, something crashes, there's a stack trace, and you see where the error points to, such as a method that is missing or misspelled. You just fix it, and life goes on.
00:01:57.380 However, there are also silent bugs, and those are a bit trickier. They work in the background, maybe for some time, and are hard to notice while they do some bad things. When you finally notice them, you often have no idea what's going on, because it’s usually a pile of issues that come together in really unexpected ways and create unforeseen problems over a long time. That's actually what we will be exploring today, which is the actual title of a talk I gave in 2016 at the Bugging Odyssey: 'Six Chapters with an Epilogue.' So, the story starts like any remote programmer's job. You log into Slack, and you see a message from your CTO saying, 'Oh, something bad has happened, please go fix it.' In this case, it was about running tests in the development environment, or rather, the local environment.
00:02:35.620 We were using Guard to automatically run tests after a file had been changed in a Rails application. To speed up the time it took for the Rails application to start, we had been using Spring, as I assume many of you do. If you are not familiar with Spring, here's a quick explanation: essentially, it promotes a Rails application into memory and keeps it there until it is destroyed or killed explicitly. Every time you start tests, start a Rails application, or start the Rails console, the memory is usually cached and only updated on change. So, we don't have to load all Rails files or libraries in our Rails application, and typically, the startup times are much faster.
00:03:31.240 The problem for us was that when tests would start, we would wait a long time after the file change was detected for the tests to actually begin, and it wasn't clear why this was happening. Chapter one—step one of bug hunting: reproduce the bug. Let’s make sure that we can actually reproduce it, see it, and experience it. So, we ran the tests, and they worked as expected. Feeling satisfied, we made a change, ran the tests again, and everything worked again. I was almost ready to say it cannot be reproduced, but thought I should test it one more time. Actually, it kind of felt a bit slower. Testing again, it got slower and slower with each attempt. An interesting observation—it means our chapter one ended with success; we have reproduced the bug.
00:04:45.590 So why? I asked myself what introduced the bug. If I were to see a commit in the git history that says, 'Oh, I changed this and that,' along with five lines of code in that one commit, I would know what was happening and why it was slower. So, I used the first tool I'm going to mention today: git bisect. How many of you have actually used git bisect? Well, almost everyone seems to be familiar with it, but let me just give a quick explanation for those who didn't raise their hands. It works by starting with two commits: one known to be bad (with a bug) and one known to be good (without a bug). We mark them accordingly and start finding the first known bad commit.
00:05:32.700 The way git bisect works is it finds one commit in the middle and checks whether that commit is good or bad. It essentially runs your test suite or a script to reproduce the bug. In this case, I had a script that would check if tests were getting slower and slower, for example. Let’s mark this one as bad, and we can ignore all subsequent commits because they will also be bad. We find another middle commit, test it out—maybe it turns out to be good, so we can proceed along our flow. Eventually, we reach a point where we have just one more commit to check, and we run the tests. I’ll leave you with a cliffhanger.
00:06:53.460 The issue is that for git bisect to work effectively, every commit in your git history must be stable so that your test suite can run smoothly. If your tests are randomly passing and failing due to broken initializers, you cannot use git bisect effectively. Thus, you need to optimize your workflow to ensure that every commit produces a green build, except for the one part of the script that tests for that specific bug. In this case, the problem was how to find that one good commit. I went back through 100 commits, and they all seemed to have this slowdown.
00:07:42.240 What I needed was my favorite debugging tool: the Ruby programming language. I wrote a script called, 'find_slow', which checks out each commit one by one and checks if there is a slowdown. If a slowdown occurs, it checks the previous commit and continues until it eventually finds a commit without a slowdown. This is essentially a linear search back through git history. I fired off the script and went to sleep. I woke up the next morning and... wow! It turns out I forgot one crucial detail in my script: I had forgotten to run 'bundle install' after checking out the commit.
00:08:42.010 Okay, so I fixed that to run 'bundle install'. Then, I ran the script again and—again—what did I forget this time? I forgot to migrate the database. The schema got out of sync with the code! Easy fix, I wrote a bin/setup script that ensured all gems, schemas, and everything were up-to-date with the codebase, but it turned out the script already existed! I just needed to use it. So, I replaced 'bundle install' with 'bin/setup', ran the script again, went to sleep, and woke up again to find that the script had crashed.
00:09:48.450 What happened this time? Well, there was quite a crazy story here. We had a developer in the company who had written a lot of JavaScript and maintained some personal JavaScript repositories. Some of those repositories defined a few functions, but he left the company and deleted those repositories while we still had a link to them in our Gemfile that no longer existed! I couldn't move past that. Well, I could, but it was tricky. It’s important to try not to have your contributors leave your company with their personal gems in your Gemfile!
00:10:15.840 Yet, while analyzing the output from the script, I noticed that the slowdown was getting smaller and smaller as I went back in git history. This suggested that the application's problem was increasing over time. However, it was challenging to pinpoint exactly what introduced it. The slowdown was gradually decreasing, making it hard to see the convergence point where the issue first appeared. I noticed a huge merge commit that introduced many changes to the codebase, which could have been the start of the problem. It was the dawn of our monolith framework.
00:11:19.860 This framework was implemented and rolled out by ourselves to support our monolithic Rails application, and we described it as a business actions framework built on top of Rails. It allows you to build business subjects that handle requests from the browser and implement controllers that execute business logic interacting with models, etc. There are several other libraries and frameworks out there offering similar capabilities, but this was our solution.
00:12:05.540 At this point, I want to announce that our monolith framework is no longer called that—it's now called Granite: Business Actions Architecture for Rails applications. I believe I’m the first public speaker to mention it! I want to formally announce Granite as an open-source gem available today on GitHub. If you have a monolithic Rails application and want to refactor it to write objects encapsulating commands, business actions, and operations that include validations and conditions, check the repository out! It could help you better organize your Rails application.
00:13:10.370 Now, I was facing a significant pull request merge issue, and I wasn’t entirely sure if that was the cause of the slowdown. To close this chapter with a note of uncertainty, it doesn't have to be the reason. Let’s try a different approach: What other causes could there be? Many people mentioned potential memory leaks as an issue because when things become slower over time, memory leaks are often suspected. So, how exactly do we check for memory leaks?
00:14:12.700 My instinct was first to check the memory usage of the Rails application. That should be simple, right? We just run a command that prints out how much memory the process is utilizing. Ideally, we would see a chart indicating that, as time goes on, memory usage increases linearly or in another predictable manner. If it continues to grow indefinitely, then it's evident that something is leaking somewhere, and we just need to identify it. However, I didn’t know this at the time: Ruby doesn’t release memory back to the operating system.
00:15:23.050 So, if you have a spike in your program that utilizes a lot of memory, the Ruby process will increase its memory usage. Objects will exist in Ruby memory, and while Ruby's garbage collector may have cleared that memory, it won’t return the allocated heap back to the operating system immediately. Consequently, when you check Ruby’s memory usage, it will appear as high as it was before. Another common pattern you might see with Ruby processes is that, as time increases, memory usage will spike significantly but then reach a certain point and stabilize. Therefore, just checking the memory usage over time isn’t sufficient. We need to utilize a more scientific approach.
00:16:33.610 Let's formulate a hypothesis: "There is a memory leak". Now we need to either prove or disprove this by conducting an experiment, which would involve checking heap dumps. The heap is a type of memory where all our objects are stored, and dumping it means reading that memory and outputting it to a file we can analyze later. This is essentially a snapshot of our dynamic memory. Such tools exist for almost every programming language, including Ruby. The code for creating heap dumps is already built into Ruby; we just need to enable tracing of object allocations and run our code. Eventually, we print out all objects known to the object space.
00:17:56.680 For instance, let's say we define one method on a class and invoke it. How many objects do you think Ruby creates? The answer was around 10,000, which is indeed quite a lot! As illustrated, there's still a long way for Ruby to go in terms of optimization and performance. When looking at these object allocations, we can see that every line in a heap dump is a JSON object, so we can read this file line by line. We might expect a large application’s startup process to generate such a dump that we can follow mentally, but we are often met with thousands of lines.
00:19:03.900 To assist with this, you need tools designed for analyzing heap dumps; one such tool is called heapy. It’s a gem that can be installed and run as a command, providing an output summarizing objects and their generations. For example, it could say, 'There are 209,000 objects that have survived through multiple garbage collection generations.' In heapy, you can analyze one generation at a time, inspecting objects that survived a specific given number of garbage collection generations.
00:19:55.320 In my case, while this was a useful method for investigating memory leaks, it wasn't sufficient since I could only run this profiling once at application initialization. The complexity arises when I have to analyze multiple heap dumps sequentially and make comparisons, which led me to write Ruby scripts for comparing heap dumps to identify memory leaks. As I monitored the results, I noticed something intriguing: class DSL objects survived the longest periods of time.
00:20:54.020 What are class DSL objects, you may wonder? These objects represent attributes and callbacks, which one wouldn’t typically expect to be leaking. However, the intuition that something was surviving for so long pointed to the possibility that there might be a genuine memory leak issue present. However, this still didn’t provide clear answers—so I needed to keep investigating.
00:21:55.660 Returning to the performance problem we were faced with, it might be related to memory but that wasn’t certain. I decided to trace a call stack. The plan was to see how long certain functions take to complete in order to identify any functions that consume more time than expected. This was my initial experience using flame graphs. For those who are not familiar, here’s a brief introduction: A flame graph provides a visual depiction of the call stack.
00:22:54.250 Consider a small Ruby program where the main method calls one and then another. If we were to run this program, we would generate a call stack with frames for each invoked method. Eventually, you can render that stack, illustrate where the time is being consumed, and see which methods are taking longer. However, in the context of a Rails startup, you could imagine the complexity involved when a large number of objects and methods are interacting. Three years ago, when I attempted to visualize such a flame graph by rendering it through a browser, it often crashed my browser at the time, leading me to favor Safari over Chrome for this task!
00:24:21.510 Looking at the initial flame graphs wasn't particularly revealing; I didn’t notice any spikes that indicated a problem. However, the unexpected component was that the methods I was monitoring got progressively slower each time the Rails application started. I loaded three SVG flame graph files representing three consecutive runs of the application. I observed that one of the functions took more time on each subsequent run, which finally gave a tangible lead.
00:25:47.260 Everything about this function was encapsulated within Rails code, and I suspected my application code was buried somewhere deep amidst the Rails code trees. In an effort to dig deeper into code I had not previously examined, I opened the router file in Rails and found a method responsible for clearing path helpers every time routes were loaded. This process appeared directly linked to the performance degradation.
00:26:50.420 Analyzing the functionality of this method revealed that it was clearing and redefining helper methods each time routes were loaded. Essentially, Rails had to verify definitions each time to accommodate for potential modifications to the routes. I questioned whether this was inherently an inefficient process. To prove this theory, I decided on an experiment.
00:27:42.300 I created 1,000 dummy objects and employed a method to include several methods in a module and remove those methods from it, measuring the time it would take to accomplish this. I speculated this operation would take linear time, i.e., as many objects there were, the time would scale correspondingly. However, the results turned out to be quite unexpected, leading me to believe that something was not operating efficiently—not diminishing linearly, nor constant, but rather polynomial. Moving to Ruby itself, a similar pattern was discovered. The problem’s perplexing behavior led me to question Ruby’s foundational handling of method definitions.
00:28:56.620 I dug into the underlying Ruby code regarding undefined methods. What I found was that much of the slowdown stemmed from Ruby's methodology for handling method definitions. Specifically, when a method is removed, Ruby does not just delete it; it adds a flag to mark that those methods are undefined. While this may seem reasonable, I learned that this operation can take a significant amount of time. In other words, the slowest aspect of ‘undefined’ is the process of adding that flag. The findings involved many observations about the inefficiency of Ruby’s memory management in specific contexts.
00:30:24.970 To recap: we had this monolith framework with one peculiar issue. Every time we defined a business action, we would mount it onto a route. However, controllers created during the application’s boot process persisted longer than necessary due to the manner in which Spring operates. As it started, Spring would keep instances of the Ruby on Rails application in memory, leading to a surprising retention of older controllers despite subsequent reloads. These conditions ultimately led to a mountain of performance issues.
00:31:44.620 Making these discoveries allowed me to roll back and address what we had been doing wrong. We re-implemented our interface for mounting business actions onto the routes, which was a significant change. This process took considerable time, comparable to the debugging process itself. Speaking about this debug experience as an odyssey, I can say it wasn’t a simple task; in fact, it spanned over a month’s time. However, we learned a critical lesson: we should refrain from creating controllers at the application’s boot time.
00:32:47.160 So, we made adjustments and fixed it. Consequently, constant time complexity improved, and our issues were resolved. In essence, we were creating controllers only after Spring had started—eliminating their unintended retention in that primary Spring instance. Thus, fixing these issues was a significant success for the team.
00:33:53.060 Now, to finish up: the lessons we’re left with aren’t centered around avoiding Rails altogether but rather a reminder of the importance of knowing the code you have written. It’s essential to understand how your application interfaces with the frameworks you rely upon, such as Rails. Gaining that level of understanding can be quite challenging, especially when collaborating with numerous other developers who are continuously introducing changes to the codebase.
00:35:02.190 We also have to grasp that our frameworks add layers upon layers of complexity to our work, so it may be worthy to consider different approaches to application architecture that does not hinge on monolithic and cumbersome frameworks. While I have yet to pursue these ideas, I intend to explore that direction more vigorously moving forward.
00:36:42.750 Finally, I promised an epilogue to my talk. Years after this experience, I discovered a small piece of code within Active Support, specifically within a module called 'DescendantsTracker.' This module, included in every object in a Rails application, contains a class variable linking every class with its parent class. If, for some reason, you find your classes aren’t being garbage collected, it might be because they are still retained within that hash for optimization purposes. This is crucial to keep in mind when dealing with memory issues in Rails applications, as the frameworks may try to cache and manage object life cycles in non-transparent ways.
00:37:34.960 So, bug hunting can be quite an adventure. Thank you for your attention! Was it worth debugging a persistent development environment issue for such an extended period? Or should you ditch Spring or Guard altogether? In my opinion, absolutely yes, it was worth it. During that time, we had between 50 and 200 developers in the company, all running tests frequently throughout the day. The first three runs with Spring were significantly faster than running those tests without it. However, after several runs with Spring, its utility diminished. Despite that, using Spring had clear advantages, and I truly believe the effort was worth it.
00:39:10.920 Now, our framework is production-tested, and you can confidently use it since it won’t impede your development environment. Thank you!