Talks

A Common Taxonomy Of Bugs And How To Squash Them

A Common Taxonomy Of Bugs And How To Squash Them

by Kylie Stradley

In her presentation at RubyDay 2016, Kylie Stradley discusses a practical taxonomy of bugs in software development and effective strategies for debugging them. The central theme of the talk is to demystify debugging processes by categorizing bugs into recognizable patterns, thus equipping developers with logical approaches to solving issues rather than relying solely on instinct or experience. The key points of the presentation include:

  • Debugger as a Swiss Army Knife: Debugging is compared to a Swiss Army knife, highlighting its versatility in tackling various software problems.
  • Instincts vs. Patterns: While debugging often seems intuitive, Stradley emphasizes the importance of developing logical patterns over relying on a single developer's insights.
  • Bugs Category: Stradley identifies four major categories of bugs to aid in debugging:
    • Bore Bugs: Easily reproducible bugs found in the code, often resulting from inadequate testing.
    • Schrodinger Bugs: Bugs that appear functional until examined closely, often leading to confusion regarding their validity.
    • Heisenberg Bugs: Elusive bugs that alter behavior upon observation, making them difficult to diagnose.
    • Mendel Bugs: Bugs where numerous interconnected issues appear to be broken but can often boil down to a single underlying problem.
  • Reproducibility and Logging: Importance of establishing clear reproducibility for debugging and utilizing logging strategies to facilitate tracking down bugs effectively.
  • Examples: Stradley uses specific scenarios to illustrate how these bugs manifest and how best to tackle them, including logging practices and testing scenarios.
  • Takeaway: Debugging can be transformed from a mystical process into a systematic approach through classification and documentation of bugs. All developers should aspire to create their own debugging toolkit to enhance problem-solving efficiency.
  • Collaboration and Teamwork: The significance of fostering a collaborative environment among developers and ops teams to handle complex issues, promoting shared knowledge and support.

Kylie concludes that building a practical toolkit for debugging will empower developers to manage complex applications and improve overall team efficacy.

00:00:09.650 Thank you so much for that introduction. I am visiting from very far away, Atlanta, Georgia, and I'm here to share with you a practical taxonomy of bugs and how to squash them. I hope you'll excuse the messiness of this presentation; these are kind of my field notes on how to squash bugs, my field guide for debugging, along with case studies from things that I've encountered.
00:00:18.829 We can probably agree that debugging software is like having a Swiss Army knife in a developer's toolbelt. A Swiss Army knife is a multi-tool designed to solve various problems and tackle any situations you might find yourself in. Debuggers seem to be able to troubleshoot anything, resolving issues from a database that refuses connections to an 'undefined is not a function' error. It seems they always know the next question to ask and the next path to take.
00:00:54.049 Watching them work can seem almost magical; they appear like mythical creatures, or unicorns, who can debug things at all levels of the stack. You might work alongside them and, if you're that unicorn on your team, you reassure your teammates that they, too, will develop these debugging instincts someday. Their reactions make it easy to believe that debugging comes purely from intuition, as you don’t see the gears turning in their heads—you just see them working. Having worked on consulting projects in the past and more recently in a large product company, I often heard phrases like, 'As you familiarize yourself with the application, you'll build up some debugging instincts,' or 'I know where to look because I have scars from the last time this happened.'
00:01:45.110 Sometimes you hear something very specific, such as, 'When I see something like this, the first thing I do is scan the logs to determine if this process is completing or if it’s sending a weird error message.' You can take that at face value, thinking that, eventually, as you work in the application you'll develop instincts, too. Alternatively, you could write down exactly what they say and memorize it: 'When X happens, check the logs.' You might even disabuse yourself of the notion that these instincts are inherent magic you just pick up. Instincts can sometimes mystify us and allow us to create heroes or unicorns from each other.
00:03:11.310 While it’s alright to praise heroic effort, turning a fellow developer into a hero makes them the sole source of truth instead of our code. Our code should be the primary source of truth regarding how our application behaves. Maybe on your team, you have someone who says, 'Oh, that always breaks,' or 'I won’t go near that part of the application,' or you might have to ask Kate, who 'knows how to fix it.' Believing that someone has better instincts than others lets us off the hook for the hard work needed for effective communication as a team and with our code. We pass the hard work to the heroes to shoulder everything and fix all the really bad bugs.
00:04:19.440 However, instincts don’t scale. You can’t just spin up a new instance of your resident unicorn developer like you can spin up a new server or a new database to replace something that has failed or is on vacation. It is more likely that you can create more of these magical creatures, but I believe that if we examine instincts closely, we can use logic to apply patterns to them. In object-oriented programming, we often discuss design patterns, and I believe there are also debugging patterns.
00:05:03.009 To illustrate, consider the useful phrase: 'Whenever I see X, I always check Y.' I love this sentence because it serves as a rule set. If you use conditionals or case statements, you likely see similar conditional logic. In my experience, instincts, scars, and feelings are simply internalized rule sets. Think back to a time when someone told you how they debugged something. Often, they might say, 'If the API is running slow, I assume we are getting too many calls,' and that knowledge translates into actions they take, helping to respond to the issue.
00:06:02.690 Now, let’s get to the actual debugging patterns I’ve developed based on my field guide. I do think numerical identifiers or categories for bugs can help contextualize the debugging process. The first thing to keep in mind is that sometimes containment must take priority over squashing. If a bug is wreaking havoc on your production environment, you must kill whatever's causing it and get the app back online, getting things working for users before you can truly sit down and examine why it happened and how to prevent it in the future.
00:06:58.360 Next, we can only work with facts. We cannot make assumptions or conjecture; we must say, 'This is exactly what is happening.' Lastly, while I cannot detail how to catch every type of bug, I can introduce four larger categories that I believe most bugs fall into. We can identify bugs by their observable attributes, focusing on their behaviors and where you observe them. Borrowing from research scientists, there’s a branch of taxonomy called phonetics, which is used to identify and group living organisms by their observable attributes and behaviors. Similarly, bugs in the software world can also be identified by their observable attributes.
00:08:23.180 Now, I want to give you a warning: in order to illustrate these patterns, I will present some convenient scenarios that may be contrived. However, my intent is for you to focus on identifying attributes and larger, more general patterns, particularly regarding the kinds of bugs encountered in live production applications. As I mentioned, I have identified four types of bugs—two major types that I observed during my research.
00:09:32.610 I refer to them as 'upsettingly observable' and 'wildly chaotic.' Up until now, we’ve talked about upsettingly observable bugs, which are those that make you smack yourself when you realize they occurred. You might think, 'How could this happen?' or 'Shouldn't unit tests have caught this?' They usually reside in your code, often being under-tested or completely untested. They are the kinds of bugs we assume will always work correctly and therefore never bother to test thoroughly.
00:10:01.430 On the other hand, wildly chaotic bugs are much more distressing. These bugs seem to break everything at once and are often not reproducible on local development machines. You're terrified to reproduce them in production. For instance, you might be baffled about how a port can be open but is simultaneously rejecting connections. These seemingly impossible scenarios can leave you feeling like nothing is functioning.
00:10:48.320 Now that we understand those two types, how do we squash them? Let's discuss the first category of upsettingly observable bugs. These bugs have identifiable observable attributes: can the bug be reproduced locally, and is it restricted to a specific area, like a particular workflow if it's a web application? If many customers are experiencing similar sticking points, you should be aware that this could signify a bug on your hands.
00:11:55.670 Bugs that are characterized this way are called 'bore bugs.' They are called this because they reflect the simplicity and determinism of the Bohr model of the atom. They provide consistent output for a given input, characterized by repeatability and reliable manifestation. Typically, when you see them, you can think, 'I can't believe that happened,' while also recognizing how to reproduce it manually. Most frequently, these bugs are found buried in your code, but they can also hide within server configurations.
00:13:24.500 Moreover, they tend to lurk within complex branching functions or classes or within configuration files. Bugs can frequently hide in any code with numerous if-else statements or switch cases. When we look for bore bugs in the wild, validation can often be a minefield. Validation classes are important to unit test due to their inherent complexity. This complexity can lead to scenarios where the user experience appears to function, but errors can fail silently. Consequently, the error never reaches the developer since they lack insight into the full flow of the process.
00:14:15.190 How do we reproduce and resolve this bore bug? It’s friendly, and the easiest to catch. Start with this bug because it will provide a model for the others. Replicate the bug's manifestation on your machine as easily as possible, and write a test that mimics this replication process. Conduct your tests manually, ensuring that the correct functionalities are actually taking place. After identifying the bug, employ the simplest implementation that resolves the issue without breaking any tests.
00:15:39.860 It's crucial to rewrite the code to ensure that it is comprehensible and extendable. This way, a bore bug or another kind cannot sneak back in when someone later inherits or utilizes your code. Therefore, make sure your implementation is not only efficient but clean and clear to read. On top of that, our goal is to squash the bug, thereby solidifying our understanding, which we seem so close to achieving.
00:16:26.620 Now let's proceed to discuss the second upsettingly observable bug. These are characterized by their observable attributes. You might find yourself asking, 'How does this work? Does this even work? What is it testing? Did it ever work?' If your answers to these questions yield confusion or uncertainty, then you may have encountered a 'Schrodinger bug.' This term refers to situations where we cannot confirm the validity of a process without directly observing it.
00:17:23.160 Much like the famous thought experiment, when viewed closely, the bug appears faulty but might masquerade as functional code. On further inspection, it starts to reveal itself as a bug. Two main types of Schrodinger bugs exist: those that never truly functioned and those that worked, but not as intended. The second type can be more insidious, as it can mislead developers into believing that everything is functioning normally.
00:18:11.240 In production, a Schrodinger bug that never worked often divulges its nature through side effects. A common hiding spot occurs around return values when a function completes or when a value is saved. For instance, a user's UI may reflect changes they've made, but those changes could fail to save in the database, masking the real issue unless caught. Two indicators that signal this type of bug often emerge: a pattern of repeated calls within the code or the presence of complex logical paths.
00:19:14.040 In this scenario, validating how and when the function is being read or assigned can lead to insight. However, when running code, you discover that a certain part is being evaluated multiple times when it should not be. Each validation must ensure that data is ready to save, so being attentive to call counts can expose these hidden bugs. The surface-level flags might suggest successful execution, but they can also hide an intricate path leading to failure.
00:20:30.470 To manage both types of Schrodinger bugs, implement logging on save and update failures to discern why values don't persist. You should also log at various data manipulation states to identify the current data value and the factors influencing it. This comprehensive logging will help verify the precise occurrence of events and facilitate troubleshooting later. Remember, we aren't merely containing bugs; we want to squash them. If a Schrodinger bug once functioned but now does not, utilize tools like Git bisect to find the time between the previous functional state and the current bug state.
00:21:28.190 Next up is wildly chaotic bugs, the first type. Observable attributes include: does it appear nondeterministic? Does it seem to vanish when you try to observe or debug it? If your answers to these questions are positive, you could be dealing with a 'Heisenberg bug.' Heisenberg bugs are challenging because they appear elusive, often disappearing under scrutiny, making them hard to reproduce.
00:22:37.890 There are two types of Heisenberg bugs you may encounter. One type lives within the code, as debugging tools can unintentionally alter code execution or timing. For instance, simply adding a print statement can lead to behavior changes that mask the original bug. The second type is bound to data—specifically, large data sets or particular user scenarios—making reproducibility a significant challenge. In these cases, you might not realize the source issue because the problem is buried within how data is processed.
00:23:32.920 To address Heisenberg bugs, profiling can expose what triggers specific data states. Profiling measures computational timing and complexity, allowing you to identify which functions are called and in what order. By observing performance data and execution paths with tools like QCachegrind, you can visualize a flame graph that indicates which functions consume the most time and processing power.
00:24:52.590 This approach helps illustrate where sluggish performance might occur and can help pinpoint data causing the slowdown. When the results of profiling show long execution times, it can point to problematic calculations or inefficient data handling, letting you know exactly what to focus on. Focus on reconstructing the conditions that lead to the Heisenberg bug to diagnose its root cause.
00:26:10.040 Finally, we arrive at the second chaotic bug: literally everything is broken. If you find yourself responding to indicators like this, you might have what I call a 'Mendel bug.' This type resembles the Mandelbrot set, a mathematical collection that appears random but is ultimately built upon its comprehensive relationships with other complex numbers. So when everything seems to fail, it’s likely just one area where the fault actually lies.
00:27:45.460 When faced with a Mendel bug, your focus should start small; check whether you can access the servers and explore logs. If logs and foundational processes indicate failure, it may point directly to system-level processes rather than the code itself. Keep monitoring the logs for indications of what’s breaking down. A classic issue arises when a server has run out of disk space, especially when extensively logging errors. Once you determine this root issue, clearing unnecessary space solves many overarching problems.
00:28:50.560 Debugging is an ongoing process: monitor your logs, preventing overflow situations. Be proactive in resolving issues so similar bugs do not continue to appear continuously in your logs without remediation. So with the rates of issues and a heightened sense of urgency, understand that when everything appears broken, often it hinges upon a singular crucial problem.
00:30:09.560 We've covered bacteria from upsettingly observable bore bugs and Schrodinger bugs to wildly chaotic Heisenberg and Mendel bugs. You don't necessarily need debugging instinct or rely solely on unicorns for problem-solving. What you do need is a practical toolkit—and I’ve shared that with you. You need the skills to observe and classify your bugs, transforming complex issues into something simple and deterministic. You have robust logging practices, and you effectively utilize version control tools like Git bisect to navigate debugging.
00:31:35.240 You are more than capable of not shying away from powerful Linux tools or profiling capabilities. This toolkit is your Swiss Army knife, enabling you to address complex bugs or challenges. So I encourage you to build your own debugging toolkit and be ready to share it with your team, as every project you work on is likely just as complex, if not even more intricate than what I’ve detailed.
00:33:14.150 These ideas aren't particularly unique or novel. I've gleaned them from various resources, and I encourage you to read up on them as well. The resources listed here are genuinely beneficial, especially those at the top of the list. Once again, my name is Kylie Stradley, and you can find me online at Chi Fast. I work for Mailchimp in Atlanta, Georgia.
00:34:05.960 We have five minutes for questions. The question was about quickly finding out when you encounter RAM or hardware failures that might be causing issues instead of server configuration problems. I'm not entirely sure, as it is not my area of expertise, but I recommend having a well-structured ops team where individuals are dedicated to addressing these challenges.
00:34:26.950 For those of you working with larger applications, a capable operations team would be invaluable. I work in an environment where our ops team is highly skilled at troubleshooting hardware issues, and we leverage their expertise to manage these scenarios effectively. Having a book like 'Site Reliability Engineering,' which encapsulates holistic approaches to server management, would be beneficial.
00:35:18.060 As our developers work with the API or database, we gather a cross-departmental approach to handle high-prioritized bugs, sometimes referred to as war room strategies. Facilitating knowledge-sharing prevents us from needing to memorize every process, and I'm thankful for that collaboration.
00:36:07.290 This might lead to enhancements in documentation or additional features to ensure alignment between user expectations and application functionality.
00:36:42.690 Sometimes, clients will request log everything, prompting us to communicate back. Reaching the point of a junior teammate needing to troubleshoot can be quite hefty, but it's essential to remember that resources and support abound.
00:37:22.780 Thank you so much for your time.