Talks
A Common Taxonomy of Bugs and How to Catch Them

http://www.rubyconf.org.au

Catching software bugs is a mysterious magic, unknowable by science, and untouchable by process. Debugging skills are instinctual and come only from years of experience.

False! Programming bugs, like real bugs, can be organized into a deterministic taxonomy. At its base, consistent and successful debugging is pattern matching. Classifying bugs by their attributes and behaviors makes the seemingly impossible possible—codifying the elusive “programmer’s instincts” into a logical debugging process.

RubyConf AU 2017

00:00:10.719 Personal sound check for Kylie. Can I get an okay emoji from the back? I'm getting thumbs up. I'd love to see an okay. All right, I see one. Okay, I'll take it. This is a common taxonomy of bugs and how to squash them.
00:00:20.080 Earlier today, someone said, 'Don't apologize during your presentation.' But I need to apologize because I'm not giving you the full presentation today. The airline lost my luggage, which means that they also lost my common taxonomy of bugs and how to squash them outfit. So when you see me present, please don't think about what I'm wearing now; think about me wearing that.
00:00:40.320 Many people have asked me if I dress like this all the time, and the answer is yes. Until they lost all this, you would have seen me wearing that every day of the conference. I hope you'll excuse how messy this notebook is because it is my lab notebook, my field book of case studies, coffee stains, and things I've seen while I was out debugging.
00:01:02.040 Getting right into it, debugging skills are important. Being able to debug is one of the most powerful tools in your developer toolkit. Adept debuggers seem to be able to debug anything and troubleshoot problems across various issues, from ports that are simultaneously open while refusing connection to the error 'undefined is not a function.' Their depth of knowledge makes it easy to believe that debugging comes from instinct and intuition.
00:01:14.560 Having worked on consulting projects and then joining a long-running product company like MailChimp, I've heard phrases suggesting that as you familiarize yourself with an application, you'll start to build up some debugging instincts. It might suggest that whenever you see something happen, you check the logs or see if it's sending a weird message on exit. But how do adept debuggers know this? They seem like magical creatures, almost like unicorns—they can debug anything and always know the next path to take or the question to ask.
00:01:39.120 They reassure you that someday you, too, will become a magical creature with debugging instincts. You can take this at face value and write down when X happens, always do Y. But you might also want to disabuse yourself of the notion that instincts are magical. Instincts are mystifying. They allow us to make heroes of each other, and while it's fine to praise heroic efforts, we shouldn't raise up our teammates as heroes. Instead, we should focus on our code.
00:01:59.360 We can put off the hard work of communication onto our teammates instead of writing highly readable code that clearly communicates our intentions. Lastly, unicorns don't scale. You can't just spin up a new instance of your resident unicorn whenever something bad happens. Instead, let's focus on the statement from earlier: when you look at it, you might notice it's just a conditional.
00:02:24.080 We handle conditionals in a straightforward manner. Whenever I see X, I always check Y. Instincts can be converted into internalized rule sets. I want to take these instincts and convert them to internalized rule sets and then start to look at them as patterns rather than something you'll just magically develop because that doesn't make any sense.
00:02:44.080 Before you get started, you need to agree on some research methods. The first method is that sometimes containment has to take priority over actually squashing a bug. Sometimes you simply need to contain it and stop it while you can, but you might not have time to figure out what went wrong and why just yet.
00:03:05.500 Next, we can only work with facts. You can't say, 'I feel' or 'I have a gut feeling about this.' Lastly, we can't squash every single bug in this talk. I submitted a 20-hour talk, and they came back to me and said, 'Just give us 20 minutes.' So that's why we can't squash every talk, which is no fault of mine.
00:03:27.880 What I can do, however, is show you how to identify bugs by their observable attributes. I know software developers love reinventing the wheel, but we don't have to this time. Biologists use a form of taxonomy called phonetics to identify and organize living organisms by observable attributes and behaviors. Bugs in the natural world can be classified this way, and bugs in the software world can too.
00:03:39.000 Before we get into my field guide, I must warn you that there are highly contrived and convenient scenarios ahead to help us focus on talking about finding and observing attributes and then what to do with them. If we're comfortable with that, I'm going to get started. If you're not, feel free to flee at any time.
00:04:04.720 Let's look at the two major phenotypes of bugs. We have the upsettingly observable and the wildly chaotic. Upsettingly observable bugs are the ones that make you smack yourself when you see them. They raise questions like, 'How could this have happened?' or 'Shouldn't unit tests have caught this?' They usually live in your code and sometimes result from under-tested or untested code.
00:04:24.160 On the other hand, wild chaotic bugs seem to break everything everywhere. You can't reproduce them on your local machine and are terrified to produce them in production. For example, how can this port be open but simultaneously rejecting connections? It doesn't make sense.
00:04:41.280 With this baseline set, we can talk about how to squash these two types of bugs. Let's start with upsettingly observable bug number one. To identify its observable attributes, ask: Is the bug observable in production? Can it be reproduced locally? Does it seem to be restricted to maybe one area of the application? If you answered yes to both, you might have a bore bug on your hands.
00:04:56.560 The bore bug is named such because, like the Bohr model of the atom, it's highly deterministic and very simple. By 'deterministic,' I mean that for a given input, it always returns the same output, and it's characterized by its repeatability. You can reproduce it easily, and it reliably manifests. You can say, 'This always causes this to happen.'
00:05:09.320 Bore bugs are commonly found in code, but sometimes, they are found in server code as well. You can usually find them in code that has complex branching logic. You might see them in validation files, which are inherently complex, making it easy to write many unit tests around them without doing full regression tests.
00:05:33.040 We've all seen, maybe not all, but many of us have seen the picture of the two doors that open together, but both handles work and you can't open either door because they stop each other. That's aggressively unit tested but not heavily regression tested.
00:05:46.920 So if the consequences of failing validation aren't planned for, they can't be rescued meaningfully, and they'll fail silently. The lack of errors leads you to believe that the save or the submit is functioning fine, and unit testing makes you think that there's no problem with your validation logic until someone needs to send an email or make a purchase.
00:06:05.480 Then, the UI tells the user they are in a good state, but the action they wish to trigger is never triggered. I start with the bore bug because it's the friendliest bug and the easiest to squash. Squashing it first provides a model for each of the other bugs. Here’s how to squash a bore bug: Step one: replicate it locally and in test. Step two: write the simple solution. Step three: rewrite the code to be highly readable and extensible.
00:06:25.240 Readable code makes it much easier for someone to add onto your code in the future without introducing or reintroducing a bug that you had already squashed. Now, audience participation is welcome: I'll do one, and if you want to jump in on the next ones, you can.
00:06:41.520 Now we’re ready for upsettingly observable bug number two. Its observable attributes include: does this work? Wait, what is this even testing? More worrying, did this ever work? If you answered yes, or 'huh' or 'oh no,' to all of those, you might have a Schrödinger bug on your hands.
00:07:05.000 A Schrödinger bug, like Schrödinger's infamous thought experiment, cannot confirm the validity of the code without observing it directly. It looks like functional code but reveals itself to be a bug upon closer inspection. There are two types of Schrödinger bugs: those that have never worked and those that reveal themselves via side effects.
00:07:26.000 A common hiding spot for a Schrödinger bug is in return values. You may complete a function or save something, but if the function has a return value that obscures the true results, it can hide a bug. In the wild, you might see this in a UI showing an update immediately after user input, but it never gets passed to the database.
00:07:43.160 It looks like it's working to the user because they see it in the UI. Now let's talk about type two of Schrödinger bugs—code that doesn't work as you thought. The data reaches the right state eventually but appears incorrect. This might manifest itself in call counts, where you see a validation function being called multiple times.
00:08:02.160 Perhaps the first time it puts the data in the right state, but then it gets saved later, maybe on the third or fourth time instead of the first time. You’re calling the same path several times before the execution occurs.
00:08:20.720 Remember these basic reproduction and resolution steps. You might be thinking, 'Kylie, be reasonable. How can I reproduce this without knowing exactly what's happening?' Here's some good news. Debugging is an essential part of the developer toolkit.
00:08:37.200 We can use logging to verify what's happening. When you have a bug that seems to be doing the right thing but doesn't, add logging on 'save' and 'update' features to indicate why the value wasn’t saved. Also, add logging at various areas where data manipulation occurs to see your value at specific times.
00:08:56.000 This practice is good for both types of bugs. For Schrödinger bugs that seem to have worked at some point, you can use git bisect. This method allows you to compare your current bad state to a previous known good state to find out when, if ever, the code actually worked.
00:09:20.080 Reproduction and resolution steps: reproduce the broken state locally in test, add logging statements until you can verify the cause of the broken state, and if the bug did work at some point, find that moment in time using git bisect. Then, follow the bore bug instructions.
00:09:41.000 All right, we’re ready for wildly chaotic bug number one. Its observable attributes include: does it appear non-deterministic? Does it not appear on every server or request? Does it seem to disappear once you try to observe and debug it? If you answered yes to these, you might have a Heisenbug.
00:10:04.720 Heisenbugs are characterized by an inability to be reproduced. Once you try to observe it through recreation, you might not find it again. There are two types of Heisenbugs: one that lives in code and another that lives in data.
00:10:21.680 Type one often occurs due to debugging statements, such as print statements affecting if a function is evaluated or not, or throwing off the execution timing. Adding a print statement forces eval when eval wouldn't have occurred otherwise.
00:10:39.280 For the second type of Heisenbug, which lives in data or large datasets, it seems hard to reproduce because you shouldn’t download user data to your machine to recreate the steps. You might only be able to see the bug in production and wonder how to reproduce it without testing on production.
00:10:59.760 Profiling can be a useful tool to verify what's happening at each point. It helps you find out how long a function takes to execute on production. Turn on profiling with your available tools, though the output might not seem clear or easy to read.
00:11:16.840 A better way to visualize profiling data is through flame graphs. The bottom of a flame graph shows the first function or class called, with the space it occupies indicating execution time. Profiling can reveal what’s called and when—even without affecting evaluation timing.
00:11:32.520 It also informs you how much time is spent during execution. If there's a query that's expensive or a large dataset that's not optimized, profiling can help you identify these issues so you can make improvements.
00:11:49.280 For reproduction and resolution, use profiling to find the triggering state. Use application not fixtures or data manipulation to replicate the data state. You can also wait for a user request or utilize a test user, and remember to turn off profiling as it can consume disk space.
00:12:05.840 Now you’ve turned your Heisenbug into a bore bug, so follow the bore bug instructions to squash it. Next, we have wildly chaotic bug number two. Its observable attributes might include everything being broken or needing immediate help.
00:12:22.760 If you answered yes or said, 'I don't have time for these questions,' you might have a Mandel bug on your hands. The Mandel bug is named after its resemblance to the Mandelbrot set, which is a mathematical set seemingly random when graphed.
00:12:40.680 This set is just a collection of complex numbers that create convoluted and sometimes beautiful fractals—depending on the artist, of course. It might seem like everything's broken, and as a result, people are very upset with you.
00:12:59.840 You're under pressure to resolve issues quickly. The good news is that it's likely an issue with your system and not your code. You might think you can pass this off to ops, but you can do more exploratory work to fix this.
00:13:19.200 Let's look at a sample call log from a specific Mandel bug scenario. You may start by checking if you can access your servers. If jobs aren't running and emails aren't sending, that indicates something is wrong.
00:13:36.480 You might check your disk usage. Did you forget to turn off profiling? Adding too many log statements can also fill up your disk space, and these things accumulate quickly. We can use commands like 'df -' to check which files or directories take up excessive disk space.
00:13:49.760 You might also verify with logging our actions or when things started breaking. This could provide an exact reference point to inform customers that everything may have stopped working at a specific time.
00:14:07.840 For reproduction and resolution, check if all storage is being used, attempt to connect to the server, and view the logs. If necessary, determine if the server can be restarted, rotated, or killed.
00:14:22.720 Finally, it's time to squash this Mandel bug. We now have a practical taxonomy of bugs and a methodology for squashing them. I’m sorry if I’ve crushed your hopes of becoming a unicorn with debugging instincts.
00:14:43.760 The good news is that you don't need that. You have a debugging toolkit. You've learned how to observe and classify bugs, verify with logging, and utilize profiling. You're now equipped to use server tools to check the entire process.
00:15:03.680 This is my toolkit, but I encourage you to build your own and share it with your team. Your company's project, application, or system is likely just as complex, confusing, and unique as mine.
00:15:22.960 These resources are for further study, and while they might not have a hook just yet, I imagine they will soon. I'm Kylie Stradley; you can find me online at KFEST, and I work for MailChimp in Atlanta, Georgia. Thank you.