Software Development

How to Debug Anything

How to Debug Anything

by James Golick

In the talk 'How to Debug Anything' at GoRuCo 2014, James Golick discusses the complexities of debugging in software development, emphasizing that while many resources focus on writing code, few address the critical skill of diagnosing why code does not work. He highlights that debugging often consumes a significant portion of the time necessary to ship a product, especially given the rapid innovation in software engineering.

Key Points Discussed:

  • Understanding Software Bugs: Golick points out that software is inherently buggy and unreliable due to the rapid pace of innovation in the field. He stresses the importance of accepting that issues will arise at all levels of the software stack.
  • Debugging Methodology: He proposes a consistent methodology for debugging, regardless of the programming language or system complexity. This methodology begins with an open mind, devoid of assumptions.
  • Case Study Example: Golick shares a real-world example involving a non-technical friend whose PHP website was down. Lacking access to the code or detailed system knowledge, Golick used system call tracing (strace) to identify the root cause—a missing file leading to a server error. This process took under five minutes and highlighted the effectiveness of systematic debugging.
  • Tools for Debugging: He advocates for using various debugging tools, emphasizing the utility of 'strace' for understanding system-level interactions and identifying failures. He encourages learning multiple tools based on the development environment.
  • Third-Party Input: One important rule is to seek third-party opinions during debugging sessions. This approach can provide fresh insights that may lead to quicker problem resolution.
  • Finding the Right Source Code: Golick discusses the challenges of accessing the correct source code and emphasizes its importance in effective debugging.
  • Learning Through Debugging: He concludes that debugging not only resolves issues but is also an opportunity for learning, particularly in unfamiliar systems or languages.

Conclusion:

Golick summarizes his debugging process with actionable steps: forget what you think you know, seek third-party opinions, find the correct source code, identify areas that need attention, and then proceed to fix the issue. The talk equips the audience with practical insights to improve their debugging skills, thus enhancing their software development practices.

00:00:16.039 Hello everyone! My name is James Golick.
00:00:18.800 I go by James GOC everywhere online—on Twitter, GitHub, Instagram, and Freenode.
00:00:22.480 You can find my blog at james.com. It's very easy to find me online, and I work 24/7, so feel free to reach out!
00:00:34.920 I work for a company called Computype, where we do APT and YUM repositories, as well as Ruby Gem repositories as a service.
00:00:39.320 If you need public or private package repositories, I encourage you to check us out. Come talk to me if you want to discuss packaging—I can guarantee that I'll probably get more excited about it than anyone else in this room.
00:01:08.880 So, people often say this: programmers say it quite frequently, and there has been a lot of discussion on my Twitter account recently regarding whether we should stop saying this.
00:01:12.400 I think it’s important to distinguish between those who express this sentiment in a moment of frustration—because what we do can indeed be very frustrating—and those who genuinely believe that everything is terrible.
00:01:34.640 I could show you my iPhone, a supercomputer in my pocket that allows me to access nearly any media ever created, and you'd realize that this is not terrible; it’s awesome.
00:01:57.920 I think you’re at the very least naive and possibly a bit disingenuous if you believe that nothing is broken. I mean, obviously, things work, but the reality is that software is often buggy, flaky, and unreliable, despite our best efforts.
00:02:19.000 This makes sense because we are innovating rapidly. Software engineering is a relatively new field, and we haven't fully caught up with the pace of innovation and growth in our industry.
00:02:59.360 As a result, it is understandable that things are broken.
00:03:01.000 When engineers come together—whether in room or online—a common topic of discussion is, 'How do we write better code?' How do we create more reliable software that can handle edge cases more effectively?
00:03:39.079 There are various techniques for improving our code, from testing—popular in the Ruby world—to static analysis, and incorporating advanced type systems from newer languages. However, one topic that I feel isn’t discussed enough is how to cope with software that doesn’t work, whether it's our code or someone else's.
00:04:13.820 If you want to deploy high-quality software, you should expect to fix bugs at every level. There's a fundamental reason why bugs exist at this depth and complexity.
00:04:20.440 Given enough time and sufficient complexity, you're going to encounter those bugs, and either you’re going to fix them, or they’re still going to be broken.
00:04:37.120 Over the past few years, I have fixed many bugs at different levels of the stack. I’ve dealt with bugs in my own code, in the Ruby VM, memory allocators, MySQL, and various other places.
00:05:04.800 People often ask me, 'How did you find that bug? How do you debug unfamiliar code, or a language you don't know?'
00:05:23.120 I've come to realize that my methodology for debugging is always the same. It doesn’t matter what stack I’m looking at or whether I know the language well; my approach remains consistent.
00:05:50.480 Every good debugging session starts with a mantra that some of you may resonate with. You generate this sense of disbelief when someone reports a defect. You examine the code that you suspect has the issue, and you wonder how this could possibly be happening.
00:06:35.279 This is a true story about a debugging session I participated in a few years ago. I’m from Toronto, and a non-technical friend came to me with a problem regarding his PHP website.
00:06:58.680 He called me up and said, 'My site is down.' I asked why he didn't get his team to fix it, and he explained they weren't available.
00:07:29.480 Since he was desperate, he asked if I could help out. I didn’t have access to the source code or detailed knowledge of the system. The last time I had written PHP was about five years prior.
00:07:58.120 Despite these challenges, I did have SSH access to his server from a previous diagnosis. Once logged in, I checked the Apache error logs, assuming they'd show some PHP errors.
00:08:37.880 Interestingly, I noticed there was nothing in the logs. Notably, this is strangely common; often there is no useful information in logs, and if the program knew why it was broken, it probably wouldn’t be broken.
00:09:26.000 So what do you do when you hit a wall like this? I realized that the PHP code was likely running within one of the Apache processes. I found the process ID for one and used a program called 'strace' to attach to it and get debugging output.
00:09:56.160 For those unfamiliar, 'strace' provides a trace of all system calls made by a program. System calls are critical as they create the interface between userland programs and the operating system.
00:10:38.160 Strace output can provide useful information, especially when debugging programs. Common system calls like write and open will show you what the program is trying to do and help identify what went wrong.
00:11:21.600 When navigating strace output, you should always work backwards through the information to find where a failure occurs.
00:11:40.320 In this specific case, by looking through the output, I eventually found an error being reported by Apache, which was 500 HTTP response, indicating a server error.
00:12:07.360 Working through the system call outputs revealed a failed attempt to open a file that didn’t exist, leading to that 500 error.
00:12:39.000 From there, I generated a hypothesis that maybe someone had deployed bad code or introduced a typo, causing the outage.
00:13:01.919 After confirming our hypothesis, I made the necessary fixes and the site was back up and running—only about three minutes from the time of that phone call.
00:13:30.000 It’s funny how I can fix issues for other people quickly, while my own coding bugs can take much longer to diagnose. I realized that the key is to come into debugging with an open mind, free from assumptions.
00:14:18.560 In debugging, you must forget everything you think you know, as it may lead you astray.
00:15:00.000 The first rule is to ask for a third-party opinion. If you find yourself blind in a debugging session, get someone else to provide information about what is happening, rather than rely on your preconceived notions.
00:15:44.160 A tool called strace proves invaluable during debugging sessions on Linux. Many tools are worth learning based on the software you develop.
00:16:31.440 I’m going to share some slides with diagrams like the one showing various tools used for debugging. Some suggestions include reviewing how programs function and employing third-party opinions to guide you.
00:17:06.080 The next example highlights a debugging issue we faced with package management on Ubuntu. When a customer tried to install a repository, they encountered numerous errors with key files being ignored.
00:17:48.320 After unsuccessfully staring at the code, I turned to 'strace' again, invoking it with a different method to check the output of a process I was testing.
00:18:27.360 The output confirmed that a file was missing, leading us back to our source files for a deeper look. We found an indication of a failure with controllable files not being detected properly, leading to the error.
00:19:00.720 This unpleasant discovery indicated that some other program—likely an underlying dependency—had broken functionality. So, by checking in detail against the servers, I discovered how the applications interacted.
00:19:37.480 By locating the correct source code, I ensured that my exploration of the repository for bugs became manageable.
00:19:57.360 The second rule surrounds accessing the right source code. This has its challenges depending on various environments. I've often spent late nights tweaking to find the right code, only to discover it's outdated or wrong.
00:20:34.680 This experience taught me the importance of understanding how to find the exact compositions of packaged installs and the methods to debug them effectively.
00:21:02.560 Through each session, I identify specific strings or triggers that help determine where issues lie within systems, albeit working in an environment slightly unfamiliar.
00:21:43.440 In essence, finding a key piece of information can generates momentum. I’d encourage anyone to persist whenever they wade into a new set of code.
00:22:04.600 Debugging is an effective practice for learning new languages and systems programming. If you find a bug, you should track down its origins.
00:22:38.200 When you reach the point where you can fix what's broken, that’s the moment for a little celebration.
00:23:03.440 Here’s a summary of my steps to debug anything effectively: first, forget everything you know.
00:23:36.759 Next, get a third-party opinion and locate the accurate source code. Identify your hook, and stare at the code until you gain some understanding. Finally, go ahead and fix the issue.
00:24:06.960 I'd be happy to take any questions you may have.