00:00:16.039
Hello everyone! My name is James Golick.
00:00:18.800
I go by James GOC everywhere online—on Twitter, GitHub, Instagram, and Freenode.
00:00:22.480
You can find my blog at james.com. It's very easy to find me online, and I work 24/7, so feel free to reach out!
00:00:34.920
I work for a company called Computype, where we do APT and YUM repositories, as well as Ruby Gem repositories as a service.
00:00:39.320
If you need public or private package repositories, I encourage you to check us out. Come talk to me if you want to discuss packaging—I can guarantee that I'll probably get more excited about it than anyone else in this room.
00:01:08.880
So, people often say this: programmers say it quite frequently, and there has been a lot of discussion on my Twitter account recently regarding whether we should stop saying this.
00:01:12.400
I think it’s important to distinguish between those who express this sentiment in a moment of frustration—because what we do can indeed be very frustrating—and those who genuinely believe that everything is terrible.
00:01:34.640
I could show you my iPhone, a supercomputer in my pocket that allows me to access nearly any media ever created, and you'd realize that this is not terrible; it’s awesome.
00:01:57.920
I think you’re at the very least naive and possibly a bit disingenuous if you believe that nothing is broken. I mean, obviously, things work, but the reality is that software is often buggy, flaky, and unreliable, despite our best efforts.
00:02:19.000
This makes sense because we are innovating rapidly. Software engineering is a relatively new field, and we haven't fully caught up with the pace of innovation and growth in our industry.
00:02:59.360
As a result, it is understandable that things are broken.
00:03:01.000
When engineers come together—whether in room or online—a common topic of discussion is, 'How do we write better code?' How do we create more reliable software that can handle edge cases more effectively?
00:03:39.079
There are various techniques for improving our code, from testing—popular in the Ruby world—to static analysis, and incorporating advanced type systems from newer languages. However, one topic that I feel isn’t discussed enough is how to cope with software that doesn’t work, whether it's our code or someone else's.
00:04:13.820
If you want to deploy high-quality software, you should expect to fix bugs at every level. There's a fundamental reason why bugs exist at this depth and complexity.
00:04:20.440
Given enough time and sufficient complexity, you're going to encounter those bugs, and either you’re going to fix them, or they’re still going to be broken.
00:04:37.120
Over the past few years, I have fixed many bugs at different levels of the stack. I’ve dealt with bugs in my own code, in the Ruby VM, memory allocators, MySQL, and various other places.
00:05:04.800
People often ask me, 'How did you find that bug? How do you debug unfamiliar code, or a language you don't know?'
00:05:23.120
I've come to realize that my methodology for debugging is always the same. It doesn’t matter what stack I’m looking at or whether I know the language well; my approach remains consistent.
00:05:50.480
Every good debugging session starts with a mantra that some of you may resonate with. You generate this sense of disbelief when someone reports a defect. You examine the code that you suspect has the issue, and you wonder how this could possibly be happening.
00:06:35.279
This is a true story about a debugging session I participated in a few years ago. I’m from Toronto, and a non-technical friend came to me with a problem regarding his PHP website.
00:06:58.680
He called me up and said, 'My site is down.' I asked why he didn't get his team to fix it, and he explained they weren't available.
00:07:29.480
Since he was desperate, he asked if I could help out. I didn’t have access to the source code or detailed knowledge of the system. The last time I had written PHP was about five years prior.
00:07:58.120
Despite these challenges, I did have SSH access to his server from a previous diagnosis. Once logged in, I checked the Apache error logs, assuming they'd show some PHP errors.
00:08:37.880
Interestingly, I noticed there was nothing in the logs. Notably, this is strangely common; often there is no useful information in logs, and if the program knew why it was broken, it probably wouldn’t be broken.
00:09:26.000
So what do you do when you hit a wall like this? I realized that the PHP code was likely running within one of the Apache processes. I found the process ID for one and used a program called 'strace' to attach to it and get debugging output.
00:09:56.160
For those unfamiliar, 'strace' provides a trace of all system calls made by a program. System calls are critical as they create the interface between userland programs and the operating system.
00:10:38.160
Strace output can provide useful information, especially when debugging programs. Common system calls like write and open will show you what the program is trying to do and help identify what went wrong.
00:11:21.600
When navigating strace output, you should always work backwards through the information to find where a failure occurs.
00:11:40.320
In this specific case, by looking through the output, I eventually found an error being reported by Apache, which was 500 HTTP response, indicating a server error.
00:12:07.360
Working through the system call outputs revealed a failed attempt to open a file that didn’t exist, leading to that 500 error.
00:12:39.000
From there, I generated a hypothesis that maybe someone had deployed bad code or introduced a typo, causing the outage.
00:13:01.919
After confirming our hypothesis, I made the necessary fixes and the site was back up and running—only about three minutes from the time of that phone call.
00:13:30.000
It’s funny how I can fix issues for other people quickly, while my own coding bugs can take much longer to diagnose. I realized that the key is to come into debugging with an open mind, free from assumptions.
00:14:18.560
In debugging, you must forget everything you think you know, as it may lead you astray.
00:15:00.000
The first rule is to ask for a third-party opinion. If you find yourself blind in a debugging session, get someone else to provide information about what is happening, rather than rely on your preconceived notions.
00:15:44.160
A tool called strace proves invaluable during debugging sessions on Linux. Many tools are worth learning based on the software you develop.
00:16:31.440
I’m going to share some slides with diagrams like the one showing various tools used for debugging. Some suggestions include reviewing how programs function and employing third-party opinions to guide you.
00:17:06.080
The next example highlights a debugging issue we faced with package management on Ubuntu. When a customer tried to install a repository, they encountered numerous errors with key files being ignored.
00:17:48.320
After unsuccessfully staring at the code, I turned to 'strace' again, invoking it with a different method to check the output of a process I was testing.
00:18:27.360
The output confirmed that a file was missing, leading us back to our source files for a deeper look. We found an indication of a failure with controllable files not being detected properly, leading to the error.
00:19:00.720
This unpleasant discovery indicated that some other program—likely an underlying dependency—had broken functionality. So, by checking in detail against the servers, I discovered how the applications interacted.
00:19:37.480
By locating the correct source code, I ensured that my exploration of the repository for bugs became manageable.
00:19:57.360
The second rule surrounds accessing the right source code. This has its challenges depending on various environments. I've often spent late nights tweaking to find the right code, only to discover it's outdated or wrong.
00:20:34.680
This experience taught me the importance of understanding how to find the exact compositions of packaged installs and the methods to debug them effectively.
00:21:02.560
Through each session, I identify specific strings or triggers that help determine where issues lie within systems, albeit working in an environment slightly unfamiliar.
00:21:43.440
In essence, finding a key piece of information can generates momentum. I’d encourage anyone to persist whenever they wade into a new set of code.
00:22:04.600
Debugging is an effective practice for learning new languages and systems programming. If you find a bug, you should track down its origins.
00:22:38.200
When you reach the point where you can fix what's broken, that’s the moment for a little celebration.
00:23:03.440
Here’s a summary of my steps to debug anything effectively: first, forget everything you know.
00:23:36.759
Next, get a third-party opinion and locate the accurate source code. Identify your hook, and stare at the code until you gain some understanding. Finally, go ahead and fix the issue.
00:24:06.960
I'd be happy to take any questions you may have.