Troubleshooting
We'll do it live: Underhanded Debugging Tactics
Summarized using AI

We'll do it live: Underhanded Debugging Tactics

by Saimon Sharif

In the video titled "We'll do it live: Underhanded Debugging Tactics," Saimon Sharif shares his insights on effective debugging strategies for software engineers, especially when dealing with bugs in production environments. Drawing from his experience, he emphasizes the importance of remaining calm and adopting a systematic approach when encountering elusive issues that cannot be replicated locally.

Key Points Discussed:

- Understanding Different Environments: Engineers typically move code from local development to QA and staging before it reaches production, making debugging more challenging as the code cannot be directly edited in these environments.

- Stay Calm: Remaining composed is crucial when debugging in stressful production scenarios.

- Philosophy of Debugging: Compare local input and output with what is collected in production, treating the process like a scientific method.

- Identifying Configuration Differences: Many bugs stem from variations in configurations across environments, highlighting the need for familiarity with the stack build process.

- Utilizing CDNs: Understand CDNs' caching mechanisms, as they can lead to outdated or broken assets. Techniques like bypassing the CDN can help isolate problems.

- Debugging Caches: Troubleshoot client-side and server-side caching by clearing it or using incognito windows to eliminate variables.

- Client-Side JavaScript Debugging: Tools like Source Maps and conditional breakpoints can enhance visibility into minified code and help isolate issues with precision.

- Inspecting AJAX Requests: Understanding and replaying AJAX requests can shed light on many client-server interaction problems.

- Server-Side Debugging: Employ logging and traffic inspection techniques (e.g., TCP dump) to analyze live production behavior.

- Impact of Race Conditions: Using bash scripts with curl to identify race conditions can help uncover hidden bugs that are difficult to reproduce.

- Adding Logging: Sharif stresses the long-term importance of improving logging within critical paths of code to capture and analyze problems faster in future.

In conclusion, the video provides practical debugging techniques that software engineers can apply when faced with bugs in production, encouraging a methodical and calm approach complemented by appropriate tools and strategies.

00:00:12.990 Hey folks, I’m Simon. Thanks for having me! I'm going to talk about underhanded debugging tactics you can use the next time you tackle a bug in production.
00:00:18.700 So real quick, here's some background on me: I'm a software junior working out of New York City. At [Company Name], we use data, design, and technology to make insurance as easy as it should be. By the way, we're hiring, so come talk to me after if you're interested.
00:00:26.050 I'm also a part-time instructor at General Assembly in New York, where I've taught evening courses in front-end web development, JavaScript, and Python, as well as many short workshops. Now, let’s dive into debugging.
00:00:40.150 At some point in your career as a software engineer, you're going to encounter a bug that’s only reproducible in production, which means it's time to debug in production. Obviously, this is one of the most stressful moments as an engineer.
00:00:52.750 Encountering a bug in a production environment that can't be reproduced locally is probably how most of us would feel. In fact, it's probably even worse than whiteboarding interviews, and that says a lot!
00:01:06.070 My favorite time debugging in production was at a previous company when we were trying to launch an important feature by Black Friday. A vendor couldn’t provide a testing or staging environment, only a production one.
00:01:17.520 So, we couldn’t even test the flow end-to-end without deploying to production. Whenever any issues came up, it was impossible to debug them locally. It was quite a journey. However, we eventually launched the feature and made a lot of money, but it was very stressful.
00:01:30.610 Let’s quickly review the different deployment environments we might typically encounter. Engineers usually develop against a local environment where they can edit their application, set breakpoints, and use a debugger—all that good stuff.
00:01:54.100 After developing a feature locally, the code is usually promoted to a QA environment and then to a staging environment. However, we can't edit the code directly in these environments, just like with production, which makes it tricky.
00:02:19.270 Bugs are often unclear in these environments. For example, it's extremely frustrating when a QA analyst finds a bug in a QA environment that you can't reproduce locally.
00:02:33.870 So let's go through some tactics you can use the next time you debug your application in production, or really any place that's not your local environment. I learned these tactics the hard way, so you can benefit from my frustrations, lost hair, and stress.
00:02:46.080 The first tactic is to not panic. I know it sounds silly, and remaining calm might be too much to ask for, but just try to stay mellow. Debugging complex applications is stressful enough, so staying calm is essential.
00:03:05.580 Next, let's review the overarching philosophy behind these underhanded debugging tactics. When developing applications locally, we have the ability to edit our code and step through it using a debugger.
00:03:19.500 When we encounter a bug, our debuggers provide us with input and output information. However, when debugging in production, we need to find a different source for input and output introspection. Essentially, we need to play detective.
00:03:39.420 This requires asking ourselves questions like: How do I know what code has been executed? What data can I collect to gain more introspection into the system? What can I do to make hypotheses and test those hypotheses?
00:04:03.780 What are the boundaries or interfaces in the actual system? Essentially, we're applying the scientific method to a running web application. For this talk, let's imagine we have the following setup.
00:04:20.549 There’s a very good chance all our technology stacks are different, so I’ll try to be technology agnostic. On the left, we have our clients, which are web browsers. As a request goes through our app, it'll pass through a CDN, load balancer, and finally reach our application servers and database.
00:04:40.179 From the architecture diagram, we can see the boundaries between the systems. At each boundary, we have input and output. Later on, we'll discuss some tactics for exploring these boundaries or interfaces.
00:04:52.360 I’m also going to assume that you have access to your application’s codebase and can run it locally. This will enable us to debug our local application in parallel with production.
00:05:06.819 Additionally, we can compare our local input and output with what we glean from the production application. This raises a question: What could actually be different between these environments?
00:05:26.919 An easy first step is to ask ourselves if there is a configuration difference between our local application and production. These configuration differences are often responsible for unreproducible bugs.
00:05:39.699 I recommend becoming familiar with the stack build process, for example, the flags passed to module owners like Webpack. This is often where these bugs tend to pop up first.
00:05:54.449 But what if we’re not familiar with this particular system or codebase? Perhaps we’re a new hire, or it's our first time working with it. What should we do? It sounds silly, but ask a friend, or a subject matter expert on that system or codebase to pair with you.
00:06:07.740 Don't be hesitant to reach out for help; it will save your team a lot of developer hours, plus it’s a new chance for you to learn from pairing on a hard problem.
00:06:20.550 All right, let’s begin our debugging journey at the very start: the CDN. The CDN, or Content Delivery Network, is the first step in our request journey.
00:06:37.760 Most companies don't run their own CDN; rather, they use providers like Cloudflare or Akamai. If we use this CDN, our end users will have their requests fulfilled by edge servers that are geographically closer to them, which means they’ll likely see performance gains.
00:06:53.020 However, if we’re trying to reproduce a bug, there might not be a guarantee that we ourselves will hit the same edge server as an end user.
00:07:11.080 CDNs are also used to cache assets like CSS files, images, full-page content, and partial caching, among others. While that all sounds great, unfortunately, it can also make reproducing bugs in production difficult.
00:07:24.149 We might encounter stale assets, which means that our end users could get old cached files with breaking changes. There could also be issues related to redirect rules, firewalls, and so much more.
00:07:34.740 So while we should definitely use CDNs in production, we need a strategy for ruling them out as the source of a bug.
00:07:52.540 One weird trick to rule out CDN issues is to bypass the CDN entirely and hit your origin server directly. Going back to our architecture diagram, here's what it might look like.
00:08:01.639 Instead of our clients going through our CDN, they would hit the load balancer or our origin servers directly. As a quick caveat, most companies tend to restrict making requests to the origin server from general networks, so be cautious.
00:08:13.140 However, if the bug disappears when you hit your origin server directly, it’s likely that your problem lies with the CDN. In that case, you'll want to investigate how your CDN configuration may be causing the issue.
00:08:24.560 Let’s focus on bugs caused by caching next.
00:08:33.500 As a quick disclaimer, Jeff Atwood was not the first person to coin the saying 'it's not a bug, it's a feature', but he did get the most retweets and likes, which is why I chose that quote.
00:08:48.030 Caching can occur in several places: client-side in the browser, server-side in web applications, in-memory caches, and at the database layer. So let’s go through some examples.
00:09:02.250 Starting with client-side caching, we can have web storage, cookies, and caching service workers. There are plenty of places for caching issues to arise.
00:09:14.450 For example, here’s how to fix an issue with client-side caching. Notice the last to-do item looks a bit wonky. Let's take a look at this situation.
00:09:27.250 When we reload the page, the issue is still present. This indicates that there's likely a caching issue somewhere.
00:09:38.780 Let’s open Chrome's developer tools and check local storage and cookies. We see something in local storage, so we'll clear it out and then reload the page.
00:09:52.180 Now our to-do items are back where they should be. Let’s say we want to buy some bread or maybe our millennials want to make avocado toast. That’s an example of how clearing the cache resolved the issue.
00:10:08.040 Even for complex applications, I recommend ruling out client-side cache first by using an incognito window or a different browser. Flush the cache as I did; it will save time.
00:10:22.500 Server-side caching can be more difficult to troubleshoot, as it’s hard to get insights into what was actually cached. However, you can check your application logs or the cache directly if you have access.
00:10:48.300 But the simplest solution is to flush the entire cache, similar to what we did in the previous example. While I know that might feel like a cop-out, it’s a good sanity check.
00:10:54.250 Caching issues can be particularly hard to debug, especially if you have a distributed cache. If your bug still occurs after this step, it’s probably not caused by server-side caching.
00:11:12.220 Fair warning: please be very careful when clearing any caches, especially during critical times like Black Friday. Don’t take your application down while you’re trying to debug.
00:11:40.080 Let’s now shift our focus to client-side JavaScript that runs in the browser. Debugging client-side JavaScript can be fun since we can interact directly with the code.
00:11:51.169 However, the downside is that browser code usually tends to be minified, making it challenging to see what’s happening. To tackle this, we can use Source Maps.
00:12:06.320 By generating a source map for the tool you are using to compress your original JavaScript files, you can apply the source map directly in Chrome's dev tools.
00:12:18.163 After adding a source map, the undelivered JavaScript file is now readable, similar to how it appears in a text editor.
00:12:37.500 Now that we have identified our code, we can use arbitrary breakpoints in the browser's developer tools to gain necessary input and output data.
00:12:49.250 These breakpoints help us determine where to pause our debugging process. We can leverage a classification as a wayfinder. By looking at our production files, we can match identifiers from local files.
00:13:03.600 Let’s assume we’re focused on the edit function in the to-do application. Maybe it’s broken for whatever reason, and we need to identify where.
00:13:22.410 While examining the code, we find input keys such as the 'Enter' key that will likely remain intact after minification.
00:13:32.000 Now, let’s search for that identifier in our broken to-do application to see if we can locate it. We search through all our files, possibly looking at multiple JavaScript files.
00:13:52.100 We print the results and determine where that identifier is used in the code. By doing this, we can tie what we saw locally to what we see in production.
00:14:06.560 Next, we can place a breakpoint on that line to see if it is hit in the running application. If the function is broken, we can determine what is going on.
00:14:18.100 Once we’ve set our breakpoint, we might notice that the bug we had is tedious to reproduce. It’s common for such issues to reside in collections.
00:14:32.540 When dealing with arrays or objects, you may have only one element that is faulty while everything else appears correct. This calls for using conditional breakpoints.
00:14:42.430 Conditional breakpoints are a great way to pause execution only when the error condition is met. The worst scenario is hitting 'play' for hours, hoping to catch which object caused the issue.
00:14:54.000 In the case of a broken to-do application, let’s see what’s wrong. We encounter a scary runtime error; it's challenging to assess in the moment.
00:15:06.840 Identifying the line where the error occurs will help narrow down the problem, so let’s place a breakpoint there and see what's happening.
00:15:17.000 We’ve now paused on that line. For this specific value/object, everything appears functional.
00:15:26.320 After continuing, we can add a conditional breakpoint that stops only when the problematic title object is parsed.
00:15:38.240 By this point, we find that the title object is undefined. Therefore, when we attempt to access the title property, it throws an error.
00:15:50.340 This confirms that we have successfully narrowed down the faulty object in our collection by using conditional breakpoints, rather than sifting through every single entry.
00:16:05.000 Reflecting back on that previous example, we noted that it is structured with React. The props give this away.
00:16:18.760 Something else we can employ for input introspection is using React developer tools. The tools offer insight into the state values and props without needing to set breakpoints.
00:16:29.560 Since these props and state values represent the actual inputs to our components, we can see discrepancies more easily.
00:16:42.420 For instance, if our to-do app is broken again, we notice something odd with the two to-do items.
00:16:55.000 Let’s focus on one of the functioning components, which appears complete. Now, let's examine the broken component.
00:17:07.950 We note the broken component lacks certain properties under the todo prop, indicating potential state issues.
00:17:25.000 Instead of using breakpoints or lengthy manual checks, we can use React Developer Tools to compare the working component to the malfunctioning one.
00:17:41.220 Focusing on data and its differences highlights how one element functions while the other doesn’t. This approach can assist us in troubleshooting data integrity issues.
00:17:54.560 We can serialize values like the previous props and current props using JSON.stringify. By utilizing Chrome DevTools, we can copy this data to our clipboard.
00:18:09.960 We can analyze these serialized JSON values in files or compare them using tools, identifying precisely what changed.
00:18:22.560 For example, if we look at the title property, we can identify changes between previous props and current ones.
00:18:35.400 This process helps us streamline our investigations by utilizing command-line tools specifically designed for dJSON data analysis.
00:18:45.020 So far, we’ve been dissecting data in our to-do app to understand its breakdown. This prompts an important question: where does this data originate from?
00:19:05.420 How do our applications procure the data they need?
00:19:11.260 We often make AJAX requests from the client. These requests are handled by our application servers, introducing another boundary in our data flow.
00:19:27.420 Consequently, we have bugs that can arise from client request creation, server-side handling of the request, and client-side handling of the response.
00:19:44.880 These areas present opportunities for bugs to appear. Thus, our goal is to access the requests as closely as possible, disregarding any surrounding code.
00:19:58.680 One benefit is that we can inspect AJAX requests directly within our browser. Let’s dig into AJAX requests, as made every time we add a new to-do item.
00:20:14.860 For instance, I might type in 'buy an owl costume' followed by researching owl behavior, generating multiple subsequent AJAX requests.
00:20:29.240 Instead of analyzing our code, we can focus on the interaction at the interface level, confirming the returned data from the request.
00:20:45.460 Furthermore, we can discern the data passed in with our request by observing AJAX requests in the network panel.
00:20:59.600 Notably, we should verify whether our request was successfully made, and if we provided the expected request body. Also, check for the anticipated response.
00:21:09.500 These three factors are common points of failure when debugging.
00:21:24.950 It gets better! We can begin replaying requests independently from the browser, removing other application code elements by utilizing curl.
00:21:39.960 Chrome’s dev tools let us copy the curl for each request easily. After grabbing the curl output, simply paste it into our terminal.
00:21:53.950 We can observe output exclusive to that request or use jq, a widely used command line JSON processing tool, to assess the actual return values.
00:22:04.210 This allows us to isolate our system requests while determining which headers to pass and what defines the request body.
00:22:20.050 Once we have our curl command perfected, we can modify it to verify system behavior and assess various input and output scenarios.
00:22:36.350 However, editing curl manually is risky. Instead, I recommend using a tool like Postman, as it allows for convenient interaction with our requests.
00:22:49.850 For instance, let’s purchase a cool owl costume by altering our earlier request. After sending this modified request to the backend server, we should see the to-do item saved.
00:23:05.950 Creatively, we can refresh our task list to confirm if the purchase appears as intended.
00:23:20.790 This illustrates how isolated requests to backend servers can streamline debugging, allowing us to bypass the application during testing.
00:23:37.990 To address potential race conditions in our code—especially troublesome to reproduce manually—I suggest repeating the curl requests.
00:23:53.350 Using a bash script, we can perform this repetitive task reliably. The script allows us to hit multiple endpoints repeatedly, identifying racing issues.
00:24:04.630 For instance, utilizing curl commands in a bash script can help us diagnose race conditions across production endpoints.
00:24:17.050 By doing so, we can expose issues with database transactions that might otherwise go unnoticed during normal operations.
00:24:35.949 Now, let’s transition into server-side debugging, where we will investigate the actual code running on our servers.
00:24:54.750 Please note that gaining introspection here can be a bit more challenging, especially without the use of tools.
00:25:06.250 If you have logging, tracing, or monitoring management tools (like Splunk, Datadog, or New Relic), use them! They simplify the debug process, but the advice still applies even without these tools.
00:25:21.870 In production, we likely have multiple servers handling requests from actual users. Therefore, we probably shouldn’t edit the code on these servers directly.
00:25:37.240 Instead, try to match the entries in your log files to the logging code in your application, effectively using your logs as a roadmap to the executed code.
00:25:51.520 Additionally, you could take the curls we discussed previously and focus on isolating specific endpoints pertinent to your debugging task.
00:26:04.380 This method resembles the AJAX techniques we covered earlier, but you’ll gain much more insight if you’re capturing live production traffic.
00:26:17.840 Thus, you won't want to lose yourself in irrelevant log entries—it’s crucial to filter for those that are impactful to your work.
00:26:33.780 In this example, let’s attempt to find a specific to-do item using its ID when examining the logs. Once in the application, we can inspect the ID attribute.
00:26:50.440 Performing an action like deleting the to-do item will enable us to search the logs for the unique ID, lending insight into any consequences of that action.
00:27:06.030 After deleting the item, we can cross-reference your logs for that ID to see what messages correspond— narrowing the data to only relevant entries.
00:27:21.830 But what if our application doesn't generate useful identifiers? In cases like this, we need to create our own for tracing.
00:27:36.880 By posting new to-dos in Postman, we can include countervailing identifiers or unique attributes that show up in our logs. If we note our identifiers, we can follow their paths.
00:27:53.640 This distinction lets us clarify the requests we've made from genuine user traffic, facilitating debugging this way.
00:28:06.950 Now, let’s wrap up with a few last tricks. These methods, however, are risky, and I recommend obtaining permission before implementing them.
00:28:20.730 For one, using TCP dump, a command line packet analyzer, allows you to inspect network traffic live on your production servers.
00:28:34.860 This approach reveals actual requests made and provides insight into the request bodies before they reach the application, directly assisting us in pinpointing issues.
00:28:50.960 To illustrate this, you can run TCP dump, filtering to inspect HTTP traffic by referencing the server's designated port.
00:29:06.750 Additionally, you can export the packet capture from production servers and analyze it with a tool like Wireshark, which streamlines evaluation.
00:29:25.950 Consequently, you can look for pertinent traffic patterns and log relevant requests, making debugging far simpler.
00:29:43.700 If you're fortunate enough to have a highly-controlled environment, you could remove one of your application servers from the load balancer.
00:29:55.450 Deploy code directly to that server without the usual breakpoints. This creates a safe space for debugging so that no customer will experience bad service during testing.
00:30:11.990 If there is a configuration bug, this method allows you to isolate concerns to that specific server.
00:30:29.170 Finally, if you take anything from this talk, remember the importance of adding logging in critical code paths.
00:30:45.580 After enhancing logging, you can deploy your code and simply let it run, capturing insights from log messages whenever bugs occur.
00:30:58.800 This allows for future bug reproduction through log correlation, regardless of how many requests it takes to revisit the issue. Letting that code run wild will provide good insight.
00:31:14.940 Notably, for client-side logging, there’s various vendors such as Sentry and BugSnap that can assist in capturing error messages as they occur.
00:31:27.210 Wrapping up, I sincerely hope these underhanded tactics for debugging in production assist you the next time you face a troubling bug.
00:31:38.640 Thank you all for your time! I’m happy to answer any questions!
Explore all talks recorded at Ancient City Ruby 2019
+3