RubyConf 2016

Diving into the Details with DTrace

Diving into the Details with DTrace

by Colin Jones

In his talk at RubyConf 2016, Colin Jones dives into the intricacies of system performance monitoring using DTrace, a dynamic tracing tool that helps developers understand application behavior and optimize performance. He begins by highlighting common misconceptions developers face when troubleshooting slow applications, often jumping to conclusions without fully investigating the root causes. By leveraging DTrace, Jones aims to provide clarity and methods for effective performance analysis.

Key Points:
- Introduction to DTrace: Jones explains what DTrace is—an observability system that provides insights into system activities in real-time. It operates on various platforms including Solaris and macOS with specific limitations. Compared to tools like strace on Linux, DTrace offers deeper system call tracing functionalities.
- Usage Scenarios: The speaker discusses practical applications of DTrace to identify where applications lag. Using DTrace, developers can monitor system calls, memory usage, and I/O operations to discern performance bottlenecks.
- Case Study - Investigating Test Suite Performance: Jones illustrates a real-life example where his team faced issues with a slow test suite. Through methodical investigation involving DTrace, they discovered a delay in DNS resolution as the key factor contributing to test slowdown. This allowed the team to implement targeted solutions rather than broad changes that could have led to more complications.
- Benefits of Dynamic Tracing: The ability of DTrace to instrument running code without modifying it emphasizes its practicality in production environments. This function aids in real-time performance monitoring to pinpoint issues immediately, preventing extended system downtime.
- Learning Resources: Jones encourages continuous learning with DTrace, recommending resources like Brendan Gregg’s website for deeper dives into performance analysis and DTrace scripting.

In conclusion, DTrace serves as a powerful tool for developers to understand system performance intricately, allowing for better troubleshooting and optimization strategies. It emphasizes the need to conduct thorough investigations before jumping to solutions, ultimately leading to more reliable software development practices.

00:00:15.360 Hello everyone! I'm Colin Jones, and you can find me on Twitter as @tropic_colin. I'm the CTO at Eighth Light, a software consulting and custom development company. One of our core principles is about learning and improving our craft, so it’s great to be here at RubyConf 2016, surrounded by all of you who are motivated to enhance your skills.
00:00:32.259 Historically, I have focused more on application development—web, mobile, desktop, and backend services. However, over the past couple of years, I’ve become increasingly interested in the systems side of things, especially around performance. I’ve found myself frequently questioning why certain applications are so slow and what improvements can be made. Through this process, I’ve realized that I held quite a few misconceptions about performance issues, making it hard to trust my own intuition when faced with these challenges.
00:01:09.369 It’s common for people like me to see a slow application and immediately jump to potential solutions such as rewriting in a different language, changing databases, or implementing more caching logic. Unfortunately, these solutions often take considerable time to implement, and even if they solve one problem, they may uncover new issues that need addressing.
00:01:38.079 Based on these experiences and reflections, I’ve started investing more time in understanding problems more fully before jumping to solutions. In this talk, I’ll discuss how I’ve been trying to better understand performance issues in particular and how DTrace can assist us in exploring these problems.
00:02:09.369 As you may have gathered from the title, we will explore how to delve into performance problems using DTrace. But first, let’s clarify what DTrace actually is. DTrace, which stands for Dynamic Tracing, is an observability system that allows you to see in-depth what’s happening on your computer at any given moment.
00:02:36.960 Originally from the Solaris world, DTrace also operates on FreeBSD and Oracle Linux, and it works with macOS too. However, there are some caveats for macOS users, especially regarding El Capitan and later versions, which may require you to jump through some hoops if you want to use it. This talk is primarily from my experience running DTrace on my personal development machine.
00:03:05.660 For those familiar with strace on Linux, you can think of DTrace as a more advanced version of that tool. For those unfamiliar with strace, it provides a way to monitor system calls made by your application code—anything from opening a file to reading from the network. Julia Evans has an excellent zine about strace that I highly recommend for anyone interested in learning more.
00:03:46.250 DTrace allows you to observe system calls in a similar manner, but it also gives you the ability to write programs for tracing events, filtering, aggregating, and performing various actions based on the events being traced. However, the programming language used for this is quite limited and designed with safety in mind. DTrace runs in the kernel, which is a heavily protected area of the operating system.
00:04:04.640 This limitation helps prevent potentially harmful operations from being executed while ensuring that the tool can safely run in production environments. For example, there are no loops in this language to prevent infinite loops. You may find this to be a major limitation at first, but it actually makes the tool safer and more reliable.
00:04:44.310 To run DTrace in practice, you simply execute it from your terminal, providing the program you want to analyze. The syntax is somewhat reminiscent of awk and can trace various events, such as system call entries and exits. For instance, you can observe all system calls across the entire machine or focus on specific processes. This ability to trace at a system level is powerful, as it allows you to aggregate system calls not only by process but also by the function being called.
00:05:34.080 This aspect of DTrace enables you to see precisely which system calls are taking the most time. For instance, while tracing, you can hit Ctrl-C to end the tracing session, and DTrace will provide insights into the most commonly invoked system calls.
00:06:13.080 As we continue our exploration of DTrace, we can further investigate lower-level system resources. DTrace allows tracing of events related to disk I/O, memory usage, and much more. For instance, you can trace page faults that require I/O operations, providing valuable insights into what processes trigger these faults.
00:06:46.530 Once you've determined which application processes are causing high page faults, you can effectively focus your performance investigations on those areas, allowing for more targeted troubleshooting and optimization.
00:07:18.660 Additionally, developers can create custom probes for their applications. Many popular applications and dynamic runtimes such as PostgreSQL and Ruby already contain built-in DTrace probes, which provide developers with more detailed visibility into their code.
00:07:57.210 By allowing applications to emit probes during execution, developers can give DTrace insights into internal events that occur, such as query execution in databases or function timing in other programs. This feature lets you trace highly specific operations and behaviors, making performance debugging much easier.
00:08:45.140 One of the truly powerful aspects of DTrace lies in its ability to dynamically instrument running code without requiring the developers to write tracing code. For instance, you can trace arbitrary functions, even in the kernel, by simply enabling DTrace. This dynamic tracing ability means that you can monitor performance in real-time without impacting system performance.
00:09:34.290 As an example, you can track durations of kernel function calls dynamically, enabling you to identify which functions are the most time-consuming. This level of detail allows developers to pinpoint precisely where the performance bottlenecks exist.
00:10:15.780 While we won't cover all the intricate details here, DTrace gives you a lot of power to probe into the underlying workings of your code and the performance of your system. The insights gained can be instrumental to developers tasked with debugging and optimizing their applications.
00:10:56.520 At this point, we can start to investigate actual performance problems using DTrace. Many of us have experienced situations where our code runs slower than expected, creating frustration and uncertainty. In this talk, I will illustrate a real-life performance issue my team faced, showcasing the steps I would take now with DTrace, highlighting the tools it provides to dig deeper into performance problems.
00:11:43.400 In this scenario, my team's test suite was intermittently slow—taking around 30 seconds when it should have taken milliseconds. This significant delay complicated our workflow, as we rely on fast feedback from tests to catch bugs quickly. It was unclear what was causing this delay; while we assumed it might be related to the test features, we lacked specific information on the underlying issue.
00:12:43.900 To get to the bottom of the problem, we knew we needed to create some hypotheses around possible causes, which could range from database locks, memory issues, and various environmental factors. However, instead of jumping to conclusions, we wanted to utilize DTrace to gather more data and better understand the problem at hand.
00:13:25.560 By gathering information through DTrace, we could validate our existing hypotheses, rule out potential problems, and generate new questions to explore. Many performance issues stem from resource limitations, whether in software (like locks) or hardware (such as CPU or memory). Therefore, the next logical step is to begin investigating system resources to identify any potential constraints.
00:14:29.710 For instance, one useful resource to look at is CPU usage. Using DTrace, we can sample CPU processes frequently and assess the counts for different executables. When analyzing the output, we discovered that a program called 'kernel_task' was consuming significant CPU time, which seemed suspicious and warranted further investigation.
00:15:21.920 As we drilled down into the kernel stack traces, we identified that most of the CPU time was actually attributed to idle processes, indicating there weren’t any blatant issues on the CPU itself. This realization was important as it ruled out many performance hypotheses related to CPU-intensive operations.
00:16:18.320 While we initially felt disappointed that we didn’t find a performance bottleneck on the CPU, this outcome was beneficial. It eliminated a large segment of possible performance problems, allowing us to focus on other potential factors affecting application performance.
00:17:31.660 Next, we decided to investigate networking, given that our application relies heavily on database calls. By tracing socket connections to see where delays might stem from, we could get a clearer picture of whether networking was impacting performance.
00:18:24.550 The results revealed that local connections to the database were performing well, but we also noticed external connections leading to an unfamiliar IP address. This discovery warranted further investigation to see why these external calls were occurring and if they were related to the performance problems.
00:19:29.790 To explore the unexpected external connections further, we decided to conduct DNS lookups to identify what hostnames corresponded with the problematic IP address. By tracing the DNS resolution process, we uncovered that DNS queries to this address could take up to 30 seconds to resolve.
00:20:31.410 This discovery was essential, as it not only explained the indirect connection delays but also pinpointed a specific cause for the slow tests we had observed. By identifying the slow DNS resolution, we had a STRATEGY FOR FIXING THE ISSUE—whether that meant changing DNS providers, switching references in our code, or faking out those calls completely.
00:21:21.790 This example illustrates the significance of employing a systematic approach for resolving performance problems using DTrace. By tracing the underlying causes better, we learned to not only avoid external calls in the test suite but also to consider DNS latency as a key performance factor.
00:22:12.520 Ultimately, this experience reinforced the notion that performing disciplined performance investigations can enhance our ability to troubleshoot issues effectively. DTrace, while not the only tool available, provided us with a deep well of capabilities to explore performance bottlenecks efficiently.
00:23:13.080 For those interested in further developing your skills with DTrace, I recommend exploring several valuable resources. Brendan Gregg’s website is an excellent place to start, filled with articles and guides about performance analysis and DTrace.
00:24:08.540 Additionally, investigating scripts written by others can be an educational experience. Diving into the examples in the DTrace toolkit can give you insights into how DTrace works and guide your own explorations.
00:25:04.590 As you continue to learn, keep in mind that not all scripts will work perfectly on your version of DTrace since compatibility may vary. Still, there are stable probes within DTrace you can rely on, enabling you to experiment with the tool effectively.
00:26:04.300 For those using Linux, there are alternatives like DTrace for Linux and SystemTap that offer similar functionality. As new versions of Linux are released, innovative tools are emerging that provide powerful observability capabilities.
00:27:04.120 To summarize, DTrace has been a valuable learning tool for me, providing insights into performance, operating systems, and effective problem-solving. I hope my experience can guide you as well. Here are a few resources to check out, and feel free to reach out to me through Twitter to discuss further.
00:28:00.090 Thank you for your time, and I would love to take any questions you may have. I encourage everyone to reach out if you’d like to discuss technology, particularly around how Eighth Light helps companies build and maintain reliable software. We’d love to chat!
00:29:08.460 Integrating DTrace into the testing framework is an interesting concept, but it comes with challenges. For instance, DTrace operates externally and requires root permissions. This requirement complicates running tests, as it would necessitate shelling out and elevating permissions, which is not typically advisable.
00:30:00.000 In practical applications, I’ve primarily used DTrace to observe system calls made by my applications. There are tools built into DTrace that provide visibility into the system calls occurring in real time, allowing for a better understanding of performance issues as they happen.
00:30:55.000 Beyond just tracing individual calls, leveraging tools like ‘dtruss’ can help us visualize what’s occurring behind the scenes. Leveraging these insights can empower developers to solve performance issues more effectively.
00:31:48.920 I’d be happy to take any further questions or continue discussing DTrace and performance improvements. Thank you all for attending, and I appreciate your engagement!