00:00:07.820
Hello everyone, my name is Jason Clark. As introduced, I work for New Relic, specifically on the Ruby agents, which is the gem portion of New Relic that you might have installed in some of your applications. While my work won't play too deeply into our discussion today, it has provided valuable insights that I want to share. The URL displayed offers examples of what I will be discussing today. I'm especially glad that Laura presented on Docker, as all of the examples I have are set up as Docker files that you can run with Docker.
00:00:39.210
We're going to explore a set of tools that you can use to look at Ruby's behavior from the outside. Specifically, we will look at estrace, RbTrace, and TCP dump. You may have heard of these tools in different contexts but might not have had the chance to dig deeper into them. They require some setup and configuration, so the Docker files in that repository will allow you to explore and experiment with these tools offline. I highly encourage everyone to start downloading those images off the conference Wi-Fi right now; it’s a great idea! However, it’s probably not advisable to install all of them at once.
00:01:15.810
This talk is structured as a series of mysteries for us to investigate. Like any good mystery story, each of these has a moral. The overarching lesson that I want you to take away from this discussion is to dig deeper. There are always layers in your software that exist below the layer where you typically work. With the help of these tools, you can unearth and understand what those underlying layers are doing.
00:01:59.040
Let's get started with the first case, which I call the case of the missing config file. This scenario comes from a startup that was dealing with a very sensitive data stream, which I can't discuss in detail due to its confidential nature. It was crucial for us to process this data quickly. One team member, Carl, an experienced C hacker, wrote a library that could rapidly parse this data stream for us.
00:02:18.459
Carl was very helpful and wrapped it in Ruby, creating a gem that wrapped the native library he had built. We had a custom, hand-built server that performed this processing. However, as often happens, when you depend on a single server for critical infrastructure, it failed. We had copies of the library, but when trying to install and run it on a different machine, we encountered issues. The first parts of the logs showed some message from Carl, indicating that the configuration was incorrect.
00:03:03.400
We remembered that there was some sort of configuration file on the original server, but none of us knew what it was or where it belonged. When we looked for Carl or his source code to understand what was happening under the hood, we found ourselves quite empty-handed. There are many ways to approach this problem; pulling your hair out, crying, or screaming could all be valid reactions to this situation. However, someone suggested that we try using estrace to diagnose the issue.
00:03:34.209
We launched estrace and executed our program. You can run estrace with a command, and it provides output that helps in understanding the problem. The output can be overwhelming at first. Instead of getting lost in the flood of information, it's essential to take a step back, gather your breath, and analyze the output closely to understand how it can assist us in addressing the problem.
00:04:08.109
Let’s pick apart a line from this output to grasp its significance. One line stands out, showing a syscall made during execution. This syscall represents a command given to the system along with parameters, such as a file path and flags dictating how we want to open that file. By isolating these syscalls, we can better understand their functions.
00:04:51.450
In computer science, there are two divisions of code execution: the kernel, which interacts directly with hardware and resources, and user land, where services and applications run. The kernel manages resources carefully and restricts user code actions to prevent interference among applications. The interaction between user code and the kernel is controlled through syscalls, allowing us to request actions on our behalf. For each interaction, documentation is available via the 'man' command, which provides insights into what each syscall entails.
00:05:54.740
Estrace allows us to observe these syscalls as they are executed while our program runs. The significant takeaway here is that we need to look for recognizable output within the swamp of data provided by estrace. There are key pieces of output, such as loading messages and error messages, that can guide us in diagnosing problems. If we can pinpoint these messages in the output, we can examine events occurring in between and potentially identify where our issue lies.
00:06:26.570
We combed through the estrace output, searching for the points where the data would be written out and discovered key syscalls. We found that our application referenced a non-existent config file and returned an error indicating there was no such file or directory. This information revealed that the application was looking for a configuration file that simply didn’t exist. Estrace effectively helped us identify this oversight.
00:06:46.370
While estrace is a powerful tool for observing syscalls, it's not a catch-all solution. It shines in displaying interactions with the operating system, but sometimes we need a deeper dive. This brings us to the next case I want to discuss: the case of the unused method.
00:07:15.000
In this situation, we had an application that allowed external users to write scripts interacting with our system. While these scripts were intended to be useful, certain parts of the code became burdensome for us to maintain. Our suspicion arose that some of these objects were no longer in use, but we couldn't confirm it. The business was cautious about removing anything without evidence to back our theory.
00:07:58.500
Though various methods could shed light on this issue—like using a method tracer or logging—our hands were tied by restrictions on deploying new software versions. This circumstance often teaches us to prepare adequately for future needs. Luckily, someone installed a Ruby gem called RbTrace for us, which granted us visibility into method calls as they were made within the Ruby virtual machine.
00:09:02.990
Much like estrace, RbTrace allows us to observe method calls in Ruby, even allowing us to attach it to a process that's already running. By passing the process ID to RbTrace, we could log its method calls without rebooting the application. We set our limitations and watched for a specified time period to see if any of the suspicious methods were invoked. Unfortunately, our findings showed that no one was calling those methods that we had assumed were unused.
00:09:55.600
This method-level tracing tactic is beneficial, and RbTrace also features built-in tracers for common Ruby constructs, potentially revealing interactions with IO methods, ActiveRecord methods, and more. With RbTrace, we could determine what Ruby code was actually being executed at runtime.
00:10:23.000
Next, I want to talk about the case of an occasional lag in a system that operated on a hundred-millisecond timer. This service established a periodic heartbeat, sending pings across the network to another service. While the heartbeats generally arrived as expected, occasionally we noticed that one heartbeat would lag, and it would be delayed for a moment before things returned to normal.
00:10:54.060
At first, this was not alarming. However, given the reliability we had observed, we needed to understand why this phenomenon was occurring. The turning point came when we discovered that background tasks running approximately every ten minutes were coinciding with the heartbeat lag. The timing was interesting, but we were puzzled why these independent tasks might affect heartbeat performance. They each operated on separate threads, and we initially saw no direct correlation.
00:11:42.180
However, RbTrace came to the rescue in this instance. By utilizing a specific flag, we could trace garbage collection events. We discovered that following the execution of our background tasks, multiple garbage collections would occur, leading to pauses in our system. Ruby's garbage collection process can briefly halt execution, which could explain the heartbeat inconsistency.
00:12:10.540
This knowledge enabled us to look into other metrics and investigate object allocation during these periods. Indeed, it turned out that there was a spike in object allocations at the ten-minute mark, confirming our theory that excessive garbage collection was affecting the heartbeat. With this insight, we could take action to minimize object allocation during these background processes and stabilize performance.
00:12:53.220
If you're interested in using RbTrace, I recommend checking out its GitHub repository. It was developed by Amr Gupta and is designed to have minimal overhead, making it suitable for use in production environments. Justin Weiss has written an excellent blog that walks through the setup we've discussed today.
00:13:22.410
Lastly, we come to the case of delicate negotiations. In one particular project I was developing—the Ruby agent I mentioned earlier—we would gather data every minute and transmit it to our collector service. Initially, we established a new HTTP connection every minute to send this data, which included SSL negotiations that incurred a cost both from a time and resource standpoint.
00:14:01.470
As you might expect, this wasn’t very efficient when dealing with scale. Instead, we switched to a strategy that allowed for reusing TCP connections. By reconfiguring our code to utilize persistent connections, we minimized overhead and improved overall performance. However, a challenge arose: how could we confirm that our code was functioning as intended in terms of connection reuse?
00:14:49.290
To overcome this hurdle, we turned to TCP dump, a straightforward tool for monitoring network packets. By applying TCP dump, we could see all traffic, ensuring that our connections were reused as expected. The key takeaway is to recognize these boundaries between systems. In our case, there was a line between the Ruby layer and the network layer, and understanding that line is essential for evaluating our application's behavior.
00:15:31.800
When you run TCP dump, you'll often discover an overwhelming amount of information. However, look for specific indicators within the packet communication to help you verify whether connection reuse is working correctly. Typically, you’d see a distinct pattern in the packet exchanges, like the initial SYN flag and the subsequent communication phases. This can offer you clarity in terms of how many connections are being initiated and whether they are being reused.
00:16:24.050
With our new connection setup, we observed the TCP dump output displaying the connection reuse as anticipated. Instead of numerous packets for each request, we found that the system maintained the connections, confirming the efficiency improvements we aimed for. This finding gave us confidence that our changes were effective, reinforcing the steps we took were indeed the right path.
00:17:15.060
As you explore TCP dump, consider its intricacies and how to refine its output based on your operational needs. While you might experience a lot of background noise from other processes, focusing on the pertinent flags and configurations can streamline your diagnosis. Resources such as Benjamin Kane’s Quick Start reference and talks about the lifecycle of web requests can help consolidate your understanding of how networking works.
00:18:20.690
In conclusion, today’s exploration of these tools, including estrace, RbTrace, and TCP dump, merely scratches the surface of what's available to help you analyze and understand your software systems. Always seek to dig deeper; don’t just stay at the Ruby level. Remember that you can comprehend what goes on beneath your code even if it was developed by someone else.
00:19:46.310
When wading through overwhelming amounts of output, look for landmarks and fine-tune your focus on recognizable elements. Planning ahead for tool access is crucial, especially for production systems, to mitigate potential emergencies. Finally, learn to find the boundaries between systems and always keep an eye out for tools that can provide you visibility into those layers. Thank you for your attention today, and I hope you are inspired to explore the provided repository and play with these tools—and don’t forget, there's a bonus case waiting for you!