What to do when Mongrel stops responding...

by Philippe Hanrigou

In this talk, Philippe Hanrigou addresses the challenges faced when Mongrel clusters stop responding, outlining his experiences and offering solutions for troubleshooting these instances.

Key points include:
- Introduction to Mongrel Clusters: Philippe shares his positive experiences with Mongrel but emphasizes that issues can arise, leading to the feeling of uncertainty in handling unresponsive requests.
- Background and Expertise: Philippe introduces himself as a Ruby troubleshooting expert and provides context regarding his work with Selenium and Ruby, as well as the project that spurred his discussion.
- Problem-Solving Methodology: When faced with production issues, Philippe highlights the importance of collaboration within the software community and comparing against other programming languages’ approaches.
- Use of System Tools: Philippe introduces various system tools, particularly GDB (GNU Debugger) and DTrace, which can provide crucial insights into both Ruby applications and the underlying operating system.
- Demonstration of GDB: He performs a demonstration using GDB to diagnose issues in a Rails application, illustrating how to obtain backtraces and work through potential causes for a locked process.
- DTrace Overview: Philippe elaborates on DTrace, explaining its role in monitoring applications dynamically and allowing for troubleshooting without performance impacts. He makes the case for using DTrace to observe application performance metrics, capture events, and determine bottlenecks.
- Conclusion and Recommendations: Philippe emphasizes not waiting for production issues to appear and instead encourages regular engagement with system tools and utility scripts to enhance debugging skills.

Overall, the takeaway from Philippe's presentation is the necessity of familiarizing oneself with system-level tools to effectively handle application failures, specifically noting the importance of early detection and proactive measures.

00:00:12.799 Good afternoon. My name is Philippe Hanrigou, and I want to tell you about my Mongrel clusters. Most of the time, I have a wonderful relationship with my Mongrel clusters, but sometimes they just stop responding to me. They don't take my requests anymore, and in this case, it's a little lost. I'm often left wondering how I should feel about it and what I can do about it. That's pretty much what I want to cover tonight.

00:00:36.180 Now, what you can do when your Mongrel cluster is misbehaving and you cannot figure out what's going on? In case you're still wondering about my accent, yes, I'm French. I work for a company called ThoughtWorks, which specializes in agile consulting and exciting technologies like Ruby. I published a book with Addison Wesley on the topic of Ruby troubleshooting, covering how you can leverage system tools like lsof, strace, and GDB in the context of Ruby.

00:01:00.600 You might also know me for my involvement as a creator and main author of Selenium Grid. By the way, we need to talk about vital Selenium tests, because there is definitely a way not to do that. Selenium Grid is a tool that lets you run your Selenium tests in a distributed manner across multiple machines so you can run your tests in a much shorter time. Instead of waiting three hours for your test suite to finish, you might only have to wait 10 minutes.

00:01:37.860 The journey that brings me here tonight dates back to about a year and a half ago. Patrick Farley, who did the presentation on internals with myself and a couple of other ThoughtWorks colleagues, was working on a pretty ambitious web services project that was dealing with a lot of databases and multiple web services, all under a pretty aggressive user load. The velocity was quite high until we started assessing the stability of our application and realized that our instances were getting stuck and freezing quite consistently, unfortunately, at random times and for random requests.

00:02:26.400 Just looking at our logs and trying to instrument Ruby, we found ourselves unable to get anywhere close to identifying the problem or the root cause. This is when Pat Farley, who was the tech lead at the time, came to me and a colleague named Joseph Miller and said, 'Hey Philippe, Jason, we need to solve this stability problem quickly.' We responded, of course, but we were thinking, 'We don't even know where to start.'

00:03:02.520 When you have a tough problem on your hands and you don't even know where to start, it's often good to ask for a little help from your friends. We are a big software community, and there are plenty of people who could help us. So we went around and asked our PHP counterparts and desperate Troopers. We even asked Java developers, who are typically very powerful but sometimes quite heavyweight.

00:03:35.580 It turns out these folks couldn't really help us much. They said, 'You need to think about who you are as Ruby developers.' So let's think about who we are. We're nimble for sure, we're moving fast, we have style, we have values and principles, and, more importantly, we have the Force! So who can we be? Well, we can be Jedi!

00:04:00.660 The cool thing about being a Jedi Ruby developer is that life is easy, right? Whenever we encounter problems, we just use the Force. A little Ruby trick here, a little script in the console, checking the logs... and we solve our problems in no time. This approach usually works for us 99% of the time, but sometimes we find ourselves in real-world trouble, usually in production environments. Not only are we in trouble, but during these cases, we need to solve the problem quickly, and the entire team is counting on us.

00:04:35.080 Does that sound familiar? It’s a bit like Luke in his mission to destroy the Death Star. When you find yourself in such a situation, it’s good to remember that even a Jedi is never alone. Even while concentrating on your task, there is always someone in the background helping out. Do you remember R2D2? We have our own little R2D2 as Ruby developers.

00:05:01.920 We have someone who keeps our Ruby applications running all the time, even if we forget about it. It’s called the operating system, Unix for most of us lucky ones. It turns out that there are a lot of system tools that Ruby developers might not know about, but they're well documented. Among these tools, there are many useful ones: lsof, iostop, strace, and whatever other tool you might name.

00:05:53.700 Something you might not know about is that there are tools that provide not only system level insights but also yield information about what’s happening in Ruby applications at the same time. This capability allows you to correlate information easily and see what’s going on. Tonight, I’ll concentrate mostly on two of these tools that give you insights into what’s happening in Ruby as well as in your system.

00:06:40.980 The first tool, unexpectedly, is GDB, the GNU Debugger. Acting as a C debugger, GDB can tell you everything that is happening system-level, like system calls and signals—essentially everything happening in your Ruby interpreter as well. With a few clever macros, you can also use GDB to obtain backtraces, raise Ruby exceptions, or even evaluate arbitrary Ruby code, which is pretty cool!

00:07:29.580 You guys want to see that tonight? Let's try to get started.

00:08:01.680 Tonight, our demonstration will mostly happen in a terminal. I have a Rails application which I'm going to launch. Here we go. So, I’m going to try to find it and call three actions in one controller. The first one I'm going to call right now is basically just sleeping and returning. As you can see, my action is working; my Mongrel cluster is happy and up and running. Then, something is going to happen. Actually, let me start over again because I forgot one critical step.

00:09:36.840 When you start to target your Mongrel process, it suddenly doesn’t respond anymore. By the way, it is never Mongrel that is the problem. Every time I investigate these troubleshooting problems in production, it’s never Mongrel; it’s always your code, Rails, your database, or your system. Mongrel is pretty solid in that regard. So when you find yourself in such a situation, let's try to use GDB to attach to the process. First, we need to know the process ID (PID). I’ve got my PID, and now I'm going to try to attach to it.

00:10:32.160 Now we’re in GDB, the C debugger, so I can see what's going on at a system level. Let’s try to get a backtrace. I’ve got my backtrace, but looking at it won’t tell me much because I see a lot of Ruby values I recognize. What’s more interesting is looking at the top of the stack to see where I’m stuck. If I repeat the action, it will be easier.

00:11:31.800 Ah, that's what I'm blocked on! I’m trying to get a lock on a file. This seems to be the result of some Blackmagic that is causing the issue. I vaguely remember that I’m utilizing a native gem called Blackmagic, but which controller is causing this? Wouldn't it be cool to be able to get those active backtraces? What movies came up in this C-level backtrace? Let me try to go to my process.

00:12:38.640 I’m going to kill the process here so you can see the output. Now, there is a pretty neat macro made by Mauricio Fernandez which allows you to evaluate Ruby code right from GDB, so let's test this out. That works! I’ll show you something else.

00:13:19.740 Maybe that would be a good way to get a backtrace. If I call 'caller,' sure enough, we have our backtrace. If I scroll up, I can see that I'm in my controller showcase, specifically in line six. The action I am executing seeks to provide business value. By checking my code, I'd quickly realize this is the part of the code that uses the Blackmagic gem. You might argue that it would be nice to know which file I'm trying to lock on, and that would be a great job for lsof. But we’ll leave that to another discussion.

00:14:19.740 Now, if I can evaluate arbitrary Ruby code, I can even evaluate the caller, which is pretty neat. But I can take it even further! I could check for every single object in my web instance, for instance. That's not super useful, but you get the idea. On the same principle, I could try to evaluate how many classes I have in my system and walk through the object space to see how many of each we have.

00:15:00.780 When attaching to a process with GDB, keep in mind that I could show you something else, but if I try to control-C here, I won’t be able to exit it. That’s because GDB is locking onto your process, so first, you need to exit from GDB. Now, a small side note: for GDB to perform its magic, you need to access the debugging symbols of your process.

00:15:44.760 Typically, this isn’t an issue if you’re on Linux or Solaris. However, if you’re using the out-of-the-box Ruby interpreter on Mac OS Leopard, it’s not compiled with debugging information. If you want to perform these types of tricks, you’re better off compiling Ruby with the debugging flags enabled and using it to troubleshoot your problems.

00:16:57.160 Let’s review a bit. First, let’s give credit where it’s due: all of these amazing macros were not made by me. I would love to take credit, but I can’t. They were created mainly by James Buck and Mauricio Fernandez. I’ll share how you can access them later; they are really awesome because they effectively reimplement eval as GDB macros.

00:18:05.240 So, how can you get these macros? You can find them in various places on the internet, but the easiest way is to visit my website, ph7sport.com. On that page, you’ll find a download link to get all the macros compiled together with additional documentation. Save it as .gdb in your home directory.

00:18:48.480 Now, what kind of macros can you find? One of the most important ones is 'all be finished.' Basically, it doesn’t perform any actions by itself. It places you in a safe state because you might attach to your process in a state that isn’t safe to run arbitrary code. So, 'all be finished' takes care of that and controls the safe state. Another macro is 'eval,' which lets you evaluate arbitrary code, especially useful for 'caller' when you're in trouble and want to get the stack traces.

00:19:49.920 There’s also a way to capture the Ruby backtrace—not the C one from GDB. Interpreting the C-level backtrace doesn’t always work, so when it fails, utilize 'eval caller' followed by 'raise' as a neat trick. That’s a quick run-through of the macros available to you.

00:21:12.240 So, what can you do with DTrace and what type of providers are available? If you have a MacBook Pro running Solaris, just run 'sudo dtrace -l,' and you will see all the providers available. There are many that can cover everything from kernel-level providers, such as system calls, to network stuff and even language interpreters.

00:23:25.440 Ruby developers can trace every significant event. DTrace gives you visibility into high-level application calls all the way down into kernel activity. With DTrace, you can see everything happening in your system, from high-level constructs to the deepest kernel internals, which is incredibly powerful.

00:24:35.560 The most exciting part of DTrace is that you only need to learn one tool to monitor everything in your system. Instead of separate tools providing information in different formats with different context, DTrace allows you to correlate all that information. It gives you the ability to answer questions that are difficult to achieve when investigating production issues.

00:26:55.240 Just as GDB is a powerful tool for frozen processes, DTrace is the opposite. It has the ability to capture dynamic aspects of your applications. Think of it like motion capture; you can monitor probes in your applications like you’d use sensors in a motion capture outfit.

00:28:40.200 Let me ask you, how many people here are familiar with DTrace? Not many, it seems. And how many are aware of the D scripting language? Similar situation; a few hands raised. I’m going to cover a bit of introductory material to ensure everyone is on the same page.

00:30:40.200 DTrace works with what we call probes. A probe acts as a sensor placed in your code at points of interest. You can enable or disable these probes on demand, minimizing performance impacts when they are turned off. This means you can run DTrace in production with probes disabled until you need to gather information, making it incredibly efficient and practical.

00:31:55.680 In terms of implementations, since DTrace is open to various systems, you could create specific probes for the areas you're interested in. This shows you how to write a simple DTrace script for capturing and analyzing data in your system without negatively impacting performance.

00:33:02.280 By providing a way to listen to events in your applications, you can analyze performance metrics and see where bottlenecks are occurring. You can watch as time is spent on various processes within your application, allowing for clear insights into potential issues that may arise.

00:34:24.000 The great thing about this process is that you have an opportunity to refine your understanding progressively. For instance, if you notice that a request is slow, you can investigate which methods and system calls are concurrently running, drilling down until you pinpoint the source of the delay.

00:35:50.520 DTrace provides you with both the visibility to notice issues early on and the tools to analyze them effectively. For instance, if you set up probes for specific actions in your Ruby application, you could actively observe its performance metrics.

00:37:15.840 In conclusion, once you’ve set up all of this infrastructure for monitoring and capturing data, you will find it illuminating for performance optimizations in all aspects of your Ruby applications. As you work through these scenarios in real time, you’ll become adept at diagnosing issues before they escalate into major problems.

00:38:58.080 If you take nothing else from this presentation, please remember: don't wait for production issues to land you in hot water. Don't let the emperor take away your Mongrel clusters without training up your skills to handle problems before they arise. Regularly familiarize yourself with existing system tools in your daily work.

00:39:27.720 This includes tools like lsof and iostop, which can provide useful insights. If you get used to using them, you will find it far more manageable to troubleshoot problems during critical incidents without the pressure of unfamiliarity.

00:40:44.400 Furthermore, notably, I recommend installing the DTrace utilities; you'll get all this implemented right out of the box, which helps streamline the setup process. Explore available documentation and tutorials, as they give you meaningful guidance on how to use tools effectively!

00:41:50.640 Thank you very much for your time! Now, if you have any questions, please feel free to ask.

MountainWest RubyConf 2008