What’s my App Really Doing in Production?

by Daniel Azuma

In the talk titled "What’s my App Really Doing in Production?", presented at RailsConf 2017 by Daniel Azuma, the focus is on understanding, measuring, and troubleshooting the behavior of Ruby on Rails applications in production environments. Azuma emphasizes the importance of gathering real-time data to address issues that arise when an application is under load, as user behavior can reveal unexpected code paths and performance challenges. Key points from the talk include:

Importance of Measurement: Throughout his career, Azuma has recognized that effective measurement and data analysis are fundamental in making informed decisions about application performance.
Challenges in Production: He shares an anecdote about a traffic spike his application faced unexpectedly due to national television exposure, illustrating the chaos that can ensue when systems are unprepared for sudden load increases.
Instrumentation Techniques: Azuma discusses different ways to instrument Rails applications for monitoring, such as utilizing the Active Support Notifications API, which allows developers to inject measurement code directly into the application code:
- Examples of instrumentation include measuring cache hits and misses, and tracking latency for requests.
Keeping Performance Penalties Low: Implementing monitoring should be done carefully without introducing significant latency or affecting application behavior. Techniques such as sample data collection and focusing on critical areas of the application are recommended.
What to Measure: He underscores the need to keep track of responses sizes, error rates (including 400-level errors), rendering times, and interaction with external systems like databases and APIs. All of these metrics can provide insight into the application's health and performance.
Decoupling Trends and Side Effects: Importance is placed on not just retaining data but understanding how it can influence application behavior and avoid introducing side effects through monitoring processes.
Instrumenting for Actionable Data: Emphasizing the need for actionable data, Azuma encourages start-ups and seasoned developers to adopt continuous measurement processes as an established practice to preemptively mitigate issues.

In conclusion, Azuma calls for developers to adopt measurement practices early in their application lifecycle, encouraging them to experiment with instrumentation and monitor the health of their applications continuously to avert major production headaches in the future.

00:00:12 Okay, well, welcome everyone. Thanks for coming. I know this is the end of a really long day, the last day of the conference. We're all a little tired, and we do have to save our energy for Aaron's keynote. It's always a lot of fun, so don't miss it. But I'm glad you're here; we have some important things to discuss.

00:00:39 So to start off, who is this person up here? My name is Daniel. I've been a Ruby developer for about 12 years or so, though I've been working with other languages for longer than that. I've spent most of that time in early-stage startups, working in really small companies. Currently, however, I work at a different kind of company, though it's also a small company.

00:01:08 As a quick note, I'm not here speaking as a representative of my employer. This is my talk, and my views are my own and not those of my employer. That said, my employer does own my code, and for whatever reason, that includes examples on the slides. I should mention that the code samples in this talk are copyrighted by Google and licensed under Apache 2.0.

00:01:31 With that out of the way, I must say I was really impressed by David’s opening keynote. I liked what he was saying about the importance of belief systems—those underlying values that should anchor and guide our decision-making. It's crucial to recognize what those values are and how they define us. I really appreciated his message.

00:02:00 One thing it made me think about was the values I've absorbed from all the various small companies, and even some that weren't so small, where I've worked. One common theme across all these organizations, big or small, especially the good ones, has been the importance of measurement and data.

00:02:24 I remember my first Rails startup back in early 2007. We launched our first Rails site using Rails 1.0, which were the good old days. A few months after launching, our CEO made an appearance on national television for something totally unrelated to our company.

00:02:57 During this interview, he accidentally mentioned the name of our site on national TV. It was completely unplanned, but since the show was quite popular at the time, as you might guess, within a few minutes, I got a page from our monitoring system, informing me that our site started to get really sluggish.

00:03:26 I looked at our logs and saw a multiple order of magnitude traffic spike. This was both a good and bad thing, but back in 2007, we didn’t have the nice auto-scaling cloud services that we have now; we were on physical servers in a colocation facility at the time.

00:03:50 So, I spent the next couple of hours logged into those servers, trying everything I could think of to increase our capacity. I was tweaking our load balancers, disabling features, and spinning up more Mongrel servers. We were running on Mongrel 1.0 at the time, which was considered state-of-the-art back then.

00:04:20 Sadly, nothing really worked; we were just extremely laggy. Eventually, after a few hours, traffic settled down a bit and the site started to recover, but in the meantime, we were flailing around in the dark.

00:04:51 Later, when we did a post-mortem investigation, we determined that our Mongrel processes were actually memory constrained. We had these in-memory caches within each Mongrel process, and as traffic went up, these caches got squeezed, which caused thrashing. My attempts to fix the problem by spinning up more Mongrels only exacerbated the memory issues.

00:05:30 The worst part was that I couldn't even tell this was happening. We had a caching strategy that we thought would work, but I didn’t have accurate data on its behavior in production, making it difficult to respond appropriately when we faced a production issue.

00:06:01 It's crucial to have real data on what your app is actually doing in production under load, whether you're just starting out, as we were, or whether you're one of the largest, most successful companies with the biggest products in the industry.

00:06:36 At the small company I work for now, we measure everything. If you're just starting to launch your first app, you also need to measure everything. So, how do you do that? First, the obvious: there are many excellent services for system and application monitoring.

00:07:00 For example, this screenshot shows Stackdriver, which, full disclosure, is developed by my current small employer. However, there are numerous other companies and products that are also very good. If you’ve spent time in the exhibit hall, you’ve probably met some of them. Performance monitoring services and error monitoring—who here uses some kind of monitoring service in production?

00:07:30 Okay, that's good! It seems like a majority. Generally, those services do a great job of collecting data and providing visualization and analysis tools. That said, there will be times when you need to customize your measurements, as a general-purpose tool might not provide everything you need out of the box.

00:07:53 This afternoon, we're going to take a peek under the hood at some of the techniques these monitoring services use to instrument your application. You will see how to perform your own monitoring or customize the existing monitoring for the services you’re using to fit your application’s needs.

00:08:20 Here’s what we’ll cover: we'll learn techniques for instrumenting your app. We'll look at how to gather data without disrupting your app’s behavior in production. Finally, we'll discuss what types of things you should be measuring.

00:08:57 To start, let’s talk about instrumentation. This is an instrument—a vintage electroretina graph machine. It’s used to diagnose a variety of retinal problems. Modern versions of these instruments are a lot smaller and perhaps a little less intimidating, but old or modern, these machines share one commonality: they must have electrodes in direct contact with your eyes in order to take measurements.

00:09:25 A patient is usually given anesthetic eye drops, which can be a little scary but are necessary. Similarly, when collecting data from a running application, measurements need to be in direct contact with the relevant code at the point involved, which is the job of an instrumentation API.

00:09:50 The instrumentation API allows you to collect data from a running application, giving you the ability to inject measurement code at key points in your application to collect relevant data. In Rails apps, you can use the instrumentation API called Active Support Notifications.

00:10:14 To see how this API works, let’s examine an example. Remember my imploding cache? How would you know if your cache is working as expected? In my case, caches were running out of space. When that happens, you’d typically see a lot of cache misses. Your cache hit rate is a good indicator of the health of your cache.

00:10:42 Let’s measure it by using notifications to count the cache hits and misses. Whenever Rails reads from a cache, it calls the method 'Active Support Notifications.instrument'. This call measures caching code, noting how long it took to read from the cache and other relevant information such as the cache key that was read and whether it resulted in a hit or miss.

00:11:00 Rails already integrates this for us. It instruments its cache code, allowing you to subscribe to this measurement. You do this by calling 'Active Support Notifications.subscribe' and providing the name of the measurement you're interested in and a block. Whenever that measurement is taken, notifications will call that block, allowing you to work with the measurement data.

00:11:35 In this example, all we're doing is logging whether the cache hit or missed. This might yield log data like this over time. We can run simple tools like Grafana to analyze that data and gain useful insights. Rails instruments numerous things for you in addition to cache reads, so often all you have to do is subscribe to them.

00:12:06 You can also instrument your own code, which is particularly important when writing your own Rails plugins. It's a good idea to instrument your code so that applications using your plugins can measure its activity and performance.

00:12:54 So far, we’ve received notifications for all our cache hits and misses, and we have an overall measurement of the cache hit rate. This is already interesting data, and it’s easy to obtain with just a few lines of code. However, to make this data more useful, we sometimes need to gather a bit of context.

00:13:20 Let’s look at another old medical instrument: a vintage X-ray machine from around 1900. At that time, X-ray machines were little more than a tube of radioactive material, and it was quite risky. Fortunately, nowadays, when we X-ray, we are careful and specific about targeting only the part of the body that we need to measure.

00:13:57 This is something you also want to do when instrumenting your app. Most apps feature numerous controllers and actions. I spoke with someone a few days ago whose company had a monolith with around 400 controllers. Most of us don't have applications that large, but you often want to focus your measurements on particular controllers or even on specific actions.

00:14:32 What if I wanted to measure cache hit rates for a particular action? One option is to turn notifications on at the start of the action and turn them off after, but here's the problem: notifications are global. They apply to all threads at once, which could include threads running other requests. Many of us are now likely on multi-threaded web services, so this won't work.

00:15:04 So, here’s the technique we can use: let’s start with our existing cache subscriber. Right now, it logs all cache reads, regardless of which action is being executed. We can determine which action is running by subscribing to a different measurement called the 'start processing' event.

00:15:34 This measurement occurs at the start of processing a request by Action Controller. It captures various information, including the controller and action being executed. We can use this to decide when to take a cache measurement.

00:16:00 To do this, we need to communicate this information to our cache read block. Since we can't use a global variable, ActiveSupport provides a 'thread-local' module attribute that can hold different values per thread. This acts as a convenience wrapper around thread-local variables in Ruby.

00:16:36 Using this pattern, we can now communicate between our subscriber blocks on a per-request basis. There are some caveats; these are still global variables, even if thread-scoped, so use them carefully. Some web servers, such as Puma, will reuse threads across multiple requests.

00:17:00 Make sure that this data is cleaned up or reset between requests. Although it's not a perfect solution, it suffices for this purpose. Now we have a technique for measuring cache hit data for a specific action.

00:17:38 Here’s another interesting instrument from around 1960, where the subject wears contact lenses with miniature lamps to capture eye movements and reflexes. This device combines data from eye measurements and machine motion to study visual perception mechanics.

00:18:06 Similarly, when instrumenting your application, it's beneficial to gather multiple sources of data. For instance, understanding your cache hit rate is useful, but you should also know how much value a cache hit brings to request latency.

00:18:34 Here’s the code we've just discussed. We're determining which action is running and measuring whether it’s a cache hit or miss while logging that information. We can obtain the request latency in several ways.

00:19:01 For example, we can subscribe to the 'process action' event, which is taken at the end of an action by Action Controller. This event provides details about the action's outcomes, including the HTTP result and latency.

00:19:30 In this case, we add a subscriber to the process action event to log the request latency. Thus, we now log both the cache hit or miss and the latency.

00:19:51 To enhance this data's usefulness, we should combine these two pieces of information and log them together. Once again, we do this by using a thread-local attribute to pass information between subscribers.

00:20:12 Now, those two data points are logged together, and when this runs in production, our log might show results indicating a clear latency benefit from cache hits, while also showing that our cache hit rates could be improved.

00:20:36 With actual, useful information collected, we can then take action to improve our cache performance. As my PM friends at my small company would say, we now have actionable data.

00:21:03 Having walked through an example using notifications, there are several other instrumentation APIs that might be useful. One of the simplest ways to instrument an app is to use controller filters.

00:21:33 For instance, a simple 'around' filter can measure action latency. This is an easy way to collect basic information about requests as a whole. You can also write a Rack middleware.

00:22:00 This is useful for measuring the behavior or latency of other middleware components; you can insert your middleware anywhere in the middleware stack. Additionally, you can use this method to install instrumentation code in other frameworks, such as Sinatra, Padrino, or Hanami.

00:22:28 Finally, there’s the TracePoint API, which I consider the 'sledgehammer' of all instrumentation APIs. It's part of the Ruby virtual machine, and it provides event triggers for debugging.

00:22:56 While TracePoint can be powerful and is often used in debugging, our small company uses it extensively to build a cloud-based debugger product, as it allows instrumentation down at the source code level.

00:23:29 However, it's not something you would usually employ for monitoring due to its complexity. This leads to an important issue. Measuring brainwaves with outdated devices, like the one shown here from 1940, raises concerns about safety.

00:24:02 When instrumenting in production, it's critical to gather measurements without altering your app's behavior unintentionally. It’s essential to minimize latency effects while ensuring safety in your measurements.

00:24:40 Here are some tips: isolate and spotlight the interesting use cases. Oftentimes, you don’t need data from the entire application but from just a few relevant controllers or actions. Isolate those and measure them.

00:25:04 I encourage you to experiment with new instrumentation techniques, gathering new pieces of data. However, always circle back and evaluate your measurements; if they aren't providing useful information, don’t hesitate to remove them.

00:25:30 Approach instrumentation like tests—some should live on indefinitely because they monitor critical systems, while others may only be valuable for a brief period, particularly if they were part of a single investigation.

00:25:59 Don't leave unnecessary measurements lingering, as they might slow you down just like lengthy tests might. Make judgment calls on whether or not to keep or remove them. You can often get away with measuring only one in a hundred instances or one in thousands.

00:26:34 Trace points, in particular, can be dangerous because many of their events can fire extremely often. Remember, they can track every move to the next line in the source code, so use them only when you lack better options.

00:27:07 If you must employ TracePoint, be aware that it defaults to global behavior. However, an alternative TracePoint API exists that allows you to instrument just one thread at a time; the catch is it’s only accessible via a C API.

00:27:37 Use asynchronous methods for sending measurements, especially when reporting amounts remotely. Logging directly to the file system is usually fast, but as your app scales, consider sending data to an analytics service.

00:28:02 When doing so, make sure not to block those calls. API gems for monitoring services typically have non-blocking client versions. If using standard HTTP calls, be aware that net HTTP posts can be blocking.

00:28:29 Another essential aspect of safety in instrumentation is avoiding side effects. Changes to your app's behavior can stem from your monitoring, so on one hand, it's essential to ensure measurement does not modify application states.

00:28:56 Some calls that you think are harmless could invoke parts of your system that wouldn't otherwise run, especially database queries. Therefore, you must be cautious when calling methods on Active Record models, as some may trigger additional database calls.

00:29:27 We've discussed numerous ways to measure, and we have a few more minutes to address what specific things to measure. Consider this early electrocardiograph machine from 1901—it served as the first practical machine for measuring electrical activity.

00:30:02 Back then, sensors would capture signals from patients. ECGs are still widely used today, not just to diagnose heart conditions but also to monitor seriously ill patients—the heart often reveals issues before they become critical.

00:30:37 Therefore, what are the indicators for your application? What can indicate when things aren't healthy, or when something unexpected is occurring, or has critically gone wrong? You should measure various response metrics.

00:31:13 Begin with the responses you’re sending back from requests. Ensure they align with your expectations. One valuable metric to monitor is the size of your responses; ensure they stay within acceptable ranges.

00:31:41 Pay attention to error responses. While 500-level errors are clearly problematic, don’t overlook 400-level errors either. Early in my career, I often ignored 404 errors because they seemed like client-side issues, yet knowing about them can signal broken links or hacking attempts.

00:32:06 Additionally, intermediate results can also offer valuable insights. Keep an eye on rendering times; Rails templates aren’t the fastest, and rendering can significantly affect app latency.

00:32:34 For instance, I once worked with an API that occasionally ran exceedingly slow, and upon further investigation, we realized that the JSON serialization was consuming most of the latency—some odd interaction between the data structure and the JSON library we used.

00:32:59 So if behavior seems irregular, can you identify the causes? Measure interface interactions with external systems, including your database and APIs. While many monitoring services will capture crypts about external dependency performance, they may miss usage insights.

00:33:32 Make sure you track usage against your assumptions—are you hitting your caches as often as expected? Finally, remember that errors and exceptions are essential to track. Never discard error logging; this includes exceptions you handle internally.

00:34:06 An often overlooked metric is your retry logic. Many systems implement retry mechanisms for flaky network requests, but if you do this, ensure to log that information. If there's a spike in retry rates, it indicates an issue that warrants attention.

00:34:36 We've covered a lot, and we've only scratched the surface. I hope to have communicated the significance of measuring. My small company embraces this practice habitually. Getting started is straightforward.

00:34:53 Subscribe to an ActiveSupport notification—it's just a few lines of code. Measure something and log it. Many commercial products available offer free trials; go check one out. However you choose to do it, start measuring—this data will be interesting and could save you from headaches later.

What’s my App *Really* Doing in Production?