Talks

Monitoring the Health of Your App

Monitoring the Health of Your App

by Carl Lerche and Yehuda Katz

In this session at Rails Conf 2013, Yehuda Katz and Carl Lerche discuss the importance of proactive application monitoring to maintain app health. They emphasize that most traditional tools merely react to problems after they occur, like a doctor diagnosing a heart attack post-factum. Instead, developers are encouraged to gather meaningful metrics to detect issues before they impact user experience. Key points from the session include:

  • Collect Data: Emphasizing the need to measure anything that provides value, enabling quick validation of hypotheses about app performance.
  • Traffic Monitoring: Discussing the importance of tracking key business metrics such as revenue per minute to identify anomalies immediately after deploying changes.
  • Response Times: An emphasis on understanding average response times versus median and standard deviation to make informed decisions on performance.
  • Long-Tail Distribution: Presenting research showing that web response times do not adhere to a normal distribution; instead, they exhibit long-tail behavior where outliers are more common than expected.
  • Real-World Implications: They shared a case study demonstrating how measuring metrics like the number of users adding items to the cart can pinpoint where issues are arising in the checkout process.
  • Performance Tools: Introduction of a monitoring tool, Skylight, designed to visualize app performance data using histograms to illustrate how response times are distributed across a range of interactions. This allows for deeper insights into where optimizations are needed rather than focusing solely on averages.

Ultimately, the presenters strongly conclude that measuring every aspect of application interaction is essential in fostering a healthy app environment. By adopting this proactive approach to monitoring and measuring, developers can significantly enhance user experience and improve the efficiency of debugging. Understanding and interpreting correct metrics ultimately promotes better software development practices.

00:00:16.600 So, well again, thanks for coming! My name is Carl, I work at Tilda, and this is my esteemed colleague, Yehuda Katz. I don't believe he requires much of an introduction.
00:00:23.080 Today, we're going to talk about measuring your app. But before we get started, I just want to begin with a quick story.
00:00:29.800 This was my first job out of college. It was your basic e-commerce app. They had products for sale, you added them to your cart, checked out, and processed your order. Pretty basic stuff.
00:00:40.879 On the day in question, I was tasked with implementing a simple feature. I was only supposed to tweak the coupon page on the checkout process.
00:00:46.320 I thought it was an easy job, and I did it in just a couple of hours. I shipped it out, and QA accepted it. They had a couple of testers who’d bang on the app to ensure everything worked well for them—very scientific.
00:00:53.239 So, the app went out to production, I went home, and everything seemed fine. Then, I got a call from my boss at 8 PM. He said, "Hey, how's it going?" At that moment, I knew something was off because he never called me.
00:01:10.439 He told me he was looking at some numbers and that the sales seemed low. I thought, "Okay, sales seem low." He continued, "I was told you made some changes to the checkout process—I think you might have broken something." My boss was also the owner of the company. It wasn't too huge, but I could hear the stress in his voice.
00:01:27.600 I could imagine him at home refreshing sales reports every day, making sure nothing was broken. So, I got my computer out, thinking, "Alright, step one, let's see if I can reproduce the issue." No luck. I went through the entire process, and there was no problem.
00:01:42.640 It's late at night, and I decided the next step would be to revert my changes. Incidentally, this was a long time ago, and we didn't use source control, so that made reverting a bit tricky. It was a simple change, so I went by memory, reverted it, pushed it out, and sent an email to my boss, notifying him of the rollback.
00:02:00.240 Thirty minutes later, he called again. 'Sales are still down,' he said. Well, I hoped I had reverted correctly. I thought, 'Something went wrong.' Next, I decided to SSH into the server; that seemed like the obvious next step.
00:02:19.120 I added debug statements. At that time, I was actually first learning Rails, but this job was a PHP shop. Adding debug statements was much easier because you could do it on the fly and automatically reload the deployment. Nevertheless, I couldn’t find anything wrong. This took me about five hours of debugging.
00:02:39.839 It turned out that someone else had updated the Google Analytics tracking code that day, a minor JavaScript snippet. Having no source control made it complicated to track that change. On the checkout page, the JavaScript was interacting weirdly with our JavaScript in certain browsers.
00:02:57.000 As it happened, the problematic browser was Internet Explorer. I was young and didn’t think to check it. What lessons can we learn from this, besides the obvious one about using source control?
00:03:10.959 My boss had a hypothesis: sales were down. But nobody had a real way to validate that hypothesis quickly. He looked at some sales reports and kept hitting refresh multiple times. Meanwhile, I had my assumption that my coupon tweak was the culprit. Yet, I had no real way of validating that without diving into extensive debugging.
00:03:38.440 If I had been collecting even a little data beforehand, I think it would have made troubleshooting much easier. This is going to be the thesis of this talk. I hope nobody here would skip writing tests for their app; I'm going to convince you that you need to measure anything that matters in your app—anything that provides value.
00:04:06.560 This way, you can quickly detect when things break. Before we continue, let’s have a quick game: Which line is slow? On the next slide, I’ll show you a very basic Rails action with only two lines of code. This action is supposed to respond in about one second, and it’s up to you to determine which line is taking longer.
00:04:39.600 Here are the two lines. I’ll give you a minute to review them. So, who thinks line one? (One person raises their hand.) Alright, what about line two? (The vast majority raise their hands.) What does Yehuda Katz think?
00:05:03.639 I don't know. The second line of this action is 'users = User.where(admin: false)'. The action ends with 'render json: users.first'. So, in theory, what's happening here is that 'User.where' creates a scope, which is a lazy object. When you call 'first' on that lazy object, it's supposed to rewrite the SQL query to be efficient.
00:05:29.280 However, let me show you more. I'm telling you now that this code is in the model. That should be 'self.where', and it’s overriding that function. So, knowing that, which line do you think is the culprit? (Audience response.) Line one? (Quiet response.) Line two? You’re getting smarter.
00:05:55.440 Here’s the measurement: the first line takes 0.1% of the request time, while the second line takes 0.4%. So now we have some data, and we can continue looking further for the problem.
00:06:22.159 Here we find a filter with a bit of code. Many of you have probably seen code like this before. The main point is that this is somewhat of a tongue-in-cheek example, but I'm sure no one here knows every line of code running in their Rails app.
00:06:52.680 Even if you're the sole developer on that app, there are probably hundreds of gems you’ve pulled into it, and you don't know every line. This brings us to how we address performance and issues when things go wrong.
00:07:13.680 I was at CodeConf a few years ago and heard a talk by a gentleman named Go To Hale. What resonated with me was his statement that as developers, we aren't merely paid to code—we are paid to provide business value.
00:07:37.040 We're paid to write code, but that code must add features, improve speed, or contribute something that provides real value to the business. That means what really matters is what's happening in production, not how it runs faster on your development box.
00:08:09.720 How can we measure what’s actually happening? We need to have a correct mental model of how our code runs in production, and we achieve that by measuring.
00:08:29.440 So, what do we measure? The simple solution is to measure business value. We need to analyze our application's responsiveness and how users complete tasks that ultimately make us money.
00:08:55.680 Going back to my original story, the obvious thing I could have measured was revenue. This is probably the key metric for most businesses—unless you are a Silicon Valley VC-backed startup.
00:09:16.560 If I had a graph with a line showing dollars made per minute, I would have seen the drop-off in revenue after my changes. It could have been as straightforward as a visual cue, saying, 'Oh, I deployed this change, and now the revenue is plummeting.'
00:09:41.120 Moreover, we can take this data, save it, and use it as a reference to notice trends. We can create benchmarks to gauge expected revenue throughout the day and have tools to alert us anytime the current performance falls below our set thresholds.
00:10:06.880 Imagine that if our boss didn't have to call us a few hours later about broken functionality, we would all be happier. A quick alert could notify us within five minutes of a deployment that something went wrong.
00:10:33.640 We need to go beyond a simple metric. If I know something is wrong and we’re not making money, I need to quickly make a decision on what’s broken. This requires an understanding of all the steps in the money-making process.
00:10:56.880 We should be tracking every step. For instance, let's track how many users add items to their shopping cart, how many items get added, and how many items get removed. Then we need to measure the number of users who check out and the number of users having successful orders.
00:11:23.640 I suspect you are using tools to measure this on the business side, but as developers, we also need tools to measure performance for ourselves. This would allow us to investigate the data when something goes wrong.
00:11:42.480 If it's important for your business, you should measure it. If it’s not measured, how do you know it’s working? Performance is also directly tied to business value.
00:12:12.320 You've probably heard studies from companies like Google and Amazon, which reveal how significant response times can be. Google discovered that each 0.1-second increase in search result time decreased ad revenue by 20%, while Amazon experienced a 1% sales loss for every additional 100 milliseconds.
00:12:30.560 Performance is critical for business as well. However, the good news is we already have tools to measure the average response time for our app, although there’s a caveat.
00:12:54.960 I need to venture into academia briefly. Many developers hear the term average response time and think they are getting useful information to understand how their app performs.
00:13:14.480 However, I’m here to tell you that relying solely on averages creates a misleading and broken mental model of what’s going on. Let's start from the basics: to calculate an average, you sum all the numbers in your list and then divide by the count.
00:13:51.679 That’s fairly straightforward and relatively easy to collect data on. In contrast to the average, let's discuss the median, which is more useful. The median is found by arranging numbers in order and identifying the middle number or the average of the two middle numbers.
00:14:15.279 This method isn't as easy to implement because you need to maintain all the numbers to calculate the median. Furthermore, the standard deviation is also important but harder to collect—it's a measure of how dispersed the numbers are in your data set.
00:14:40.480 For instance, imagine we have a set of integers. A standard deviation of 4.96 means that moving 4.96 in either direction from the mean reduces the likelihood of hitting any particular number.
00:14:59.679 Most people’s mental models are rooted in the bell curve. For example, let's say someone tells you the average response time is 50 milliseconds with a standard deviation of 10 milliseconds. Using that, you might think that 68% of your requests will be between 40 and 60 milliseconds.
00:15:29.120 Moreover, the assumption would be that as you move farther out, the likelihood of encountering a much longer response time diminishes. However, in the real world, it is often more complex, and the distributions don’t resemble normal distributions.
00:15:52.840 Take, for example, the height of people. If we say the average male height is 60 inches with a standard deviation of 3 inches, that means around 68% of all men are between 57 and 61 inches tall.
00:16:18.080 But what if I told you the average salary of workers in the U.S. is around $60,000 with a standard deviation of $30,000? The distribution becomes skewed, and your intuition tells you that there won’t be many high earners.
00:16:44.080 However, the reality is that high salaries can exist. In fact, the distribution shows that the longer the tail, the less likely it seems for them to exist in our initial understanding.
00:17:07.280 To illustrate, when learning we adapt a mental model where we think many values converge around the mean—with some tolerances around the edges. We might overlook vital information that’s occurring beyond those limits.
00:17:27.840 If I tell you that the mean response time is 100 milliseconds with a standard deviation of 50 milliseconds, you'd expect 95% of requests to be between 20 and 80 milliseconds.
00:17:46.960 However, the real world dictates that outliers can exist. Thus, an outlier response time of 500 milliseconds, which seems way off, turns out to be not as rare as we think.
00:18:08.640 This misunderstanding can be hazardous as many web performance metrics don't follow normal distributions. Many web requests often present a long tail, meaning that the data can extend out into the distant right.
00:18:31.280 The skewed view diverges from the normal distribution like we expect, meaning we need insights into the long tail. We need a better understanding of this reality.
00:18:52.960 Instead of the normal distribution, web responses behave more like long-tailed distributions. So, for example, with an average response time of 150 milliseconds and a standard deviation of 50, a 500-millisecond response time might actually be more common than we suspect.
00:19:14.960 Again, while this does not mean half of all requests take that long, it implies that a notable proportion will fall in that category.
00:19:36.560 To summarize, if your mental model is that things cluster around the mean, you're missing a large chunk of reality. If you focus solely on data around the mean, you risk missing crucial insights related to web response performance.
00:19:58.720 The same applies to understanding outliers when performance bottlenecks arise in your application. For instance, if I were to tell you that the 95th percentile is 500 milliseconds, that would still highlight that one in every twenty requests is likely slower than this.
00:20:21.280 The count of slower requests should promote you to act rather than dismiss them solely based on averages. We’re developing a tool called Skylight, which focuses on accurately measuring distributions in a way that enhances your understanding of application performance.
00:20:51.040 Currently in MVP mode, Skylight aims to replace outdated methods, providing fast, practical insights into the distribution of performance metrics. Observing actual response distributions is an essential part of understanding how your application performs.
00:21:22.160 It's worth noting, that tools offering real-time data don't depict the ideal outcome derived from averages or expectations alone. Users can see legitimate differences like where their slowest requests and the faster requests intersect, allowing you to make better decisions.
00:21:42.680 As we analyze a large number of requests, we may discover that certain requests demand more concentration or attention, leading to informed decisions about where optimizations need to take place.
00:22:04.280 It's essential to highlight the importance of understanding response distribution rather than relying on a simple average. In real-world statistics, averages often create misconceptions when navigating performance issues.
00:22:24.960 To ensure accurate metrics, Skylight focuses on identifying key trends amongst average response times—segregating the fast from the slow while eliminating factors such as GC noise.
00:22:50.560 This has resulted in more informed handling of slow requests. In short, analysis of complete distributions leads to a more thorough understanding of your application.
00:23:11.560 Lastly, we are excited and engaged, using real data to influence improvements. Feedback from real users greatly shifts our focus when rectifying issues, highlighting the importance of accurate metrics.
00:23:30.560 Real-world insights can help highlight patterns that guide future decisions about application performance. Bringing about a better product requires understanding what’s truly happening.
00:23:56.880 If you're ready to be part of the journey, we encourage you to visit our booth. Your feedback will ultimately shape Skylight and elevate how we measure application performance. Thank you for joining us.