RailsConf 2019

Scalable Observability for Rails Applications

Scalable Observability for Rails Applications

by John Feminella

In his talk at RailsConf 2019, John Feminella presents 'Scalable Observability for Rails Applications,' focusing on how to effectively monitor distributed applications. He emphasizes the importance of understanding observability, distinguishing it from traditional monitoring and metrics. Here are the key points discussed throughout the talk:

  • Understanding Observability: Feminella explains that observability is about knowing the state of a system from its outputs rather than just raw metrics like CPU or memory usage.

  • Common Challenges: He highlights common pitfalls in application monitoring, such as measuring irrelevant metrics that do not contribute to understanding application health or accumulating countless alert notifications (false positives).

  • Service Level Objectives (SLOs): He introduces the concept of SLOs, which are promises applications make regarding performance (e.g., processing requests in a timely manner). There are three components: Service Level Indicators (SLIs), SLOs, and Service Level Agreements (SLAs).

  • Tools for Observability: Feminella outlines three tools that should be part of any observability strategy:

    • Metrics: Help quantify aspects of system performance or health.
    • Logging: Provides a stream of events for debugging and understanding system issues.
    • Tracing: Captures the order of events to provide context about system operations.
  • The Four Golden Signals: He discusses four critical metrics to measure: traffic, latency, saturation, and errors, stressing that understanding these can significantly enhance application monitoring.

  • Customized Metrics: He argues against a one-size-fits-all approach to metrics, encouraging teams to establish what is specific to each application based on its unique demands and context.

  • Practical Examples: Drawing from experience with Fortune 100 companies, Feminella shares anecdotal evidence showing that using a more thorough observability strategy can reduce false positives and improve response relevance to real issues.

  • Key Takeaways: Feminella concludes that observability requires a comprehensive toolkit, a focus on meaningful metrics relative to application promises, and iteration based on real-world experiences to achieve effective monitoring results. His advice is to start with understanding what each application does and to measure what matters based on the demands on those applications.

This talk serves as a guide for developers to build healthier and more observable Rails applications, leading to fewer unnecessary alerts and better overall system health.

00:00:21.050 Welcome to 'Scalable Observability', a handy guide for getting started with distributed applications.
00:00:26.160 I'm John Feminella, and I work for a company called Pivotal. I'm delighted to talk to you about it more.
00:00:32.850 However, that's not the focus of this talk. If you have questions or comments, feel free to tweet me, Slack me, or ask me afterwards.
00:00:40.379 You can find me in the RailsConf Slack.
00:00:48.750 Today, I want to discuss the fundamental building blocks of observability as applied to applications.
00:00:56.300 These are the kinds of applications you all probably work on, design, build, and operate.
00:01:01.350 When people hear the term 'observability', it might be confusing or new to some, or often conflated with monitoring or metrics.
00:01:09.780 I want to use this talk as an opportunity to clarify these ideas and show you how they can be applied to your own applications.
00:01:16.140 Whether you are managing one, ten, or hundreds of applications—some of which may be Rails, and some may not—I hope to provide tools to address the challenges we'll discuss.
00:01:28.259 I will recommend a specific approach that I think will be helpful if you're new to the topic of observability.
00:01:41.880 Finally, I'll conclude with actionable advice that I believe will be useful on this journey.
00:01:47.399 Let’s start with the challenges. When operating an application, one common question is whether it is healthy and meeting its intended performance.
00:01:58.080 Are we successfully processing concert ticket requests? Are we making widgets in the factory? Is the application doing what it’s supposed to do?
00:02:10.229 To answer these questions, it's tempting to measure everything we can.
00:02:17.670 For instance, we might measure disk I/O, CPU usage, and memory consumption.
00:02:30.180 Let’s say we have a system we want to understand better. We can gather various metrics to assess its health.
00:02:40.680 Imagine recording a hundred different metrics per instance of this application.
00:02:47.489 If there are three instances, that means we're recording the same hundred things for every instance.
00:02:56.300 To clarify, we’ll refer to one of those recorded metrics as a 'stream'. For every instance, if we sample that stream at one sample per second,
00:03:07.769 and record each sample at 24 bytes, that means we're generating 2,400 bytes per second per instance.
00:03:14.270 With a hundred instances, that totals about 240,000 bytes per second. Over a year, that translates to roughly 7.5 terabytes.
00:03:20.640 That seems like a hefty amount of data, and while it’s more than what you’d store on your laptop,
00:03:27.100 storing that in a cloud solution like S3 is relatively inexpensive—around two hundred bucks a year.
00:03:37.979 For enterprises and larger companies, the idea of storing everything seems favorable—especially for future diagnostics.
00:03:43.799 This raises a valid question: if we can measure everything cheaply, why wouldn’t we?
00:03:53.009 However, it's also essential to consider whether we are measuring the right things.
00:04:01.100 For example, in a Java application using Spring, you'll see various metrics endpoints populated.
00:04:06.589 You can easily add a gauge to measure ticket sales and track relevant events.
00:04:12.550 With Ruby applications, including those not using Rails, the runtime doesn't readily provide metrics out of the box.
00:04:20.480 There are various instrumentation and benchmarking tools available,
00:04:26.650 but Rails provides some unique choices. You have low-level tooling, such as ActiveSupport instrumentation,
00:04:33.289 which is about recording events within the Rails framework.
00:04:40.100 Then there are agent-based tools like New Relic that you add as a gem or binary to your environment.
00:04:46.089 Another option is rack middleware, which are libraries inserted into the middleware stack of your application.
00:04:52.059 Those are the three major types of tools available for measuring your applications.
00:04:58.939 But there's another question: even if we agree on what to measure, how do we actually measure it?
00:05:05.179 What tools should we use? This brings up another challenge, especially when measuring applications that are fundamentally different.
00:05:12.540 In a sample size of applications from Fortune 500 companies, we found that out of 1,326 applications, about ten percent were Rails apps.
00:05:20.169 Ninety-four percent of them don’t utilize custom metrics and largely accept the default metrics provided by their tools.
00:05:27.320 Ninety-two percent have never altered the default metrics, relying instead on out-of-the-box settings.
00:05:33.289 This presents a problem: despite our varied applications, we're not measuring vital distinctions.
00:05:39.239 There's significant variation in system topologies and footprints, and even in the number of instances of each application.
00:05:46.819 Consider a model that depends on numerous microservices. These dependencies reveal very different relationships between applications.
00:05:52.250 Moreover, variation in runtime versions adds complexity. The system deployed in staging could differ from production.
00:06:00.320 In the middle of a blue-green deployment, we may have multiple versions running simultaneously.
00:06:07.159 Does it make sense, then, to measure the same things and accept these defaults when each application’s characteristics differ?
00:06:18.600 Even worse, we humans can fall prey to various biases.
00:06:25.039 A common fallacy is to assume that correlation equates to causation. For instance, let’s consider a dataset.
00:06:32.570 If we graph two unrelated aspects, such as spelling test scores and shoe sizes, a correlation may appear.
00:06:39.490 A possible outcome is that individuals with smaller shoe sizes may have lower spelling scores.
00:06:46.930 However, this does not mean size causes spelling ability; age can be a hidden factor influencing both.
00:06:54.120 Moreover, consider a scenario analyzing cheese consumption against people becoming entangled in their bed sheets.
00:07:02.919 A correlation may appear, but the two are not causally linked.
00:07:11.200 Additionally, we often introduce speculative questions that don't lead to valid conclusions.
00:07:16.460 Scientific literature often explores variables that don’t contribute meaningfully to outcomes.
00:07:23.539 For instance, studies may suggest coincidence over actual causative effects.
00:07:31.400 The complexity of applications and their dependencies can lead to oversimplification.
00:07:38.040 Often we summarize application performance into simplistic metrics like CPU and memory usage.
00:07:44.660 Our observations are constrained when we reduce complexity to single values.
00:07:52.440 Let’s visualize a dataset that represents random points across a 0 to 100 range.
00:08:00.689 In a random dataset like this, expect close to zero correlation as values are unrelated.
00:08:07.300 However, if we view another dataset arranged in an identifiable shape, the correlation can appear different.
00:08:13.889 Despite having similar statistical properties, their visual depictions can convey entirely different messages.
00:08:21.150 This indicates how important it is to understand the overall shape of our data and avoid getting lost in singular metrics.
00:08:30.429 To truly grasp our systems, we need a comprehensive view rather than relying solely on simplified outputs.
00:08:37.820 Let’s explore some effective tools for achieving meaningful observability.
00:08:43.870 The first concept is that of a service level objective (SLO). This signifies a promise your application extends.
00:08:50.920 It details how the application will interact with users or upstream/downstream services.
00:08:55.180 Each SLO is typically broken down into three components: the SLI (Service Level Indicator), SLO (Service Level Objective), and SLA (Service Level Agreement).
00:09:01.089 An SLI indicates a desired state, such as aiming for 100 requests per second.
00:09:07.240 The SLO details how consistently you want that state upheld, such as having 99.9% of requests complete in under 5 seconds.
00:09:12.640 Failing to uphold an SLO could necessitate consequences, which are outlined in the SLA.
00:09:20.430 In the context of an Internet Service Provider (ISP), they may promise a certain bandwidth.
00:09:27.420 If that promise is broken, there are typically repercussions described in the SLA.
00:09:33.600 For your applications, understanding these components is critical because they clarify what promises your application is making.
00:09:40.440 Consider what is crucial for your application to function as expected.
00:09:45.810 It's not the trivial facts, such as using less than 90% memory,
00:09:50.150 but essential functionalities like providing patient medical records or processing ticket transactions.
00:09:56.340 This focus on SLOs helps prioritize monitoring efforts.
00:10:02.360 Observability can feel nebulous, but it represents our ability to understand a system based solely on its outputs.
00:10:10.304 For analogy, think of a printer. It’s a complex device with numerous moving parts and software dependencies.
00:10:16.309 When you send a print job, if the printer ceases to work, you need a way to ascertain the issue.
00:10:23.070 Basic outputs like error messages or screens indicating out-of-paper provide crucial information to resume normal operation.
00:10:30.850 The goal of observability is understanding and diagnosing when system commitments are unmet.
00:10:38.070 If we must attach a debugger or debug our application interactively, scaling issues arise.
00:10:46.410 To achieve effective observability, we must focus on events—the currency that conveys system behavior.
00:10:55.490 Your observability toolbox can include three key instruments, mixed and matched to evaluate if SLOs are met.
00:11:02.859 First, there are metrics that measure facts, alongside aggregate statistics describing system events.
00:11:11.079 Monitoring metrics is often enlightening—measuring CPU usage could reveal how busy your system is.
00:11:19.470 However, if CPU is not the primary bottleneck, it’s not as informative.
00:11:26.590 Many Rails applications experience more network I/O constraints, depending heavily on other services.
00:11:33.720 Metrics allow for flexible data slicing to reveal insights, like ticket price trends and purchasing behaviors.
00:11:40.520 The challenge remains in discerning important metrics, often leading to an alert fatigue.
00:11:46.800 Logging serves another purpose, allowing structured data to document events for better analysis.
00:11:55.250 Logs are straightforward, emit data at points of interest, and are helpful for debugging.
00:12:01.790 Unfortunately, they can become overly verbose when attempting to capture everything.
00:12:08.940 Another tool is tracing, which tracks events with causal ordering across services.
00:12:15.880 It’s particularly useful for multi-threaded or distributed applications, ensuring a clear understanding of processes.
00:12:24.200 Open tracing APIs are available for configuring these tools.
00:12:30.040 Rubicon, an open-zipkin framework, offers the middleware you can insert into your rails applications for easy tracing.
00:12:36.470 However, tracing particularly benefits applications interacting with multiple services.
00:12:43.930 In legacy monoliths, traces may offer limited utility since they often don’t include inter-service communication.
00:12:52.459 The next concept refers to the four golden signals, which are key metrics to monitor application health.
00:13:02.629 Initially outlined in Google’s SRE book, these signals guide your measurement priorities.
00:13:08.220 First, measure your traffic: the volume of requests your system handles.
00:13:17.120 Second, track latency: how long requests require to receive a response.
00:13:24.750 Differentiate between successful and failed requests while measuring latency.
00:13:32.040 The third signal measures saturation—how close your system is to its capacity.
00:13:39.050 Finally, track error rates—how frequently requests fail, both explicitly and implicitly.
00:13:46.930 Observing these signals offers valuable insights into your application’s health.
00:13:55.250 Now that we have covered how specific tools contribute to observability, let's look at practical applications.
00:14:01.900 The approach I recommend is examining varying topologies within distributed systems.
00:14:07.050 This can apply regardless of whether you’re dealing with Kubernetes pods or instances on Heroku.
00:14:15.180 Our first instinct might be to condense the complexity into a neat dashboard, but such oversimplification is misleading.
00:14:22.940 Simplified visuals may obscure the real messiness of distributed systems.
00:14:29.030 Don't rely on identical metrics across different applications; acknowledge the inherent complexity in what you monitor.
00:14:38.080 In determining what aspects deserve monitoring, consider questions surrounding system performance.
00:14:46.440 What are the right metrics to understand system health, and under which conditions do they operate best?
00:14:51.650 Understanding the specific application landscape helps identify what to measure.
00:14:58.390 Measuring demand indicates how much work the system processes, unique to each application’s function.
00:15:05.130 Measure how much demand is satisfied: if ten tickets per second is needed, how many are sold?
00:15:12.160 Assess efficiency to determine resource usage corresponding to output generation.
00:15:20.760 Establish capacity—does the system possess the resources required to meet demand?
00:15:29.440 In essence, thoroughly monitoring these four aspects augments your understanding of any system.
00:15:38.900 Ambiguity may arise from varied application setups, with unique workflow demands.
00:15:48.400 While it requires a deeper level of understanding, acting on these insights yields significant benefits.
00:15:54.830 Setting thresholds can become tricky; instead of relying solely on fixed metrics, consider percentile distributions.
00:16:01.900 Let’s visualize request latency as an instance—service times can vary across various requests.
00:16:06.790 When plotting latency data, you can discover critical values across various distributions.
00:16:12.640 Recognizing changes in percentiles over time can be far more revealing than fixed values.
00:16:20.800 Let’s evaluate a few scenarios. Suppose query volume increases by 20%, with efficiency remaining constant.
00:16:29.390 In this context, processing rates would mirror the incoming demand.
00:16:36.230 In response to this scenario, I wouldn’t raise an alert.
00:16:44.250 However, let's evaluate the second scenario where the volume increases, but efficiency drops.
00:16:51.800 If request volume rises by 20%, and resource usage spikes by 60%, this should signal concern.
00:16:58.830 Ideally, if your SLO sits at 50% more capacity than that, alerting here is warranted.
00:17:04.450 Notably, the third scenario may involve a demand rise that remains unaddressed—the 80% increase.
00:17:11.380 If during this instance you only process a marginal quantity while demand heightens, it's worth investigating.
00:17:16.290 Identifying errors should consistently extend beyond traditional server specs.
00:17:23.200 Reflecting on three Fortune 100 companies that successfully implemented our strategies reveals impressive outcomes.
00:17:31.780 They witnessed nearly a 98% drop in false positives across their applications.
00:17:39.220 Furthermore, these companies reported a staggering 1500% improvement in relevant alerts.
00:17:46.190 When notifications triggered, they corresponded to violations of SLOs, concerning measurable outcomes.
00:17:52.810 Moreover, these firms have now ambitiously expanded this approach to new systems, including legacy ones.
00:17:59.160 In summary, if you take away a few key aspects of observability, remember that it is not simply a silver bullet.
00:18:07.370 It requires a toolset—there’s no one-size-fits-all library that confers observability on its own.
00:18:13.410 Approaching observability necessitates a holistic understanding of existing systems and outlining what you’re measuring.
00:18:20.720 Understanding the SLOs becomes imperative for determining metrics to monitor.
00:18:27.810 If you lack clear promises to stakeholders, reconsider your application’s fundamental value.
00:18:34.070 However daunting this task may feel, with the earlier outlined strategies, you can foster better outcomes.
00:18:44.900 Thank you very much, and enjoy lunch at RailsConf!