00:00:21.050
Welcome to 'Scalable Observability', a handy guide for getting started with distributed applications.
00:00:26.160
I'm John Feminella, and I work for a company called Pivotal. I'm delighted to talk to you about it more.
00:00:32.850
However, that's not the focus of this talk. If you have questions or comments, feel free to tweet me, Slack me, or ask me afterwards.
00:00:40.379
You can find me in the RailsConf Slack.
00:00:48.750
Today, I want to discuss the fundamental building blocks of observability as applied to applications.
00:00:56.300
These are the kinds of applications you all probably work on, design, build, and operate.
00:01:01.350
When people hear the term 'observability', it might be confusing or new to some, or often conflated with monitoring or metrics.
00:01:09.780
I want to use this talk as an opportunity to clarify these ideas and show you how they can be applied to your own applications.
00:01:16.140
Whether you are managing one, ten, or hundreds of applications—some of which may be Rails, and some may not—I hope to provide tools to address the challenges we'll discuss.
00:01:28.259
I will recommend a specific approach that I think will be helpful if you're new to the topic of observability.
00:01:41.880
Finally, I'll conclude with actionable advice that I believe will be useful on this journey.
00:01:47.399
Let’s start with the challenges. When operating an application, one common question is whether it is healthy and meeting its intended performance.
00:01:58.080
Are we successfully processing concert ticket requests? Are we making widgets in the factory? Is the application doing what it’s supposed to do?
00:02:10.229
To answer these questions, it's tempting to measure everything we can.
00:02:17.670
For instance, we might measure disk I/O, CPU usage, and memory consumption.
00:02:30.180
Let’s say we have a system we want to understand better. We can gather various metrics to assess its health.
00:02:40.680
Imagine recording a hundred different metrics per instance of this application.
00:02:47.489
If there are three instances, that means we're recording the same hundred things for every instance.
00:02:56.300
To clarify, we’ll refer to one of those recorded metrics as a 'stream'. For every instance, if we sample that stream at one sample per second,
00:03:07.769
and record each sample at 24 bytes, that means we're generating 2,400 bytes per second per instance.
00:03:14.270
With a hundred instances, that totals about 240,000 bytes per second. Over a year, that translates to roughly 7.5 terabytes.
00:03:20.640
That seems like a hefty amount of data, and while it’s more than what you’d store on your laptop,
00:03:27.100
storing that in a cloud solution like S3 is relatively inexpensive—around two hundred bucks a year.
00:03:37.979
For enterprises and larger companies, the idea of storing everything seems favorable—especially for future diagnostics.
00:03:43.799
This raises a valid question: if we can measure everything cheaply, why wouldn’t we?
00:03:53.009
However, it's also essential to consider whether we are measuring the right things.
00:04:01.100
For example, in a Java application using Spring, you'll see various metrics endpoints populated.
00:04:06.589
You can easily add a gauge to measure ticket sales and track relevant events.
00:04:12.550
With Ruby applications, including those not using Rails, the runtime doesn't readily provide metrics out of the box.
00:04:20.480
There are various instrumentation and benchmarking tools available,
00:04:26.650
but Rails provides some unique choices. You have low-level tooling, such as ActiveSupport instrumentation,
00:04:33.289
which is about recording events within the Rails framework.
00:04:40.100
Then there are agent-based tools like New Relic that you add as a gem or binary to your environment.
00:04:46.089
Another option is rack middleware, which are libraries inserted into the middleware stack of your application.
00:04:52.059
Those are the three major types of tools available for measuring your applications.
00:04:58.939
But there's another question: even if we agree on what to measure, how do we actually measure it?
00:05:05.179
What tools should we use? This brings up another challenge, especially when measuring applications that are fundamentally different.
00:05:12.540
In a sample size of applications from Fortune 500 companies, we found that out of 1,326 applications, about ten percent were Rails apps.
00:05:20.169
Ninety-four percent of them don’t utilize custom metrics and largely accept the default metrics provided by their tools.
00:05:27.320
Ninety-two percent have never altered the default metrics, relying instead on out-of-the-box settings.
00:05:33.289
This presents a problem: despite our varied applications, we're not measuring vital distinctions.
00:05:39.239
There's significant variation in system topologies and footprints, and even in the number of instances of each application.
00:05:46.819
Consider a model that depends on numerous microservices. These dependencies reveal very different relationships between applications.
00:05:52.250
Moreover, variation in runtime versions adds complexity. The system deployed in staging could differ from production.
00:06:00.320
In the middle of a blue-green deployment, we may have multiple versions running simultaneously.
00:06:07.159
Does it make sense, then, to measure the same things and accept these defaults when each application’s characteristics differ?
00:06:18.600
Even worse, we humans can fall prey to various biases.
00:06:25.039
A common fallacy is to assume that correlation equates to causation. For instance, let’s consider a dataset.
00:06:32.570
If we graph two unrelated aspects, such as spelling test scores and shoe sizes, a correlation may appear.
00:06:39.490
A possible outcome is that individuals with smaller shoe sizes may have lower spelling scores.
00:06:46.930
However, this does not mean size causes spelling ability; age can be a hidden factor influencing both.
00:06:54.120
Moreover, consider a scenario analyzing cheese consumption against people becoming entangled in their bed sheets.
00:07:02.919
A correlation may appear, but the two are not causally linked.
00:07:11.200
Additionally, we often introduce speculative questions that don't lead to valid conclusions.
00:07:16.460
Scientific literature often explores variables that don’t contribute meaningfully to outcomes.
00:07:23.539
For instance, studies may suggest coincidence over actual causative effects.
00:07:31.400
The complexity of applications and their dependencies can lead to oversimplification.
00:07:38.040
Often we summarize application performance into simplistic metrics like CPU and memory usage.
00:07:44.660
Our observations are constrained when we reduce complexity to single values.
00:07:52.440
Let’s visualize a dataset that represents random points across a 0 to 100 range.
00:08:00.689
In a random dataset like this, expect close to zero correlation as values are unrelated.
00:08:07.300
However, if we view another dataset arranged in an identifiable shape, the correlation can appear different.
00:08:13.889
Despite having similar statistical properties, their visual depictions can convey entirely different messages.
00:08:21.150
This indicates how important it is to understand the overall shape of our data and avoid getting lost in singular metrics.
00:08:30.429
To truly grasp our systems, we need a comprehensive view rather than relying solely on simplified outputs.
00:08:37.820
Let’s explore some effective tools for achieving meaningful observability.
00:08:43.870
The first concept is that of a service level objective (SLO). This signifies a promise your application extends.
00:08:50.920
It details how the application will interact with users or upstream/downstream services.
00:08:55.180
Each SLO is typically broken down into three components: the SLI (Service Level Indicator), SLO (Service Level Objective), and SLA (Service Level Agreement).
00:09:01.089
An SLI indicates a desired state, such as aiming for 100 requests per second.
00:09:07.240
The SLO details how consistently you want that state upheld, such as having 99.9% of requests complete in under 5 seconds.
00:09:12.640
Failing to uphold an SLO could necessitate consequences, which are outlined in the SLA.
00:09:20.430
In the context of an Internet Service Provider (ISP), they may promise a certain bandwidth.
00:09:27.420
If that promise is broken, there are typically repercussions described in the SLA.
00:09:33.600
For your applications, understanding these components is critical because they clarify what promises your application is making.
00:09:40.440
Consider what is crucial for your application to function as expected.
00:09:45.810
It's not the trivial facts, such as using less than 90% memory,
00:09:50.150
but essential functionalities like providing patient medical records or processing ticket transactions.
00:09:56.340
This focus on SLOs helps prioritize monitoring efforts.
00:10:02.360
Observability can feel nebulous, but it represents our ability to understand a system based solely on its outputs.
00:10:10.304
For analogy, think of a printer. It’s a complex device with numerous moving parts and software dependencies.
00:10:16.309
When you send a print job, if the printer ceases to work, you need a way to ascertain the issue.
00:10:23.070
Basic outputs like error messages or screens indicating out-of-paper provide crucial information to resume normal operation.
00:10:30.850
The goal of observability is understanding and diagnosing when system commitments are unmet.
00:10:38.070
If we must attach a debugger or debug our application interactively, scaling issues arise.
00:10:46.410
To achieve effective observability, we must focus on events—the currency that conveys system behavior.
00:10:55.490
Your observability toolbox can include three key instruments, mixed and matched to evaluate if SLOs are met.
00:11:02.859
First, there are metrics that measure facts, alongside aggregate statistics describing system events.
00:11:11.079
Monitoring metrics is often enlightening—measuring CPU usage could reveal how busy your system is.
00:11:19.470
However, if CPU is not the primary bottleneck, it’s not as informative.
00:11:26.590
Many Rails applications experience more network I/O constraints, depending heavily on other services.
00:11:33.720
Metrics allow for flexible data slicing to reveal insights, like ticket price trends and purchasing behaviors.
00:11:40.520
The challenge remains in discerning important metrics, often leading to an alert fatigue.
00:11:46.800
Logging serves another purpose, allowing structured data to document events for better analysis.
00:11:55.250
Logs are straightforward, emit data at points of interest, and are helpful for debugging.
00:12:01.790
Unfortunately, they can become overly verbose when attempting to capture everything.
00:12:08.940
Another tool is tracing, which tracks events with causal ordering across services.
00:12:15.880
It’s particularly useful for multi-threaded or distributed applications, ensuring a clear understanding of processes.
00:12:24.200
Open tracing APIs are available for configuring these tools.
00:12:30.040
Rubicon, an open-zipkin framework, offers the middleware you can insert into your rails applications for easy tracing.
00:12:36.470
However, tracing particularly benefits applications interacting with multiple services.
00:12:43.930
In legacy monoliths, traces may offer limited utility since they often don’t include inter-service communication.
00:12:52.459
The next concept refers to the four golden signals, which are key metrics to monitor application health.
00:13:02.629
Initially outlined in Google’s SRE book, these signals guide your measurement priorities.
00:13:08.220
First, measure your traffic: the volume of requests your system handles.
00:13:17.120
Second, track latency: how long requests require to receive a response.
00:13:24.750
Differentiate between successful and failed requests while measuring latency.
00:13:32.040
The third signal measures saturation—how close your system is to its capacity.
00:13:39.050
Finally, track error rates—how frequently requests fail, both explicitly and implicitly.
00:13:46.930
Observing these signals offers valuable insights into your application’s health.
00:13:55.250
Now that we have covered how specific tools contribute to observability, let's look at practical applications.
00:14:01.900
The approach I recommend is examining varying topologies within distributed systems.
00:14:07.050
This can apply regardless of whether you’re dealing with Kubernetes pods or instances on Heroku.
00:14:15.180
Our first instinct might be to condense the complexity into a neat dashboard, but such oversimplification is misleading.
00:14:22.940
Simplified visuals may obscure the real messiness of distributed systems.
00:14:29.030
Don't rely on identical metrics across different applications; acknowledge the inherent complexity in what you monitor.
00:14:38.080
In determining what aspects deserve monitoring, consider questions surrounding system performance.
00:14:46.440
What are the right metrics to understand system health, and under which conditions do they operate best?
00:14:51.650
Understanding the specific application landscape helps identify what to measure.
00:14:58.390
Measuring demand indicates how much work the system processes, unique to each application’s function.
00:15:05.130
Measure how much demand is satisfied: if ten tickets per second is needed, how many are sold?
00:15:12.160
Assess efficiency to determine resource usage corresponding to output generation.
00:15:20.760
Establish capacity—does the system possess the resources required to meet demand?
00:15:29.440
In essence, thoroughly monitoring these four aspects augments your understanding of any system.
00:15:38.900
Ambiguity may arise from varied application setups, with unique workflow demands.
00:15:48.400
While it requires a deeper level of understanding, acting on these insights yields significant benefits.
00:15:54.830
Setting thresholds can become tricky; instead of relying solely on fixed metrics, consider percentile distributions.
00:16:01.900
Let’s visualize request latency as an instance—service times can vary across various requests.
00:16:06.790
When plotting latency data, you can discover critical values across various distributions.
00:16:12.640
Recognizing changes in percentiles over time can be far more revealing than fixed values.
00:16:20.800
Let’s evaluate a few scenarios. Suppose query volume increases by 20%, with efficiency remaining constant.
00:16:29.390
In this context, processing rates would mirror the incoming demand.
00:16:36.230
In response to this scenario, I wouldn’t raise an alert.
00:16:44.250
However, let's evaluate the second scenario where the volume increases, but efficiency drops.
00:16:51.800
If request volume rises by 20%, and resource usage spikes by 60%, this should signal concern.
00:16:58.830
Ideally, if your SLO sits at 50% more capacity than that, alerting here is warranted.
00:17:04.450
Notably, the third scenario may involve a demand rise that remains unaddressed—the 80% increase.
00:17:11.380
If during this instance you only process a marginal quantity while demand heightens, it's worth investigating.
00:17:16.290
Identifying errors should consistently extend beyond traditional server specs.
00:17:23.200
Reflecting on three Fortune 100 companies that successfully implemented our strategies reveals impressive outcomes.
00:17:31.780
They witnessed nearly a 98% drop in false positives across their applications.
00:17:39.220
Furthermore, these companies reported a staggering 1500% improvement in relevant alerts.
00:17:46.190
When notifications triggered, they corresponded to violations of SLOs, concerning measurable outcomes.
00:17:52.810
Moreover, these firms have now ambitiously expanded this approach to new systems, including legacy ones.
00:17:59.160
In summary, if you take away a few key aspects of observability, remember that it is not simply a silver bullet.
00:18:07.370
It requires a toolset—there’s no one-size-fits-all library that confers observability on its own.
00:18:13.410
Approaching observability necessitates a holistic understanding of existing systems and outlining what you’re measuring.
00:18:20.720
Understanding the SLOs becomes imperative for determining metrics to monitor.
00:18:27.810
If you lack clear promises to stakeholders, reconsider your application’s fundamental value.
00:18:34.070
However daunting this task may feel, with the earlier outlined strategies, you can foster better outcomes.
00:18:44.900
Thank you very much, and enjoy lunch at RailsConf!