00:00:16.490
As he said, I'm Anthony. I do site reliability at Fleet Erode, and today I'm going to talk about how to get started with Rails applications and site reliability. There's a lot to cover, and it can be complicated, but it doesn't always make sense at first glance. We'll try to make sense of some of these concepts together.
00:00:38.460
As you begin to work on the reliability of your sites, you might find yourself becoming more productive. In the beginning, as developers create apps and schemas, they often start out with something that is void. It's common that execution of those apps provides little content.
00:00:48.290
Take Twitter and GitHub as examples; there wasn't much there at their inception. They started empty, but over time, they evolved into reliable platforms. Reliability is crucial; oddly enough, its highest state is often found in the simplest form of the application.
00:01:07.729
Complexity arises quickly as more functionality is added. Initially, the focus is on financial viability, trying to build something desirable enough to attract paying customers. During this phase, reliability is secondary.
00:01:21.550
However, as more customers start using the application and its functionality expands, the equilibrium between site reliability and financial viability begins to change. Eventually, users will expect the application to function well without significant delays or disruptions.
00:01:48.880
The complexity only increases with components like web views, mobile apps, APIs, databases, and various services like Elastic Search and Redis. This creates a cloud of unknowing, making it difficult to see performance issues. There's an anonymous 14th-century mystic whose work illustrates this problem well; as we deal with reliability, transparency often comes up as a challenge.
00:03:08.019
Metrics give us visibility into what is happening within our systems. It's essential that we understand these metrics to separate the reliable from the unreliable components. The temptation to build our own metric systems when reliability issues arise is common, but starting with established APM (Application Performance Management) tools is advisable.
00:03:53.790
As we utilize these tools, we gain insights that allow us to better understand the complexities of our systems. One of my favorite pieces of advice for software developers is that understanding is crucial. It's essential to figure out what's happening, as that understanding enables us to take meaningful actions.
00:04:14.010
When reliability becomes an issue, our immediate focus should be on deploying metrics into the systems. We need to establish service level indicators (SLIs), which are specific metrics that reflect response times and availability.
00:04:56.180
From these SLIs, we can create service level objectives (SLOs), which define what we want those metrics to achieve, such as achieving a 95th percentile request time of under two seconds, or ensuring our application's availability exceeds 99.9%.
00:05:36.330
Once we establish these measurements and objectives, we can formulate a service level agreement (SLA) with our customers. This could include metrics around availability and performance indicators. Internally, defining various SLIs can help shape our services and meet expectations.
00:06:10.130
It's critical to monitor availability closely since this metric resonates most with our customers. We typically estimate availability in terms of 'nines,' such as 99.9%, which correlates with the number of minutes our system might be down.
00:06:35.090
However, merely responding to pings for availability doesn't satisfy user needs; thus, switching to a request-based availability metric provides a clearer picture of what users experience.
00:06:57.600
Understanding the performance of our applications often requires digging through historical data to see where issues intersected with user requests. APM tools help with time travel in a sense, allowing us to clarify what happened during a spike in demand.
00:07:43.860
By recording event streams for every request and monitoring exceptions, we can recreate contexts around performance spikes. Snapshots taken at intervals allow us to build a reference point against which we can view issues over time, enabling a more holistic analysis.
00:09:21.480
However, issues related to spikes can be difficult to debug, especially when they muddy logs with too much information. By taking regular snapshots of database statistics, we can analyze queries without losing sight of broader performance trends.
00:10:21.960
Understanding performance begins with measuring and grappling with various types of errors within our applications. Asynchronous job performance, for instance, requires different metrics than web requests. Measuring wait times, service times, and response times allows us to see where jobs fit into our performance frameworks.
00:11:37.790
Every job must have its service level objectives. Monitoring elements like queue times and service durations provides insights into job performance, allowing us to strategically address the longest-running tasks that create customer dissatisfaction.
00:12:35.920
The metrics discussed apply universally to measuring job duration as well as web request performance. While averages can suggest an overall picture, they can mislead, reflecting satisfactory performance when issues persist behind the scenes.
00:13:45.230
Percentiles serve as a more reliable indicator for performance evaluation; they help us detect where outlier issues may suggest larger systemic problems. Additionally, measuring the contribution of various endpoints to the overall load helps clarify where efficiency improvements can yield significant gains.
00:16:21.700
Apdex is a fascinating metric used to measure user satisfaction. By defining satisfying, tolerating, and frustrating interactions, we can calculate overall user interactions to provide insight into user experience. Identifying interactions leading to the most dissatisfaction allows us to prioritize improvements effectively.
00:17:50.449
Simultaneously, tracking 'agony scores' can help us to manage user dissatisfaction by applying specific weights to the user interactions. Such metrics facilitate prioritization in addressing the most critical pain points users face within our applications.
00:20:18.890
As we consider the constructs of timeouts in our systems, we should recognize they act as safety nets against poor performance. However, timeouts should not serve as an acceptable solution while we work towards improving the speed and reliability of our applications.
00:21:08.130
Utilizing APM tools allows for monitoring various query metrics efficiently. They also shed light on potential performance issues tracing back to permissions or functionality that may not be apparent at surface levels. A cross-sectional view of error occurrences across many requests can reveal underlying slowdowns.
00:22:36.340
Once we identify errors, we need to address them effectively. Each error should be counted per occurrence and categorized by type. Addressing the highest-volume errors will yield the most considerable impact on overall system reliability.
00:24:06.030
We must approach error reporting with diligence. Ignoring errors is not a constructive strategy; instead, we should investigate thoroughly, ensure understanding of the underlying issue, and act to resolve it appropriately. When measuring availability, consider the tolerance for failure we allow within our systems.
00:25:29.600
The concept of an error budget allows us to maintain a balanced approach toward reliability and feature updates. When monitored effectively, the error budget can provide a solid framework for deciding between delivering new features and improving existing infrastructure.
00:26:15.170
Contrasting APM tracing capabilities with dashboard functionalities can uncover production issues. By investing time in understanding different metrics, we allow our teams to prioritize tasks better and address critical areas first.
00:27:02.530
Alerts should be layered intelligently. Different levels of alerting can indicate the urgency of issues, allowing our teams to prioritize response efforts effectively and preventing burnout from frequent, less significant notifications.
00:28:54.740
Filtering our metrics with custom attributes can be incredibly useful. This allows our support teams to gain insight tailored to individual queries, enhancing the customer experience, while also aiding our development teams in performance analysis.
00:30:04.330
Communication must remain a central focus as we improve site reliability. Regular updates on individual metrics and insights into efforts can create an informed organizational culture around reliability initiatives.
00:31:10.640
Continuous improvement in our understanding of system performance requires an ever-evolving strategy. Identifying gaps in data visibility should be an ongoing assessment in our technical backlogs, ensuring we never become complacent in our pursuit of reliability.
00:32:42.040
Thank you everyone for your time and attention.