Talks

Site Reliability on Rails

Site Reliability on Rails

by Anthony Crumley

In the talk titled "Site Reliability on Rails" presented by Anthony Crumley at Birmingham on Rails 2020, the speaker discusses the intersection of Rails app development and site reliability, providing insights into enhancing the reliability of web applications. Crumley emphasizes the increasing complexity of applications and the corresponding necessity for effective site reliability practices that can evolve alongside application development.

Key Points Discussed:
- Starting with Rails Apps: The presentation begins by highlighting how many web applications, like Twitter and GitHub, start simple but quickly grow in complexity as new features and capabilities are added.
- Metrics and Visibility: Crumley stresses the importance of deploying metrics to better understand application performance. Tools mentioned can provide the visibility necessary to detect issues early, particularly in complex production environments.
- Service Level Agreements (SLAs): An explanation of SLAs as contracts with customers that are built on Service Level Indicators (SLIs) and Service Level Objectives (SLOs) helps define expectations around the application's performance.
- Understanding Performance Issues: The speaker introduces concepts such as event streams and snapshots for tracking performance over time, allowing developers to identify spikes and issues through historical data.
- Metrics for Job Performance: He delves into how different performance metrics apply to web requests and background jobs, noting that each job requires its own SLO due to variability in processing times.
- User Satisfaction Metrics: Crumley discusses application performance metrics like Apdex and Agony Score to help prioritize improvements based on user satisfaction and dissatisfaction.
- Error Budgeting: The idea of maintaining an error budget is advocated, allowing teams to balance feature development with the need to fix errors.
- Communication and Dashboards: The necessity of clear communication among team members about reliability improvements is highlighted. He suggests using dashboards and regular updates to foster visibility and ownership among the entire team.

Conclusions & Takeaways:
- Developing and maintaining a reliable Rails application requires dedicated metrics and visibility into performance aspects, including creating a culture of accountability and continuous improvement across teams.
- By understanding and acting on data gathered from metrics, developers can effectively manage trade-offs between reliability and feature development, ultimately enhancing user satisfaction and application performance.

00:00:16.490 As he said, I'm Anthony. I do site reliability at Fleet Erode, and today I'm going to talk about how to get started with Rails applications and site reliability. There's a lot to cover, and it can be complicated, but it doesn't always make sense at first glance. We'll try to make sense of some of these concepts together.
00:00:38.460 As you begin to work on the reliability of your sites, you might find yourself becoming more productive. In the beginning, as developers create apps and schemas, they often start out with something that is void. It's common that execution of those apps provides little content.
00:00:48.290 Take Twitter and GitHub as examples; there wasn't much there at their inception. They started empty, but over time, they evolved into reliable platforms. Reliability is crucial; oddly enough, its highest state is often found in the simplest form of the application.
00:01:07.729 Complexity arises quickly as more functionality is added. Initially, the focus is on financial viability, trying to build something desirable enough to attract paying customers. During this phase, reliability is secondary.
00:01:21.550 However, as more customers start using the application and its functionality expands, the equilibrium between site reliability and financial viability begins to change. Eventually, users will expect the application to function well without significant delays or disruptions.
00:01:48.880 The complexity only increases with components like web views, mobile apps, APIs, databases, and various services like Elastic Search and Redis. This creates a cloud of unknowing, making it difficult to see performance issues. There's an anonymous 14th-century mystic whose work illustrates this problem well; as we deal with reliability, transparency often comes up as a challenge.
00:03:08.019 Metrics give us visibility into what is happening within our systems. It's essential that we understand these metrics to separate the reliable from the unreliable components. The temptation to build our own metric systems when reliability issues arise is common, but starting with established APM (Application Performance Management) tools is advisable.
00:03:53.790 As we utilize these tools, we gain insights that allow us to better understand the complexities of our systems. One of my favorite pieces of advice for software developers is that understanding is crucial. It's essential to figure out what's happening, as that understanding enables us to take meaningful actions.
00:04:14.010 When reliability becomes an issue, our immediate focus should be on deploying metrics into the systems. We need to establish service level indicators (SLIs), which are specific metrics that reflect response times and availability.
00:04:56.180 From these SLIs, we can create service level objectives (SLOs), which define what we want those metrics to achieve, such as achieving a 95th percentile request time of under two seconds, or ensuring our application's availability exceeds 99.9%.
00:05:36.330 Once we establish these measurements and objectives, we can formulate a service level agreement (SLA) with our customers. This could include metrics around availability and performance indicators. Internally, defining various SLIs can help shape our services and meet expectations.
00:06:10.130 It's critical to monitor availability closely since this metric resonates most with our customers. We typically estimate availability in terms of 'nines,' such as 99.9%, which correlates with the number of minutes our system might be down.
00:06:35.090 However, merely responding to pings for availability doesn't satisfy user needs; thus, switching to a request-based availability metric provides a clearer picture of what users experience.
00:06:57.600 Understanding the performance of our applications often requires digging through historical data to see where issues intersected with user requests. APM tools help with time travel in a sense, allowing us to clarify what happened during a spike in demand.
00:07:43.860 By recording event streams for every request and monitoring exceptions, we can recreate contexts around performance spikes. Snapshots taken at intervals allow us to build a reference point against which we can view issues over time, enabling a more holistic analysis.
00:09:21.480 However, issues related to spikes can be difficult to debug, especially when they muddy logs with too much information. By taking regular snapshots of database statistics, we can analyze queries without losing sight of broader performance trends.
00:10:21.960 Understanding performance begins with measuring and grappling with various types of errors within our applications. Asynchronous job performance, for instance, requires different metrics than web requests. Measuring wait times, service times, and response times allows us to see where jobs fit into our performance frameworks.
00:11:37.790 Every job must have its service level objectives. Monitoring elements like queue times and service durations provides insights into job performance, allowing us to strategically address the longest-running tasks that create customer dissatisfaction.
00:12:35.920 The metrics discussed apply universally to measuring job duration as well as web request performance. While averages can suggest an overall picture, they can mislead, reflecting satisfactory performance when issues persist behind the scenes.
00:13:45.230 Percentiles serve as a more reliable indicator for performance evaluation; they help us detect where outlier issues may suggest larger systemic problems. Additionally, measuring the contribution of various endpoints to the overall load helps clarify where efficiency improvements can yield significant gains.
00:16:21.700 Apdex is a fascinating metric used to measure user satisfaction. By defining satisfying, tolerating, and frustrating interactions, we can calculate overall user interactions to provide insight into user experience. Identifying interactions leading to the most dissatisfaction allows us to prioritize improvements effectively.
00:17:50.449 Simultaneously, tracking 'agony scores' can help us to manage user dissatisfaction by applying specific weights to the user interactions. Such metrics facilitate prioritization in addressing the most critical pain points users face within our applications.
00:20:18.890 As we consider the constructs of timeouts in our systems, we should recognize they act as safety nets against poor performance. However, timeouts should not serve as an acceptable solution while we work towards improving the speed and reliability of our applications.
00:21:08.130 Utilizing APM tools allows for monitoring various query metrics efficiently. They also shed light on potential performance issues tracing back to permissions or functionality that may not be apparent at surface levels. A cross-sectional view of error occurrences across many requests can reveal underlying slowdowns.
00:22:36.340 Once we identify errors, we need to address them effectively. Each error should be counted per occurrence and categorized by type. Addressing the highest-volume errors will yield the most considerable impact on overall system reliability.
00:24:06.030 We must approach error reporting with diligence. Ignoring errors is not a constructive strategy; instead, we should investigate thoroughly, ensure understanding of the underlying issue, and act to resolve it appropriately. When measuring availability, consider the tolerance for failure we allow within our systems.
00:25:29.600 The concept of an error budget allows us to maintain a balanced approach toward reliability and feature updates. When monitored effectively, the error budget can provide a solid framework for deciding between delivering new features and improving existing infrastructure.
00:26:15.170 Contrasting APM tracing capabilities with dashboard functionalities can uncover production issues. By investing time in understanding different metrics, we allow our teams to prioritize tasks better and address critical areas first.
00:27:02.530 Alerts should be layered intelligently. Different levels of alerting can indicate the urgency of issues, allowing our teams to prioritize response efforts effectively and preventing burnout from frequent, less significant notifications.
00:28:54.740 Filtering our metrics with custom attributes can be incredibly useful. This allows our support teams to gain insight tailored to individual queries, enhancing the customer experience, while also aiding our development teams in performance analysis.
00:30:04.330 Communication must remain a central focus as we improve site reliability. Regular updates on individual metrics and insights into efforts can create an informed organizational culture around reliability initiatives.
00:31:10.640 Continuous improvement in our understanding of system performance requires an ever-evolving strategy. Identifying gaps in data visibility should be an ongoing assessment in our technical backlogs, ensuring we never become complacent in our pursuit of reliability.
00:32:42.040 Thank you everyone for your time and attention.