Talks

Don't page me! How we limit pager noise at New Relic

Don't page me! How we limit pager noise at New Relic

by Chuck Lauer Vose

In his talk titled "Don't page me! How we limit pager noise at New Relic," Chuck Lauer Vose discusses strategies that his team implemented to drastically reduce the volume of pager notifications within their operations. He shares insights from his experience at New Relic, where their large monolith handles over 200,000 requests per minute, managing numerous external services and databases, which historically led to frequent incident notifications.

Key Points Discussed:
- Reduction of Pager Noise: The New Relic team successfully reduced the frequency of alerts from five times a week to about once a month. This highlighted the need to improve the clarity and trustworthiness of alert conditions.
- Process for Improvement:
- Adding Basic Alerts: The initial step involves integrating basic alert conditions to ensure customers don’t inform them of outages. However, such pre-built alerts need refinement for effectiveness.
- Retrospective Analysis: Chuck emphasizes the importance of conducting retrospectives on all incidents and alert notifications, which enables learning and improves alert accuracies.
- Refining Alerts: Once sufficient data is collected, existing alert conditions should be refined. This process involves adjusting parameters and integrating more meaningful metrics based on the types of incidents encountered.
- Utilizing Heartbeat Monitors: Implementing heartbeat monitors can automate checks by simulating customer activity. These monitors ensure consistent traffic and can help catch issues early.

Illustrative Examples:
- Chuck narrates how initial alert conditions were often misleading or non-critical and describes how he and his team learned to differentiate between genuine incidents versus false positives. Metrics about error rates were adjusted based on customer behavior patterns rather than arbitrary thresholds, illustrating the need for careful monitoring and analysis.
- He also discussed incorporating statistical methods to refine alerts further, which led to improved response times and more relevant notifications during real incidents.

Conclusions and Takeaways:
- Effective Notifications: Notifications that reach the team during incidents deserve retrospectives to improve future responses.
- Refinement of Alerts: Default alert conditions require continuous refinement based on real operational data rather than solely relying on pre-set configurations.
- Essential Monitoring Tools: Using heartbeat monitors is crucial for ensuring system health and gaining confidence in alert conditions, facilitating timely responses to genuine incidents.

Chuck concludes by reiterating these three takeaways, aimed at transforming how engineering teams approach system monitoring and incident management, ultimately creating a more productive workflow. This approach is essential in scaling operations without drowning in unnecessary pager alerts.

00:00:12.179 Thanks everyone! Welcome to the talk titled 'Don't Page Me.' This presentation is about how my team limits pager noise.
00:00:20.400 I work at New Relic; my name is Chuck Lauer Vose, and I use he/him pronouns.
00:00:26.160 I realized when I got about halfway through that there was some tiny code on the screen that you might want to look at, so if you can't see it or if you're visually impaired for any reason, the slides are available. You can totally email me, and my email is included.
00:00:38.760 I’m at vosatu pretty much everywhere, except I found out that on the Pokémon forums, it seems like everything ending in 'chu' is taken. We have a booth open tomorrow, and I'll just mention now that we would love to help you with any of these topics, so come by. I’ll be there frequently to talk about everything we discuss today.
00:00:53.219 This is our corporate-mandated legal slide. If you’ve been to any of my talks before, you’ll know that I love a good activity. I need something to get the brain going, so I’d like to ask everyone, if you’re excited by the idea of going a week without pages, to stand up for a moment. Let’s get some blood flowing!
00:02:21.180 Three years ago, we were getting paged about five times a week, which some people might think is reasonable. Now, we get paged about once a month. We have different challenges now; for instance, we have new people coming into on-call, and I have no idea how to train them—that's success, and I'll take that problem. Also, for the first time, I get to wear the Ruby cave, which is something I’ve wanted to do for eight years.
00:03:04.140 If you remember nothing else from this talk, I hope you retain these three points, which we will discuss further: First, the notifications you receive during an incident deserve retrospectives. Second, default alert conditions that you add (you know, the ones you click on to add pre-built alerts) always require refinement and extra details before they become viable. Lastly, heartbeat monitors save lives—not just in a medical sense, but they genuinely save you.
00:03:50.220 If I were joining your team right now and we had a pager noise problem, this is where I’d spend time first: adding some basic alert conditions that are going to be noisy and terrible. It’s going to make life a lot worse initially, but we need to create a culture of retrospectives for these alerts along with the insights, then we’ll start refining them. You must have data to refine things; otherwise, it's just random noise. The last two steps involve adding heartbeat monitors, which greatly increase the confidence in alert conditions.
00:04:43.260 Finally, once we eliminate all the false positives, we'll discuss how we can actually reduce the number of incidents and their impact. Unfortunately, I’ll need to rush through that portion as they warrant their own talks, but we'll touch on it quickly. Now, let's clarify some terminology because the industry can be quite varied. When I say 'alert policy,' it essentially means nothing—it’s just a bucket for alert conditions.
00:05:39.000 Alert conditions are the crucial metrics we evaluate that trigger events. If an alert condition is triggered, we might send a message to Slack or PagerDuty—or both. Our team tends to send everything to Slack but occasionally to PagerDuty. I differentiate what we send to Slack as 'lower trust' notifications but still important, while the ones that I want to alert me in the middle of the night I describe as 'alert notifications.' It's important to note that I might mix these terms up, and I apologize in advance.
00:06:19.140 Let’s make things worse as the first step. Adding some default alert conditions, regardless of which APM product you use, is essential. We need to rush through this part because we’d want to get alerts set up; otherwise, our customers will have to inform us when we’re down, which is the absolute worst! New Relic has a pre-built alerts button that’s fine to use. Generally, default alerts resemble triggers like an increased error rate or excessive request durations.
00:07:03.539 However, the problems with pre-built alerts are that they can lead to poor decisions. Single customers can spike error rates significantly. If all your customers are happy but one is not, is it truly an emergency? We encounter this constantly; major clients can have batch jobs that, when stopped, drop our throughput drastically. Is that incident-worthy, or is it just a regular customer operation? Regarding request durations, you may have certain website functionalities that take much longer, distorting your average times.
00:08:00.360 None of these scenarios are genuinely ideal alert conditions for notifications, yet they are better than having no idea when an incident occurs. I really love this quote by Seth Godin: 'Just because something is easy to measure doesn’t mean it's important.' This truly resonates within our industry; easy tests and metrics may not provide valuable insights. Alerts often originate simply as starting points but are limited since HTTP doesn’t furnish adequate details specific to your organization.
00:09:02.700 Many challenges arise, particularly when incidents occur that depend on customer actions. As such, we cannot rely solely on defaults because it leads to a significant underestimation of the impact of certain metrics. Step two involves focusing on not just fixing the alert condition but also understanding the incidents. I know many of you might adjust the sensitivity of alert conditions but haven’t conducted more extensive analyses.
00:09:59.880 Don’t worry if your team is still at this stage; it's commendable you’ve recognized the problems. Every morning during our stand-up, we review every notification received, whether it’s an alert or Slack notification. We categorize notifications into four buckets: real incidents that required help, real incidents that didn’t need assistance, ones that seemed like incidents but weren’t, and false positives—these should never happen but will be our focus.
00:11:05.820 We also record metadata, which is crucial. I believe that magnets and spreadsheets can enhance nearly everything. Tracking trends over time as the number of incidents decreases is instrumental. Now, let’s talk quickly about false positives. These often feel genuine but cause significant turmoil in personal and professional lives. They can feel satisfying to respond to, but in reality, they waste time and energy.
00:12:10.620 We need to start letting automated processes handle these. By refining our alert conditions with the data collected from retrospectives, we can actually address alert notifications more effectively. During an incident or mini retrospective, we aim to record, which condition triggered the alarm, how we categorized the incident, and how we determined when it was resolved.
00:12:55.500 Having this information recorded will allow us to improve our alert conditions over time. We can clone our existing alert conditions to include automatic research and monitor them for a while before attaching alert policies. Often, we make those old conditions less sensitive—for example, moving from three minutes of error rate to four. Our more generalized error rate can stretch as far as 60 minutes; such refinement reflects that we’ve gathered sufficient data.
00:14:02.640 By adjusting these metrics, we can examine raw incident data and assist in fine-tuning in relation to customer behavior. Ultimately, the goal is to ensure that alert notifications serve to provide valuable context during each incident.
00:14:57.180 Next, a common misconception might arise—that these alert conditions are only applicable to larger companies. But trust me, smaller companies face similar issues with limited visibility. You might think only big companies have extensive traffic; however, the number of incidents may remain consistent. Focus less on resolving a specific alert or false positive and more on gathering the necessary data for future reference. Regardless of the size of your organization, refining your alerts enhances the decision-making process in critical situations.
00:15:55.580 When configuring alert conditions, remember that they only work when there is customer engagement. Many services may experience periods of inactivity, making it difficult to establish reliability in the data. This creates a need for the next step: adding heartbeat monitors, which provide consistency in activity and alert conditions.
00:16:42.660 Heartbeat monitors serve to replicate customer actions continuously, providing insights without the need for a live customer. If something were to go awry, we would have a clearer understanding of whether it was a user error or a systemic problem. These monitors should be integrated thoughtfully and run continuously throughout the day.
00:17:37.460 I’d like to emphasize that it is essential to trust your heartbeat monitors. I do understand the hesitations that may arise with relying on automated systems, but implementing these into your workflow provides invaluable clarity about system performance.
00:18:31.820 Automated tests enable the monitoring of complex systems through various means. Their implementation can offer not just alerts but also engagement logs, conveying the health of your services. As we refine these processes and conduct retrospectives for alerts, we will have a better view of overall customer experience.
00:19:26.540 Cherish the opportunity that automated systems offer. Transitioning from manual assessments to automated monitoring enhances overall system management. As the talk progresses, I’d like to share that your primary focus should remain on building systems that effectively communicate service status, thereby preventing gaps in customer service reporting.
00:20:20.220 I have a few key takeaways to ensure you have a robust strategy regarding alerts: First, prioritize fewer deployments more frequently. Build Continuous Integration (CI) processes that embrace automated deployment. Second, explore canary deployments where only a subset of your services runs under a new code release, allowing for error detection.
00:21:04.900 Lastly, employ circuit breakers effectively. Circuit breaking is necessary for monitoring responses under changing conditions. Every Rails application should implement simple circuit breakers as an initial safeguard against network failures.
00:21:10.200 To summarize, ensure that your response mechanisms, monitoring systems, and alert notifications undergo continuous refinement. Your team should foster a culture inclined towards incident transparency and learning from prior occurrences. Remember: notifications during incidents deserve retrospectives, default alert conditions require constant tuning, and heartbeat monitors are essential components for reliable operations.
00:22:03.480 Thank you all for attending! If you have any questions or wish to delve into finer details, come visit us at our booth. I love sharing this knowledge and experience and appreciate your attentiveness. Thank you!