Don't page me! How we limit pager noise at New Relic

In his talk titled "Don't page me! How we limit pager noise at New Relic," Chuck Lauer Vose discusses strategies that his team implemented to drastically reduce the volume of pager notifications within their operations. He shares insights from his experience at New Relic, where their large monolith handles over 200,000 requests per minute, managing numerous external services and databases, which historically led to frequent incident notifications.

Key Points Discussed:
- Reduction of Pager Noise: The New Relic team successfully reduced the frequency of alerts from five times a week to about once a month. This highlighted the need to improve the clarity and trustworthiness of alert conditions.
- Process for Improvement:
- Adding Basic Alerts: The initial step involves integrating basic alert conditions to ensure customers don’t inform them of outages. However, such pre-built alerts need refinement for effectiveness.
- Retrospective Analysis: Chuck emphasizes the importance of conducting retrospectives on all incidents and alert notifications, which enables learning and improves alert accuracies.
- Refining Alerts: Once sufficient data is collected, existing alert conditions should be refined. This process involves adjusting parameters and integrating more meaningful metrics based on the types of incidents encountered.
- Utilizing Heartbeat Monitors: Implementing heartbeat monitors can automate checks by simulating customer activity. These monitors ensure consistent traffic and can help catch issues early.

Illustrative Examples:
- Chuck narrates how initial alert conditions were often misleading or non-critical and describes how he and his team learned to differentiate between genuine incidents versus false positives. Metrics about error rates were adjusted based on customer behavior patterns rather than arbitrary thresholds, illustrating the need for careful monitoring and analysis.
- He also discussed incorporating statistical methods to refine alerts further, which led to improved response times and more relevant notifications during real incidents.

Conclusions and Takeaways:
- Effective Notifications: Notifications that reach the team during incidents deserve retrospectives to improve future responses.
- Refinement of Alerts: Default alert conditions require continuous refinement based on real operational data rather than solely relying on pre-set configurations.
- Essential Monitoring Tools: Using heartbeat monitors is crucial for ensuring system health and gaining confidence in alert conditions, facilitating timely responses to genuine incidents.

Chuck concludes by reiterating these three takeaways, aimed at transforming how engineering teams approach system monitoring and incident management, ultimately creating a more productive workflow. This approach is essential in scaling operations without drowning in unnecessary pager alerts.