Ruby Video

Title

Description

Date

Summarized using AI?

If this talk's summary was generated by AI, please check this box. A "Summarized using AI" badge will be displayed in the summary tab to indicate that the summary was generated using AI.

Show "Summarized using AI" badge on summary page

Summary

Markdown supported

In this talk titled "The Sounds of Silence: Lessons from an 18 hour API outage", Paul Zaich, an engineering manager at Checkr, discusses the significant challenges and lessons learned from a critical API outage that occurred in October 2017, which lasted 18 hours and severely impacted customer operations.

Zaich begins by addressing the inevitability of bugs in software development, emphasizing the need for engineers to focus on minimizing their impact rather than attempting to eliminate them completely. He narrates how Checkr, a background check service provider, faced an outage that prevented customers from creating reports using their API—a critical component for their operations.

Key points of discussion include:  
- **Incident Background**: The outage began when a script was run to migrate old records, leading to a series of events that led to significant errors in report creation.  
- **Delayed Detection**: The team initially misinterpreted error alerts from an unrelated component, delaying the real diagnosis of the overarching API failure. Over 14 hours passed before the core issue was fully identified and addressed.  
- **Root Cause Analysis**: The root cause was linked to database constraints not being enforced due to the migration process, leading to null references in the records associated with reports. The resolution of the issue was complicated by the reliance on informal knowledge within the team at that time.  
- **Post-Mortem Process**: Zaich highlights the importance of conducting post-mortems to learn from outages, emphasizing a blameless culture that encourages examining mistakes to prevent future occurrences. The team developed a structured process for documenting incidents, focusing equally on root causes and actionable follow-ups.  
- **Observability Improvements**: A major takeaway from the incident was the need to enhance monitoring and observability of their systems. Zaich discusses developing more sensitive alerting mechanisms and the implementation of composite monitors, which combined multiple metrics to better detect failures. Additionally, the importance of avoiding overly simplistic thresholds for alerts was stressed.  
- **Conclusion**: Zaich concludes with insights into building a robust observability culture, advising teams to start small with their monitoring efforts, utilizing tools like exception trackers and application performance management (APM), and iteratively improving monitoring rules to ensure alerts are meaningful and actionable.

By reflecting on this incident, Zaich emphasizes that while bugs are part of software development, effectively managing and responding to them through a culture of observability is crucial for building reliable systems.

Suggest modification to this talk