Talks
Speakers
Events
Topics
Sign in
Home
Talks
Speakers
Events
Topics
Leaderboard
Use
Analytics
Sign in
Suggest modification to this talk
Title
Description
The Sounds of Silence: Lessons from an 18 hour API outage by Paul Zaich Sometimes applications are behaving “normally” along strict definitions of HTTP statuses but under the surface, something is terribly wrong. In 2017, Checkr’s most important API endpoint went down for 12 hours without detection. In this talk I’ll talk about this incident, how we responded (what went well and what could have gone better) and explore how we’ve hardened our systems today with simple monitoring patterns. ___________ Paul hails from Denver, CO where he works as an Engineering Manager at Checkr. He’s passionate about building technology for the new world of work. In a former life, Paul was a competitive swimmer. He now spends most of his free time on dry land with his wife and three children.
Date
Summarized using AI?
If this talk's summary was generated by AI, please check this box. A "Summarized using AI" badge will be displayed in the summary tab to indicate that the summary was generated using AI.
Show "Summarized using AI" badge on summary page
Summary
Markdown supported
In this talk titled "The Sounds of Silence: Lessons from an 18 hour API outage", Paul Zaich, an engineering manager at Checkr, discusses the significant challenges and lessons learned from a critical API outage that occurred in October 2017, which lasted 18 hours and severely impacted customer operations. Zaich begins by addressing the inevitability of bugs in software development, emphasizing the need for engineers to focus on minimizing their impact rather than attempting to eliminate them completely. He narrates how Checkr, a background check service provider, faced an outage that prevented customers from creating reports using their API—a critical component for their operations. Key points of discussion include: - **Incident Background**: The outage began when a script was run to migrate old records, leading to a series of events that led to significant errors in report creation. - **Delayed Detection**: The team initially misinterpreted error alerts from an unrelated component, delaying the real diagnosis of the overarching API failure. Over 14 hours passed before the core issue was fully identified and addressed. - **Root Cause Analysis**: The root cause was linked to database constraints not being enforced due to the migration process, leading to null references in the records associated with reports. The resolution of the issue was complicated by the reliance on informal knowledge within the team at that time. - **Post-Mortem Process**: Zaich highlights the importance of conducting post-mortems to learn from outages, emphasizing a blameless culture that encourages examining mistakes to prevent future occurrences. The team developed a structured process for documenting incidents, focusing equally on root causes and actionable follow-ups. - **Observability Improvements**: A major takeaway from the incident was the need to enhance monitoring and observability of their systems. Zaich discusses developing more sensitive alerting mechanisms and the implementation of composite monitors, which combined multiple metrics to better detect failures. Additionally, the importance of avoiding overly simplistic thresholds for alerts was stressed. - **Conclusion**: Zaich concludes with insights into building a robust observability culture, advising teams to start small with their monitoring efforts, utilizing tools like exception trackers and application performance management (APM), and iteratively improving monitoring rules to ensure alerts are meaningful and actionable. By reflecting on this incident, Zaich emphasizes that while bugs are part of software development, effectively managing and responding to them through a culture of observability is crucial for building reliable systems.
Suggest modifications
Cancel