Ruby Video | Making The Best of a Bad Situation

Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages

The video titled "Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages" presented by Miles McGuire at Rails World 2024, discusses the significant outage experienced by Intercom on February 22, 2024.

The key points covered in the presentation are as follows:

Introduction to the Incident: McGuire, a Staff Engineer at Intercom, recounts an incident where a 32-bit foreign key referencing a 64-bit primary key resulted in a major service outage. The discussion is framed from the perspective of the on-call engineer dealing with the chaos of the incident.
Initial Response: Upon being paged for elevated exceptions, the engineering team discovered issues related to their data model and high volumes of exceptions indicating failures to persist conversation part records due to integer limitations.
Importance of Team Collaboration: The outbreak led to a collaborative effort involving engineers, customer support, and marketing teams to manage communications and ensure a consistent response to stakeholders while working on a solution.
Multiple Solutions Explored: Initially, the team considered a time-consuming database migration and explored alternative quick fixes, including socializing workarounds with primary keys.
Resolution Attempts: After several hours of discussions and attempts to fix the issues via code, the team realized the deployment had not updated the cached processes that relied on the database schema.
Learning from the Experience: The incident prompted a discussion on socio-technical factors, emphasizing the need for better documentation and more comprehensive runbooks that account for challenges encountered during similar situations in the past.
Improvements Implemented: The team adopted several changes, including the implementation of code checks in their CI pipeline to catch potential foreign key mismatches, improving debugging messages to provide context in errors, and creating alarms that alert them when they are approaching integer limits for critical databases.

The presentation concludes by stressing the importance of thorough post-incident analysis to ensure the issues don’t recur and the significance of documentation in learning and improving processes within technical teams.

Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages
Miles McGuire • September 27, 2024 • Toronto, Canada

Incidents are an opportunity to level up, and on 22 Feb 2024 Intercom had one of its most painful outages in recent memory. The root cause? A 32-bit foreign key referencing a 64-bit primary key. Miles McGuire shared what happened, why it happened, and what they are doing to ensure it won't happen again (including some changes you can make to your own Rails apps to help make sure you don’t make the same mistakes.)

#outage #lessonslearned

Thank you Shopify for sponsoring the editing and post-production of these videos. Check out insights from the Engineering team at: https://shopify.engineering/

Stay tuned: all 2024 Rails World videos will be subtitled in Japanese and Brazilian Portuguese soon thanks to our sponsor Happy Scribe, a transcription service built on Rails. https://www.happyscribe.com/

Rails World 2024