Ruby Video | Making The Best of a Bad Situation

Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages

In the video titled "Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages", Miles McGuire, a Staff Engineer at Intercom, discusses a significant outage that occurred on February 22, 2024. He emphasizes that such incidents present opportunities for learning. The root cause of the outage was a mismatch between foreign key and primary key data types, specifically a 32-bit foreign key referencing a 64-bit primary key. The following key points cover the timeline of events during the incident, the investigation process, and the lessons learned:

Incident Overview: McGuire provides an account of the incident from the perspective of an on-call engineer, outlining the steps taken from the moment the outage was detected.
Initial Response: Upon receiving a page, the on-call engineer noticed elevated exceptions and an increase in 500 error responses. Through the investigation, it became apparent that the conversation part records could not be persisted due to integer overflow issues.
Collaboration and Communication: McGuire highlights the importance of engaging various teams, including Customer Support and Marketing, to manage the incident and communicate effectively about the outage.
Workarounds and Solutions: Initially, attempts to run a database migration were made; however, a workaround was needed when it became apparent that hours of downtime would be unacceptable.
Cascading Issues: Subsequent troubleshooting revealed that other models dependent on the conversation part were also impacted. The team had to consider disabling features temporarily to mitigate customer frustration.
Post-Incident Analysis: After resolving the immediate issues, McGuire stresses the importance of analyzing what went wrong. They found that lessons from previous outages were not adequately documented, leading to the same problems surfacing again.
Preventive Measures: The team established new checks within their continuous integration framework to prevent future occurrences of similar data type issues. This included implementing alarms for tables nearing primary key limits and improving error messaging with more context.
Importance of Documentation: Lastly, McGuire underscores the need for thorough documentation of processes and experiences in order to learn from incidents effectively.

In conclusion, this incident illustrates that being deliberate about systemic improvements and documentation in the aftermath of failures can help prevent future outages and enhance operational resilience.

Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages
Miles McGuire • September 26, 2024 • Toronto, Canada

Incidents are an opportunity to level up, and on 22 Feb 2024 Intercom had one of its most painful outages in recent memory. The root cause? A 32-bit foreign key referencing a 64-bit primary key. Miles McGuire shared what happened, why it happened, and what they are doing to ensure it won't happen again (including some changes you can make to your own Rails apps to help make sure you don’t make the same mistakes.)

#outage #lessonslearned

Thank you Shopify for sponsoring the editing and post-production of these videos. Check out insights from the Engineering team at: https://shopify.engineering/

Stay tuned: all 2024 Rails World videos will be subtitled in Japanese and Brazilian Portuguese soon thanks to our sponsor Happy Scribe, a transcription service built on Rails. https://www.happyscribe.com/

Rails World 2024