Error Handling
Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages

Summarized using AI

Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages

Miles McGuire • September 26, 2024 • Toronto, Canada

In the video titled "Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages", Miles McGuire, a Staff Engineer at Intercom, discusses a significant outage that occurred on February 22, 2024. He emphasizes that such incidents present opportunities for learning. The root cause of the outage was a mismatch between foreign key and primary key data types, specifically a 32-bit foreign key referencing a 64-bit primary key. The following key points cover the timeline of events during the incident, the investigation process, and the lessons learned:

  • Incident Overview: McGuire provides an account of the incident from the perspective of an on-call engineer, outlining the steps taken from the moment the outage was detected.
  • Initial Response: Upon receiving a page, the on-call engineer noticed elevated exceptions and an increase in 500 error responses. Through the investigation, it became apparent that the conversation part records could not be persisted due to integer overflow issues.
  • Collaboration and Communication: McGuire highlights the importance of engaging various teams, including Customer Support and Marketing, to manage the incident and communicate effectively about the outage.
  • Workarounds and Solutions: Initially, attempts to run a database migration were made; however, a workaround was needed when it became apparent that hours of downtime would be unacceptable.
  • Cascading Issues: Subsequent troubleshooting revealed that other models dependent on the conversation part were also impacted. The team had to consider disabling features temporarily to mitigate customer frustration.
  • Post-Incident Analysis: After resolving the immediate issues, McGuire stresses the importance of analyzing what went wrong. They found that lessons from previous outages were not adequately documented, leading to the same problems surfacing again.
  • Preventive Measures: The team established new checks within their continuous integration framework to prevent future occurrences of similar data type issues. This included implementing alarms for tables nearing primary key limits and improving error messaging with more context.
  • Importance of Documentation: Lastly, McGuire underscores the need for thorough documentation of processes and experiences in order to learn from incidents effectively.

In conclusion, this incident illustrates that being deliberate about systemic improvements and documentation in the aftermath of failures can help prevent future outages and enhance operational resilience.

Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages
Miles McGuire • September 26, 2024 • Toronto, Canada

Incidents are an opportunity to level up, and on 22 Feb 2024 Intercom had one of its most painful outages in recent memory. The root cause? A 32-bit foreign key referencing a 64-bit primary key. Miles McGuire shared what happened, why it happened, and what they are doing to ensure it won't happen again (including some changes you can make to your own Rails apps to help make sure you don’t make the same mistakes.)

#outage #lessonslearned

Thank you Shopify for sponsoring the editing and post-production of these videos. Check out insights from the Engineering team at: https://shopify.engineering/

Stay tuned: all 2024 Rails World videos will be subtitled in Japanese and Brazilian Portuguese soon thanks to our sponsor Happy Scribe, a transcription service built on Rails. https://www.happyscribe.com/

Rails World 2024

00:00:10.280 Hey everyone. So, who am I? You just heard a little bit. I'm a Staff Engineer at Intercom. I've worked here for about seven and a half years, and for the last five, I've been on our data stores team.
00:00:14.360 The data stores team at Intercom manages MySQL, Elasticsearch, DynamoDB, and also parts of our core Rails platform and various components of our application stack.
00:00:22.199 What am I here to talk to you about? Stories about outages are always interesting. That’s where we started thinking about this presentation. We're going to try to put you in the shoes of an on-call engineer responding to this incident at Intercom.
00:00:39.800 At Intercom, we use a volunteer on-call system, so any engineer could be on call for the entire company. Incidents are also a great opportunity for learning, allowing us to reflect on what happened. We'll relive the early stages of the outage in chronological order, which might seem chaotic—and that’s because it was.
00:00:54.719 I’m streamlining a lot of information here; there was a lot of involvement and parallel tracks. If you want the full story, feel free to find me later, but there’s just too much to cover right now.
00:01:10.680 First, we’ll discuss what happened, then the mistakes that led to it, and finally, the changes we made to our application and processes to ensure that something like this doesn't happen again. So, without any further ado, let’s take a step back to the morning of February 22nd.
00:01:29.439 It’s just after 8 a.m., and you are on call when you receive a page due to elevated exceptions. You open Datadog and see a graph showing ELB errors, which indicate a significant number of 500 responses. They are slightly up across all web-facing fleets, but the numbers look low, only about 1.5k. This isn't the largest volume ever—just a small percentage of requests are failing.
00:01:50.399 You notice exceptions such as an active model range error. Next, you open Sentry and observe a specific exception. Zooming in a bit, you see that there’s a 2.14 billion values being reported, which is quite familiar. An integer's limit is four bytes, which is not a great situation.
00:02:19.080 After digging into the problem, you realize that we are failing to persist conversation part records because something is larger than a 32-bit integer. If we look back, we can see that the name of the field is included in the exception, which complicates debugging further. I'm not diving into the full breakdown of Intercom’s data modeling; however, we have over 800 models.
00:02:47.400 What you should know is that the conversation part model represents each individual message exchange in a conversation. I realize I’ve left out explaining what Intercom is; if you're not familiar with our service, it’s a customer communication platform that enables our customers to connect with their own customers.
00:03:06.240 While the number of failing requests is low, nobody can start new conversations, effectively meaning that our product is completely down. Unfortunately, there are over 35 billion rows in the conversation part table, so simply running a migration is going to take a while.
00:03:39.360 At this point, we don’t have anything better to do, so we just start that migration. It's expected to take at least days. Hopefully, we can find a better solution, as that's not viable long-term. So what are we to do next? While the migration is running, we know that when it finishes, it should resolve our issues.
00:04:09.079 However, we also need to collaborate with other teams, ensuring that everyone is up to speed and looking for alternative solutions. At this point, we've likely paged in about 10 to 15 people.
00:04:31.600 We need to engage Customer Support proactively, as none of our customers can write in to tell us they're experiencing issues; they also use Intercom. We need to clarify that something is happening and likely prepare Marketing for the eventuality that people will notice, potentially leading to social media outcry. It’s crucial to have a consistent response.
00:05:04.880 A program manager becomes involved at this stage to handle communications between various teams and escalate to our executive team since this is a serious outage. We must start pulling in more engineers, which means relevant teams from different aspects of the company are coming online.
00:05:30.160 Typically, our workday starts around 9:00 a.m., so we were fortunate to receive the page early. It's now about 9:35, and 75 minutes have passed since you were paged. We have brainstormed a few ideas.
00:05:51.960 One of our principal engineers joins the call and points out that Rails’ primary keys are signed by default. This might give you an idea of where this is heading. We have a four-byte integer, but we actually only utilize 31 of the 32 bits.
00:06:05.639 So we discuss the possibility of overriding the field, unpacking the integer, and then repacking it as a negative value. While there's confidence that this would solve the problem, we’re not so sure that it wouldn't cause other issues and figuring it out later would be problematic.
00:06:31.680 As a side note, I don’t really understand why Rails’ primary keys are signed by default. If anyone has insights, I’d love to hear them. However, there’s another idea: we could work around this by using other relationships. As mentioned earlier, we have over 800 models in Intercom, and our data modeling doesn’t adhere strictly to the fifth normal form of SQL.
00:06:52.400 So we come up with a plan, where we can look up values from another table. We’ll load the conversation and retrieve the necessary value from there, and we can monkey patch that over the accessors for the attribute that’s failing in the Active Record model.
00:07:04.760 This workaround is also not ideal, but we frame it as improving our model by removing a denormalized field, even if we can’t find a tangible reason this change would break things. We feel this cleanup will be relatively straightforward, so we just go ahead and implement it.
00:07:29.600 After implementing the fix, we can see the exceptions graph tailing off, but it’s concerning that the number did not drop to zero post-fix. So what’s next? By now, it’s been 150 minutes on the call with many people involved. We thought we had a solution, but we’re still down.
00:08:00.800 It turns out that other models were also broken. The conversation part is critical to our function, so while the immediate impact is somewhat mitigated, there is still a cascade of issues from other parts. Customers can now communicate with us, but some functions are still impaired.
00:08:25.839 As we’re troubleshooting, we need to consider how we’re going to address this. We can run migrations for some tables, and while some are small, others are much larger. The largest has only a few hundred million rows, which feels manageable, but the key table has no viable workaround.
00:08:54.200 The impact is somewhat mitigated, and customers using features powered by this model are still down. The situation isn't a part of our core flow, but the customers affected are understandably displeased.
00:09:10.160 They cannot have conversations, which frustrates them even more, so we make the call to brown out that specific feature entirely. We use feature flags at Intercom, so we can simply turn it off completely as a temporary measure.
00:09:26.279 Doing so places us back into a viable state. We have about 90 minutes to wait for the initial migration to conclude, so temporarily disabling a feature for that duration feels acceptable.
00:09:43.480 Fast forward, you’ve been on the call for five hours. It’s a long incident call. I don’t know if you’ve ever been part of a five-hour incident call, but it isn’t the most enjoyable experience.
00:10:02.640 The migration is finished, and we think it’s time to turn off that feature flag. Instantly, exceptions return, and we find ourselves down again. People begin to panic, so the team quickly re-enables the feature flag. What happened? Why didn’t the fix we were confident would work turn out?
00:10:41.600 We’re utilizing Ghost, which is a tool designed for online schema migrations on MySQL. It doesn't rely on Rails’ standard rake tasks for migrations. The schema change has been successful, but we were still encountering exceptions.
00:10:59.760 The issue arose because we hadn’t triggered a deployment. Intercom runs roughly 50,000 Rails processes to serve requests, and since there hadn’t been a deployment, none of those processes had restarted. They had cached the schema, so even though the database was functional, the processes were still failing.
00:11:32.360 This presented a clear explanation. Fortunately, we managed to get there quickly; we just triggered a redeployment. This update seemed to resolve the issue, and we thought we were back up and everything was functioning properly.
00:11:57.000 But, at this point, we must ask: does an incident end just because requests succeed again? In many respects, I believe incidents really only begin when you reach operational normalcy. This next phase took weeks, while the earlier stages took hours or even days.
00:12:14.360 In some sense, the real work is only beginning. Now that we’ve mitigated the issue, to derive any value from this incident, we need to learn from it and ensure it doesn’t occur again.
00:12:35.200 As one of my colleagues would say, we need to understand the socio-technical factors that contributed to the outage. So, how did it happen? This isn’t the first time we’ve encountered a big integer issue.
00:12:51.400 I believe our largest table contains about 80 billion rows, while the previous table I mentioned had around 35 billion. Clearly, we had been through this before. When it first occurred, it was a major day for Intercom; we pulled together a team of principal engineers and a working group to devise a plan.
00:13:20.080 We coordinated all relevant parties to ensure the plan’s successful execution. We thought we had anticipated everything that could go wrong and were able to address those issues, yet we hadn’t established or systematized our learnings, despite knowing we’d likely face this scenario again.
00:13:43.120 Over six years into operations, we had experienced this issue in the past and did not take the necessary steps to document and repeat our success, and ultimately the same issue occurred again.
00:14:00.720 As a result, we began discussing how to create an alarm for scenarios where a table is approaching the limits of its primary keys, which would notify when a migration is needed. Every alarm in Intercom has a runbook, so that one of our on-call engineers can respond and take appropriate action.
00:14:21.840 You might wish to see the runbook for that. It stated: "This primary key for the table is approaching the maximum integer value; please migrate it to BigInteger." There were no instructions or insights about dependencies or potential issues.
00:14:49.440 In 2023, when this alarm triggered, an engineer only noted the notification and executed the command, thinking that they had resolved the problem. The alarm was set off when it reached 85% of the limit.
00:15:04.160 In practice, it took several months to actually reach the 100% mark. Fundamentally, we overlooked the foreign keys in our schema.
00:15:30.080 To ensure this doesn’t happen again, we resolved to implement checks within our continuous integration framework. When the team initially tackled this, they placed some code within our RSpec setup that would trigger a test migration and increase the auto-increment count of our comments table. This made sure that if any spec requiring a comment ID did not pass, it indicated that they were in trouble.
00:16:04.079 However, embedding this sort of test deeply within RSpec setup was inefficient; it became difficult to keep track of and lacked context on its importance. You find it easy to forget about this test unless you’re someone aware of its purpose.
00:16:30.720 Thus, we investigated how we could create a more resilient measure. We monkey patched the create table method of migration to check if we were in a development or testing environment, hashing the table name into significant value.
00:16:55.360 If it’s a BigInteger, we elevate it by a trillion. This way, we ensure that our tests will fail safely if any model attempts to insert a value into too small a field.
00:17:17.560 This approach also helps maintain uniqueness among our tables. Previously, we had many models with similar names, which resulted in errors like passing a conversation part ID instead of a conversation ID, since they both started from similar identifiers.
00:17:38.720 Ensuring that every table now has a unique setup reduces the chance of such errors, effectively cleaning up flaky tests that had been a burden.
00:17:57.760 Additionally, we updated the error messages to include more context. Instead of simply indicating a value was out of range, the expanded message will now specify the particular field causing the issue.
00:18:14.520 Now when we see errors like the earlier example of 2.14 billion being out of range, it includes "for field message thread ID." This type of context will reduce the time needed to resolve issues during incidents, allowing us to focus on actionable tasks.
00:18:40.800 I want to give a quick shout-out. When this talk was announced on the Rails World Twitter account, someone mentioned Active Record Doctor, a tool that could have helped us avoid this issue. It offers multiple features for checking database schemas and suggesting best practices.
00:19:06.640 While it includes checks for mismatched foreign key types, the implementation in our setup depended on conditions we don’t use and wouldn’t have played well due to our architecture. However, if it fits your system, it could be beneficial.
00:19:25.160 That brings us to the conclusion of this incident. I think it serves as a noteworthy example of the importance of documenting processes adequately, especially when learning from outages. When writing a runbook, always be sure to include the context and potential pitfalls.
00:19:45.480 When you learn from such experiences, be intentional with your actions, ensuring that they are meaningful and practical. Thank you very much.
Explore all talks recorded at Rails World 2024
+13