00:00:10.280
Hey everyone. So, who am I? You just heard a little bit. I'm a Staff Engineer at Intercom. I've worked here for about seven and a half years, and for the last five, I've been on our data stores team.
00:00:14.360
The data stores team at Intercom manages MySQL, Elasticsearch, DynamoDB, and also parts of our core Rails platform and various components of our application stack.
00:00:22.199
What am I here to talk to you about? Stories about outages are always interesting. That’s where we started thinking about this presentation. We're going to try to put you in the shoes of an on-call engineer responding to this incident at Intercom.
00:00:39.800
At Intercom, we use a volunteer on-call system, so any engineer could be on call for the entire company. Incidents are also a great opportunity for learning, allowing us to reflect on what happened. We'll relive the early stages of the outage in chronological order, which might seem chaotic—and that’s because it was.
00:00:54.719
I’m streamlining a lot of information here; there was a lot of involvement and parallel tracks. If you want the full story, feel free to find me later, but there’s just too much to cover right now.
00:01:10.680
First, we’ll discuss what happened, then the mistakes that led to it, and finally, the changes we made to our application and processes to ensure that something like this doesn't happen again. So, without any further ado, let’s take a step back to the morning of February 22nd.
00:01:29.439
It’s just after 8 a.m., and you are on call when you receive a page due to elevated exceptions. You open Datadog and see a graph showing ELB errors, which indicate a significant number of 500 responses. They are slightly up across all web-facing fleets, but the numbers look low, only about 1.5k. This isn't the largest volume ever—just a small percentage of requests are failing.
00:01:50.399
You notice exceptions such as an active model range error. Next, you open Sentry and observe a specific exception. Zooming in a bit, you see that there’s a 2.14 billion values being reported, which is quite familiar. An integer's limit is four bytes, which is not a great situation.
00:02:19.080
After digging into the problem, you realize that we are failing to persist conversation part records because something is larger than a 32-bit integer. If we look back, we can see that the name of the field is included in the exception, which complicates debugging further. I'm not diving into the full breakdown of Intercom’s data modeling; however, we have over 800 models.
00:02:47.400
What you should know is that the conversation part model represents each individual message exchange in a conversation. I realize I’ve left out explaining what Intercom is; if you're not familiar with our service, it’s a customer communication platform that enables our customers to connect with their own customers.
00:03:06.240
While the number of failing requests is low, nobody can start new conversations, effectively meaning that our product is completely down. Unfortunately, there are over 35 billion rows in the conversation part table, so simply running a migration is going to take a while.
00:03:39.360
At this point, we don’t have anything better to do, so we just start that migration. It's expected to take at least days. Hopefully, we can find a better solution, as that's not viable long-term. So what are we to do next? While the migration is running, we know that when it finishes, it should resolve our issues.
00:04:09.079
However, we also need to collaborate with other teams, ensuring that everyone is up to speed and looking for alternative solutions. At this point, we've likely paged in about 10 to 15 people.
00:04:31.600
We need to engage Customer Support proactively, as none of our customers can write in to tell us they're experiencing issues; they also use Intercom. We need to clarify that something is happening and likely prepare Marketing for the eventuality that people will notice, potentially leading to social media outcry. It’s crucial to have a consistent response.
00:05:04.880
A program manager becomes involved at this stage to handle communications between various teams and escalate to our executive team since this is a serious outage. We must start pulling in more engineers, which means relevant teams from different aspects of the company are coming online.
00:05:30.160
Typically, our workday starts around 9:00 a.m., so we were fortunate to receive the page early. It's now about 9:35, and 75 minutes have passed since you were paged. We have brainstormed a few ideas.
00:05:51.960
One of our principal engineers joins the call and points out that Rails’ primary keys are signed by default. This might give you an idea of where this is heading. We have a four-byte integer, but we actually only utilize 31 of the 32 bits.
00:06:05.639
So we discuss the possibility of overriding the field, unpacking the integer, and then repacking it as a negative value. While there's confidence that this would solve the problem, we’re not so sure that it wouldn't cause other issues and figuring it out later would be problematic.
00:06:31.680
As a side note, I don’t really understand why Rails’ primary keys are signed by default. If anyone has insights, I’d love to hear them. However, there’s another idea: we could work around this by using other relationships. As mentioned earlier, we have over 800 models in Intercom, and our data modeling doesn’t adhere strictly to the fifth normal form of SQL.
00:06:52.400
So we come up with a plan, where we can look up values from another table. We’ll load the conversation and retrieve the necessary value from there, and we can monkey patch that over the accessors for the attribute that’s failing in the Active Record model.
00:07:04.760
This workaround is also not ideal, but we frame it as improving our model by removing a denormalized field, even if we can’t find a tangible reason this change would break things. We feel this cleanup will be relatively straightforward, so we just go ahead and implement it.
00:07:29.600
After implementing the fix, we can see the exceptions graph tailing off, but it’s concerning that the number did not drop to zero post-fix. So what’s next? By now, it’s been 150 minutes on the call with many people involved. We thought we had a solution, but we’re still down.
00:08:00.800
It turns out that other models were also broken. The conversation part is critical to our function, so while the immediate impact is somewhat mitigated, there is still a cascade of issues from other parts. Customers can now communicate with us, but some functions are still impaired.
00:08:25.839
As we’re troubleshooting, we need to consider how we’re going to address this. We can run migrations for some tables, and while some are small, others are much larger. The largest has only a few hundred million rows, which feels manageable, but the key table has no viable workaround.
00:08:54.200
The impact is somewhat mitigated, and customers using features powered by this model are still down. The situation isn't a part of our core flow, but the customers affected are understandably displeased.
00:09:10.160
They cannot have conversations, which frustrates them even more, so we make the call to brown out that specific feature entirely. We use feature flags at Intercom, so we can simply turn it off completely as a temporary measure.
00:09:26.279
Doing so places us back into a viable state. We have about 90 minutes to wait for the initial migration to conclude, so temporarily disabling a feature for that duration feels acceptable.
00:09:43.480
Fast forward, you’ve been on the call for five hours. It’s a long incident call. I don’t know if you’ve ever been part of a five-hour incident call, but it isn’t the most enjoyable experience.
00:10:02.640
The migration is finished, and we think it’s time to turn off that feature flag. Instantly, exceptions return, and we find ourselves down again. People begin to panic, so the team quickly re-enables the feature flag. What happened? Why didn’t the fix we were confident would work turn out?
00:10:41.600
We’re utilizing Ghost, which is a tool designed for online schema migrations on MySQL. It doesn't rely on Rails’ standard rake tasks for migrations. The schema change has been successful, but we were still encountering exceptions.
00:10:59.760
The issue arose because we hadn’t triggered a deployment. Intercom runs roughly 50,000 Rails processes to serve requests, and since there hadn’t been a deployment, none of those processes had restarted. They had cached the schema, so even though the database was functional, the processes were still failing.
00:11:32.360
This presented a clear explanation. Fortunately, we managed to get there quickly; we just triggered a redeployment. This update seemed to resolve the issue, and we thought we were back up and everything was functioning properly.
00:11:57.000
But, at this point, we must ask: does an incident end just because requests succeed again? In many respects, I believe incidents really only begin when you reach operational normalcy. This next phase took weeks, while the earlier stages took hours or even days.
00:12:14.360
In some sense, the real work is only beginning. Now that we’ve mitigated the issue, to derive any value from this incident, we need to learn from it and ensure it doesn’t occur again.
00:12:35.200
As one of my colleagues would say, we need to understand the socio-technical factors that contributed to the outage. So, how did it happen? This isn’t the first time we’ve encountered a big integer issue.
00:12:51.400
I believe our largest table contains about 80 billion rows, while the previous table I mentioned had around 35 billion. Clearly, we had been through this before. When it first occurred, it was a major day for Intercom; we pulled together a team of principal engineers and a working group to devise a plan.
00:13:20.080
We coordinated all relevant parties to ensure the plan’s successful execution. We thought we had anticipated everything that could go wrong and were able to address those issues, yet we hadn’t established or systematized our learnings, despite knowing we’d likely face this scenario again.
00:13:43.120
Over six years into operations, we had experienced this issue in the past and did not take the necessary steps to document and repeat our success, and ultimately the same issue occurred again.
00:14:00.720
As a result, we began discussing how to create an alarm for scenarios where a table is approaching the limits of its primary keys, which would notify when a migration is needed. Every alarm in Intercom has a runbook, so that one of our on-call engineers can respond and take appropriate action.
00:14:21.840
You might wish to see the runbook for that. It stated: "This primary key for the table is approaching the maximum integer value; please migrate it to BigInteger." There were no instructions or insights about dependencies or potential issues.
00:14:49.440
In 2023, when this alarm triggered, an engineer only noted the notification and executed the command, thinking that they had resolved the problem. The alarm was set off when it reached 85% of the limit.
00:15:04.160
In practice, it took several months to actually reach the 100% mark. Fundamentally, we overlooked the foreign keys in our schema.
00:15:30.080
To ensure this doesn’t happen again, we resolved to implement checks within our continuous integration framework. When the team initially tackled this, they placed some code within our RSpec setup that would trigger a test migration and increase the auto-increment count of our comments table. This made sure that if any spec requiring a comment ID did not pass, it indicated that they were in trouble.
00:16:04.079
However, embedding this sort of test deeply within RSpec setup was inefficient; it became difficult to keep track of and lacked context on its importance. You find it easy to forget about this test unless you’re someone aware of its purpose.
00:16:30.720
Thus, we investigated how we could create a more resilient measure. We monkey patched the create table method of migration to check if we were in a development or testing environment, hashing the table name into significant value.
00:16:55.360
If it’s a BigInteger, we elevate it by a trillion. This way, we ensure that our tests will fail safely if any model attempts to insert a value into too small a field.
00:17:17.560
This approach also helps maintain uniqueness among our tables. Previously, we had many models with similar names, which resulted in errors like passing a conversation part ID instead of a conversation ID, since they both started from similar identifiers.
00:17:38.720
Ensuring that every table now has a unique setup reduces the chance of such errors, effectively cleaning up flaky tests that had been a burden.
00:17:57.760
Additionally, we updated the error messages to include more context. Instead of simply indicating a value was out of range, the expanded message will now specify the particular field causing the issue.
00:18:14.520
Now when we see errors like the earlier example of 2.14 billion being out of range, it includes "for field message thread ID." This type of context will reduce the time needed to resolve issues during incidents, allowing us to focus on actionable tasks.
00:18:40.800
I want to give a quick shout-out. When this talk was announced on the Rails World Twitter account, someone mentioned Active Record Doctor, a tool that could have helped us avoid this issue. It offers multiple features for checking database schemas and suggesting best practices.
00:19:06.640
While it includes checks for mismatched foreign key types, the implementation in our setup depended on conditions we don’t use and wouldn’t have played well due to our architecture. However, if it fits your system, it could be beneficial.
00:19:25.160
That brings us to the conclusion of this incident. I think it serves as a noteworthy example of the importance of documenting processes adequately, especially when learning from outages. When writing a runbook, always be sure to include the context and potential pitfalls.
00:19:45.480
When you learn from such experiences, be intentional with your actions, ensuring that they are meaningful and practical. Thank you very much.