00:00:10.080
hey
00:00:11.200
everyone um so who am I you just heard a
00:00:14.160
little bit I'm a Staff engineer at
00:00:15.639
intercom I've worked here for about
00:00:17.439
seven seven and a half years and for the
00:00:19.640
last five I've been on our data stores
00:00:21.600
team so data stores and intercom own
00:00:24.199
MySQL mcast elastic search Dynamo DB but
00:00:27.599
also things like our core rails platform
00:00:29.720
and
00:00:30.880
uh various parts of like our application
00:00:33.120
stack too
00:00:36.920
um uh what am I here to talk to you
00:00:39.600
about uh stories about outages are fun
00:00:43.440
that's kind of where we started with
00:00:44.879
thinking about this we're going to try
00:00:46.399
and put you in the shoes of like an on
00:00:47.960
call engineer responding to this
00:00:49.440
incident at intercom we use a volunteer
00:00:51.719
Le on call system so any engineer could
00:00:54.520
be on call for the entire
00:00:56.559
company uh out is there also a great
00:00:59.120
learning opportunity so it's a chance to
00:01:01.760
like really reflect on things um we're
00:01:05.960
going to like try and relive the early
00:01:08.439
stages of the outage so we'll go through
00:01:10.960
it in a kind of timeline order and it
00:01:12.720
might seem pretty chaotic and that's
00:01:14.240
because it was uh I'm actually editing
00:01:17.240
down a lot of stuff that happened that
00:01:19.040
like there was a lot of people involved
00:01:20.520
a lot of parallel tracks if you want a
00:01:22.119
full story you can come find me later
00:01:23.840
but there's way too much to get into
00:01:25.720
right
00:01:26.640
now uh we're going to talk a bit about
00:01:29.280
after we hear what happened like what
00:01:31.040
mistakes led to it like how did we get
00:01:32.680
into a situation like this and finally
00:01:35.600
we'll talk about changes we made to our
00:01:37.360
application and our processes to make
00:01:39.439
sure something like this doesn't happen
00:01:41.000
again so without any further Ado it's
00:01:43.799
time to take a step back to the morning
00:01:45.560
of February
00:01:47.119
22nd it's just after 8: a.m. you're on
00:01:50.200
call and you've just been paged because
00:01:51.759
of elevated
00:01:56.159
exceptions uh you crack open data dog
00:02:00.039
and you see this graph showing like elb
00:02:01.960
errors this is like 500 responses and
00:02:04.000
they're up kind of from all web facing
00:02:06.159
fleets but uh the numbers look quite low
00:02:09.080
like 1.5k that's uh not the biggest uh
00:02:13.319
volume ever only a small percentage of
00:02:15.480
requests are
00:02:16.599
failing uh and you see these exceptions
00:02:18.879
saying like active model range error uh
00:02:21.879
you go and open up
00:02:24.239
centry and you see you see an exception
00:02:26.680
like this so uh if we zoom in a little
00:02:29.920
little bit uh we see
00:02:32.000
a 2.14 billion sounds pretty familiar
00:02:35.800
integer is limited four
00:02:38.840
byes so uh yeah not great
00:02:42.159
situation
00:02:43.680
uh what do we know uh so after some time
00:02:47.200
digging into it you realize that we're
00:02:49.840
failing to persist conversation part
00:02:51.680
records because something is bigger than
00:02:53.480
a 32-bit integer um if we went back
00:02:56.280
you'd see that like the the name of the
00:02:58.159
field is actually included in the
00:02:59.480
exception which made things a little
00:03:01.720
harder to debug I'm not going to go into
00:03:04.159
a full breakdown of intercom data
00:03:06.280
modeling uh we have over 800 models but
00:03:09.720
uh what we what I can tell you is the
00:03:12.080
conversation part is like the model in
00:03:14.400
intercom one record for every individual
00:03:16.760
message Exchange in the conversation um
00:03:20.000
I've just realized I've kind of left out
00:03:21.519
what intercom is if you're not familiar
00:03:23.239
with our service it's a customer
00:03:25.080
communication platform so you can our
00:03:28.360
customers use intercomp talk to their
00:03:31.599
customers uh like we can serve the
00:03:35.080
application like the number of requests
00:03:37.159
was uh number of failing requests was
00:03:39.159
low but nobody could actually start new
00:03:41.840
conversations so effectively our product
00:03:44.239
was totally
00:03:45.879
down uh and unfortunately for us there
00:03:48.400
are over 35 billion rows on the
00:03:50.439
conversation part table so just kicking
00:03:53.000
off a migration is going to take a while
00:03:55.319
at this point we don't really have
00:03:56.480
anything better to do so we just start
00:03:58.519
that migration it's going to take at
00:04:00.040
least days uh but hopefully that's uh
00:04:02.720
hopefully we can come up with something
00:04:03.760
better because that's not a workable
00:04:06.159
solution uh so what are we going to do
00:04:08.879
like what do we think the the migration
00:04:10.879
is running we know when it eventually
00:04:13.040
finishes it should fix our
00:04:15.040
problems uh but we also need to spend
00:04:17.840
time working with other teams getting
00:04:19.519
people up to speed trying to find an
00:04:21.400
alternative solution at this point
00:04:23.680
you've probably paged in I think we had
00:04:25.759
about 10 or 15 people at this stage uh
00:04:29.360
we have to pulling customer support to
00:04:31.560
proactively work with them none of our
00:04:34.120
customers are able to write in and tell
00:04:35.919
us they're having a problem because they
00:04:37.440
also use
00:04:39.560
intercom but uh they like we have to
00:04:42.440
make make it clear that there was
00:04:43.840
something going up uh going on uh we
00:04:47.440
also have to pull in marketing to start
00:04:49.199
preparing for the idea that like people
00:04:50.840
are going to notice eventually and and
00:04:52.800
this this could be a big issue like lots
00:04:54.479
of posts on social media uh we want to
00:04:56.680
have a consistent
00:04:58.120
response uh we have a program manager
00:05:00.479
involved at this stage uh she's going to
00:05:02.960
handle Communications between the
00:05:04.759
different groups of people responding
00:05:06.199
and also like bubbling this up to our
00:05:07.560
exact team because this is like this is
00:05:09.320
a a serious outage and uh we have to
00:05:12.520
start pulling in more and more Engineers
00:05:14.520
so at this point we're like relevant
00:05:16.479
teams in different parts of the company
00:05:18.039
are uh kind of coming online our workday
00:05:21.199
tends to start around 9:00 a.m. so we
00:05:22.840
were pretty lucky that we the page was
00:05:24.759
quite close to the beginning of the day
00:05:27.120
uh we need to start brainstorming on
00:05:28.639
what we're going to do
00:05:30.639
uh so at this point it's about 9:35 and
00:05:33.360
75 minutes have passed from the time you
00:05:35.240
were
00:05:36.960
paged uh and we have we have some ideas
00:05:40.039
uh one of our principal Engineers joins
00:05:41.759
the call and uh he points out that
00:05:45.720
rail's primary keys are signed by
00:05:47.919
default uh maybe you can see where this
00:05:49.720
is going uh we have a four by integer
00:05:52.720
but we've actually only used 31 bits out
00:05:55.039
of the 32 bits uh so we spiked out like
00:05:59.080
what if we just just over Road the the
00:06:02.080
field unpack the integer and then like
00:06:05.080
pack it back in as a as a negative value
00:06:07.720
uh kind of gross um we're pretty
00:06:09.960
confident it would solve the problem
00:06:12.160
we're not so confident that something
00:06:14.120
else wouldn't break and cleaning it up
00:06:16.440
afterwards would would suck uh as a side
00:06:19.560
note I I don't really understand why
00:06:21.599
rails primary keys are signed by the F
00:06:23.680
if if anyone has any ideas I would love
00:06:25.520
to hear them uh but we do like have
00:06:28.560
another idea someone else has suggested
00:06:31.080
something uh we can work around it by
00:06:33.680
using some other
00:06:35.120
relationships so like I mentioned we
00:06:38.360
have over 800 models in intercom our
00:06:41.120
data modeling isn't necessarily in like
00:06:43.039
fifth normal form if you ever read any
00:06:45.400
books about SQL uh so we have this idea
00:06:48.599
that like okay we can look up um we can
00:06:51.759
look up the value from another table
00:06:53.479
like we can load the conversation and
00:06:55.360
then get the value we want from there uh
00:06:57.800
and we can just monkey patch that over
00:06:59.840
accessors for the attribute that's
00:07:01.360
failing on on the active record model uh
00:07:04.400
this is also kind of gross but uh in you
00:07:07.319
could I guess frame it in a way where
00:07:09.039
you think that maybe this is getting
00:07:10.840
better like we're removing a
00:07:12.240
denormalized field uh and we can't
00:07:15.280
really come up with a reason why this
00:07:16.639
would break finding the nil values
00:07:18.639
afterwards if we want to clean up kind
00:07:20.560
of simple so we just go ahead and ship
00:07:23.000
it uh and we can see this this graph of
00:07:26.199
exceptions tailing
00:07:28.160
off but this this
00:07:31.080
point like why did it not go to zero
00:07:34.080
when we fixed it uh little little
00:07:37.199
concerning so like what now it's been
00:07:39.759
150 minutes on the call there's a lot of
00:07:41.639
people involved we thought we had a
00:07:43.400
solution and actually we're still down
00:07:46.520
uh turns out other models were also
00:07:48.680
broken conversation part is very core so
00:07:51.720
a lot of the impact is mitigated but
00:07:53.840
actually there was a long tale of other
00:07:56.400
things so these things are kind of less
00:07:59.159
critical like customers could talk to us
00:08:01.319
now like intercom was effectively
00:08:03.919
back but uh the same fix doesn't
00:08:06.960
necessarily work for all of them so we
00:08:09.879
have to start thinking about like what
00:08:11.400
are we going to do we we can run these
00:08:12.919
migrations some of the tables are small
00:08:15.039
some of the tables were not so small the
00:08:17.159
biggest one I think only had a few
00:08:18.440
hundred million rows which is still big
00:08:20.840
but not not compared to some of the the
00:08:22.680
large tables uh but actually the largest
00:08:25.520
one in question also had no viable
00:08:27.599
workaround like no way of uh
00:08:30.080
doing the same trick of traversing the
00:08:31.840
relationship some other way so we have
00:08:34.399
to start thinking like what do we do at
00:08:35.800
this point uh the impact is is basically
00:08:40.519
mitigated uh but for the customers that
00:08:42.599
are using the feature powered by that
00:08:44.600
one model like everything is still down
00:08:47.480
for them it's kind of not part of a core
00:08:49.720
flow I don't want to get into too much
00:08:51.399
like granular detail about what intercom
00:08:53.320
is definitely customers that use this
00:08:55.600
were not happy it wasn't working but
00:08:57.600
they were much less happy that they
00:08:59.000
weren't able to have
00:09:01.040
conversations uh so we made the call
00:09:03.880
that like what if we just Brown out that
00:09:05.680
feature entirely we use feature flags
00:09:07.399
and intercom so we'll just like feature
00:09:08.920
flag turning it off completely uh and
00:09:12.200
that's that's fine it gets us back into
00:09:14.839
uh gets us back into a viable State we
00:09:16.480
have like 90 minutes to wait for the
00:09:18.040
migration to finish so turning off a
00:09:19.640
feature for 90 minutes is is kind of
00:09:22.880
okay uh so we're nearly there you've
00:09:26.640
been on the call for five hours at this
00:09:28.640
point
00:09:29.880
uh it's a long incident call I don't
00:09:31.480
know if you've ever done a 5H hour
00:09:32.640
incident call but uh not not the most
00:09:35.120
fun uh the the migration is finished and
00:09:39.399
we think like great time to turn off
00:09:41.600
that feature flag instantly exceptions
00:09:44.279
are back we're down again people are uh
00:09:47.320
panicking and they just quickly turn off
00:09:49.000
the feature flag so like what happened
00:09:51.240
like how how did we uh why did the fix
00:09:54.000
that we were certain would work not
00:09:56.680
work uh
00:09:59.959
we use ghost I'm not sure if you if
00:10:02.160
you've used it before but it's a tool
00:10:03.560
for doing online schema migrations on
00:10:05.680
myql it doesn't use the like normal
00:10:08.079
rails uh rake tasks for running
00:10:10.320
migrations uh the schema is fixed though
00:10:13.279
like the migration has worked um we're
00:10:17.120
still seeing the
00:10:18.880
exceptions uh uh but the problem was
00:10:21.160
that we hadn't had a deployment so
00:10:22.800
intercom at Peak runs something like
00:10:25.320
50,000 rails processes serving requests
00:10:29.040
uh so since there had been no deployment
00:10:31.399
none of those processes had been
00:10:32.839
restarted and they had they had cached
00:10:35.160
the schema so even though the database
00:10:36.959
underneath was working the process would
00:10:39.000
still fail uh that's a nice like clear
00:10:42.880
explanation fortunately we managed to
00:10:44.360
get there pretty fast uh we just trigger
00:10:46.839
to redeploy and you think done okay
00:10:49.200
we're back up everything's working again
00:10:52.040
happy but like is that is that the end
00:10:54.680
do incidents end when you get when you
00:10:57.200
get the uh request to succeed again
00:10:59.959
like in some ways I think incidents kind
00:11:02.040
of only start at that point uh this this
00:11:05.200
next phase took weeks whereas the
00:11:07.079
earlier had only taken day or hours
00:11:09.880
sorry so in some way the real work is
00:11:12.320
only just beginning now you've mitigated
00:11:14.920
the problem to really like get something
00:11:17.920
out of it incident you kind of need to
00:11:19.360
learn from it and make sure it doesn't
00:11:20.880
happen
00:11:22.040
again uh as one of my colleagues would
00:11:24.639
say you have to understand the socio
00:11:26.040
technical factors that led to the
00:11:28.079
out uh so how did it
00:11:31.000
happen uh this isn't the first time
00:11:34.399
we've like tipped over into a big int uh
00:11:37.639
I think our largest table has about 80
00:11:39.320
billion rows and then the table I
00:11:41.560
mentioned earlier had about 35 billion
00:11:43.480
so obviously we had to do this
00:11:45.279
before uh and when that happened you
00:11:48.760
know that was a that was a sorry a red
00:11:51.200
letter day for intercom we pulled
00:11:53.079
together a team got principal engineer a
00:11:55.200
working group we figured out what to do
00:11:57.000
we got all of our kind of top Minds on
00:11:59.000
making making sure that this is going to
00:12:00.360
work
00:12:01.680
right uh so we looked at all the
00:12:04.279
dependencies we made a plan we made sure
00:12:06.680
we were aligned got in people from all
00:12:08.920
the different parts of the product and
00:12:10.440
it all went perfectly like we thought of
00:12:12.880
lots of things that could go wrong
00:12:14.199
things like this issue uh and we handled
00:12:17.160
them and we didn't like systematize the
00:12:20.040
learning because even though we knew
00:12:22.120
we'd kind of have to do it again in the
00:12:23.399
future like I think this first happened
00:12:26.160
maybe six years into intercom uh and it
00:12:29.240
was uh like I said it was a big deal but
00:12:31.880
hard to figure out what parts of it were
00:12:33.440
going to be repeatable and eventually it
00:12:35.720
happened again a lot of the same people
00:12:37.639
were involved we did it right again
00:12:40.880
nothing went wrong so we're kind of a
00:12:42.680
victim of our own success it's really
00:12:44.720
hard to learn from problems you don't
00:12:46.519
have and easy to forget everything that
00:12:48.279
went into making sure you avoided
00:12:50.920
them having a big nasty outage as as a
00:12:53.639
result sucks but like that does actually
00:12:56.480
bring things into focus and now you know
00:12:58.560
you're going to do the work to ensure
00:13:00.240
that you don't have a
00:13:01.639
Rish so like I said we just keep kept
00:13:04.600
doing this and eventually someone said
00:13:06.160
like hey like if a table is about to run
00:13:09.199
out of primary Keys you know maybe we
00:13:11.800
should make an alarm that says you need
00:13:14.279
to run the migration and every alarm in
00:13:17.040
intercom has a run book so that one of
00:13:18.720
our volunteer on call people can just
00:13:20.399
respond to it and figure out what to do
00:13:22.680
uh so do you want to see the Run book
00:13:23.880
for
00:13:25.880
RO uh like what did the Run book
00:13:27.880
actually say
00:13:30.120
this is this primary key for table is
00:13:32.199
approaching the maximum integ value
00:13:34.600
migrate it to Big in and here's the
00:13:36.560
dashboard that's it nothing else no no
00:13:39.600
mention of like all the dependencies all
00:13:41.199
the things that might go wrong all the
00:13:42.480
things you need to
00:13:43.959
know uh so in 2023 that alarm goes off
00:13:48.720
and it says like hey this table message
00:13:50.279
Frets needs a
00:13:52.079
migration and then engineer just says oh
00:13:54.920
okay I see an alarm I see a run book it
00:13:57.000
does tells me to do this thing and like
00:13:58.839
yeah moved on with my life I know that
00:14:00.839
I've like saved the day fixed the
00:14:02.800
problem uh and the alarm is triggered
00:14:05.839
when it's at 85% of the limit so it took
00:14:08.480
months to actually get to 100% at that
00:14:11.519
point
00:14:13.320
uh so what has happened like
00:14:16.240
fundamentally we missed foreign
00:14:18.399
Keys
00:14:21.480
uh right well I'm okay sorry so uh the
00:14:27.639
problem was that we missed foreign key
00:14:29.320
and we needed to uh make sure that like
00:14:32.160
that doesn't happen again so like how
00:14:34.480
how could we have detected that in
00:14:36.680
CI uh earlier on like when we' done it
00:14:40.120
the first time when when that team had
00:14:42.199
been like solving this problems the
00:14:44.240
first time they put some code like this
00:14:45.880
into our kind of rspec setup where we
00:14:48.279
just run a quick migration bump up the
00:14:50.440
auto increment count on this comments
00:14:52.360
table to be above the max integer the
00:14:54.959
idea being that any spec then that
00:14:56.839
requires a comment ID on another model
00:14:59.759
will fail if it doesn't fish uh the
00:15:02.399
problem with doing it this way is that
00:15:04.279
like this is just one piece of code
00:15:05.560
buried deep in some rspec setup uh you
00:15:08.120
have to remember every time to go and
00:15:10.519
like update it uh it doesn't really
00:15:12.880
mention what it's for like there's no
00:15:14.360
comment explaining it if you were lucky
00:15:16.800
enough to be one of the people that knew
00:15:18.279
why it was important then like great
00:15:20.040
good for you but that's not a very good
00:15:21.759
pattern like you could say a solution
00:15:23.720
that relies on someone remembering to do
00:15:25.440
it every time isn't really a solution
00:15:27.199
that's just a like a Band-Aid for the
00:15:28.720
problem
00:15:30.639
so we had to figure out like is there a
00:15:33.120
better way to do it I'm going to put a
00:15:34.759
lot of code on the screen now uh we
00:15:38.440
monkey patched over the create table
00:15:40.800
method of migration that like checks if
00:15:43.440
we're in a development or a test
00:15:45.120
environment and then we hash the table
00:15:47.720
name into a a large value and and if
00:15:51.160
it's a into if it's a big in we like put
00:15:54.000
it up like over a trillion that way we
00:15:57.160
always know in specs when they're run
00:15:59.279
that like they're going to have a a
00:16:01.040
unique per table and very large primary
00:16:04.360
key value so this way like Ci will
00:16:07.639
reliably fail if any model tries to put
00:16:09.759
one of those big in IDs into a field
00:16:11.839
that's too small another side benefit of
00:16:14.480
doing this uh is the is this idea that
00:16:17.560
like every table now is unique so one
00:16:20.279
thing that we'd happened before or had
00:16:22.480
happened before is that uh we have a lot
00:16:25.079
of similarly named models in intercom
00:16:27.040
which is a bit unfortunate and we have
00:16:28.480
like convers ation conversation part
00:16:30.279
message message spread uh so we've had
00:16:33.480
people accidentally use like the wrong
00:16:35.759
ID when they're looking something up uh
00:16:38.120
so you might say like uh conversation
00:16:42.079
doind but You' pass in a conversation
00:16:44.160
part ID and that's an easy mistake to
00:16:46.000
make and in tests that would often work
00:16:48.160
because they all started at one like
00:16:49.639
they all they all started from the
00:16:50.720
beginning so doing this means that
00:16:52.800
everything is unique now and actually
00:16:54.440
those things will also fail so we
00:16:55.880
removed like a category of flaky specs
00:16:58.279
maybe not the big category but like one
00:17:00.160
that did bite us
00:17:02.839
occasionally uh and we also like had
00:17:04.919
that bad error message earlier so uh we
00:17:08.679
went ahead and and monkey patched over
00:17:11.240
in active model uh value for database so
00:17:14.480
value for databas is the point at which
00:17:16.039
that exception gets raised and we just
00:17:18.600
like include the name in the message so
00:17:20.799
that we you can like quickly go from
00:17:23.199
like that Sentry output I showed earlier
00:17:25.839
so when we uh think back to that like
00:17:27.760
output we saw this just says like 2.14
00:17:31.000
billion is out of range for active model
00:17:32.760
type integer of limit 4 bytes uh it
00:17:35.440
doesn't say what the field was so
00:17:38.080
including this it would now say like uh
00:17:41.320
2.14 billion is out of range for active
00:17:43.360
model type integer with limit four bytes
00:17:45.640
for field message thread
00:17:48.080
ID uh so we uh like I said monkey
00:17:51.120
patched over Das uh it would have it
00:17:53.280
wouldn't have saved the day it wouldn't
00:17:54.440
have been the biggest deal but it might
00:17:55.559
have saved 10 15 minutes off of figuring
00:17:57.840
out what the problem was getting the
00:17:59.240
right people involved and there's enough
00:18:01.080
to do during an incident without having
00:18:02.760
like a fun little puzzle to solve like
00:18:04.400
what what field is uh is the is the
00:18:07.559
issue uh it also actually makes the
00:18:10.120
problems from CI with the other patch I
00:18:12.640
showed easier to debug because now you
00:18:14.840
see what the what the issue is right
00:18:17.039
away um I want to give a quick shout out
00:18:19.720
actually uh when this talk got announced
00:18:22.520
on the rails on the rails World Twitter
00:18:24.880
account someone responded like oh hey
00:18:27.159
you know there's this thing active
00:18:28.200
record do that uh it would have just
00:18:30.000
solved this problem for you um feels
00:18:32.919
like it deserves a mention here's the
00:18:34.559
link uh it has lots of Cool Tools in it
00:18:37.000
for checking your database schema and
00:18:38.440
making sure it's in a good state it
00:18:40.679
actually uh has a check for this for
00:18:43.080
like mismatch foreign key types
00:18:45.000
unfortunately it uh depends on
00:18:47.559
referential Integrity constraints which
00:18:49.280
we don't use in intercom and also
00:18:51.360
actually the way our database is set up
00:18:53.760
uh with like lots of different clusters
00:18:55.880
didn't play Super nicely with it either
00:18:57.559
so it wouldn't have saved us but maybe
00:18:59.400
it would work for you uh it has lots of
00:19:02.039
good stuff in it um so yeah that's uh
00:19:06.840
that's kind of the end that brings us to
00:19:08.240
the end of that that incident it's like
00:19:10.159
an interesting example I think of how
00:19:12.760
you need to um really think about what
00:19:16.120
you're writing down like when you when
00:19:18.000
you have a problem when you have an
00:19:19.280
outage like that you want to systematize
00:19:21.000
your learnings if you write a run book
00:19:24.039
like that one I showed you earlier and
00:19:25.520
it doesn't include context you need to
00:19:27.400
know you've just made a worse outage for
00:19:29.559
yourself later uh so yeah make sure that
00:19:33.200
like when you're learning from those
00:19:35.159
kind of events that you are very
00:19:37.320
deliberate about the actions you take um
00:19:40.440
okay thanks very much