Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages

Summarized using AI

Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages

Miles McGuire • September 27, 2024 • Toronto, Canada • Talk

The video titled "Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages" presented by Miles McGuire at Rails World 2024, discusses the significant outage experienced by Intercom on February 22, 2024.

The key points covered in the presentation are as follows:

  • Introduction to the Incident: McGuire, a Staff Engineer at Intercom, recounts an incident where a 32-bit foreign key referencing a 64-bit primary key resulted in a major service outage. The discussion is framed from the perspective of the on-call engineer dealing with the chaos of the incident.

  • Initial Response: Upon being paged for elevated exceptions, the engineering team discovered issues related to their data model and high volumes of exceptions indicating failures to persist conversation part records due to integer limitations.

  • Importance of Team Collaboration: The outbreak led to a collaborative effort involving engineers, customer support, and marketing teams to manage communications and ensure a consistent response to stakeholders while working on a solution.

  • Multiple Solutions Explored: Initially, the team considered a time-consuming database migration and explored alternative quick fixes, including socializing workarounds with primary keys.

  • Resolution Attempts: After several hours of discussions and attempts to fix the issues via code, the team realized the deployment had not updated the cached processes that relied on the database schema.

  • Learning from the Experience: The incident prompted a discussion on socio-technical factors, emphasizing the need for better documentation and more comprehensive runbooks that account for challenges encountered during similar situations in the past.

  • Improvements Implemented: The team adopted several changes, including the implementation of code checks in their CI pipeline to catch potential foreign key mismatches, improving debugging messages to provide context in errors, and creating alarms that alert them when they are approaching integer limits for critical databases.

The presentation concludes by stressing the importance of thorough post-incident analysis to ensure the issues don’t recur and the significance of documentation in learning and improving processes within technical teams.

Making The Best of a Bad Situation - Lessons from one of Intercom's most painful outages
Miles McGuire • September 27, 2024 • Toronto, Canada • Talk

Incidents are an opportunity to level up, and on 22 Feb 2024 Intercom had one of its most painful outages in recent memory. The root cause? A 32-bit foreign key referencing a 64-bit primary key. Miles McGuire shared what happened, why it happened, and what they are doing to ensure it won't happen again (including some changes you can make to your own Rails apps to help make sure you don’t make the same mistakes.)

#outage #lessonslearned

Thank you Shopify for sponsoring the editing and post-production of these videos. Check out insights from the Engineering team at: https://shopify.engineering/

Stay tuned: all 2024 Rails World videos will be subtitled in Japanese and Brazilian Portuguese soon thanks to our sponsor Happy Scribe, a transcription service built on Rails. https://www.happyscribe.com/

Rails World 2024

00:00:10.080 hey
00:00:11.200 everyone um so who am I you just heard a
00:00:14.160 little bit I'm a Staff engineer at
00:00:15.639 intercom I've worked here for about
00:00:17.439 seven seven and a half years and for the
00:00:19.640 last five I've been on our data stores
00:00:21.600 team so data stores and intercom own
00:00:24.199 MySQL mcast elastic search Dynamo DB but
00:00:27.599 also things like our core rails platform
00:00:29.720 and
00:00:30.880 uh various parts of like our application
00:00:33.120 stack too
00:00:36.920 um uh what am I here to talk to you
00:00:39.600 about uh stories about outages are fun
00:00:43.440 that's kind of where we started with
00:00:44.879 thinking about this we're going to try
00:00:46.399 and put you in the shoes of like an on
00:00:47.960 call engineer responding to this
00:00:49.440 incident at intercom we use a volunteer
00:00:51.719 Le on call system so any engineer could
00:00:54.520 be on call for the entire
00:00:56.559 company uh out is there also a great
00:00:59.120 learning opportunity so it's a chance to
00:01:01.760 like really reflect on things um we're
00:01:05.960 going to like try and relive the early
00:01:08.439 stages of the outage so we'll go through
00:01:10.960 it in a kind of timeline order and it
00:01:12.720 might seem pretty chaotic and that's
00:01:14.240 because it was uh I'm actually editing
00:01:17.240 down a lot of stuff that happened that
00:01:19.040 like there was a lot of people involved
00:01:20.520 a lot of parallel tracks if you want a
00:01:22.119 full story you can come find me later
00:01:23.840 but there's way too much to get into
00:01:25.720 right
00:01:26.640 now uh we're going to talk a bit about
00:01:29.280 after we hear what happened like what
00:01:31.040 mistakes led to it like how did we get
00:01:32.680 into a situation like this and finally
00:01:35.600 we'll talk about changes we made to our
00:01:37.360 application and our processes to make
00:01:39.439 sure something like this doesn't happen
00:01:41.000 again so without any further Ado it's
00:01:43.799 time to take a step back to the morning
00:01:45.560 of February
00:01:47.119 22nd it's just after 8: a.m. you're on
00:01:50.200 call and you've just been paged because
00:01:51.759 of elevated
00:01:56.159 exceptions uh you crack open data dog
00:02:00.039 and you see this graph showing like elb
00:02:01.960 errors this is like 500 responses and
00:02:04.000 they're up kind of from all web facing
00:02:06.159 fleets but uh the numbers look quite low
00:02:09.080 like 1.5k that's uh not the biggest uh
00:02:13.319 volume ever only a small percentage of
00:02:15.480 requests are
00:02:16.599 failing uh and you see these exceptions
00:02:18.879 saying like active model range error uh
00:02:21.879 you go and open up
00:02:24.239 centry and you see you see an exception
00:02:26.680 like this so uh if we zoom in a little
00:02:29.920 little bit uh we see
00:02:32.000 a 2.14 billion sounds pretty familiar
00:02:35.800 integer is limited four
00:02:38.840 byes so uh yeah not great
00:02:42.159 situation
00:02:43.680 uh what do we know uh so after some time
00:02:47.200 digging into it you realize that we're
00:02:49.840 failing to persist conversation part
00:02:51.680 records because something is bigger than
00:02:53.480 a 32-bit integer um if we went back
00:02:56.280 you'd see that like the the name of the
00:02:58.159 field is actually included in the
00:02:59.480 exception which made things a little
00:03:01.720 harder to debug I'm not going to go into
00:03:04.159 a full breakdown of intercom data
00:03:06.280 modeling uh we have over 800 models but
00:03:09.720 uh what we what I can tell you is the
00:03:12.080 conversation part is like the model in
00:03:14.400 intercom one record for every individual
00:03:16.760 message Exchange in the conversation um
00:03:20.000 I've just realized I've kind of left out
00:03:21.519 what intercom is if you're not familiar
00:03:23.239 with our service it's a customer
00:03:25.080 communication platform so you can our
00:03:28.360 customers use intercomp talk to their
00:03:31.599 customers uh like we can serve the
00:03:35.080 application like the number of requests
00:03:37.159 was uh number of failing requests was
00:03:39.159 low but nobody could actually start new
00:03:41.840 conversations so effectively our product
00:03:44.239 was totally
00:03:45.879 down uh and unfortunately for us there
00:03:48.400 are over 35 billion rows on the
00:03:50.439 conversation part table so just kicking
00:03:53.000 off a migration is going to take a while
00:03:55.319 at this point we don't really have
00:03:56.480 anything better to do so we just start
00:03:58.519 that migration it's going to take at
00:04:00.040 least days uh but hopefully that's uh
00:04:02.720 hopefully we can come up with something
00:04:03.760 better because that's not a workable
00:04:06.159 solution uh so what are we going to do
00:04:08.879 like what do we think the the migration
00:04:10.879 is running we know when it eventually
00:04:13.040 finishes it should fix our
00:04:15.040 problems uh but we also need to spend
00:04:17.840 time working with other teams getting
00:04:19.519 people up to speed trying to find an
00:04:21.400 alternative solution at this point
00:04:23.680 you've probably paged in I think we had
00:04:25.759 about 10 or 15 people at this stage uh
00:04:29.360 we have to pulling customer support to
00:04:31.560 proactively work with them none of our
00:04:34.120 customers are able to write in and tell
00:04:35.919 us they're having a problem because they
00:04:37.440 also use
00:04:39.560 intercom but uh they like we have to
00:04:42.440 make make it clear that there was
00:04:43.840 something going up uh going on uh we
00:04:47.440 also have to pull in marketing to start
00:04:49.199 preparing for the idea that like people
00:04:50.840 are going to notice eventually and and
00:04:52.800 this this could be a big issue like lots
00:04:54.479 of posts on social media uh we want to
00:04:56.680 have a consistent
00:04:58.120 response uh we have a program manager
00:05:00.479 involved at this stage uh she's going to
00:05:02.960 handle Communications between the
00:05:04.759 different groups of people responding
00:05:06.199 and also like bubbling this up to our
00:05:07.560 exact team because this is like this is
00:05:09.320 a a serious outage and uh we have to
00:05:12.520 start pulling in more and more Engineers
00:05:14.520 so at this point we're like relevant
00:05:16.479 teams in different parts of the company
00:05:18.039 are uh kind of coming online our workday
00:05:21.199 tends to start around 9:00 a.m. so we
00:05:22.840 were pretty lucky that we the page was
00:05:24.759 quite close to the beginning of the day
00:05:27.120 uh we need to start brainstorming on
00:05:28.639 what we're going to do
00:05:30.639 uh so at this point it's about 9:35 and
00:05:33.360 75 minutes have passed from the time you
00:05:35.240 were
00:05:36.960 paged uh and we have we have some ideas
00:05:40.039 uh one of our principal Engineers joins
00:05:41.759 the call and uh he points out that
00:05:45.720 rail's primary keys are signed by
00:05:47.919 default uh maybe you can see where this
00:05:49.720 is going uh we have a four by integer
00:05:52.720 but we've actually only used 31 bits out
00:05:55.039 of the 32 bits uh so we spiked out like
00:05:59.080 what if we just just over Road the the
00:06:02.080 field unpack the integer and then like
00:06:05.080 pack it back in as a as a negative value
00:06:07.720 uh kind of gross um we're pretty
00:06:09.960 confident it would solve the problem
00:06:12.160 we're not so confident that something
00:06:14.120 else wouldn't break and cleaning it up
00:06:16.440 afterwards would would suck uh as a side
00:06:19.560 note I I don't really understand why
00:06:21.599 rails primary keys are signed by the F
00:06:23.680 if if anyone has any ideas I would love
00:06:25.520 to hear them uh but we do like have
00:06:28.560 another idea someone else has suggested
00:06:31.080 something uh we can work around it by
00:06:33.680 using some other
00:06:35.120 relationships so like I mentioned we
00:06:38.360 have over 800 models in intercom our
00:06:41.120 data modeling isn't necessarily in like
00:06:43.039 fifth normal form if you ever read any
00:06:45.400 books about SQL uh so we have this idea
00:06:48.599 that like okay we can look up um we can
00:06:51.759 look up the value from another table
00:06:53.479 like we can load the conversation and
00:06:55.360 then get the value we want from there uh
00:06:57.800 and we can just monkey patch that over
00:06:59.840 accessors for the attribute that's
00:07:01.360 failing on on the active record model uh
00:07:04.400 this is also kind of gross but uh in you
00:07:07.319 could I guess frame it in a way where
00:07:09.039 you think that maybe this is getting
00:07:10.840 better like we're removing a
00:07:12.240 denormalized field uh and we can't
00:07:15.280 really come up with a reason why this
00:07:16.639 would break finding the nil values
00:07:18.639 afterwards if we want to clean up kind
00:07:20.560 of simple so we just go ahead and ship
00:07:23.000 it uh and we can see this this graph of
00:07:26.199 exceptions tailing
00:07:28.160 off but this this
00:07:31.080 point like why did it not go to zero
00:07:34.080 when we fixed it uh little little
00:07:37.199 concerning so like what now it's been
00:07:39.759 150 minutes on the call there's a lot of
00:07:41.639 people involved we thought we had a
00:07:43.400 solution and actually we're still down
00:07:46.520 uh turns out other models were also
00:07:48.680 broken conversation part is very core so
00:07:51.720 a lot of the impact is mitigated but
00:07:53.840 actually there was a long tale of other
00:07:56.400 things so these things are kind of less
00:07:59.159 critical like customers could talk to us
00:08:01.319 now like intercom was effectively
00:08:03.919 back but uh the same fix doesn't
00:08:06.960 necessarily work for all of them so we
00:08:09.879 have to start thinking about like what
00:08:11.400 are we going to do we we can run these
00:08:12.919 migrations some of the tables are small
00:08:15.039 some of the tables were not so small the
00:08:17.159 biggest one I think only had a few
00:08:18.440 hundred million rows which is still big
00:08:20.840 but not not compared to some of the the
00:08:22.680 large tables uh but actually the largest
00:08:25.520 one in question also had no viable
00:08:27.599 workaround like no way of uh
00:08:30.080 doing the same trick of traversing the
00:08:31.840 relationship some other way so we have
00:08:34.399 to start thinking like what do we do at
00:08:35.800 this point uh the impact is is basically
00:08:40.519 mitigated uh but for the customers that
00:08:42.599 are using the feature powered by that
00:08:44.600 one model like everything is still down
00:08:47.480 for them it's kind of not part of a core
00:08:49.720 flow I don't want to get into too much
00:08:51.399 like granular detail about what intercom
00:08:53.320 is definitely customers that use this
00:08:55.600 were not happy it wasn't working but
00:08:57.600 they were much less happy that they
00:08:59.000 weren't able to have
00:09:01.040 conversations uh so we made the call
00:09:03.880 that like what if we just Brown out that
00:09:05.680 feature entirely we use feature flags
00:09:07.399 and intercom so we'll just like feature
00:09:08.920 flag turning it off completely uh and
00:09:12.200 that's that's fine it gets us back into
00:09:14.839 uh gets us back into a viable State we
00:09:16.480 have like 90 minutes to wait for the
00:09:18.040 migration to finish so turning off a
00:09:19.640 feature for 90 minutes is is kind of
00:09:22.880 okay uh so we're nearly there you've
00:09:26.640 been on the call for five hours at this
00:09:28.640 point
00:09:29.880 uh it's a long incident call I don't
00:09:31.480 know if you've ever done a 5H hour
00:09:32.640 incident call but uh not not the most
00:09:35.120 fun uh the the migration is finished and
00:09:39.399 we think like great time to turn off
00:09:41.600 that feature flag instantly exceptions
00:09:44.279 are back we're down again people are uh
00:09:47.320 panicking and they just quickly turn off
00:09:49.000 the feature flag so like what happened
00:09:51.240 like how how did we uh why did the fix
00:09:54.000 that we were certain would work not
00:09:56.680 work uh
00:09:59.959 we use ghost I'm not sure if you if
00:10:02.160 you've used it before but it's a tool
00:10:03.560 for doing online schema migrations on
00:10:05.680 myql it doesn't use the like normal
00:10:08.079 rails uh rake tasks for running
00:10:10.320 migrations uh the schema is fixed though
00:10:13.279 like the migration has worked um we're
00:10:17.120 still seeing the
00:10:18.880 exceptions uh uh but the problem was
00:10:21.160 that we hadn't had a deployment so
00:10:22.800 intercom at Peak runs something like
00:10:25.320 50,000 rails processes serving requests
00:10:29.040 uh so since there had been no deployment
00:10:31.399 none of those processes had been
00:10:32.839 restarted and they had they had cached
00:10:35.160 the schema so even though the database
00:10:36.959 underneath was working the process would
00:10:39.000 still fail uh that's a nice like clear
00:10:42.880 explanation fortunately we managed to
00:10:44.360 get there pretty fast uh we just trigger
00:10:46.839 to redeploy and you think done okay
00:10:49.200 we're back up everything's working again
00:10:52.040 happy but like is that is that the end
00:10:54.680 do incidents end when you get when you
00:10:57.200 get the uh request to succeed again
00:10:59.959 like in some ways I think incidents kind
00:11:02.040 of only start at that point uh this this
00:11:05.200 next phase took weeks whereas the
00:11:07.079 earlier had only taken day or hours
00:11:09.880 sorry so in some way the real work is
00:11:12.320 only just beginning now you've mitigated
00:11:14.920 the problem to really like get something
00:11:17.920 out of it incident you kind of need to
00:11:19.360 learn from it and make sure it doesn't
00:11:20.880 happen
00:11:22.040 again uh as one of my colleagues would
00:11:24.639 say you have to understand the socio
00:11:26.040 technical factors that led to the
00:11:28.079 out uh so how did it
00:11:31.000 happen uh this isn't the first time
00:11:34.399 we've like tipped over into a big int uh
00:11:37.639 I think our largest table has about 80
00:11:39.320 billion rows and then the table I
00:11:41.560 mentioned earlier had about 35 billion
00:11:43.480 so obviously we had to do this
00:11:45.279 before uh and when that happened you
00:11:48.760 know that was a that was a sorry a red
00:11:51.200 letter day for intercom we pulled
00:11:53.079 together a team got principal engineer a
00:11:55.200 working group we figured out what to do
00:11:57.000 we got all of our kind of top Minds on
00:11:59.000 making making sure that this is going to
00:12:00.360 work
00:12:01.680 right uh so we looked at all the
00:12:04.279 dependencies we made a plan we made sure
00:12:06.680 we were aligned got in people from all
00:12:08.920 the different parts of the product and
00:12:10.440 it all went perfectly like we thought of
00:12:12.880 lots of things that could go wrong
00:12:14.199 things like this issue uh and we handled
00:12:17.160 them and we didn't like systematize the
00:12:20.040 learning because even though we knew
00:12:22.120 we'd kind of have to do it again in the
00:12:23.399 future like I think this first happened
00:12:26.160 maybe six years into intercom uh and it
00:12:29.240 was uh like I said it was a big deal but
00:12:31.880 hard to figure out what parts of it were
00:12:33.440 going to be repeatable and eventually it
00:12:35.720 happened again a lot of the same people
00:12:37.639 were involved we did it right again
00:12:40.880 nothing went wrong so we're kind of a
00:12:42.680 victim of our own success it's really
00:12:44.720 hard to learn from problems you don't
00:12:46.519 have and easy to forget everything that
00:12:48.279 went into making sure you avoided
00:12:50.920 them having a big nasty outage as as a
00:12:53.639 result sucks but like that does actually
00:12:56.480 bring things into focus and now you know
00:12:58.560 you're going to do the work to ensure
00:13:00.240 that you don't have a
00:13:01.639 Rish so like I said we just keep kept
00:13:04.600 doing this and eventually someone said
00:13:06.160 like hey like if a table is about to run
00:13:09.199 out of primary Keys you know maybe we
00:13:11.800 should make an alarm that says you need
00:13:14.279 to run the migration and every alarm in
00:13:17.040 intercom has a run book so that one of
00:13:18.720 our volunteer on call people can just
00:13:20.399 respond to it and figure out what to do
00:13:22.680 uh so do you want to see the Run book
00:13:23.880 for
00:13:25.880 RO uh like what did the Run book
00:13:27.880 actually say
00:13:30.120 this is this primary key for table is
00:13:32.199 approaching the maximum integ value
00:13:34.600 migrate it to Big in and here's the
00:13:36.560 dashboard that's it nothing else no no
00:13:39.600 mention of like all the dependencies all
00:13:41.199 the things that might go wrong all the
00:13:42.480 things you need to
00:13:43.959 know uh so in 2023 that alarm goes off
00:13:48.720 and it says like hey this table message
00:13:50.279 Frets needs a
00:13:52.079 migration and then engineer just says oh
00:13:54.920 okay I see an alarm I see a run book it
00:13:57.000 does tells me to do this thing and like
00:13:58.839 yeah moved on with my life I know that
00:14:00.839 I've like saved the day fixed the
00:14:02.800 problem uh and the alarm is triggered
00:14:05.839 when it's at 85% of the limit so it took
00:14:08.480 months to actually get to 100% at that
00:14:11.519 point
00:14:13.320 uh so what has happened like
00:14:16.240 fundamentally we missed foreign
00:14:18.399 Keys
00:14:21.480 uh right well I'm okay sorry so uh the
00:14:27.639 problem was that we missed foreign key
00:14:29.320 and we needed to uh make sure that like
00:14:32.160 that doesn't happen again so like how
00:14:34.480 how could we have detected that in
00:14:36.680 CI uh earlier on like when we' done it
00:14:40.120 the first time when when that team had
00:14:42.199 been like solving this problems the
00:14:44.240 first time they put some code like this
00:14:45.880 into our kind of rspec setup where we
00:14:48.279 just run a quick migration bump up the
00:14:50.440 auto increment count on this comments
00:14:52.360 table to be above the max integer the
00:14:54.959 idea being that any spec then that
00:14:56.839 requires a comment ID on another model
00:14:59.759 will fail if it doesn't fish uh the
00:15:02.399 problem with doing it this way is that
00:15:04.279 like this is just one piece of code
00:15:05.560 buried deep in some rspec setup uh you
00:15:08.120 have to remember every time to go and
00:15:10.519 like update it uh it doesn't really
00:15:12.880 mention what it's for like there's no
00:15:14.360 comment explaining it if you were lucky
00:15:16.800 enough to be one of the people that knew
00:15:18.279 why it was important then like great
00:15:20.040 good for you but that's not a very good
00:15:21.759 pattern like you could say a solution
00:15:23.720 that relies on someone remembering to do
00:15:25.440 it every time isn't really a solution
00:15:27.199 that's just a like a Band-Aid for the
00:15:28.720 problem
00:15:30.639 so we had to figure out like is there a
00:15:33.120 better way to do it I'm going to put a
00:15:34.759 lot of code on the screen now uh we
00:15:38.440 monkey patched over the create table
00:15:40.800 method of migration that like checks if
00:15:43.440 we're in a development or a test
00:15:45.120 environment and then we hash the table
00:15:47.720 name into a a large value and and if
00:15:51.160 it's a into if it's a big in we like put
00:15:54.000 it up like over a trillion that way we
00:15:57.160 always know in specs when they're run
00:15:59.279 that like they're going to have a a
00:16:01.040 unique per table and very large primary
00:16:04.360 key value so this way like Ci will
00:16:07.639 reliably fail if any model tries to put
00:16:09.759 one of those big in IDs into a field
00:16:11.839 that's too small another side benefit of
00:16:14.480 doing this uh is the is this idea that
00:16:17.560 like every table now is unique so one
00:16:20.279 thing that we'd happened before or had
00:16:22.480 happened before is that uh we have a lot
00:16:25.079 of similarly named models in intercom
00:16:27.040 which is a bit unfortunate and we have
00:16:28.480 like convers ation conversation part
00:16:30.279 message message spread uh so we've had
00:16:33.480 people accidentally use like the wrong
00:16:35.759 ID when they're looking something up uh
00:16:38.120 so you might say like uh conversation
00:16:42.079 doind but You' pass in a conversation
00:16:44.160 part ID and that's an easy mistake to
00:16:46.000 make and in tests that would often work
00:16:48.160 because they all started at one like
00:16:49.639 they all they all started from the
00:16:50.720 beginning so doing this means that
00:16:52.800 everything is unique now and actually
00:16:54.440 those things will also fail so we
00:16:55.880 removed like a category of flaky specs
00:16:58.279 maybe not the big category but like one
00:17:00.160 that did bite us
00:17:02.839 occasionally uh and we also like had
00:17:04.919 that bad error message earlier so uh we
00:17:08.679 went ahead and and monkey patched over
00:17:11.240 in active model uh value for database so
00:17:14.480 value for databas is the point at which
00:17:16.039 that exception gets raised and we just
00:17:18.600 like include the name in the message so
00:17:20.799 that we you can like quickly go from
00:17:23.199 like that Sentry output I showed earlier
00:17:25.839 so when we uh think back to that like
00:17:27.760 output we saw this just says like 2.14
00:17:31.000 billion is out of range for active model
00:17:32.760 type integer of limit 4 bytes uh it
00:17:35.440 doesn't say what the field was so
00:17:38.080 including this it would now say like uh
00:17:41.320 2.14 billion is out of range for active
00:17:43.360 model type integer with limit four bytes
00:17:45.640 for field message thread
00:17:48.080 ID uh so we uh like I said monkey
00:17:51.120 patched over Das uh it would have it
00:17:53.280 wouldn't have saved the day it wouldn't
00:17:54.440 have been the biggest deal but it might
00:17:55.559 have saved 10 15 minutes off of figuring
00:17:57.840 out what the problem was getting the
00:17:59.240 right people involved and there's enough
00:18:01.080 to do during an incident without having
00:18:02.760 like a fun little puzzle to solve like
00:18:04.400 what what field is uh is the is the
00:18:07.559 issue uh it also actually makes the
00:18:10.120 problems from CI with the other patch I
00:18:12.640 showed easier to debug because now you
00:18:14.840 see what the what the issue is right
00:18:17.039 away um I want to give a quick shout out
00:18:19.720 actually uh when this talk got announced
00:18:22.520 on the rails on the rails World Twitter
00:18:24.880 account someone responded like oh hey
00:18:27.159 you know there's this thing active
00:18:28.200 record do that uh it would have just
00:18:30.000 solved this problem for you um feels
00:18:32.919 like it deserves a mention here's the
00:18:34.559 link uh it has lots of Cool Tools in it
00:18:37.000 for checking your database schema and
00:18:38.440 making sure it's in a good state it
00:18:40.679 actually uh has a check for this for
00:18:43.080 like mismatch foreign key types
00:18:45.000 unfortunately it uh depends on
00:18:47.559 referential Integrity constraints which
00:18:49.280 we don't use in intercom and also
00:18:51.360 actually the way our database is set up
00:18:53.760 uh with like lots of different clusters
00:18:55.880 didn't play Super nicely with it either
00:18:57.559 so it wouldn't have saved us but maybe
00:18:59.400 it would work for you uh it has lots of
00:19:02.039 good stuff in it um so yeah that's uh
00:19:06.840 that's kind of the end that brings us to
00:19:08.240 the end of that that incident it's like
00:19:10.159 an interesting example I think of how
00:19:12.760 you need to um really think about what
00:19:16.120 you're writing down like when you when
00:19:18.000 you have a problem when you have an
00:19:19.280 outage like that you want to systematize
00:19:21.000 your learnings if you write a run book
00:19:24.039 like that one I showed you earlier and
00:19:25.520 it doesn't include context you need to
00:19:27.400 know you've just made a worse outage for
00:19:29.559 yourself later uh so yeah make sure that
00:19:33.200 like when you're learning from those
00:19:35.159 kind of events that you are very
00:19:37.320 deliberate about the actions you take um
00:19:40.440 okay thanks very much
Explore all talks recorded at Rails World 2024
+33