Talks

Herding Cats to a Firefight

EuRuKo 2016

00:00:05.310 please take a seat thank you very much so our next speaker grace chang
00:00:12.799 she's an engineer at Yammer London she is also the on call tech lead as far as
00:00:20.340 I understood she likes reliable sites continues development and I was told
00:00:25.740 also gifts food and gives a food is it correct ok so here she is for you enjoy
00:00:38.690 thank you uh hello thank you again and I
00:00:43.800 apologize ahead of time to all the bulgarians here if I mess this up
00:00:48.890 but I've 8th day I think yeah i'm told that means hello if it means a
00:00:55.470 curse word I apologize profusely and I blame all the organizers now they're
00:01:01.470 cool people so yes my name is Grace and currently i work at Yammer London but as
00:01:07.350 you can probably tell from my accent I'm originally from Vancouver Canada I just moved to London last year but I have
00:01:15.240 been doing on call with Yammer for several years now and that's what I'm here to talk about with you today in
00:01:22.079 case you're not familiar Yammer is a social network for enterprise we were acquired several years ago by Microsoft
00:01:29.990 so it's a large platform a lot of users but we're obviously trying to grow every
00:01:36.240 day um and I also apologize very profusely both to the internet because
00:01:42.389 while I am talking about cats I am personally a dog person and I actually
00:01:48.569 don't have any gifts in my slides but I have hand-drawn pictures and hopefully
00:01:55.950 that will be just as good as gifs so I'm going to talk about hurting cats and
00:02:03.389 if this is enough phrase that people are not familiar with it's kind of a phrase that we use to talk about trying to
00:02:10.979 gather people to do something but it's really difficult because all these people are trying to do
00:02:17.410 different things or they're just not cooperating with you and it is one of the hardest things you can possibly do
00:02:22.750 and why specifically cats I can give a story that one of my friends told me he
00:02:30.510 used to play a game when he was in uni with his roommates cats I think at least
00:02:35.650 two of them maybe more he would take a laser pointer and try and get one of the
00:02:40.870 cats to follow the laser pointer into a bookshelf and then while that cat is still inside the bookshelf he would try
00:02:47.290 and get the other cat into the same bookshelf obviously that doesn't work out with one laser pointer because first
00:02:53.440 cat starts coming out the other cat goes in it's a kind of a hard task but this
00:02:59.920 story isn't about impossibilities it's about being able to train teams and help
00:03:05.080 them learn new tricks we're dealing with difficult things so let me rewind a little bit first all
00:03:12.940 the way back to the year 1 BC or before
00:03:18.130 cats so in the beginning there was only
00:03:23.730 darkness but suddenly out of the darkness there
00:03:30.040 came a sound I apologize there's no speaker setup so
00:03:36.100 imagine pedra noises go in here it's mostly for effect anyways so when i
00:03:41.920 started at yammer we didn't really have concept of alcohol I'd heard lots of stories about other teams at other
00:03:48.700 organizations having on call teams and obviously it's not a pretty terrible
00:03:53.739 they would have to carry pagers or mobile phones and modern days with them at all times just in case something went
00:04:00.520 wrong with the production environment actually Yammer did have an on-call sort
00:04:07.000 of it was one person who was on call all day and every night of every day of
00:04:15.280 every week for all of eternity not quite but close enough
00:04:23.640 everyone was around to help of course anybody else on the engineering team but eventually this team grew so large and
00:04:31.030 the entire environment was so unwieldy that it was unreasonable for one person to be responsible for everything and we
00:04:38.710 knew that deep inside just nobody wanted to be the first one to say it but why not why didn't we want to help
00:04:46.000 this poor soul who had the entire world of Yammer on his shoulders well one of
00:04:51.009 our leading people said and I highly paraphrase here we don't need no
00:04:57.639 stinking on-call rotation our project is so good it doesn't need a whole team to
00:05:02.740 keep it running in fact that would be a waste of their time I
00:05:08.039 call on this so eventually as one would predict
00:05:13.900 disaster happens it was a Friday or for
00:05:18.940 some of us still Thursday because we hadn't really slept much the night before and a massive production issue
00:05:24.400 occurred for some reason or another and a handful of either dedicated or
00:05:29.819 insomniac engineers we're struggling to keep things from falling apart eventually we had to call for help and I
00:05:38.110 was the one who had to make this call and it's rough this is roughly how it went
00:05:43.680 mmm hi sorry to be calling at this hour I'm from Yammer and I work with
00:05:48.880 engineers name can I please speak with him again this is three in the clock in
00:05:54.789 the morning this guy had just gotten back from his honeymoon and I was in his wife answered the phone basically it was
00:06:01.900 as awkward as it sounds and eventually he woke up probably tried to
00:06:08.830 get some coffee and stuff and helped us out and eventually after several hours
00:06:14.080 we finally managed to get our production environment back into a stable state but it was at this point more or less that
00:06:20.560 we decided we can't keep doing this this guy needs to sleep so the decree was made we would have an
00:06:27.520 on-call team ray problem solved right easier said than done like many things
00:06:34.650 so let me go back to a little forward in
00:06:39.699 time to the year 180 after disaster the first difficulty of what we had come
00:06:45.720 across while just putting this on call team together was how do we do math at
00:06:51.930 the time give we decided that we would have just for engineers or cats to be
00:06:58.979 the guinea pigs give me guinea cats to iron out the entire process before we
00:07:04.770 rolled it out to the entire team out of one of the out of these four one was the
00:07:10.050 cat herder or the tech lead and that was my role at this time we also had one
00:07:15.509 monolithic application written and rails with about 15 ish services which were
00:07:21.060 written in Java so that was two different stacks and we tried to figure out best how to
00:07:27.000 organize and split the responsibilities uh in such a way that it didn't make everybody's lives suck and
00:07:34.759 it's not easy it kind of ended up with just for sad cats
00:07:40.430 this is an actual photo of my notebooks when we when I was trying to figure out
00:07:46.319 how to do our scheduling so it really didn't suck for everyone but all the
00:07:51.539 options sucked really he and another big challenge of for us was having to get
00:07:57.930 used to all these acronyms that started cropping up all around us and we couldn't avoid them things like MTBF
00:08:04.279 mttr aar SLA and they became bigger and bigger
00:08:10.590 not obviously the acronyms themselves so I'll go quickly over some of the
00:08:16.169 meanings of the the meanings of some of these acronyms that were key to us the first one is MTBF mean time between
00:08:22.620 failures this is basically how long is it roughly between outages of your
00:08:29.120 application or sites whatever you may whatever it may be meantime to recovery is related to this
00:08:37.680 but it is actually how long it takes for you to get from that unstable state back
00:08:42.930 to a healthy state SLA most important thing also called
00:08:49.110 availability in some areas service level agreement this is basically the contract between your business and your users so
00:08:57.120 you're saying to your users we will promise that our site or application or
00:09:02.160 service will be available to you this percentage out of the month it's pretty
00:09:08.760 important because obviously more downtime means fewer users they are after action review this one
00:09:16.440 came up and I actually had to look this up again because it's kind of an older one
00:09:21.680 basically is pretty self self-descriptive it is what you do with
00:09:27.840 the documents that you have to write up after the whole shindig has gone down and you have to kind of explain what
00:09:33.900 happened also related to that is I our incident report this is actually the term that we like to use more just
00:09:40.770 because it's a little well it's shorter slightly
00:09:46.250 but yeah there was like so many acronyms and it was just so overwhelming to us so
00:09:51.390 we started to break it down and I'll break it down here too but this is very rough um
00:09:57.530 going back to MTBF mean time between failures and meantime two recoveries
00:10:02.730 I'll try hard I tell them I'll try my best not to use the acronyms themselves and use the full terms here
00:10:09.740 anyways so between these two what mattered the most to us
00:10:14.840 but really it's not a matter of what's more important it's how do you balance
00:10:21.720 them out and how do you improve on one while the other while not making sure
00:10:27.030 over all your systems are still healthy that your availability is still up it's
00:10:32.940 kind of like if going back to the cat analogies mean time between failures is
00:10:37.980 like how often does your cat vomit on the rug mean time for recovery is how fast can you clean it up when it does
00:10:44.270 and obviously there were various things that we have to consider a the rough business impact of each which is the
00:10:51.180 ones in grey so mean time between failures obviously less frequent mean
00:10:56.610 time to recovery when you focus on that you obviously have faster recovery so it's much better even if you're down
00:11:02.760 it's only for a few minutes at most in blue is what we need to do to achieve
00:11:09.269 either of these goals for mean time between failures required more stable
00:11:14.550 systems it depends obviously on the current state of your system whether that's something reasonable at that time
00:11:20.519 or it's something that you need to do more incrementally mean time to recovery also needed some time so that your
00:11:27.540 engineers could be trained to respond well to any of these incidents so that they can act fast to anything that comes
00:11:33.899 up the benefits of these two in green for mean time between failures it meant
00:11:40.709 that engineers were interrupted less frequently you know your cat vomits less
00:11:45.959 frequently good that's less for you to clean up or less frequently that you have to clean up after it while mean
00:11:52.170 times our recovery engineers gain a broader knowledge over the entire system just because they have to handle all of
00:11:59.220 these incidents but at least they'll know how to do that fast and of course the negatives for these for
00:12:07.290 mean time between failures possibly they could be more disastrous it's kind of like instead of smaller
00:12:13.760 breaks or outages you might possibly have kind of a big bang outage that
00:12:19.860 lasts for several minutes or god forbid hours while mean time to recovery we
00:12:26.579 don't have we might not have that but it could also mean that you are facing more frequent issues you know how to respond
00:12:33.329 to them quickly but they come up more often so maybe that's how you get used to responding to them
00:12:39.800 so this is I swear this is the only formula in my slides and it's not really
00:12:45.660 even a formula really but this is kind of the official definition of the relationship between mean time between
00:12:52.050 failure and meantime to recovery and of course our over looming SLA or availability
00:13:00.320 so neither these really are dependent on each other necessarily but obviously the
00:13:06.149 SLA does depend on both of them but it's not a scale where you have to have
00:13:11.250 either one on opposite ends it's more like a range for either one of them so
00:13:16.649 as one goes up you don't have to go up on the other one
00:13:21.730 in fact it could go down temporarily then when you're happy with that one and it doesn't have to be a hundred percent
00:13:27.399 for example in the mean time between failures here you can go and push the
00:13:33.070 other one the second one up and then whatever you do though obviously you just need to make sure that you stick to
00:13:39.250 the important thing on the left hand side of the equation the availability and this is obviously the goal that you
00:13:45.040 want in the end so we settled on our goals now we had to
00:13:52.029 kind of track all these things the AAA RS or IRS that as we are using now we
00:13:58.660 considered using google doc forms at the very beginning there obviously there are many tools
00:14:03.940 that we consider so I'm only listing a the key ones that we actually try it out this one ended up being kind of hard to
00:14:10.870 read the reports if you're not familiar with the way that Google Forms does it
00:14:16.449 basically spits out the responses as a spreadsheet and it kind of when you're trying to read prose it makes it really
00:14:22.779 difficult to do that there was also Yammer notes we have
00:14:28.569 basically a rich text editing in Yammer the product as well and
00:14:33.839 the good thing was about this was that you can link it to different threads or
00:14:39.010 conversations that are happening about the specific incidents however they were very hard to analyze you can link them
00:14:45.880 to threads but there's not very good way of searching querying for these for
00:14:51.160 these notes inside of Yammer lastly we had JIRA which we were already using as a bug tracking
00:14:58.470 bug tracking tool and it wasn't perfect but it sort of worked
00:15:06.600 in the ends there was lots of different fields as you can see and from another
00:15:12.880 photo of some of the notes that i was taking at the time to try and see how we can actually build an incident report
00:15:19.149 form that would make sense to engineers that would be simple for them to fill out but also not lose track of all the
00:15:25.779 important things there are a lot of details that we considered important at first we don't have all of these fields
00:15:31.990 anymore as so as we went along obviously some of them naturally stopped being filled out
00:15:37.220 we determined obviously oh there's probably not as important they've become optional now but the key ones are
00:15:43.700 obviously still there and we've now been using JIRA for about three years and
00:15:50.270 it's been serving us quite well and obviously has many benefits including a tag track or label tracking you can do
00:15:57.650 queries against them so this is what we settled on using and
00:16:04.450 all of this took place over several months and we figured hey we're starting
00:16:09.740 to get the hang of this maybe
00:16:15.430 not yet we weren't where we want it to be at the time
00:16:21.610 at this point in time the way that we did projects it made it such that whenever we created a new service in our
00:16:29.150 micro system our co service service oriented architecture worlds we would kind of dump them into the pool with all
00:16:34.910 the others so this meant that the number of services grew faster than the
00:16:39.950 engineers who maintained them and it also meant possibly more technologies
00:16:45.320 which meant more toys for cats to play with it means that they get distracted easily as well at this for example at
00:16:53.930 this time we had about four different data stores on top of the rails and Java
00:16:59.450 applications that we had so imagine taking like a bunch of yarn balls tossing it on the ground and then
00:17:06.290 letting loose a bunch of cats it's kind of a disaster hurting them to actually you know get them to do whatever you
00:17:12.560 want to so despite this we also
00:17:18.790 rotated projects and people who worked on different projects because we wanted to prevent expertise or we wanted to
00:17:26.060 prevent silos from fear from forming as some people refer to them as
00:17:31.930 but with all these cats playing with different balls unfortunately one cat
00:17:38.510 may play might play with just one ball so silos for me even though you didn't want them to some people just actually
00:17:45.440 get more used to a certain service than another person it just happens it's not
00:17:50.720 something that is easily avoidable on top of this a side effect is that you get no clarity on who's responsible for
00:17:58.370 what so you know after all the cats are done playing and they kind of saunt her
00:18:03.649 off to do whatever their own thing is they kind of leave all the yarn balls for someone else to deal with probably
00:18:09.830 you and most importantly burnouts burnout
00:18:15.799 basically I'm sure many people feel it on their normal day-to-day but imagine also having to have the responsibility
00:18:22.159 of carrying your phone around our lap or even laptop in the evenings in the app
00:18:28.070 on the weekends it's very stressful and the responsibility level is much higher
00:18:34.070 than usual so that was another thing that was problematic and we didn't really have a solution for that yet so
00:18:42.669 several months of this continued on we kept making small tweaks here and there
00:18:47.840 and eventually we got to a point where we decided all these initial efforts were kind of good enough to roll out to
00:18:55.940 the entire team so we thought there were plenty of adjustments that we obviously had to
00:19:02.210 make one of those is that we decided to do was do more by doing less
00:19:09.039 as I mentioned we had two different stacks rails and rails in java / drop
00:19:16.700 wizard and that was the easiest way to draw the line between responsibilities so we had some people being on call for
00:19:23.000 rails while others were on call for java so we decided to go with that it meant
00:19:30.049 that you as a person we're probably dealing with less at any given point in time on top of that we added in our London
00:19:37.909 office when all this to all of this was taking place it was only in this San Francisco office we also added the
00:19:44.330 London office with shorter rotations but it meant that people in San Francisco could finally sleep at night they don't
00:19:51.200 have to go on from 9am to nine a.m. the next day basically 24 hours
00:19:56.600 and it also meant that with more people joining the processing the team of being
00:20:03.470 on call it meant that we they gained more knowledge but we had to onboard them to this process we had to help them
00:20:10.190 gain that knowledge because it's a lot to digest all at once so to make sure we train people and had
00:20:19.250 them both be both open to helping anybody in an area that they were from more most familiar with they also had to
00:20:27.350 be okay obviously with asking somebody more knowledgeable about some other area that might not be their Forte
00:20:36.670 overall this is the hardest part convincing people to do on call especially with engineers
00:20:43.990 going back to the park the cat's story in the book case is basically that
00:20:49.600 but it is possible to get the whole team on board you just have to be persuasive
00:20:55.010 about it and convincing and I'll touch on that a little bit later once this was all done we also made sure
00:21:02.540 to practice everything so we did drills we did tabletop exercises we did
00:21:09.470 breakdowns just practice goes with almost anything else in life
00:21:16.900 so now that all hands were on deck we had
00:21:21.980 to also make some more adjustments to the overall process and the system and tooling the first change we did was move
00:21:30.020 all of the alerts into a configuration repo this meant that we since they were
00:21:36.410 in source control and made it clear what the history of each alert was it was also easier to maintain so any changes
00:21:43.580 that can be made you just had to change the configuration push it out and the alert was changed
00:21:50.590 another thing we did was since engineers were busy doing their thing and managers
00:21:55.970 weren't doing too much else make them incident managers make them do something
00:22:01.480 so it was the role of the managers or the incident managers to do all the
00:22:07.070 documentation as the in starting the incident reports and making sure that all the details were properly filled out or correctly
00:22:14.539 filled out they also coordinated all the necessary engineers for any issues again going
00:22:21.350 back to some people having more expertise in certain areas if they were kind of struggling with something the
00:22:27.769 incident manager would be able to pull in someone else by calling them or messaging them texting them sending
00:22:34.490 carrier pigeons I don't know however you want to get people pulled in most importantly though they did the
00:22:41.779 external communications with the other teams at at our company that had to tell
00:22:48.230 the customers the bad news is that we are having issues I can imagine I myself
00:22:53.510 obviously would not want to be the one telling our users hey sorry our sites kind of down you're gonna have to wait
00:23:00.049 for a little bit to use it it's imagine it's just like having to tell a cat not
00:23:06.139 to or to hang out with this specific person it's going to have a mind of its
00:23:11.450 own it's going to go under off and do something else we also did run books for all of these
00:23:19.179 different services that we had originally we of course did have some form of read me or run book for with
00:23:26.149 each service but the most important thing about the run books that we had was that we made them very in-depth with
00:23:34.070 regard to the alerts that were tied to the service this is especially important
00:23:39.620 for the meantime to recovery because it means if there are clearer steps for
00:23:44.630 resolution all you have to do is just go I got this alert follow steps one two
00:23:50.690 and three and hopefully it should be resolved by then this also obviously meant that the
00:23:57.799 initial response was as simple as possible we assumed anybody should be
00:24:03.049 able to do this even if you have as little knowledge about a certain system as I don't know your cat
00:24:12.370 but even with all this we decided also we can't keep doing this these people
00:24:18.620 still need to do work outside of this so this continued for a while and we
00:24:25.169 slowly began to hit our stride but we obviously continue making adjustments as we discovered parts of the process that
00:24:31.379 just didn't really work out for us and so my story comes back up to speed with
00:24:38.190 the present and you might be thinking we've done all these things what's left
00:24:44.159 the fix well like any real problem person wise or code wise there's no
00:24:50.669 actual end to what things that you can improve on so these are some of the things now that
00:24:57.269 we are focusing on the first being combined schedules and keep in mind the
00:25:02.970 things that I'm talking about now are still in flux we're doing them now so
00:25:08.269 take them with a grain of salt so combining the two schedules I
00:25:14.570 mentioned previously that we had decided actively not to do this at the beginning
00:25:19.739 and we split by stacks so we had rails in Java now we had basically a single unified
00:25:27.869 team who just worked on services so we decided that they should be responsible
00:25:34.830 for all the services regardless of stack it probably was crazy it sounds crazy
00:25:42.889 probably is but we did this knowing that it meant that people would end up being
00:25:50.190 on-call less often it might be a shitty week for them but at least that shitty week will only have to have to happen
00:25:56.519 maybe once every nine to 18 weeks depending on obviously how many people
00:26:01.710 were in the team and this was very critical obviously for burnout we didn't want people feeling tired or stressed so
00:26:10.289 doing it less frequently at least they could prepare ahead of time as well they can kind of recoup in the time off and
00:26:17.389 get some actually work done to this also came out of the
00:26:24.739 so this wasn't the only decision behind us though it came out so out of the fact that since these teams are unified the
00:26:31.590 schedule should be too so this is kind of important based on the
00:26:36.970 fact that we believe that the engineer should be able to handle all issues regardless of their primary programming
00:26:43.059 language we also have started doing post mortems and retrospectives it's basically
00:26:51.130 learning from our mistakes and the problems that we had it's good practice for any part of life anyway so we
00:26:57.970 applied it here on top of the incident reports we started doing post mortems
00:27:03.070 and retro sorry we started doing these post mortems and retrospectives and we
00:27:08.350 try to do them almost immediately after the incident so it's fresh in our minds obviously if it happens at the end of
00:27:14.559 the day then we postpone it until the next day but if it happens in the morning and it's resolved we try and do
00:27:20.500 it in the afternoon so that you don't forget any details when we do them we focus on few on just
00:27:27.700 a few key questions the what basically the timeline what happened in
00:27:34.690 what order this is helping us to figure out what exactly happened and getting into the root cause of the problem the
00:27:41.080 where which geographical area since I mentioned that there was san francisco and london but sometimes things go under
00:27:47.950 the radar something happens in one time zone that overflows into the other so that's also important to know so that we
00:27:54.970 know whether there were any things that we could have done to prevent this from happening also as
00:28:02.460 obviously as the product is used by customers from all over the world only some regions might be affected by
00:28:08.890 certain problems too that's also important because we figure out whether you know only
00:28:14.549 asia-pacific was affected was it only North America it cut it kind of cuts down on our overall slrs overall service
00:28:23.320 level agreement with our customers the Y basic going back from the what's kind of
00:28:29.500 dives into what the actual root cause was this might not necessarily come up in the retrospectives immediately it
00:28:36.370 might take a little bit more investigation but we at least make sure that we have tickets or something and we
00:28:42.640 have people assigned to figure out what exactly went wrong and obviously that leads into the how of how can we prevent
00:28:49.750 this from happening again in the future which is the most important part
00:28:55.410 actually the most important thing is that we do not make this a blame game it's not just because we don't want to
00:29:02.320 point fingers at anybody but also it's such a massive system and a small change
00:29:07.330 here can beam disaster over there it's a mistake that anybody can make and we don't want to put all the pressure on to
00:29:14.440 just one person its collective ownership of our problems and collective ownership
00:29:20.440 of the efforts to fix them we also have started doing weekly
00:29:26.620 handovers and monthly reviews the weekly handovers happen first thing
00:29:32.140 Monday morning in both the US and the UK basically all the engineers who had been
00:29:38.140 on call the previous week get into a room with the engineers who are coming on call the current week and they do a
00:29:44.950 handover they say what were the problems that they noticed during their during their stint the previous week are there
00:29:51.400 any lingering issues are there any current ongoing issues that might come up during this week's
00:29:57.330 this week's oncol stint we also make note of all of the top
00:30:02.710 alerts and what we had to do to resolve them and we have a fancy spreadsheet you don't won't show on here its pivot
00:30:10.690 tables in excel it's pretty boring but we keep track of this so that we know what are the things that we need to pay
00:30:17.440 attention to the most which services are the most noisy or active like or
00:30:23.080 problematic also if any run books were missing for these services we need to
00:30:28.210 make note of that and get that fixed as soon as possible
00:30:33.660 also going so again going into the focusing on the noisiest of these services
00:30:39.750 also because time zones are really hard I have also had a time zone snafu coming
00:30:46.960 here anyways um time zones are hard it's really hard to do the math in your head with you know calculating okay so London
00:30:53.950 is eight hours ahead of San Francisco but there's also daylight saving as a daily or is it summer or winter it can
00:31:01.700 be a nightmare so we only had the on-call tech leads in each of the
00:31:06.830 offices doing the handover between the UK and us this frees up the actual
00:31:13.190 engineers from having to stay late or having to wake up early in the mornings for these handover meetings and it saves
00:31:20.270 us a little bit of coordination because there's only one or two people in charge of making sure that all the important
00:31:26.330 things are communicated across the geographies and it also obviously means that you don't
00:31:32.870 have to again not everyone has to go to either 6pm 7pm meetings in London or I
00:31:39.500 don't know 8am meetings in San Francisco aside from the reviews in London we've
00:31:46.940 also started doing started a site effort to make sure that the engineers are actually feeling like their time on call
00:31:53.150 is doing more good than not so we started doing these surveys they are
00:31:58.730 completely optional but the engineers coming off call were recommended to fill out the survey which had the same
00:32:05.780 questions every single time we basically asked about their mood
00:32:11.630 before going before and after going on call just to make sure that they're actually
00:32:18.620 feeling pretty prepared for whatever is meant to come up to them or you know
00:32:24.080 maybe they had a really shitty week and they just felt like completely exhausted by the end it's also important to note
00:32:30.169 that to burn out we also had to make sure that we're
00:32:35.480 actually improving so we asked things about did you feel like you were
00:32:40.520 prepared were all the proper run books available to you was your we have
00:32:46.250 primary and secondary so was the opposite person available when you needed them did your schedules work out
00:32:53.270 ok were they available to cover you when you're commuting all these basically more of the personal stuff rather than
00:33:00.080 the technical aspects and we of course overall just wanted to
00:33:06.620 make sure that nobody felt burned out at the end and we wanted to make sure that they fell heard their opinions matter
00:33:14.100 too we didn't want anyone feeling like they were overwhelmed or on their own while they were on call
00:33:21.320 but of course the biggest and worst problem of all is the fact that we
00:33:27.930 just have so many alerts still we need to fix them all and
00:33:33.680 in order to do that we started to try and break them down into bigger categories of alerts so that we can
00:33:40.080 obviously use different processes to fix them first on the chopping chopping
00:33:45.540 block or her the noisy ones that are probably configured wrong or something
00:33:51.360 so if there was a threshold that could be bumped just bump it if it's not even
00:33:57.720 alert that you need why do we even have it just delete that
00:34:02.900 flaky alerts are a little bit harder than this it could possibly resolve
00:34:08.010 before you could even have a chance to go and look at it graphs could be spiky
00:34:13.350 but you don't know why and then of course there are also people working on projects that
00:34:19.310 you know the service is kind of falls under maybe they didn't tell you about it which is also bad but maybe they
00:34:27.570 could also look at it too of course if it's an actual really bad
00:34:33.060 problem we have to spend more time to dig into it
00:34:38.990 sorry going back also to the flaky alerts we have we don't have this yet
00:34:46.320 but we hope to eventually have an initial line of automated defense where our monitoring systems can trigger a
00:34:54.140 scripts that try to auto repair when an alert triggers if this fails then we can
00:35:01.020 alert actually to the on-call engineer at least but by the time that a human
00:35:06.390 looks at it we've tried at some comet we've tried all the common responses to
00:35:12.720 this type of alert especially if it's an alert that happens frequently
00:35:18.200 all this probably sounds completely insane and we're probably still really
00:35:23.340 far from where we want to be but you know what we come a long way and at least we've seen enough to know better
00:35:29.990 what path we need to take to actually get to our end goal hell before we
00:35:36.060 didn't even have an end goal but now we do and for us what that means is we want to get
00:35:44.370 to a point where eventually we want to have only one
00:35:49.560 alert per person per day so that's maximum absolute maximum of seven alerts
00:35:56.190 per week we are still quite far of that but far from that but at least it's a
00:36:01.350 tangible goal that we can set our eyes on and kind of work towards it would also help further with the fact
00:36:09.690 that our rotations would have fewer people on call overall
00:36:15.000 if the service owners could be responsible just for the services that they are responsible for we've started
00:36:21.780 changing our model a little bit where instead of having everybody working on all the different services we have now
00:36:28.320 these domains or I think another common term is like squad where a set of people
00:36:35.400 work in a fixed area or a specific business aspect of the product and they
00:36:41.850 are responsible for this if they were to be on call twenty-four-seven that's not
00:36:47.580 achievable at the current state where at so this one alert per person is very critical to actually getting to the
00:36:54.570 second point of having people being on call for their services no matter what
00:37:00.470 obviously as a side effect of that it means probably our systems are more stable anyways so even if you're on call
00:37:07.880 twenty-four-seven for specific services hopefully that means you're not actually going to be woken up in the middle of
00:37:14.130 the night anyways again this all comes back down to the
00:37:21.360 mean time between failure vs mean time to recovery while we had been working on
00:37:26.940 previously on the meantime to recovery as I mentioned you work on one the other
00:37:32.580 might go down by nature but then once one is good you work on fixing the other
00:37:38.010 thing so now that we're kind of at a place where we're okay with our meantime to recovery now it's time for us to
00:37:44.460 focus on our mean time between failures so these goals will reflect that and of
00:37:52.230 course once that's all done hopefully the world will be full of kittens and everyone will be happy
00:37:59.720 but why going as to a more higher level why go through all this trouble why put
00:38:06.420 all the stress on people isn't on called just for ops no
00:38:12.170 that's all I really okay well so it's not just one team's responsibility it's
00:38:18.540 the entire organizations responsibility as a whole to to move towards being proactive rather than reactive
00:38:26.150 everyone should be care that their product is stable that it's not affecting users negatively I
00:38:33.470 also feel that as engineers we all have the sense of responsibility for the code
00:38:39.720 that we ship and we shared this with every single one of our colleagues if we don't then I don't that's out of my
00:38:46.800 league we also take pride in the code that rewrites in knowing that what we deliver
00:38:52.460 isn't just delivering value to our customers but also that and therefore
00:38:57.750 our businesses business as well but also the quality of our code doing things
00:39:02.820 like pull requests getting code reviews testing in pre-production all these
00:39:08.430 things or things that we do because we want to make sure that our code is as good as it could possibly be
00:39:14.990 if we want to achieve this world full of kittens though
00:39:21.140 we have to go through some pain no pain no gain it's not easy though at all and
00:39:28.440 we all need a little bit of direction we all need to be herded
00:39:35.630 going back again to isn't on called just for ops this is a quote from Wikipedia
00:39:42.000 so I didn't put it here because obviously quoting Wikipedia is a little bit of a my professors will tell me that's
00:39:49.589 absolutely no hey it defines you'll notice i wrote ops
00:39:55.690 here some people might say what about DevOps so this is the actual definition according to wikipedia of what DevOps is
00:40:03.430 it is a culture movement or practice that emphasizes the collaboration and communication of both software
00:40:10.839 developers and other IT professionals while automating the process of software delivery and infrastructure changes it
00:40:18.310 aims at establishing a culture and environment where building testing and releasing software can happen rapidly
00:40:25.710 frequently and more reliably it has nothing about a single team it's talking
00:40:32.830 about both software developers and other IDs so again alcohol isn't just for ops
00:40:39.849 and DevOps should just be a goal for the organization not a team that you have I
00:40:46.300 think so anyways so all of this it's not easy it's never
00:40:53.740 gonna get easier I ever think I I hope that it will but probably not and
00:40:59.849 because it's not easy we've taken such a long and convoluted path to get to this place where we have some process that
00:41:07.420 works for us now but we still need to work on it because nothing is perfect and we need some
00:41:15.040 direction we need to hurt all of our cats thank you very much for taking time to
00:41:22.089 listen to my story
00:41:27.579 Thank You grace as usual if you have questions you can find
00:41:34.329 her here at the stage
00:41:44.800 you