Herding Cats to a Firefight

00:00:05.310 please take a seat thank you very much so our next speaker grace chang

00:00:12.799 she's an engineer at Yammer London she is also the on call tech lead as far as

00:00:20.340 I understood she likes reliable sites continues development and I was told

00:00:25.740 also gifts food and gives a food is it correct ok so here she is for you enjoy

00:00:38.690 thank you uh hello thank you again and I

00:00:43.800 apologize ahead of time to all the bulgarians here if I mess this up

00:00:48.890 but I've 8th day I think yeah i'm told that means hello if it means a

00:00:55.470 curse word I apologize profusely and I blame all the organizers now they're

00:01:01.470 cool people so yes my name is Grace and currently i work at Yammer London but as

00:01:07.350 you can probably tell from my accent I'm originally from Vancouver Canada I just moved to London last year but I have

00:01:15.240 been doing on call with Yammer for several years now and that's what I'm here to talk about with you today in

00:01:22.079 case you're not familiar Yammer is a social network for enterprise we were acquired several years ago by Microsoft

00:01:29.990 so it's a large platform a lot of users but we're obviously trying to grow every

00:01:36.240 day um and I also apologize very profusely both to the internet because

00:01:42.389 while I am talking about cats I am personally a dog person and I actually

00:01:48.569 don't have any gifts in my slides but I have hand-drawn pictures and hopefully

00:01:55.950 that will be just as good as gifs so I'm going to talk about hurting cats and

00:02:03.389 if this is enough phrase that people are not familiar with it's kind of a phrase that we use to talk about trying to

00:02:10.979 gather people to do something but it's really difficult because all these people are trying to do

00:02:17.410 different things or they're just not cooperating with you and it is one of the hardest things you can possibly do

00:02:22.750 and why specifically cats I can give a story that one of my friends told me he

00:02:30.510 used to play a game when he was in uni with his roommates cats I think at least

00:02:35.650 two of them maybe more he would take a laser pointer and try and get one of the

00:02:40.870 cats to follow the laser pointer into a bookshelf and then while that cat is still inside the bookshelf he would try

00:02:47.290 and get the other cat into the same bookshelf obviously that doesn't work out with one laser pointer because first

00:02:53.440 cat starts coming out the other cat goes in it's a kind of a hard task but this

00:02:59.920 story isn't about impossibilities it's about being able to train teams and help

00:03:05.080 them learn new tricks we're dealing with difficult things so let me rewind a little bit first all

00:03:12.940 the way back to the year 1 BC or before

00:03:18.130 cats so in the beginning there was only

00:03:23.730 darkness but suddenly out of the darkness there

00:03:30.040 came a sound I apologize there's no speaker setup so

00:03:36.100 imagine pedra noises go in here it's mostly for effect anyways so when i

00:03:41.920 started at yammer we didn't really have concept of alcohol I'd heard lots of stories about other teams at other

00:03:48.700 organizations having on call teams and obviously it's not a pretty terrible

00:03:53.739 they would have to carry pagers or mobile phones and modern days with them at all times just in case something went

00:04:00.520 wrong with the production environment actually Yammer did have an on-call sort

00:04:07.000 of it was one person who was on call all day and every night of every day of

00:04:15.280 every week for all of eternity not quite but close enough

00:04:23.640 everyone was around to help of course anybody else on the engineering team but eventually this team grew so large and

00:04:31.030 the entire environment was so unwieldy that it was unreasonable for one person to be responsible for everything and we

00:04:38.710 knew that deep inside just nobody wanted to be the first one to say it but why not why didn't we want to help

00:04:46.000 this poor soul who had the entire world of Yammer on his shoulders well one of

00:04:51.009 our leading people said and I highly paraphrase here we don't need no

00:04:57.639 stinking on-call rotation our project is so good it doesn't need a whole team to

00:05:02.740 keep it running in fact that would be a waste of their time I

00:05:08.039 call on this so eventually as one would predict

00:05:13.900 disaster happens it was a Friday or for

00:05:18.940 some of us still Thursday because we hadn't really slept much the night before and a massive production issue

00:05:24.400 occurred for some reason or another and a handful of either dedicated or

00:05:29.819 insomniac engineers we're struggling to keep things from falling apart eventually we had to call for help and I

00:05:38.110 was the one who had to make this call and it's rough this is roughly how it went

00:05:43.680 mmm hi sorry to be calling at this hour I'm from Yammer and I work with

00:05:48.880 engineers name can I please speak with him again this is three in the clock in

00:05:54.789 the morning this guy had just gotten back from his honeymoon and I was in his wife answered the phone basically it was

00:06:01.900 as awkward as it sounds and eventually he woke up probably tried to

00:06:08.830 get some coffee and stuff and helped us out and eventually after several hours

00:06:14.080 we finally managed to get our production environment back into a stable state but it was at this point more or less that

00:06:20.560 we decided we can't keep doing this this guy needs to sleep so the decree was made we would have an

00:06:27.520 on-call team ray problem solved right easier said than done like many things

00:06:34.650 so let me go back to a little forward in

00:06:39.699 time to the year 180 after disaster the first difficulty of what we had come

00:06:45.720 across while just putting this on call team together was how do we do math at

00:06:51.930 the time give we decided that we would have just for engineers or cats to be

00:06:58.979 the guinea pigs give me guinea cats to iron out the entire process before we

00:07:04.770 rolled it out to the entire team out of one of the out of these four one was the

00:07:10.050 cat herder or the tech lead and that was my role at this time we also had one

00:07:15.509 monolithic application written and rails with about 15 ish services which were

00:07:21.060 written in Java so that was two different stacks and we tried to figure out best how to

00:07:27.000 organize and split the responsibilities uh in such a way that it didn't make everybody's lives suck and

00:07:34.759 it's not easy it kind of ended up with just for sad cats

00:07:40.430 this is an actual photo of my notebooks when we when I was trying to figure out

00:07:46.319 how to do our scheduling so it really didn't suck for everyone but all the

00:07:51.539 options sucked really he and another big challenge of for us was having to get

00:07:57.930 used to all these acronyms that started cropping up all around us and we couldn't avoid them things like MTBF

00:08:04.279 mttr aar SLA and they became bigger and bigger

00:08:10.590 not obviously the acronyms themselves so I'll go quickly over some of the

00:08:16.169 meanings of the the meanings of some of these acronyms that were key to us the first one is MTBF mean time between

00:08:22.620 failures this is basically how long is it roughly between outages of your

00:08:29.120 application or sites whatever you may whatever it may be meantime to recovery is related to this

00:08:37.680 but it is actually how long it takes for you to get from that unstable state back

00:08:42.930 to a healthy state SLA most important thing also called

00:08:49.110 availability in some areas service level agreement this is basically the contract between your business and your users so

00:08:57.120 you're saying to your users we will promise that our site or application or

00:09:02.160 service will be available to you this percentage out of the month it's pretty

00:09:08.760 important because obviously more downtime means fewer users they are after action review this one

00:09:16.440 came up and I actually had to look this up again because it's kind of an older one

00:09:21.680 basically is pretty self self-descriptive it is what you do with

00:09:27.840 the documents that you have to write up after the whole shindig has gone down and you have to kind of explain what

00:09:33.900 happened also related to that is I our incident report this is actually the term that we like to use more just

00:09:40.770 because it's a little well it's shorter slightly

00:09:46.250 but yeah there was like so many acronyms and it was just so overwhelming to us so

00:09:51.390 we started to break it down and I'll break it down here too but this is very rough um

00:09:57.530 going back to MTBF mean time between failures and meantime two recoveries

00:10:02.730 I'll try hard I tell them I'll try my best not to use the acronyms themselves and use the full terms here

00:10:09.740 anyways so between these two what mattered the most to us

00:10:14.840 but really it's not a matter of what's more important it's how do you balance

00:10:21.720 them out and how do you improve on one while the other while not making sure

00:10:27.030 over all your systems are still healthy that your availability is still up it's

00:10:32.940 kind of like if going back to the cat analogies mean time between failures is

00:10:37.980 like how often does your cat vomit on the rug mean time for recovery is how fast can you clean it up when it does

00:10:44.270 and obviously there were various things that we have to consider a the rough business impact of each which is the

00:10:51.180 ones in grey so mean time between failures obviously less frequent mean

00:10:56.610 time to recovery when you focus on that you obviously have faster recovery so it's much better even if you're down

00:11:02.760 it's only for a few minutes at most in blue is what we need to do to achieve

00:11:09.269 either of these goals for mean time between failures required more stable

00:11:14.550 systems it depends obviously on the current state of your system whether that's something reasonable at that time

00:11:20.519 or it's something that you need to do more incrementally mean time to recovery also needed some time so that your

00:11:27.540 engineers could be trained to respond well to any of these incidents so that they can act fast to anything that comes

00:11:33.899 up the benefits of these two in green for mean time between failures it meant

00:11:40.709 that engineers were interrupted less frequently you know your cat vomits less

00:11:45.959 frequently good that's less for you to clean up or less frequently that you have to clean up after it while mean

00:11:52.170 times our recovery engineers gain a broader knowledge over the entire system just because they have to handle all of

00:11:59.220 these incidents but at least they'll know how to do that fast and of course the negatives for these for

00:12:07.290 mean time between failures possibly they could be more disastrous it's kind of like instead of smaller

00:12:13.760 breaks or outages you might possibly have kind of a big bang outage that

00:12:19.860 lasts for several minutes or god forbid hours while mean time to recovery we

00:12:26.579 don't have we might not have that but it could also mean that you are facing more frequent issues you know how to respond

00:12:33.329 to them quickly but they come up more often so maybe that's how you get used to responding to them

00:12:39.800 so this is I swear this is the only formula in my slides and it's not really

00:12:45.660 even a formula really but this is kind of the official definition of the relationship between mean time between

00:12:52.050 failure and meantime to recovery and of course our over looming SLA or availability

00:13:00.320 so neither these really are dependent on each other necessarily but obviously the

00:13:06.149 SLA does depend on both of them but it's not a scale where you have to have

00:13:11.250 either one on opposite ends it's more like a range for either one of them so

00:13:16.649 as one goes up you don't have to go up on the other one

00:13:21.730 in fact it could go down temporarily then when you're happy with that one and it doesn't have to be a hundred percent

00:13:27.399 for example in the mean time between failures here you can go and push the

00:13:33.070 other one the second one up and then whatever you do though obviously you just need to make sure that you stick to

00:13:39.250 the important thing on the left hand side of the equation the availability and this is obviously the goal that you

00:13:45.040 want in the end so we settled on our goals now we had to

00:13:52.029 kind of track all these things the AAA RS or IRS that as we are using now we

00:13:58.660 considered using google doc forms at the very beginning there obviously there are many tools

00:14:03.940 that we consider so I'm only listing a the key ones that we actually try it out this one ended up being kind of hard to

00:14:10.870 read the reports if you're not familiar with the way that Google Forms does it

00:14:16.449 basically spits out the responses as a spreadsheet and it kind of when you're trying to read prose it makes it really

00:14:22.779 difficult to do that there was also Yammer notes we have

00:14:28.569 basically a rich text editing in Yammer the product as well and

00:14:33.839 the good thing was about this was that you can link it to different threads or

00:14:39.010 conversations that are happening about the specific incidents however they were very hard to analyze you can link them

00:14:45.880 to threads but there's not very good way of searching querying for these for

00:14:51.160 these notes inside of Yammer lastly we had JIRA which we were already using as a bug tracking

00:14:58.470 bug tracking tool and it wasn't perfect but it sort of worked

00:15:06.600 in the ends there was lots of different fields as you can see and from another

00:15:12.880 photo of some of the notes that i was taking at the time to try and see how we can actually build an incident report

00:15:19.149 form that would make sense to engineers that would be simple for them to fill out but also not lose track of all the

00:15:25.779 important things there are a lot of details that we considered important at first we don't have all of these fields

00:15:31.990 anymore as so as we went along obviously some of them naturally stopped being filled out

00:15:37.220 we determined obviously oh there's probably not as important they've become optional now but the key ones are

00:15:43.700 obviously still there and we've now been using JIRA for about three years and

00:15:50.270 it's been serving us quite well and obviously has many benefits including a tag track or label tracking you can do

00:15:57.650 queries against them so this is what we settled on using and

00:16:04.450 all of this took place over several months and we figured hey we're starting

00:16:09.740 to get the hang of this maybe

00:16:15.430 not yet we weren't where we want it to be at the time

00:16:21.610 at this point in time the way that we did projects it made it such that whenever we created a new service in our

00:16:29.150 micro system our co service service oriented architecture worlds we would kind of dump them into the pool with all

00:16:34.910 the others so this meant that the number of services grew faster than the

00:16:39.950 engineers who maintained them and it also meant possibly more technologies

00:16:45.320 which meant more toys for cats to play with it means that they get distracted easily as well at this for example at

00:16:53.930 this time we had about four different data stores on top of the rails and Java

00:16:59.450 applications that we had so imagine taking like a bunch of yarn balls tossing it on the ground and then

00:17:06.290 letting loose a bunch of cats it's kind of a disaster hurting them to actually you know get them to do whatever you

00:17:12.560 want to so despite this we also

00:17:18.790 rotated projects and people who worked on different projects because we wanted to prevent expertise or we wanted to

00:17:26.060 prevent silos from fear from forming as some people refer to them as

00:17:31.930 but with all these cats playing with different balls unfortunately one cat

00:17:38.510 may play might play with just one ball so silos for me even though you didn't want them to some people just actually

00:17:45.440 get more used to a certain service than another person it just happens it's not

00:17:50.720 something that is easily avoidable on top of this a side effect is that you get no clarity on who's responsible for

00:17:58.370 what so you know after all the cats are done playing and they kind of saunt her

00:18:03.649 off to do whatever their own thing is they kind of leave all the yarn balls for someone else to deal with probably

00:18:09.830 you and most importantly burnouts burnout

00:18:15.799 basically I'm sure many people feel it on their normal day-to-day but imagine also having to have the responsibility

00:18:22.159 of carrying your phone around our lap or even laptop in the evenings in the app

00:18:28.070 on the weekends it's very stressful and the responsibility level is much higher

00:18:34.070 than usual so that was another thing that was problematic and we didn't really have a solution for that yet so

00:18:42.669 several months of this continued on we kept making small tweaks here and there

00:18:47.840 and eventually we got to a point where we decided all these initial efforts were kind of good enough to roll out to

00:18:55.940 the entire team so we thought there were plenty of adjustments that we obviously had to

00:19:02.210 make one of those is that we decided to do was do more by doing less

00:19:09.039 as I mentioned we had two different stacks rails and rails in java / drop

00:19:16.700 wizard and that was the easiest way to draw the line between responsibilities so we had some people being on call for

00:19:23.000 rails while others were on call for java so we decided to go with that it meant

00:19:30.049 that you as a person we're probably dealing with less at any given point in time on top of that we added in our London

00:19:37.909 office when all this to all of this was taking place it was only in this San Francisco office we also added the

00:19:44.330 London office with shorter rotations but it meant that people in San Francisco could finally sleep at night they don't

00:19:51.200 have to go on from 9am to nine a.m. the next day basically 24 hours

00:19:56.600 and it also meant that with more people joining the processing the team of being

00:20:03.470 on call it meant that we they gained more knowledge but we had to onboard them to this process we had to help them

00:20:10.190 gain that knowledge because it's a lot to digest all at once so to make sure we train people and had

00:20:19.250 them both be both open to helping anybody in an area that they were from more most familiar with they also had to

00:20:27.350 be okay obviously with asking somebody more knowledgeable about some other area that might not be their Forte

00:20:36.670 overall this is the hardest part convincing people to do on call especially with engineers

00:20:43.990 going back to the park the cat's story in the book case is basically that

00:20:49.600 but it is possible to get the whole team on board you just have to be persuasive

00:20:55.010 about it and convincing and I'll touch on that a little bit later once this was all done we also made sure

00:21:02.540 to practice everything so we did drills we did tabletop exercises we did

00:21:09.470 breakdowns just practice goes with almost anything else in life

00:21:16.900 so now that all hands were on deck we had

00:21:21.980 to also make some more adjustments to the overall process and the system and tooling the first change we did was move

00:21:30.020 all of the alerts into a configuration repo this meant that we since they were

00:21:36.410 in source control and made it clear what the history of each alert was it was also easier to maintain so any changes

00:21:43.580 that can be made you just had to change the configuration push it out and the alert was changed

00:21:50.590 another thing we did was since engineers were busy doing their thing and managers

00:21:55.970 weren't doing too much else make them incident managers make them do something

00:22:01.480 so it was the role of the managers or the incident managers to do all the

00:22:07.070 documentation as the in starting the incident reports and making sure that all the details were properly filled out or correctly

00:22:14.539 filled out they also coordinated all the necessary engineers for any issues again going

00:22:21.350 back to some people having more expertise in certain areas if they were kind of struggling with something the

00:22:27.769 incident manager would be able to pull in someone else by calling them or messaging them texting them sending

00:22:34.490 carrier pigeons I don't know however you want to get people pulled in most importantly though they did the

00:22:41.779 external communications with the other teams at at our company that had to tell

00:22:48.230 the customers the bad news is that we are having issues I can imagine I myself

00:22:53.510 obviously would not want to be the one telling our users hey sorry our sites kind of down you're gonna have to wait

00:23:00.049 for a little bit to use it it's imagine it's just like having to tell a cat not

00:23:06.139 to or to hang out with this specific person it's going to have a mind of its

00:23:11.450 own it's going to go under off and do something else we also did run books for all of these

00:23:19.179 different services that we had originally we of course did have some form of read me or run book for with

00:23:26.149 each service but the most important thing about the run books that we had was that we made them very in-depth with

00:23:34.070 regard to the alerts that were tied to the service this is especially important

00:23:39.620 for the meantime to recovery because it means if there are clearer steps for

00:23:44.630 resolution all you have to do is just go I got this alert follow steps one two

00:23:50.690 and three and hopefully it should be resolved by then this also obviously meant that the

00:23:57.799 initial response was as simple as possible we assumed anybody should be

00:24:03.049 able to do this even if you have as little knowledge about a certain system as I don't know your cat

00:24:12.370 but even with all this we decided also we can't keep doing this these people

00:24:18.620 still need to do work outside of this so this continued for a while and we

00:24:25.169 slowly began to hit our stride but we obviously continue making adjustments as we discovered parts of the process that

00:24:31.379 just didn't really work out for us and so my story comes back up to speed with

00:24:38.190 the present and you might be thinking we've done all these things what's left

00:24:44.159 the fix well like any real problem person wise or code wise there's no

00:24:50.669 actual end to what things that you can improve on so these are some of the things now that

00:24:57.269 we are focusing on the first being combined schedules and keep in mind the

00:25:02.970 things that I'm talking about now are still in flux we're doing them now so

00:25:08.269 take them with a grain of salt so combining the two schedules I

00:25:14.570 mentioned previously that we had decided actively not to do this at the beginning

00:25:19.739 and we split by stacks so we had rails in Java now we had basically a single unified

00:25:27.869 team who just worked on services so we decided that they should be responsible

00:25:34.830 for all the services regardless of stack it probably was crazy it sounds crazy

00:25:42.889 probably is but we did this knowing that it meant that people would end up being

00:25:50.190 on-call less often it might be a shitty week for them but at least that shitty week will only have to have to happen

00:25:56.519 maybe once every nine to 18 weeks depending on obviously how many people

00:26:01.710 were in the team and this was very critical obviously for burnout we didn't want people feeling tired or stressed so

00:26:10.289 doing it less frequently at least they could prepare ahead of time as well they can kind of recoup in the time off and

00:26:17.389 get some actually work done to this also came out of the

00:26:24.739 so this wasn't the only decision behind us though it came out so out of the fact that since these teams are unified the

00:26:31.590 schedule should be too so this is kind of important based on the

00:26:36.970 fact that we believe that the engineer should be able to handle all issues regardless of their primary programming

00:26:43.059 language we also have started doing post mortems and retrospectives it's basically

00:26:51.130 learning from our mistakes and the problems that we had it's good practice for any part of life anyway so we

00:26:57.970 applied it here on top of the incident reports we started doing post mortems

00:27:03.070 and retro sorry we started doing these post mortems and retrospectives and we

00:27:08.350 try to do them almost immediately after the incident so it's fresh in our minds obviously if it happens at the end of

00:27:14.559 the day then we postpone it until the next day but if it happens in the morning and it's resolved we try and do

00:27:20.500 it in the afternoon so that you don't forget any details when we do them we focus on few on just

00:27:27.700 a few key questions the what basically the timeline what happened in

00:27:34.690 what order this is helping us to figure out what exactly happened and getting into the root cause of the problem the

00:27:41.080 where which geographical area since I mentioned that there was san francisco and london but sometimes things go under

00:27:47.950 the radar something happens in one time zone that overflows into the other so that's also important to know so that we

00:27:54.970 know whether there were any things that we could have done to prevent this from happening also as

00:28:02.460 obviously as the product is used by customers from all over the world only some regions might be affected by

00:28:08.890 certain problems too that's also important because we figure out whether you know only

00:28:14.549 asia-pacific was affected was it only North America it cut it kind of cuts down on our overall slrs overall service

00:28:23.320 level agreement with our customers the Y basic going back from the what's kind of

00:28:29.500 dives into what the actual root cause was this might not necessarily come up in the retrospectives immediately it

00:28:36.370 might take a little bit more investigation but we at least make sure that we have tickets or something and we

00:28:42.640 have people assigned to figure out what exactly went wrong and obviously that leads into the how of how can we prevent

00:28:49.750 this from happening again in the future which is the most important part

00:28:55.410 actually the most important thing is that we do not make this a blame game it's not just because we don't want to

00:29:02.320 point fingers at anybody but also it's such a massive system and a small change

00:29:07.330 here can beam disaster over there it's a mistake that anybody can make and we don't want to put all the pressure on to

00:29:14.440 just one person its collective ownership of our problems and collective ownership

00:29:20.440 of the efforts to fix them we also have started doing weekly

00:29:26.620 handovers and monthly reviews the weekly handovers happen first thing

00:29:32.140 Monday morning in both the US and the UK basically all the engineers who had been

00:29:38.140 on call the previous week get into a room with the engineers who are coming on call the current week and they do a

00:29:44.950 handover they say what were the problems that they noticed during their during their stint the previous week are there

00:29:51.400 any lingering issues are there any current ongoing issues that might come up during this week's

00:29:57.330 this week's oncol stint we also make note of all of the top

00:30:02.710 alerts and what we had to do to resolve them and we have a fancy spreadsheet you don't won't show on here its pivot

00:30:10.690 tables in excel it's pretty boring but we keep track of this so that we know what are the things that we need to pay

00:30:17.440 attention to the most which services are the most noisy or active like or

00:30:23.080 problematic also if any run books were missing for these services we need to

00:30:28.210 make note of that and get that fixed as soon as possible

00:30:33.660 also going so again going into the focusing on the noisiest of these services

00:30:39.750 also because time zones are really hard I have also had a time zone snafu coming

00:30:46.960 here anyways um time zones are hard it's really hard to do the math in your head with you know calculating okay so London

00:30:53.950 is eight hours ahead of San Francisco but there's also daylight saving as a daily or is it summer or winter it can

00:31:01.700 be a nightmare so we only had the on-call tech leads in each of the

00:31:06.830 offices doing the handover between the UK and us this frees up the actual

00:31:13.190 engineers from having to stay late or having to wake up early in the mornings for these handover meetings and it saves

00:31:20.270 us a little bit of coordination because there's only one or two people in charge of making sure that all the important

00:31:26.330 things are communicated across the geographies and it also obviously means that you don't

00:31:32.870 have to again not everyone has to go to either 6pm 7pm meetings in London or I

00:31:39.500 don't know 8am meetings in San Francisco aside from the reviews in London we've

00:31:46.940 also started doing started a site effort to make sure that the engineers are actually feeling like their time on call

00:31:53.150 is doing more good than not so we started doing these surveys they are

00:31:58.730 completely optional but the engineers coming off call were recommended to fill out the survey which had the same

00:32:05.780 questions every single time we basically asked about their mood

00:32:11.630 before going before and after going on call just to make sure that they're actually

00:32:18.620 feeling pretty prepared for whatever is meant to come up to them or you know

00:32:24.080 maybe they had a really shitty week and they just felt like completely exhausted by the end it's also important to note

00:32:30.169 that to burn out we also had to make sure that we're

00:32:35.480 actually improving so we asked things about did you feel like you were

00:32:40.520 prepared were all the proper run books available to you was your we have

00:32:46.250 primary and secondary so was the opposite person available when you needed them did your schedules work out

00:32:53.270 ok were they available to cover you when you're commuting all these basically more of the personal stuff rather than

00:33:00.080 the technical aspects and we of course overall just wanted to

00:33:06.620 make sure that nobody felt burned out at the end and we wanted to make sure that they fell heard their opinions matter

00:33:14.100 too we didn't want anyone feeling like they were overwhelmed or on their own while they were on call

00:33:21.320 but of course the biggest and worst problem of all is the fact that we

00:33:27.930 just have so many alerts still we need to fix them all and

00:33:33.680 in order to do that we started to try and break them down into bigger categories of alerts so that we can

00:33:40.080 obviously use different processes to fix them first on the chopping chopping

00:33:45.540 block or her the noisy ones that are probably configured wrong or something

00:33:51.360 so if there was a threshold that could be bumped just bump it if it's not even

00:33:57.720 alert that you need why do we even have it just delete that

00:34:02.900 flaky alerts are a little bit harder than this it could possibly resolve

00:34:08.010 before you could even have a chance to go and look at it graphs could be spiky

00:34:13.350 but you don't know why and then of course there are also people working on projects that

00:34:19.310 you know the service is kind of falls under maybe they didn't tell you about it which is also bad but maybe they

00:34:27.570 could also look at it too of course if it's an actual really bad

00:34:33.060 problem we have to spend more time to dig into it

00:34:38.990 sorry going back also to the flaky alerts we have we don't have this yet

00:34:46.320 but we hope to eventually have an initial line of automated defense where our monitoring systems can trigger a

00:34:54.140 scripts that try to auto repair when an alert triggers if this fails then we can

00:35:01.020 alert actually to the on-call engineer at least but by the time that a human

00:35:06.390 looks at it we've tried at some comet we've tried all the common responses to

00:35:12.720 this type of alert especially if it's an alert that happens frequently

00:35:18.200 all this probably sounds completely insane and we're probably still really

00:35:23.340 far from where we want to be but you know what we come a long way and at least we've seen enough to know better

00:35:29.990 what path we need to take to actually get to our end goal hell before we

00:35:36.060 didn't even have an end goal but now we do and for us what that means is we want to get

00:35:44.370 to a point where eventually we want to have only one

00:35:49.560 alert per person per day so that's maximum absolute maximum of seven alerts

00:35:56.190 per week we are still quite far of that but far from that but at least it's a

00:36:01.350 tangible goal that we can set our eyes on and kind of work towards it would also help further with the fact

00:36:09.690 that our rotations would have fewer people on call overall

00:36:15.000 if the service owners could be responsible just for the services that they are responsible for we've started

00:36:21.780 changing our model a little bit where instead of having everybody working on all the different services we have now

00:36:28.320 these domains or I think another common term is like squad where a set of people

00:36:35.400 work in a fixed area or a specific business aspect of the product and they

00:36:41.850 are responsible for this if they were to be on call twenty-four-seven that's not

00:36:47.580 achievable at the current state where at so this one alert per person is very critical to actually getting to the

00:36:54.570 second point of having people being on call for their services no matter what

00:37:00.470 obviously as a side effect of that it means probably our systems are more stable anyways so even if you're on call

00:37:07.880 twenty-four-seven for specific services hopefully that means you're not actually going to be woken up in the middle of

00:37:14.130 the night anyways again this all comes back down to the

00:37:21.360 mean time between failure vs mean time to recovery while we had been working on

00:37:26.940 previously on the meantime to recovery as I mentioned you work on one the other

00:37:32.580 might go down by nature but then once one is good you work on fixing the other

00:37:38.010 thing so now that we're kind of at a place where we're okay with our meantime to recovery now it's time for us to

00:37:44.460 focus on our mean time between failures so these goals will reflect that and of

00:37:52.230 course once that's all done hopefully the world will be full of kittens and everyone will be happy

00:37:59.720 but why going as to a more higher level why go through all this trouble why put

00:38:06.420 all the stress on people isn't on called just for ops no

00:38:12.170 that's all I really okay well so it's not just one team's responsibility it's

00:38:18.540 the entire organizations responsibility as a whole to to move towards being proactive rather than reactive

00:38:26.150 everyone should be care that their product is stable that it's not affecting users negatively I

00:38:33.470 also feel that as engineers we all have the sense of responsibility for the code

00:38:39.720 that we ship and we shared this with every single one of our colleagues if we don't then I don't that's out of my

00:38:46.800 league we also take pride in the code that rewrites in knowing that what we deliver

00:38:52.460 isn't just delivering value to our customers but also that and therefore

00:38:57.750 our businesses business as well but also the quality of our code doing things

00:39:02.820 like pull requests getting code reviews testing in pre-production all these

00:39:08.430 things or things that we do because we want to make sure that our code is as good as it could possibly be

00:39:14.990 if we want to achieve this world full of kittens though

00:39:21.140 we have to go through some pain no pain no gain it's not easy though at all and

00:39:28.440 we all need a little bit of direction we all need to be herded

00:39:35.630 going back again to isn't on called just for ops this is a quote from Wikipedia

00:39:42.000 so I didn't put it here because obviously quoting Wikipedia is a little bit of a my professors will tell me that's

00:39:49.589 absolutely no hey it defines you'll notice i wrote ops

00:39:55.690 here some people might say what about DevOps so this is the actual definition according to wikipedia of what DevOps is

00:40:03.430 it is a culture movement or practice that emphasizes the collaboration and communication of both software

00:40:10.839 developers and other IT professionals while automating the process of software delivery and infrastructure changes it

00:40:18.310 aims at establishing a culture and environment where building testing and releasing software can happen rapidly

00:40:25.710 frequently and more reliably it has nothing about a single team it's talking

00:40:32.830 about both software developers and other IDs so again alcohol isn't just for ops

00:40:39.849 and DevOps should just be a goal for the organization not a team that you have I

00:40:46.300 think so anyways so all of this it's not easy it's never

00:40:53.740 gonna get easier I ever think I I hope that it will but probably not and

00:40:59.849 because it's not easy we've taken such a long and convoluted path to get to this place where we have some process that

00:41:07.420 works for us now but we still need to work on it because nothing is perfect and we need some

00:41:15.040 direction we need to hurt all of our cats thank you very much for taking time to

00:41:22.089 listen to my story

00:41:27.579 Thank You grace as usual if you have questions you can find

00:41:34.329 her here at the stage

00:41:44.800 you

Herding Cats to a Firefight

EuRuKo 2016