Herding Cats to a Firefight

00:00:05.310 Please take a seat. Thank you very much. So, our next speaker is Grace Chang.

00:00:12.799 She's an engineer at Yammer in London. She is also the on-call tech lead. As far as I understand, she likes reliable sites, continuous development, and I was told she also enjoys gifts and food. Is that correct? OK, so here she is for you. Enjoy!

00:00:38.690 Thank you! Hello! Thank you again, and I apologize ahead of time to all the Bulgarians here if I mess this up.

00:00:43.800 But I've learned 'Zdraveyte', I think. I'm told that means 'hello'. If it means a curse word, I apologize profusely and blame all the organizers. They're cool people! My name is Grace, and I currently work at Yammer in London. However, as you can probably tell from my accent, I'm originally from Vancouver, Canada.

00:00:58.890 I just moved to London last year, but I have been on-call with Yammer for several years now, and that's what I'm here to talk about with you today. In case you're not familiar, Yammer is a social network for enterprises. We were acquired several years ago by Microsoft. It's a large platform with a lot of users, but we're obviously trying to grow every day.

00:01:15.240 I also apologize very profusely to the internet because while I am talking about cats, I am personally a dog person. I actually don't have any GIFs in my slides, but I have hand-drawn pictures, and hopefully that will be just as good as GIFs.

00:01:36.240 So, I'm going to talk about herding cats. If this is a phrase people are not familiar with, it's kind of a term we use to describe trying to gather people to do something, but it's really difficult because everyone is trying to do different things or they're just not cooperating with you. It is one of the hardest things you can possibly do.

00:02:03.389 Why specifically cats? I can share a story from one of my friends. When he was in university, he had several roommates with cats. He would take a laser pointer and try to get one of the cats to follow it into a bookshelf. While one cat was still inside the bookshelf, he would try to get the other cat into the same space. Obviously, that doesn't work well with one laser pointer because the first cat starts coming out while the other cat goes in. It's a challenging task, but this story isn't about impossibilities; it's about training teams to learn new tricks while dealing with difficult situations.

00:02:30.510 Now, let's rewind a bit, all the way back to the year 1 BC or even before. In the beginning, there was only darkness, but suddenly, out of the darkness, there came a sound... I apologize, there's no speaker setup, so imagine some sound effects here, it's mostly for effect. So, when I started at Yammer, we didn't really have a concept of on-call.

00:03:00.690 I'd heard tons of stories about other teams in different organizations having on-call teams, and let me tell you, it's not a pretty picture. They would have to carry pagers or modern mobile phones with them at all times, just in case something went wrong with the production environment. Yammer did have an on-call system of sorts. It was just one person who was on call all day and every night, every day of the week, for all of eternity. Well, not quite, but close enough.

00:03:31.030 Everyone was around to help, of course—anyone else on the engineering team. But eventually, the team grew so large and the environment became so unwieldy that it was unreasonable for one person to manage everything. Deep down, we knew it, but nobody wanted to be the first to say it. One of our leaders often said, paraphrasing, 'We don't need a whole on-call rotation; our project is so good that it doesn't need an entire team to keep it running.' In fact, it was seen as a waste of time.

00:04:31.030 As you might expect, disaster struck. It happened on a Friday— or for some of us, it was still Thursday because we hadn't slept much the night before. A massive production issue occurred for reasons unknown, and a handful of dedicated or insomniac engineers were struggling to hold things together.

00:04:55.100 Eventually, we had to call for help. I was the one who made that call, and let me tell you, it was rough. This is roughly how it went: 'Hi, sorry to be calling at this hour. I'm from Yammer and I work with [engineer’s name]. Can I please speak with him?' Keep in mind this was three in the morning, and the guy had just returned from his honeymoon.

00:05:11.460 I was chatting with his wife, and it was as awkward as it sounds. Eventually, he woke up, got some coffee, and helped us out. After several hours, we finally managed to stabilize our production environment. It was at this point that we decided we couldn't keep running it this way; this guy needed sleep. So, the decree was made: we would establish an on-call team. Problem solved, right? Easier said than done.

00:05:36.160 Let me fast forward in time to the year 180 after disaster. The first difficulty we encountered while putting the on-call team together was figuring out the math. At that time, we decided to choose four engineers or 'cats' to be the guinea pigs to iron out the entire process before rolling it out to the entire team.

00:06:05.300 Out of these four, one was the cat herder, or the tech lead, which was my role. At that time, we also had a monolithic application written in Rails with about 15 services written in Java. We were trying to figure out how to organize and split the responsibilities efficiently so that it didn’t make everyone’s lives miserable.

00:06:28.660 That's not easy! It kind of ended up being four very sad cats. This is an actual photo of my notebooks when I was trying to figure out how to optimize our scheduling so it truly didn’t suck for everyone. Honestly, all the options felt somewhat terrible. Additionally, one major challenge we faced was getting used to the acronyms that started cropping up around us; we couldn't avoid them.

00:07:01.000 Acronyms like MTBF, MTTR, ARR, and SLA were becoming more prevalent. I will briefly go over some key definitions. The first one is MTBF, which stands for 'Mean Time Between Failures'. This metric indicates how long it is, roughly, between outages of your application or site. Mean Time to Recovery is related to MTBF; it describes how long it takes to return from an unstable state back to a healthy one.

00:07:37.420 Then there's SLA—Service Level Agreement—sometimes called availability in certain contexts. This is essentially the contract between your business and your users, wherein you assure users that your site, application, or service will be available to them for a specific percentage of the month. It’s crucial because more downtime usually means fewer users.

00:08:03.490 The After Action Review, or AAR, which is a term I actually had to look up again because it’s an older one, is self-descriptive. It's what you do with the documents you have to write up after things have gone wrong, explaining what happened. Also, there's 'IR', which stands for 'Incident Report', a term we prefer to use simply because it is shorter.

00:08:34.220 We had a ton of acronyms, and it was so overwhelming. We started breaking it down and will do so here too, though it's going to be very rough. Going back to MTBF and Mean Time to Recovery, what mattered most to us wasn't necessarily one being more important than the other; it was about finding balance and improving one without compromising the other. Equally important was making sure that our systems remain healthy and the availability stays high.

00:09:10.850 For an analogy, think of it this way: the mean time between failures is like how often your cat vomits on the rug. Mean time to recovery is how quickly you can clean it up when it does. Of course, there are many factors to consider, one being the rough business impact of each metric.

00:09:44.120 For example, a less frequent mean time between failures means a better overall system. Meanwhile, focusing on cutting down the mean time to recovery gives you faster response times to incidents, which is better even if it means being down for a few minutes at most.

00:10:05.470 In terms of mean time between failures, achieving a more stable system depends on your current configuration. As for mean time to recovery, engineers must be trained to respond rapidly to incidents as they occur. The benefits of these two metrics show that with fewer interruptions from mean time between failures, engineers are less frequently disturbed.

00:10:41.360 Similarly, with a lower mean time to recovery, engineers gain broader knowledge of the entire system simply because they handle these incidents more frequently. However, these advantages also come with potential downsides. For mean time between failures, the issue might be that instead of smaller outages, you could face a significant event that lasts for several minutes or, worse, hours.

00:11:09.180 With mean time to recovery, although we might not experience prolonged issues, we risk facing more frequent problems. The upside here is that while you may respond quickly, they still happen often, leading to a higher familiarity with how to tackle them.

00:11:24.900 So, this formula I’m about to show is the only one I have in my slides. It's not really a formula but the formal definition of the relationship between mean time between failure and mean time to recovery, alongside our overall Service Level Agreement.

00:11:50.520 These two metrics don't necessarily depend on one another, but the SLA is influenced by both. It's not a pull-push relationship; rather, it’s a range. For example, as one improves, the other doesn’t have to decline; it can even go down temporarily, then bounce back when you’re satisfied with your goal. Neither metric needs to be at one hundred percent.

00:12:27.160 So, we established our goals, but then we needed to track these metrics using various tools. Initially, we considered Google Docs, but that proved challenging for reading reports effectively.

00:12:41.750 Yammer Notes provided a rich text editing experience, allowing links to different threads or conversations, but it lacked good search capabilities. We also had Jira, already in use for bug tracking, and though it wasn't perfect, it worked reasonably well.

00:13:07.530 We worked through numerous iterations over several months and figured we were starting to get the hang of things. There were plenty of adjustments left to make, such as reducing the workload by dividing responsibilities based on technology stacks. We had Rails and Java, so we designated certain engineers to be on-call for one stack or the other.

00:13:40.130 This way, each person worked with fewer responsibilities at any given time. During this time, we also expanded our team, adding our London office to the mix. This helped ease the burden on the San Francisco team and allowed for shorter on-call rotations, letting them get some vital rest.

00:14:08.760 With more people joining the on-call process, it became critical to onboard them effectively. This meant they became comfortable asking for help when they were unfamiliar with certain areas and could share their own expertise accordingly.

00:14:40.109 While this was all happening, one of the hardest parts was persuading the engineers to commit to being on-call. We turned that cat story around into a reality. However, when we achieved this, we implemented drills, tabletop exercises, and breakdowns. Practice is essential for effectiveness in any endeavor.

00:15:05.170 As we secured everyone's presence for the drills, we also made adjustments to the overall processes and tooling. The first major change was moving all alerts into a configuration repository.

00:15:24.000 By placing them in source control, we clarified each alert's history, making maintenance much smoother. Any updates just required us to change the configuration and push it out.

00:15:48.140 Next, we made sure that managers took on the role of incident managers. Their responsibilities included documenting as incidents arose, coordinating estimations of which engineers should be contacted when issues surfaced, ensuring communications occurred with the other teams affected, particularly when it came to informing customers.

00:16:18.930 We implemented detailed runbooks for each service, ensuring the steps for resolution were clear. This arrangement was crucial for reducing the time to recovery since a clearly defined action plan made it easy to follow. We aimed to make the initial response as seamless as possible, so anyone could manage it with limited knowledge about a particular system.

00:16:49.520 Despite our organized process, people still needed to accomplish their regular duties. As we continued to refine our methodology, we encountered challenges in certain aspects of the process that simply weren’t working.

00:17:13.270 Returning to my previous thoughts, you may be wondering what remains to fix after sharing all these learnings. Like any real problem—whether in personnel or code—there is no defined endpoint to the improvements we can make.

00:17:37.760 We are currently focused on combining schedules, keeping in mind that these discussions are still ongoing. Initially, we made a conscious decision to stick with separate schedules for the Rails and Java stacks, but we ended up creating a unified team that handles services across the entire stack.

00:18:02.060 This meant that people would be on-call less frequently. It might be a challenging week for someone, but that tough week would only need to occur once every nine to eighteen weeks based on team size, alleviating pressures associated with burnout.

00:18:26.290 We have also started conducting post-mortems and retrospectives after incidents to learn from our mistakes and the problems we faced. We aim to conduct them almost immediately after the incident to capture the details while they are still fresh. Of course, if something happened late in the day, we would delay until the following day.

00:18:52.380 During these reviews, we focus on a few key questions: what happened in what order, the geographical area impacted, the root cause of the issue, and how we can prevent similar incidents in the future.

00:19:18.030 It's essential to emphasize that our analysis is not about placing blame—it’s about recognizing that errors can stem from collective systems and processes, rather than pinpointing one individual.

00:19:44.240 We have also initiated weekly handovers and monthly reviews. Every Monday morning, engineers who were on call the previous week meet with the current team's engineers to discuss problems and lingering issues from last week's shift.

00:20:10.510 We carefully note the top alerts, document what actions were taken to resolve them, and maintain a spreadsheet to track these details. This documentation allows us to stay aware of which services are most problematic and whether any missing runbooks need to be addressed promptly.

00:20:37.610 We have to recognize that time zones can be tricky, especially when coordinating handovers. To streamline this process, we only have the tech leads from each office meet once rather than requiring all engineers to attend late or early meetings.

00:21:17.580 We also want to ensure that engineers feel the time they take while on-call is productive. To do this, we started optional surveys for engineers, capturing their mood before and after being on call to gauge preparedness and stress levels.

00:21:46.560 We asked about their sense of preparedness, whether the proper runbooks were available, how useful their primary and secondary contacts were during call time, and whether their schedules worked for them—checking if they had the support they needed when commuting.

00:22:14.100 Ultimately, our goal is to avoid any feelings of burnout among our team. We want to ensure everyone has a voice and their opinions matter. We certainly do not want anyone overwhelmed or isolated while on call.

00:22:42.820 However, the biggest ongoing issue we face is the sheer volume of alerts. We are working to categorize these alerts more effectively so that larger groups can be managed simultaneously, allowing us to identify the most important ones and prioritize them.

00:23:11.220 In this approach, the noisy alerts that are misconfigured will be addressed first; if we can easily increase thresholds, we will. Alerts that are not necessary will be deleted, and we intend to initiate more structured approaches for those that are considerable issues.

00:23:40.500 We aim to create automated processes that can manage alerts and take initial actions for common issues, feeding those processes into our monitoring systems. When a trigger appears, the script will attempt an auto-repair, but should that fail, it will then alert the on-call engineer.

00:24:07.720 Even with the amount of effort that has been taken, we recognize we have much ground left to cover in achieving our goals. We have come a long way, and we now have a defined goal to target.

00:24:29.600 Our goal is to reach a point where each person has only one alert per day, totaling a maximum of seven alerts per week. While we are still falling short of that target, it serves as tangible motivation and direction for us moving forward.

00:25:06.009 As we modify our approach, teams are starting to evolve. Instead of having individuals working on disparate services, we’re adjusting to a more cohesive squad-oriented model—designating groups to specific domains of our product they will be responsible for.

00:25:36.550 This transition is designed to mitigate the existing challenges in on-call rotations and assist with enhancing accountability, making it easier to distribute shared responsibilities and moving towards our objective.

00:26:07.540 This also informs the shift to address issues surrounding mean time between failures, as it can reflect whether our systems are more stable. As we work on advancing our standards, we also keep in mind the importance of maintaining a balance between mean time between failures and mean time to recovery.

00:26:36.360 Having gained experience with our current strategies, we intend to further improve our user-friendliness and, with luck, this effort will end with a smoother path down the road.

00:27:07.020 In conclusion, remember that while we try to reach a utopia of relaxed cats, we have to put in efforts and commitment to develop that culture. Herding cats isn't just about responsibility falling solely on operations; it's a collective effort from everyone in the organization.

00:27:28.080 As developers, we have the responsibility of owning the code we ship, taking pride in delivering quality work that functions well for our customers and our overall business success.

00:28:02.740 Transforming our workspace into one bursting with positivity and stable systems requires dedication and awareness. The perseverance that goes into overcoming challenges ultimately leads to a better environment for our teams and our users.

00:28:37.100 Success is not just about the end result but about the journey, where each step taken reinforces the value of collaboration and teamwork.

00:29:05.040 Thank you very much for taking the time to listen to my story!

00:29:23.570 Thank you, Grace! As usual, if you have questions, you can find her here at the stage.