You Should Be On Call, too

Talks

Joshua Timberman

3 talks

#devops

#collaboration

#incident-management

#application-performance

#automation

#community

You Should Be On Call, too

by Joshua Timberman

The video titled "You Should Be On Call, too" features Joshua Timberman discussing the intersection of development and operations within the context of DevOps. The central theme revolves around the idea that developers should actively participate in on-call responsibilities, fostering collaboration between development and operations teams to enhance efficiency and accountability in handling application-related issues.

Key points discussed in the video include:

The Importance of On-Call Participation: Emphasizing that both developers and operations teams should share responsibilities to address production issues efficiently.
Collaboration Over Blame: Highlighting that the move towards developers being on call is not about assigning blame but about fostering a collaborative environment where Developers assist Ops in maintaining applications.
Challenges with Silos and Communication: Sharing anecdotes about how siloed responsibilities lead to inefficiencies and misunderstandings, especially when handling system alerts and notifications.
Case Studies from Industry Leaders: Mentioning companies like Opscode, Heroku, and Etsy, which have successfully integrated developers into on-call rotations, illustrating the value this brings in terms of ownership and rapid problem-solving.
Incorporation of Tools and Automation: Discussing how effective monitoring systems and automation can lead to a better shared understanding of application performance and reduce alert fatigue.
Outcome of Shared Devolutions: Outlining benefits such as increased morale, better system reliability, and a culture of shared responsibility, which aligns with the DevOps ethos.

In conclusion, Timberman highlights that enabling developers to be part of on-call duties not only helps in solving immediate issues but also cultivates a deeper understanding of the applications they create. This intersection of roles lowers the burden on operations teams and results in improved overall business performance, ultimately contributing to a more fulfilling working environment. The talk encourages organizations to adopt these strategies as a part of their DevOps practices, enhancing collaboration and operational efficiency.

00:00:16.380 Hey, you should be on call too. This is a contentious topic. Earlier, Mike mentioned that speakers spend six to eight weeks preparing their presentations. Unfortunately, most speakers only find out that their talk has been accepted about three weeks ago, which means they often write their talk on the plane on the way to the conference. This is why we're nervous—we’ve never given this talk before. However, we worry about it for a lot longer, about three weeks.

00:00:19.640 The timing for today's talk is interesting because we only found out about the schedule a few days ago. But I know that everyone is actually here because they want to hear Jesse talk about ChatOps at GitHub, so you'll have to bear with me for this time instead. Now, who am I? First of all, why am I qualified to talk about why you should be on call? Well, I work for Opscode, where we have people that are on call and we deal with operational tasks. We have some automation software that some of you might have heard about, and I’ll touch on that a little later. However, I’m a community manager at Opscode, which means I’m not actually on call since I'm not in operations.

00:01:02.879 Despite this, I have been a systems administrator for the majority of my career, so I’ve carried a pager. You know, the Arch wireless ones that used to function in every data center. In my community manager role, I’m somewhat on call because I support our community and the cookbooks that I write, which are run in their infrastructures. I answer questions on Twitter, IRC, and through GitHub pull requests. If you have my Skype ID from my business card, it’s easy to find. So if you have a problem and need help with Chef, I’m somewhat available. Aside from being a systems administrator, I have a variety of interests. I'm a father, a gamer, and in the past year, I have gotten into CrossFit.

00:01:57.810 One of the things about being a systems administrator is that my career has led to a number of interruptions. For instance, today is my son's birthday, and it's the first birthday I've missed because of work. So on account of that, can I get a happy birthday for Ethan? One, two, three... thank you! I'm a big fan of audience participation, and since we've been sitting for a while, let's do a quick show of hands on this topic. Who here is a sysadmin or operations person? All right, now who is a developer? Who is a sysadmin that also writes code? Okay, that's a pretty good mix.

00:02:32.060 Now, who among you are business people, including consultants, managers, and directors? Okay. Who here writes code and is also a director that manages people? They say that managers don’t actually write code, but we know some do. Who is on call for production outages? Not as many hands. Who is using Chef? It's great to see!

00:03:15.099 Let’s talk about pagers. After all, one of the essential parts of DevOps is that every developer should carry a pager. That’s the cultural aspect of it, right? But we do need to share responsibility. This isn’t about assigning blame; it’s about collaboration. Operations teams need your help. If you're a developer, we need your assistance. It’s been mentioned before: developers can make their applications more operationally fitting, which includes having the right tools. We need that support. That's not to say that we are hiring, but just a reminder that everyone is hiring!

00:04:06.600 So imagine this: you have another company running an application that developers wrote and operations teams have put into production. When developers say, "it works on my machine," it often leads to roadblocks where operations personnel say, "no, you cannot deploy." Recently, we've heard talk of doing ten deploys a day, 15 deploys a day, continuous deployment. It seems simple; just run a command in a loop, right? Yet it’s really about the need for collaboration and sharing responsibility for the applications you write and the ones we run. We’re not just separate teams; we’re collaborating to make things better.

00:05:53.160 Let me share a story about silos and separation of duties. I once worked for a large enterprise IT services company where my team was siloed. We were strictly systems administrators due to separation-of-duties protocols as outlined by COBIT or ITIL methodologies. This meant that sysadmins were on call for all OS-level matters, while other teams handled applications. Our team had a hot pager that rotated weekly; the on-call person was the primary contact for any system issues.

00:06:46.819 Let’s say an alert comes in about a disk full issue; the helpdesk would page the on-call person. That person would investigate and find that a customer’s data disk was full, prompting them to page the primary sysadmin of that account. Now, it’s two to three in the morning, and they get woken up. The primary admin then investigates the file system and realizes there are tons of logs consuming space. This situation is common. Who knows which logs can be deleted? So, we page out the application support team for the application in question—this could either be an internal team or customer support.

00:08:58.620 Repeatedly, these alerts show up, whether it's a disk full, high CPU usage, or something else. It’s easy for the helpdesk to receive these alerts and pages regarding the application without a precise understanding of how to analyze such issues. What typically happens next is they escalate to administrators who may not have full context. This can quickly lead to fatigue for both ops and development teams.

00:09:15.670 These metrics-based alerts aren't always actionable. For instance, 'High CPU usage' doesn’t mean the application is down; maybe it’s designed to utilize CPU resources. Alert fatigue can develop when alerts don’t provide clear guidance for action. This confusion complicates situations for on-call personnel when they have to assist.

00:09:41.900 However, I love CrossFit! This is me doing thrusters. If you would like to meet at the Little America lobby tomorrow morning at six-forty-five, I’ll be doing a CrossFit workout somewhere. But anyway, let’s discuss a few case studies. By 'case studies,' I mean that I had a brief conversation with individuals from a few key companies. Keep in mind that I had limited time to gather this information.

00:10:28.440 I spoke to various directors of operations at companies like Opscode, Heroku, and Etsy. The general consensus was that several companies have successfully implemented the idea of having developers as the primary on-call personnel to respond to issues. These companies initially started this practice and sometimes rotated back for various reasons, but generally speaking, there’s been a trend of developers playing an active role in operations.

00:11:06.309 At Opscode, many employees used to work at Amazon, where it was emphasized that if you built it, you run it. Google also adopted a variant, allowing developers to maintain the applications they built for a period. Following suit, Opscode and other companies like Heroku and Etsy started ensuring that developers played operational roles for the applications they created.

00:12:09.430 This practice has benefits: it fosters a sense of ownership amongst developers regarding their applications, and it also helps bridge the gap between development and operations. The trend shows that companies maintain a higher number of developers compared to sysadmins. So, for the common corporation, having developers actively involved in operations could lead to lesser burdening and frustrations for sysadmin colleagues.

00:13:17.470 Additionally, developers who take part in these operational tasks often feel empowered and more accountable for their work. When issues arise, they are likely to develop robust fixes on the spot, and in many cases, this leads to the incorporation of unit or integration tests. In turn, those tests ensure that similar issues won’t crop up again.

00:14:29.830 Time to talk about automation. Development teams that implement effective monitoring and alerting systems, like Nagios or PagerDuty, make it easier for everyone involved. Building clear metrics within applications also aids operations teams. By allowing operations personnel to interact with tools that developers create, it fosters a shared accountability that leads to better systems overall.

00:15:48.000 Effective collaboration enriches the culture, increases morale, and lightens the load on the Ops teams. Good communication between these two camp leads to better outcomes and smoother integrations, and in turn, everything runs more smoothly during outages. Developers get better insights into the nature and performance of their applications, while operations personnel can ensure that systems are robust and meeting business needs.

00:17:08.900 Earlier, Gene discussed the various approaches to optimizing application operations, and I encourage you to revisit that information for deeper understanding. To engage the audience further, by show of hands, who here writes application code that supports the business? Awesome! How many of you write code that directly runs websites? Okay, good, same hands! Now, how many are consultants that write code for clients? This discussion certainly extends to you as well, so consider what roles you play within your respective organizations.

00:18:52.810 The fundamental takeaway here is about enabling on-call participation, even if it sounds daunting at first. Developers, you can define collaborative roles that extend into operations within your teams. Writing instrumentation code, metrics, and monitoring tools can enhance the operability of applications. Enabling tools will allow you to facilitate smooth operations when production incidents occur.

00:20:04.950 Therefore, it is vital for developers to collaborate with operations teams. Adopting uniform automation practices throughout all environments ensures consistency and effective operations. Building automation that better aligns with processes is imperative. Collaboration in creating these solutions fosters an engaged culture of shared responsibility.

00:21:20.350 All of these practices contribute to the overall ethos of DevOps, aimed at enhancing collaboration, streamlining processes, and fostering an environment of shared ownership. This movement not only benefits the business but also creates a more fulfilling experience for everyone involved. By having both developers and operations share the responsibilities and create a culture of understanding and support, the organization will be better equipped to handle any issues swiftly and efficiently.

00:22:55.300 In conclusion, it’s imperative for organizations that adopt DevOps methodologies that they also enable developers to participate actively in operations. Ultimately, when developers contribute to fixing outages and improving system reliability, they will cultivate a deeper understanding of their creations, and in the end, everyone will benefit—business and team morale alike. Thank you all for your time!

MountainWest RubyConf 2013