GoRuCo 2017

Developer Productivity Engineering

Developer Productivity Engineering by Panayiotis Thomakos

Ruby is often praised for being a happy language. For highly motivated developers, a large part of happiness is tied to being productive. How can we extend the productivity gains we experience from writing Ruby to an entire engineering organization? At Strava we are experimenting with a framework we call Developer Productivity Engineering (DPE). DPE applies the principles of Site Reliability Engineering, developed by Google, to improving productivity through automation for both individual engineers and engineering organizations. This talk is a detailed view of the DPE framework and our experience with it so far.

GoRuCo 2017

00:00:16.710 So, I'll be talking about developer productivity engineering. My name is Panayiotis Thomakos, and you can find me on the internet.
00:00:19.330 I work at a company called Strava. We are a GPS-based training site and social network for athletes, and we have about 50 engineers working there. I've been there for about eight years, but more recently, I've been working in productivity engineering.
00:00:26.800 Productivity engineering effectively means that my job is to try to make other people as productive as possible. Even though I'm technically a team of one, there are many other people at Strava who spend some portion of their time working on productivity-related tasks. The takeaways from today can apply to any engineering organization or even to yourself in your personal projects.
00:00:58.769 So, what does it mean to be productive for me? Productivity is inherently tied to happiness, which really means how engaged my mind is on any particular task. The more time I get to focus on challenging and interesting problems, and the less time I have to spend on repetitive and mindless tasks, the more productive I am.
00:01:16.060 Often, this means that I'm using automation to reduce those repetitive and mindless tasks. While productivity engineering isn't solely about automation, it is a significant part of it. Today, I want to discuss automation because it often costs engineering time and effort to build it, and it's not always obvious how we prioritize that or even make a case that it's important.
00:01:38.400 You may have found yourself in a situation where you're battling with your automation, or perhaps you have so much to automate that you don't even know where to start. Alternatively, you might be working hard on your next feature, leaving you no time to think about automation. It's okay to decide that automation is not a priority for you right now.
00:02:07.250 However, it can be a bit disconcerting to feel like you can't make that decision strategically. At Strava, I've developed a framework that helps you think systematically about automation and when it might be appropriate to dedicate time and effort to automate something. It's called Developer Productivity Engineering, or DPE, named after Site Reliability Engineering developed at Google.
00:02:27.690 Google uses Site Reliability Engineering to apply engineering practices to the problems of site reliability and operations. Similarly, Developer Productivity Engineering uses engineering practices to solve the challenges of developer productivity. It can be broken down into three steps: identify, measure, and prioritize. I will tell you about each of these.
00:02:51.140 Let's start with identifying productivity bottlenecks. Sandi Metz once said, 'Duplication is far cheaper than the wrong abstraction.' When we're writing code, this means we can't just go around duplicating stuff because it introduces wrong abstractions. Wrong abstractions have a high long-term cost; they're difficult to change and maintain.
00:03:15.020 The same concept applies to productivity engineering. Just because you've done something twice does not mean it's worth automating. We don't want to end up in a situation where the work required to maintain our automation exceeds the cost of just doing the task manually.
00:03:40.489 There is a more effective heuristic, known as toil. Site Reliability Engineering has its exact definition, and I believe it applies well to Developer Productivity Engineering. Toil usually refers to hard or menial labor, but that's not a rigorous definition. For our purposes, toil refers to a task that satisfies a set of approximately six criteria.
00:04:00.270 The first criterion is that the task needs to be manual. This might seem obvious, but if a machine is already doing it, our threshold for automation should be significantly higher. The second criterion is that the task must be repetitive. This means it needs to occur frequently enough—once or twice a week, month, or quarter—to warrant investment in automation.
00:04:24.320 The task should also be automatable. We must at least be able to envision or have the budget to put software engineering effort into the task. If we can't, it's probably not worth automating. Furthermore, the task should be tactical, not strategic; meaning it should occur in response to something measurable, like CPU load or site load.
00:04:49.650 The task should not provide enduring value. Essentially, if doing this task yields similar results repeatedly, then there's no permanent improvement, making it a good candidate for automation. Lastly, it helps to know if the task will scale linearly with growth or even faster. As we add more engineers or more commits, the task will likely become more cumbersome.
00:05:10.869 These are the six criteria. I will give you a simple example from my own work at Strava. We have a Ruby CLI that we run on our machines to deploy the website and API twice a day. This deployment script manages the intricate details of changing the bytes on all our EC2 servers and restarting them.
00:05:36.270 However, developers still need to be present to ensure that everything operates smoothly and that we don't need a failover or rollback. This task is indeed manual; developers must type commands into their keyboards and monitor specific metrics to ensure nothing goes wrong. It is repetitive, as we run it twice a day.
00:06:08.250 The task is automatable to some extent; we could write a cron task to initiate the deployment, and we could develop a service that pulls metrics and sends notifications when something goes wrong, rather than expecting developers to gather that information themselves.
00:06:24.639 It is tactical since it responds to the passing of the QA suite and occurs twice a day. There is no enduring value in the deployment itself. If you consider that all the effort in creating the code provides enduring value, changing bits and bytes on a server does not.
00:06:41.739 Moreover, the task scales linearly as we hire more people, which we are doing, increasing the number of commits and the frequency of our deployments. You should feel comfortable assessing toil. Almost anyone in your organization should feel up to the task.
00:07:06.680 The most effective way I've found to do this is by talking to people, whether through retrospectives or weekly and bi-weekly one-on-ones. This is usually the best time for the most effective assessment of toil.
00:07:27.990 Now, let's discuss measuring productivity. First, you should not track your time; it’s a tedious process and an ineffective way to measure productivity. One of the main reasons is that productive work is inherently varied, and it’s tough to correlate long-term productivity gains with initial activities.
00:07:51.880 Instead, we take a more indirect approach that means focusing on the negatives affecting productivity and trying to minimize them. The first step is measuring toil, which we do by sending surveys and having conversations with the team. We ask people to estimate how much time they spend on undesirable or manual tasks.
00:08:26.880 Additionally, we instrument all our existing automation. If someone complains about waiting on a deploy script, it’s beneficial to know how long it takes that script to run. While this approach may seem simple and somewhat non-intuitive, it is liberating to feel that you can measure productivity and make improvements.
00:08:49.480 Let’s talk about prioritization. At some point in your organization, a few months down the road, you may have identified a significant amount of toil and measured it, but you might find you have so much toil that you don’t have time to automate it.
00:09:09.910 Determining what to work on next can be straightforward. Calculate four different costs: the toil cost, which means framing this in terms of consistent measurements like hours per week or month spent on a task; the implementation cost—how long it will take to automate a solution.
00:09:29.860 The less clear the implementation is, the higher the total cost should be before you think about beginning. Also, consider that software maintenance incurs costs. If you spend an hour a week on upkeep, that should subtract from the initial time cost of doing the task manually.
00:09:53.720 Finally, factor in onboarding costs. If you bring a new hire into your company, how long will it take for them to own that process? You want to avoid having one person solely responsible for a particular process.
00:10:17.380 We have successfully implemented Developer Productivity Engineering at Strava. Early this year, we recognized that our mobile release process was quite toilsome and calculated that it cost us approximately 20 developer hours each week.
00:10:42.000 After investing significant effort into automation, we estimate that by the end of this year, considering the implementation and maintenance costs, we will save upwards of 17 developer hours per week by automating the process.
00:11:06.960 I believe you all can apply these principles in your own lives successfully. Thank you for your attention. I work at Strava, and we are hiring! You can find my slides at this link and follow me on Twitter, GitHub, etc. Thank you.