Talks

Taming the Kraken - How Operations enables developer productivity

Taming the Kraken - How Operations enables developer productivity

by Nathen Harvey

In this presentation titled "Taming the Kraken - How Operations Enables Developer Productivity" by Nathen Harvey at Rails Conf 2012, the focus is on reducing friction between development and operations at CustomInk, allowing for smoother processes in deploying software. CustomInk, known for facilitating custom t-shirt design, has transformed its software deployment strategies to address the challenges it faced in the past, such as lengthy deployment cycles and frequent rollbacks.

Key Points Discussed:
- Process Redefinition: The team moved from bi-weekly sprints to a continuous model where code deployment occurs as soon as the code is ready. The definition of "done" was modified to mean that code is not just written, but also deployed and running in production.

  • Staging Environments: To support frequent deployments, CustomInk established disposable staging environments. They ensured each topic branch had its staging setup which enabled rapid verification and easy transitions to production.

  • Infrastructure Automation: The use of tools such as Chef and Vagrant was highlighted to automate server provisioning and manage development environments efficiently. Vagrant allows developers to spin up local environments quickly, ensuring consistency across development and production.

  • Continuous Integration and Testing: The integration of Jenkins with Capistrano facilitates testing for every code branch, minimizing integration errors. Green Screen, a visual monitoring tool, is also used to keep the team informed about the status of builds.

  • Empowering Developers: Developers are encouraged to take ownership of their code. This includes deploying their changes, which reduces reliance on the operations team and fosters a culture of accountability.

  • Communication and Culture: Establishing a culture that encourages communication between development and operations teams is vital. Transparency about deployment statuses and automated notifications helps keep all stakeholders informed and reduces the stress associated with deployments.

Examples and Anecdotes: Nathen shares stories about past deployment anxieties, likening them to high school proms due to the buildup and tension. With improvements, deploys became less fearful occurrences, leading to efficient software releases.

The conclusion of Nathen's talk emphasizes the importance of minimizing work in progress, deploying frequently, and automating processes to create a developer-friendly environment where features can be released with confidence and agility.

00:00:25.480 My name is Nathen Harvey, and I manage web operations for a company called Custom Ink. Today, I'd like to share with you some of the things that we're doing internally at Custom Ink to help make the process of getting software from developers to production smoother and keep that software running. I'd like to start off with a very brief introduction to who Custom Ink is. Custom Ink is a website that makes it easy and fun to design custom t-shirts. We have a great set of artwork and fonts that you can include in your custom-designed t-shirts, and we ship them out to you for free. We do all of this because we believe that t-shirts make events awesome, and they unite people together. If you're interested, I encourage you to visit our site, order some t-shirts, and see me afterward for discount codes if you're really interested. However, I'm not here just to sell t-shirts.
00:01:13.760 Let me talk a little bit about the scale and size of Custom Ink. We've been around since 1999, and we have not always been a Rails shop throughout our entire history. Currently, we receive about 40 million requests to our app servers each week. That's only to our app servers and not accounting for the caching layer that intercepts many of those requests. We've delivered shirts to over 1 million customers, and if you're wearing a t-shirt today, there is a chance that we printed it for you. We've shipped over 25 million t-shirts, and I'm glad to see someone here from Pragmatic Studios wearing a Custom Ink shirt.
00:02:01.399 Now, let me take you back a little bit to describe how things used to be at Custom Ink. We used to have one team working on multiple applications simultaneously. We operated in two-week sprints, and at the end of each sprint, the operations team, including myself, would deploy the code to production. While it worked okay, we faced some challenges with this process.
00:02:13.160 First of all, many of the developers would work directly in the master branch. Consequently, they would commit their code, encounter merge conflicts, and then fix those conflicts. Meanwhile, another developer might commit more code, leading to more conflicts. This cycle of resolving merge conflicts became a real pain. We've gotten a lot smarter about using Git now, especially since we moved from SVN to Git, which made a significant difference. However, small changes would often sit for days waiting to be deployed. For instance, if we decided to change the hours of our customer service, the developer would complete the code at the start of the sprint, mark it as done, but it would sit in the queue waiting for two weeks before it could be deployed.
00:03:02.360 Deploys would often be rolled back because we were deploying two weeks' worth of changes at once, and some significant feature might break in production. Unfortunately, this led to our customer service hours being inaccurate again. As our team grew, we faced more significant challenges—a few developers became about 20 developers all working on different applications. You may have similar challenges if your deploys feel like a high school prom.
00:03:39.600 What do I mean by that? Consider this: if deploys happen infrequently (perhaps once every two weeks), the buildup is always bigger than the actual result. Leading up to a deploy, people often feel anxiety and fear about what might break, rather than excitement. Surrounding each deploy was a lot of ceremony—mass emails going out to the company, warning everyone about potential issues. Rather than a celebration, it often felt like a nerve-wracking event. There were also office pools to guess how long it would take before we had to roll back the deployment, as this happened frequently.
00:04:23.240 In fact, when I first joined Custom Ink, everyone quickly labeled me as the person who informed them of the rollbacks. Being happy about a deploy was often tied to the idea of ordering custom t-shirts to celebrate, which is not something I would recommend. However, if you are working on a big project, I happen to know a guy who can help you with ordering t-shirts.
00:05:08.480 So, how can you fix the issues we faced? We decided to deploy early and often. This isn't just a talk about continuous deployment—it's about the changes we needed to implement to reach a point where we could deploy frequently. We aimed to move to a scenario where we deploy small, independent changes as soon as they are ready. The first thing we did was redefine the process and adjust the definition of done from a developer's perspective.
00:05:44.280 Now, a developer is considered done not merely when the code is written and the build passes, but when their code is running in production and verified to be working correctly. We've shifted from two-week sprints to a continuous workflow—taking an item off the list and working on it until it’s fully completed and deployed. This approach minimizes the amount of work in process and ensures that we get changes out to production as quickly as possible. We also began utilizing our product managers to help prioritize tasks and manage the release cycle.
00:07:05.040 On the operations side, we recognized the need for an easy way for developers to stage their changes. In the past, we had one or two staging environments, which were insufficient and led to queues for access. Therefore, we knew we needed disposable servers and a method to quickly spin up staging environments, deploy new code, and have it verified by business owners. Once verified, these environments could be discarded without a trace.
00:08:07.960 We decided that each topic branch within our Git repository must have its own staging environment and that all work should be done in topic branches—eliminating the use of the master branch for development. These short-lived branches required short-lived staging environments, and the only way to achieve that was through automation. We automated the process of spinning up and tearing down servers using infrastructure as code frameworks.
00:09:11.920 In doing so, we developed specifications or policies for the environments we wanted to build. For instance, our staging environment might have a caching server in front of a couple of web servers and a database server at the back end. By defining these resources, we could automate the installation and configuration processes using tools like Chef. Chef allows us not only to provision new environments but also runs continuously on our nodes for compliance.
00:10:00.800 If we need to make a change to the environment, it gets checked into source control and applied to our production nodes. This transition has not only improved our deploying capabilities but also allowed developers to get up and running much faster with the help of tools like Vagrant, which integrates perfectly with Chef.
00:11:17.679 Vagrant is an outstanding tool that allows us to manage virtual machine instances locally, enabling developers to quickly spin up local development environments. We created a repository for Vagrant configuration files, which we called the Hobo Jungle, to assist developers in locating the appropriate setup for various applications.
00:12:22.759 For each branch being tested, we developed an automated testing framework using Jenkins. By integrating Jenkins with Capistrano, developers can easily create new jobs from the command line without needing to interact with the Jenkins interface directly. This simple setup allows developers to quickly know when their branches are ready for merging.
00:13:06.000 In addition to the above systems, we also use a tool called Green Screen, which provides a visual representation of Jenkins job statuses throughout our development office. By mounting this code on a monitor, everyone can see the current state of the builds in real-time, promoting accountability and prompt communication within the team.
00:14:05.560 We've effectively reduced friction in our processes, enabled quicker deployments, and simplified our release cycle. All applications, of which we manage around 20 to 25, are deployed using Capistrano, which has also provided an opportunity to streamline our deployment procedures. Instead of replicating identical deployment logic across multiple applications, we utilized a tool called Cap Hub to standardize and centralize our deployment strategies.
00:15:40.479 Cap Hub allows us to manage our deployment logic from one repository, reducing duplication and facilitating easy updates to deployment configurations. By separating these deployment tasks into a dedicated repository, we prevent unnecessary changes to application repositories, streamlining the overall process.
00:17:02.080 Our deployment process now enables developers to merge their branches into the master, build, deploy, verify, and then efficiently get out of the way for the next team member behind them. We have managed this workflow using a shared communication channel that acknowledges each deploy in real-time, ensuring that everyone remains up to date.
00:18:15.360 To announce successful deployments, we have automated our legacy announcement mechanisms, tying everything together so that when a deploy occurs, it automatically updates our internal blog, sends relevant emails, and posts updates into our chatrooms. Overall, this has significantly reduced the ceremony and friction surrounding deploys.
00:19:08.919 As a result, we have seen notable benefits, including shorter cycle times, fewer integration bugs, and a reduction in the overall work in process. We can now change simple things, like the wording on our website, in mere minutes instead of taking weeks. However, we also acknowledge that there can be agitation if there are problems with a deploy, which remains a concern that needs to be addressed.
00:20:28.640 We believe that with proper sequencing and informed operations, our developers should take ownership of their deployments. No longer is it solely the operations team's responsibility to manage deploys; the developers themselves should operate with the same level of urgency they expect from others.
00:21:44.640 To enhance collaboration and communication between development and operations, we encourage open dialogue while leveraging the right tools. We employ Git and GitHub for version control, Vagrant for local development, Chef for configuration management, Jenkins for continuous integration, as well as other tools that promote better operational excellence.
00:22:43.440 Lastly, I encourage you to evaluate your processes and identify intentional delays. Look for opportunities to remove unnecessary components and foster a culture that embraces frequent deployments. In this endeavor, consider how to provide responsibilities to your developers, ensuring they have the correct access, tools, and information necessary to succeed.