Ruby Video | Devops for Developers - Why You Want To Run Your Own

Talks

Devops for Developers - Why You Want To Run Your Own Site, And How

Matthew Kocher

1 talk

#devops

#continuous-integration

#monitoring

#load-testing

#automation

#infrastructure-as-code

#high-availability

Devops for Developers - Why You Want To Run Your Own Site, And How

by Matthew Kocher

In this talk at the MountainWest RubyConf 2012, Matthew Kocher, a software developer at Pivotal Labs, discusses the intersection of DevOps and software development, emphasizing why developers should take ownership of the sites they work on. The session covers essential strategies and considerations for effectively managing operations while maintaining a consistent flow of development.

Key Points:

- DevOps Overview: Kocher defines DevOps as a methodology where developers are responsible for the complete lifecycle of their applications, fostering a culture of accountability and prompt responses to issues.

- Story of Thanksgiving 2010 Incident: Through a personal anecdote about a critical outage on Thanksgiving, he illustrates the importance of thorough load testing and preparation, revealing how unexpected changes can lead to significant downtime despite prior testing.

- Importance of Recovery and Planning: The narrative highlights the necessity of having extra capacity to handle unforeseen spikes in traffic and highlights the lessons learned from failures, particularly regarding planning and timely responses.

- Availability vs. Consistency: Kocher explores the balance of availability (measured as uptime percentages) against the consistency of data and service, stressing that achieving high availability comes at a cost that stakeholders must understand.

- CAP Theorem: He mentions the CAP theorem, which explains the trade-offs necessary between Consistency, Availability, and Partition Tolerance, urging programmers to be strategic about which two of the three can be prioritized.

- Automation and Testing: The necessity of automation in infrastructure management is covered, suggesting tools like Puppet and Chef. Kocher underscores the value of writing tests that not only validate code but also ensure infrastructure integrity.

- Monitoring Strategies: He categorizes monitoring into site-level, server-level, and business-level, explaining the various methods to keep track of application performance and health.

- Refactoring Infrastructure: The talk stresses that rethinking and upgrading the infrastructure is as vital as software refactoring, promoting an organized and maintainable structure that supports ongoing development.
- DevOps Philosophy: Kocher discusses the community and cultural aspects of DevOps, focusing on collaboration among teams and how developers can improve their operations capabilities. Finally, he highlights the importance of clear communication about expectations and responses to failures and the need for continuous improvement in infrastructure management practices.

Conclusions:

The session concludes with the idea that while transitioning to a DevOps model may seem chaotic, it provides a holistic approach that can significantly streamline workflows and operational efficiency. Kocher encourages developers to embrace these practices to better manage their applications and improve overall system resilience.

00:00:13.650 I'm Matthew Kocher. I'm a software developer at Pivotal Labs in San Francisco.

00:00:20.480 I do some DevOps stuff, but I don't consider myself an Ops person. I'm just a software developer who happens to know something about keeping a site running.

00:00:32.119 So my goal today is to share how I approach Ops as a problem and give you some tools to learn how to run your own website.

00:00:39.480 How many of you take care of operations for the sites that you're building? Great, many of you.

00:00:48.320 I wanted to start out by telling you a story. I hope you can see this. It's certainly not big enough, but let me see if this works.

00:00:55.800 This is a graph of traffic to an API on Thanksgiving 2010. We had load tested to about the halfway point at around 6,000 requests per minute.

00:01:02.160 This graph shows the requests per minute coming in. Blue represents good 200 response codes, and anything else is bad. So we had load tested and we were ready. This was a day we had planned for for months, and everything was looking perfect.

00:01:15.439 However, what happened next was a little unfortunate. You can see our throughput decreasing dramatically as our upstream load balancer noticed the 500 errors we were generating, cutting us out of the loop entirely.

00:01:33.600 What actually happened was that the site hitting this API had rolled out a change that moved the API two clicks down instead of four clicks down in the site map. They had also significantly expanded their use of JSONP requests for this sale on Thanksgiving, which introduced some caching difficulties.

00:02:03.039 While we had performed well past any load testing we had done, we faced sad times. About a minute after the site started bouncing, everyone involved received a text message, and we assembled in our chat room to investigate the problem. Various people coordinated what they would look into. After a little too long, we managed to stabilize the site, which was great.

00:02:40.159 While the graph looked better, it didn't match the first portion of the graph. However, it stabilized, and we were serving the load. Now I'm going to show you how we accomplished this miracle.

00:03:01.159 We had previously built a cluster of machines that had been sitting idle. Generally, we would perform blue-green deployments between those machines, but they were ready to serve the load when we saw that the site had gone down. We initiated one CAP task to touch a file on the machine to have the load balancer include them.

00:03:22.920 Interestingly, on the graph you can see the green line representing 500 errors. This occurred because our MySQL's disk or memory cache got filled up, which couldn't manage the load until it had populated its caches. Usually, we would cut over slowly, but since the site was down, it wasn’t a big deal. After this point, the site came back up, and we were serving a far greater load than before.

00:03:59.600 We were able to cut in our third environment, and about four hours later, we successfully pushed a code change that reduced the abuse on the API. Fortunately, we were able to respond much quicker than the front-end developers who had to push their JavaScript change through. You can see the load drops off, and by the end of the day, we began removing our extra capacity.

00:04:39.880 This story illustrates what it's like to be both a developer and the operations person on a project. You get the enjoyable task of writing code, while also having to wake up at 6 a.m. on the most important day of the year when the site goes down, especially when you're at your parents' house for Thanksgiving.

00:05:02.720 From this experience, we learned a couple of things. We always try to learn from our experiences. First, we learned that our planning can be flawed. We had spent too much money on load testing, which was based on predictions made by a reputable load testing company that overlooked the fact that the site would be redesigned at the same time as traffic increased.

00:05:34.840 Second, we realized that our response is crucial. When the situation occurred, we evaluated the problem quickly and responded adequately. Understanding that JSONP requests weren’t cached was beneficial, something that someone unfamiliar with server setup might not have known.

00:06:01.120 We learned that it's always a good idea to have more capacity. Overprovisioning is much cheaper than downtime, which leads us to our first main topic: availability. We aim for 99.9999% availability, which is a measurable goal. It’s straightforward to determine when your site is unavailable; everyone in the company knows when it happens.

00:06:34.119 What people don’t realize is that each one of those nines costs a considerable amount of money. Last year, if you were hosted in US East, you might have experienced only three nines due to downtime. Achieving those incremental nines requires spending, and one cannot guarantee meeting those metrics. What is essential is to have a conversation with all stakeholders about the implications of downtime and how much they are willing to invest.

00:07:01.039 Once you've dealt with availability, the next challenge is consistency, which involves maintaining the site in a consistent manner—keeping your databases synchronized and your data clean. Henry David Thoreau said, 'A foolish consistency is the hobgoblin of a feeble mind,' but customers prefer their data maintained consistently.

00:07:35.479 Once the site is operational, you need to ensure that the database replicates properly and that everything reflects the current state of the system. The challenge arises due to network outages. As we've discussed, network outages are a fact of life when services are distributed over a network and may be unreachable.

00:08:02.479 This leads us to the CAP theorem, which states that you can only have two of three: consistency, availability, and partition tolerance. This suggests that operations is an impossible problem because every stakeholder desires all three attributes: consistency, availability, and tolerance to network partitions, yet they usually aren’t prepared for the implications.

00:08:37.679 So, when managing operations, evaluate which trade-offs you're prepared to make, especially because making those trade-offs is a given. Automation is a trendy solution these days, with various tools at your disposal such as Puppet, Chef, and others. It doesn't matter which tool you choose to automate; the key is to ensure that subsequent teams know how to spin up your infrastructure.

00:09:07.239 Along with automation, testing is crucial since automation also relies on external services. Bringing an instance always depends on those services. This is a dynamic space, with numerous projects coming to light focused on testing Chef and Puppet. Tools like Foodcritic are designed for linting Chef cookbooks, while Minitest Chef Handler is for assertions executed after your Chef recipes.

00:09:59.959 You can find numerous examples that help ensure the infrastructure operates as intended. A simple example is a Chef recipe to install a web server like NGINX. If your Chef recipe installs NGINX, it should also verify that NGINX is running. This method ensures those verifying steps are built into the recipe itself.

00:10:47.240 Integration tests, although more difficult, are essential because they require substantial thought on how to bootstrap your infrastructure. I often advise people to avoid writing the integration tests until they have the system working. Utilizing tools like Cucumber can help ensure your UI or API behaves appropriately and the checks maintain integrity.

00:11:21.640 Once your infrastructure is automated and fully functional, monitoring becomes crucial. How many of you know if your site is operational right now? Not many hands go up. Ideally, you should be notified as soon as your site goes down—without depending on someone else to alert you.

00:11:54.240 I classify monitoring into three categories: site-level monitoring, server-level monitoring, and business-level monitoring. The first is site-level monitoring, which indicates whether your application is down. I recommend using services like Pingdom for this purpose. It’s the simplest form of monitoring, as you only need to load a webpage and check the response code.

00:12:29.880 The second level is server-level monitoring, which focuses on individual servers. While you may not want to receive alerts for every app server, it is important to maintain an awareness of your server's health. You can monitor this at regular intervals to keep track of any gradual changes in your infrastructure.

00:12:52.960 The third category is business-level monitoring, which isn't as enterprise-oriented as it sounds. This simply means tracking metrics that are key to your service. For instance, if you’re running an e-commerce site, you might monitor how many orders are processed within an hour.

00:13:28.480 Having tests that verify your current data against business logic can help catch issues early. If your website features APIs, regular checks can help ascertain the reliability of your service and ensure everything is functioning.

00:14:18.760 Once you have your monitoring set up and your configurations automated, you can refactor your infrastructure with confidence. Make sure everything is functioning correctly before proceeding with any essential restructuring of your setup.

00:14:42.120 Refactoring your infrastructure is just as valuable as refactoring your code. Many organizations' infrastructures become convoluted as they maintain legacy setups due to lack of resources or oversight. Ensuring that your infrastructure is clean and maintainable is vital.

00:15:07.680 Regarding DevOps, I have a love-hate relationship with the term because, fundamentally, it means putting the people who write the code in charge of keeping it running. This often aligns priorities as it emphasizes the importance of maintaining availability.

00:15:38.880 If your site goes down and you have a large user base, prioritizing uptime becomes crucial. DevOps allows for faster responses to issues because it enables individuals to have a complete understanding of the entire workflow instead of focusing solely on their segments.

00:16:06.720 At Pivotal, we promote full-stack development by encouraging developers to implement features completely, including installation processes for any required services. This empowers developers to solve their problems without having to rely on others for assistance, thus enhancing overall productivity.

00:17:17.240 On a more practical level, when choosing a hosting provider, it’s vital to find one that understands how to keep servers up and running. They don't need to be proficient in setting up specific applications or databases; instead, they should focus on their core competencies.

00:18:00.080 Issues frequently arise when clients ask hosting providers to handle tasks that are out of their expertise. We have seen more success by utilizing providers for services they offer consistently rather than relying on them for one-off tasks.

00:18:43.440 While many people hear critiques against platforms like Heroku, I advocate for using such services initially. At Pivotal, we utilize Heroku for many projects because it enables us to quickly address critical needs without immediately worrying about scaling.

00:19:26.000 However, as scalability becomes a priority or the application requires a unique infrastructure that hosted services cannot accommodate, taking ownership of maintenance and reconstruction becomes necessary.

00:20:06.560 Let me share another quick story before concluding today. We had a test looking for batteries near Richfield, which had turned red due to a service change. However, one day all stores near Richfield disappeared from our database completely.

00:20:38.320 After some investigation, we discovered that the service we relied on changed fundamentally. The latitude and longitude were mixed up in the update, leading our data to contain no nearby stores, thus affecting operations.

00:21:11.360 Fortunately, we preserved previous versions of our data, which enabled us to restore the missing data quickly. In conclusion, DevOps can often feel chaotic, but with discipline, you can maintain control and organization.

00:21:45.760 The economics of adopting DevOps practices can be beneficial, and understanding its principles can transform your workflow. This holistic approach to development can significantly enhance your operational capabilities.

00:22:27.360 Does anyone have any questions? At Pivotal, we regularly start new projects, often utilizing Heroku as our default. Eventually, we tailor our approach based on client preferences and requirements.

00:23:10.800 There are differing preferences depending on the team you're working with, which can guide your choice of infrastructure. Most individuals in operations prefer CentOS, while Rails developers lean towards Ubuntu.

00:23:38.480 Ultimately, choosing a platform that's familiar to your team can lead to more efficient outcomes. However, you can always adapt according to the project's circumstances.

00:24:02.240 If you have specific implementation questions, there are methods to mirror the data as JSON, or even using alternatives like JSONP. Many of our tests utilize these techniques to maintain reliability.

00:24:40.760 Mitigating alert fatigue is important—setting thresholds for alerts can reduce unnecessary noise. We focused on addressing criteria changes as part of morning stand-ups, ensuring we discuss any critical alerts daily.

00:25:01.120 Using proper monitoring tools, we ensure critical changes don’t overwhelm us. Having clear processes in place allows your team to address issues in real-time without being overwhelmed by noise.

00:25:56.160 Finally, there are no strict rules regarding responsiveness or dealing with load fluctuations. Each application behaves differently; web applications need robust planning, especially around traffic spikes or increased interest.

00:26:56.440 Success following public interest may fade but requires some planning around capacity, ensuring that the infrastructure you're leveraging can cope with spikes. Monitoring is paramount at all times.

00:27:20.960 With that, I will conclude my prepared remarks. If there are any further questions, feel free to ask, and I'll gladly assist.

MountainWest RubyConf 2012