00:00:13.650
I'm Matthew Kocher. I'm a software developer at Pivotal Labs in San Francisco.
00:00:20.480
I do some DevOps stuff, but I don't consider myself an Ops person. I'm just a software developer who happens to know something about keeping a site running.
00:00:32.119
So my goal today is to share how I approach Ops as a problem and give you some tools to learn how to run your own website.
00:00:39.480
How many of you take care of operations for the sites that you're building? Great, many of you.
00:00:48.320
I wanted to start out by telling you a story. I hope you can see this. It's certainly not big enough, but let me see if this works.
00:00:55.800
This is a graph of traffic to an API on Thanksgiving 2010. We had load tested to about the halfway point at around 6,000 requests per minute.
00:01:02.160
This graph shows the requests per minute coming in. Blue represents good 200 response codes, and anything else is bad. So we had load tested and we were ready. This was a day we had planned for for months, and everything was looking perfect.
00:01:15.439
However, what happened next was a little unfortunate. You can see our throughput decreasing dramatically as our upstream load balancer noticed the 500 errors we were generating, cutting us out of the loop entirely.
00:01:33.600
What actually happened was that the site hitting this API had rolled out a change that moved the API two clicks down instead of four clicks down in the site map. They had also significantly expanded their use of JSONP requests for this sale on Thanksgiving, which introduced some caching difficulties.
00:02:03.039
While we had performed well past any load testing we had done, we faced sad times. About a minute after the site started bouncing, everyone involved received a text message, and we assembled in our chat room to investigate the problem. Various people coordinated what they would look into. After a little too long, we managed to stabilize the site, which was great.
00:02:40.159
While the graph looked better, it didn't match the first portion of the graph. However, it stabilized, and we were serving the load. Now I'm going to show you how we accomplished this miracle.
00:03:01.159
We had previously built a cluster of machines that had been sitting idle. Generally, we would perform blue-green deployments between those machines, but they were ready to serve the load when we saw that the site had gone down. We initiated one CAP task to touch a file on the machine to have the load balancer include them.
00:03:22.920
Interestingly, on the graph you can see the green line representing 500 errors. This occurred because our MySQL's disk or memory cache got filled up, which couldn't manage the load until it had populated its caches. Usually, we would cut over slowly, but since the site was down, it wasn’t a big deal. After this point, the site came back up, and we were serving a far greater load than before.
00:03:59.600
We were able to cut in our third environment, and about four hours later, we successfully pushed a code change that reduced the abuse on the API. Fortunately, we were able to respond much quicker than the front-end developers who had to push their JavaScript change through. You can see the load drops off, and by the end of the day, we began removing our extra capacity.
00:04:39.880
This story illustrates what it's like to be both a developer and the operations person on a project. You get the enjoyable task of writing code, while also having to wake up at 6 a.m. on the most important day of the year when the site goes down, especially when you're at your parents' house for Thanksgiving.
00:05:02.720
From this experience, we learned a couple of things. We always try to learn from our experiences. First, we learned that our planning can be flawed. We had spent too much money on load testing, which was based on predictions made by a reputable load testing company that overlooked the fact that the site would be redesigned at the same time as traffic increased.
00:05:34.840
Second, we realized that our response is crucial. When the situation occurred, we evaluated the problem quickly and responded adequately. Understanding that JSONP requests weren’t cached was beneficial, something that someone unfamiliar with server setup might not have known.
00:06:01.120
We learned that it's always a good idea to have more capacity. Overprovisioning is much cheaper than downtime, which leads us to our first main topic: availability. We aim for 99.9999% availability, which is a measurable goal. It’s straightforward to determine when your site is unavailable; everyone in the company knows when it happens.
00:06:34.119
What people don’t realize is that each one of those nines costs a considerable amount of money. Last year, if you were hosted in US East, you might have experienced only three nines due to downtime. Achieving those incremental nines requires spending, and one cannot guarantee meeting those metrics. What is essential is to have a conversation with all stakeholders about the implications of downtime and how much they are willing to invest.
00:07:01.039
Once you've dealt with availability, the next challenge is consistency, which involves maintaining the site in a consistent manner—keeping your databases synchronized and your data clean. Henry David Thoreau said, 'A foolish consistency is the hobgoblin of a feeble mind,' but customers prefer their data maintained consistently.
00:07:35.479
Once the site is operational, you need to ensure that the database replicates properly and that everything reflects the current state of the system. The challenge arises due to network outages. As we've discussed, network outages are a fact of life when services are distributed over a network and may be unreachable.
00:08:02.479
This leads us to the CAP theorem, which states that you can only have two of three: consistency, availability, and partition tolerance. This suggests that operations is an impossible problem because every stakeholder desires all three attributes: consistency, availability, and tolerance to network partitions, yet they usually aren’t prepared for the implications.
00:08:37.679
So, when managing operations, evaluate which trade-offs you're prepared to make, especially because making those trade-offs is a given. Automation is a trendy solution these days, with various tools at your disposal such as Puppet, Chef, and others. It doesn't matter which tool you choose to automate; the key is to ensure that subsequent teams know how to spin up your infrastructure.
00:09:07.239
Along with automation, testing is crucial since automation also relies on external services. Bringing an instance always depends on those services. This is a dynamic space, with numerous projects coming to light focused on testing Chef and Puppet. Tools like Foodcritic are designed for linting Chef cookbooks, while Minitest Chef Handler is for assertions executed after your Chef recipes.
00:09:59.959
You can find numerous examples that help ensure the infrastructure operates as intended. A simple example is a Chef recipe to install a web server like NGINX. If your Chef recipe installs NGINX, it should also verify that NGINX is running. This method ensures those verifying steps are built into the recipe itself.
00:10:47.240
Integration tests, although more difficult, are essential because they require substantial thought on how to bootstrap your infrastructure. I often advise people to avoid writing the integration tests until they have the system working. Utilizing tools like Cucumber can help ensure your UI or API behaves appropriately and the checks maintain integrity.
00:11:21.640
Once your infrastructure is automated and fully functional, monitoring becomes crucial. How many of you know if your site is operational right now? Not many hands go up. Ideally, you should be notified as soon as your site goes down—without depending on someone else to alert you.
00:11:54.240
I classify monitoring into three categories: site-level monitoring, server-level monitoring, and business-level monitoring. The first is site-level monitoring, which indicates whether your application is down. I recommend using services like Pingdom for this purpose. It’s the simplest form of monitoring, as you only need to load a webpage and check the response code.
00:12:29.880
The second level is server-level monitoring, which focuses on individual servers. While you may not want to receive alerts for every app server, it is important to maintain an awareness of your server's health. You can monitor this at regular intervals to keep track of any gradual changes in your infrastructure.
00:12:52.960
The third category is business-level monitoring, which isn't as enterprise-oriented as it sounds. This simply means tracking metrics that are key to your service. For instance, if you’re running an e-commerce site, you might monitor how many orders are processed within an hour.
00:13:28.480
Having tests that verify your current data against business logic can help catch issues early. If your website features APIs, regular checks can help ascertain the reliability of your service and ensure everything is functioning.
00:14:18.760
Once you have your monitoring set up and your configurations automated, you can refactor your infrastructure with confidence. Make sure everything is functioning correctly before proceeding with any essential restructuring of your setup.
00:14:42.120
Refactoring your infrastructure is just as valuable as refactoring your code. Many organizations' infrastructures become convoluted as they maintain legacy setups due to lack of resources or oversight. Ensuring that your infrastructure is clean and maintainable is vital.
00:15:07.680
Regarding DevOps, I have a love-hate relationship with the term because, fundamentally, it means putting the people who write the code in charge of keeping it running. This often aligns priorities as it emphasizes the importance of maintaining availability.
00:15:38.880
If your site goes down and you have a large user base, prioritizing uptime becomes crucial. DevOps allows for faster responses to issues because it enables individuals to have a complete understanding of the entire workflow instead of focusing solely on their segments.
00:16:06.720
At Pivotal, we promote full-stack development by encouraging developers to implement features completely, including installation processes for any required services. This empowers developers to solve their problems without having to rely on others for assistance, thus enhancing overall productivity.
00:17:17.240
On a more practical level, when choosing a hosting provider, it’s vital to find one that understands how to keep servers up and running. They don't need to be proficient in setting up specific applications or databases; instead, they should focus on their core competencies.
00:18:00.080
Issues frequently arise when clients ask hosting providers to handle tasks that are out of their expertise. We have seen more success by utilizing providers for services they offer consistently rather than relying on them for one-off tasks.
00:18:43.440
While many people hear critiques against platforms like Heroku, I advocate for using such services initially. At Pivotal, we utilize Heroku for many projects because it enables us to quickly address critical needs without immediately worrying about scaling.
00:19:26.000
However, as scalability becomes a priority or the application requires a unique infrastructure that hosted services cannot accommodate, taking ownership of maintenance and reconstruction becomes necessary.
00:20:06.560
Let me share another quick story before concluding today. We had a test looking for batteries near Richfield, which had turned red due to a service change. However, one day all stores near Richfield disappeared from our database completely.
00:20:38.320
After some investigation, we discovered that the service we relied on changed fundamentally. The latitude and longitude were mixed up in the update, leading our data to contain no nearby stores, thus affecting operations.
00:21:11.360
Fortunately, we preserved previous versions of our data, which enabled us to restore the missing data quickly. In conclusion, DevOps can often feel chaotic, but with discipline, you can maintain control and organization.
00:21:45.760
The economics of adopting DevOps practices can be beneficial, and understanding its principles can transform your workflow. This holistic approach to development can significantly enhance your operational capabilities.
00:22:27.360
Does anyone have any questions? At Pivotal, we regularly start new projects, often utilizing Heroku as our default. Eventually, we tailor our approach based on client preferences and requirements.
00:23:10.800
There are differing preferences depending on the team you're working with, which can guide your choice of infrastructure. Most individuals in operations prefer CentOS, while Rails developers lean towards Ubuntu.
00:23:38.480
Ultimately, choosing a platform that's familiar to your team can lead to more efficient outcomes. However, you can always adapt according to the project's circumstances.
00:24:02.240
If you have specific implementation questions, there are methods to mirror the data as JSON, or even using alternatives like JSONP. Many of our tests utilize these techniques to maintain reliability.
00:24:40.760
Mitigating alert fatigue is important—setting thresholds for alerts can reduce unnecessary noise. We focused on addressing criteria changes as part of morning stand-ups, ensuring we discuss any critical alerts daily.
00:25:01.120
Using proper monitoring tools, we ensure critical changes don’t overwhelm us. Having clear processes in place allows your team to address issues in real-time without being overwhelmed by noise.
00:25:56.160
Finally, there are no strict rules regarding responsiveness or dealing with load fluctuations. Each application behaves differently; web applications need robust planning, especially around traffic spikes or increased interest.
00:26:56.440
Success following public interest may fade but requires some planning around capacity, ensuring that the infrastructure you're leveraging can cope with spikes. Monitoring is paramount at all times.
00:27:20.960
With that, I will conclude my prepared remarks. If there are any further questions, feel free to ask, and I'll gladly assist.