00:00:00.120
So today, I'd like to speak about DevOps without the Ops and whether it's a fallacy, a dream, or perhaps both.
00:00:03.120
This discussion is based on my experience at a company I worked for, where we handled thousands of requests and reads per second on our Ruby on Rails stack—all without an operations team and achieving 99.97% uptime without trying really hard.
00:00:10.519
I thought this was really interesting, so I wanted to share some insights that could help others build applications that are resilient and do not necessarily require a large Ops team.
00:00:18.760
Let me give you a quick overview of Weena, which is essentially the shopping mall of the future. If you're not familiar with it, think of it as a blend between Tumblr, Twitter, and Facebook—all surrounded by shopping. You can follow people, stores, and collections, all personalized to your preferences. It's an infinite space where you can easily get lost. We represent around 350,000 stores and 20 million products, which have been saved over three billion times by users. So, consider us as something like Pinterest, but focusing heavily on products.
00:00:45.000
We launched in 2012 and started with about 35 employees, with a small team where half of them were engineers. I now want to talk about our technology stack. People often ask, 'What technology stack are you using?' It’s a loaded question and, hence, I’ve put together a slide, not necessarily for you to read into right now, but it should prompt you to consider how we rely on a massive amount of components, including numerous open-source tools and paid services.
00:01:01.799
Given how extensive this is, it’s somewhat silly to discuss the stack unless you have a couple of hours to spare, as there’s a lot to consider. Moving on, if you're building a Ruby on Rails application and perhaps targeting mobile users or public APIs, you will eventually deal with traffic considerations. For internal websites, you might expect something along the lines of hundreds of requests per minute. However, if your application gains popularity, you may see traffic reach up to 1,000 or 2,000 requests per minute.
00:01:20.000
Once you cross over to 100,000 requests per minute, you definitely encounter scaling issues. Top-tier services like Pinterest, Facebook, and Twitter experience tens of millions of requests per minute, but most of us don't have to worry about that level of traffic. Still, growth is essential, and understanding how to scale is crucial, even at lower levels of traffic.
00:01:40.200
My talk aims to review DevOps, cloud computing, and how it is changing the landscape of software development. I want to point out patterns that can significantly reduce the stress of software development, especially in Ruby on Rails, and question whether a new startup or small company really needs an Ops team.
00:01:58.280
To begin with, let's discuss the basic definition of DevOps. One definition from Wikipedia is very lengthy, so I’ll skip over it. A more interesting perspective comes from a recent report stating many organizations are confused about what DevOps means for them. It's interesting to see how we've combined Development and Operations, and yet many remain uncertain about the actual implications.
00:02:12.760
Another report from Puppet Labs highlights that efficient teams deploy code 30 times more frequently with 50% fewer failures. This correlation between DevOps practices and high organizational performance gives us tangible benefits to consider.
00:02:27.360
Traditional agile approaches often involve heavy processes where products push requirements to dev, transitioning to QA, where they might spend weeks manually testing before heading to operations. I've worked in such environments, and believe me, it can be a painful process that hinders progress. Therefore, I argue that an Ops team may not be indispensable up until a certain company size, especially for applications that are less regulated.
00:02:47.240
For instance, if you’re operating in the financial sector, a dedicated QA team is vital. However, for projects like a social network, a lean approach can be much more efficient. Let's take a moment to consider what traditional operations involve. They are generally responsible for uptime, stability, availability, on-call issues, backups, and disaster recovery.
00:03:01.600
With the advent of cloud computing, many of these responsibilities have evolved. This shift has allowed us to eliminate tasks like managing hardware and networking configuration, simplifying operations significantly.
00:03:16.280
In building software, we focused early on the goals of our company. Initially, we had roughly six engineers and our key objective was to maximize iteration speed. The reality of startup life was that we needed to experiment with various features and launch them rapidly.
00:03:31.760
We referred to this flexible approach as Agile, contrasting it with the traditional heavy Agile methodologies. As we began to experience traffic spikes, it was crucial that we could scale effectively.
00:03:44.280
From early on, we made small investments in performance, focusing on aspects like caching. We implemented caching strategies early, which proved beneficial whenever traffic surged.
00:03:59.440
We wanted to cultivate an environment where experimentation was encouraged. Our final goal was to have control over our infrastructure. We evaluated various options like Heroku and found them prohibitively expensive and somewhat limiting in flexibility.
00:04:12.920
So while moving quickly, we opted not to hire an Ops team but brought in engineers who enjoyed operational tasks. However, they approached those tasks with a software engineering mindset.
00:04:27.760
This meant we had to deploy our application to the cloud. We learned to provision nodes, set up load balancers, and tune our servers, which proved vital for our success.
00:04:38.720
Fast forward to today, our application is entirely hosted on Joyent Cloud, a lesser-known but dynamic provider. We implemented automation through Chef, which handled a 10,000% traffic growth in six months without downtime.
00:05:01.320
Although we experienced slower performance at times, we managed to maintain 99.97% uptime without hard effort. Our focus remained to develop features that users wanted and that proved more important than having a flawless uptime.
00:05:13.920
On our engineering team, we have a low on-call burden, receiving about one to two pages a week. This setup allows us to do extensive Ops work without a traditional Ops team.
00:05:38.000
The key takeaway is this: having established a culture where everyone is responsible for both development and operations leads to a more resilient environment.
00:05:59.640
The next segment focuses on practical advice that could help you alleviate stress in your operations. The first key point is that infrastructure must be treated as a first-class citizen.
00:06:08.000
We consider infrastructure work as equally important to building features. For instance, if you need to add a background worker to your Ruby code, you would handle the infrastructure configuration in tandem.
00:06:29.080
Additionally, we run Chef continuously in production rather than sporadically. This helps maintain reliability and enables us to catch issues quickly without the uncertainty of when it was last applied.
00:06:44.560
Our deployment strategy is incremental rather than continuous. We roll the code out and then incrementally restart a small percentage of our servers, monitoring errors closely before proceeding.
00:07:07.960
This approach limits manual interventions to just two commands, making it efficient while allowing engineers to stay close to the infrastructure.
00:07:22.720
The next crucial element is creating fault-tolerant infrastructure, something that has become significantly easier, thanks to modern cloud services.
00:07:43.040
If there's one lesson to take away from this talk, it is to put HAProxy in front of everything. This includes your application servers, database servers, and search engines.
00:08:02.160
HAProxy adds stability and speed, significantly improving user experience without adding latency since it manages backend routing efficiently.
00:08:29.440
Other proxies, such as Marara, perform similarly and should also be incorporated into your architecture to enhance resilience.
00:08:42.320
For example, the Marara proxy can enhance your database operations by efficiently pooling connections and ensuring robust functionality in a multi-threaded environment.
00:09:08.880
In scenarios where you may face service downtime, having this kind of architecture can help mitigate customer impact, ensuring that your service remains operational.
00:09:25.120
Monitoring plays a crucial role, and we alert only on significant business metrics while ignoring noise that doesn't directly affect customer experience.
00:09:41.240
This allows our team to focus on what truly matters, maintaining system reliability without overwhelming engineers with alerts.
00:10:02.560
In summary, it is important to assert control over infrastructure management. Hire skilled engineers who can write software rather than merely managing infrastructure, and emphasize that infrastructure is indeed software engineering.
00:10:29.360
Ultimately, if you insist on automation, leverage fault tolerance, monitor business metrics over noise, and partner with responsive cloud providers, you will significantly reduce operational stress.
00:10:40.400
Thank you for your attention.