Ruby Video | DevOps Without The "Ops" – A Fallacy? A Dream? Or Both?

Talks

DevOps Without The "Ops" – A Fallacy? A Dream? Or Both?

Konstantin Gredeskoul

1 talk

#devops

#ruby-on-rails

#cloud-computing

#continuous-deployment

#scaling

#infrastructure-as-code

#automation

DevOps Without The "Ops" – A Fallacy? A Dream? Or Both?

by Konstantin Gredeskoul

In the presentation titled "DevOps Without The 'Ops' – A Fallacy? A Dream? Or Both?" at RubyConf AU 2015, Konstantin Gredeskoul explores the concept of managing software development and operations without a dedicated Ops team. The discussion arises from his experience at Weena, an innovative platform handling substantial traffic with minimal operational hassles. Gredeskoul argues that startups can thrive without a conventional Ops team and outlines a modern approach to scaling applications effectively.

Key points discussed include:

- Redefining DevOps: Gredeskoul presents various perspectives on DevOps, indicating that many organizations struggle to grasp its implications. Traditional methods bind dev and ops with strict protocols, yet flexible approaches may work for less regulated industries.
- Scaling Traffic: He highlights the importance of understanding scaling challenges as traffic increases. Initial hundreds of requests may snowball into thousands or even millions, necessitating an agile response.
- Culture of Ownership: At Weena, he asserts that engineers were not just coders but were also responsible for operational tasks, blending roles to enhance resilience and reduce reliance on a dedicated Ops team.
- Infrastructure as a First-Class Citizen: Viewing infrastructure configuration as equally crucial as feature development can streamline operations. Continuous monitoring and incremental deployment strategies help maintain uptime, while fault-tolerant architecture ensures stability.
- Utilizing Modern Tools: The use of cloud services and automation tools like Chef allowed for significant traffic handling and uptime performance without a traditional Ops structure.
- Real-World Examples: Gredeskoul's insights about their infrastructure detailed how their use of HAProxy and other proxies contributed to system speed and reliability by efficiently managing backend requests.
- Monitoring Metrics: He advises on focusing alerts on significant business metrics to prevent overwhelming engineers, a practice that allows for greater operational efficiency.

To conclude, Gredeskoul emphasizes that startups can operate effectively without dedicated operations if they foster a culture where everyone shares responsibility for both development and operations. By leveraging automation, focusing on fault tolerance, and employing modern cloud technologies, companies can scale up efficiently while maintaining high service availability.

00:00:00.120 So today, I'd like to speak about DevOps without the Ops and whether it's a fallacy, a dream, or perhaps both.

00:00:03.120 This discussion is based on my experience at a company I worked for, where we handled thousands of requests and reads per second on our Ruby on Rails stack—all without an operations team and achieving 99.97% uptime without trying really hard.

00:00:10.519 I thought this was really interesting, so I wanted to share some insights that could help others build applications that are resilient and do not necessarily require a large Ops team.

00:00:18.760 Let me give you a quick overview of Weena, which is essentially the shopping mall of the future. If you're not familiar with it, think of it as a blend between Tumblr, Twitter, and Facebook—all surrounded by shopping. You can follow people, stores, and collections, all personalized to your preferences. It's an infinite space where you can easily get lost. We represent around 350,000 stores and 20 million products, which have been saved over three billion times by users. So, consider us as something like Pinterest, but focusing heavily on products.

00:00:45.000 We launched in 2012 and started with about 35 employees, with a small team where half of them were engineers. I now want to talk about our technology stack. People often ask, 'What technology stack are you using?' It’s a loaded question and, hence, I’ve put together a slide, not necessarily for you to read into right now, but it should prompt you to consider how we rely on a massive amount of components, including numerous open-source tools and paid services.

00:01:01.799 Given how extensive this is, it’s somewhat silly to discuss the stack unless you have a couple of hours to spare, as there’s a lot to consider. Moving on, if you're building a Ruby on Rails application and perhaps targeting mobile users or public APIs, you will eventually deal with traffic considerations. For internal websites, you might expect something along the lines of hundreds of requests per minute. However, if your application gains popularity, you may see traffic reach up to 1,000 or 2,000 requests per minute.

00:01:20.000 Once you cross over to 100,000 requests per minute, you definitely encounter scaling issues. Top-tier services like Pinterest, Facebook, and Twitter experience tens of millions of requests per minute, but most of us don't have to worry about that level of traffic. Still, growth is essential, and understanding how to scale is crucial, even at lower levels of traffic.

00:01:40.200 My talk aims to review DevOps, cloud computing, and how it is changing the landscape of software development. I want to point out patterns that can significantly reduce the stress of software development, especially in Ruby on Rails, and question whether a new startup or small company really needs an Ops team.

00:01:58.280 To begin with, let's discuss the basic definition of DevOps. One definition from Wikipedia is very lengthy, so I’ll skip over it. A more interesting perspective comes from a recent report stating many organizations are confused about what DevOps means for them. It's interesting to see how we've combined Development and Operations, and yet many remain uncertain about the actual implications.

00:02:12.760 Another report from Puppet Labs highlights that efficient teams deploy code 30 times more frequently with 50% fewer failures. This correlation between DevOps practices and high organizational performance gives us tangible benefits to consider.

00:02:27.360 Traditional agile approaches often involve heavy processes where products push requirements to dev, transitioning to QA, where they might spend weeks manually testing before heading to operations. I've worked in such environments, and believe me, it can be a painful process that hinders progress. Therefore, I argue that an Ops team may not be indispensable up until a certain company size, especially for applications that are less regulated.

00:02:47.240 For instance, if you’re operating in the financial sector, a dedicated QA team is vital. However, for projects like a social network, a lean approach can be much more efficient. Let's take a moment to consider what traditional operations involve. They are generally responsible for uptime, stability, availability, on-call issues, backups, and disaster recovery.

00:03:01.600 With the advent of cloud computing, many of these responsibilities have evolved. This shift has allowed us to eliminate tasks like managing hardware and networking configuration, simplifying operations significantly.

00:03:16.280 In building software, we focused early on the goals of our company. Initially, we had roughly six engineers and our key objective was to maximize iteration speed. The reality of startup life was that we needed to experiment with various features and launch them rapidly.

00:03:31.760 We referred to this flexible approach as Agile, contrasting it with the traditional heavy Agile methodologies. As we began to experience traffic spikes, it was crucial that we could scale effectively.

00:03:44.280 From early on, we made small investments in performance, focusing on aspects like caching. We implemented caching strategies early, which proved beneficial whenever traffic surged.

00:03:59.440 We wanted to cultivate an environment where experimentation was encouraged. Our final goal was to have control over our infrastructure. We evaluated various options like Heroku and found them prohibitively expensive and somewhat limiting in flexibility.

00:04:12.920 So while moving quickly, we opted not to hire an Ops team but brought in engineers who enjoyed operational tasks. However, they approached those tasks with a software engineering mindset.

00:04:27.760 This meant we had to deploy our application to the cloud. We learned to provision nodes, set up load balancers, and tune our servers, which proved vital for our success.

00:04:38.720 Fast forward to today, our application is entirely hosted on Joyent Cloud, a lesser-known but dynamic provider. We implemented automation through Chef, which handled a 10,000% traffic growth in six months without downtime.

00:05:01.320 Although we experienced slower performance at times, we managed to maintain 99.97% uptime without hard effort. Our focus remained to develop features that users wanted and that proved more important than having a flawless uptime.

00:05:13.920 On our engineering team, we have a low on-call burden, receiving about one to two pages a week. This setup allows us to do extensive Ops work without a traditional Ops team.

00:05:38.000 The key takeaway is this: having established a culture where everyone is responsible for both development and operations leads to a more resilient environment.

00:05:59.640 The next segment focuses on practical advice that could help you alleviate stress in your operations. The first key point is that infrastructure must be treated as a first-class citizen.

00:06:08.000 We consider infrastructure work as equally important to building features. For instance, if you need to add a background worker to your Ruby code, you would handle the infrastructure configuration in tandem.

00:06:29.080 Additionally, we run Chef continuously in production rather than sporadically. This helps maintain reliability and enables us to catch issues quickly without the uncertainty of when it was last applied.

00:06:44.560 Our deployment strategy is incremental rather than continuous. We roll the code out and then incrementally restart a small percentage of our servers, monitoring errors closely before proceeding.

00:07:07.960 This approach limits manual interventions to just two commands, making it efficient while allowing engineers to stay close to the infrastructure.

00:07:22.720 The next crucial element is creating fault-tolerant infrastructure, something that has become significantly easier, thanks to modern cloud services.

00:07:43.040 If there's one lesson to take away from this talk, it is to put HAProxy in front of everything. This includes your application servers, database servers, and search engines.

00:08:02.160 HAProxy adds stability and speed, significantly improving user experience without adding latency since it manages backend routing efficiently.

00:08:29.440 Other proxies, such as Marara, perform similarly and should also be incorporated into your architecture to enhance resilience.

00:08:42.320 For example, the Marara proxy can enhance your database operations by efficiently pooling connections and ensuring robust functionality in a multi-threaded environment.

00:09:08.880 In scenarios where you may face service downtime, having this kind of architecture can help mitigate customer impact, ensuring that your service remains operational.

00:09:25.120 Monitoring plays a crucial role, and we alert only on significant business metrics while ignoring noise that doesn't directly affect customer experience.

00:09:41.240 This allows our team to focus on what truly matters, maintaining system reliability without overwhelming engineers with alerts.

00:10:02.560 In summary, it is important to assert control over infrastructure management. Hire skilled engineers who can write software rather than merely managing infrastructure, and emphasize that infrastructure is indeed software engineering.

00:10:29.360 Ultimately, if you insist on automation, leverage fault tolerance, monitor business metrics over noise, and partner with responsive cloud providers, you will significantly reduce operational stress.

00:10:40.400 Thank you for your attention.

Konstantin Gredeskoul

1 talk

RubyConf AU 2015