Stack Smashing

by David Czarnecki

In the video Stack Smashing, David Czarnecki discusses an internal project that transformed a complex application environment for Major League Gaming from over 100 virtual machines down to just 2 physical machines. This consolidation was motivated by the need for a simpler infrastructure to support their 13 inter-connected applications, which serve millions of users with functionalities like single-sign-on, live competition information, and online video streaming. Czarnecki provides insights into various technical aspects of this process, detailing the steps taken and lessons learned during the migration and upgrade of their application stack.

Key points discussed include:

- Network Topology Changes: A detailed comparison of the network infrastructure before and after the migration, highlighting the interrelationships among applications and databases.
- Server Provisioning and Management: The use of Chef recipes for server management, emphasizing the importance of automation instead of manual installations. Czarnecki discusses the migration of applications, starting from internal systems to public-facing applications, which allowed them to encounter and solve issues progressively.

- Rails Upgrades: The upgrade from Rails 3 to 3.1, focusing on the benefits of the new asset pipeline as well as switching from Thin to Unicorn servers, which improved application performance.
- Service Configuration: Strategies for preventing service failures were shared, including using DNS aliases to avoid redeploying applications when changing database servers.
- Deployment Automation: Czarnecki explains how they improved their deployment tasks using Capistrano, creating reusable gems for consistency across applications.

- Monitoring and Continuous Deployment: The necessity of visual monitoring tools for historical data analysis, along with insights into their continuous integration and deployment processes.

Throughout the talk, several technical examples illustrate how to simplify and optimize infrastructure. Czarnecki concludes with several important takeaways:
- Simplifying infrastructure is crucial for manageability and efficiency.
- Understanding the roles of all servers in your system is essential for effective operations.
- Employing continuous integration and deployment practices can significantly reduce deployment headaches.
- Treating infrastructure management with the same rigor as application development ensures higher reliability and performance.

These strategies are not only applicable to gaming environments but can benefit any organization dealing with complex infrastructures.

00:00:10.820 Thank you.

00:00:25.439 If you are here for Stack Smashing, then you're in the right place. If not, I'd still like you to enjoy the talk. Please see me afterward and I can validate your parking.

00:00:37.800 If you go to the link speakerdeck.com, I've posted the slides online. For folks in the back, if you cannot see the two projection screens well, you can follow along at home. Hopefully, you'll stay for the entire presentation.

00:01:01.199 My name is David Czarnecki. Like most of you, I am on Twitter at @zarneckiD. If topics like Ruby, Rails, video games, pictures of food, snowboarding, and beekeeping are your thing, then follow me! I also have code on GitHub at @czarneck.

00:01:29.220 I work for a company called Agora Games, where we provide middleware for video games such as Guitar Hero, Call of Duty, Brink, Blur, Mortal Kombat, and Saints Row the Third. If you've played a major video game, we have data that may flow through our system.

00:01:47.880 We are also part of Major League Gaming, handling all of their online properties. Today, I will discuss simplifying your infrastructure and share key lessons learned managing over 16 interconnected applications on minimal hardware.

00:02:07.140 This initiative, which I termed "stack smashing," arose from the need to collapse nearly 100 virtual machines down to just two physical boxes. Of course, we have additional utility boxes for services like databases, memcache, and Redis, but mainly, our applications run on these two servers.

00:02:19.379 This goal was a priority on the Whiteboard for our CEO. Coming off a previous project, he asked me to take on initiatives that were not being addressed by anyone else. Managing server infrastructure was not something I had much experience with, but as a full-stack developer, I welcomed the opportunity to gain experience in this area.

00:02:51.360 I originally thought this project would be a one-month tour of duty, but as I learned more about the complexities of infrastructure, it has extended from last September to the present.

00:03:10.080 My key goal was to simplify all aspects of our infrastructure. We had many virtual machines for various applications, which complicated things. I aimed to create a clearer understanding of how these applications interact with each other and how to manage this configuration more efficiently.

00:03:30.660 A vital part of this process was documenting everything to ensure that the entire engineering team was aware of ongoing changes and the technologies being utilized. Here’s a brief overview of the network we're discussing for Major League Gaming.

00:04:03.840 As you can see, we have around 15 applications that are intricately intertwined. I have highlighted the larger applications we’re dealing with, such as Pro Gamer, which handles player profiles and stats; MLG TV for video on demand and user-generated content; and a StarCraft arena site for tournaments. Additionally, we offer a live event experience four to five times a year, allowing attendees to mix and match video streams.

00:04:57.920 We sell memberships and various merchandise on the web. Our single sign-on (SSO) is fundamental, enabling all network applications to identify users across the platform. When it comes to computing needs, we perform some simple calculations: the number of apps multiplied by the capacity we require indicates that a minimum of 30 virtual machines is desirable.

00:05:40.620 This would provide failover support in case one machine goes down. However, demand varies based on specific applications and peak times of the year. Therefore, we often require additional capacity and, consequently, more virtual machines. Currently, we are utilizing Rackspace, hosting applications on VMs with configurations of 1 gigabyte of RAM and 40 gigabytes of disk storage. This should allow for the effective operation of several services and provides the flexibility to scale.

00:06:40.620 Our traffic numbers are significant, amounting to millions of views and page views per month, indicating that our operations are anything but negligible.

00:06:54.000 While this scalability via VMs is advantageous for handling traffic, it also incurs variable costs. Switching to physical servers would offer us fixed costs along with reliable computing resources. However, it’s essential to consider that the more servers you have, the more complex your infrastructure becomes, making it harder to manage and understand.

00:07:20.880 We have established a robust hardware profile for our application servers, featuring processors with six cores, extensive memory, and significant disk capacity. These servers also have hyper-threading enabled, providing a performance boost. We manage our server provisioning using Chef, and I can't stress enough the importance of using automated tools for managing infrastructure.

00:08:25.200 Our Chef recipes are mostly standard, and I will go into detail on where modifications have been made and share some code samples. Alongside provisioning, I want to talk about application migration. We migrated 16 applications from virtual machines to physical servers.

00:09:06.180 The migration process began internally, using the tools of internal departments before moving on to more customer-facing applications. During this process, we encountered many challenges, allowing us to identify potential problems and edge cases. By the time we transitioned the last applications, we had resolved most of the earlier issues.

00:09:45.960 What I've learned from this iterative approach is that repetition helps illuminate patterns and abstractions that can efficiently be applied to your infrastructure.

00:10:23.820 We encourage autonomous decision-making regarding the technologies used across our applications. All applications run on Ruby 1.9.3 and Rails 3. During our infrastructure migration, we also upgraded from Rails 3 to Rails 3.1, which mostly revolves around the asset pipeline.

00:10:48.240 In the upgrade process, you can check out a branch of the Ruby project and set your Rails gem to an updated version. Although the asset pipeline is optional, I recommend enabling it to take full advantage of Rails’ features. The steps include grouping necessary gems in the Gemfile and making adjustments in application.rb to set up the asset pipeline.

00:11:39.780 Also, moving your assets from the public directory to the app/assets directory is a part of the upgrade process. While this was our upgrade path, your specific experience may vary.

00:12:07.680 Additionally, we transitioned from using Thin to Unicorn for our application server, a switch that yielded numerous benefits. Unicorn exhibits impressive performance compared to other servers, as confirmed by benchmarks.

00:12:31.680 One of the significant advantages of Unicorn is its ability to manage worker processes at the kernel level, providing a streamlined way to handle load balancing, while avoiding complexity in our codebase. It also supports rolling restarts, allowing us to maintain zero downtime during deployments.

00:13:06.020 We developed some Capistrano tasks for scaling our application workers according to traffic demand, enabling us to dynamically adjust resources instead of provisioning new VMs or servers.

00:13:39.240 Next, let’s talk about our Rails recipe for Unicorn. We set the Rails environment for specific applications based on their requirements, ensuring that our applications have a robust configuration.

00:14:01.680 In terms of service configuration, we discovered that a server failure can lead to rapid alerts, prompting direct action. A common culprit is identified as the database YAML configuration, particularly if it references a specific machine by name. Imagine facing a primary database crash with several applications relying on that configuration.

00:14:54.120 To mitigate this issue, we implemented DNS aliases for our database services. By using an alias in our database configuration, all the applications will reference the same alias, allowing us to change the target machine seamlessly without redeploying.

00:15:37.620 This DNS-driven approach greatly simplifies our operations. It allows for flexible configurations regarding Redis, Memcache, and any other services we use, thereby minimizing the need for multiple database.yaml updates across numerous applications during server transitions.

00:16:07.259 For offline processing, we utilize Rescue for task management effectively. Our Rails recipe is tailored to support this, allowing configuration of queue management based on our application's demands.

00:16:31.740 We have set different processing intervals for the queues, adjusting them according to workload requirements. Each worker runs as a distinct service within the OS, ensuring independent operation, and they are automatically restarted if they fail.

00:17:03.420 Deployment tasks are handled through Capistrano, and I often get questions about our RVM (Ruby Version Manager) configuration regarding gem isolation. We use gem sets for applications to avoid dependency issues across projects.

00:17:31.920 Capistrano plays a crucial role in deploying our apps. To avoid redundancy across various applications, we package common deployment procedures into a reusable gem, which eliminates repetitive configurations.

00:18:17.640 Our deploy.rb script maintains simplicity while including necessary elements such as error tracking, notifications, and task configurations. The goal is to keep it as homogeneous as possible across 16 applications.

00:19:02.520 We also monitor our applications, emphasizing the importance of having visual representations of application performance metrics. It is crucial for teams to analyze request rates and server health over time.

00:19:48.000 This visibility allows teams to identify trends and assess application performance under varying loads. Timely detection of slowdowns empowers engineers to troubleshoot before issues escalate.

00:20:26.760 At our company, we use Munin as a monitoring tool. Each server regularly checks into a central Munin server, which facilitated categorizing machines based on running services.

00:21:04.620 Some of our metrics include performance tracking of application deployments, NGINX response rates, and Redis operations. Generally, we maintain historical data comparisons to ensure everything runs smoothly.

00:21:41.580 In addition to monitoring application performance, we’ve implemented behavior-driven development (BDD) practices for our infrastructure. This systematic approach enhances our ability to validate service configurations.

00:22:25.080 Using step definitions, we validate the operational status of machines, running commands on different servers, performing DNS lookups to verify proper aliasing, and ensuring backups are timely.

00:23:09.240 This practice helps us continuously validate that our systems operate as expected, maintaining reliability in services such as MongoDB, Redis, and others across our stack.

00:23:56.160 For continuous integration and deployment, we rely on Jenkins to automatically verify developers’ code. It builds our internal gems, ensuring they remain up to date and functional throughout multiple projects.

00:24:40.680 We also implement GitHub Flow for our development process, making sure that all changes merge seamlessly into the master branch, which is always deployable. This approach involves utilizing pull requests to facilitate code reviews before merging back into the master.

00:25:39.960 Upon successful merging, our post-build script triggers automatic deployments of the applications. Notifications via email or chat help the team stay informed about deployment statuses.

00:26:07.080 To summarize, here are some key takeaways identified throughout this project. Simplify your infrastructure as much as possible. Regularly revisit the roles of your servers, ensuring team familiarity with the infrastructure.

00:27:06.000 When you perform upgrades, maintain awareness of the patterns you discover through repetition. A simple, straightforward comprehension of your services facilitates speedy recovery from failures.

00:27:51.600 Implement DRY (Don't Repeat Yourself) principles in your Codebase, especially while integrating Capistrano. Our organization has consolidated this information into a single gem, allowing for clear, consistent deployment processes across multiple applications.

00:28:24.480 Establish comprehensive monitoring solutions that offer accessible visualizations. Different stakeholders in the company should comprehend performance metrics and engage in discussions around the application's health.

00:29:05.760 Adopt BDD practices for your infrastructure, just as you would for your applications. This structured approach verifies that services run correctly, adding another level of assurance to deployments.

00:29:39.960 Integrate all your systems, ensuring continuous integration processes are standard practice. This will help catch discrepancies between local development and the CI environment.

00:30:56.820 Lastly, I encourage everyone to explore the slides available at speakerdeck.com, featuring both cornflower blue and ruby red themes.

00:31:38.160 Thank you for your attention today and for attending this session.

00:32:41.460 Thank you.