Talks

The Real Black Friday aka How To Scale an Unscalable Service

The Real Black Friday aka How To Scale an Unscalable Service

by Judit Ördög-Andrási

The video titled "The Real Black Friday aka How To Scale an Unscalable Service" features Judit Ördög-Andrási discussing the challenges faced during the Black Friday rush at Quality Marcs, an email marketing platform. The talk is part of the Euruko 2017 conference and focuses on practical applications of software development principles such as Test Driven Development (TDD) and DevOps in a high-pressure environment.

Key Points:
- Introduction to Black Friday: Judit presents the significance of Black Friday as a critical time for the company, where email marketing plays a vital role in holiday marketing strategies, highlighting the substantial consumer spending of $58 billion in the U.S. on that day in 2013.
- Product Overview: The product Smart Insight, a Ruby on Rails application, analyzes the effectiveness of email campaigns, was under increased pressure due to a doubling in the customer base without necessary maintenance over six months.
- Problem with Infrastructure: On the night before Black Friday, the team faced a significant issue where the UI of Smart Insight became unreachable due to backend requests not completing within the five-minute timeout period.
- Hotfix Strategies: *Judit shares the actions taken, which included increasing the request timeout and limiting long-running requests, showcasing teamwork and immediate problem-solving techniques under pressure.
- *
Infrastructure Improvements:
The narrative transitions to longer-term infrastructural fixes, including implementing configuration management through Puppet and considering moving to a cloud solution, while following the Twelve-Factor App principles to prepare for scalability challenges.
- Code Optimization: Werner presents a restructuring of the backend and UI processes to allow for long-running tasks to operate independently, employing 'dark launches' to minimize disruption during code deployment.

Conclusions and Takeaways:
- Judit emphasizes the importance of adhering to best practices and being proactive, especially during crisis moments.
- Key takeaways include understanding critical system components, the necessity of having a backup action plan, and utilizing proper operational practices to withstand unexpected demand spikes.
- Despite challenges, the company's preparations led to a highly successful Black Friday, demonstrating the effectiveness of the strategies implemented.

00:00:07.900 Hello!
00:00:13.900 Thank you for coming and thank you for being here.
00:00:17.380 Are you people excited about fall? Do you like colorful leaves, scarves, and family gatherings? What about a pumpkin spice latte? Personally, I love it. Do you know what else is in the autumn? Black Friday! So today, I'm going to share a story with you about some experiences with the team I was working with last year.
00:00:28.700 Maybe you can learn something from our story, so let's get started. Here you can see a bunch of buzzwords: TDD, XP, DevOps, clean code. These are words that always come up in discussions at conferences like this. While I enjoy talking about these concepts in theory, I often find that when they are only discussed theoretically, they are hard to grasp, and the discussions can be unfruitful. Today, I prepared something where you can see the application of some of these principles through practice and how we implement them at the company I work for.
00:00:52.300 When things go south, I think people tend to let go of the best practices they know and revert to old, familiar methods that seem less risky. However, I believe these are precisely the times when you must hold on to the best practices that everyone talks about. They can help you navigate through difficult times.
00:01:14.900 Perhaps you can learn something from our story, and maybe share a laugh or two. Now, I can laugh at it a year later! First of all, before we get started, a disclaimer: I am going to talk about a production system. It may not be perfect, but it generates real income for the company I work for and serves real customers.
00:01:29.480 So, please don't judge. Secondly, no blaming is included. The events that I'm going to describe are not the fault of the developers; they are based on various factors. However, the team always did their best. The things I am going to talk about are heavily context-dependent, so there is no silver bullet. Before applying any of these practices, please consider your context.
00:01:42.890 The company I work for is called Quality Marcs. We operate an online marketing platform for our customers. Essentially, we send emails. It's a bit more elaborate than that; we provide B2C marketing automation software that allows for one-to-one communication with our customers. We send a lot of emails—approximately 7 billion every month. That's like sending an email to every single person on Earth! You are probably receiving emails from our system as well.
00:02:03.310 So, do you guys know which is the single busiest shopping day of the whole year? The title of my talk kind of gives it away, but it's Black Friday. In 2013, in the U.S., shoppers spent fifty-eight billion dollars in stores on just that day. And that's only the offline spending; people spent much more online.
00:02:17.070 Black Friday is the day after Thanksgiving and marks the official first day of the Christmas shopping season. Those fifty-eight billion dollars is fifty times the budget of Budapest as a city in 2016. That's a significant amount of money, and for an email marketing company, Black Friday is a vital holiday.
00:02:34.860 Last year, I was working on a product called Smart Insight. Smart Insight is a tool that analyzes our customers' campaigns and their impact on their customers' purchases. For instance, if you receive an email from eBay offering a 50% discount, we can tell eBay how much revenue they would generate from that email. Based on these results, they can better filter and target user groups for their new campaigns.
00:02:48.210 Smart Insight is a Ruby on Rails application. Our technology stack isn't huge, but it is substantial, and it has been around for about four to five years within our company. A lot of people have worked on it, so it has a bit of legacy code and a mixture of technologies.
00:03:03.130 Let's get on to the problem. I believe almost every startup in Budapest, Europe, and even in the U.S. is struggling to hire engineers. Originally, the team working on Smart Insight moved on to another project before Black Friday. They had worked on something else for the previous six months leading up to last year's Black Friday.
00:03:20.220 When they left, the system was functioning perfectly. However, during the six months that they were not working on it, the number of customers nearly doubled. We hadn't done any maintenance on the system, nor had we added new features. We got back to Smart Insight just before Black Friday 2016.
00:03:36.750 So here we were, with no maintenance done before Black Friday, and the number of customers doubled unexpected consequences awaited us.
00:03:54.020 A little side note—who has seen the movie Dunkirk? Hands up! Did you enjoy it? I certainly did. Do you remember the timelines in the movie? If you haven't seen it, there were three timelines that depicted different periods: the first was the air, covering one hour; the second was the sea, covering a day; and the third was the mole, spanning about a week. My war story might be less important, but I like this time structure, so I'm going to follow it in my presentation. I'll discuss the problems and solutions across three time frames: the hotfix, which will cover a few days; the infrastructure, which will span a couple of weeks; and the code, which encompasses about one to two months.
00:04:28.000 In each section, I'll explain the problems we faced, the fixes we implemented, and the takeaway lessons from these experiences.
00:04:49.500 Let's start with the hotfix. On a beautiful November morning around 2:00 a.m., we were woken up by pager duty due to an unpleasant issue: the UI of Smart Insight was unreachable, and a large number of backend requests could not be completed.
00:05:08.000 At that time, the UI and backend processes were served from the same processes, and we had a five-minute request timeout. We began with quite a few long-running backend requests. On that particular morning, these backend requests couldn't be completed in the allotted five minutes, causing the campaigns scheduled to be sent out to get stuck.
00:05:29.600 To give you an idea of the impact of Black Friday on our pager duty alerts, the number of errors weekly had recorded consistent alerts prior to week 44 of last year, but after Black Friday, we were inundated with pager duty alerts leading up to Christmas.
00:05:47.880 This was when I joined the team—quite a lucky timing! There is a picture of me from that period that my team may fondly remember as we battled with our system during that time. We looked quite harried until the company Christmas party.
00:06:12.150 So, what do you do when you're awakened in the middle of the night, and requests are timing out while the UI is unreachable? My first thought was to get a therapy dog! This is actually the dog of one of my colleagues, who was in the office with us during those stressful nights and provided significant emotional support.
00:06:35.250 But let's get serious. What can be done in a situation like this? Shout out any ideas! More servers might sound good, but I’ll explain later why that didn't work for us.
00:06:54.000 So, besides adding servers, the only immediate action available to us was to increase the request timeout from five minutes to eight minutes in hopes of getting those campaigns sent out quickly. However, raising the timeout proved insufficient as we had another problem at simultaneously. Not only were the request lengths problematic, but the sheer number of requests was overwhelming.
00:07:26.620 We extended the timeout to give those long-running requests a chance to complete, but at the same time, all processes were still held up, creating a bottleneck. We started investigating where those requests were coming from.
00:07:48.380 We had this tool called Automation Center, which automated tasks effectively. One of our clients had created a program that sent us long-running requests around 2,000 times a day, which compounded the issue. Increasing the timeout alone would not resolve the situation.
00:08:07.140 We needed to take further action. Besides increasing the request timeout, we had to prevent the same requests coming from the Automation Center. So we implemented a unique constraint along with the request timeout, which turned out to be sufficient to ensure the campaigns got sent out and helped us survive Black Friday.
00:08:25.810 This humorous implementation took place during a late-night session with one of my colleagues at 9:00 p.m., as we anticipated being awakened during the night if we failed to resolve the alerts.
00:08:42.920 These hotfixes essentially helped us survive Black Friday, but we knew that Christmas was still approaching, and we had to ensure these issues wouldn’t recur. One key takeaway from this experience is to know what is critical in your system. Identify those critical microservices within your platform.
00:09:02.700 It’s equally important to be aware of your Service Level Agreements (SLAs). If you don’t know the SLAs of the services you operate, inquire with your boss or tech lead. This knowledge is crucial because, when something goes wrong, you need to know how much time you have to act. Fortunately, we were still within our SLA with the increased timeout.
00:09:23.480 Also, have a proactive plan. If problems occur, they will happen, and you don’t want to devise a solution at 3 a.m. This is my second takeaway: understand what is critical within your system and prepare an action plan for critical situations.
00:09:41.670 At my previous company, we had playbooks for situations like these. It doesn't matter what you call them, but you need to know what actions to take when crisis strikes. Additionally, if you can avoid coding during the night, it’s wise to do so unless it’s absolutely necessary. Energize work is a core principle of Extreme Programming, and fatigue can hinder your performance.
00:10:03.320 Complete your urgent tasks and then resolve them with your team during normal working hours.
00:10:26.170 This hotfix allowed us to survive Black Friday, but we still had concerns about handling the traffic expected during the Christmas season. The second part of our approach involved fixing our infrastructure.
00:10:46.050 If you noticed, we began with a five-minute request timeout, which is significantly higher than the industry standard of 30 seconds used by hosting providers like Heroku. We couldn't scale up the services because we were on an internal infrastructure, and we only had two servers behind our load balancer. Additionally, these servers were not managed by any configuration management.
00:11:10.350 As the Christmas season approached and we needed to set up new servers, we first had to work with our systems team to bring everything under proper management. This task took several weeks for our teams.
00:11:30.000 We tackled the infrastructure issues by creating configuration management scripts using Puppet, which allowed us to install new servers and deploy them into production. However, this was easier said than done; it involved blood, sweat, and tears as we collaborated with our systems team.
00:11:53.052 There were differing responsibilities for various processes, and we needed to agree upon several aspects before we could proceed with the Puppet setup. One of the major lessons learned from this experience is to consider moving to the cloud, if possible.
00:12:24.310 At the time, we were unable to do so, which was a significant pain point for our team. Although cloud infrastructure is not a one-size-fits-all solution, it can benefit most general web applications. Avoid operating your own servers unless absolutely necessary.
00:12:42.290 Another helpful guideline is to explore the Twelve-Factor App, a set of principles established by Heroku that outlines twelve factors for building scalable, robust, and reliable web applications.
00:13:01.820 Did my team comply with all twelve factors? Almost. There was one factor related to configuration management scripts where we fell short.
00:13:16.930 The second factor pertains to dependencies, which means declaring not only the dependencies your application needs but also ensuring your data is running on the right server. Failing to do so could lead to a situation where a few engineers who know the ins and outs become single points of failure.
00:13:43.540 For example, let's say you have a team of four DevOps engineers all knowledgeable about server restarts and configurations. If one of them quits, another transfers, or someone gets ill, you risk being left with only one engineer who knows how to restart a service. This can lead to significant setbacks.
00:14:03.870 Consider implementing the Twelve-Factor App principles and transitioning to cloud solutions. Explore how to assess your services from the Twelve-Factor perspective; it can provide useful insights.
00:14:18.540 Now that we have implemented our hotfixes and scaled our infrastructure, we needed to ensure that these problems would not return. Even with the ability to scale our services, it can become expensive.
00:14:36.920 Thus, optimizing and making your code more scalable is equally important.
00:14:54.220 As I mentioned earlier, our UI and backend processes were served by the same web server, resulting in no differentiation between UI requests and long-running backend queries. Therefore, we decided to completely restructure our setup.
00:15:07.460 We aimed to ensure that UI performance remained unaffected by long-running backend tasks and to lessen our worry regarding those long tasks as well. This meant deploying backend workers to handle long-running queries while allowing web processes to manage web requests.
00:15:25.950 This restructuring was considerable work, as we aimed to maintain API integrity and avoid breaking anything for our customers during the Christmas season.
00:15:48.000 This large change could not simply be deployed hoping for the best; we needed to ensure everything was functioning correctly.
00:16:04.640 One recommended strategy for these critical changes is known as dark launches. Dark launches were first introduced by Facebook. Essentially, it allows you to deploy your code or new features into production without releasing them to customers.
00:16:25.820 You run the new features in production behind feature switches, allowing you to monitor their performance without exposing them to customers immediately. This approach proved effective for us.
00:16:45.360 We ran both our old synchronous method and the new asynchronous background worker process in parallel for a few weeks, enabling us to install monitoring systems to compare the two.
00:17:05.780 This allowed us to catch bugs and fix them quickly before deploying the code to production.
00:17:26.020 In summary, this was how we survived last year's Black Friday. Here, you can see a little thank-you gift from our headquarters in Vienna.
00:17:42.190 Last year was our most successful Black Friday despite the minor hiccups we encountered. Bracing for the 2017 Black Friday is essential, and we hope the fixes we implemented will hold up in the following situation.
00:18:03.950 So, my key takeaway for you from this presentation is to embrace your practices. Don't abandon them during a crisis; they may be your lifeline during tough times.
00:18:17.380 Thank you very much!