00:00:07.900
Hello!
00:00:13.900
Thank you for coming and thank you for being here.
00:00:17.380
Are you people excited about fall? Do you like colorful leaves, scarves, and family gatherings? What about a pumpkin spice latte? Personally, I love it. Do you know what else is in the autumn? Black Friday! So today, I'm going to share a story with you about some experiences with the team I was working with last year.
00:00:28.700
Maybe you can learn something from our story, so let's get started. Here you can see a bunch of buzzwords: TDD, XP, DevOps, clean code. These are words that always come up in discussions at conferences like this. While I enjoy talking about these concepts in theory, I often find that when they are only discussed theoretically, they are hard to grasp, and the discussions can be unfruitful. Today, I prepared something where you can see the application of some of these principles through practice and how we implement them at the company I work for.
00:00:52.300
When things go south, I think people tend to let go of the best practices they know and revert to old, familiar methods that seem less risky. However, I believe these are precisely the times when you must hold on to the best practices that everyone talks about. They can help you navigate through difficult times.
00:01:14.900
Perhaps you can learn something from our story, and maybe share a laugh or two. Now, I can laugh at it a year later! First of all, before we get started, a disclaimer: I am going to talk about a production system. It may not be perfect, but it generates real income for the company I work for and serves real customers.
00:01:29.480
So, please don't judge. Secondly, no blaming is included. The events that I'm going to describe are not the fault of the developers; they are based on various factors. However, the team always did their best. The things I am going to talk about are heavily context-dependent, so there is no silver bullet. Before applying any of these practices, please consider your context.
00:01:42.890
The company I work for is called Quality Marcs. We operate an online marketing platform for our customers. Essentially, we send emails. It's a bit more elaborate than that; we provide B2C marketing automation software that allows for one-to-one communication with our customers. We send a lot of emails—approximately 7 billion every month. That's like sending an email to every single person on Earth! You are probably receiving emails from our system as well.
00:02:03.310
So, do you guys know which is the single busiest shopping day of the whole year? The title of my talk kind of gives it away, but it's Black Friday. In 2013, in the U.S., shoppers spent fifty-eight billion dollars in stores on just that day. And that's only the offline spending; people spent much more online.
00:02:17.070
Black Friday is the day after Thanksgiving and marks the official first day of the Christmas shopping season. Those fifty-eight billion dollars is fifty times the budget of Budapest as a city in 2016. That's a significant amount of money, and for an email marketing company, Black Friday is a vital holiday.
00:02:34.860
Last year, I was working on a product called Smart Insight. Smart Insight is a tool that analyzes our customers' campaigns and their impact on their customers' purchases. For instance, if you receive an email from eBay offering a 50% discount, we can tell eBay how much revenue they would generate from that email. Based on these results, they can better filter and target user groups for their new campaigns.
00:02:48.210
Smart Insight is a Ruby on Rails application. Our technology stack isn't huge, but it is substantial, and it has been around for about four to five years within our company. A lot of people have worked on it, so it has a bit of legacy code and a mixture of technologies.
00:03:03.130
Let's get on to the problem. I believe almost every startup in Budapest, Europe, and even in the U.S. is struggling to hire engineers. Originally, the team working on Smart Insight moved on to another project before Black Friday. They had worked on something else for the previous six months leading up to last year's Black Friday.
00:03:20.220
When they left, the system was functioning perfectly. However, during the six months that they were not working on it, the number of customers nearly doubled. We hadn't done any maintenance on the system, nor had we added new features. We got back to Smart Insight just before Black Friday 2016.
00:03:36.750
So here we were, with no maintenance done before Black Friday, and the number of customers doubled unexpected consequences awaited us.
00:03:54.020
A little side note—who has seen the movie Dunkirk? Hands up! Did you enjoy it? I certainly did. Do you remember the timelines in the movie? If you haven't seen it, there were three timelines that depicted different periods: the first was the air, covering one hour; the second was the sea, covering a day; and the third was the mole, spanning about a week. My war story might be less important, but I like this time structure, so I'm going to follow it in my presentation. I'll discuss the problems and solutions across three time frames: the hotfix, which will cover a few days; the infrastructure, which will span a couple of weeks; and the code, which encompasses about one to two months.
00:04:28.000
In each section, I'll explain the problems we faced, the fixes we implemented, and the takeaway lessons from these experiences.
00:04:49.500
Let's start with the hotfix. On a beautiful November morning around 2:00 a.m., we were woken up by pager duty due to an unpleasant issue: the UI of Smart Insight was unreachable, and a large number of backend requests could not be completed.
00:05:08.000
At that time, the UI and backend processes were served from the same processes, and we had a five-minute request timeout. We began with quite a few long-running backend requests. On that particular morning, these backend requests couldn't be completed in the allotted five minutes, causing the campaigns scheduled to be sent out to get stuck.
00:05:29.600
To give you an idea of the impact of Black Friday on our pager duty alerts, the number of errors weekly had recorded consistent alerts prior to week 44 of last year, but after Black Friday, we were inundated with pager duty alerts leading up to Christmas.
00:05:47.880
This was when I joined the team—quite a lucky timing! There is a picture of me from that period that my team may fondly remember as we battled with our system during that time. We looked quite harried until the company Christmas party.
00:06:12.150
So, what do you do when you're awakened in the middle of the night, and requests are timing out while the UI is unreachable? My first thought was to get a therapy dog! This is actually the dog of one of my colleagues, who was in the office with us during those stressful nights and provided significant emotional support.
00:06:35.250
But let's get serious. What can be done in a situation like this? Shout out any ideas! More servers might sound good, but I’ll explain later why that didn't work for us.
00:06:54.000
So, besides adding servers, the only immediate action available to us was to increase the request timeout from five minutes to eight minutes in hopes of getting those campaigns sent out quickly. However, raising the timeout proved insufficient as we had another problem at simultaneously. Not only were the request lengths problematic, but the sheer number of requests was overwhelming.
00:07:26.620
We extended the timeout to give those long-running requests a chance to complete, but at the same time, all processes were still held up, creating a bottleneck. We started investigating where those requests were coming from.
00:07:48.380
We had this tool called Automation Center, which automated tasks effectively. One of our clients had created a program that sent us long-running requests around 2,000 times a day, which compounded the issue. Increasing the timeout alone would not resolve the situation.
00:08:07.140
We needed to take further action. Besides increasing the request timeout, we had to prevent the same requests coming from the Automation Center. So we implemented a unique constraint along with the request timeout, which turned out to be sufficient to ensure the campaigns got sent out and helped us survive Black Friday.
00:08:25.810
This humorous implementation took place during a late-night session with one of my colleagues at 9:00 p.m., as we anticipated being awakened during the night if we failed to resolve the alerts.
00:08:42.920
These hotfixes essentially helped us survive Black Friday, but we knew that Christmas was still approaching, and we had to ensure these issues wouldn’t recur. One key takeaway from this experience is to know what is critical in your system. Identify those critical microservices within your platform.
00:09:02.700
It’s equally important to be aware of your Service Level Agreements (SLAs). If you don’t know the SLAs of the services you operate, inquire with your boss or tech lead. This knowledge is crucial because, when something goes wrong, you need to know how much time you have to act. Fortunately, we were still within our SLA with the increased timeout.
00:09:23.480
Also, have a proactive plan. If problems occur, they will happen, and you don’t want to devise a solution at 3 a.m. This is my second takeaway: understand what is critical within your system and prepare an action plan for critical situations.
00:09:41.670
At my previous company, we had playbooks for situations like these. It doesn't matter what you call them, but you need to know what actions to take when crisis strikes. Additionally, if you can avoid coding during the night, it’s wise to do so unless it’s absolutely necessary. Energize work is a core principle of Extreme Programming, and fatigue can hinder your performance.
00:10:03.320
Complete your urgent tasks and then resolve them with your team during normal working hours.
00:10:26.170
This hotfix allowed us to survive Black Friday, but we still had concerns about handling the traffic expected during the Christmas season. The second part of our approach involved fixing our infrastructure.
00:10:46.050
If you noticed, we began with a five-minute request timeout, which is significantly higher than the industry standard of 30 seconds used by hosting providers like Heroku. We couldn't scale up the services because we were on an internal infrastructure, and we only had two servers behind our load balancer. Additionally, these servers were not managed by any configuration management.
00:11:10.350
As the Christmas season approached and we needed to set up new servers, we first had to work with our systems team to bring everything under proper management. This task took several weeks for our teams.
00:11:30.000
We tackled the infrastructure issues by creating configuration management scripts using Puppet, which allowed us to install new servers and deploy them into production. However, this was easier said than done; it involved blood, sweat, and tears as we collaborated with our systems team.
00:11:53.052
There were differing responsibilities for various processes, and we needed to agree upon several aspects before we could proceed with the Puppet setup. One of the major lessons learned from this experience is to consider moving to the cloud, if possible.
00:12:24.310
At the time, we were unable to do so, which was a significant pain point for our team. Although cloud infrastructure is not a one-size-fits-all solution, it can benefit most general web applications. Avoid operating your own servers unless absolutely necessary.
00:12:42.290
Another helpful guideline is to explore the Twelve-Factor App, a set of principles established by Heroku that outlines twelve factors for building scalable, robust, and reliable web applications.
00:13:01.820
Did my team comply with all twelve factors? Almost. There was one factor related to configuration management scripts where we fell short.
00:13:16.930
The second factor pertains to dependencies, which means declaring not only the dependencies your application needs but also ensuring your data is running on the right server. Failing to do so could lead to a situation where a few engineers who know the ins and outs become single points of failure.
00:13:43.540
For example, let's say you have a team of four DevOps engineers all knowledgeable about server restarts and configurations. If one of them quits, another transfers, or someone gets ill, you risk being left with only one engineer who knows how to restart a service. This can lead to significant setbacks.
00:14:03.870
Consider implementing the Twelve-Factor App principles and transitioning to cloud solutions. Explore how to assess your services from the Twelve-Factor perspective; it can provide useful insights.
00:14:18.540
Now that we have implemented our hotfixes and scaled our infrastructure, we needed to ensure that these problems would not return. Even with the ability to scale our services, it can become expensive.
00:14:36.920
Thus, optimizing and making your code more scalable is equally important.
00:14:54.220
As I mentioned earlier, our UI and backend processes were served by the same web server, resulting in no differentiation between UI requests and long-running backend queries. Therefore, we decided to completely restructure our setup.
00:15:07.460
We aimed to ensure that UI performance remained unaffected by long-running backend tasks and to lessen our worry regarding those long tasks as well. This meant deploying backend workers to handle long-running queries while allowing web processes to manage web requests.
00:15:25.950
This restructuring was considerable work, as we aimed to maintain API integrity and avoid breaking anything for our customers during the Christmas season.
00:15:48.000
This large change could not simply be deployed hoping for the best; we needed to ensure everything was functioning correctly.
00:16:04.640
One recommended strategy for these critical changes is known as dark launches. Dark launches were first introduced by Facebook. Essentially, it allows you to deploy your code or new features into production without releasing them to customers.
00:16:25.820
You run the new features in production behind feature switches, allowing you to monitor their performance without exposing them to customers immediately. This approach proved effective for us.
00:16:45.360
We ran both our old synchronous method and the new asynchronous background worker process in parallel for a few weeks, enabling us to install monitoring systems to compare the two.
00:17:05.780
This allowed us to catch bugs and fix them quickly before deploying the code to production.
00:17:26.020
In summary, this was how we survived last year's Black Friday. Here, you can see a little thank-you gift from our headquarters in Vienna.
00:17:42.190
Last year was our most successful Black Friday despite the minor hiccups we encountered. Bracing for the 2017 Black Friday is essential, and we hope the fixes we implemented will hold up in the following situation.
00:18:03.950
So, my key takeaway for you from this presentation is to embrace your practices. Don't abandon them during a crisis; they may be your lifeline during tough times.
00:18:17.380
Thank you very much!