Тоо much of a Good Thing

00:00:09.240 Usually, there are about five minutes between talks, but today is going to be longer. A little bit about me: I put a bunch of stuff together because I realized I didn't have an introduction prepared. If you're curious, here are some platforms where you can find me, usually under this handle. Sometimes it's slightly different, but I'm mostly there. I don't post a lot these days for various reasons. I mean, once Twitter was gone, it was like, 'Hey, what can you do?'

00:00:22.960 You can find more information about me on my personal website, which is essentially like my visiting card. You can check it out to see what I'm currently doing. The fastest way to reach me is actually by writing an email, so feel free to do that. And if you're interested in some of my ramblings, my friend and I have a podcast. We haven't published anything lately because I'm currently in between jobs, but I plan to start working again very soon, so there will likely be new episodes coming out. We've got almost two years of episodes, so you won’t get bored.

00:01:04.920 Today, as introduced, we're going to talk about what happens when success becomes unexpected. Sometimes we think we are prepared for success, but in fact, we are not. The story I’m going to share with you today is about an experience I had six months into my first job as an engineering manager. I had never experienced anything like that before.

00:01:35.040 To be successful, you often have to be in the right place at the right time, and I think I was in that right place at the right time. At that time, I was working for an app called Fletics, a fitness application that became quite successful in Germany. Some of you might know it. We had various features, including a nutrition application, enabling users to get healthy not only through training but also by having the right nutrition.

00:02:07.159 We created personalized training plans using what would now be called AI, but we referred to it as machine learning back then. We also incorporated user feedback about the training—whether it was too hard, too easy, or if they wanted to change workout days. This API personalization was crucial; it was not just a generic fitness app. We also introduced mindset coaching, which was a sort of demo for a new feature that was intended to align with our personalized AI approach.

00:02:45.120 It was quite successful. In a typical month, we received around 10,000 new registrations per day and approximately 20,000 training users daily. At that time, the company had about 48 million registered users, with approximately 600,000 being active users. While it may not be Netflix, it certainly was significant.

00:03:08.400 A peak month saw slightly different metrics, with registrations doubling and training sessions significantly increasing. We were well-prepared for that traffic surge because we understood our users' patterns. For example, the peak month for us was January, the month when most people decide to join the gym.

00:03:43.760 The advantage of our app was that you could work out anywhere, anytime. You don't need a lot of equipment; it was easily accessible. This meant that in January, we would generate as much revenue as we did for the rest of the year combined. It was vital to our business.

00:04:07.680 Consequently, our DevOps team was aware they needed to enhance the infrastructure to handle the increased traffic. Our infrastructure was quite standard for a company of our size. The organization was built to ensure that clients hit the load balancer, following which we had Kubernetes handling various servers for different applications.

00:04:30.240 We had database read replicas across different regions for redundancy and caching systems to optimize performance. When I joined the company six years ago, we were using a Rails monolith. However, we realized that if we wanted to add more users and features to our application, we needed to branch out from that structure. We gravitated towards a more service-oriented architecture since microservices would have been too much overhead for our relatively small team.

00:05:19.920 We aimed to break the application down into various services representing different domains. We created separate services for our training system, authentication system, and payment system, which, despite its name, primarily functioned as an authorization system. This architecture allowed us to check user payments and subscription statuses efficiently.

00:05:57.680 The nutrition component was its own app due to the initial Rails monolith. This made sense at the time, but later on, we transitioned to service reuse to minimize overhead. Mindset coaching started as a demo for an MVP, allowing us to encapsulate it early on to avoid duplicating work later.

00:06:25.960 We had systems in place for marketing, advertising, and communication. However, the infamous GDPR law hit us, requiring us to build a service around that for compliance. Initially, we used synchronous calls for most operations because we were able to manage the demand at that stage.

00:07:10.520 As the demand increased, particularly when customers began creating subscriptions and making payments, we used tools like Sidekiq to retry failed attempts. Eventually, we recognized the need for a message bus since the synchronous calls had created complexity that the operations team warned us about.

00:08:02.200 As I mentioned, the right timing is crucial for success, and then, unexpectedly, COVID-19 hit. Initially, the impact was negligible as people continued their daily routines, but things escalated quickly after the first COVID-related fatality, and the German government announced a nationwide lockdown.

00:08:44.639 This is when we saw a significant spike in activity, particularly in registrations. On the day the lockdown was announced, we experienced a huge surge. With fitness studios closed, people turned to our app to maintain their workouts, generating a lot of revenue.

00:09:07.640 Our registrations skyrocketed to 50,000, which was much higher than what we've ever encountered before. While a regular month had a certain amount of traffic, this spike effectively doubled our training users.

00:09:45.360 Naturally, this spike in activity also led to an increase in requests per minute. Our operation teams began noticing patterns tied to specific times of day when usage surged, particularly around 6 PM, which was when most people would exercise after work.

00:10:31.920 This increase happened primarily over weekends, and of course, alarms started going off for our operations team as they responded to the heavier load. A series of errors closely correlated with the rising number of requests began appearing.

00:11:12.960 The first major challenge we faced was with our DNS setup and issues stemming from the synchronous calls that could normally handle our operations successfully but began to falter under such high demand. The cascading failures began as one service started to fail, causing a ripple effect on all applications.

00:11:58.599 For instance, during routine operations, our authentication system would validate users, check payment statuses, and run other necessary connections. But if one service went down, it would cause failures in several others, creating a backlog, especially in our Sidekiq worker processes.

00:12:42.720 As a countermeasure, we recognized that improving our autoscaler was urgent. We were using Kubernetes for our infrastructure and could spin up extra pods, but the autoscaler was simply not adequately configured for that sudden traffic increase.

00:13:24.880 While increasing the autoscaler worked temporarily, the relentless flow of incoming traffic was wrecking our operational infrastructure, leading not only to slow response times but also to overwhelming amounts of requests.

00:14:03.840 Following that, we implemented a caching system over various services to ease the burden of constant traffic by avoiding real-time data freshness when permissible. This seemed to work initially, but soon, demand on Mondays shot up even higher. It was like clockwork; Monday would bring systemic strain.

00:14:57.319 The operations team quickly identified a problem with our DNS service, which at the time wasn’t supporting caching for Kubernetes service names. As requests increased, we also began exceeding our AWS Route 53 quotas, compounding the issues with our internal calls that were now starting to fail.

00:15:42.720 With employees often relying on easy-to-remember short DNS names that caused additional resolution requirements, we saw a marked rise in demand for our internal calls. This increased traffic led to an unexpected reduction in our ability to call AWS services.

00:16:40.960 Consequently, payment processing became hobbled, and our marketing services faced interruptions, leading to further frustration, especially considering that our Sidekiq retries only added to the chaos.

00:17:23.440 In response, we had to pivot our approach around caching. Initially, we had a policy to limit caching due to the complexities it could introduce in identifying potential issues. However, as the situation escalated, our operations team swiftly switched to using CoreDNS to optimize our queries.

00:18:04.760 We also updated our internal APIs to resolve the longer names, reducing the strain caused by short DNS names, thus minimizing the multiple calls going out. This led to a significant decrease in traffic volume.

00:18:43.360 Next, we started addressing circular dependencies, which emerged across various service domains due to our service-oriented architecture. This reality often required teams to become more isolated, creating communication gaps and complicating processes.

00:19:20.640 The issues stemming from our circular dependencies remained hidden during previous periods of increased usage. However, they became apparent as we began integrating with external systems, like the CMS used for our marketing services, leading to even more complexities.

00:20:07.440 We discovered that many of our services were interdependent, with several of them requiring calls to one another to function properly. This situation forced us to focus on which services truly required immediate improvements while we began breaking those circular dependencies.

00:20:53.919 The identification and isolation of these interactions helped alleviate the load on our system. We managed to hold the situation by preserving necessary interactions, while simultaneously removing any non-essential calls.

00:21:41.360 The engineering managers approached the situation with a sense of urgency, as did our VP of Engineering, to ensure we had a plan moving forward after the lockdown. We recognized the importance of retaining performance and uptime to preserve user satisfaction.

00:22:36.679 Just a couple of weeks later, we began to notice stability in the application and acknowledged that teams had not only remained effective but also remained engaged during this rushed phase of working from home.

00:23:24.960 We organized a virtual war room for team members to come together during peak usage hours. We expected 6 PM to be intensive, so having open communication channels and shared moments of collaboration were pivotal.

00:24:10.720 We adopted tools like Slack for constant communication to ensure cross-departmental visibility during these critical moments. This allowed everyone, especially marketing, to maintain their usual campaigns without interruption during our peaks.

00:24:56.240 Our overall priority was maintaining user experience for our paying customers. Ensuring good practices were used allowed us to shift back to normal operations without anyone in the company being too alarmed.

00:25:49.840 Simultaneously, we advocated for an improved infrastructure to counteract technical debt, thus presenting a case for cleaning up our codebase and scaling effectively for the long run.

00:26:39.680 It was important that our engineering department had its only OKRs primarily focused on technical improvements to be initiated after the immediate crisis passed. This initiative allowed us to prepare well for the continuation of our successful campaigns.

00:27:23.760 We strategized ways to frame technical improvements not as a result of a crisis, but as proactive measures in anticipation of future success. This created more opportunities for cross-team collaboration and streamlined our workflow.

00:28:13.440 Ultimately, framing our problems and solutions in a business context made them easier to communicate and implement, resulting in a smoother path toward improving our system as a whole.

00:28:51.680 In hindsight, understanding and mapping our needs framed around business metrics instigated our progress in optimizing time and resources.

Тоо much of a Good Thing

Key Points Discussed:

Conclusions: