It's Down! Simulating Incidents in Production

00:00:00.030 All right, our next speaker has come all the way from San Francisco, where she arrived about 7 a.m. this morning on an overnight flight. She has not slept, and yet she's caffeinated and ready to go. Her name is Kelsey Pedersen.

00:00:06.960 Kelsey majored in economics and initially wanted to work in sales. However, she started finding ways to automate small tasks in her job, like batch sending emails. This led her to shadow some engineers at work, where she realized that development is a cool, creative, and interesting challenge—unlike what people generally use to describe sales.

00:00:15.389 She taught herself to code, attended a dev bootcamp, and got a job at a startup almost immediately because of her skills.

00:00:20.609 Now, Kelsey works at Stitch Fix, an online styling and shopping website, and she's hoping to bring the service to Australia. When she joined, there were about 60 engineers, and now there are around 180, showcasing the company's amazing growth. She primarily works with Ruby on Rails and React.

00:01:18.170 Hi everyone, Caitlyn, thank you so much for that introduction! When we spoke the other night, I was very impressed by your ability to summarize our conversation, as we talked for quite a long time and formed a great bond.

00:01:23.490 So, like Caitlyn said, I arrived at 7 a.m. this morning on a direct flight from SFO, so I'm highly caffeinated and excited to get started!

00:01:30.119 Today, we're going to discuss simulating incidents in production. I want to start with a story: it was 3 a.m., and I had just begun working at Stitch Fix a few months earlier. It was one of my first on-call rotations, and of course, I got paged.

00:01:38.610 I rolled out of bed, somewhat in a fog, opened up my computer, and noticed that errors were occurring in our application—it wouldn’t load. We needed to fix the issue promptly since our users rely on the software at all times.

00:01:44.730 However, at first glance, it was challenging for me to find information about the impact of the current incident. After several minutes of digging through logs, graphs, and alerts, I discovered, on a custom dashboard deep within our dashboard folders, that we were experiencing an outage due to a dependent service being down.

00:01:52.410 Unfortunately, this was a service I had never worked with before, and I wasn’t sure how to resolve the problem or how it affected our users.

00:01:59.070 This story resonates with many of us, and it's far from the only time I’ve faced such situations. In my two years at Stitch Fix, while we are expected to support applications within our teams, we often don't feel prepared or practiced in resolving these incidents.

00:02:04.619 In this scenario, we ultimately solved the issue and got everything back up and running, but only after more than an hour of downtime—a situation that could have been avoided if we had a better understanding of our systems.

00:02:11.009 As our teams and companies grow, our systems become increasingly complex, creating stress and anxiety for those of us responsible for applications that we don’t interact with daily.

00:02:18.540 As Ruby developers, we typically have two main responsibilities: one is building new features, while the other is supporting existing ones.

00:02:24.810 However, many of us focus primarily on building new features, as they directly contribute to company growth, revenue, and user satisfaction. The unintended consequence is that supporting and ensuring the resiliency of our applications often becomes an afterthought.

00:02:31.500 This neglect often leads us to feel like this sad pug dog, which is unfortunate. Today, we’re going to talk about practicing incident response in production and how it can help us not only support but also build more resilient software.

00:02:37.950 As Caitlyn mentioned, I work at Stitch Fix on the styling engineering team, responsible for building and maintaining software for our 3,000 stylists who work remotely across the country. We’re constantly seeking ways to enhance our resiliency as our systems grow more complex.

00:02:44.730 Today, we're going to discuss injecting failure to learn from it through chaos engineering.

00:02:53.930 This concept isn't new; for example, doctors undergo extensive practice in incident response during medical school and residency.

00:03:01.590 Similarly, firefighters train for months, if not years, to efficiently respond to emergencies and help people swiftly. Most engineers have some form of training, whether through college, boot camp, or online tutorials; however, very few of us have actively practiced incident response as a primary focus.

00:03:10.680 This is a skill we don’t flex very often, which is why today we want to make practicing incident response a priority within our teams to build more resilient systems.

00:03:16.200 One term that encapsulates incident response and simulating incidents in production is chaos engineering, a concept coined by Netflix about five years ago.

00:03:22.139 They developed the chaos monkey, which automatically kills containers within their applications, forcing the applications to reboot.

00:03:28.790 At Stitch Fix, we've adopted this chaos engineering concept and placed the onus on developers to implement similar strategies. Therefore, as we discuss simulating incidents, we will also consider chaos engineering.

00:03:37.210 There are three main components to chaos engineering. First, consider what we want to simulate—what type of failure are we looking to replicate in our applications?

00:03:44.300 Second, we want to run the simulation as a team, allowing us to huddle and collaborate in a larger group.

00:03:50.880 Third, we will have a designated game day—a specific time to run the simulation together and learn from it.

00:03:57.630 We will dive into how Stitch Fix has implemented chaos engineering within our styling engineering team, focusing on three main sections of our approach.

00:04:04.110 First, we will discuss how to prepare for the simulation and game day, including the type of code needed to effectively run a simulation.

00:04:11.010 Next, we'll talk about the actual game day—gathering the team, executing the simulation live in production, and ensuring it does not cause chaos for our business.

00:04:19.030 Finally, we will explore how to extract learnings and build more resilient systems following the simulation.

00:04:26.400 To start, we need to set up the technical implementation for simulation. This preparation should occur weeks in advance, allowing time to think about what types of failures we want to simulate and to write the required code.

00:04:31.020 Considering what we aim to simulate is a critical question, as the weaknesses in our system are often team- and company-specific, dependent on specific technologies, architectures, and services.

00:04:38.130 At Stitch Fix, we decided to begin by simulating failures within our services. We have a microservice-based architecture, meaning we utilize dozens, if not hundreds, of different services that power our applications.

00:04:44.930 This structure can create potential points of failure for our systems, leading to downtime if not handled with resilience. So, today we'll focus on simulating a service failure within our application.

00:04:52.890 Specifically, we will represent this failure as a 500 status code—simulating a response from a service that would normally return a 200 success status code.

00:05:01.590 To accomplish this, we will utilize middleware. Middleware sits between every request and response that your application produces. We can create custom middleware classes to modify these requests and responses.

00:05:09.660 In this case, we will alter the response received when making service requests. To clarify, middleware is essential in simulating downtime for our services.

00:05:16.250 At Stitch Fix, we already use Faraday, a Ruby HTTP client, for our requests. This makes it an optimal choice to create our custom middleware classes.

00:05:24.530 For those unfamiliar, Faraday allows developers to customize its behavior with middleware. So, we've written a custom middleware class that alters the response when an application requests data from an internal service.

00:05:31.820 What's the implementation look like? This shows a new Faraday object we instantiate with an options hash containing the URL. This can include various request options and adapters.

00:05:39.170 We want to create this custom middleware class for altering response statuses. This is surprisingly simple: we create a new SimulateServiceFailure class that inherits from Faraday Middleware.

00:05:46.130 In the complete method, we essentially override the response status to 500, which forces all service response statuses to return a 500 error.

00:05:52.960 After defining this class, we can go back to our new Faraday connection and call SimulateServiceFailure. However, if we merged this into production, it would cause all service requests to fail, which isn’t ideal.

00:06:01.420 To avoid this scenario, we need to segment the simulation so it only affects a subset of users. At Stitch Fix, we accomplish this through feature flags, allowing us to control who is included in the simulation.

00:06:10.020 We've implemented feature flags through two different tables: one holds the key name for the feature flag, while the other connects user IDs to the feature flags, designating users for the simulation.

00:06:17.210 Once we determine the feature being utilized, we pass in a config variable called simulateFailure, setting it to true if the user is part of the run simulation feature flag.

00:06:24.630 Now, back in our Faraday connection object, we can run the SimulateServiceFailure if the user is part of the feature flag, which is all we need to implement chaos engineering within our systems at Stitch Fix.

00:06:31.950 Once we have the necessary code, the next step is to communicate to the organization that we will run our game day. This involves informing business partners and engineers to prepare for any potential issues.

00:06:38.540 Furthermore, we want to gather expectations from our team. When contemplating simulating failures, each team member likely has an expectation of potential outcomes.

00:06:44.800 We collected these expectations through a Slack poll, which was an easy way to assess our thoughts before the simulation. We sent it out a few hours prior to the event.

00:06:52.280 Essentially, the poll asked what impact we thought a service failure would have, specifically focusing on our client data—information displayed to our stylists.

00:06:58.800 Interestingly, the responses revealed a lack of alignment: three team members believed the page would still render, while one thought the application would crash entirely.

00:07:05.710 This disparity highlighted the importance of understanding our systems, so we will begin documenting these expectations.

00:07:11.700 We created a Google Doc to record conversations, expectations, and learnings throughout the process, discussing how we envisioned the app's response under various conditions.

00:07:18.240 As we discussed these expectations, we also began planning our upcoming game day. We set aside an hour-long meeting each week for the team's game day conversations.

00:07:26.760 During these discussions, we considered questions like: What alerts might we receive? What will the dashboards display? Which documents will be accessible? The aim was to ensure all needed information was readily available.

00:07:33.450 Once we completed the preparation and established infrastructure, including our feature flags, we were ready to run the simulation. We gathered the team for the execution.

00:07:40.920 We often collaborate as a remote team over video conferencing. To execute the process, I shared my screen, allowing everyone to view the Google Doc and our KPIs during the simulation.

00:07:48.530 It's essential that, as a team, we experience this together. Everyone gains valuable insights and learns from the experience.

00:07:55.690 Right before the game day, we remind our business partners again to emphasize the importance of communication. Generally, 99% of the time, things go smoothly, but it’s critical to keep them informed.

00:08:02.130 As we prepare for the simulation, we need to ensure our feature flag is inactive. We begin adding users to the feature flag by manually adjusting settings in our console.

00:08:08.160 Now it’s time to run the command and start our simulation. We used a Rake task to easily toggle the feature flag.

00:08:14.610 This command activates the feature flag. As we run the command, we see a division of expectations within our team.

00:08:20.920 About three-quarters of the team believes that the application will load, while one member thinks the application may crash entirely.

00:08:27.490 When we pull up our application, we see the message, 'Sorry, but something went wrong.' Simulating this failure resulted in a full application crash, which was unexpected for most team members.

00:08:34.390 Most anticipated that at least part of the application would load, but instead, the service failure caused a complete outage.

00:08:41.470 This initial experience provided our first learning opportunity, which was exciting for everyone!

00:08:49.050 As the event unfolded, it felt like a fireworks show—an explosion moment. Nothing was loading, reinforcing the need for effective team communication.

00:08:55.710 At this stage, we must ensure that the simulation is only affecting the users we anticipated. We avoid impacting others outside of the feature flags.

00:09:04.350 To verify, we kept one team member off the feature flag to confirm that the app wasn’t down for everyone else.

00:09:11.130 We pinged the entire engineering team to double-check the situation.

00:09:17.920 As expected, errors began rolling in. This development would usually cause concern, but in this case, it was exciting—we knew the simulation was successfully working.

00:09:24.900 Initially, we received alerts from BugSnag displaying the 500 status, which was exactly what we hoped to observe.

00:09:31.820 Next, we checked our DataDog metrics, where we saw the simulation components showing spikes in 500 responses.

00:09:39.520 One additional discovery was that our PagerDuty alerts did not activate. We had anticipated receiving pages as part of the simulation, but our thresholds were set too low.

00:09:46.440 At this point in the simulation, only a couple dozen people were involved, falling below our PagerDuty thresholds.

00:09:54.000 After verifying everything was functioning as expected, we saw the application was still down. It was time to conclude the game day.

00:10:02.850 We ran the Rake task to disable the simulation, turning the feature flag back to inactive status.

00:10:09.510 Now that the simulation had ended, we had significant key learnings and takeaways from the experience.

00:10:16.340 Moving forward, we need to think about how to extract useful learnings from this experience to enhance our resilience within the system.

00:10:24.440 When discussing resiliency, we commonly focus on our technical systems and strive to build applications devoid of bugs or outages.

00:10:31.710 Through these simulations, we gain empowerment to identify issues with dependent services, which could significantly impact the user experience.

00:10:39.270 This simulation taught us that if this particular service goes down, our application won’t function; it won’t load at all, revealing a critical dependency.

00:10:48.070 These insights allow us to identify new backlog items aimed at enhancing system resilience.

00:10:55.680 Additionally, we should reassess our initial expectations. Why did one person think nothing would load while three others anticipated the app would render?

00:11:03.870 Understanding these differences in expectations is crucial for identifying knowledge gaps regarding our system.

00:11:12.800 Ultimately, the goal for our technical system is to ensure increased failure tolerance.

00:11:20.130 In addition to enhancing system resilience from a technical standpoint, we must consider our process resiliency.

00:11:28.440 It’s essential to improve the tools and knowledge accessible to engineers, facilitating a better understanding of our systems when issues arise.

00:11:35.590 Reflecting on my first experiences on-call at Stitch Fix, I recall struggling to find relevant documentation, metrics, and tools when attempting to resolve issues.

00:11:42.830 As a result, we want to develop resilient processes by enhancing the discoverability of resources—such as GitHub wiki pages.

00:11:49.660 We also need to establish optimal paging thresholds and create usable dashboards.

00:11:56.470 One practical approach has been linking documentation and runbooks related to our alerts, allowing for easier access during critical moments.

00:12:04.320 For example, we also receive alerts in our team Slack channel, which include direct links to relevant runbooks, making it easy to access important information.

00:12:09.150 By optimizing our runbooks through incidents and determining useful versus unhelpful information, we can continuously improve.

00:12:16.920 Another strategy involves enhancing access to dashboards, as we recognized the importance of organized dashboard systems during the simulation.

00:12:23.370 We discovered numerous dashboards that were scattered and difficult to find, which prompted us to consolidate them into more accessible locations.

00:12:30.200 Accessibility to process information is critical. When faced with high-stress situations, quick and easy access to relevant information is crucial.

00:12:38.270 The final factor to consider in our resilience strategy is the human element.

00:12:44.950 Improving engineers' confidence and abilities as they troubleshoot issues will result in higher efficiency.

00:12:52.710 The more we practice these simulations, the higher our confidence will build over time.

00:13:00.290 Our team seeks to be prepared, to feel less stressed, and to be less anxious while on call.

00:13:07.220 Practicing incident response sometimes resembles having insurance: it may not seem necessary until an emergency arises.

00:13:14.870 When a significant outage occurs, having practiced for months can lead to swift recovery.

00:13:22.060 Additionally, we want to promote a mindset shift in how we view resilience in our systems.

00:13:30.720 The more frequently we conduct simulations, the more top-of-mind resiliency becomes.

00:13:39.240 We should design systems with the understanding that failures can happen, especially distributed systems that must account for service failures, latency, and various outages.

00:13:48.340 Through game days and incident response simulations, we enhance our knowledge of systems and ourselves, allowing us to build resilient processes, applications, and people.

00:13:57.170 Our goal is to eliminate stress, confusion, and a sense of incompetence, and instead cultivate a sense of happiness, knowledge, and empowerment.

00:14:05.040 Ultimately, we aim to construct strong technical processes and human systems through simulating potential incidents in production. Thank you!