00:00:00.030
All right, our next speaker has come all the way from San Francisco, where she arrived about 7 a.m. this morning on an overnight flight. She has not slept, and yet she's caffeinated and ready to go. Her name is Kelsey Pedersen.
00:00:06.960
Kelsey majored in economics and initially wanted to work in sales. However, she started finding ways to automate small tasks in her job, like batch sending emails. This led her to shadow some engineers at work, where she realized that development is a cool, creative, and interesting challenge—unlike what people generally use to describe sales.
00:00:15.389
She taught herself to code, attended a dev bootcamp, and got a job at a startup almost immediately because of her skills.
00:00:20.609
Now, Kelsey works at Stitch Fix, an online styling and shopping website, and she's hoping to bring the service to Australia. When she joined, there were about 60 engineers, and now there are around 180, showcasing the company's amazing growth. She primarily works with Ruby on Rails and React.
00:01:18.170
Hi everyone, Caitlyn, thank you so much for that introduction! When we spoke the other night, I was very impressed by your ability to summarize our conversation, as we talked for quite a long time and formed a great bond.
00:01:23.490
So, like Caitlyn said, I arrived at 7 a.m. this morning on a direct flight from SFO, so I'm highly caffeinated and excited to get started!
00:01:30.119
Today, we're going to discuss simulating incidents in production. I want to start with a story: it was 3 a.m., and I had just begun working at Stitch Fix a few months earlier. It was one of my first on-call rotations, and of course, I got paged.
00:01:38.610
I rolled out of bed, somewhat in a fog, opened up my computer, and noticed that errors were occurring in our application—it wouldn’t load. We needed to fix the issue promptly since our users rely on the software at all times.
00:01:44.730
However, at first glance, it was challenging for me to find information about the impact of the current incident. After several minutes of digging through logs, graphs, and alerts, I discovered, on a custom dashboard deep within our dashboard folders, that we were experiencing an outage due to a dependent service being down.
00:01:52.410
Unfortunately, this was a service I had never worked with before, and I wasn’t sure how to resolve the problem or how it affected our users.
00:01:59.070
This story resonates with many of us, and it's far from the only time I’ve faced such situations. In my two years at Stitch Fix, while we are expected to support applications within our teams, we often don't feel prepared or practiced in resolving these incidents.
00:02:04.619
In this scenario, we ultimately solved the issue and got everything back up and running, but only after more than an hour of downtime—a situation that could have been avoided if we had a better understanding of our systems.
00:02:11.009
As our teams and companies grow, our systems become increasingly complex, creating stress and anxiety for those of us responsible for applications that we don’t interact with daily.
00:02:18.540
As Ruby developers, we typically have two main responsibilities: one is building new features, while the other is supporting existing ones.
00:02:24.810
However, many of us focus primarily on building new features, as they directly contribute to company growth, revenue, and user satisfaction. The unintended consequence is that supporting and ensuring the resiliency of our applications often becomes an afterthought.
00:02:31.500
This neglect often leads us to feel like this sad pug dog, which is unfortunate. Today, we’re going to talk about practicing incident response in production and how it can help us not only support but also build more resilient software.
00:02:37.950
As Caitlyn mentioned, I work at Stitch Fix on the styling engineering team, responsible for building and maintaining software for our 3,000 stylists who work remotely across the country. We’re constantly seeking ways to enhance our resiliency as our systems grow more complex.
00:02:44.730
Today, we're going to discuss injecting failure to learn from it through chaos engineering.
00:02:53.930
This concept isn't new; for example, doctors undergo extensive practice in incident response during medical school and residency.
00:03:01.590
Similarly, firefighters train for months, if not years, to efficiently respond to emergencies and help people swiftly. Most engineers have some form of training, whether through college, boot camp, or online tutorials; however, very few of us have actively practiced incident response as a primary focus.
00:03:10.680
This is a skill we don’t flex very often, which is why today we want to make practicing incident response a priority within our teams to build more resilient systems.
00:03:16.200
One term that encapsulates incident response and simulating incidents in production is chaos engineering, a concept coined by Netflix about five years ago.
00:03:22.139
They developed the chaos monkey, which automatically kills containers within their applications, forcing the applications to reboot.
00:03:28.790
At Stitch Fix, we've adopted this chaos engineering concept and placed the onus on developers to implement similar strategies. Therefore, as we discuss simulating incidents, we will also consider chaos engineering.
00:03:37.210
There are three main components to chaos engineering. First, consider what we want to simulate—what type of failure are we looking to replicate in our applications?
00:03:44.300
Second, we want to run the simulation as a team, allowing us to huddle and collaborate in a larger group.
00:03:50.880
Third, we will have a designated game day—a specific time to run the simulation together and learn from it.
00:03:57.630
We will dive into how Stitch Fix has implemented chaos engineering within our styling engineering team, focusing on three main sections of our approach.
00:04:04.110
First, we will discuss how to prepare for the simulation and game day, including the type of code needed to effectively run a simulation.
00:04:11.010
Next, we'll talk about the actual game day—gathering the team, executing the simulation live in production, and ensuring it does not cause chaos for our business.
00:04:19.030
Finally, we will explore how to extract learnings and build more resilient systems following the simulation.
00:04:26.400
To start, we need to set up the technical implementation for simulation. This preparation should occur weeks in advance, allowing time to think about what types of failures we want to simulate and to write the required code.
00:04:31.020
Considering what we aim to simulate is a critical question, as the weaknesses in our system are often team- and company-specific, dependent on specific technologies, architectures, and services.
00:04:38.130
At Stitch Fix, we decided to begin by simulating failures within our services. We have a microservice-based architecture, meaning we utilize dozens, if not hundreds, of different services that power our applications.
00:04:44.930
This structure can create potential points of failure for our systems, leading to downtime if not handled with resilience. So, today we'll focus on simulating a service failure within our application.
00:04:52.890
Specifically, we will represent this failure as a 500 status code—simulating a response from a service that would normally return a 200 success status code.
00:05:01.590
To accomplish this, we will utilize middleware. Middleware sits between every request and response that your application produces. We can create custom middleware classes to modify these requests and responses.
00:05:09.660
In this case, we will alter the response received when making service requests. To clarify, middleware is essential in simulating downtime for our services.
00:05:16.250
At Stitch Fix, we already use Faraday, a Ruby HTTP client, for our requests. This makes it an optimal choice to create our custom middleware classes.
00:05:24.530
For those unfamiliar, Faraday allows developers to customize its behavior with middleware. So, we've written a custom middleware class that alters the response when an application requests data from an internal service.
00:05:31.820
What's the implementation look like? This shows a new Faraday object we instantiate with an options hash containing the URL. This can include various request options and adapters.
00:05:39.170
We want to create this custom middleware class for altering response statuses. This is surprisingly simple: we create a new SimulateServiceFailure class that inherits from Faraday Middleware.
00:05:46.130
In the complete method, we essentially override the response status to 500, which forces all service response statuses to return a 500 error.
00:05:52.960
After defining this class, we can go back to our new Faraday connection and call SimulateServiceFailure. However, if we merged this into production, it would cause all service requests to fail, which isn’t ideal.
00:06:01.420
To avoid this scenario, we need to segment the simulation so it only affects a subset of users. At Stitch Fix, we accomplish this through feature flags, allowing us to control who is included in the simulation.
00:06:10.020
We've implemented feature flags through two different tables: one holds the key name for the feature flag, while the other connects user IDs to the feature flags, designating users for the simulation.
00:06:17.210
Once we determine the feature being utilized, we pass in a config variable called simulateFailure, setting it to true if the user is part of the run simulation feature flag.
00:06:24.630
Now, back in our Faraday connection object, we can run the SimulateServiceFailure if the user is part of the feature flag, which is all we need to implement chaos engineering within our systems at Stitch Fix.
00:06:31.950
Once we have the necessary code, the next step is to communicate to the organization that we will run our game day. This involves informing business partners and engineers to prepare for any potential issues.
00:06:38.540
Furthermore, we want to gather expectations from our team. When contemplating simulating failures, each team member likely has an expectation of potential outcomes.
00:06:44.800
We collected these expectations through a Slack poll, which was an easy way to assess our thoughts before the simulation. We sent it out a few hours prior to the event.
00:06:52.280
Essentially, the poll asked what impact we thought a service failure would have, specifically focusing on our client data—information displayed to our stylists.
00:06:58.800
Interestingly, the responses revealed a lack of alignment: three team members believed the page would still render, while one thought the application would crash entirely.
00:07:05.710
This disparity highlighted the importance of understanding our systems, so we will begin documenting these expectations.
00:07:11.700
We created a Google Doc to record conversations, expectations, and learnings throughout the process, discussing how we envisioned the app's response under various conditions.
00:07:18.240
As we discussed these expectations, we also began planning our upcoming game day. We set aside an hour-long meeting each week for the team's game day conversations.
00:07:26.760
During these discussions, we considered questions like: What alerts might we receive? What will the dashboards display? Which documents will be accessible? The aim was to ensure all needed information was readily available.
00:07:33.450
Once we completed the preparation and established infrastructure, including our feature flags, we were ready to run the simulation. We gathered the team for the execution.
00:07:40.920
We often collaborate as a remote team over video conferencing. To execute the process, I shared my screen, allowing everyone to view the Google Doc and our KPIs during the simulation.
00:07:48.530
It's essential that, as a team, we experience this together. Everyone gains valuable insights and learns from the experience.
00:07:55.690
Right before the game day, we remind our business partners again to emphasize the importance of communication. Generally, 99% of the time, things go smoothly, but it’s critical to keep them informed.
00:08:02.130
As we prepare for the simulation, we need to ensure our feature flag is inactive. We begin adding users to the feature flag by manually adjusting settings in our console.
00:08:08.160
Now it’s time to run the command and start our simulation. We used a Rake task to easily toggle the feature flag.
00:08:14.610
This command activates the feature flag. As we run the command, we see a division of expectations within our team.
00:08:20.920
About three-quarters of the team believes that the application will load, while one member thinks the application may crash entirely.
00:08:27.490
When we pull up our application, we see the message, 'Sorry, but something went wrong.' Simulating this failure resulted in a full application crash, which was unexpected for most team members.
00:08:34.390
Most anticipated that at least part of the application would load, but instead, the service failure caused a complete outage.
00:08:41.470
This initial experience provided our first learning opportunity, which was exciting for everyone!
00:08:49.050
As the event unfolded, it felt like a fireworks show—an explosion moment. Nothing was loading, reinforcing the need for effective team communication.
00:08:55.710
At this stage, we must ensure that the simulation is only affecting the users we anticipated. We avoid impacting others outside of the feature flags.
00:09:04.350
To verify, we kept one team member off the feature flag to confirm that the app wasn’t down for everyone else.
00:09:11.130
We pinged the entire engineering team to double-check the situation.
00:09:17.920
As expected, errors began rolling in. This development would usually cause concern, but in this case, it was exciting—we knew the simulation was successfully working.
00:09:24.900
Initially, we received alerts from BugSnag displaying the 500 status, which was exactly what we hoped to observe.
00:09:31.820
Next, we checked our DataDog metrics, where we saw the simulation components showing spikes in 500 responses.
00:09:39.520
One additional discovery was that our PagerDuty alerts did not activate. We had anticipated receiving pages as part of the simulation, but our thresholds were set too low.
00:09:46.440
At this point in the simulation, only a couple dozen people were involved, falling below our PagerDuty thresholds.
00:09:54.000
After verifying everything was functioning as expected, we saw the application was still down. It was time to conclude the game day.
00:10:02.850
We ran the Rake task to disable the simulation, turning the feature flag back to inactive status.
00:10:09.510
Now that the simulation had ended, we had significant key learnings and takeaways from the experience.
00:10:16.340
Moving forward, we need to think about how to extract useful learnings from this experience to enhance our resilience within the system.
00:10:24.440
When discussing resiliency, we commonly focus on our technical systems and strive to build applications devoid of bugs or outages.
00:10:31.710
Through these simulations, we gain empowerment to identify issues with dependent services, which could significantly impact the user experience.
00:10:39.270
This simulation taught us that if this particular service goes down, our application won’t function; it won’t load at all, revealing a critical dependency.
00:10:48.070
These insights allow us to identify new backlog items aimed at enhancing system resilience.
00:10:55.680
Additionally, we should reassess our initial expectations. Why did one person think nothing would load while three others anticipated the app would render?
00:11:03.870
Understanding these differences in expectations is crucial for identifying knowledge gaps regarding our system.
00:11:12.800
Ultimately, the goal for our technical system is to ensure increased failure tolerance.
00:11:20.130
In addition to enhancing system resilience from a technical standpoint, we must consider our process resiliency.
00:11:28.440
It’s essential to improve the tools and knowledge accessible to engineers, facilitating a better understanding of our systems when issues arise.
00:11:35.590
Reflecting on my first experiences on-call at Stitch Fix, I recall struggling to find relevant documentation, metrics, and tools when attempting to resolve issues.
00:11:42.830
As a result, we want to develop resilient processes by enhancing the discoverability of resources—such as GitHub wiki pages.
00:11:49.660
We also need to establish optimal paging thresholds and create usable dashboards.
00:11:56.470
One practical approach has been linking documentation and runbooks related to our alerts, allowing for easier access during critical moments.
00:12:04.320
For example, we also receive alerts in our team Slack channel, which include direct links to relevant runbooks, making it easy to access important information.
00:12:09.150
By optimizing our runbooks through incidents and determining useful versus unhelpful information, we can continuously improve.
00:12:16.920
Another strategy involves enhancing access to dashboards, as we recognized the importance of organized dashboard systems during the simulation.
00:12:23.370
We discovered numerous dashboards that were scattered and difficult to find, which prompted us to consolidate them into more accessible locations.
00:12:30.200
Accessibility to process information is critical. When faced with high-stress situations, quick and easy access to relevant information is crucial.
00:12:38.270
The final factor to consider in our resilience strategy is the human element.
00:12:44.950
Improving engineers' confidence and abilities as they troubleshoot issues will result in higher efficiency.
00:12:52.710
The more we practice these simulations, the higher our confidence will build over time.
00:13:00.290
Our team seeks to be prepared, to feel less stressed, and to be less anxious while on call.
00:13:07.220
Practicing incident response sometimes resembles having insurance: it may not seem necessary until an emergency arises.
00:13:14.870
When a significant outage occurs, having practiced for months can lead to swift recovery.
00:13:22.060
Additionally, we want to promote a mindset shift in how we view resilience in our systems.
00:13:30.720
The more frequently we conduct simulations, the more top-of-mind resiliency becomes.
00:13:39.240
We should design systems with the understanding that failures can happen, especially distributed systems that must account for service failures, latency, and various outages.
00:13:48.340
Through game days and incident response simulations, we enhance our knowledge of systems and ourselves, allowing us to build resilient processes, applications, and people.
00:13:57.170
Our goal is to eliminate stress, confusion, and a sense of incompetence, and instead cultivate a sense of happiness, knowledge, and empowerment.
00:14:05.040
Ultimately, we aim to construct strong technical processes and human systems through simulating potential incidents in production. Thank you!