Ruby Video
Talks
Speakers
Events
Topics
Leaderboard
Sign in
Talks
Speakers
Events
Topics
Use
Analytics
Sign in
Suggest modification to this talk
Title
Description
RubyConf 2018 - It's Down! Simulating Incidents in Production by Kelsey Pederson Who loves getting paged at 3am? No one. In responding to incidents -- either at 3am or the middle of the day -- we want to feel prepared and practiced in resolving production issues. In this talk, you'll learn how to practice incident response by simulating outages in production. We'll draw from learnings from our simulations at Stitch Fix, like technical implementation strategies, key metrics to watch, and writing runbooks. You'll walk away from this talk with the superhero ability help your team simulate incidents in production. Be prepared for your next incident!
Date
Summary
Markdown supported
In her talk at RubyConf 2018, Kelsey Pederson discusses the importance of simulating incidents in production to improve incident response capabilities within engineering teams. She begins by sharing her own experiences during an on-call rotation at Stitch Fix, where she faced challenges responding to production issues that affected users. Kelsey emphasizes the need for engineers to feel prepared for these incidents, likening their need for practice to that of firefighters and doctors, and she advocates for prioritizing incident simulations within teams. Kelsey outlines the following key points: - **The Necessity of Simulation**: Simulating incidents can prepare teams for real outages, making engineers more effective in resolving issues. - **Chaos Engineering**: Introduced by Netflix, chaos engineering involves deliberately injecting failures to test the resilience of systems. - **Steps for Simulation**: Kelsey describes a structured approach to running simulations, which includes defining the type of failure, implementing necessary code, gathering team expectations, and running the simulation during a designated game day. - **Technical Implementation**: At Stitch Fix, Kelsey demonstrates how they injected failures into their middleware using custom coding, allowing them to test the application under controlled failure scenarios. - **Debriefing and Learning**: Post-simulation, teams must reflect on the outcomes versus expectations, documenting insights and improving processes such as updating runbooks and dashboards for better incident management. Throughout her talk, Kelsey shares specific examples from Stitch Fix's chaos engineering project, noting how the team was able to learn from a surprising application crash during a simulation. This emphasized the need for ongoing discussions and refinements in team processes to enhance system resilience. Kelsey concludes by reiterating that through regular practice and organized simulations, engineering teams can develop both technical systems and human resourcefulness, ultimately leading to more robust applications and effective incident responses.
Suggest modifications
Cancel