Ruby Video

Title

Description

Date

Summary

Markdown supported

In her talk at RubyConf 2018, Kelsey Pederson discusses the importance of simulating incidents in production to improve incident response capabilities within engineering teams. She begins by sharing her own experiences during an on-call rotation at Stitch Fix, where she faced challenges responding to production issues that affected users. Kelsey emphasizes the need for engineers to feel prepared for these incidents, likening their need for practice to that of firefighters and doctors, and she advocates for prioritizing incident simulations within teams.

Kelsey outlines the following key points:  
- **The Necessity of Simulation**: Simulating incidents can prepare teams for real outages, making engineers more effective in resolving issues.  
- **Chaos Engineering**: Introduced by Netflix, chaos engineering involves deliberately injecting failures to test the resilience of systems.  
- **Steps for Simulation**: Kelsey describes a structured approach to running simulations, which includes defining the type of failure, implementing necessary code, gathering team expectations, and running the simulation during a designated game day.  
- **Technical Implementation**: At Stitch Fix, Kelsey demonstrates how they injected failures into their middleware using custom coding, allowing them to test the application under controlled failure scenarios.  
- **Debriefing and Learning**: Post-simulation, teams must reflect on the outcomes versus expectations, documenting insights and improving processes such as updating runbooks and dashboards for better incident management.

Throughout her talk, Kelsey shares specific examples from Stitch Fix's chaos engineering project, noting how the team was able to learn from a surprising application crash during a simulation. This emphasized the need for ongoing discussions and refinements in team processes to enhance system resilience.

Kelsey concludes by reiterating that through regular practice and organized simulations, engineering teams can develop both technical systems and human resourcefulness, ultimately leading to more robust applications and effective incident responses.

Suggest modification to this talk