Talks
Speakers
Events
Topics
Search
Sign in
Search
Home
Talks
Speakers
Events
Topics
Leaderboard
Use
Analytics
Sign in
search talks for
⏎
Suggest modification to this talk
Title
Description
Kelsey Pedersen Who loves getting paged at 3am? No one. In responding to incidents – either at 3am or the middle of the day – we want to feel prepared and practiced in resolving production issues. In this talk, you’ll learn how to practice incident response by simulating outages in your application. Kelsey Pedersen is a software engineer at Stitch Fix on the styling engineering team. She builds internal software their stylists use to curate clothes for their clients. She works cross-functionally with their styling community and data science team to build new features to better assist their stylists. She had a former career in sales and as a Division I rower at UC Berkeley. #ruby #rubyconf #rubyconfau #programming
Date
Summarized using AI?
If this talk's summary was generated by AI, please check this box. A "Summarized using AI" badge will be displayed in the summary tab to indicate that the summary was generated using AI.
Show "Summarized using AI" badge on summary page
Summary
Markdown supported
In her talk at RubyConf AU 2019, Kelsey Pedersen discusses the importance of simulating incidents in production to better prepare teams for real downtime events. The main theme revolves around enhancing incident response practices through chaos engineering. Kelsey shares her experience working at Stitch Fix, where she encountered challenges during her first on-call rotation, emphasizing that many engineers often focus on new feature development at the expense of maintaining system resilience. Key points discussed include: - The common stress and anxiety faced by developers during production incidents, particularly the challenges of responding to unexpected outages. - A focus on chaos engineering, which involves intentionally simulating failures to learn from them and improve system resilience. Kelsey draws parallels to other fields where professionals practice incident response extensively, like medicine and firefighting. - The structured approach to chaos engineering, which involves three components: determining the failure to simulate, collaborating as a team during simulations, and designating a game day for execution. - Kelsey's team's implementation of chaos engineering at Stitch Fix, including technical preparations like coding custom middleware to simulate service failures and the use of feature flags to control user impact. - Real-time execution of a simulation where the team anticipated different outcomes, revealing gaps in their expectations about system responses. - Key learnings from the simulations, including the realization that a particular service's failure can lead to a complete application crash and the importance of improving documentation and accessibility to resources during incident response. Conclusions from Kelsey’s talk include: - The significance of practicing incident response regularly to build confidence among engineers and enhance overall system resilience. - The development of robust processes and documentation that can be accessed quickly during critical incidents to mitigate confusion and stress. - A shift in mindset towards understanding that failures are part of distributed systems and preparing accordingly through ongoing simulations and practice. Ultimately, Kelsey emphasizes that through game days and incident simulations, teams can enhance their knowledge and preparedness, fostering a culture of resilience and empowerment within engineering teams.
Suggest modifications
Cancel