Talks
Speakers
Events
Topics
Sign in
Home
Talks
Speakers
Events
Topics
Leaderboard
Use
Analytics
Sign in
Suggest modification to this talk
Title
Description
Date
Summarized using AI?
If this talk's summary was generated by AI, please check this box. A "Summarized using AI" badge will be displayed in the summary tab to indicate that the summary was generated using AI.
Show "Summarized using AI" badge on summary page
Summary
Markdown supported
In her talk titled 'Herding Cats to a Firefight' at EuRuKo 2016, Grace Chang, an engineer and on-call tech lead at Yammer, discusses the intricate challenges of managing on-call duties and ensuring system reliability in a growing tech environment. She creatively uses the metaphor of 'herding cats' to illustrate the difficulties in coordinating teams in high-pressure situations where everyone's attention is divided. Key points discussed include: - **Understanding On-Call Dynamics**: Grace recounts the initial lack of an effective on-call system at Yammer, highlighting a critical incident that led to the establishment of a proper on-call team. - **Defining Key Metrics**: She introduces important metrics such as Mean Time Between Failures (MTBF), Mean Time to Recovery (MTTR), Service Level Agreement (SLA), After Action Review (AAR), and Incident Report (IR). These metrics are crucial for evaluating system performance and recovery strategies. - **Balancing Responsibilities**: The complexities of balancing MTBF and MTTR to optimize system stability and engineer workload are emphasized. Grace provides an analogy comparing these metrics to the frequency of cat mishaps, making them relatable and easier to understand. - **Iterative Improvements**: The development of the on-call team involved various iterations using tools like Jira and Yammer Notes. Strategies included dividing responsibilities based on technology stacks and onboarding new team members to ensure everyone felt comfortable when on-call. - **Conducting Reviews**: Grace stresses the importance of post-mortems and retrospectives after incidents to foster a culture of learning from errors rather than placing blame. Regular handovers and documentation ensure knowledge sharing and help manage potential burnout among engineers. - **Future Goals**: Despite the progress made, Grace acknowledges ongoing challenges like managing alert volumes and aims to reduce the number of alerts engineers encounter, striving for a balanced workload. In conclusion, Grace stresses that developing a robust on-call culture requires collaboration among all team members, encouraging pride in the quality of code produced and the significant role it plays in business success. Her closing message highlights that creating a stable system is a collective effort, underscoring the importance of teamwork in achieving organizational goals.
Suggest modifications
Cancel