Talks
Speakers
Events
Topics
Sign in
Home
Talks
Speakers
Events
Topics
Leaderboard
Use
Analytics
Sign in
Suggest modification to this talk
Title
Description
RubyConf 2016 - Datacenter Fires and Other "Minor" Disasters by Aja Hammerly Most of us have a "that day I broke the internet" story. Some are amusing and some are disastrous but all of these stories change how we operate going forward. I'll share the amusing stories behind why I always take a database backup, why feature flags are important, the importance of automation, and how having a team with varied backgrounds can save the day. Along the way I'll talk about a data center fire, deleting a production database, and accidentally setting up a DDOS attack against our own site. I hope that by learning from my mistakes you won't have to make them yourself.
Date
Summarized using AI?
If this talk's summary was generated by AI, please check this box. A "Summarized using AI" badge will be displayed in the summary tab to indicate that the summary was generated using AI.
Show "Summarized using AI" badge on summary page
Summary
Markdown supported
In her talk "Datacenter Fires and Other Minor Disasters" at RubyConf 2016, Aja Hammerly shares her personal experiences with significant mishaps in software development and operations. The session revolves around the lessons learned from encountering various disasters, both amusing and catastrophic, in the context of tech and team dynamics. **Key Points Discussed:** - **The Importance of Backups:** Aja recounts a solo release where she accidentally corrupted the production database by pushing the wrong branch. She highlights the critical need for database backups, as her prior experience with backups allowed her to restore everything successfully. - **Automation as a Safety Net:** She emphasizes automating release processes to minimize human errors, particularly during low-energy times like midnight. Automation, particularly of rollbacks and migrations in Rails, is presented as a crucial strategy. - **Crisis Handling with Diverse Teams:** A fire in a data center caused outages during an important checkout process. Aja explains how having a diverse team with varied skills enabled quick recovery and smart decision-making during the crisis. - **Learning from Incidents:** She shares a story of a fire at a colocation service that forced her company to rethink their infrastructure and backup strategies. Lessons from this incident prompted them to enhance their disaster recovery planning. - **Value of Communication and Trust:** A significant takeaway is the importance of transparent communication and building trust within teams to foster learning culture and improvement post-crisis. - **Embrace Diversity:** Aja underscores that having team members with different backgrounds and skills can tremendously benefit crisis management. **Significant Anecdotes:** - A while working as a QA engineer, Aja nearly faced disaster during her first solo release but managed to recover due to taking a backup. - Another fire incident required her company to manage without credit card processing for days, highlighting the value of having a fallback plan during outages. - She emphasizes the importance of collaborative knowledge sharing, involving everyone in operational knowledge to prevent silos. **Conclusions/Takeaways:** Aja concludes her talk with the mantra that everyone, regardless of expertise, makes mistakes. The key lessons include automating tasks, ensuring backups, practicing disaster recovery plans, fostering team diversity, and maintaining open communication to handle crises effectively. She invites the audience to share their war stories, enriching the conversation about mishaps and learning moments in tech.
Suggest modifications
Cancel