Datacenter Fires and Other 'Minor' Disasters

In this talk titled "Datacenter Fires and Other 'Minor' Disasters," Aja Hammerly, a Developer Advocate at Google Cloud Platform, shares engaging tales from her professional journey about various disasters that can occur in tech and how to learn from them to improve processes and resilience. Hammerly emphasizes the importance of having a backup strategy, the benefits of automation, and fostering team diversity for better problem-solving. Key points from her presentation include:

Emphasizing Correct Backup Protocols: Personal anecdotes illustrate the peril of performing a release without backup, where Hammerly recounts her experience of corrupting a production database late at night. The lesson learned is to always automate backup processes and to have a safety mechanism in place, like a 'big red rollback button.'
Case Study of a Data Center Fire: Hammerly talks about a significant incident where a data center fire at a credit card processor led to service disruption. This highlighted the need for systems that can isolate external dependencies to avoid complete outages.
Electrical Maintenance Incident: Another incident involved an upgrade that resulted in complete power loss in their section of the facility. The lesson learned here was about the importance of having recovery processes and spreading hardware across different locations to enhance resilience.
Innovation Under Pressure: Hammerly shared a story about a demo in Japan thwarted by incompatible phone adapters, leading the team to creatively improvise with soldering irons to ensure the success of the demo. This underscores the importance of not making assumptions and verifying details.
Client-Side Issues Leading to Downtime: She also recounts an experience where malformed messages from clients overwhelmed their web socket application, marking it as an example of how assumptions can lead to significant disruptions.

Throughout the talk, Hammerly stresses the importance of trust, communication, collective knowledge, ownership, and learning from failures within tech teams. These experiences and the lessons drawn from them are aimed at encouraging Developers to cultivate resilience and adaptability in technology environments. The concluding message highlights the value of diverse teams in enhancing problem-solving capabilities.

Datacenter Fires and Other 'Minor' Disasters
Aja Hammerly • February 09, 2017 • Earth

http://www.rubyconf.org.au

Most of us have a “that day I broke the internet” story. Some are amusing and some are disastrous but all of these stories change how we operate going forward. I’ll share the amusing stories behind why I always take a database backup, why feature flags are important, the importance of automation, and how having a team with varied backgrounds can save the day. Along the way I’ll talk about a data center fire, deleting a production database, and accidentally setting up a DDOS attack against our own site. I hope that by learning from my mistakes you won’t have to make them yourself.

RubyConf AU 2017