Talks
Datacenter Fires and Other 'Minor' Disasters

http://www.rubyconf.org.au

Most of us have a “that day I broke the internet” story. Some are amusing and some are disastrous but all of these stories change how we operate going forward. I’ll share the amusing stories behind why I always take a database backup, why feature flags are important, the importance of automation, and how having a team with varied backgrounds can save the day. Along the way I’ll talk about a data center fire, deleting a production database, and accidentally setting up a DDOS attack against our own site. I hope that by learning from my mistakes you won’t have to make them yourself.

RubyConf AU 2017

00:00:11.160 Happy Friday, everyone! I'm the last speaker of the conference, which means I get to do the awesome thing and say thank you to all the organizers, volunteers, and everyone else who made this fantastic event possible. I have very much enjoyed my first time in Australia, and this conference has been amazing.
00:00:31.000 Since it is Friday, I’m looking for some friendly faces in the audience to share a holiday hug. I know some of you just came in from another track, but many of us have been sitting for a while, so let's all stand up and share a Friday hug. It's still technically Thursday night back in Seattle, where Aaron and I are from, so I’ll tweet this to them, and they will wake up to a lovely Friday hug from Australia. Also, I may send this to my boss to prove that there are a lot of Rubyists in Australia.
00:01:14.960 So, my name is Aja, and I like dinosaurs a lot! One of the coolest things I did while here was visit an opal museum. They had an opalized plesiosaur skeleton that made me go out of my mind with glee! I'm from Seattle, so if I’m talking funny, you’ll know why. This is what it looks like back home in the winter; that's snow, which doesn't happen normally, but it’s okay because I came to Australia early.
00:01:38.280 While I was on a beautiful white sand beach enjoying the ocean, I may have rubbed it in a bit. Seattle is beautiful, maybe not as beautiful as Melbourne, but still lovely. As mentioned, I work for Google Cloud Platform as a Developer Advocate. Someone tweeted at me to ask if I had managed to bring a data center with me, but it didn’t fit in my luggage. However, we are building a data center in Sydney this year.
00:02:20.480 If you have any questions about Google Cloud, Kubernetes, or anything else related to what we do, feel free to ask me. You can hit me up on Twitter or email me at [email protected]. I'm happy to engage during the talks, as my phone isn't connected to the internet.
00:02:52.200 Now, let’s dive into my actual story: Data center fires and other minor disasters. One of my friends once said that if nothing bad happens, you don’t have a good story to tell. I’ve noticed that when we gather as developers or engineers, we love sharing stories about the day everything went wrong.
00:03:10.000 We’ve all got our large red button stories. Many of them start similarly: 'So this one time, the CEO visited the data center,' or 'Do you know how the fire suppression system at the data center works? Let me tell you!' So grab a spot in our virtual campfire, and I’ll share my boneheaded experiences and the disasters that have shaped how I write software and build teams.
00:03:39.000 Our first story starts at midnight, since it’s story time. Once upon a time at midnight, I was doing a release for a startup. We used to do our releases at that hour because ten years ago, it was generally acceptable to take your service offline to perform database migrations. Most folks wouldn’t put up with that nowadays, but technology has improved. It was also my first time being the solo on-call to do the release without a backup in case things went wrong.
00:04:07.560 Of course, my backup was on the Trans-Siberian Railway, likely somewhere in the Far East of Russia or Mongolia, making them unreachable if something went south. Naturally, that was the night when things went wrong. The release process was a military-style pre-flight checklist with 37 items, taking at least an hour to complete. I still had to manually cross-check every item.
00:04:41.360 That night, I sat at my desk at home, putting up a maintenance page and notifying the team. I had to run database migrations and restart the server while leaving the maintenance page active, allowing me to test it before going live. Everything was standard until I attempted a manual test.
00:05:30.160 Using a URL hack to bypass the maintenance page, I found an array of random errors — it was chaos. I started freaking out but eventually found my calm. I realized that I had pushed the master branch, not our release branch. Although it was potentially fixable, we had unapproved database migrations on the master that would cause chaos.
00:06:05.080 At midnight, my boss was in Mongolia, and I had just corrupted the production database. Fortunately, as part of our checklist, I had taken a full database backup before starting the release! Since I had experience restoring the database from backup, the only requirement was to drop the corrupted production database.
00:06:54.199 Dropping it was incredibly terrifying, but once it was done, I could restore the database from the backup. I was nervous, petting my cat, assuring myself it would be okay while my boss urgently emailed me inquiring about the delay.
00:07:23.319 I assured him everything would be fine. We had a threefold increase in outages during that process, but it was more manageable since it was midnight. I learned that while I had made mistakes, this particular error was common among many developers.
00:08:00.000 The first lesson I learned was to automate everything. Late at night, when I’m stressed and tired, I’m not at my sharpest. Anything that could break significantly should be automated, including rollbacks. If something goes wrong, a 'big red rollback button' returns everything to normal while allowing you to deal with it calmly afterwards.
00:08:54.680 The second moral is to always have a backup of everything. Ideally, you should have a backup of your on-call staff to ensure that not one person is left to handle the situation alone.
00:09:17.320 So, moving on to my data center fire story, one day we received a page on our old-school pager that customers were having trouble getting through the checkout process. Upon digging into the logs, we figured out it was a timeout issue with our credit card processor.
00:09:56.560 I contacted their support and found their website down. After looking through our curtains, we realized we could see their network operations center, which was lit up with a lot of red. It turned out there had been a significant data center fire at their colocated facility.
00:10:31.000 This facility was also home to several major news stations in Seattle, causing a considerable evacuation. The fire department responded with nineteen vehicles, and despite the flames being extinguished, we faced extended downtime, because the fire inspector had to certify everything was safe before restarting services.
00:11:10.080 Ultimately, once we identified the issue, we resolved it by using a clever solution we had put in place, allowing us to deactivate external dependencies without taking the main site offline. This meant we were offline for several days yet managed to keep offering services to existing subscribers.
00:12:00.639 From that experience, I learned the importance of designing systems with the capacity to isolate and restrict dependencies, ensuring that a single point of failure does not bring down the entire system.
00:12:59.200 The second fire incident involved a colocated service provider upgrading electrical facilities, moving everything over to diesel for maintenance. They provided sufficient prior notice, and we prepared for what we believed would be a seamless switch.
00:13:29.600 Everything went smoothly until suddenly, we lost power entirely across our section of the facility, including our battery backups. Our Colo provider informed us that there’d been an incident during maintenance.
00:14:07.920 Eventually, a co-worker had to drive over to coax our systems back online as we didn’t have an automatic recovery process established. When everything came back up, we received a photo of a severely damaged circuit board, with the admission that our vendor had never encountered such damage before.
00:14:54.800 This incident taught us to have processes in place for restarting our systems and to ensure that hardware isn’t siloed in one location. The more we spread things across different resources, the better resilience we have.
00:15:41.760 Furthermore, I learned the importance of having collective knowledge of both software and hardware within a team. We must distribute knowledge across our teams to avoid situations where one person is solely responsible for system recovery.
00:16:31.680 Practicing disaster recovery is essential; we were down longer than necessary simply because we hadn’t practiced the recovery plan. Cloud providers are undoubtedly equipped to handle retirement failures better, providing broader safety nets.
00:17:23.200 Now, let me share one more story that isn’t mine but rather one from my co-workers. At a tech demo in Japan, they recognized issues with importing phones, discovering that US and Japanese adapters didn’t match. Instead of cancelling the demonstration, the team improvised by purchasing soldering irons.
00:18:02.280 Handmade connections were crafted, and the demo proceeded successfully! From this story, I learned the importance of not making assumptions; details matter and require verification to ensure that critical components align.
00:18:58.680 The last story I want to share is about being your own worst enemy.
00:19:48.720 While working on a client-server websockets application, I faced an issue where one client sent a malformed socket message, causing the system to overload.
00:20:10.800 After several rounds of restarting everything, we built a healthy mental feedback loop to manage the situation. But once the root cause was identified, we realized it was the assumption that our clients wouldn’t send malformed requests, which led to the DDoS-like effects.
00:20:58.000 Now, as this is a closing talk, I want to emphasize: we all mess up! I have made many mistakes over my tech career since 2002, and we’re lucky that, for most of us, there are no severe consequences.
00:21:51.760 Trust is vital in handling crises wisely, meaning you need to trust your colleagues and tools. It’s unacceptable to question the competency of team members or hint at poor communication when trouble arises.
00:22:47.840 Additionally, we must learn and share from each other’s experiences. It's crucial to avoid negative energy by directing conversation focus towards solutions instead.
00:23:27.680 Moreover, communication is key, especially under pressure. My principle in hiring people focuses more on their ability to learn continually rather than strictly existing knowledge, especially when issues arise.
00:24:06.320 Ownership is equally significant; a shared responsibility fosters resilience in a team. We also need to establish blameless postmortems to address failures openly, without finger-pointing.
00:25:00.240 You want to discuss what went wrong and share mandatory steps moving forward for improvement and prevention.
00:26:07.760 Ultimately, diversity on teams cultivates stronger ideas. Different experiences and insights will yield a better problem-solving dynamic.
00:26:55.680 Before wrapping up, I’d like to share my contact information. I would love to connect! You can find me on Twitter as @thamer.
00:27:49.880 I enjoy talking about dinosaurs and I have a pile of stickers and tiny plastic dinosaurs from Google Cloud, so feel free to come to collect some before I head back to Seattle. I’m open to a couple minutes for questions before we wrap up the conference.
00:30:30.240 I had it as a means to bond and share experiences, to encourage more interaction.
00:31:07.640 Yes, that's a picture of my cat.
00:31:10.640 The other question concerned how Google has engineered its processes to prevent these issues from arising again. It’s about developing comprehensive SRE cultures and reliable processes.
00:31:27.200 And of course, I utilize a variety of monitoring systems and report any issues we face publicly, learning from each experience.
00:31:49.560 I also believe the effectiveness of backups could be enhanced. The last cat picture is the Instagram highlight!