Datacenter Fires and Other "Minor" Disasters

by Aja Hammerly

In her talk "Datacenter Fires and Other Minor Disasters" at RubyConf 2016, Aja Hammerly shares her personal experiences with significant mishaps in software development and operations. The session revolves around the lessons learned from encountering various disasters, both amusing and catastrophic, in the context of tech and team dynamics.

Key Points Discussed:

- The Importance of Backups: Aja recounts a solo release where she accidentally corrupted the production database by pushing the wrong branch. She highlights the critical need for database backups, as her prior experience with backups allowed her to restore everything successfully.

- Automation as a Safety Net: She emphasizes automating release processes to minimize human errors, particularly during low-energy times like midnight. Automation, particularly of rollbacks and migrations in Rails, is presented as a crucial strategy.
- Crisis Handling with Diverse Teams: A fire in a data center caused outages during an important checkout process. Aja explains how having a diverse team with varied skills enabled quick recovery and smart decision-making during the crisis.
- Learning from Incidents: She shares a story of a fire at a colocation service that forced her company to rethink their infrastructure and backup strategies. Lessons from this incident prompted them to enhance their disaster recovery planning.
- Value of Communication and Trust: A significant takeaway is the importance of transparent communication and building trust within teams to foster learning culture and improvement post-crisis.
- Embrace Diversity: Aja underscores that having team members with different backgrounds and skills can tremendously benefit crisis management.

Significant Anecdotes:

- A while working as a QA engineer, Aja nearly faced disaster during her first solo release but managed to recover due to taking a backup.
- Another fire incident required her company to manage without credit card processing for days, highlighting the value of having a fallback plan during outages.
- She emphasizes the importance of collaborative knowledge sharing, involving everyone in operational knowledge to prevent silos.

Conclusions/Takeaways:

Aja concludes her talk with the mantra that everyone, regardless of expertise, makes mistakes. The key lessons include automating tasks, ensuring backups, practicing disaster recovery plans, fostering team diversity, and maintaining open communication to handle crises effectively. She invites the audience to share their war stories, enriching the conversation about mishaps and learning moments in tech.

00:00:15.340 Okay, my watch says that it's time to start, so welcome to "Datacenter Fires and Other Minor Disasters." I'm Aja Hammerly. I like it when people tweet at me during my talks; I am the thako miser on Twitter. I submitted this talk to a track entitled "War Stories," but that track didn’t end up happening. However, I know we’ve all had that day where we broke the internet or everything went wrong.

00:00:28.910 Lots of folks seem to have a big red button story. I was at a DevOps meet-up in Seattle, and I said, "Hey, tell me all your war stories." Many of the stories began with, "So this one time, the CEO came to the data center and asked, 'What is that big red button for?'" Eventually, the story progressed, and the entire data center went up in flames. Some of my favorite war stories begin with, "Do you know how the fire suppression system at a data center works?" Because I do. I'm going to share with you some of the boneheaded things I’ve done and some odd situations I’ve been in, and how they’ve changed how I write software and build teams.

00:01:05.640 So let's all gather around the virtual campfire because it’s story time. Once upon a time at midnight, I was doing a release for a startup I worked at. We did our releases at midnight because we couldn't do migrations and keep the site online. So we took the site down, did the release, and then brought it back up. This happened to be the first time I was doing the release completely solo because my backup, the secondary on-call, was on the Trans-Siberian Railway somewhere in Russia or Mongolia. I’m not even kidding about that.

00:01:36.740 I was pretty junior at this point; I think I’d only been in the industry for about three or four years. Luckily, my backup, who led our team, was a former military man who believed in being highly organized. He left me with a 30-plus item checklist that was basically the pre-flight checklist for a release of our product. Every single step had to be done in order, and you had to print out the checklist and check off the items while doing them. However, every single step had to be done manually, and it turns out that takes a long time.

00:02:07.580 The first thing you do is notify the team that you’re going to start the release and put the maintenance page up. You then notify the team that the site is down, put the new code on the server, run the database migrations, restart the servers, start manually testing, and finally bring the site back up, manually test again, and communicate to the team that the release has been completed. Then, you watch it for 15 minutes to ensure nothing blew up. I got through that process and reached the initial set of manual testing, only to find something was wrong. It wasn’t all pages that were throwing the standard Rails 500 page; some were just showing missing assets, while others rendered in odd ways.

00:02:44.860 After a couple of minutes of looking at the logs, I realized, "Oh no! I pushed `master` instead of the release branch." While that would have been fine if I had just pushed the release branch on top of it—which would have taken five extra minutes to bring the site back up—it became a problem because I had already run the database migrations. So, I effectively corrupted our production database at 1 a.m. on my very first solo release while my backup was in Mongolia.

00:03:15.360 At this point, I start freaking out. I began instant messaging some friends from the Seattle Ruby community, saying, "Oh my god, what did I do wrong?" They replied, "You know this; you have these skills; you’ve done this before; you know how to fix this." I took a deep breath and remembered that one of those 30-plus pre-flight checklist steps was to take a database backup. Luckily, at the time, because I was working as a QA engineer, regularly, I restored a backup to our staging server, so I knew exactly how to do it. I could type those commands in my sleep.

00:03:58.640 I thought, "If it works on staging, hopefully, it’ll work on production." I started the restore, and that took about 45 minutes. I had a good period to get my blood pressure down and pace the circle that was my apartment at that time—from the living room through the kitchen and back. I hoped everything would work out; I pushed the correct branch with the correct set of migrations, brought everything back up, and it was fine. I emailed the team, saying the release was successful.

00:04:34.350 I then received an email back from my boss: "We were down a lot longer than we should have been. What happened?" I explained, and he replied, "We’re going to talk about that in the morning." I was actually feeling pretty good because I remembered this was the only time I had made that particular error. I’d worked at several other companies where someone else had made that error, including my first release at a relatively large software company right after college. We were down for two and a half hours longer than intended because someone else had made almost the same error on that release night.

00:05:14.680 So, lesson learned: automate everything. People make stupid mistakes at midnight, and I make stupid mistakes at midnight. I wouldn’t have made that error if that lovely checklist had just been automated as a bash script. It would have been easy to do. Also, if you’re going to automate your releases, you should automate your rollback as well. That probably means you need to write your rollback migrations and down migrations in Rails because why would you ever roll back?

00:05:54.520 Always have a backup. If I hadn’t had a backup that night, I would have had to write code on the fly to undo the migrations that we had just performed. So, having a backup is totally useful. I claim this talk is about database fires or datacenter fires. So, here’s a datacenter fire story: once upon a time, someone sent a message to our pager that customers were having trouble going through our checkout process.

00:06:09.030 We tried it out and got an error that said, "We’re sorry; we have been unable to charge your credit card. Please contact your credit card provider." We knew that was the error shown no matter what error we got back from the credit card processor. We dug into the logs and saw that our site was timing out while trying to process cards, leading to the conclusion that we had broken something. So, someone suggested, "Why don’t we try processing cards manually?" Our provider had a website for manual processing. However, when trying to process cards, we encountered a timeout error on that page as well, which was concerning.

00:07:01.580 Someone suggested we call the provider, and we were put on hold. A couple of minutes later, our call disconnected, and we decided to check the news. We discovered that a relatively large data center in Seattle, Washington, had caught fire, prompting a response from 19 fire vehicles. The power had gone off, but the generators kicked in as they were supposed to. However, when the fire department arrived, they assessed it as an electrical fire and turned off the generators. Consequently, the entire facility went offline, taking with it two radio stations, a television station, and several co-location service providers, including our credit card processor.

00:07:51.300 Here's an actual picture of the damage from the fire that I found through some news sources in Seattle. Luckily, once we figured out what the issue was, it became easy for us to fix it. Due to experiences at a previous job, I had insisted we have a way to turn off the store so we could deactivate credit card processing, renewals, and free trials while still allowing the site to run smoothly. We put all accounts into free play mode without store activity, which was great because this particularly incident occurred over a holiday weekend.

00:08:37.760 The fire happened on July 3rd, and while the fire department worked to extinguish the flames and restore some facilities, the fire inspector had to check every single co-location facility and duct before giving the go-ahead to bring sections back online. Meanwhile, our credit card processor also needed time to restore their infrastructure, as they were entirely within that data center. We ended up without credit card processing for four days, and if we hadn’t structured our system to ensure renewals and free trials would continue seamlessly without the store running, we would have been in a painful situation.

00:09:27.210 This cemented my strong belief that you should ensure that you can gracefully fallback if any of your external dependencies fail. You should preferably have a method to activate that fallback without having to redeploy your entire system. We had a console page where an admin could log in and go to a specific URL, check a box to turn off the store, and hit submit. Everything picked that up swiftly, resolving the issue quickly.

00:10:16.890 However, a few weeks later, someone accidentally clicked that box, believing they were in staging while testing something, and we ran without credit card processing for a couple of days. As a result, we added some very obnoxious colors to the production admin console, ensuring no one could mistake being on production. So, the title of this talk is about datacenter fires.

00:10:52.420 There’s another fire story I have from when we experienced the first fire incident with our credit card processor. We took the opportunity to check if our systems were hardened against such eventualities and decided to upgrade our facility. We moved to a really nice mom-and-pop co-location service known for their friendliness and the best Christmas trees I’ve ever seen, complete with ram and peppermint sticks.

00:11:54.220 After about a month, they sent an email announcing there would be mandatory maintenance on the power conditioners that transform line power into something usable for our electronics. We trusted these guys and assumed everything would go smoothly. The scheduled time came, and they sent out an announcement stating they were on generator power, but we had not noticed any significant blip, as we ensured we had battery backups in our rack. Everything ran fine for about two hours.

00:12:32.240 Then suddenly, we went down hard. The co-location facility informed us that during maintenance, all power to our section had been shut off, and personnel were evacuated. They assured us they would try to restore our rack within the next hour. If you know anything about how colocation facilities work, if the words 'incident' and 'all personnel evacuated' appear in the same sentence, it means something caught fire, and Halon was activated.

00:13:25.380 I looked at the person I worked with on the infrastructure side; we both knew we had to fix something. He grabbed the go-bag when we were let back in to the facility. I was instructed to remain at headquarters to manage communications and other company issues. I wanted to head down but was informed that things had to be turned on in a specific order, and I asked if that order was documented. It wasn’t, leading us to realize we had a significant knowledge silo.

00:14:19.420 We managed to get everything back up and were only down for about an hour and a half. Unfortunately, the co-location service couldn’t say the same. A few hours later, they sent us a picture of a fried circuit board that had caught fire, accompanied by the comment that their vendor had never seen anything like that before. They ended up running on diesel for 11 days while they sourced replacement parts.

00:15:02.830 This incident prompted us to re-architect our rack. Previously, our databases and switches were running on one battery and our servers on another. We learned it made more sense to distribute functionalities across the batteries so we could bring half of the rack online and maintain operation rather than needing to power everything up before having a valid site.

00:15:47.390 We sought to avoid any operational silos, emphasizing the need for pairing in our programming, infrastructure, and deployment processes. We began pairing on hardware, ensuring everyone learned how to operate every aspect of our systems. For instance, I rewired our cabinet so I knew how everything worked and got a second set of keys for our rack so that at least two people could enter if something went wrong.

00:16:44.230 An important lesson I learned was the necessity of having a disaster recovery plan and practicing it. We were down longer than anticipated because we hadn’t rehearsed restoring the site after a complete shutdown. Though we had successfully moved colocation facilities before, the current situation was more complicated due to evolving systems and hardware.

00:17:31.840 Now, I prefer using the cloud for managing systems because I want fire-related issues to be someone else’s problem. I work with cloud providers who proactively move workloads out of data center sections undergoing maintenance. My experience during the recent hurricane scare on the East Coast was impressive; precautions ensured no one faced issues.

00:18:21.250 Multiple regions also provide redundancy for your infrastructure. This story isn't one of mine, but it begins with "Once upon a time in Japan." I work on a team with many developer advocates at Google, and we show off this interesting demo called 'Cloud Spin.' It uses a variety of phones and elaborate tubing to take pictures and stitch them together in a unique way.

00:19:31.680 Before taking this demo to an event in Japan, we decided it best to buy the necessary gear locally to avoid customs issues. Someone from our Japanese office was tasked with purchasing 30 specific phones and was responsible for managing the acquisition of custom parts. However, the evening before the event, while setting up, the team realized the Japanese version of the phone had different connectors than the US version, rendering the parts they brought useless. Even in Tokyo, they found no adapters after looking for three hours.

00:20:34.600 During a team meeting, someone suggested canceling the demo, but another member insisted, "Nope, I know how to solder!" The team stayed up late soldering connectors to make everything work and managed to pull it off successfully, leading to a well-received demo. This experience taught me the importance of not making assumptions about device compatibility and ensuring someone on the team possesses complementary skills.

00:21:23.720 Now, sometimes you can be your own worst enemy. Once upon a time, I was working on a web application using WebSockets, which needed between 30 and 60 frames a second to operate satisfactorily. While in beta, we noticed traffic and memory usage spiking out of control. Servers began shutting down and restarting.

00:22:35.420 After some debugging, we hit the big red button—shutting everything down, then bringing it back up, normalizing the situation. We spent the remainder of the week adding logging to identify any recurrence of the issue. I went on vacation only for the same issue to resurface, this time causing prolonged instability during my absence.

00:23:47.760 After returning, I began analyzing logs and discovered that a malformed socket message had been saved to the database. This incident led to a series of events causing all clients connected to the affected server to be forced offline while retrying their connections. In doing so, they were inadvertently resending the bad message, exacerbating the problem and sending every affected server down.

00:24:39.730 The moral of that story is you can often be your own worst enemy. As you design systems, think about all the ways things can fail, and design against those possibilities. Implementing incremental backoff strategies is also critical. I emphasize that we all mess up—many can identify with stories of breaking their site or experiencing mishaps or disasters.

00:25:33.640 What saves us in these moments is trust—trusting coworkers and tools. If you can’t do that, you should either learn your tools or seek different coworkers. Learning from poor experiences proves significant. I insisted on having a mechanism to turn the store off because of a prior issue with an external dependency.

00:26:42.540 We also need to communicate clearly and honestly within our teams. If something goes wrong, be transparent about it. Strong group ownership can counter siloed operations. New members on your team should be included on field trips to the data center, and they should shadow you during releases, so everyone understands the processes.

00:27:51.020 The value of sharing knowledge is reflected in post-mortems, where everyone involved in incidents is entrusted to have the right intentions. Trust and lack of blame can foster effective learning and improvements, allowing us to enhance our systems for the future. I share these stories as a way to communicate the significance of these concepts.

00:28:43.370 Experiences showcasing diverse backgrounds and skills have saved us during crises—whether it’s a coworker soldering together components or someone’s calmness in stressful situations. It's vital we encourage valuing, cultivating, and benefiting from diverse perspectives. This means listening and being open to all team members’ suggestions, ensuring everyone's experiences can contribute positively to crisis management.

00:29:45.920 Thank you all for joining me today. Here’s my contact information; I’m thako miser on Twitter, available on GitHub as bestatomizer, and I blog at ajahammerly.com about technology and issues in tech culture. I work on Google Cloud Platform as a developer advocate, centering primarily on DevOps and Ruby. If you’d like to talk about running a Ruby site, implementing DevOps practices, or exploring containers, I can help! Also, I always bring stickers and tiny plastic dinosaurs!

00:30:32.070 Since this talk is right before lunch, I’d love to take questions. Additionally, please share your own war stories—those times the CEO visited the data center or incidents where the internet was broken—at my lunch table. One of my favorite parts of conferences is hearing about how people have messed up!

00:31:29.290 So, does anyone have questions? The first question was about convincing a team resistant to better engineering practices, such as gradual backoff or isolating external dependencies. I've worked on teams that didn't implement several key practices, and one piece of advice is to pick your battles wisely.

00:32:02.480 You can let them experience pain from failure to illustrate why these practices are important. It may not seem gracious, but exposing them to failure can help solidify the need for these changes in a real-world context. Thank you all! Let's go have lunch.

See Slides on thagomizer.com