Talks

Y2K and Other Disappointing Disasters: How To Create Fizzle

Y2K and Other Disappointing Disasters: How To Create Fizzle

by Heidi Waterhouse

In her talk at RubyConf 2017, Heidi Waterhouse explores the themes of risk reduction and harm mitigation, using the Y2K phenomenon as a case study. She highlights how proactive measures in technology and infrastructure can prevent disasters or minimize their impact when they do occur.

Key Points Discussed:
- Understanding Risk Reduction and Harm Mitigation: Waterhouse explains the distinction between risk reduction, which aims to prevent disasters from happening (e.g., anti-lock brakes, vaccinations), and harm mitigation, which seeks to lessen the impact of disasters that do occur (e.g., fire sprinklers, seatbelts).
- The Y2K Experience: She recounts the extensive preparations taken by tech professionals, including system upgrades and rigorous testing, which ultimately resulted in minimal disruption when the year 2000 arrived.
- The Importance of Infrastructure: Waterhouse emphasizes how societies can learn from past disasters, particularly through examples such as building codes that save lives following catastrophic events like earthquakes or fires.
- Real-Life Examples of Failure and Success: She discusses the Grenfell Tower fire in London, where lack of safety regulations led to significant loss of life, contrasting it with the successful evacuation from the Torch Tower fire in Dubai due to stricter codes.
- Complex Systems and Interconnected Failures: The talk explains that modern software systems are complex and can lead to unpredictable failures, as seen in LinkedIn's issues with simultaneous feature activation.
- Strategies for Disaster Prevention: She recommends strategies like utilizing microservices, implementing automation for recovery processes, and designing systems for better flexibility and resilience.

Conclusions and Takeaways:
- Embracing a proactive mindset towards risk management can significantly reduce the likelihood of catastrophic failures.
- It is crucial to recognize that while complete risk elimination is impossible, effective planning and mitigation strategies can prepare systems and individuals to handle adverse outcomes more effectively.
- Organizations should continuously test their systems and improve upon failures to build a culture of resilience against disasters. Waterhouse encourages professionals to think critically about potential failure points and to actively work on strategies for prevention.

00:00:10.610 Hello everyone. Today, I will discuss Y2K and other disappointing disasters.
00:00:17.850 I must give a verbal warning: I will talk about disasters and people dying, not just minor mishaps.
00:00:25.619 If that kind of topic is too much to handle right after lunch, feel free to leave. I wanted to give you a heads-up.
00:00:32.399 So how many of you were working in software or technology back in 1999?
00:00:40.980 I see a few hands raised — the few, the proud, the brave. In December 1999, a young sysadmin I knew made a bet with her boss.
00:00:49.489 She bet that their systems wouldn't go down, with her job on the line if anything failed.
00:01:00.289 Her boss insisted she stay at the servers all night to monitor them, but she wanted to attend a party.
00:01:06.200 At that time, she was in her twenties and was more excited about driving from Des Moines to Minneapolis.
00:01:12.390 It was a four-hour trip, and she wasn't concerned about Y2K at all.
00:01:18.180 Many of you remember pagers from that era. The modern ones were much better than the older versions.
00:01:27.060 Imagine this: you would wear them on your belt until you dropped them in the toilet.
00:01:33.240 When you received a page, you had to find a payphone to call back.
00:01:38.400 You would get a page with a number and maybe a little snippet of data about what went wrong.
00:01:44.759 If you were at a club, you might see someone looking haggard outside, trying to call a number.
00:01:51.420 Sometimes, you even had to hang out of car windows to use payphones.
00:01:58.829 They weren't conveniently located, and regardless of what the issue was, you had to fix it on-site.
00:02:05.250 This was long before remote administration was available, as we were still waiting on the year 2000.
00:02:12.010 Our sysadmin had her pager, but she knew it wouldn't help her four hours away.
00:02:17.110 She thought it was as good as nothing, so she headed to the party.
00:02:23.590 She knew nothing was going to go wrong because she had spent all of 1999 upgrading and hardening the systems.
00:02:29.830 This was not a trivial deployment, but required hand-wiring floppy drives to upgrade routers.
00:02:35.440 She had to reach out to people to send her physical pieces of software.
00:02:43.300 The patches and fixes were alarmist; we spent a lot of time preparing for Y2K.
00:02:48.850 Our sysadmin had traced back every piece of hardware, software, and firmware.
00:02:55.060 She had done everything to ensure the systems would work properly.
00:03:01.390 We are now getting further away from the reality of Y2K mitigation.
00:03:06.940 How many of you were in school during Y2K? This was not your problem.
00:03:16.450 A lot of planning went into reducing risks and ensuring systems held up.
00:03:22.690 Utilities did fail, but that occurred in isolation prior to the real change, thanks to extensive testing.
00:03:30.850 We avoided a catastrophic cascade failure because preparations had been made.
00:03:37.470 About half to a quarter of tech professionals did not celebrate but stayed on high alert.
00:03:43.000 They remained in server rooms and war rooms to ensure that the world did not end.
00:03:49.330 Back then, the idea of push updates didn’t exist, and regular updates were barely a thing.
00:03:55.660 The longer your system stayed up, the better; taking it down was a sign of failure.
00:04:05.380 People counted their server uptime in years as a badge of honor.
00:04:11.470 The space program is often romanticized as a massive success.
00:04:18.729 But the reality of managing Y2K was a nightmare compared to that.
00:04:25.370 It's easy to say Y2K was overblown now, but that’s because we worked hard to avoid issues.
00:04:32.210 Many forget disasters that never happened.
00:04:37.820 Today, we don't remember all the near-miss accidents or close calls.
00:04:43.490 All of this preparation for Y2K was risk reduction.
00:04:51.700 We acted on something that could have caused problems to prevent it.
00:04:57.890 We engage in risk reduction in many ways: vaccinations, anti-lock brakes, and more.
00:05:04.070 We constantly make trade-offs with risk, knowing we can't avoid it completely.
00:05:09.830 Though air travel is objectively safer than driving, we choose to drive.
00:05:15.500 Feeling in control influences our decisions; we must accept that risk is inherent.
00:05:20.930 As a cyclist in Minneapolis, I know the risks and choose to mitigate them, such as wearing bright clothes.
00:05:27.210 I also carry life insurance to manage the risk of cycling.
00:05:34.260 I work to ensure my family is protected, recognizing lifestyle choices come with inherent risks.
00:05:40.020 In my work, I add documentation to our APIs, understanding the associated risks.
00:05:46.020 We have a standard review process to catch any errors before pushing updates.
00:05:51.960 However, we often overlook these risk reduction strategies in our daily operations.
00:05:57.300 Consider the finite state machine; it allows us to predict outcomes and assess transitions.
00:06:03.930 By understanding potential states and transitions, we can avoid failures.
00:06:09.790 For instance, validating input to a zip code ensures accuracy and reduces errors.
00:06:18.139 We need to identify states and fix potential risks, leading to more stable outcomes.
00:06:24.389 On the other hand, harm mitigation kicks in when risk reduction fails.
00:06:31.149 We install safety measures, like seat belts and fire detectors, expecting the worst.
00:06:37.360 If disaster strikes, we want to ensure we can survive it.
00:06:44.160 Building codes are a concrete example of harm mitigation.
00:06:51.660 They exist to prevent tragedies, similar to how seat belts do in car accidents.
00:06:58.900 In cases of emergency, we want to minimize the damage.
00:07:05.230 Looking back at natural disasters, Mexico City suffered greatly from an 8.0 earthquake in 1985.
00:07:12.270 Over a hundred thousand people died due to poor building conditions and practices.
00:07:19.910 This led to stricter building codes that saved lives in future disasters.
00:07:26.120 During a 7.1 earthquake, newer buildings survived due to these enhanced building codes.
00:07:33.170 In fact, the estimated death toll was significantly lower because of prior learnings.
00:07:40.070 They now even have early warning systems in place to alert residents.
00:07:47.120 They recognize the threat of earthquakes and take steps to mitigate the consequences.
00:07:53.210 Another example of building codes that failed is the Grenfell Tower fire in London.
00:08:01.000 This residential building suffered catastrophic consequences due to a lack of safety regulations.
00:08:10.029 Many died because of flammable cladding and insufficient emergency exits.
00:08:16.160 The building's design overlooked critical safety measures.
00:08:22.560 Lack of proper evacuation procedures only heightened the tragedy.
00:08:29.700 Contrast that with the Torch Tower in Dubai, where stricter regulations prevented loss.
00:08:37.080 While it also caught fire, everyone inside was able to evacuate safely.
00:08:44.930 The difference lies in building codes and emergency preparedness.
00:08:51.580 So, what does harm mitigation look like in our lives?
00:08:58.300 It involves recognizing that we can never fully eliminate risk.
00:09:05.270 As a parent, I would love to shield my children from all harm, but that's unrealistic.
00:09:11.370 Instead, I can teach them about boundaries and ensure they know safe practices.
00:09:18.320 Once we accept that risk exists, we can begin to manage it effectively.
00:09:25.900 You must understand your product's failure mechanisms for effective harm mitigation.
00:09:32.650 Different contexts lead to varying levels of risk, so consider your threat models.
00:09:39.830 A harmless bug for some could lead to catastrophic issues for others.
00:09:45.960 Failure means different things in various environments; context matters.
00:09:52.680 Thus, we need professionals focused on testing to help find issues before they arise.
00:10:02.040 They will think outside the box and identify what could go wrong.
00:10:09.000 When we discuss harm mitigation, we must identify what we aim to protect.
00:10:16.000 For instance, in nuclear power, systems automatically shut down in emergencies.
00:10:23.050 If power or control is lost, safety mechanisms activate to prevent disasters.
00:10:30.050 Conversely, a time-lock safe protects money at the expense of human safety.
00:10:37.000 Computer systems focus primarily on protecting data.
00:10:44.000 Graceful shutdowns are essential when power is lost.
00:10:50.160 Laptops often save data in critical moments, proving how much we value our information.
00:10:57.050 My experience at Microsoft taught me that access control is crucial to protecting sensitive data.
00:11:04.000 Once we secured user data, we prioritized preventing breaches over hardware.
00:11:11.440 Predicting possible states lets us understand potential vulnerabilities.
00:11:18.360 Continuous improvement is key to minimizing risks in software.
00:11:24.060 Implementing kill switches enhances our ability to respond to failures.
00:11:32.090 When things spiral out of control, the option to hit the kill switch is invaluable.
00:11:39.080 Failure is inevitable; we will all encounter setbacks.
00:11:46.940 We must view disasters as a series of interconnected failures.
00:11:53.020 Understanding these nuances helps us avoid large-scale catastrophes.
00:12:01.120 Complex systems like those in modern software can lead to unpredictable failures.
00:12:08.090 For example, LinkedIn once faced issues by activating all their features simultaneously.
00:12:15.160 This action caused significant problems, highlighting how interconnected our systems are.
00:12:22.300 In today's software landscape, tracking data flow is more challenging than ever.
00:12:30.000 Testing must encompass microservices, connections, and their multiple permutations.
00:12:37.110 The complexity of microservices amplifies testing hurdles.
00:12:44.000 You can no longer expect complete test coverage; it's unrealistic.
00:12:51.100 Remember, every disaster results from compounded failures.
00:12:58.000 Natural disasters exemplify the many contributing factors that lead to catastrophe.
00:13:06.000 Consider the involvement of multiple failure modes, such as a hurricane.
00:13:13.080 Wind and storm surges can devastate infrastructures, leading to widespread chaos.
00:13:20.218 It’s not just about the immediate damage but also the failure of utilities, increasing the severity.
00:13:27.000 Disasters often stem from a variety of compounding failures that intensify the situation.
00:13:34.000 A significant example is the Interstate Bridge collapse, which resulted in 13 fatalities.
00:13:40.500 The primary cause? Undersized gusset plates compounded by additional road surface layers.
00:13:48.000 We discovered that several compounding factors led to one disastrous outcome.
00:13:54.000 In conclusion, it was only through over-engineering that it lasted 50 years.
00:14:02.000 Any combination of factors could generate disasters, prompting us to build better systems.
00:14:10.000 Chaos engineering has gained recognition, helping avoid major breakdowns.
00:14:17.000 It tests systems under stress to ensure robustness against unexpected issues.
00:14:23.000 As we identify weaknesses, we can determine if our systems are adequately prepared.
00:14:29.000 Here are strategies to reduce the potential of disasters.
00:14:35.000 Utilize microservices to promote resiliency through loose coupling.
00:14:42.000 Aim for internal APIs that minimize cascading failures.
00:14:48.850 Consider alternatives for hosting status pages to ensure clear communication.
00:14:55.000 Maintain replicas of essential data for reliability in case of emergencies.
00:15:01.500 Test your features in production scenarios to gauge performance accurately.
00:15:09.000 All modern software must undergo live testing to validate responsiveness.
00:15:16.000 As humans are the slow, fallible part of our systems, automation becomes vital.
00:15:23.000 Incorporate circuit breakers and kill switches for rapid response in emergencies.
00:15:29.000 The automation of recovery processes enhances resilience against failures.
00:15:35.000 For every detected issue, ensure a plan is in place to react swiftly.
00:15:41.000 Your post-failure plan should include automation for seamless recovery.
00:15:47.000 Proactively scrutinize your load testing to prepare against large traffic spikes.
00:15:54.000 The phenomenon of being slashed can exponentially increase demand on your servers.
00:16:01.000 Planning leads to smoother recovery; if hiccups occur, you can manage them better.
00:16:08.000 Despite any risk, remember to expect failure and embrace flexibility.
00:16:16.000 Ultimately, equip your systems to handle problems without complete breakdowns.
00:16:23.720 I work for LaunchDarkly, specializing in feature flags as a service.
00:16:30.000 If you’re interested in a free t-shirt, please take a photo of this slide and visit our site.
00:16:37.110 Thank you for your time!