00:00:10.610
Hello everyone. Today, I will discuss Y2K and other disappointing disasters.
00:00:17.850
I must give a verbal warning: I will talk about disasters and people dying, not just minor mishaps.
00:00:25.619
If that kind of topic is too much to handle right after lunch, feel free to leave. I wanted to give you a heads-up.
00:00:32.399
So how many of you were working in software or technology back in 1999?
00:00:40.980
I see a few hands raised — the few, the proud, the brave. In December 1999, a young sysadmin I knew made a bet with her boss.
00:00:49.489
She bet that their systems wouldn't go down, with her job on the line if anything failed.
00:01:00.289
Her boss insisted she stay at the servers all night to monitor them, but she wanted to attend a party.
00:01:06.200
At that time, she was in her twenties and was more excited about driving from Des Moines to Minneapolis.
00:01:12.390
It was a four-hour trip, and she wasn't concerned about Y2K at all.
00:01:18.180
Many of you remember pagers from that era. The modern ones were much better than the older versions.
00:01:27.060
Imagine this: you would wear them on your belt until you dropped them in the toilet.
00:01:33.240
When you received a page, you had to find a payphone to call back.
00:01:38.400
You would get a page with a number and maybe a little snippet of data about what went wrong.
00:01:44.759
If you were at a club, you might see someone looking haggard outside, trying to call a number.
00:01:51.420
Sometimes, you even had to hang out of car windows to use payphones.
00:01:58.829
They weren't conveniently located, and regardless of what the issue was, you had to fix it on-site.
00:02:05.250
This was long before remote administration was available, as we were still waiting on the year 2000.
00:02:12.010
Our sysadmin had her pager, but she knew it wouldn't help her four hours away.
00:02:17.110
She thought it was as good as nothing, so she headed to the party.
00:02:23.590
She knew nothing was going to go wrong because she had spent all of 1999 upgrading and hardening the systems.
00:02:29.830
This was not a trivial deployment, but required hand-wiring floppy drives to upgrade routers.
00:02:35.440
She had to reach out to people to send her physical pieces of software.
00:02:43.300
The patches and fixes were alarmist; we spent a lot of time preparing for Y2K.
00:02:48.850
Our sysadmin had traced back every piece of hardware, software, and firmware.
00:02:55.060
She had done everything to ensure the systems would work properly.
00:03:01.390
We are now getting further away from the reality of Y2K mitigation.
00:03:06.940
How many of you were in school during Y2K? This was not your problem.
00:03:16.450
A lot of planning went into reducing risks and ensuring systems held up.
00:03:22.690
Utilities did fail, but that occurred in isolation prior to the real change, thanks to extensive testing.
00:03:30.850
We avoided a catastrophic cascade failure because preparations had been made.
00:03:37.470
About half to a quarter of tech professionals did not celebrate but stayed on high alert.
00:03:43.000
They remained in server rooms and war rooms to ensure that the world did not end.
00:03:49.330
Back then, the idea of push updates didn’t exist, and regular updates were barely a thing.
00:03:55.660
The longer your system stayed up, the better; taking it down was a sign of failure.
00:04:05.380
People counted their server uptime in years as a badge of honor.
00:04:11.470
The space program is often romanticized as a massive success.
00:04:18.729
But the reality of managing Y2K was a nightmare compared to that.
00:04:25.370
It's easy to say Y2K was overblown now, but that’s because we worked hard to avoid issues.
00:04:32.210
Many forget disasters that never happened.
00:04:37.820
Today, we don't remember all the near-miss accidents or close calls.
00:04:43.490
All of this preparation for Y2K was risk reduction.
00:04:51.700
We acted on something that could have caused problems to prevent it.
00:04:57.890
We engage in risk reduction in many ways: vaccinations, anti-lock brakes, and more.
00:05:04.070
We constantly make trade-offs with risk, knowing we can't avoid it completely.
00:05:09.830
Though air travel is objectively safer than driving, we choose to drive.
00:05:15.500
Feeling in control influences our decisions; we must accept that risk is inherent.
00:05:20.930
As a cyclist in Minneapolis, I know the risks and choose to mitigate them, such as wearing bright clothes.
00:05:27.210
I also carry life insurance to manage the risk of cycling.
00:05:34.260
I work to ensure my family is protected, recognizing lifestyle choices come with inherent risks.
00:05:40.020
In my work, I add documentation to our APIs, understanding the associated risks.
00:05:46.020
We have a standard review process to catch any errors before pushing updates.
00:05:51.960
However, we often overlook these risk reduction strategies in our daily operations.
00:05:57.300
Consider the finite state machine; it allows us to predict outcomes and assess transitions.
00:06:03.930
By understanding potential states and transitions, we can avoid failures.
00:06:09.790
For instance, validating input to a zip code ensures accuracy and reduces errors.
00:06:18.139
We need to identify states and fix potential risks, leading to more stable outcomes.
00:06:24.389
On the other hand, harm mitigation kicks in when risk reduction fails.
00:06:31.149
We install safety measures, like seat belts and fire detectors, expecting the worst.
00:06:37.360
If disaster strikes, we want to ensure we can survive it.
00:06:44.160
Building codes are a concrete example of harm mitigation.
00:06:51.660
They exist to prevent tragedies, similar to how seat belts do in car accidents.
00:06:58.900
In cases of emergency, we want to minimize the damage.
00:07:05.230
Looking back at natural disasters, Mexico City suffered greatly from an 8.0 earthquake in 1985.
00:07:12.270
Over a hundred thousand people died due to poor building conditions and practices.
00:07:19.910
This led to stricter building codes that saved lives in future disasters.
00:07:26.120
During a 7.1 earthquake, newer buildings survived due to these enhanced building codes.
00:07:33.170
In fact, the estimated death toll was significantly lower because of prior learnings.
00:07:40.070
They now even have early warning systems in place to alert residents.
00:07:47.120
They recognize the threat of earthquakes and take steps to mitigate the consequences.
00:07:53.210
Another example of building codes that failed is the Grenfell Tower fire in London.
00:08:01.000
This residential building suffered catastrophic consequences due to a lack of safety regulations.
00:08:10.029
Many died because of flammable cladding and insufficient emergency exits.
00:08:16.160
The building's design overlooked critical safety measures.
00:08:22.560
Lack of proper evacuation procedures only heightened the tragedy.
00:08:29.700
Contrast that with the Torch Tower in Dubai, where stricter regulations prevented loss.
00:08:37.080
While it also caught fire, everyone inside was able to evacuate safely.
00:08:44.930
The difference lies in building codes and emergency preparedness.
00:08:51.580
So, what does harm mitigation look like in our lives?
00:08:58.300
It involves recognizing that we can never fully eliminate risk.
00:09:05.270
As a parent, I would love to shield my children from all harm, but that's unrealistic.
00:09:11.370
Instead, I can teach them about boundaries and ensure they know safe practices.
00:09:18.320
Once we accept that risk exists, we can begin to manage it effectively.
00:09:25.900
You must understand your product's failure mechanisms for effective harm mitigation.
00:09:32.650
Different contexts lead to varying levels of risk, so consider your threat models.
00:09:39.830
A harmless bug for some could lead to catastrophic issues for others.
00:09:45.960
Failure means different things in various environments; context matters.
00:09:52.680
Thus, we need professionals focused on testing to help find issues before they arise.
00:10:02.040
They will think outside the box and identify what could go wrong.
00:10:09.000
When we discuss harm mitigation, we must identify what we aim to protect.
00:10:16.000
For instance, in nuclear power, systems automatically shut down in emergencies.
00:10:23.050
If power or control is lost, safety mechanisms activate to prevent disasters.
00:10:30.050
Conversely, a time-lock safe protects money at the expense of human safety.
00:10:37.000
Computer systems focus primarily on protecting data.
00:10:44.000
Graceful shutdowns are essential when power is lost.
00:10:50.160
Laptops often save data in critical moments, proving how much we value our information.
00:10:57.050
My experience at Microsoft taught me that access control is crucial to protecting sensitive data.
00:11:04.000
Once we secured user data, we prioritized preventing breaches over hardware.
00:11:11.440
Predicting possible states lets us understand potential vulnerabilities.
00:11:18.360
Continuous improvement is key to minimizing risks in software.
00:11:24.060
Implementing kill switches enhances our ability to respond to failures.
00:11:32.090
When things spiral out of control, the option to hit the kill switch is invaluable.
00:11:39.080
Failure is inevitable; we will all encounter setbacks.
00:11:46.940
We must view disasters as a series of interconnected failures.
00:11:53.020
Understanding these nuances helps us avoid large-scale catastrophes.
00:12:01.120
Complex systems like those in modern software can lead to unpredictable failures.
00:12:08.090
For example, LinkedIn once faced issues by activating all their features simultaneously.
00:12:15.160
This action caused significant problems, highlighting how interconnected our systems are.
00:12:22.300
In today's software landscape, tracking data flow is more challenging than ever.
00:12:30.000
Testing must encompass microservices, connections, and their multiple permutations.
00:12:37.110
The complexity of microservices amplifies testing hurdles.
00:12:44.000
You can no longer expect complete test coverage; it's unrealistic.
00:12:51.100
Remember, every disaster results from compounded failures.
00:12:58.000
Natural disasters exemplify the many contributing factors that lead to catastrophe.
00:13:06.000
Consider the involvement of multiple failure modes, such as a hurricane.
00:13:13.080
Wind and storm surges can devastate infrastructures, leading to widespread chaos.
00:13:20.218
It’s not just about the immediate damage but also the failure of utilities, increasing the severity.
00:13:27.000
Disasters often stem from a variety of compounding failures that intensify the situation.
00:13:34.000
A significant example is the Interstate Bridge collapse, which resulted in 13 fatalities.
00:13:40.500
The primary cause? Undersized gusset plates compounded by additional road surface layers.
00:13:48.000
We discovered that several compounding factors led to one disastrous outcome.
00:13:54.000
In conclusion, it was only through over-engineering that it lasted 50 years.
00:14:02.000
Any combination of factors could generate disasters, prompting us to build better systems.
00:14:10.000
Chaos engineering has gained recognition, helping avoid major breakdowns.
00:14:17.000
It tests systems under stress to ensure robustness against unexpected issues.
00:14:23.000
As we identify weaknesses, we can determine if our systems are adequately prepared.
00:14:29.000
Here are strategies to reduce the potential of disasters.
00:14:35.000
Utilize microservices to promote resiliency through loose coupling.
00:14:42.000
Aim for internal APIs that minimize cascading failures.
00:14:48.850
Consider alternatives for hosting status pages to ensure clear communication.
00:14:55.000
Maintain replicas of essential data for reliability in case of emergencies.
00:15:01.500
Test your features in production scenarios to gauge performance accurately.
00:15:09.000
All modern software must undergo live testing to validate responsiveness.
00:15:16.000
As humans are the slow, fallible part of our systems, automation becomes vital.
00:15:23.000
Incorporate circuit breakers and kill switches for rapid response in emergencies.
00:15:29.000
The automation of recovery processes enhances resilience against failures.
00:15:35.000
For every detected issue, ensure a plan is in place to react swiftly.
00:15:41.000
Your post-failure plan should include automation for seamless recovery.
00:15:47.000
Proactively scrutinize your load testing to prepare against large traffic spikes.
00:15:54.000
The phenomenon of being slashed can exponentially increase demand on your servers.
00:16:01.000
Planning leads to smoother recovery; if hiccups occur, you can manage them better.
00:16:08.000
Despite any risk, remember to expect failure and embrace flexibility.
00:16:16.000
Ultimately, equip your systems to handle problems without complete breakdowns.
00:16:23.720
I work for LaunchDarkly, specializing in feature flags as a service.
00:16:30.000
If you’re interested in a free t-shirt, please take a photo of this slide and visit our site.
00:16:37.110
Thank you for your time!