00:00:10.800
Hello, and I want to thank you for coming to my talk today titled "Surprise! Inspiring Resilience." I don't want to yell "surprise" too loudly at the beginning of the talk; that seems like a bad idea. So instead, I'll just say quietly, "surprise."
00:00:17.600
I want to tell you a little story today. The story doesn't require you to know anything about race cars or even like them, but I'm pretty sure by the end, we can all relate to the story in question.
00:00:30.080
On July 12th of 2020, during the opening weekend of the Formula 2 racing season, many of you have probably heard of Formula One. Formula Two is sort of the next rung down; it's where you get ready to go into Formula One if you're lucky. There's a fella named Mick Schumacher; that name may sound familiar to you based on his father, Michael Schumacher. Mick Schumacher, his son, is a race car driver, and he was in third place, cruising through the Styrian Grand Prix on the way to a podium finish. This is what you call it when you're in first, second, or third.
00:00:47.120
During the race, as he was doing this, a little piece of rubber popped off of his tire, came into the cockpit of his car, and hit the button on his steering wheel that deploys the fire extinguisher. Now, what you see here is a tweet from Mick Schumacher's Twitter account. Down toward the bottom, you can sort of see some spray from the fire extinguisher around the inside of his cockpit.
00:01:05.040
He did have to retire the car, but he managed to get it off the track and into the pits with no damage, other than the fire extinguisher, of course. His later tweets described this as a once-in-every-10-years event.
00:01:15.920
Now, cars in Formula 2 cost about 500,000-ish dollars, and they can reach speeds close to 200 miles per hour. They're crammed into this tiny little cockpit where they have very little room. Their knees are braced up into the front, and they even sometimes have to wear pads on their knees because they rattle around in there. I don't know about you, but if I were just walking or hanging out at home and a fire extinguisher went off, I would probably at least fall down, if not completely freak out.
00:01:38.079
It blows my mind that Mick Schumacher, whizzing around a track at hundreds of miles per hour, making all these abrupt, crazy decisions, pulling lots of g's, was able to pull this off and bring the car back safely. At the time of this occurrence, Mick Schumacher had been racing cars for about 12 years, relying on his extensive skills and experience to get this done. He was prepared for an event that, frankly, nobody would have bothered to write a rulebook or runbook for or perhaps even practiced.
00:02:01.680
His skills and capabilities helped him avoid a loss of control, significant financial loss, possibly even injury, or otherwise really significant damage. This is impressive. When I saw this happen, I was impressed. Wow! This is in a completely different industry than tech, but there are some strong parallels. What can we learn if we look outside of our industry, the sort of norms, the things we think about every day, and observe how other industries or situations think about resilience?
00:02:50.480
You may be wondering who in the world I am and why you should listen to anything I say. Well, I am Cory Watson, and I work for a company called Jelly, which can be found at jelly.io. We do incident review software and help you learn from both incidents and maybe even your successes, which we'll talk about a little later. I've done reliability and resilience work and observability work for companies like Stripe, Splunk, and Twitter, and I've broken a lot of stuff in my career; some of it I even fixed. Sometimes I even did that on purpose, but I joke about this because at the end of the day, we are the things that make these systems work, and quite often, the work we do stabilizes or keeps the system going.
00:03:47.040
However, in many of these contexts, like this talk or at conferences, we have a tendency to focus on the things we break, which I think is unfortunate because more often than not, we are fixing things instead of just breaking them. Anyway, enough about me; let's talk more about resilience.
00:04:02.159
Resilience is the ability to not fall apart whenever something bad happens, to persevere through problems or bad circumstances. The problem with resilience is that we tend to want it right now. We want it during an incident, or in the incident we just had or are currently experiencing. We state that we need this now, or whatever—it’s at the most pointed, sharp, and painful part of the time in which we are dealing with resilience. We want it at this second, and we tend to want it too late.
00:04:38.720
We can refer back to Mick Schumacher, who will make a return later in this talk. We could talk a lot about the literature, and I love to read the literature. Some of my favorite afternoons are just curling up with a good paper and a set of highlighters, marking my way through with a cup of tea. In this case, we’re discussing an article by David Woods called "Resilience is a Verb," but I won't make you read the whole paper. Instead, I will pull out one particular quote that's really important to me and serves as the basis for a lot of this talk today. "The ability to recognize and to stretch, extend, or change what you're doing, or what you have planned, has to be there in advance of adapting." The emphasis is mine here, where I've emboldened 'it has to be there in advance.' This is the important thing.
00:05:40.720
If we want resilience, if we want to deal well with failure—like Mick Schumacher did—we have to put a plan in place. We need to prepare; we cannot just hope for the best. Some of our colleagues in security have noted that 'hope is not a strategy.' I’m not sure if that's just a security saying, but it isn't a strategy. We must prepare; we must be there in advance of adapting. A lot of the things we do for incidents—though I won't spend a lot of time on this—require some attention to a concept called barriers.
00:06:45.600
Barriers are about stopping people from making mistakes. Sometimes that’s putting a guardrail around a dangerous curve or including something in your CI/CD pipeline that won't allow a build to go into production unless a certain test passes. This focuses on prevention, on stopping something from happening. This is not the same as resilience, but it’s where we spend a lot of our time. Much of the postmortem material on incidents is about creating action items for things to do to stop this from happening again.
00:07:30.240
Another piece of literature is from a paper called "Accidents and Barriers" by Hollnagel in 1999. I will just pull out the important point, which is that barriers can be used as a means to prevent the same or similar accidents from occurring again. Imagine if we had a situation where something went into production, but a specific test meant to stop a certain type of user input from breaking the application failed, and it went into production anyway. You can imagine that would negatively impact the user experience.
00:08:15.840
However, the most persnickety tests often deal with timing or some sort of randomness, and suddenly we might get hung up, unable to deploy a crucial bug fix because a flaky test won’t pass. That’s why we need to make a distinction: barriers are not the same as resilience. They only prevent the same or similar accidents from happening.
00:09:09.600
So a little more literature—because all good talks come with this. Let's talk a bit about complex systems. If you've never been to howe.complexsystems.fail, I highly encourage you to do so after this talk; please pay attention to me for now. Complex systems are large assemblies of technical and social components that we combine to create an application, a web property, or even an organization or group of people. The important takeaway from Dr. Cook’s perspective is that failure-free operations require experience with failure.
00:09:58.040
This is crucial. If we install all these barriers to prevent anyone from failing, when things inevitably do fail, we won't be prepared. Imagine if Mick Schumacher had never experienced a situation where the car didn't behave as expected; he would never have developed the capacity to deal with the car when a fire extinguisher went off. You have to experience failure to make progress toward failure-free operations; it's counterintuitive but really important.
00:10:31.760
If we have to have experience with failure to foster improvements or successes, we learn individually, but what’s important for us, the people who care about these outcomes in organizations, is figuring out how to gain and learn as an organization from the failures or successes we experience. How do we "verb" our resilience? How do we prepare? If we reference the "Resilience as a Verb" paper, the goal is to anticipate, to see signs of trouble, to synchronize and help different levels keep track of changes.
00:11:44.320
These are the attributes we need to be ready to respond, and who does not have an on-call rotation for an adequately complex or important product? Lastly, we need to proactively learn. We must monitor brittleness, opportunity, weak signals, and symptoms—things that smell like near misses. We will talk about these concepts further, and I will try to provide examples.
00:12:57.200
What is your organization doing to foster these types of attributes? If we return to anticipation, synchronization, readiness to respond, and proactively learning, what are we doing to encourage these? We will explore examples from outside the tech industry to inspire us to think differently about resilience.
00:13:47.360
The first example I want to provide is from the U.S. Navy. There’s a concept known as the "Plane Guard." This is outlined on Wikipedia, and you can check it out for more details. During aircraft recovery operations, there is always a ship or helicopter tasked with recovering air crews when they land on an aircraft carrier.
00:14:18.720
This is an incredibly dangerous operation, often described as one of the most hazardous jobs in the world. To alleviate the fear that any human might feel in this situation, having someone close by to rescue is crucial. The Plane Guard concept has been in place in the U.S. Navy and many navies worldwide for years.
00:14:45.280
The idea is that there’s a helicopter hovering off the back of the ship, ready to swoop in as a backup for rescue whenever needed. This is a dedicated expense; it requires significant time, money, training, and resources to maintain this presence. If we check the statistics for numbered class A flight mishaps, we see that in 2020, there were only four class A mishaps per 100,000 flight hours.
00:15:31.520
This class A mishap is classified as incursions resulting in significant damage, such as losing an airframe. It’s interesting to note that despite the low incident rate, they still dedicate the time and resources to the Plane Guard, which prompts us to think about our own organization's safety measures during dangerous operations.
00:16:16.240
Now, moving on to a topic we all enjoy—space! I’m actually wearing a NASA shirt today! The concept of a "Countdown Hold" is associated with the Space Shuttle. I didn’t research whether they still employ this today, but the Shuttle had a built-in countdown pause, allowing the launch team to still aim for a specific launch window based on various factors.
00:17:01.760
The countdown holds provide a cushion of time for critical testing and procedures without affecting the overall launch schedule. For example, at T-minus 27 hours into the countdown, the first hold of four hours occurs. During this time, it’s crucial to clear the launch pad of all non-essential personnel. This means the countdown pauses and allows for as much time as necessary for the procedure.
00:17:55.200
In safety-critical operations, it's generally terrifying to think that someone is unaccounted for. Thus, having the capability to hold the countdown means even if someone is missing, we can take the necessary time to ensure everyone's safety before proceeding.
00:18:49.920
The Olympic Planning Committee issues a comprehensive document called the "Host City Contract," which outlines the requirements for executing the Olympics. It’s noteworthy that these guidelines, along with experiences from previous Olympics, had to be documented and codified over time.
00:19:31.320
This document includes critical stipulations learned through experiences across Olympic history, such as ensuring full power redundancy from geographically independent substations, which is paramount for maintaining operations within data centers.
00:20:22.960
Moreover, another interesting point from the same document is that energy for field lighting should come from two independent sources, with each supplying 50% of the lighting. In the event of a power failure at one geographic location, this redundancy ensures that we can still hold events regardless of conditions.
00:21:10.400
Next, let's address Notre Dame de Paris. Most are familiar with the devastating fire that occurred a few years back at this historic church, which housed priceless artifacts. An intriguing revelation from an article in Science inspired the foundations of this talk.
00:21:51.200
The firefighters were aware of which artworks to rescue and in what order, following a pre-established protocol developed for such disasters. They controlled their water pressure so that spraying cold water on the hot stained-glass windows wouldn't cause them to shatter. This foresight and preparation exemplify the essence of resilience.
00:22:50.240
What struck me most is the idea that someone had preemptively prepared for such a catastrophic scenario, engaging with the local fire department long before the event. This illustrates that resilience comes from planning and putting strategies in place. Their prior planning ultimately resulted in successfully salvaging many elements from the structure.
00:23:20.320
Returning to the theme of cars, the Rimac Nevera, previously known as the C2, is a $2.4 million electric hypercar. Even such an expensive model undergoes crash testing, which is remarkable. The foundational understanding of automobile engineering allows them to create simulations, yet they still opt to physically crash test these ultra-luxury vehicles.
00:24:01.760
In fact, the prototypes often cost more due to intricate, handcrafted components with no duplicates. They perform various computer simulations but subsequently crash-test multiple full cars to observe real-world outcomes rather than solely relying on computed predictions.
00:24:54.720
Furthermore, I would like to highlight TWA Flight 800, a tragic incident in July 1996 in which 230 people perished just 12 minutes after takeoff. The intriguing part is that after the investigation concluded, the National Transportation Safety Board (NTSB) preserved the wreckage for 25 years.
00:25:41.440
During this time, the NTSB utilized the wreckage to train investigators, allowing them to investigate airplane crashes effectively. This was an effort to turn a tragedy into an opportunity for learning, thus advancing the industry. The NTSB even digitized the wreckage for continuous educational purposes.
00:26:31.600
Another significant topic relates to healthcare rationing, an issue that is not new to the COVID era. Market forces and other circumstances have driven healthcare rationing historically, such as in the 40s with iron lungs, in the 60s with dialysis machines, and currently with ventilators and ICU beds.
00:27:16.560
This area raises substantial ethical questions and financial implications, leading to extensive research and papers written understanding how these situations unfold. Emergencies will keep occurring, whether they are localized disasters or national and global pandemics. Proactively investing before these situations is incredibly valuable.
00:28:27.400
Returning to the essence of resilience, we often desire it when we need it rather than thinking about it in advance. This brings us back to the question: how are you fostering resilience in your organization? What actions are you taking to 'verb' resilience?
00:29:11.920
What prompted these practices in other organizations? Consider the NTSB's decision to preserve wreckage from TWA Flight 800; there was intent and commitment behind that decision. What is driving the commitment to crash-test the Rimac Nevera? What's inspiring the Navy's investment in their Plane Guard? Why did the Notre Dame administration prioritize discussions and plans with the fire department? These questions underline the importance of planning and foresight that all effective operations share.
00:30:05.680
If we do this sort of investment, if we pre-position resources, if we dedicate years of training, we may indeed succeed, just as Mick Schumacher did. He did not win the particular race I mentioned earlier, but he went on to win the Formula 2 Championship that year. He achieved this not through sheer domination but through consistency, preparation, and a stroke of luck. In short, he was resilient; he was prepared, he was ready, and he now drives for the Haas Racing Team in Formula One.
00:31:14.000
I hope that through resilience and investment, you too can reach the pinnacle of your industry. Thank you for attending this talk today.