Talks

Summarized using AI

Who Destroyed Three Mile Island?

Nickolas Means • April 17, 2018 • Pittsburgh, PA

The video titled "Who Destroyed Three Mile Island?" by Nickolas Means, presented at RailsConf 2018, explores the complex nuclear incident at Three Mile Island Unit #2 that occurred on March 28, 1979. This session examines the issue through a detailed analysis of the events leading to the partial meltdown, with a focus on human error and systemic failures rather than individual blame.

Key points discussed include:
- Nuclear Reactor Basics: Means begins by explaining how a nuclear power plant operates, highlighting the crucial functions of the reactor core, cooling systems, and safety protocols in place during operation.
- The Incident Timeline: The talk outlines a step-by-step timeline of the incident, beginning with a minor issue in the condensate polishers, which ultimately led to the reactor's failure. Essential procedural areas such as the errors in managing pressure and coolant levels during the crisis are highlighted.
- Human Factors: Means emphasizes that the operators made decisions based on their training background, which was heavily influenced by past experiences with naval reactors, leading to critical misjudgments during the incident.
- Second Stories Concept: Drawing from Sydney Decker's work, he discusses how exploring 'second stories,' or the systemic and contextual reasons behind decisions made, leads to a better understanding of human error, instead of simply attributing blame to individual actions.
- Systemic Failures: The discussion points out numerous design flaws and inadequacies in operational training that contributed to the meltdown, advocating for a blameless approach in investigating incidents to improve safety and learning outcomes.

Ultimately, the conclusion underlines that asking "What destroyed Three Mile Island?" rather than "Who destroyed it?" leads to a richer understanding of the incident, helping to illuminate the broader issues within the nuclear industry and beyond. This approach encourages organizations to foster an environment where team members can learn from mistakes openly, enhancing overall operational safety and decision-making processes.

Who Destroyed Three Mile Island?
Nickolas Means • April 17, 2018 • Pittsburgh, PA

RailsConf 2018: Who Destroyed Three Mile Island? by Nickolas Means

On March 28, 1979 at exactly 4:00am, control rods flew into the reactor core of Three Mile Island Unit #2. A fault in the cooling system had tripped the reactor. At 4:02, the emergency cooling system automatically kicked in as reactor temperature continued to climb. At 4:04, one of the operators switched the emergency cooling system off, dooming the reactor to partial meltdown. Why?

Let’s let the incredibly complex failure at Three Mile Island teach us how to dig into our own incidents. We'll learn how the ideas behind just culture can help us learn from our mistakes and build stronger teams.

RailsConf 2018

00:00:11 Good morning, everybody! Welcome to RailsConf. I'm glad we all made it here despite the weather. I'm curious how many of you would say you have absolutely no idea how a nuclear reactor works. Raise your hand.
00:00:28 Awesome! We're going to clear that up today. When I was a kid, my parents gave me this four-volume set of books called 'How Things Work.' My dad is a mechanical engineer by training, and he had great patience with all my questions about how various complicated things in the world worked.
00:00:36 I think he got tired of answering those questions all the time, so he bought me the set of books, in part so that I could find those answers on my own, but in part to continue inspiring that curiosity in me. I don’t have a chance to look at them very often these days, thanks to the wonders of the internet and Wikipedia, but they still have a treasured place on my bookshelf because they're a big part of why I am who I am and why I'm curious about the things I'm curious about.
00:01:01 I distinctly remember when it was all over the news in 1990 that Comanche Peak Unit One came online outside of Dallas. It was the first nuclear power plant that I could remember coming online in my lifetime, and I remember turning to pages 68 and 69 of volume one of 'How Things Work' to try to understand how it was that a nuclear reactor made electricity.
00:01:20 I think that's a good place for us to start today as well, with one of the reactor diagrams at the bottom of the page. As it turns out, the basic mechanics of a nuclear power plant are very similar to those of any other power plant. You have a heat source, in this case, it's a carefully controlled nuclear chain reaction fueled by uranium. In the case of a combustion power plant, that would be natural gas or coal.
00:01:46 High-pressure water circulating through the reactor core carries the heat to a steam generator or is used to boil water, converting it to steam. That steam is used to turn a turbine, which is essentially a giant fan in a tube. The turbine then turns a generator, which is where the electricity comes from.
00:02:10 The steam is then piped into a condenser, where it's cooled and turned back into water, allowing it to make another trip through the steam generator. There are two primary kinds of nuclear reactors in operation in the United States: the boiling water reactor and the pressurized water reactor. What we are dealing with here is a pressurized water reactor because that was the kind that was in operation at Three Mile Island.
00:02:29 So, what makes it pressurized? The components that I just walked you through are on two cooling loops: the primary loop, which is orange, and a secondary loop, which is blue. The primary loop consists of the water that flows through the reactor core, gathering heat, and then through the steam generator, boiling the water in the secondary loop. What the secondary loop boils to steam drives the turbine to create electricity.
00:02:54 The interesting thing is that water in these two loops never combines; they are completely isolated from one another. So, what makes it pressurized? In a boiling water reactor, you have a much larger reactor pressure. The reason for this is that water actually boils in the core of a boiling water reactor.
00:03:06 You must allow room in the pressure vessel for that phase change to occur. In a pressurized water reactor, the primary coolant loop is held at about 2,000 pounds per square inch (psi). The effect of that is that the water in the primary coolant loop will not boil, even at the plant’s operating temperature of 600 degrees Fahrenheit, or at least it's not supposed to boil.
00:03:40 This brings us to March 28, 1979. Three Mile Island Nuclear Generating Station is a two-unit nuclear power plant in Londonderry Township, Pennsylvania. It's about 200 miles from where we're sitting right now, built on a three-mile long sandbar in the middle of the Susquehanna River, about ten miles south of Harrisburg, the capital of Pennsylvania.
00:04:00 Unit Two is a 906-megawatt pressurized water reactor designed by Babcock & Wilcox. It went into commercial operation on December 30, 1978, and early on the morning of March 28, 1979, it was running at 97% capacity and had been for the three months since it came online. In the slang of the industry, this reactor was running hot, straight, and normal.
00:04:25 These men are the ones who were at the controls of Three Mile Island Unit Two for the overnight shift on March 28. Bill Zaily was the shift supervisor for Units One and Two; he was the most senior person on-site. Fred Shaymin was the shift foreman for Unit Two, and he was Bill Zaily's second-in-command. Fredrick and Craig Faust were the control room operators on duty, sitting at the controls of the reactor.
00:04:49 Everything at the plant was running perfectly normal that night, except for a small problem in the condensate polishers that the previous shift hadn't been able to solve. What are the condensate polishers? They're a set of eight filtration tanks that filter the water coming out of the condenser before it goes back into the expensive and delicate steam generators.
00:05:10 These are not the actual steam generators or the actual condensate polishers from Three Mile Island; as you can imagine, it's pretty difficult to find a picture of a specific component of a specific nuclear power plant, but this is what condensate polishers look like.
00:05:25 These tanks are filled with sticky resin beads that absorb everything but water. Any flecks of rust or dirt that happen to be circulating in the coolant water will stick to the resin beads. The problem is that these things tend to clog, and they need to be periodically backwashed, much like a pool filter.
00:05:49 At Three Mile Island, the backwash system wasn't quite powerful enough to do the job it was intended to do, and so the swing shift the night before, faced with a clogged polisher tank, turned on a secondary system, using pressurized air to try to break the clog in the number seven condensate polisher tank.
00:06:14 At 3:59 in the morning, Fred Shaymin is down in the basement of the turbine hall, looking into the viewing port of the number seven condensate polisher tank to see if they made any progress on this clog. Suddenly, everything gets incredibly silent. As you can imagine, thousands of tons of water pushing through pipes make a lot of noise, so the silence was disconcerting.
00:06:34 Shortly after that, there was a rumble, and Fred Shaymin barely jumped free as a water hammer surged through the feed lines, knocking them free of all eight condensate polisher tanks. What happened was that over about ten hours since the swing shift had started using the pressurized air system to try to break this clog, a leaky one-way check valve had allowed water to force its way from the condensate polisher tank into the air supply line.
00:06:55 At 3:59 in the morning, that water finally made it to the manifold feeding the pneumatic control valves for these condensate polisher tanks, causing all eight valves to close. Obviously, this is not good. But to help us understand why, here's a schematic of Three Mile Island Unit Two.
00:07:22 Now, it looks a little bit more complicated than the diagram we were looking at a minute ago, but it’s the same thing. All the same components are here. I have colored the primary loop in orange and the secondary loop in blue. Let me run through the components quickly. In the center is the reactor core where the nuclear chain reaction takes place and generates heat.
00:07:47 Next to it are the two steam generators where water from the primary loop boils the water in the secondary loop to create steam. Over here in the turbine building are the turbine and generator where the electricity is generated. Below them is the condenser, where the steam is turned back into liquid water, and right below that is the condensate polisher, which is completely blocked.
00:08:12 What that means is that there is no water to be pumped through the secondary cooling loop, so the main feedwater pump tripped offline at 4:00 in the morning. This marks the official start of the accident at Three Mile Island.
00:08:35 Two seconds after the main feedwater pump tripped, the turbine sensed it wasn't going to be getting any more steam, so both the turbine and generator also tripped offline. The plant's main safety valve opened, venting all the remaining steam into the early morning sky. This steam is not radioactive; it's completely safe. However, it makes a noise that could be heard from miles away.
00:09:04 In the control room, Fred and Craig Faust are getting their first indications that something has gone wrong. An alarm horn announcing the turbine trip starts going off, and several alarm indicators begin to flash. A few seconds after the turbine and generator alarms go off, the pressure in the reactor vessel starts to climb rapidly.
00:09:27 This pressure spike is expected. Without the secondary loop to remove heat, the primary loop is heating up, and as water heats up, it expands. The good news is that the plant is designed for exactly this situation to happen and it's taking action to resolve it automatically.
00:09:50 As soon as the alarms go off, the reactor's pressure control system is the first to kick into action. There are two components of the system, both of which are important to the accident. The first is the pressurizer, which does three jobs at Three Mile Island. The first is to regulate system pressure because the primary coolant loop is a sealed system.
00:10:12 The pressurizer changes the pressure of the entire system and works essentially like a piston. There's a steam bubble at the top and water at the bottom. As the water expands, it increases the pressure on the rest of the system. The second job of the pressurizer is to allow the operating crew to measure the water level.
00:10:36 When Babcock and Wilcox designed this reactor, they designed it without any water level instrumentation on the reactor core to save money. They could do it because the pressurizer is the highest point in the system, so if there's water in the pressurizer, you can infer that there's water in the primary coolant loop.
00:11:07 The third job is to absorb pressure shocks. Steam is significantly more compressible than liquid water, and the steam bubble at the top of the pressurizer is able to absorb any quick pressure spikes that happen in the system. Much like the one that's happening right now. The steam absorbs the initial shock, but the pressurizer itself is really only designed for small pressure adjustments.
00:11:36 It's not designed to handle situations like this, where the reactor is more than 100 psi over its standard operating pressure. It would take the pressurizer a few minutes to make adjustments of that size.
00:12:02 So, what does the system do about big pressure changes like this? That's where the pilot-operated relief valve comes in, and if you've heard anything about the Three Mile Island accident before, you've probably heard about this component. This is the one that gets all the press.
00:12:17 In the event of a big pressure spike, the pilot-operated relief valve will open and release coolant into a drain tank on the floor of the containment building. By releasing a certain amount of coolant, it lowers the pressure of the primary cooling loop. The pilot-operated relief valve opens four seconds after the turbine and generator trips offline.
00:12:55 A few seconds later, the reactor computer senses that the reactor pressure is still continuing to rise even with the pilot-operated relief valve open, and so it takes another defensive action: it scrams the reactor.
00:13:20 The chain reaction within a nuclear reactor consists of uranium atoms and neutrons flying around, and these free neutrons like to bond with uranium nuclei. When they do, the uranium nucleus splits, releasing heat energy along with two other free neutrons. Those two free neutrons then go bond with other uranium nuclei, continuing the chain reaction.
00:13:47 The primary means of controlling this chain reaction is a set of cadmium rods that can be inserted into the core of the nuclear reactor. Normally, these rods are raised and lowered smoothly to make small adjustments in power, but in the event of a scram, the control mechanism releases the rods, allowing them to free-fall into the reactor core.
00:14:10 This happens in about three seconds and shuts down the chain reaction instantly. The problem is that shutting down the chain reaction doesn’t entirely stop the production of heat in the reactor core. Immediately after a scram, the reactor is still producing about six and a half percent of what it was before the scram.
00:14:38 That decay heat will decrease to one and a half percent over the first hour after the scram, but during that entire first hour, the reactor is still generating plenty of heat to damage the core if it's not carried away. Therefore, it’s essential to continue cooling the reactor core in the hour following a scram.
00:15:07 A few seconds later, back in the control room, a line on the console turns from red to green, indicating to the staff that the pilot-operated relief valve has been signaled to close. All the defensive actions that the reactor has taken have worked; the pressure spike has been contained, and everything's back to normal.
00:15:27 At this point, everything feels very much under control for the operating staff at Three Mile Island. However, that sense of control would last only two minutes, because two minutes later, the world is thrown into chaos when the automatic emergency core cooling system kicks in—specifically, the high-pressure injection system that dumps about a thousand gallons of water per minute into the core of the reactor.
00:15:50 This unexpected surge of water was very confusing for Bill Zaily and his crew. The plant had gone from a state they understood very well to one they had never encountered before. As soon as high-pressure injection kicked in, they were watching the water level on the pressurizer and noticed that it continued to rise.
00:16:24 Seeing the water level in the pressurizer rise told them that there was plenty of water in the system. So, why did the system think it needed more water? Fred Shaymin made the call to turn off the high-pressure injection after it had only been running for about two and a half minutes.
00:16:43 Had he not done this, the accident would have likely been a minor inconvenience, and the plant would have been back online later that week. However, Bill Zaily was in a perplexing situation. The water level in the pressurizer was continuing to rise, confirming that the primary loop had plenty of water, yet the pressure of the primary loop was continuing to drop.
00:17:06 This was problematic because if the pressure dropped too far, the water would start to boil, and if the water started to boil, it wouldn't be able to carry heat away from the reactor core effectively. He suspected that the pilot-operated relief valve might be stuck open, which could explain why the system was having trouble maintaining pressure.
00:17:38 He double-checked the indicator on the control panel, and it showed that the pilot-operated relief valve was closed. To be certain, he checked the temperature of the outlet pipe of the pilot-operated relief valve, which came back at 228 degrees Fahrenheit. According to the plant's operation manual, any temperature over 200 degrees Fahrenheit on the outlet pipe indicates that the pilot-operated relief valve has opened.
00:18:05 The procedure for dealing with that situation is to shut the manual block valve ahead of the pilot-operated relief valve. Had Bill Zaily closed the block valve, he would have stopped the incident in its tracks, but he did not; he left it open.
00:18:36 Now six minutes into the accident, at 4:11 in the morning, another alarm goes off—this one for the sump in the reactor containment building. The sump is a large pit at the bottom of the containment building meant to catch any water that leaks or is vented from anywhere in the system.
00:18:53 Currently, there’s enough water being released from the stuck-open pilot-operated relief valve that it has overflowed the drain tank on the floor of the containment building and filled the sump. This is a clear indication that there is a leak in the plant, but the crew misses it.
00:19:21 The core is in serious trouble at this point. Just after 5:00 a.m., the floor of the control room starts to rumble. It’s subtle at first, but it quickly becomes impossible to ignore. What's happening is that the primary coolant pumps of the reactor are designed only to pump fluid, but steam bubbles have started to form in the core.
00:19:42 As these pumps begin pumping steam in addition to the water they were designed to pump, they start vibrating. It starts sudden, but then it gets worse, and the operators know their training says to shut them off when this happens. These pumps are very large and expensive, and to keep them from tearing themselves apart and creating a coolant leak, they are supposed to be shut off.
00:20:05 They hold off as long as they can, but finally, after 15 minutes, they cannot stand it any longer, so they shut off the first set of pumps. This helps for a little while, but 30 minutes later, the vibrations return, and they shut off the second set of pumps.
00:20:33 It’s less than two hours since Three Mile Island Unit Two was running at nearly full capacity, and it now has no coolant circulating through its core. It doesn’t take long for the effects of no circulation to become apparent.
00:20:57 At 6:00 a.m., precisely two hours into the accident, a radiation alarm in the containment building goes off, indicating several important things. The uranium fuel in Three Mile Island is contained in fuel rods, which are sealed so the water can circulate around them in the reactor core but not absorb any uranium.
00:21:21 The water itself does not become radioactive. However, the radiation alarm going off tells us that at least one of these fuel rods is damaged, which allows water to reach the fuel. Moreover, the water level in the core is now below the top of the nuclear fuel, meaning the core has been exposed.
00:21:47 By this point, plant leadership has started to make their way into the plant. Gary Miller is the station manager; he is the chief executive of Three Mile Island. George Condor is the Technical Support Manager for Unit Two, in charge of all technical personnel, nuclear assistants, health physicists, chemists, etc.
00:22:05 Almost as soon as they walk in the door, they join a conference call with Leland Rogers, the site representative for Babcock & Wilcox, the reactor designers. They discuss what they know, and Leland Rogers mentions that the block valve should have been closed.
00:22:29 George Condor yells to someone in the control room and asks, 'Is the block valve shut?' You can hear commotion in the background, and a couple of seconds later, the answer comes back: 'Yes, it’s shut.' Finally, at 6:22 in the morning, in response to a question from Leland Rogers, the crew closes the block valve.
00:22:51 This seals the leak of Three Mile Island Unit Two, stopping it from losing further coolant. Although this would have been a great decision 20 minutes into the accident, doing it now actually makes things worse. At this point, the only way this reactor can cool itself is by boiling coolant off through the pilot-operated relief valve.
00:23:15 By closing the block valve, the system is now sealed and all of the heat contained in the system has nowhere to go. With the block valve closed, the heat in the core intensifies rapidly, and it takes about eight minutes for the top of the core to begin to collapse.
00:23:36 Subsequent calculations show that by 7:00 in the morning, the core is two-thirds uncovered, with temperatures in the hottest part of the core around 4,000 degrees Fahrenheit—sufficiently hot to melt the cladding around the fuel, and eventually the uranium itself.
00:24:04 At 7:20 in the morning, the radiation alarm in the dome of the containment building goes off, indicating a reading of 800 REM per hour. To give you some context, if one of the operators at Three Mile Island had been standing in that 800 REM per hour radiation field, they would receive their maximum legal yearly dose of radiation in about 20 seconds.
00:24:27 This represents a significant radiation threat. The crew had largely been in denial about core damage up till this point, but this alarm finally brings them back to reality. Immediately after the alarm, they attempt to turn the high-pressure injection pumps back on, but they turn them off again after about 18 minutes because they’re worried about having too much water in the system.
00:24:59 It wasn't until 8:26 in the morning, as the situation continued to worsen, that they finally re-enable high-pressure injection, largely in a state of desperation; they were unsure what else to try at this point. It would take until 10:30 a.m., more than two hours, for high-pressure injection to finally fill the pressure vessel back up and cover the core with water, ending the initial sequence of the accident.
00:25:23 Over the next few days, there was continued worry about a nuclear release at the plant. They continued monitoring the situation on the ground and flew helicopters over the plant with radiation detection equipment on board, but the redundant containment built into the plant did its job; there was never a radiation release from Three Mile Island.
00:25:48 Public worry about a hydrogen explosion from hydrogen released as the fuel cladding melted turned out to be unfounded as well. On Sunday, April 1st, four days after the accident, President Jimmy Carter and his wife, Rosalynn, visited the plant to reassure the American public about the safety of nuclear power and confirm that the situation at the plant was under control.
00:26:21 He would later convene an investigatory commission that generated a report on the accident, from which I've drawn a lot of the facts for this story. Three Mile Island Unit Two would be written off as a total loss less than three months after it went online.
00:26:55 Around 20 tons of melted uranium ended up in the bottom of the core, and another ten tons ended up suspended in the middle of the core. This is what they found when they began the cleanup in 1983: severed and melted fuel rods at the bottom of the reactor vessel. The final cost of the initial cleanup was just over one billion dollars—billion with a B—and it took 14 years; they're still not done.
00:27:23 This is a picture of Three Mile Island today. You can see Unit One on the right, still puffing out billowy steam clouds. It's still operating and generating electricity. Unit Two is the one on the left—it will finally be decommissioned and dismantled when Unit One is decommissioned, currently scheduled for 2034.
00:27:51 So, what happened? How did these four men miss so many signs that their reactor was in the midst of a loss-of-coolant accident? It was an accident that reactor operators are trained for. Why didn’t they leave the emergency core cooling system enabled? Why didn’t they close the block valve sooner?
00:28:17 We are looking at this from the wrong perspective. Sidney Decker's wonderful book, 'The Field Guide for Understanding Human Error,' provides an in-depth guide to investigating and understanding what happens when things go wrong. He introduces the concept of first stories and second stories.
00:28:50 The story I have just told you is very much a first story of Three Mile Island. First stories focus on the humans involved in a story and what they should have done differently, often placing blame for an accident at the feet of the humans involved and the decisions they made.
00:29:13 They're called first stories because they’re the first angle that comes to mind and are easy to find. However, there are a couple of problems with this perspective, including the cognitive biases we all have. The first is hindsight bias.
00:29:38 This is the phenomenon where, when reviewing an event after it has occurred, and you know the outcome, you exaggerate your ability to have predicted and prevented that outcome. This is also referred to as the 'I knew it all along' effect.
00:30:00 An example of this is the belief that all that water in the sump had to be coming from somewhere. I don’t know anything about nuclear reactors, but I would have figured out that there was a leak there. The second bias is outcome bias. Once you know the outcome of a situation, you hold that outcome against every decision that led to it, making you more likely to judge those decisions more harshly.
00:30:31 A good example here is that turning off the emergency core cooling system early in the accident seems like a stupid decision when you know the outcome is a partial meltdown. So, what should we do instead? We should look for the second story.
00:30:53 In the second story, human error is seen as an effect of systems and vulnerabilities deeper within the organization—not merely a result of bad decision-making or failure to follow instructions. How do we get there?
00:31:18 We must dig into decisions from the perspective of the individuals who made them, considering the messy reality they faced, not the clean-room conditions we envision in hindsight. Additionally, we should approach this with the belief that everyone involved made the best decisions they could with the information available.
00:31:42 So, let's see if we can find some second stories from Three Mile Island. Let's start early in the accident sequence: why did Fred Shaymin make the call to turn off the emergency core cooling system five minutes into the accident? We can find our answer by examining the pressurizer.
00:32:09 In his deposition to the presidential inquiry, Fred Shaymin said he turned off the emergency core cooling system because it was causing the water level in the pressurizer to rise and he was afraid that it would 'go solid.' Now, what does that mean?
00:32:31 Remember that one of the jobs of the pressurizer is to absorb pressure shocks in the system, and steam is significantly more compressible than liquid water. Letting it go solid means allowing the pressurizer to fill with water and eliminate the steam bubble at the top, which reduces the shock absorption capabilities of the pressurizer.
00:32:54 Why would this concern for the pressurizer overcome his concern for the core? That’s where Admiral Hyman Rickover and the Navy come into play. Bill Zaily, Fred Shaymin, Ned Frederick, and Craig Faust were all former naval nuclear reactor operators, and the Navy’s reactor training emphasized keeping the pressurizer from going solid as the most crucial focus for a reactor operator.
00:33:21 The reason for this is that a 1960s era submarine reactor produced 12 megawatts of thermal energy, while Three Mile Island Unit Two had to produce 2,841 megawatts of thermal energy to generate its 906 megawatts of electricity due to losses and inefficiencies in the system.
00:33:47 This is common for nuclear reactors—there’s always a significant heat-to-electricity ratio. When you scram a reactor, primary heat production stops almost instantly. In a submarine reactor, the decay heat amount is trivial, about 780 kilowatts.
00:34:10 However, in Three Mile Island Unit Two, immediately after the scram, it was still producing 185 megawatts of heat. That's enough heat to melt nuclear fuel. Experiencing a water hammer without shock absorption is literally the worst-case scenario because it could lead to disastrous consequences.
00:34:35 Fred Shaymin, faced with a rising pressurizer, inferred that the system was already full of water. Allowing the emergency core cooling system to continue injecting water would risk overfilling the pressurizer, so he turned it off, trying to keep the reactor safe.
00:35:03 Now, let’s look at another decision: why didn't Bill Zaily close the block valve when he checked the outlet temperature of the pilot-operated relief valve? If you'll remember, the outlet temperature was 228 degrees Fahrenheit, and the operations manual indicated that the block valve should be shut for any reading over 220 degrees Fahrenheit.
00:35:35 At Three Mile Island Unit Two, the pilot-operated relief valve had been leaking slightly since the plant went into service. This was considered a minor issue, and they weren’t going to address it until the next refueling shutdown. The operators had learned to work around it, adjusting the primary loop pressure ever so slightly.
00:36:06 The consequence of this was that the outlet temperature of the pilot-operated relief valve was often over 200 degrees Fahrenheit. This desensitized the operators to the importance of shutting the block valve when those temperatures were reached.
00:36:29 Bill Zaily saw the 228° reading and thought of the fact that the pilot-operated relief valve had just been venting scalding hot water; that seemed like a perfectly reasonable temperature to him, so he did not shut the block valve. But there was another factor at play as well.
00:36:55 He had confidence in the indicator on his control panel, which indicated that the pilot-operated relief valve was closed. What Bill Zaily did not know is that the indicator was indicating signals sent from the computer to the pilot-operated relief valve, not the valve's actual status.
00:37:17 When the light turned red, it indicated that the computer had told the valve to close, and when it turned green, it indicated that the computer had told the valve to open. But the only way to know the actual position of the valve was to infer it from the temperature of the outlet.
00:37:43 There was no way for Bill Zaily to confirm if it was truly open or closed, and so he rationalized, thinking that if the pressurizer did go solid, the pilot-operated relief valve could still respond if there were a pressure spike. Thus, he left the block valve open in a bid to keep the plant safe.
00:38:05 Finally, let’s address why the crew did not respond when the sump alarm went off. How did they not realize they had a coolant leak from the full sump? The answer is simple: they never received the alarm.
00:38:30 The control room relayed alarms in two ways: first, there were a series of alarm lights around the room. In total, there were about 600 lights in the control room, and when the plant was operating normally, between 40 and 50 of these lights were illuminated. This created a constant background noise and confusion.
00:38:59 Second, there was an alarm printer; every time an alarm went off, it was sent to this printer to keep a log. The problem was that this printer was connected with a very slow 300 baud serial connection that couldn't keep up. Less than an hour into the accident, over 100 alarm lights were illuminated, and it would take the printer more than two and a half hours to catch up.
00:39:29 With all this going on, there’s no way the operators could prioritize the flood of alarms coming at them, and therefore, they missed the simple indicators.
00:39:56 So, how do we apply this concept of first and second stories within our teams? Dr. Sidney Decker has some helpful advice for us, and the first piece is hidden here on the cover of the book—notice the scare quotes around 'human error.' The reason for this is that human error is never the actual cause.
00:40:23 When we are trying to understand why something went wrong, we agree on a baseline rule: human error is not the cause; it is always a symptom of some underlying systemic problems. Thus, blaming an issue on human error does nothing but prevent us from uncovering what truly went wrong.
00:40:52 A good way to frame the conversation is to ask what is responsible for an outcome, not whose fault it is. Secondly, understand why a decision made sense at the time. The people you work with do not come to work intending to do a bad job.
00:41:20 Chances are, when they make a decision that you don't understand or agree with, or when they miss something obvious, there is a good reason for what they did. Take the time to understand why it made sense to them because if it made sense to them, it will make sense to someone else later.
00:41:48 Third, seek forward accountability, not blame. Our instinct when problems arise is often to find who is responsible and punish them. When we try to steer our organizations away from blame and towards discovering second stories, one of the most common objections is, 'What about holding people accountable?'
00:42:14 Interestingly, removing punishment actually frees people to candidly share their stories of what happened, allowing us to learn from them rather than hiding them to avoid punishment. This brings us to the concept of blameless post-mortems.
00:42:35 Many of you are likely familiar with the idea, but the act of narrating the story of what happened, giving their account, and taking ownership of their part in it is often the only accountability well-meaning people require. Any punishment doled out doesn’t reinforce the lesson we’re trying to teach. It's better to give people the opportunity to share their story and understand what happened.
00:43:03 Backward accountability focuses on blaming someone for past events, while forward accountability motivates individuals to concentrate on the action required for change and enhancement going forward. The beauty of this technique is its broad applicability. There’s always a second story if you’re willing to look for it.
00:43:31 This approach works when someone drops the production database, when a team misses an important deadline, when a key team member decides to leave, or when sales misses their quarterly target. There’s always a second story that can be uncovered to deepen your understanding of the situation.
00:44:01 It requires honesty and trust-building, but it's worthwhile because finding the second story can be a powerful way for your team to grow and improve. It allows you to treat your teammates with the humanity and dignity they deserve.
00:44:28 It turns out that asking who destroyed Three Mile Island isn’t even a fair question. What destroyed Three Mile Island is a much more helpful question for us to ask.
00:44:46 Thankfully, that’s the question the President's Commission asked. Check out the subtitle of their report; it's full of second stories revealing weaknesses in reactor design and operating training throughout the nuclear industry.
00:45:04 By moving beyond human error to uncover the real causes, the President's Commission made the world a tangibly safer place. If you take the time to look for second stories in everything—not just when there’s an outage—you will create a safer environment for everyone to perform at their best.
00:45:25 You’ll also fix issues affecting your organization's delivery speed and delivery quality. Thank you.
Explore all talks recorded at RailsConf 2018
+94