DevOps
Escalating Complexity: DevOps Learnings From Air France 447

Summarized using AI

Escalating Complexity: DevOps Learnings From Air France 447

Lindsay Holmwood • January 28, 2020 • Earth

In her presentation titled "Escalating Complexity: DevOps Learnings From Air France 447," Lindsay Holmwood discusses the tragic crash of Air France flight 447, which occurred on June 1, 2009, claiming the lives of all 228 passengers and crew. The talk critiques the mainstream narrative that attributes the crash solely to pilot error, arguing instead that this oversimplifies the events surrounding the incident. Holmwood emphasizes the importance of understanding the complexities of systems and that blaming individuals overlooks the systemic issues that can lead to failures.

Key points covered in the presentation include:

  • Mainstream Narratives: Hollywood discusses how reports frequently blame pilots for incidents, focusing on human error while neglecting the broader context of the complex systems in which they operate.
  • Pilot Experience: She refutes the notion that the pilots were inexperienced, detailing their extensive flying hours and qualifications, thus challenging the argument that poor training led to the crash.
  • Complex Systems: The presentation highlights the significance of viewing pilots as participants in a larger system rather than as isolated actors. Holmwood references the BEA report indicating that different crews under similar circumstances would likely act similarly.
  • Local vs. Global Rationality: Holmwood differentiates between local rationality, which reflects the pilots' decision-making in real-time, and global rationality, which is the benefit of hindsight that investigators enjoy after the fact.
  • Systems Feedback: She stresses the importance of clear feedback mechanisms within operational systems, using the Airbus A330's flight control modes as an example. The lack of effective feedback during the flight led to critical misunderstandings by the pilots.
  • Communication: Holmwood explores the challenges of communication in high-pressure environments, stating that tactile feedback and alert mechanisms need to be enhanced to prevent critical information from being overlooked.
  • Lessons for DevOps: Drawing parallels between the aviation incident and technology operations, she advocates for comprehensive incident response plans and effective communication channels in tech environments.

Holmwood concludes by reminding the audience that while system failures are inevitable, the approach to understanding these systems can transform how we respond to and mitigate risks. The final takeaway is to avoid an anthropocentric view of systems and to recognize the role of context in failures, emphasizing that a holistic perspective can lead to a better understanding of both aviation safety and operational effectiveness in technology.

This talk encourages the audience to consider how their own systems operate and how to improve upon them to avoid tragic outcomes in their fields.

Escalating Complexity: DevOps Learnings From Air France 447
Lindsay Holmwood • January 28, 2020 • Earth

Title: Escalating complexity: DevOps learnings from Air France 447
Presented by: Lindsay Holmwood

On June 1, 2009 Air France 447 crashed into the Atlantic ocean killing all 228 passengers and crew. The 15 minutes leading up to the impact were a terrifying demonstration of the how thick the fog of war is in complex systems.
Mainstream reports of the incident put the blame on the pilots - a common motif in incident reports that conveniently ignore a simple fact: people were just actors within a complex system, doing their best based on the information at hand.
While the systems you build and operate likely don't control the fate of people's lives, they share many of the same complexity characteristics. Dev and Ops can learn an abundance from how the feedback loops between these aviation systems are designed and how these systems are operated.
In this talk Lindsay will cover what happened on the flight, why the mainstream explanation doesn't add up, how design assumptions can impact people's ability to respond to rapidly developing situations, and how to improve your operational effectiveness when dealing with rapidly developing failure scenarios.

Help us caption & translate this video!

http://amara.org/v/FGbc/

MountainWest RubyConf 2013

00:00:09.280 Good afternoon, everyone. My name is Lindsay Holmwood, and I'm Oxiesis on Twitter.
00:00:20.400 I'm an engineering manager at Bulletproof Networks, which is a managed hosting company in Sydney, Australia. I live in an area called the Blue Mountains, which is similar to the Grand Canyon but filled with trees. You may know me from some of my other projects, like Cucumber, Nagios, which allows you to write Nagios checks in Cucumber, Visage for graphing data in the browser, and Flapjack. If you're interested, you can come and talk to me about these projects later on.
00:07:01.599 The mainstream explanation for what happened during the flight was that the pilots misunderstood their circumstances. They were portrayed as being poorly trained and failing to react swiftly to alarms. This view was espoused by many major, well-known publications, which we typically trust to deliver accurate information. However, there's a very convenient narrative that simplifies the events leading up to the crash, specifically decomposing all the different things that happened and honing in on a broken component to find a root cause. The simplest explanation is that human error was at play—the pilots were negligent and did not follow their training, resulting in the crash. This notion perpetuates the idea of 'bad apples'—amoral actors within our systems working against normal functioning, leading to the mentality that if we just remove these bad actors, the system will stabilize.
00:08:05.120 Sydney Decker, a professor of human factors and flight safety at Lund University in Sweden, argues that what you label as the root cause is merely the point at which you stop looking for deeper issues. So, let’s look beyond the surface. We should examine the flight experience of the pilots operating the plane, given that they were deemed poorly trained and inexperienced. Captain Dubois had logged 10,988 flying hours, 6,258 of which were as captain on the Airbus A330. First Officer Robert had 6,547 total hours, 4,479 of which were on the Airbus A330, while the more junior pilot had 2,936 hours, 807 of which were on the same aircraft. This contradicts the idea that these pilots were inexperienced.
00:09:30.319 A fundamental flaw in this reasoning arises when you substitute different individuals into the same scenario and question how they would react under similar stresses. The Bureau d’Enquêtes et d’Analyses (BEA), the French equivalent of the FAA, reached the same conclusion in their report: a different crew would have likely taken the same actions, and thus, we cannot solely blame this crew for the incident. This aligns with the notion that individuals act as participants within a complex system, which can be conceived as multiple nested systems in operation. This raises questions regarding the validity of identifying a singular root cause.
00:10:17.760 Attributing accidents solely to human error follows a very Cartesian and Newtonian worldview in which actions have equal and opposite reactions. When we trace the linear events leading back to the beginning, we too often point to human error as the primary cause. Hindsight does not equal foresight; it transforms a once vague and unlikely future into a clear and definite past. Fortunately, with this investigation, we have clarity and all the facts laid out for us, resulting from a three-year process to locate the crash site and recover the flight data recorder.
00:11:09.040 Once the flight data recorder was analyzed, investigators had access to a wealth of information, including meteorological data. They spent three years investigating an event that unfolded in merely ten minutes. During the crash, the pilots were operating in a dense fog of war, with limited information at hand and a rapidly evolving situation. They were employing a concept known as local rationality, making what they believed were the best decisions based on the data available. In contrast, we benefit from hindsight, allowing us to achieve global rationality. We can analyze all the facts, identify failures, and locate the broken components that supposedly caused the crash.
00:12:05.760 To delve deeper, we should examine the operational modes of the systems in play, focusing on the flight control modes. The flight control computer operates under various modes, including normal law for maintaining altitude, attitude for flight, and flare mode for landing. Most of the time, the Airbus A330 functions under normal law; however, it can transition into alternate law, which alters the flight characteristics significantly.
00:13:09.760 One critical input to the flight control computer is the pitot tube, which measures air pressure in front and behind the aircraft. The plane itself had multiple pitot tubes for redundancy, with one for the first officer, one for the captain, and a standby unit. This information feeds into the flight control computer through various pneumatic and electrical channels. However, a significant issue arises with how pilots are notified when the flight control computer transitions between operational modes. In hindsight, we were fortunate to access information collected during the investigation, providing insight into this aspect.
00:14:10.880 A system called ACARS transmits crucial information from the plane to a satellite, which then relays it to a ground station, ultimately reaching the airline operator, such as Air France. Among the data relayed are non-vocal communication, operational computer messages, and maintenance messages. Investigators discovered important findings, particularly indicating that the flight control computer had shifted from normal law to alternate law, which was reflected in a small textual warning on a massive dashboard filled with instruments. Unfortunately, the critical nature of this notification did not register with the pilots until several minutes later, as indicated by their comments just before the crash.
00:15:29.360 Bonin remarked that he had been pitching the nose up for some time, while the captain realized a couple of seconds later that they were in a stall—a realization that came only 40 seconds before the plane crashed into the ocean. This raises crucial questions: How do you provide feedback in your systems? Do you have clear reconfiguration feedback mechanisms when transitioning between operational modes? Are you aware of how those modes behave differently? This is vital for system performance, especially in high-stress environments.
00:16:50.400 Clear sensory feedback is essential. For instance, changing the colors, size, and font of alerts can enhance visibility and response. It's interesting to note that around 10% of the male population is colorblind, which might hinder their ability to distinguish between alert colors. Furthermore, the readability of the text is crucial. In high-pressure situations, it's important that information is presented clearly to support quick decision-making.
00:18:02.960 Another aspect to consider is the lack of tactile feedback for the pilots. In Airbus planes, when one pilot manipulates the control stick, the other pilot receives no tactile indication of that action. During this incident, Bonin's pull on the stick caused the plane's altitude to rise until it reached a critically high level, leading to a stall. This dilemma was exacerbated by the flight control computer averaging inputs from both pilots, effectively neutralizing their actions. While individual feedback lights are present, they are small and easily overlooked amidst the various complications in the cockpit.
00:19:21.440 To avoid such failures, crews employ Crew Resource Management (CRM) to communicate effectively. However, in high-stress environments, the startle effect can cause pilots to revert to their training, making them less responsive to critical communication. This highlights why understanding your systems and communication processes is essential during incidents.
00:20:09.840 In Air France 447, if the pilots had not interacted with the controls, they might have survived. With global rationality, we can identify every fact in hindsight; however, during the incident, real-time coordination and communication are critical to avoid multiple operators clashing with their inputs or making conflicting decisions regarding the systems under duress.
00:21:09.840 Effective incident response also requires clear communication channels, including a formalized system for disseminating information to the broader business. For example, during Tumblrs 18-hour outage, updates were sparse, highlighting the importance of having a structured communication process in place. Teams should practice incident response plans with a clear set of guidelines, and core metrics should be established to guide teams when working through problems together.
00:22:28.880 The importance of systems thinking cannot be overstated. Instead of isolating components and seeking shallow root causes, we should understand how the systems we work with can fail or succeed in various ways. Consider scenarios where systems enable communication while simultaneously exposing weaknesses. Think about the sensitive information shared through the leaked US diplomatic cables, which were valuable internally but damaging publicly. Systems are multifaceted and can deliver both innovation and catastrophe.
00:23:54.320 Failure is inherent in complex systems, and when we accept this reality, we can learn to expect it. Recognize that while your systems may not control people's lives, they could significantly impact those reliant on them. We must move away from an anthropocentric view, which positions humans as the center of the universe, and avoid drifting toward technocentrism, which views humans as inconsequential components. We need to lie within the balance, cultivating systems that bridge human operators and machines.
00:25:43.040 Chesley Sullenberger, the pilot who successfully ditched an Airbus into the Hudson River, famously noted, 'If you look at human factors alone, you’re missing two-thirds of the total system failure.' It is critical to adopt this mindset to ensure that the poignant final words of an individual facing imminent demise do not become mere headlines in a newspaper.
00:26:49.440 Thank you, and I have time for two questions.
00:29:49.520 A question arose about whether there have been any published comparisons between the Apollo 13 incident and Air France 447. I encourage anyone interested to research that further. The importance of understanding these complex systems is essential.
00:30:03.679 Additionally, someone inquired about other alert systems that might serve as better models. Each domain has unique challenges; for example, the medical field grapples with alert overload while operating. While improvements are ongoing, we remain at the forefront, seeking solutions. I highly recommend reading works by Sydney Decker, especially "Drift Into Failure," to delve deeper into these subjects.
00:31:17.200 Thank you very much for your attention.
Explore all talks recorded at MountainWest RubyConf 2013
+28