Managing Unmanageable Complexity

by Patrick Joyce

In the talk "Managing Unmanageable Complexity" at RailsConf 2017, Patrick Joyce discusses the increasing complexity of systems and how it can lead to failures, emphasizing that such failures can often be prevented. He argues that while developers are not negligent, the cognitive load of complex systems outstrips our capabilities, drawing parallels with other high-stakes fields.

Key points include:
- Complexity and Failure: With the rising sophistication of systems, the likelihood of failure increases, but many failures can be addressed through better management techniques.
- Learning from Other Fields: Joyce highlights how checklists have revolutionized crucial areas like aviation and surgery, enabling professionals to manage complexity effectively.
- Case Studies:

- Aviation: The crash of the Boeing Model 299 due to a simple oversight led to the creation of pilots' checklists, which significantly enhanced safety in aviation.
- Medicine: Atul Gawande's work in hospitals, introducing checklists for central line insertions, resulted in a dramatic reduction in infection rates and deaths, showcasing the power of procedural checklists in high-pressure environments.
- Application to Software Development: Joyce advocates for the introduction of checklists in software development to mitigate risks and improve code quality, suggesting checkpoints before submitting pull requests, during peer reviews, and prior to deployments.
- Actionable Steps: He encourages developers at any level to start implementing checklist practices without needing formal approval, fostering a culture of thoroughness and accountability.

In conclusion, Joyce emphasizes that by adopting simple checklist methodologies, software teams can drastically enhance their performance, ensuring essential tasks are completed and facilitating improved communication among team members.

00:00:12.469 This talk starts with one of the worst days of my professional career. About five years ago, I was sitting at my desk, not feeling particularly well. I was a little sick, and I had actually decided it was time to surrender and admit defeat; it was time to go home and get some much-needed rest. Suddenly, I saw a flurry of campfire messages firing off. It was five years ago, and all of a sudden, Google Chat windows started popping up. A product manager came rushing over, and everyone was asking the same question: "What's wrong with the site?" I was logged in as Aaron, the co-founder of the company I worked at, and we all definitely should not have been logged in as him. We saw that we had just done a deploy, so we quickly rolled back. Six minutes later, everything was working as it should again. However, we were a pretty high-volume site, so in the 10 or 12 minutes that the issue was live, there were hundreds of purchases made that were incorrectly applied to Aaron's account. Dozens of people added credit cards to Aaron's account. It was not good.

00:01:06.540 I walked downstairs and had a quick conversation with our chief legal counsel. I also spoke with our security team, and in a moment of panic, I threw up in a trash can. Then I went back upstairs to figure out how to prevent this from happening again. I think it's helpful to start with what actually happened. The feature we were trying to roll out was integration with Passbook. This was right before the launch of iOS 6, and our Apple rep had strongly implied that if we were ready with Passbook support on day one, they would feature us on the homepage of the App Store. From prior experience, we knew that this was worth tens of thousands of installs that we wouldn't have to pay for and would gain us lots of new users. It was a big opportunity, but it had a short turnaround and came about when people were already working on other things.

00:01:39.689 So, implementing this feature fell to a junior iOS developer who was not six months removed from college. This was actually the first rail feature he was working on, and it’s helpful to have some understanding of how authentication works, although this is a simplified version. In the application controller, there was a method that checked your auth cookie, verified that it was signed properly, and confirmed that you were logged in. The problem our junior iOS developer faced was that he needed an account that had purchases on it so that he could test the Passbook integration. In development, we used a slimmed-down version of the production database that filtered out everyone except employees. He looked through and saw that Aaron had a lot of purchases, so it seemed like a good test account.

00:02:21.750 He added this line of code. Now, I’m sure there are people in the room thinking, "That is just bad code; you shouldn't have done that, it was dangerous." I would actually agree—like, even in development, this has the risk of going to production, so you should never write code like this. But it's understandable how it happened. It solved the immediate problem he had of being able to log in to an account with sufficient test data pretty easily. I also want to emphasize that the team wasn't lazy, reckless, or indifferent to quality. Some might think, "Well, we have tests; this would never happen to us." We actually had good unit test coverage around the authentication system. If you look at what was added, there was a fine by email that returns mail, and using an OR. Unless the account with that email address existed in the test database, it would fall through and work like it always did.

00:03:06.540 That email address was not in our fixture data, so it just fell through and behaved like it always did. We also had continuous integration. The test suite was run fully on a machine that was not the developer's, giving us a false sense of security. If you had reviewed the code or had people doing code reviews, you would have noticed that line was there. We did have someone do the code review—an extremely talented and conscientious developer actually looked over the code. However, this was their first code review, it was a tight deadline, and this particular code change was one line out of a thousand-line diff spread across 14 work-in-progress commits, and that one line in a thousand-line diff was missed. This can happen; people make mistakes.

00:03:45.990 If you actually ran it, you definitely would have seen that mistake. If you did any sort of manual testing whatsoever, you would have caught it. The developer who reviewed it did run it, but again, Aaron was a good test account. He had lots of test data, and in a less dangerous way, that developer often used Aaron when testing. So, he saw Aaron's name while testing and didn't think anything of it. After we communicated about the incident and spelunked through our logs to figure out where all of those hundreds of purchases were supposed to go, we cleaned up all the data. I went home much later that night, still struggling with how we could prevent this in the future. I remembered a New Yorker article I had read a year or two before by Atul Gawande, a surgeon who wrote about how surgeons handle complexity.

00:04:21.300 He expanded that article into a book, which I bought and read cover-to-cover that night. It’s only 150 pages—it's not that long. A lot of what I want to talk about today comes from that book. Let's start with another field that has had to deal with increasing complexity: aviation. This is the B10, introduced in 1935. This was the state-of-the-art in the American military arsenal. It was the first all-metal, single-wing plane ever produced, revolutionizing the design of large aircraft. However, in the early days of aviation, things were developing quickly. Not a year after this plane was introduced, the Army Air Corps announced a competition for a successor. They wanted a plane that had a longer range, was faster, and could carry a larger load.

00:05:07.620 The hands-down favorite was the Boeing Model 299. It was the first plane to ever have four engines and was the largest land plane ever produced in the United States. The Model 299 was head and shoulders above the other competitors for this contract. It had twice the range, could carry twice the load, and was 30% faster. The Army was so excited that, after the first test flight, they already entered into discussions with Boeing to purchase 65 planes before the competition had even been completed. However, in October of 1935, two very senior test pilots—a U.S. Army Major and a senior engineer from Boeing—got into the plane at Wright Air Field in Dayton. They took off, flew up to 300 feet in the air, turned sharply to the right, and crashed; both test pilots were killed.

00:06:02.650 Boeing could not complete the evaluation and therefore legally could not win the contract. Upon investigation, it was determined that the cause of the crash was that the pilots failed to disengage the gust locks. Gust locks keep the flaps from moving around while sitting on the runway, so they don't get damaged. Releasing them only requires flipping one switch—a very simple task. This error wasn't due to a lack of expertise; again, we had two of the most experienced test pilots in the world flying the plane. It wasn’t due to carelessness—if there were ever a time to be dialed in, it was when taking off in the largest experimental aircraft ever produced, where your life is literally at risk. It was a simple step, but one of dozens of vital steps.

00:06:50.600 This was the most complex plane ever produced. Just five years before the Model 299's fateful crash, the A3 was the most advanced plane in the American arsenal. There was a lot going on in the A3 cockpit, but an expert trained could handle it. In contrast, the cockpit of the Model 299 presented a level of complexity that was fundamentally different from the planes that came before and after the crash. Following the incident, there were major concerns that this plane was simply too difficult for people to fly. However, the Army was intrigued by its capabilities, such as range, load capacity, and speed, and figured out a way through a contracting loophole to place an order for 419 planes, giving Boeing time to figure out how to fly them successfully.

00:07:34.660 One thing they could have done is try to reduce the complexity to make it easier to fly, but given the technology state at the time, all of the controls were necessary. This was necessary complexity—not accidental complexity; they couldn't remove it. Instead, what Boeing and the test pilots figured out was that they were running into the limits of human cognition. They produced a checklist of all the steps that needed to be done before common operations. So, before you start the plane, these are the things you need to do; when you're starting the engines, this is what you need to do; and before you take off, this is what you need to do. Armed with this checklist, a plane that was too complicated for two of the most expert pilots to fly became manageable.

00:08:25.810 This proved to be crucial when World War II broke out. The B-17 was a long-range bomber that proved essential to the Allied campaign in Europe. Over 13,000 B-17s were produced and they dropped 40% of the bombs the U.S. dropped in World War II. It’s not a stretch to say the B-17 and the capability to safely fly it were instrumental in defeating Hitler. Since the B-17, checklists have become a key part of aviation safety culture. For example, when a US Airways flight took off from JFK in New York and flew into a flock of Canadian geese, destroying both engines, the pilots were able to make an emergency landing in the Hudson River, losing no passengers. They did that armed with checklists on how to respond after losing engines, safely ditch fuel, and execute a water landing.

00:09:16.600 So, checklists represent a huge improvement in aviation safety. What about another high-stakes field: medicine? Our friend Dr. Gawande is a doctor, so let’s talk about medicine. This is a central line, a tube inserted into a large vein, often around the aorta, enabling doctors to administer medicine and fluids directly into the bloodstream. It is also an extremely common procedure in U.S. ICUs, with patients spending over 15 million days a year with central lines inserted. Yet, it's a leading cause of blood infections, which are incredibly serious. Thousands of people die each year from blood infections, leading to billions of dollars in additional costs, and those infections are preventable.

00:10:01.340 In 2001, a doctor at Johns Hopkins in the ICU decided to tackle this problem. He aimed to improve overall care in the ICU, specifically reducing the rate of central line infections. We know these infections are preventable, so he created a simple checklist with five critical steps. Every time a central line is inserted, doctors should wash their hands with soap, clean the patient's skin with an antiseptic, cover the patient with sterile drapes, wear a mask, hat, gown, and gloves, and apply sterile dressing over the insertion site. These are fairly simple things, and you would think that a hospital like Johns Hopkins—one of the best in the world—would follow these steps.

00:10:49.050 Before rolling out the new checklist, he asked nurses in the ICU to observe the insertion of central lines for a month and report the results. They found that even at Hopkins, during the insertion of central lines in the ICU, where the most critical patients are cared for, in over a third of patients, one of these steps was skipped. He collaborated with the hospital administration to empower nurses to stop a doctor if they noticed they were skipping one of the steps. Nurses were also asked to check every day for any central lines that could be removed. In the year before the checklist was introduced, the 10-day line infection rate hovered around 11%. In the year after the checklist was implemented, that rate dropped to zero.

00:11:44.140 The results were so impressive that they didn’t completely believe them. They monitored the situation for another 15 months, and during that entire time, there were only two infections reported in that ICU, preventing an estimated 43 infections and eight deaths, which represented approximately $2 million in cost savings. We have seen two different fields with a huge impact from introducing checklists, but I’m willing to wager that many in this room might be skeptical about whether this could translate to software development. One common objection is that the examples I've provided are largely about ensuring that repetitive tasks are completed, and we have a solution for that—we automate things.

00:12:29.860 For instance, when deploying, we can add a restart for our workers as part of the deploy script so that we're not relying on someone to remember to do that every single time. But checklists can also help with more complex problems beyond just ensuring that simple things are done. I want to share another example from medicine: surgery. I have a healthy respect and fear of the complexity we deal with in building web applications. When considering all the systems required to place an order on an e-commerce site, it’s sort of a miracle that anything ever works. However, I also recognize that nothing we do is nearly as complex as performing surgery, where you cut open a living, breathing person and try to fix something. Surgery makes what we do look trivial.

00:13:16.790 Surgical procedures are incredibly varied, with thousands of commonly performed procedures and every patient being different. Tiny errors can have massive consequences: a scalpel half a centimeter to the left, an antibiotic administered five minutes too early or too late, or one of hundreds of surgical sponges forgotten and left in the body cavity can result in life-and-death outcomes. In 2006, the World Health Organization approached Dr. Gawande for help. They noted that the rate of surgery had skyrocketed worldwide, with over 230 million major surgical operations performed in 2003. However, the rate of safety had not kept pace with this increase.

00:14:02.320 We don’t have perfect statistics, but estimates indicate about 17% of those 230 million surgeries resulted in major complications. Dr. Gawande led a working group aimed at generating recommendations for interventions that could improve the standard of surgical care worldwide. This is an incredibly difficult task due to the vast number of procedures and wildly varying conditions under which they're performed. Ultimately, they decided to create a general surgery checklist—a simple document that consists of just 19 steps, fits on a single sheet of paper, and takes about two minutes to run through.

00:14:50.000 This checklist has the potential to improve safety for all 230 million annual surgeries. The first thing it does is create three pulse points where key actions are checked, and important conversations are prompted. Before administering anesthesia, before the first incision, and before the patient leaves the operating room, the surgical team comes together to ensure they have taken care of the basics and engaged in necessary conversations. This structured approach allows highly competent professionals to perform their jobs while also ensuring that critical aspects aren’t overlooked, and essential discussions take place. One of the steps is confirming that antibiotics have been administered within 60 minutes prior to incision, ensuring they are in the bloodstream to reduce infection rates.

00:15:39.860 Another crucial aspect is fostering communication. Before the first incision, the entire surgical team introduces themselves and clarifies their roles—something that doesn’t always happen. Additionally, the surgeon reviews potential risks and complications related to the surgery. By discussing these in advance, the team is more likely to respond effectively if any complications arise. This checklist is incredibly concise, yet its potential impact is massive. After some pilot tests in a single operating room to iron out any issues, the WHO implemented a pilot program across eight diverse hospitals in various locations.

00:16:30.540 They monitored the quality of care before and after the checklist's introduction to assess its impact. Prior to introducing the checklist, observers spent three months monitoring over 4,000 operations. During this period, 400 patients experienced serious complications, and 56 died. Following the introduction of the checklist, these statistics were remarkable—the rate of major complications dropped by 36%, and the death rate reduced by 46%. All of this came from a single-page, 19-step checklist that takes only about two minutes to run through.

00:17:12.220 So, what makes checklists effective? They ensure that simple yet critical tasks are not overlooked while also facilitating essential conversations among team members and empowering experts to make decisions. It's not about reducing the surgeon's role to merely ticking boxes; it's about ensuring that the right people are communicating and planning so that they can adeptly address unforeseen circumstances. A well-designed checklist clearly lays out the goals—be it a task-oriented list, a communication list, or a combination of both. The aviation checklist we looked at was primarily task-oriented, while the surgical checklist served as a hybrid containing both task and communication elements.

00:17:50.560 It's vital to specify who is responsible for carrying out each step—traditionally in an operating room, surgeons often hold a God complex, but their hands are busy. Consequently, the responsibility for ensuring the checklist gets executed often falls to the circulating nurse, who watches over the steps. Moreover, identify when each step should occur by determining natural pause points where validation can occur. Remember not to try to cover everything, as there are far more than 19 steps in a typical surgery, and attempting to be comprehensive may render the checklist overly burdensome. Lastly, the checklist should be iteratively improved.

00:18:31.830 Returning to a couple of years ago, after my epiphany regarding these examples, I was curious about how checklists could apply to software development. I began to consider the potential pauses for implementing something similar. I identified three logical points: before submitting a pull request, while reviewing a pull request, and before deploying. Before submitting a pull request, individual developers should ask themselves certain questions: Have I examined every line of the diff? Am I sure that everything here is intentional? Is there anything in this patch unrelated to the overall change? Have I separated refactoring from feature changes, creating distinct units of work?

00:19:20.250 Have I organized the commits in a way that makes the reviewer's job easier? Did I run the code locally? It's surprising how often developers, including myself, go through the motions of submitting a change without verifying that it works. If there's a formal QA team, does this change require thorough scrutiny from someone else? Lastly, does the pull request clearly communicate its goals and provide instructions for verifying that the feature works? When reviewers receive a pull request, they should be asking themselves several other crucial questions.

00:20:06.740 Do I understand the goal of this change? If you can't grasp what the pull request aims to achieve, you cannot effectively review it. Have I thoroughly inspected every line between this branch and master? Remember the pain caused by a single missed line in a review? Have I checked that the code runs and is functional? Is further QA necessary? Are there sufficient tests accompanying the change, and how will we validate that this change accomplishes its intended goal?

00:20:52.830 After addressing the individual steps of submitting and reviewing pull requests, the next logical phase is deployment. Before deploying, both the submitter and reviewer should have a quick discussion to identify potential risks. What problems could arise? Are there performance concerns we're worried about? What factors may differ between production and development or staging? For example, are we utilizing a third-party service for the first time, necessitating different production credentials? Is it the right time to deploy the change?

00:21:39.920 If we are pushing an update related to a significant purchase flow, we may want to reconsider deploying during peak hours. Alternatively, is it feasible or preferable to release this update to a subset of users using a feature flag or an A/B test? What specific steps will our team undertake to verify that the deployment is functioning as intended? Once we deploy, how can we ascertain whether the change is operating as predicted? If complications do arise, do we know how to effectively roll back the update, or is this a situation where we've altered the database schema, making it impossible to revert to an older code version?

00:22:26.320 I didn't conduct a controlled study or implement a randomized control group, nor did I monitor the exact number of issues beforehand and afterward. However, I can confidently say that we caught issues that could have otherwise slipped through the cracks. For example, we were on the verge of deploying a new third-party service and discovered we hadn’t put the production credentials in the credential management system, which would have caused significant problems. We also encountered instances where something went wrong, yet we managed to respond and recover much faster because we discussed risks beforehand.

00:23:10.280 I am undoubtedly a convert; I believe checklists can greatly assist in deploying higher-quality software more quickly and with significantly less stress. But, I want to leave you with one more crucial insight: one of the great advantages of checklists—and the lightweight process it entails—is that anyone in this room can begin implementing this practice without needing formal permission. If you are a developer working solo, you can identify the things to consider before deploying and begin applying that discipline.

00:24:04.500 If you're an individual developer on a team, you can raise these ideas during deployment discussions while modeling the concepts. If you're leading a team, you can introduce these examples of successful research and help facilitate the integration of these practices. Thank you very much for your time. My name is Patrick, and my personal blog, which I update occasionally, can be found at pragmatist. I am the Director of Engineering at Stitch Fix, where we are hiring. Please feel free to reach out to any friendly Stitch Fix representatives here. I’m also on Twitter at KeeperPat. Our team shares interesting content at multi-threaded.stitchfix.com. Thank you very much.