Talks
What Could Go Wrong? The Subtle Science Of Coding for Failure
Summarized using AI

What Could Go Wrong? The Subtle Science Of Coding for Failure

by Daniel Fone

In the talk titled "What Could Go Wrong? The Subtle Science Of Coding for Failure" by Daniel Fone at RubyConf AU 2019, the speaker explores the theme of anticipating and quantifying risks in software development. Drawing from his background in environmental chemistry and human cognition, Fone emphasizes the importance of asking critical questions regarding potential failures when developing software. He introduces a simple model for assessing danger, focusing on two key factors: likelihood and severity of an outcome.

Key points discussed throughout the video include:

- The Main Question: Fone encourages developers to consistently ask, "What could go wrong?" in various coding scenarios, comparing it to navigating risks in everyday life.

- Understanding Danger: He defines danger as the combination of the possibility of harm and its severity, leading to the insight that software developers often misjudge risks based on social narratives rather than factual data.

- Micromorts: Fone introduces the concept of a micromort, representing a one-in-a-million chance of death, to illustrate the relative risks associated with various activities (e.g., skydiving versus childbirth).
- Biases and Decision Making: Emphasizing human cognitive biases, Fone explains how our decisions are influenced by evolutionary social tendencies, focusing on anecdotal narratives rather than statistical evidence, leading to misjudged risk assessments.
- Practical Coding Examples: Fone provides concrete examples from coding tasks to illustrate how to address potential failures. For instance, when fetching a file from a URL, he methodically analyzes what could go wrong, the likelihood of each issue, and its potential severity, showcasing a structured approach to risk mitigation.
- Real-World Applications: He discusses how the insights derived from assessing risks can lead to improved development processes, such as replacing conventional tests with interactive testing scripts that minimize danger without introducing new risks.

In conclusion, Fone reiterates the need for developers to remain mindful of their biases and social instincts while assessing danger in software development. He concludes with three guiding questions for every coding task:

1. What could go wrong?

2. How likely is it?

3. How bad is it?

These principles not only help in building better software but also foster a more robust conversation about risk among development teams.

00:00:06.170 This will be a simple talk. I hope I'm going to start and finish with just one question. If you forget everything in between, that's fine. But as people who create things, I want to help us ask one simple question: what could go wrong? That's it! Every line of code, every user story, every deployment, every time you SSH into production without telling anyone—what could go wrong?
00:00:19.740 Now, obviously, there's a way to ask this rhetorically. We know what this looks like: we're about to sit our friend on fire, or we need to cross a bridge with just a little more river under it than usual. I think there's a conversation happening here that seems familiar to many of us; we can make it. No, we can't. And I think the same conversation happens in both professional and personal contexts, whether it's parenting or pushing to production. My goal today is to help us ask that question a little more critically. What could go wrong? Not only so that we can make better decisions, but also so that we can disagree more robustly about crossing the bridge.
00:01:00.660 A long time ago, in another life, I studied environmental chemistry. I don't remember too much about my chemistry degree, but there are a couple of key things that I recall. First, everything is toxic if you have enough of it in the right attitude. Second, danger has two factors: risk and hazard. Chemistry was a long time ago, and as soon as I say that, I'm terrified that someone will come up to me afterwards and say, 'Well actually, ISO 31000 defines...' And you know what? You're probably right. So, let's be really clear and simple: the Oxford English Dictionary describes danger as the possibility of suffering harm or injury, which again has two factors: possibility and harm.
00:01:45.900 Just like risk also has two factors. The terminology doesn't matter so much, but just to appease the ISO gods, let's say instead that the dangerousness of something is determined by its likelihood and its severity. Ultimately, I want us—every time we ask what could go wrong—to also ask how likely it is and how bad it is. This is it. This is the talk. What could go wrong? How likely is it? How bad is it? So, let's break this down a little bit.
00:02:29.340 If I try to walk across a gymnast's beam about a meter or surf the ground, I'm very confident I could do that. I'd be fine. But if we put that beam at the height of the top of the forum, which is about 50 meters, I know no amount of money is going to convince me to do that. I hope that's intuitive for most of us. But let's think about what's actually changing here. The likelihood of falling is about the same; I'm just as capable. In fact, if anything, I'm far more motivated to stick accurately as I cross the beam. But the severity changes dramatically. If I fell from that height, I'd hit the ground at about 112 kilometers an hour.
00:03:06.299 So, the scenario is more dangerous not because I'm more likely to fall, but because the severity changed from, 'That would look dumb,' to 'I'm almost certainly dead.' Let's return to our disagreement on the bridge: what are we actually disagreeing about here? Sure, everyone agrees that if the bridge collapses, that would be very bad. But maybe someone thinks it's definitely going to collapse, and someone else thinks it'll be fine, and we begin to identify the heart of the disagreement. It's about the likelihood. Maybe everyone can agree on that; a wave is definitely going to hit the car. But maybe someone thinks it won't be too bad while someone else thinks it's going to break the car.
00:04:04.050 Now, I don't know how cars, rivers, or bridges work, but as we start to identify the things that could go wrong and we distinguish the severity and the likelihood, we can begin to disagree more rigorously about the danger. Everyone agrees it would be very bad if we died; we just disagree on how likely it is. This brings us to the concept of a micromort.
00:04:45.180 A micromort is a defined quantum of dangerousness; it's a one-in-a-million chance of death. So, the likelihood is one in a million, and the severity is a single death. Micromort data for travel suggests that, per kilometer, flying is far safer than any other form of transport. It would be interesting to see that same data per trip, though. And this is all sourced from Wikipedia, so take it with a grain of salt.
00:05:21.510 Similarly, for recreational risks, scuba diving, marathons, and skydiving are all about the same risk level, which surprises me. I've run a couple of marathons, and skydiving did not seem equally dangerous. Then we've got crazy dangerous sports, like base jumping, which is notoriously risky. In fact, climbing Everest had 223 fatalities from around 5,500 climbers.
00:05:50.400 To put these risks in perspective, in Australia in 2016, the risk of childbirth to a mother was an order of magnitude riskier than skydiving. That's about normal; in fact, that's quite good by world standards. But it surprised me in terms of the relative risks because it didn't really match the stories we tell ourselves about these things. And it's important to remember these are population-level statistics. The more I dug into the details and data of that risk, the more shocked I was about how wrong my stories and assumptions were.
00:06:28.290 At this point, I start to realize how often my perception of danger isn't driven by knowledge, but rather by my social narratives. Underneath all this logic, we have to remember that we have the brains of social primates. Two-and-a-half thousand years ago, Aristotle wrote that a human is by nature a social animal, and modern neuroscientists like Matthew Lieberman seem to agree. He argues that the human brain has been primed by evolution to view the world predominantly in social terms.
00:07:11.470 So, our neural biology is overwhelmingly wired for social cognition and much less for the kind of statistical cognition we need for risk analysis. One of the most common ways I see this play out is the tendency to fixate on a single factor when we're thinking about danger. For example, we can easily focus on the low likelihood of something bad happening and completely discount the high hazard of a scenario. We might call this complacency.
00:08:15.599 Let's say we have an eight-seater car; even with a rear-view camera, I am ashamed to admit I have backed into more than one pole. Now, we have this little jingle when we reverse, and the kids recite this safety chant because it's easy to think, 'It'll never happen to me,' and forget how unimaginably bad it would be if something did happen. Alternatively, we can fixate on the high severity of something and completely ignore the very low risk, a tendency we might call hysteria.
00:09:00.180 A good example of this is Gerd Gigerenzer, a psychologist who studies decision-making. He discusses how, in the 12 months following 9/11, an estimated 16,000 Americans died on the road because many chose to drive long distances rather than fly. He suggests that the very real threat of that road toll isn't as compelling or scary to us as the potential deaths that could arise from mass violence. We can see from our collective behavior that fear and that fixation on single factors are just part of our social inheritance.
00:09:56.859 Whatever the reason, there's also a special type of fixation on likelihood, where we misjudge the probability of risks. I recently read about a six-year-old boy who got lost in the woods near his home. Rather than wait to be rescued, he hiked for two days and managed to save himself. Having a six-year-old myself, I think about how you interpret that story depends largely on who you follow on Twitter.
00:10:58.070 The point is that I read that story because he survived. This is an example of survivorship bias: the fact that something went right doesn't tell us how likely it was to go wrong. The truth is, this is just how our brains work. Our instincts are often governed by social factors—your family of origin, your tribal ideology, and the beliefs built around your identity and sense of belonging.
00:12:01.860 The rational layers of our brains are relatively recent, so while we have some hurdles to overcome when thinking about danger, I believe that if we can be mindful of our biases and think about danger in a structured way, we're actually quite good at this. We may be social primates, but we went to the moon.
00:12:57.500 When we code, the answers to the question 'What could go wrong?' depend on the broad context surrounding the specific action we're trying to take. It's a little artificial to try and demonstrate this in a conference talk, so let's start at the line of code level.
00:13:10.140 Here's a simple working method: we're fetching a file from a URL. If you can't read all of the code, don't worry; what's important is the gist. What could go wrong? Let's start with 'nil.' What happens if the URL is nil? How likely is that? As a rubyist, I can say it's extremely likely.
00:13:42.240 How bad is it? Well, if we test it, we see that we get an argument error from the first method, which is pretty clear. What's gone wrong here isn't that bad, so in this context, I'm going to say it’s not dangerous enough to warrant fixing. What else? What about a blank string or a garbage string?
00:14:07.380 Turns out those are all valid URLs, and instead, we get a very obscure error from the HTTP request. It's hard to see what's gone wrong here. I find this just as likely as getting a nil. So, this is worth fixing; we'll catch that scenario and raise a clear error.
00:14:45.050 Now we've got a valid HTTP URL. What could go wrong next? There's a whole class of connectivity errors: DNS errors, connection timeouts, host errors. How likely is it? It's the Internet, so it's highly likely. How bad is it? Again, we test; we see the exceptions are pretty good and descriptive. We know what's gone wrong. Maybe we could catch them all and re-raise them, but I don't think that's dangerous enough to fix.
00:15:29.140 Okay, we've now gotten a response from the server. So, what's next? What if we get a 404 or a 500 error? How likely is that? Again, we're on the Internet. How bad is it? I think it's actually worse than a failed request because it doesn't fail; you think you've succeeded, and you have no indication that there's been a failure. This is likely; it's bad, and we definitely need to fix that.
00:15:51.810 It turns out Ruby gives us a value method on net HTTP response that raises an error for any non-200 status codes. Who knew? So definitely a good comment on what else could go wrong with the response. What happens if we get a 300 response for redirection? How bad is it? Right now, we create an exception. How likely is it? Well, I only thought about that because it happened while I was testing.
00:16:33.670 So, that's likely enough for me. We will catch them, test the response, and recurse with our new location. What's next? I think it's very unlikely that we will get stuck in a redirect loop, but how bad would it be if we did? This isn't some misleading success; it's not a hard-to-decipher exception—it just keeps going.
00:17:10.180 If you're trying to connect to a slow server, it keeps going, again and again, and eventually, it'll run out of resources. This is the worst-case scenario I've seen so far; even though it's very unlikely, we definitely want to fix it. So, it's getting a little more complicated now.
00:17:38.110 We'll throw in our max redirects and raise an error if that limit is exceeded. But now we have a robust way to handle your response. We could keep going, thinking about fetching multi-gigabyte files on a 512-megabyte row of code, but I think that's enough for now.
00:17:52.710 If we compare this to our original implementation, the happy path is roughly the same, but now if asked what could go wrong—and note that this is not purely defensive coding—there are some scenarios that we thought through, realizing that they are not bad enough. They are not dangerous enough to warrant handling, but the real dangers have been mitigated. That's one application of that mindset at the line of code level.
00:18:26.150 Now, here’s a bit of a counterexample. Many years ago, I needed a unique, non-guessable key for a record. It's a pretty common and straightforward scenario, and while there are better ways to do this now, back then I just added a callback to generate a UUID, which worked perfectly. So, what could go wrong? Now, I had written a blog post on this exact subject, specifically noting that you would need a hundred twelve terabytes of UUIDs before facing even a one-in-a-billion chance of a collision.
00:19:12.250 Part of my brain understood the likelihood of a collision wasn’t even there. We had a unique constraint on the column; it's an exception—no one's getting hurt—but somehow, a year later, I remember trying to explain it away. There’s no way I should have been worried, but the part of my brain that lets me sleep at night isn't the same part that analyzes and quantifies danger. And that's just the way it is.
00:20:09.980 Let's step up a level from the code. Recently, we've been streaming Salesforce data. That's a pretty complex protocol and the code is fragile and mission-critical, so it is very likely to break badly. The temptation is to ask the question: what could go wrong? We've answered it and then say we need tests.
00:20:45.750 But we can't stop asking the question yet. What could go wrong with conventional tests? They'd be very brittle, involve lots of mocking, and need to change. They'd be very hard to read. So if automated tests are also dangerous, what else could we do? We dropped the automated testing in that part of the app and instead wrote an interactive test script.
00:21:26.340 While that wasn't conventional, it reduced the danger without introducing many new ones, as automated testing would. It's easy to get stuck in patterns like 100% automated test coverage, but by continuously asking that question, you can move beyond best practices and actually mitigate the specific dangers you are dealing with, rather than just dangers in general.
00:22:06.340 One last example before I wrap up: I've been working on some authentication issues recently. I noticed that even though the response is the same whether you get your password or email wrong, the response time was measurably different. Not noticeably different, but measurably different. So technically, you could perform user enumeration.
00:22:50.660 This is because bcrypt is intentionally slow, so if you don't have a user's password hash to check against, the response is quicker. I spent considerable effort writing constant-time password authentication. What could go wrong? Well, now the code is a lot less intuitive. Small changes affect the timing, so it's very likely that someone might break it badly.
00:23:14.600 So I began working on a specification to compare the timing. What could go wrong with that? Well, now we have a non-deterministic test—timing changes on every run, affecting whether it's more likely to give a false positive or a false negative. How bad is that? Well, we could end up with flapping tests, unreliable CI, and angry developers. We've all been there.
00:24:00.230 I spent considerable effort trying to avoid timing jitters, taking medians, interleaving iterations, and ensuring we had a statistically significant sample size—managing garbage collection—all to write a reliable timing comparison spec helper. What could go wrong with that? The thing is, user enumeration and timing attacks aren't ideal, but in our context, it's not that bad; it's low risk, low impact, and low severity. Therefore, low danger.
00:24:41.200 You see, the real reason I wrote these hundreds of lines of complex code is that I wanted to prove I could. My patterns of social cognition overpowered my rational self. And you never notice it at the time, you only realize it when you're writing a conference talk twelve months later.
00:25:21.640 So, the subtle science of coding for failure. Next time you're writing code, designing a system, or reversing down the driveway, I want you to ask three questions: What could go wrong? How likely is it? How bad is it?
00:25:37.360 We're sitting in a car at the bridge together—figuratively at home. Let's answer those questions as rationally as possible, but also remember that we are not perfectly rational creatures. Hence, let's also use our big social brains to be kind to each other.
00:26:16.180 If you're interested in this stuff, I've got lots of sources and much more data. I'd love to talk with you, so please come and find me afterwards or online. Thank you very much.
Explore all talks recorded at RubyConf AU 2019
+10