Daniel Fone

What Could Go Wrong? The Science of Coding For Failure

Paris.rb Conf 2020

00:00:14 Today, as Justin said, I'm Daniel, and this is by far the farthest I've been from home. So it's very good to be here. This is going to be a simple talk. I want to start and finish with one question.
00:00:27 If you forget everything in between, that's fine. But for us, as people who make things, I want to help us ask one simple question: What could go wrong?
00:00:32 That's it. Every line of code, every test, every deploy, and every new feature in Ruby. For every time you SSH into production without telling anyone, what could go wrong? Now, obviously, there's a way to ask this question, right? What could go wrong? We know what this looks like.
00:00:53 We're about to set our friend on fire as we need to cross a bridge with a little more river under it than usual. I think there's a familiar conversation happening here. We can make it. No, we can't. I think we have the same conversation again and again, whether in a professional or personal context, whether we're parenting or pushing to production.
00:01:28 So my goal today is to help us ask that question a little more critically: What could go wrong? Not only so that we can make better decisions, but so that we can disagree more rigorously about crossing the bridge.
00:01:52 A long time ago, in another life, I studied environmental chemistry. I don't remember much from my chemistry degree, but I remember two things. Firstly, everything is toxic if you have enough of it and you have the right attitude. Secondly, danger has two factors: risk and hazard.
00:02:12 Now, chemistry was a long time ago, and I'm terrified that as soon as I finish, someone will come up and say, 'Actually, ISO 31000 defines...' Right? You know what? You're probably right, so let's just keep this really simple.
00:02:31 One dictionary defines danger as the possibility of suffering harm or injury. Two factors: possibility and harm. Just like risk and hazard, the exact terms don't matter, but just to appease the ISO gods, let's choose some different words. Let's make something up: The dangerousness of something is determined by its likeliness and its badness.
00:02:56 They will rhyme, and we can remember that. Ultimately, I want to ask what could go wrong, then how likely is it, and how bad would it be? This is it. This is the whole talk: What could go wrong? How likely is it? How bad is it?
00:03:29 Let's break this down a little bit. If I tried to walk across a balance beam about a meter off the floor, I'm very confident that I could do that. But if we put that beam at the same height as some famous towering local landmark, like the Eiffel Tower, no amount of money would convince me to do that.
00:03:53 I would think about it pretty hard, and no amount of money would convince me to walk across it. I think the difference is intuitive for most people, but let's break down what’s actually changing here.
00:04:14 I think the likeliness of falling stays about the same. If anything, I am far more motivated to step accurately. But the badness changes dramatically. From that height, I'd hit the ground at about 200 km/h, so the scenario is more dangerous, not because falling is more likely, but because the badness goes from a sore ankle to just being splat.
00:04:36 Returning to a disagreement on the bridge, what are we really disagreeing about here? Sure, everyone agrees that the bridge collapsing would be really bad, but maybe one of us thinks it's not going to happen, while someone else thinks it's definitely going to collapse.
00:04:51 We get to the heart of the disagreement—it's about the likeliness. Or maybe we agree 100% that water is definitely going to hit the car, but maybe someone thinks it won't be too bad, while someone thinks it's going to break the car.
00:05:05 Now, I have no idea how cars, rivers, or bridges work, but as we identify what could actually go wrong and separate out the badness from the likeliness, we can start to disagree more robustly about the danger.
00:05:22 Everyone agrees it would be very bad if we died. We just ultimately disagree on how likely that is, which brings me to ‘who's heard of micromorts?’ Yeah, a few hands, few nods. A micromort is a defined quantifiable amount of dangerousness, so it represents a one-in-a-million chance of death.
00:05:52 The likeliness is one in a million, and the badness is a single death. We can get micromort data for travel, and we see here that, per kilometer, flying is by far the safest way to travel. This is all sourced from Wikipedia, so take it with a grain of salt; it’s all made up.
00:06:17 Similarly, for recreational risks, we can see here that scuba diving, marathons, and skydiving are all about the same risk, which surprises me because I have run a couple of marathons and have dived out of the sky, and they did not seem equally dangerous.
00:06:35 We’ve also got some crazy dangerous sports. BASE jumping is notoriously risky, and the last time I checked, Mount Everest had 223 fatalities from five and a half thousand descents.
00:06:57 To put these risks in perspective, according to the World Health Organization, the risk of childbirth to the mother, the maternal mortality rate, is an order of magnitude riskier than skydiving.
00:07:18 That's pretty good by world standards. Part of that surprised me in terms of that relative risk; that doesn't really match the story I tell myself about these things. The more I've researched these statistics, the more surprised I become at how wrong my stories are.
00:07:36 At this point, I start to realize how often my perception of danger isn't driven by knowledge, but rather influenced by my social narratives. Underneath all of this logic, we must remember that we have the brains of social primates.
00:08:02 Two and a half thousand years ago, Aristotle said that a human is, by nature, a social animal. Modern neuroscientists like Matthew Lieberman seem to agree. He argues that the human brain has been primed by evolution to view the world in predominantly social terms.
00:08:24 Our neural biology is much more wired for social cognition than for the statistical cognition that we need for risk assessment. One of the most common ways that I see us make this mistake is the tendency to fixate on a single factor when we're thinking about danger.
00:08:42 For example, we can easily fixate on the low likelihood of something happening and completely ignore the high badness. We might call this complacency. We have a minivan, a big car, and even with a camera in the back for reversing, it has a worrying number of dents in the rear bumper.
00:09:07 So now, when we're reversing, we've got this little safety jingle that the children sing because it's too easy to think it'll never happen and forget how unimaginably bad it would be if something tragic did happen.
00:09:30 Alternatively, we can fixate on the high badness and completely ignore how unlikely something is. We might call this hysteria, and I think we have plenty of examples of this in today's world. One example is the case of a psychologist who studies decision-making.
00:09:52 He talks about how, in the year following 9/11, an estimated 16,000 Americans died on the road because they drove long distances instead of flying. This shows how easy it is to focus on how bad airplane crashes are while ignoring how likely our car crashes are.
00:10:12 None of us are immune to this. We can see from our collective behavior that these fixations, these stories that we tell ourselves, are part of our social inheritance. This is just how our brains work.
00:10:23 Whether it's complacency, hysteria, or something else, this is just how our brains work. Like Andy said, we think that the person riding the elephant is in charge, but so often, the emotional—or perhaps I'll say social—elephant is the one that chooses the direction.
00:10:40 Whether that's influenced by our family of origin, ideology, or our sense of belonging or identity, whatever the rational layers of our brain are, they are very recent evolutionary add-ons.
00:11:00 While I think we have some hurdles for thinking accurately about danger, I think if we can be mindful about our biases and if we can distinguish clearly and think about danger in a structured way, experience shows we can be pretty good at this.
00:11:20 You know, we might be social primates, but we've socialized our way to the moon. Okay, let's talk about code. When we code, the answer to the question 'What could go wrong?' depends on a thousand little details for the precise thing that we're trying to do.
00:11:38 So it’s a little bit artificial to try and demonstrate it in a conference talk. Nevertheless, let’s have a look. We’ll start at the line of code level.
00:12:00 Here’s a simple working method to fetch a file from a URL. If you can't see all of the code, don’t worry; you’ll get the gist. What could go wrong? Alright, let's start with nil. What happens if the URL argument is nil? How likely is that?
00:12:21 I’m a Rubyist; I know it’s very likely. And how bad is it? If we try it, we see that their URI method returns an argument error. It's quite clear what's gone wrong, so I think that scenario is actually okay.
00:12:35 It’s likely, but it’s not too bad. We’re not going to do anything about that. What else? What if we get a blank string or a junk string? Well, it turns out that all strings are valid URIs, so if we supply a junk string, we get a very obscure error from net HTTP.
00:12:59 I think it’s just as likely as getting a nil. I think it’s quite bad and quite unhelpful, so let’s do something about it. It’s worth fixing.
00:13:21 So now, we’ll check the protocol of the scheme and raise a clear error if it's invalid. What could go wrong next? If we have a valid URL or URI, now we’ve got a whole class of connectivity issues to deal with.
00:13:40 There are timeouts, DNS issues, and server issues, so it's highly likely that something can go wrong on the Internet. And how bad is it? If we try, we see we get pretty good exceptions from the get response method.
00:14:03 So again, I'm going to say this is not dangerous enough to fix. We could improve it if we're writing a library; catch it and re-raise it with a different exception, but I think it's okay.
00:14:21 So, we got a response. What could go wrong next? What if we get a 404 or a 500? Right now, I think that's worse than an exception because we think we have succeeded.
00:14:36 I think that's bad, and I think it's quite likely; again, it's the Internet. Let's do something about that. Ruby gives us the value method on net HTTP response, and that will raise if it’s a non-200 response.
00:14:56 So that gets a thumbs up! We now have a successful response. What else could go wrong? What about a 300? How likely is that? I only thought of this because it happened while I was writing this example.
00:15:09 So I think that’s likely enough, and it raises an exception, so that's bad. We can do better. Let's just recurse on redirect. We’ll take the new location and call the method again.
00:15:24 Next, I think it's very unlikely we will get into an infinite redirect loop, but if we do, this is worse than a misleading exception—it's worse than a silent failure.
00:15:41 We’re just stuck for ages, especially if it’s a slow server, until we get a stack error. So we'll deal with that. We'll put in a limit on the number of redirects, and we'll raise an error if we exceed it.
00:16:02 We're getting a little bit more complicated now, but we've got a robust way of fetching a response. Now, we could keep asking that question; I'm sure you'd love me to, but I think that example is okay for now.
00:16:21 Compare this to our original implementation. The happy path is about the same, but now we’ve asked what could go wrong. We haven’t addressed every single thing that could go wrong, but we’ve thought about the consequences.
00:16:40 We’ve thought about the likeliness, and the real dangers are mitigated. So we can ask that question at the line of code level.
00:16:56 Here's a little counterexample. Years ago, I needed a unique non-guessable key for a record. That’s a pretty standard requirement. Back then, I added a callback in ActiveRecord to generate a UUID.
00:17:12 It worked perfectly. What could go wrong? Now, I had previously written a blog post specifically noting that you would need a hundred and twelve terabytes of UUIDs before you'd even have a one-in-a-billion chance of a collision.
00:17:29 A collision isn't even that bad, right? We have a unique constraint, so it would just be an exception; no one gets hurt. Part of my brain knows the likeliness; I know how bad that would be. But somehow—
00:17:49 there’s no way I should have bothered with that commit. But the part of my brain that lets me sleep at night is not the same part of my brain that calculates danger. That’s just how it is.
00:18:00 Okay, let’s take a different angle. Recently, we had a project that was streaming some data from Salesforce. I don’t know if anyone has used the streaming API from Salesforce. It’s complex, it’s fragile, and this is mission-critical code, so it’s very likely to break badly.
00:18:19 Now, I think the default is just to write an automated test, write a spec, and done. But we can't stop asking the question yet: What could go wrong with automated tests?
00:18:38 We went down this path a little way, but they’re very brittle to mark; they’re very likely to change, and they’re difficult to read just because of the way the streaming protocol works.
00:19:07 We can’t use VCR or any of the other nice libraries, so we’re going to sink a lot of maintenance time into these tests. So, if there are dangers with untested code, and there are dangers with automated testing, what do we do?
00:19:24 We actually dropped automated tests and we wrote an interactive test script. You run the test, then you go to Salesforce, you do some things, and then the test continues and verifies that everything has been done correctly.
00:19:41 That’s an unconventional trade-off, but it reduces the danger of automated testing without introducing a lot of new dangers. It’s easy to get stuck in patterns; like Philip was saying, we just go for 100% automated test coverage.
00:20:01 But I think if you ask the question, ‘What could go wrong?’ and keep asking, you can move beyond best practices and mitigate your specific dangers. You can innovate around danger.
00:20:23 One last example before I wrap up, nearly lunchtime. A while back, I was working on some authentication code, and I noticed that even though the response was identical for an incorrect email or an incorrect password, the response time was measurably different.
00:20:42 Not noticeably different, but measurably different. Technically, you could perform user enumeration via a timing attack, so I spent considerable effort writing constant time password authentication. What could go wrong?
00:21:04 The code is pretty unintuitive; it's hard to tell what's going on. Small tweaks to the code will affect the timing, so I think it’s very likely that someone breaks the timing safety.
00:21:29 Of course, I spent considerable effort writing a spec to compare the timing. What could go wrong? Now we have a non-deterministic test, so we have to mitigate false positives and negatives.
00:21:44 So I spent a considerable amount of effort avoiding timing jitters, taking medians, ensuring a statistically significant sample size, and even wrangling garbage collection, all to write a reliable timing comparison helper.
00:21:55 Finally, nothing else can go wrong! The thing is, I never even stopped to ask the primary question in the first place. In the context of this project, user enumeration wasn’t a big deal.
00:22:22 In fact, today we use two-step authentication where you put your email in first. It’s low badness, low likelihood. This is low danger. The real reason why I wrote all of those hundreds of lines of complex and beautiful code was ego.
00:22:38 I wanted to prove I could. My patterns of social cognition overrode my rational cognition, and you don’t notice it at the time. It is only when you're looking for examples for a conference talk months later that you realize it.
00:22:58 And I can now say since the last time I did this talk, all of that code has gone.
00:23:03 So, the subtle science of coding for failure—I think that’s the title of the talk. Next time you are writing code, designing systems, or just backing down the driveway, I want us to ask three questions: What could go wrong? How likely is it? And how bad is it?
00:23:15 That’s it! So that’s not the talk; that’s the question. Sorry, I’m very close. Thirty more seconds, sir. When it says thank you, then we're done.
00:23:37 When we're in the car together at the bridge, figuratively speaking, I hope we can answer those questions as rationally as possible. But also remember, we're not perfectly rational people.
00:24:00 So we can use these big social brains to at least be kind to each other.