Keynote: I Test In Production by Charity Majors

by Charity Majors

In her keynote speech titled "I Test In Production," Charity Majors discusses the importance and inevitability of testing in production environments, arguing for a paradigm shift in how software testing is approached. The prevailing notion that testing should only occur before or after deployment is challenged by Majors, who insists that responsible engineers constantly test in production. She emphasizes that modern systems are inherently complex, and it is impractical to replicate production environments in staging. The key takeaways from her talk include:

Testing in production is often avoided due to misconceptions, but it should be integrated as a responsible practice for improving system reliability.
Traditional testing methods are not sufficient for the complexities found in contemporary software systems, which require developers to engage directly with production.
Engineering resources are finite, and it is crucial to spend developer cycles wisely, focusing on understanding real usage in production rather than getting lost in potentially misleading staging environments.
Observability and monitoring are essential tools that help developers understand system performance and issues as they arise in the wild.
Engaging with production allows developers to learn and adapt their instinct about system behaviors and failures, which is vital for their growth and the reliability of the systems they manage.
She encourages engineers to embrace failures as part of the process, creating environments where significant failures can occur without impacting user experience.
Majors critiques the heavy reliance on staging, noting that it can waste valuable time and lead to misplaced confidence in a system's stability.
The conclusion of her talk is optimistic, advocating for embracing production as a learning environment where real user interactions can lead to better software outcomes. Overall, testing in production is not just a technique but a necessary attitude shift that prioritizes real-world relevance over theoretical safety.

00:00:05.839 So with that said, anchors away! Heave-ho!

00:00:11.200 Let's welcome to the stage our speaker, Charity Majors.

00:00:20.000 She is the grand dame of infrastructure, the CTO of Honeycomb, and she even co-wrote a book on database reliability engineering. I think she knows what she's talking about.

00:00:33.360 I heard she might have insights on testing in production with giant data sets, and I was told that she does it in production, which means I'm definitely going to learn something today. Let's give a warm round of applause for Charity!

00:01:09.040 Yay! It's really my favorite part about this talk—getting to see the horror in some people's eyes when I say the phrase 'testing in production.'

00:01:20.640 All right, we’ll wait for my... thing to come up. I’ve had such an interesting experience getting here.

00:01:27.759 Directions are hard! Anyway, I don't want to procrastinate too long. I have this very pretty slide that I just finished 30 seconds ago, so I'm going to leave it up there.

00:01:40.479 But I will say that I brought a lot of really snarky Nietzsche quotes, stickers with unicorns and rainbows, and they express some rather angry thoughts about technology.

00:01:57.840 I’ll leave them in the speakers' lounge, so you should come and get some afterwards. They say things like, '20 tools and no two agree.' Yay! There we go! Isn’t that pretty?

00:02:06.840 This is kind of like... who’s in English governance? Who’s always like, 'A spoonful of sugar'? I find that a spoonful of rainbows helps the anger go down. My name is Charity.

00:02:20.879 You’ve already heard the intro. If you have the book 'Database Reliability Engineering,' you’ll notice it has a horse on the cover, but I brought stickers to fix that for you. See me after class. Apparently, you can’t have a mythical creature on your book.

00:02:43.040 Am I going to get in trouble for swearing here? It’s Europe, right? Yes, all right! I’ll do my best.

00:02:48.800 So, testing in production—some reactions that I often get are, 'That sounds like something cowboys do,' or, 'Isn’t that from the bad old days of sysadmining?' It’s gotten a bad rap.

00:03:01.680 And I blame this guy! It’s such a great meme! Who of us hasn’t had it up on their wall at some point? It’s so funny, but it’s also so wrong. Responsible engineers test in production constantly, while irresponsible engineers do too, they just don’t admit it until they do it very poorly.

00:03:20.959 First of all, it starts with kind of a false dichotomy. You know, it sets up as though you can only do one or the other. You can only test before or after, which we hope is not true.

00:03:40.480 And I want to make it clear—many people seem to take the title and run off with it, saying, 'Charity says I don’t have to test anymore.' That is not what I’m trying to say!

00:03:51.840 I do respect the traditional tests. I think you can often spend less time on that; you know, 80% of the bugs can be caused by 20% of the work. You don’t have to ask me—ask Katie McCaffrey, who’s done computer science. I’m a music major, so don’t ask me anything about math.

00:04:12.799 But when it comes to testing in production, we waste a lot of our energy. Honestly, we waste it because we want to do a good job, which is a very good motivation, but we waste it because we think there is such a thing as safety, and so we chase it.

00:04:35.120 And it’s just a fact of life that the scarcest resource we have will always be developer cycles, right? Engineering time is finite. It’s a finite set, even if you’re Google. Even then, you have a finite amount of time and energy.

00:04:54.800 It sounds great to say, 'Let’s catch all the bugs! Let’s test all the things!' But in reality, is that the best use of your time? Maybe not always. Anyway, please test. Please test!

00:05:18.080 Tests are great for catching the problems we already know about, right? They’re good for the known unknowns. As soon as you find a way that your system can fail, you write a test for it—yay! Now we’ll know if it happens again.

00:05:30.160 Over time, the set of things you know about how your system can fail should grow, and it should get better at not repeating those same issues. Tests are for developers, while monitoring checks are for ops people.

00:05:48.639 Our systems are changing in ways that make these older tools catch less and less of the problems we actually deal with. This is such a meme-heavy topic, but I just had to throw in a couple more definitions.

00:06:02.240 Testing checks the quality, right? And production is where your users are. Every interesting thing that happens comes from the intersection of code, infrastructure, a point in time, and an unpredictable user action.

00:06:26.000 Every interesting thing is an unknown unknown. The traditional deployment path goes something like this: You write your code, commit it, it gets automatically rolled out to some staging, and then runs all the tests on it before rolling up an artifact and promoting it to production.

00:06:58.000 But staging is not like production! Shocker, right? It will never be production. You cannot clone production, even if you think you do—that is a mistake you will always make. It's better to accept that it's wrong and not try.

00:07:17.039 I'll go so far as to make a maximalist point, which I won’t fully defend, and say that most of you can probably get rid of staging altogether and just run things on your laptop.

00:07:38.000 If you take that effort and apply it to production by hardening it, building guardrails, and increasing your visibility and observability, you’ll find that a lot of that time is better used elsewhere.

00:07:54.800 At its worst, staging can be worse than nothing at all. Has anyone here ever lost a whole day trying to track down a bug in staging that didn’t exist in production? Sorry, a week? Did anyone last a month? You know, this time adds up!

00:08:12.400 Every moment you spend in a false environment is not neutral time; it’s negative time. When I think of senior engineers, I think of engineers whose instincts I trust.

00:08:34.160 Yes, we should be data-driven and blah blah, but if they say, 'Oh, I have a bad feeling about this,' I want to explore that. The more time you spend in staging, the more you’re building instincts that might lead you to think actions like 'drop database' are fine.

00:08:49.919 The more time you spend running queries in staging, the more you learn about what’s fast, what’s efficient, what’s okay to do, and what’s not—but you’re learning the wrong lessons.

00:09:13.120 I believe that increasingly, every one of us, even client-side developers, should spend a lot of time immersed in production—knee-deep, if not neck-deep, because that’s where you learn the instincts. That's where you learn the lessons that actually make you good at your job.

00:09:34.399 I’m not saying that everyone has to get rid of staging, but you have to understand your code in the context of real data, real users, little agents of chaos, real traffic, real scale, real concurrency, real network deploys—all of it.

00:09:55.679 Everything should be real! Oh yeah, it’s hella expensive to spin up a copy. Now, let’s backtrack just a little bit because for a long time, this really was the best practice.

00:10:10.880 Build elaborate environments, automate the whole thing, and gain confidence by running all these tests. I’m not a math major, but I drew you a graph so you know it’s true.

00:10:28.000 By my calculations, things are getting more complicated. We toss around the word 'complexity' a lot, but what does it actually mean? What is the complexity of systems?

00:10:45.919 Let’s visualize this. There’s your LAMP stack—humble one on the right. If you can solve your problems with a LAMP stack, please do! They’re great. Unfortunately, more and more of us cannot.

00:11:08.479 The middle one is Parse. Any Facebook people here? Then I can say that I will never forgive them for shutting it down. I’ll hold that grudge for the rest of my life.

00:11:27.680 That was Parse’s infrastructure in 2015, and like that middle blob is a few hundred MongoDB replica sets, running queries that we just let developers all over the world upload, and we just had to make them work.

00:11:44.799 Co-tenancy problems? What? JavaScript? You just had to make it work! Running this system was what directly led me to starting Honeycomb.

00:12:01.519 And it gets worse! This is an actual electrical grid, and increasingly, this is how we need to look at our heads as a mental model for building systems. There are emergent behaviors that no one understands, and that's just fine—things break constantly.

00:12:18.160 Some problems are hyperlocal, like a tree falling over on Main Street in a small town in Iowa. No one could have predicted it, and no one should have tried to predict it. You just have to repair things and move on with your life.

00:12:37.440 Other problems can only be seen if you zoom way out. For example, if all the bolts manufactured in 1972 are rusting 10 times as fast, you might want to proactively replace them. There are just these categories of problems where, honestly, we shouldn’t spend effort trying to predict them.

00:12:55.679 Instead, we should invest in our observability so that we're detecting information at the right level of detail, allowing us to ask any question of our systems and understand them without relying on new code to handle occurrences.

00:13:09.920 This roughly tracks to the idea of observability for unknown unknowns, while monitoring is more about known knowns. But the overall message of this is just to stop trying—just give up.

00:13:23.680 So many catastrophic states exist in your systems right now, and that’s okay! Sleep tight, because this matters more and more to all of us.

00:13:37.440 We’re all distributed systems engineers these days, and honestly, who are the first OG distributed systems engineers? Web engineers, right? The proliferation of clients is what got us all into this mess.

00:13:57.919 The problems in our systems used to be predictable—like, you could look at a LAMP stack and predict 80% of the ways it would break and write monitoring checks for them. Over the next six months, you would learn the other 20% and write checks to handle those issues.

00:14:17.120 It was pretty rare that something genuinely new happened; usually, you'd just get paid and say, 'Ah, that again,' and you'd try to repair it. We don’t do that anymore!

00:14:31.680 We just automate those problems out of existence. Now, whenever you get paged, it should be something genuinely baffling—something you couldn’t handle automatically. It should be unknown unknowns.

00:14:47.440 And there are more and more of those because of the interactions between all of these ephemeral components blipping in and out of existence. Users are creative, and I don’t need to describe it; you’re all on the same page.

00:15:05.920 Distributed systems are incredibly hostile to being cloned, imitated, or even monitored. At Facebook, we weren’t going to spin up a copy of itself to test it—it’s not financially practical.

00:15:27.760 You’re not going to spin up a copy of the national electrical grid to monitor it. Even if you did, it would be pointless because if it’s not users bringing their chaos and trying new things, it’s boring.

00:15:45.760 You don’t care how things fail in false environments; you care intimately about how they fail in production. Distributed systems have this infinitely long list of things that almost never happen—except that one time they do, or it takes five of them at once to trigger a bug.

00:16:03.680 Then you spend a week trying to reproduce it in staging. Like I said, just give up. You don’t care; it’s probably not going to fail the same way in staging as it does in production anyway.

00:16:24.960 Staging’s not production. There will never be production. Staging is a black hole for your time and energy and a misleading set of results—mostly resulting in misplaced confidence in your code.

00:16:40.399 It will drain away your life force! That’s why I bring the rainbows. And how well do you even understand and notice problems when they do happen?

00:16:56.480 Only production is production! People will argue with me, saying, 'I work on this system where I can spin up a copy perfectly,' trying to avoid breaking anything. It’s actually impossible.

00:17:12.240 Every time a deployed version is a unique process. You can’t test everything! And by the way, deploy scripts are production code.

00:17:30.480 Not only that, but they are the most important code in your systems. I don’t understand why we give it to the interns. Why don't we give it to senior engineers?

00:17:44.160 Furthermore, I see people putting all this time and energy into staging when they can’t explain what’s going on in production. That’s where you should be putting your effort.

00:18:03.360 If you can’t explain any spike instantly, screw production! Sorry, I meant staging—getting swearing, screw staging! Until you have the level of observability that allows you to explain anything happening to your satisfaction, don’t pour more energy into staging.

00:18:22.800 Katie gave this great talk on this topic. They showed that actually, 80% of bugs are found with 20% of the effort. So I’m arguing for shifting that energy we put into staging into production.

00:18:46.720 Like I said, you must watch your code run with reality! The best software engineers I know have an IDE window open and a window open with their observability tooling. They’re in constant conversation with their code in production.

00:19:13.919 They watch users use it. As soon as they ship code, they look at it through the lens of their instrumentation to see if it’s doing what they expected. They notice anything weird.

00:19:29.680 You’ll catch 80% of the bugs you ship before users even notice! You do have to have the right tooling in place, but I’m not doing a pitch here. Suffice it to say it exists.

00:19:45.679 Our idea of the software development lifecycle is super overdue for an update. We have this idea of a switch—you flip it off, you flip it on, flip from one version to the other.

00:20:02.720 This is manifestly not true! Deploying code is not a binary switch; rather, it’s a process of increasing confidence in your code. Just like developing code, it’s not committing to master and walking away.

00:20:18.080 It is committing to master, and then owning it until it’s sufficiently stable for your users. If your model exists in most managers' minds, it’s a lovely world, but not the real one.

00:20:34.600 Instead, we have continuous deployments, rolling releases, feature flags, cherry picks, rollbacks—all this mess. And on top of that, we have chaos monkey and chaos engineering.

00:20:50.720 We have to recognize that the development process extends way into production, and not only that but that the production system extends way into development.

00:21:05.440 Every time one team uses a tool and another team uses another tool, those tool boundaries create silos. I’m so tired of having a toolset while developing and then being in a whole new land with production. That's not okay!

00:21:21.680 You should feel comfortable there. The tool sets should be knit together throughout the beginning and later stages of your code.

00:21:36.800 No two tools are going to agree on reality, and so if you’re not using the same tools, you won’t agree on reality. This will lead to conflict.

00:21:53.199 All right, if you want to gain confidence in your code, that means watching it. It means having enough tooling to look at what’s actually happening and showing some curiosity.

00:22:10.320 This may seem remedial, but I will say it again: Staging is not production! You usually can't clone the data due to security reasons. You can't just spin up a copy of it.

00:22:30.240 Yet, we have to make some kind of declaration of what we're going to do with it, right? So what should we do to test before production, and what should we test after production?

00:22:50.840 Before production, we test the basics: Does it even run? Does it fail in ways that the library of things I previously built will help predict? That’s what some people think of when they say testing.

00:23:09.039 But then there’s all this other stuff: experiments, edge cases, canaries, progressive deployments, data migrations, and load tests. I know more than two people running continuous load tests on their systems.

00:23:27.199 When they run out of room, they just turn off 20 to buy themselves some extra time. That’s a great idea, and it helps them find scaling edges really nicely.

00:23:45.440 Yeah, there are more reasons why chaos engineering outside of production doesn't make sense. I’m not really a big fan of chaos engineering; I think it's just like operations but with new branding.

00:24:03.920 By the way, if you’re doing chaos engineering and you don’t have observability like I was talking about, you're not doing chaos engineering—you're just chaosing!

00:24:22.240 Also, beta programs where customers try new features. Facebook has a robust way of rolling out changes. Whenever you commit, it merges to a 'sandcastle' to spin up a whole isolated testing environment and runs days worth of tests against it.

00:24:45.919 It then builds an artifact, deploying it first to a small section of Brazil—don't ask why Brazil—and then slowly graduates it to more of Brazil, then more of South America, and eventually the rest of the world.

00:25:05.760 It can take a few days for a deployment to go fully out, and they have dozens of versions in flight at any given time. This was terrifying to me at first, but I realized that with high cardinality tooling, it's easy to compare builds.

00:25:23.040 Deployments are often the tip of the spear when it comes to getting software engineers to do stuff in production for obvious reasons. If you're looking for a starting point, I recommend that everyone be curious about what happens to their code when it hits users.

00:25:42.400 How many of you here have root access? How many of you deploy your own code? Nice! How many of you think I’m absolutely insane and wrong about all this?

00:26:01.160 Ah, you liars! There have to be some of you here! There are some areas where you do want to be paranoid. The closer you get to the data—like, laying bits down on disk—the more paranoid you should be.

00:26:28.400 I come from the database side of things and have written a piece of code three times for three different databases that just sniffs 24 hours worth of production traffic before taking it offline.

00:26:48.960 You can then replay it and adjust concurrency levels. I have done this every time I’ve had to do a major version upgrade, and each time I’ve thought, 'Oh my god, I cannot imagine how screwed I would have been if I hadn’t done this!' Databases are terrifying.

00:27:08.720 But if you aren’t a database person, there’s basically no reason to do that. Some other terrifying things involve rewrites. For example, doing a rewrite from one language to another is justifiably terrifying.

00:27:29.760 You might want to do something like have a splitter so that when a request comes in, it hits a version of the old API server and a version of the new.

00:27:54.960 It then diffs the results and returns the older API server's result to the user so they don’t see any different outcome. You can literally review the drift, and you can even daisy chain two of those tests if you’re doing testing of mutating data endpoints.

00:28:10.560 That is pretty cool! Although, don’t rewrite if you don’t have to. Anyway, testing in staging—I’m not going to get super religious about it; you can if you want.

00:28:24.480 But I feel like we have a very low standard for our tools. How many of us have had just a wall of green dashboards while users are complaining loudly? Everyone?

00:28:45.680 I guarantee you don’t have zero problems! The real risks include exposing security vulnerabilities. You need to think carefully about namespacing.

00:29:02.880 If you run continuous load tests along with regular traffic, you’ll want to be careful to reap it, delete the data, and expire it. End-to-end checks, tenancy issues, your app could die, and you could saturate a resource.

00:29:17.760 You could arguably say that this is a good thing, replicating the situation, helping identify that you can saturate the resource, but you would want to just turn it off.

00:29:32.160 Chaos does tend to cascade, so there’s that. Much like Martin Fowler said about microservices—if you’re not this tall, you shouldn’t ride this ride!

00:29:48.240 The cardinal rule here is not that nothing can break. All kinds of things can break, and it’s fantastic how many things can break before you have to care!

00:30:05.760 Do you know how much stuff can break before a user has to notice? It’s glorious! This can become a fun activity to see what you can break without alarming any users.

00:30:22.640 Because, like I said, part of my mission in life is to make this a profession where every engineer can be on-call and support their code without it being life-impacting.

00:30:39.920 It should not be something you have to plan around, or that wakes you up at night. You should be over 40, totally happy to support your code in production.

00:30:53.280 This is an achievable mission—I have seen it achieved! It isn’t crazy, but it does require getting comfortable with failure.

00:31:12.640 We get to this glorious place not by making things break less, because that’s just not going to happen. Instead, we make it so that lots of things can break without impacting our users.

00:31:29.440 I find this to be a very optimistic worldview!

00:31:46.560 It’s completely possible! Five years ago, if you left Google or Facebook, it felt like you were flying without a net. There was no toolchain that allowed for such intricate operations.

00:32:01.760 However, you can now assemble a reasonably good composable set of a Google-ish or Facebook-ish pipeline leveraging tools like feature flags, high cardinality tooling, and shadowing systems.

00:32:16.640 Please don’t build your own. I don’t say this because I’m a vendor; I say this because it’s monumentally stupid.

00:32:35.360 Unless you have infinite engineering hours, in which case, be my guest! But seriously, who does? I assume you have better things to do with your life.

00:32:54.640 Support the ecosystem! Be less afraid! These are valuable safety nets.

00:33:09.760 Bring your designer in with you—next time you work on your deploy code, ask them how we can make unsafe things hard to do, and safe things easy.

00:33:24.960 We’ve been building systems for engineers instead of humans. Remember that whoever uses it will be awake at 4 AM and probably under the influence. Be less afraid!

00:33:36.800 The only real way to be less afraid is to get used to playing in production. Get comfortable with breaking things and fixing them; realize that the world keeps spinning and everything’s going to be okay.

00:33:56.320 Failures are not rare! No one has figured out how to make them rare, and it’s probably not going to happen for you.

00:34:11.119 But production does not have to be scary! Failures will happen; it’s not if, but when.

00:34:30.239 So everyone on your team, everyone who has the ability to merge to master, should know what normal looks like.

00:34:48.399 If you only look at your system when things are down, you don’t know what normal looks like; your intuition has not been properly trained.

00:35:06.239 If you don’t know how to deploy and roll back, if you can’t return to a known state before escalation, you will be terrified.

00:35:20.239 If you can’t slowly roll out changes under controlled circumstances and understand how to ship scary changes by debugging your code in production, then you absolutely should be terrified.

00:35:37.119 So just learn those things! You can do it; it’s not that hard! And tonight, we can test some production!

00:35:50.159 All right, I think that’s it. Yep, that’s it! Sweet, thank you so much!

00:36:06.879 Oh yeah, you're doing great—not bad!

00:36:09.679 Not bad! We’ve got some pieces of eight for you!

00:36:15.440 [Unintelligible] Presents! Thank you very much!