Testing in Production

by Aja Hammerly

In the talk "Testing in Production" presented by Aja Hammerly at RailsConf 2018, the importance and techniques of testing in production environments were discussed. Hammerly emphasizes that while Ruby developers excel in testing, they often overlook the potential benefits of testing in production settings.

Key points covered include:
- Definition of Production: Production refers to any environment that is not pre-production, where real user load can reveal issues that cannot be found in pre-production testing.
- Real User Feedback: Testing in production allows developers to uncover real bugs that occur under actual user loads, often missed in traditional testing setups.
- Blackbox Testing: The speaker shares experiences starting as a blackbox web tester and highlights the value of both automated and manual testing techniques in production.
- Methodologies for Testing in Production:
- Canary Releases: Gradual rollout of changes to a small subset of servers or users, closely monitoring the application for errors before a full rollout.
- Blue-Green Deployments: Maintaining two identical production environments (one live and one idle) to facilitate seamless deployments and quick rollbacks if needed.
- User Focus Testing: Techniques such as A/B testing and beta programs are highlighted as ways to assess user experience and usability before final releases.
- Smoke Tests: Automated basic functionality tests run in production to catch issues that may not be evident during regular monitoring.
- Controlled Breakage Testing: Intentionally disrupting services to evaluate system recovery mechanisms and ensure robustness.
- Disaster Recovery Testing: Regularly testing disaster recovery plans to ensure systems can recover from major failures.
- Implicit Testing: Monitoring systems and alerts as a form of testing to ensure expectations are being met in production environments.

The presentation concludes that developers should embrace testing in production, ensuring their systems work reliably with real user interactions. The importance of planning, ethical considerations in testing, and leaving systems as they were post-tests are emphasized as critical practices for effective production testing.

00:00:10 Hi, is everyone here for testing in production? Awesome! Cool! I'm Aja. I am Asha on Twitter.

00:00:19 I'm the Aja miser on GitHub, and I blog at FaginMiser.com. I did not post the slides yet, but I will post them immediately after the talk. I am a bad presenter.

00:00:32 Um, I really, really like dinosaurs, so Pittsburgh has been amazing! I landed at 2:00 a.m. after a two-hour delay in Chicago because it was snowing.

00:00:45 Going down the escalator, I saw a dinosaur there. It was amazing! I love this city. I work on the Google Cloud Platform as a developer advocate.

00:00:58 If you're interested in Google Cloud, Kubernetes, or other things like that, I'm happy to answer questions. I have plenty of opinions, but you don't have to ask me. We've got seven of us here.

00:01:14 We have a booth down in the vendor hall, so you can come say hi. I think we might be out of fidget spinners, but I have a couple stored away in my bag.

00:01:26 We’re here because Google loves Ruby, and we love Ruby. It's a group of Rubyists who work on our Ruby support, and I love my Ruby community.

00:01:39 So, the victory conditions for my talk: these are the things that I want you to feel or think about when you leave. First of all, I want folks in this talk to feel comfortable with testing in production.

00:01:58 I remember when I first heard about testing in production; I had a slightly less polite version of 'Dear heavens no.' But the more I thought about it, the more I realized that this is actually a good thing.

00:02:14 It isn’t scary because in many cases you're already doing it; you may just not be aware of it. In addition to being comfortable with the idea of testing in production, I want you to walk away from this talk with the ability to be a bit more intentional about your testing.

00:02:32 So let’s start with some quick definitions. The first definition is production. Production is any environment that is not pre-production.

00:02:46 The second definition, testing: for the purposes of this talk, testing is what we as developers call verifying our expectations. Yes, I did just use the term 'expectation'.

00:03:02 It's okay if it makes you feel better to think of it as verifying behavior instead of verifying expectations. So, as Rubyists, we test all the time.

00:03:15 One of the things I love about this community is that we test. We test a lot! You wouldn't dream of pushing a gem without at least a couple of tests.

00:03:29 That's good documentation, but we’re good at testing and we have all of our great test frameworks where we set up a scenario, do some verification, and then hopefully clean up a little bit. We call our method under test.

00:03:45 These are the traditional tests, but there's also a huge category of blackbox testing, which is how I got my career started. I was a blackbox web tester doing manual testing.

00:03:56 You can do blackbox tests automated and do it a fair amount. All of this is still testing, and I’m bringing this up because we’re going to use both techniques for testing in production.

00:04:09 So why should we test in production? Isn’t that naughty? The answer is because a real environment gives you real bugs.

00:04:21 You find stuff that you just can’t find in your pre-production environments. For example, production is where you have real user load.

00:04:38 While load testing is awesome and I highly recommend it, most load testing frameworks I’ve worked with can’t actually simulate real user load because humans are fantastic entropy machines.

00:04:58 I’m going to tell a quick story: talking to some of my co-workers last week, I objected to a form I had to fill out for another conference. One of my co-workers said, 'Whenever I see something I’m not quite okay with, I find a way to hack around it.'

00:05:15 She said she was signing up for a bike share and for reasons she didn’t understand, they wanted to know if she was male or female. So, she opened up the form and realized there was 'male' as one and 'female' as two. She managed to convince the server on the other end that she was a four.

00:05:30 Humans are fantastic entropy machines!

00:05:40 Other things you can only do in production: you can test your integrations. Who here uses a billing service of some sort? Audience participation is encouraged.

00:05:54 Okay, your billing provider probably has a test gateway or a test API endpoint that you can hit when you’re testing. They probably also provide some test credit cards that you may or may not be able to use against a production endpoint.

00:06:07 But how often do you point your staging environment at the real production gateway using a real credit card to run a billing transaction through? Whenever I've built something that took credit cards, we did that once or twice before the initial rollout.

00:06:22 But we didn’t do it regularly after that. So if we wanted to test that we were actually integrating with this third party correctly, we had to do it in production.

00:06:38 If you don’t have a billing service, maybe you have a third-party storage solution, or you use some cloud storage. Maybe there are other services like an image processing service or something you're using.

00:06:54 It’s important to ensure you're testing those infrequently. The only place you can test them for real is in production.

00:07:09 Or maybe you don’t use third-party services, but you work on a large team building a huge app. Your team might build one microservice, and there are other microservices built by different parts of the company.

00:07:23 When those services come together, that's a seam. How often do you test your seams? How often do you run an integration test across these services?

00:07:38 One of my most frustrating moments at a previous job was three days before a big GA when we were going to go into production with some new stuff, and we had two teams: client and server.

00:07:54 We hadn’t tested the two pieces to see if they could talk to each other, and out of curiosity, I spun it up. The first thing it did was crash hard.

00:08:05 Because the person orchestrating and running the client-side team and the person orchestrating the server-side team had a misunderstanding in the protocol they developed.

00:08:18 So it just exploded. You have to test your teams. I would hope you test them before production, but sometimes sneaky bugs can get in, and testing your seams in production is valuable.

00:08:30 Who has heard the term Heisenberg? For those of you who don’t know, Heisenberg refers to bugs that can only be produced in production by that one really important client.

00:08:42 Maybe it’s an artifact of their network, their browser, or some sort of security thing they have. Testing in production allows you to find these bugs.

00:08:56 Last story: we had a very important client who said, 'It mostly works, but we do this one thing and something weird happens.' It didn’t crash, but it didn’t seem right.

00:09:10 We spent about a month and a half debugging with them remotely, and finally, we picked up laptops and did a site visit, which was about an hour away.

00:09:28 We got there, and they asked, 'What are you planning on doing?' We said, 'We’re going to run a network speed test.' It appeared they weren't getting the full download.

00:09:42 They told us it wouldn’t work because they cut off any download greater than a specific number of kilobytes. We wouldn't have found that unless we had been testing in production.

00:09:54 The second thing about testing in production— I heard this at a meetup about nine months ago. My favorite meetup in Seattle, I live in Seattle, RB coffee.

00:10:07 Someone said, 'Hey, I want to talk about testing in production today.' I thought, 'Awesome! What’s testing in production?' This person went on and on, and I thought I was going to learn new stuff.

00:10:22 They were talking about monitoring, logging, tracing, blue-green deployments, and canaries. And I thought, 'Oh, there’s nothing new here. This is stuff that has been often discussed for years!'

00:10:36 Many of the techniques I talk about today are things I have seen in use since I started in tech in 2002, so I guess that means I'm old now. Preemptively, I’m telling you all to get off my lawn!

00:10:50 I've talked a little about the background, but I haven’t talked about the bulk of my talk. To keep myself on track because I’m going to discuss a lot of techniques, I’ve divided this talk into four sections.

00:11:04 Deployment testing, user focus testing, reusing tests, and my favorite, implicit testing. Let's dive in.

00:11:20 The first technique I’m going to talk about is canaries. A canary is just a phased rollout where you roll out your release gradually to some of your servers at a time over a course of minutes, hours, days, or even weeks.

00:11:34 You have a subset of your users or a subset of your servers that will receive the new code. Once you’ve rolled it out, you monitor vigorously for things like error rates, memory usage, and disk space.

00:11:50 You might also want to monitor for user-based metrics like free trial conversions or purchase path completion. If everything is thumbs up, you expand the canary group and keep monitoring.

00:12:06 You continue this process until you’ve rolled out your new release to all of your servers. That’s great, but how do you choose your canary group?

00:12:20 You can use internal users; sometimes we call this dogfooding. You can push the new version to people who don’t have a choice but to use it and find all the bugs.

00:12:34 You can choose randomly. If you have 600 servers or containers, just choose some of them. You can do it geographically. This is how a lot of folks do it.

00:12:45 They start with a small percentage of the servers in the U.S. West and gradually expand. Then they move to their data center, U.S. East, Europe, and then Asia.

00:13:02 You can also roll this out only to users who are new or users who log in eighteen times a day and you're not quite sure why they’re using your product so much.

00:13:18 You can ask users to sign up to get access to stuff early as well. The cool thing is you can pick any combination of these methods.

00:13:32 The goal is to start with a small group and then roll it out gradually, ensuring whatever you're doing is not toxic and does not take down your environment.

00:13:43 The second deployment strategy is blue-green deployments. You have two copies of production, one is blue and one is green. In this case, the blue is live and the green is idle.

00:13:59 One is always live, and one is always idle. When you want to roll out new code, you deploy it to the idle side (the green). Once it's up and running, you start routing traffic to the new code.

00:14:13 Now you end up switching live and idle, and you've done your deployment. The nice thing about this is if something goes wrong, it's an easy rollback.

00:14:27 You have the previous known good version live just a couple of minutes ago, so you can just swap whatever router role you did to move your traffic back.

00:14:41 Depending on how you implement your blue-green setup, it might be really good for disaster recovery if your blue and green are in different parts of the same data center.

00:14:55 If you experience a partial power failure in your data center, you might be able to move traffic back to the other half because you have two copies of everything.

00:15:09 It's fantastic having two copies of everything, but it’s not always great when doing databases. Databases with blue-green deployment can be kind of a pain.

00:15:23 Don't use databases! Or at least leave your databases out of your blue-green clusters. If you want your databases to be part of the system, you can do things with snapshotting and replication.

00:15:38 But depending on your database setup and how good you are at setting it up, you might have a bit of a blip when you switch from one database to another.

00:15:50 Or you could use a non-relational database. A lot of the problems with relational databases and replication are solved if you use a non-relational database.

00:16:05 Relational databases are awesome, but when I started my career, we used a variation of this technique. We divided our server cluster in half, and had one half A and the other half B.

00:16:20 We would deploy to A, test it behind a firewall, and once it was good, we'd route all traffic to A and deploy to B. This wasn't true blue-green though.

00:16:36 Because we couldn’t run the site successfully at peak load on just half of the cluster, we had to wait until late at night to perform the deployment.

00:16:52 Testing in production: do what works for you. Both of these techniques, plus many others I will talk about, work well in conjunction with auto rollback.

00:17:06 In auto rollback, you have predetermined metrics, and if you hit those thresholds, the condition is triggered, and your deployment system rolls back to a known good release.

00:17:22 To do this, you have to make sure everything is scripted. I hope most of you have scripted your deploys at this point. When I started, I was releasing based on a 34-point printed checklist.

00:17:39 If you're going to use auto rollback, you need to be conscious and careful about your data and database migrations. If you roll back, will you lose important data?

00:17:56 Will the old code actually work against the new schema? These are things you need to consider.

00:18:10 Is anyone still using session affinity or sticky sessions? I figured there would still be a couple of us out here. If you're using WebSockets, it's really hard to avoid sticky sessions.

00:18:23 If your user has to hit a specific server because that's where their connection is established, how are you going to deal with that when that server goes away?

00:18:39 The biggest thing you can do is separate your data migrations from your code pushes. Push the code, make sure it works with both versions of the schema, then do a data migration.

00:18:54 Once everything is stable, push code that can only work with the new version. It’s a really common pattern; lots of us have been doing it for years.

00:19:09 The second section is user focus tests. These are things that test the user experience.

00:19:20 You might think, 'I’m a developer; that’s not testing in production,' but it totally counts! You’re testing the underlying stability and correctness of your code.

00:19:36 Who’s done A/B testing? Yay! People are testing stuff in production; it's fantastic! A/B testing is an experiment. You have a control group and experimental groups.

00:19:53 You run users through different experiences, and when you have enough data to be statistically valid, you figure out if there are significant behavioral differences.

00:20:06 You then decide which experience you’re going to go with. A/B testing is different than blue-green testing because both experiences are live at the same time.

00:20:26 This means you have some interesting concerns with data integrity.

00:20:39 Another way to do user focus testing is through betas and EAPs (Early Access Programs). For those who haven't heard the term EAP before, it’s similar to a beta but typically happens just before it.

00:20:54 These programs allow you to test the stability and usability of what you're about to push. Nothing finds the edge case like how users test.

00:21:05 If you run these programs, you need to give users enough time. I know folks say, 'We had a beta for like eight whole hours!' No, that’s not a beta.

00:21:22 You need to give people multiple weeks in many cases so they can use your product over time and ensure it works for all scenarios.

00:21:38 You also need to make sure your expectations are clear. If there's an expectation that someone participating in your EAP will provide feedback, that must be communicated upfront.

00:21:54 Also, tell them where the known issues are because every beta has some edges, and you don’t want 19 reports about the same issue.

00:22:06 The third section is reusing tests. There was a fantastic talk on Monday about checkups, and this is similar content.

00:22:19 The cool thing is that each of you can do it! Running a usability test or a beta will require cooperation of many others; however, everything in this section can be done independently.

00:22:37 Additionally, you should run smoke tests against production.

00:22:46 I worked at a relatively large company and did some manual testing, but I got permission to start some basic automated testing with a record-and-playback tool.

00:23:02 Record-and-playback tools make for brittle tests, but I would rather not run manual tests 15 times a day.

00:23:16 I sat there one day and realized I could run these smoke tests against production. I hoped they would never fail!

00:23:31 I set it up to run every four hours, and it sent me an email if it failed. For a couple of days, it worked, and I was excited.

00:23:45 Then I mostly forgot about it. One day, a couple of months later, I came back from lunch and got an email saying it failed. I thought, 'There’s no way it failed.'

00:23:58 If this was actually down for 30 minutes, someone would have noticed! So, I ran the test manually, and it actually had failed.

00:24:13 One of our suppliers wasn’t sending all the information we needed when we made a request, and normally monitoring would catch this.

00:24:27 But we got a response back and the body was empty, so we received 200s without errors. This bug would have gone unnoticed!

00:24:44 We managed to contact the third-party that we were working with and have them fix the issue before anyone noticed.

00:25:00 We wouldn’t have caught it without monitoring in production.

00:25:10 I’ve been using the term 'smoke test' for those who don’t know what it is. A smoke test is a super simple test of the core functionality of your product.

00:25:26 It comes from the idea that when there’s smoke, there’s fire—that if this test fails, something is on fire. Even with complicated products, you can have relatively few smoke tests.

00:25:41 Most places I’ve worked keep it under six; almost everyone can keep it under a dozen. You’re just testing the very basics.

00:25:58 So if you're going to do this, pick a subset of your existing tests. You probably have something that you would consider a smoke test in your integration suite.

00:26:06 Set it to run on a schedule: every hour, once a day, or once a week, whatever makes sense. Focus on things like your third-party integrations and the core functionality of your product.

00:26:21 If you use something like the VCR gem that normally makes your tests run faster and don't make requests against third party services, consider not doing that for these tests.

00:26:34 You are not actually testing your integrations if you fake those parts out. The big thing is to Leave No Trace. Your tests should clean up after themselves.

00:26:49 Most of my smoke tests don’t do purchases. Many of the database schemas I've dealt with have not allowed updates or deletes; they only allow subsequent writes.

00:27:01 So if I did a purchase, I couldn’t delete it. If you have a system like that, ensure you have a way of testing without doing purchases, or if you do purchases, you can flag those.

00:27:17 You don’t want to run a test every minute that makes tons of money to show up in your reports!

00:27:33 Now, on to my last section, which I’m calling controlled breakage. Controlled breakage involves purposefully and deliberately breaking parts of your system.

00:27:47 Take servers down, pretend that disks went bad, or that your network pipe shrunk significantly. What are you testing in this case? You’re testing your ability to respond and recover.

00:28:01 Is your system supposed to be self-healing? Is someone on call to detect these types of errors and address them? I really like this testing.

00:28:16 I started my career in testing; I fundamentally love breaking things! It’s fantastic and wonderful. Once I got permission to do this, I went nuts, finding all sorts of bugs.

00:28:30 I documented bugs that started coming back marked as 'will not fix' because like security, durability is something that can vary.

00:28:43 You can be more durable or less durable, but it’s always a trade-off between durability and the amount of engineering time you want to dedicate to it.

00:28:59 It doesn't make sense to be exceptionally durable against something incredible like lightning striking a server.

00:29:15 If you’re going to do this kind of testing, stay in scope! Stay within the scope of stuff your team has agreed should be tested.

00:29:30 I can't talk about controlled breakage without mentioning Netflix, Chaos Monkey, and the Simian Army. Go check them out; they’re open-source and very cool.

00:29:44 We do controlled breakage testing at Google as well; we call it disaster recovery testing. I haven’t personally participated in that process, but I found a fantastic talk.

00:29:59 You should go watch it; it’s by one of the SREs who started that process and has fantastic stories about testing during disaster recovery.

00:30:13 Now, let’s talk about penetration testing. Who has been able to do some pen testing? I did some about six months ago, and it was awesome!

00:30:28 I had a week where I was just trying to break things. It’s really fun to think like an evil adversary. This is another form of controlled breakage.

00:30:44 You try to figure out mistakes that you have likely made and determine if you've patched against them. However, controlled breakage needs to be ethical breakage.

00:31:02 DHH touched on this in his keynote; we have the power for both good and evil. So, make sure you use your power for good!

00:31:21 Think carefully about the potential impacts of your choices on your users, your company, and your job.

00:31:36 When doing penetration testing, there are always rules of engagement, especially for large exercises. There’s usually a Proctor to ensure you’re playing fair.

00:31:49 The last form of testing in production is disaster recovery verification. Who has a disaster recovery plan? Who has tested it in the last year?

00:32:06 Congratulations; you’ve successfully tested in production. You’re doing better than the vast majority of the audience!

00:32:17 Disaster recovery is how you plan for your data center catching on fire in a way you can't predict. I did a talk on this at RubyConf.

00:32:35 There was a time when they were replacing pieces in the power conditioners, and they caught fire at the data center. We were down for an hour.

00:32:50 Then, we ran on diesel for 11 days. It was an error that the supplier of the power system had never seen before; it was not supposed to happen!

00:33:03 Disaster recovery is how you're planning to deal with things like that. If you haven’t tested your disaster recovery plan in production, you haven’t tested it.

00:33:20 By its nature, your disaster recovery plan is created for when something bad happens in production. You need to move traffic to another cluster.

00:33:36 Maybe you need to move data between data centers or restore databases from a backup. I accidentally deleted a production database at 11:00 p.m. once.

00:33:54 I ran the feature branch migrations instead of trunk migrations against it. That was fun!

00:34:07 Luckily, I had taken a database backup right before doing that, so I could restore from it. I knew how to restore from the backup because I practiced that regularly.

00:34:23 As part of DR, you need to ensure you're testing scripts that handle network migrations and database restorations.

00:34:38 But you're also testing your people! Many people say testing people isn't testing; if you’re doing it right, you're testing everybody.

00:34:57 Implicit testing! This section is called implicit testing. I originally planned to call it passive testing, but I didn’t think that sounded right.

00:35:13 Implicit testing is the testing that you are already doing but don’t consciously think of as testing. Think about monitoring.

00:35:30 Ask yourself: who has monitoring in the audience? Raise your hands please. Great! Now, who has alerts set on their monitoring? Excellent!

00:35:49 Alerts are actually tests. Think about it. We think of alerts as notifying us when something is wrong.

00:36:06 But if we massage this into English: alerts tell us the system isn't meeting expectations. Remember, I defined testing as verifying that your expectations are met.

00:36:21 So by definition, alerts are testing. Still don’t believe me? Say you have an alert if latency is greater than 500 milliseconds; that’s your test!

00:36:35 If you’re vigilant, you might look at your system for a couple of weeks to spot what it looks like, and set up alerts based on that.

00:36:51 Think about how you want your system to behave. If you have an endpoint that hits 90% of your traffic, it should probably respond quickly.

00:37:06 Want your error rate to be less than 5%? Set that test up as well! Maybe you think your disk should never be more than 80% full; set that test too.

00:37:22 We just call these tests alerts! The variation in testing is looking at month-over-month or year-over-year trends.

00:37:39 This way, you can answer questions and make assertions, such as: our error rate should not get larger, and our site should not get slower.

00:37:56 Here's a screenshot from Stackdriver showing a year-over-year comparison. You can see it's bimodal.

00:38:12 Depending on which the blue is newer or old, it may have slowed a bit at the far end, but mostly it looks the same.

00:38:24 I feel pretty okay that my assertion about no changes is valid; I expect behavior hasn't changed, and I trust my expectations.

00:38:38 To wrap things up, I’m throwing a lot of thoughts and ideas; I’m also going to give some basic do's and don'ts for production testing.

00:38:54 At the end, there’s a cheat sheet so you don’t have to take pictures of every slide. I will publish the slides.

00:39:10 So, do have clear goals. You should go into this intentionally, figuring out your goals and expectations.

00:39:24 When picking what to monitor and test in production, don’t DDoS yourself! I was doing a disaster recovery test.

00:39:42 We took down a server holding WebSockets, and the clients were supposed to reconnect, but the fallback server promptly fell over.

00:39:57 It couldn’t handle so many simultaneous connections, so we created a fail cycle during our disaster recovery testing.

00:40:10 So, think carefully about the possible impacts of the tests you’re about to conduct.

00:40:26 As we discussed before, test your seams; test the integrations between all your services that may be physically distanced.

00:40:38 Also, do not mess with user data! Keep your tests as walled off from user data as possible to avoid corruption.

00:40:52 Lastly, clean up after yourself! Do the Girl Scout thing and Leave No Trace; be a good citizen!

00:41:06 When using alerts, ensure they are actionable! If you use alerts as tests, that’s great, but ensure they don’t wake someone up at 3:00 a.m.

00:41:18 If it’s not urgent, they shouldn’t be paged; it causes burnout. We don’t want that.

00:41:33 Verify your integrations after your experience. If you encounter bugs that haven’t been caught, you’ll be able to trust but verify your dependencies.

00:41:47 Commonly, the third-party might update their API, and you miss the email notification.

00:42:00 Therefore, make sure you are testing! The big takeaway is whatever you choose, act methodically.

00:42:14 Have a purpose and a plan so you know what you've done and how to undo it.

00:42:27 Here’s your cheat sheet: have clear goals, test your seams, verify your integrations, clean up after yourself, and leave user data alone!

00:42:43 Keep alerts actionable. I want to say thank you, now get off my lawn!

00:42:58 So, the question is, how do you handle testing against production servers?

00:43:10 I’ve always done it by creating a magical test account. The advantage is that everything associated with that account I can ignore.

00:43:25 I worked at a place where we used a specific last name for those magical accounts; it started with five X’s.

00:43:35 This way, we wouldn't accidentally pick up someone else's real name. This is important because part of our smoke testing involved sign-ups.

00:43:45 There are other ways to do it, and there are tools you can use. Some companies actually offer production testing services.

00:44:01 Do the right thing for your services, because you are already testing in production. You might as well do it on purpose!

00:44:14 Thank you!