Make a Difference with Simple A/B Testing

Talks

Danielle Gordon

#test-driven-development

Make a Difference with Simple A/B Testing

by Danielle Gordon

In the RailsConf 2021 presentation titled "Make a Difference with Simple A/B Testing," Danielle Gordon addresses common misconceptions surrounding A/B testing, emphasizing that these tests are not only feasible, but also essential in validating app improvements. Gordon, who works at the startup Nift, discusses her journey from encountering the difficulties of implementing A/B tests to developing an efficient system to conduct them without cluttering the codebase.

Key Points Covered:
- Understanding A/B Testing: A/B testing involves comparing two or more versions of a component (like buttons or web pages) to determine which performs better based on defined success metrics such as user engagement, purchases, or retention.
- Example A/B Tests: Gordon presents two initial tests: changing the color of call-to-action buttons and adding a featured dessert section to a bakery marketplace app.
- Ensuring Consistency: She explains the importance of deterministic tests where users consistently see the same variant. This is achieved by using a user ID as a seed for a random number generator, creating a predictable, reliable testing environment.
- Code Organization: Gordon outlines the need for clean code management, suggesting that each A/B test should have its own experiment class to simplify updates or deletions, thus keeping the codebase tidy.
- Data Tracking Mechanism: She introduces an experiment events table to log each user’s participation in experiments, enabling data collection to evaluate the success of the A/B tests.
- Statistical Significance: A critical conclusion of her talk revolves around statistical significance, which determines how confident one can be in the results; Gordon advocates for at least a 95% confidence level in test outcomes.
- Future Enhancements: She concludes by discussing potential system improvements, such as accommodating multiple variants in the same test and providing a user interface for non-developer stakeholders to access A/B testing data easily.

Overall, Gordon's presentation aims to demystify A/B testing, encouraging developers to integrate such methodologies into their projects to enhance user experience and make informed design decisions. She leaves the audience with the empowering message that A/B tests can be straightforward and beneficial for any project's improvement efforts.

00:00:06.319 Hi, I'm Danielle and today we're talking about A/B testing. I currently work at a small startup based in Boston called Nift. Before joining Nift, I had never written an A/B test. I had heard about A/B tests, and specifically, I heard about all the difficulties associated with them.

00:00:11.040 I heard that they took too much time to write, made your codebase messy, were difficult to remove, and especially difficult to update. All the upkeep was frustrating, and we've run into a lot of these issues as well. However, over time we've been able to develop a system that makes A/B tests quick to add, easy to keep track of, easy to update, easy to remove, while also keeping our codebase clean.

00:00:37.260 My goal today is to show you this system, which I hope you can apply to your own apps, whether those are large apps serving millions or just personal projects. Before diving into the code, let's define a few things. First of all, A/B testing is really just comparing two or more versions of something to see which one performs better. To do that comparison, we need to have an idea of what success looks like; we need a metric to track.

00:01:01.580 That metric can be anything from site engagement—where you care about likes and page visits, because you want your users to be actually engaged with your website—to something like redemptions and purchases, as if you have a store, you care about how much is being bought. It can even extend to things that aren't on your website, like email clicks and opens, to see if customers are paying attention to you. Additionally, it could also include things like positive reviews; if you can get reliable reviews from someone outside your website, you can use that to determine if people enjoy using your site and find it useful.

00:01:39.360 I built a quick demo app that functions like a bakery marketplace where chefs can post desserts they're willing to sell, and customers can buy them. We're going to focus on two simple A/B tests. First, let's see if changing the color of our call-to-action buttons makes a difference. Right now, we have nice yellow buttons, and we want to see if changing them to a vibrant pink will encourage people to buy more.

00:02:02.700 For our next A/B test, let's try adding a featured dessert section to our desserts list page. Now that we know what we want to test, let's dive in and see what it looks like.

00:02:18.360 The first A/B test we want to conduct is with our call-to-action buttons. Fifty percent of our users will still see the yellow buttons, while the other fifty percent will see a pink call-to-action button. The simplest solution here is to use a random number generator to randomly select either 0 or 1, helping us decide whether to show the yellow button (which will be our control) or the pink button.

00:02:57.660 However, currently we have a problem with our A/B test; it’s not deterministic for the user. Every time the user refreshes the page, there's a 50% chance they'll see a different button. We want our users to consistently see the same button. The nice thing about the random class is that it takes a number as a seed for input. This means that if you always pass the same number, it will generate the same sequence of random numbers. We could leverage this for our users: if we pass in the user ID, the same user will consistently see the same color button.

00:03:40.260 Now that we have a more consistent test, we can replicate this code across our other two call-to-action buttons. Next, let's add our second A/B test on the desserts list page. Here, we want to add a featured dessert section for half of our users to see if this will help increase engagement.

00:04:14.040 Similar to the first A/B test, we will implement a simple 50/50 split; half of our users will see the featured section while the other half will not. There's a problem of test collisions that we have to address due to using the same seed in both A/B tests. If we get a zero, that means we’ll always see yellow call-to-action buttons with the featured dessert section. If we get a one, we’ll see pink buttons without the featured dessert section. A simple solution to this is to add a number, say one, to our user ID.

00:05:00.420 This change will determine that tests remain deterministic based on the user but are no longer correlated between different tests. However, there are a couple of issues with the current code. First, there’s an excess of duplicate code as we are essentially doing the same thing across three separate files.

00:05:43.140 Second, at the start of this talk, I promised you that adding tests would be easy to find, easy to update, and easy to delete; right now, it is none of these. Moreover, you must remember to assign a unique seed every time you add a new A/B test. If any of your tests share the same seed, they're essentially colliding, and you cannot ensure that your data is meaningful. Also, we need to consider how we’re capturing data.

00:06:25.260 You could run through each user and determine which bucket they belong to, but you have no guarantee that the user saw any of those pages. To address these issues, we're going to implement two changes. First, for each A/B test, we will introduce its own experiments class. This class will encapsulate all the logic pertaining to that specific A/B test. This means we now have a single location to update, making it easier to delete and locate the impacted files.

00:07:06.060 Additionally, we're going to integrate an experiment events table. Each time, or at least the first time a user enters our test, we will log them into this table. We'll not only note that they're part of this test but also whether they're a control or part of a different variant.

00:08:00.840 Let’s begin by first adding a migration for our experiment events table. In this table, we're going to have a non-null reference to the user, a nullable experiment name, and a non-null bucket or variant that the user belongs to. While we're at it, we will also add a unique index on the user ID and the experiment, since there’s no reason to have more than one row per experiment per user, as a user only enters an experiment once.

00:08:46.920 Now we can create a new folder to hold all our A/B tests called 'experiments.' Let’s start adding our new experiments class. In this class, we're going to take in the user as input and assign them to a bucket, which could be control or the show section. We'll log the user’s participation in this experiment and also add functionality to easily determine whether we should show the featured dessert section.

00:09:33.840 Now that we have a class that contains all our A/B test logic, we just need to call it where required. Next, we can return to our desserts index HTML and replace the existing logic with our actual experiments class. We will now create another experiment class for our call-to-action buttons A/B test. What if we decide to modify our test? Perhaps we only want to run it for users who’ve already made a purchase.

00:10:21.420 Modifying it is surprisingly easy. Let’s create a trackable function to determine user eligibility, verifying if the user has prior orders. We can then utilize this function in 'track experiments.' If a user is not trackable, we will refrain from saving them in the experiment events table. Also, in our 'show Pink CTA' function, we can return false if the user is not trackable. Now we’re prepared to implement this experiment in our three view files.

00:11:07.860 At the beginning of this talk, we wanted to ensure that our tests were easy to remove. Because our logic is within the experiment class now, we can simply search our codebase for the class name. Anywhere we find it, we can remove that code, as it is impacted by our test.

00:11:57.720 Now that we have a functioning A/B test, let's explore how to retrieve the results. We can easily open up a Rails console to run our queries. Additionally, we can directly access the database to perform queries. However, since we are likely to check results multiple times throughout the test cycle, it makes sense to save how we get the data, and what better place to save this than in the experiments class itself?

00:12:43.080 When reviewing results for A/B tests, there are two critical factors we need to assess: how many people participated in the experiment and how many converted. In our case, we count a conversion when someone makes an order after the experiment starts. Once we have that data, we can determine not only if one variant outperformed the other but also whether the test was valuable or statistically significant.

00:13:30.900 Statistical significance refers to how confident we are that if we repeat the experiment, we will obtain similar results. This percentage ranges from 0 to 100, where a higher percentage signifies greater confidence in our outcomes. For instance, if we achieve a statistical significance of 90%, we can assert that we are 90% confident control is better than the pink button. Generally, in A/B testing, we aim for a significance level of at least 95%, meaning we’re 95% sure of our results.

00:14:19.620 The formula for calculating statistical significance is quite intricate, requiring a lookup table. Therefore, we typically utilize online calculators or libraries to perform these calculations for us. Our A/B testing system is effective; all test logic is centralized, making tests straightforward to create, update, and remove. But how can we improve it?

00:15:02.700 Currently, we don’t support multiple versions of the same A/B test. Oftentimes, when introducing a new variant, we may want to expose only 10% of our users to it initially. For example, if we have a million users and direct 50% to a new variant, yet it performs 10% worse than control, we could lose around 50,000 users. Therefore, we frequently start with a small percentage, say 5 or 10%, monitor the performance, and gradually escalate that percentage.

00:15:51.180 Additionally, non-developer stakeholders can only access data by asking a developer. If we could anchor our tests in a database, we could develop a user interface for these stakeholders to view results. Finally, all our tests thus far have centered on users, but what if we want to test categories, such as showing descriptions on dessert pages, or sending follow-up emails after order creation? Currently, we can't do this.

00:16:30.780 To accommodate these use cases, we can create an experiments table. This table would not only facilitate the creation of a UI for non-developer stakeholders but also allow us to add a row whenever we wish to create a new version of the same A/B test, instead of requiring an entirely new commit and deployment. Furthermore, we could render experiment events polymorphic, meaning it does not solely support users; it can accommodate any models we care about.

00:17:31.320 So, let's work on updating our tests to support this. We're going to start by updating the experiment events table. I will roll back our last migration and modify the migration file. Initially, we're going to remove the user reference and replace it with a target. We are aiming to make the target polymorphic, supporting various models beyond just users. Next, rather than allowing experiment to be a string representing the name, we will change it to reference the new experiments table we will create.

00:18:29.760 Lastly, we’ll update our index. Instead of having a unique index on the experiment and user ID, we will create a unique index based on the experiment ID, target type, and target ID. As we prepare to run our migration, I realize that the unique index name might be too lengthy by default, so I'll manually designate a name for it.

00:19:20.940 Now, let's add a migration for our new experiments table. This table will focus on three primary fields: a non-null string for the name of the experiment, a JSON field to represent our variants and their aspect ratios, and finally a last column indicating the version. For example, if we want to conduct a test with three variants—one being the control, assigned a ratio of 50%, while the other two variants get 25% each—we can create a hash where control receives 2 and the other two variants obtain 1.

00:20:20.280 Now that our updated experiment events table and our new experiments table are ready, we'll take a moment to clean up some code by creating a base experiments class. Instead of referencing a user class, we will take in a target. Our subclasses will be required to implement an experiment and experiment methods, allowing us to gather all the relevant experiments and the current running experiment.

00:21:17.640 We'll move our track experiment method from the subclasses to this base class since they are performing similar functionality. This will establish an experiment event for the target, based on the chosen experiment and the current running experiment. We’ll introduce a default method for trackable, set to true.

00:22:01.440 Lastly, we’ll move the chosen experiment method from the subclasses to the base class. In essence, we have all the necessary elements to carry out our tasks since everything is saved on the model. We will update our random number generator in our chosen experiment to leverage a seed based on the class name, thus alleviating the burden of manually tracking used seeds when adding new A/B tests.

00:22:46.920 Finally, we'll apply our variant and aspect ratio hash on the experiment model to ascertain which variant our random number aligns with. Each number in our variant hash represents the size of a range. If we stack those ranges on top of one another numerically, we can ensure that our random number will fit within one of them.

00:23:51.300 At this point, we are prepared to update our two A/B test experiment classes. We will begin with our call-to-action buttons experiment class, extending from our base experiment. We’ll remove our attribute readers and replace them with alias methods since we will transition to using targets instead of users.

00:24:41.520 We will eliminate our experiment started attribute, as we can directly access that information from the experiment model. Next, we will initiate our experiments and experiment methods, which will include lazy-loading our first variant. We'll designate that initial version and run our test with a 90/10 split, where the control group will encompass 90% of our users.

00:25:17.760 Now, let’s revise our print results function to accommodate multiple versions of tests. We will loop through all our experiments and gather information on the versions and different variants available. Utilizing the hardcoded variants in the print results function is feasible, but we could also extract the variants directly from the experiments model.

00:25:51.180 Also, since we will pivot from user IDs to targets—making it polymorphic—we'll need to modify our query accordingly. Conducting polymorphic joins is complex, so we will approach this by executing a join from the user perspective to experiment events and orders, allowing us to gather the conversion statistics effectively.

00:26:30.780 Lastly, we will update our featured desserts experiment in a similar manner to how we revised our call-to-action buttons experiment. We will adjust our attribute reader on the user, replacing it with an alias method for the target, ensuring smooth functionality through an updated initialization.

00:27:26.440 We will also add experiment and experiment methods akin to those in the CTA buttons experiment, while also eliminating chosen experiment and track experiment methods from the class.

00:28:02.880 At this juncture, we are operating with a quite flexible system. If we wanted, we could easily develop a UI for non-developer stakeholders, allowing for efficient creation, updates, and removal of A/B tests in under five minutes. More importantly, we now have a promising opportunity to gain deeper insights into our users.

00:28:37.560 I hope this presentation has shown you that A/B tests don't have to be intimidating or a nuisance. Instead, I hope you leave here confident in your ability to try them out in your own projects. If you have any questions or comments, I’ll be available in the RailsConf Discord. Thank you.

RailsConf 2021