How to A/B test with confidence

by Frederick Cheung

In the video "How to A/B Test with Confidence," presented by Frederick Cheung at RailsConf 2021, the importance of A/B testing as a method for making data-driven decisions in applications is highlighted. The session begins with a brief overview of what A/B testing involves and proceeds to discuss various common pitfalls during the setup, implementation, and analysis phases of A/B tests, as well as practical tips to enhance their reliability.

Key points include:
- Introduction to A/B Testing: A/B testing is described as a decision-making process guided by data. The speaker uses the example of testing two button texts ('Buy Now' vs. 'Order') to determine which drives more conversions.
- Statistical Concepts: The importance of significance testing, power, and sample size are explained, emphasizing that understanding these concepts is critical for valid results.
- Common Pitfalls: Various common errors when conducting A/B tests include improper group randomization, starting tests too early, failures in metrics agreement, and accidental differences between variants such as bugs affecting outcomes.
- Implementation Errors: Cheung discusses how to avoid fixed biases in testing groups and the dangers of stopping tests prematurely based on initial reactions.
- Analyzing Results: The right statistical test must be applied based on metrics of interest, and confounding factors must be carefully considered. Outliers are discussed, showcasing how significant individuals can skew average results.
- Best Practices: Emphasis is placed on well-documented tests, avoiding over-testing, and ensuring neutrality when interpreting results.

Cheung concludes the talk by presenting the essential practices for successful A/B testing, recommending that testers document processes rigorously, understand their tools, and engage with users directly to complement testing data. The overarching message stresses the importance of careful planning, ongoing testing education, and adherence to established methodologies to ensure that A/B tests yield meaningful insights rather than misleading conclusions.

00:00:04.860 Hi, welcome to my talk, 'How to A/B Test with Confidence'. In this session, we'll learn a lot about A/B tests, including what can go right and what can go wrong with them.

00:00:10.740 My name is Frederick Cheung, and I am a long-time Rubyist. I started with Rails back in version 1.0, and I am currently the CTO at Recipe.

00:00:16.680 At Recipe, we do a lot of things, mainly helping retailers personalize their content and make smarter data-driven decisions.

00:00:22.439 As part of our work, we run numerous A/B tests to compare our personalization methods with others, exploring different types and algorithms for improvements.

00:00:28.500 So, we run a lot of A/B tests. This is my assistant Luna, who is half pillow and half cat.

00:00:35.160 Briefly, the plan for the next half hour is to explore how to conduct A/B tests effectively and how to avoid common pitfalls.

00:00:42.000 We'll start with a brief recap of what A/B testing is and why we do it. Then, we'll discuss what can go wrong during setting up, running, and analyzing tests.

00:00:48.719 To finish on a positive note, I will share best practices.

00:00:54.719 So, starting with the basics, what is an A/B test? It begins when there’s a decision you'd like to make, guided by data.

00:01:05.939 For example, you might have an e-commerce application and wonder whether a button should say 'Buy Now' or 'Order.' Both options seem plausible, but instead of relying on gut feeling, you want actual data.

00:01:12.119 To test this, you would split your users randomly into two groups, with one seeing the 'Buy Now' button and the other seeing the 'Order' button.

00:01:18.000 After running this for a few days, you would count how many orders occurred in each group, resulting in quantities like 49 orders versus 56 orders.

00:01:24.119 The real question we need to answer is whether this difference is statistically significant—does changing the text genuinely improve our results, or was it just a fluke?

00:01:29.280 Before diving into the statistical analysis, it’s important to note that A/B tests are not just limited to button text. They can encompass more complex changes in layouts, designs, or overall user flows.

00:01:40.020 You can test how different buttons impact user actions; for instance, does one button lead to a quicker purchase process if it triggers an action on the page, while another button redirects to the checkout page? Additionally, you can explore variations in algorithms or recommendation engines whenever measurable differences exist.

00:02:08.580 However, before going too deep, we need to cover some jargon that will help clarify our discussion: significance, test power, and sample size.

00:02:16.920 Significance answers the key question: is the observed difference just a fluke? People often refer to p-values when discussing significance. A p-value of 0.05 indicates a 5% chance of a fluke.

00:02:27.819 People might also present this as a confidence level of 95%. Although these numbers are viewed differently, they represent the same idea from opposing perspectives.

00:02:33.540 When you run your test, collect data, and input it into the appropriate statistical test method, which may vary depending on the metric type. For example, you'll do a different analysis for retention rates than for signup conversions or average spend per customer.

00:02:50.280 Significance testing tells you whether there's a genuine difference between the two groups—but it does not indicate how substantial that difference might be. If one group achieves a 10% conversion rate and another 6%, the presence of a statistically significant difference does not guarantee that the two rates result in a 4% difference.

00:02:57.120 The counterpart to significance is test power, which focuses on the likelihood of missing an existing change instead of detecting a false positive. This concept relates to how small a change you want to be able to detect—intuitively, the smaller the change, the more data you'll need.

00:03:09.959 It’s worth noting that test power also considers absolute rates, as opposed to just relative rates. For instance, moving from a 10% to a 20% conversion rate is easier to measure than from 0.1% to 0.2%, despite both cases indicating a doubling.

00:03:17.400 If you’re measuring something with a current conversion rate of just 0.1%, achieving a statistically significant result will be quite challenging. Combining significance and test power leads us to the concept of sample size.

00:03:29.880 Sample size indicates how many users you need to include in your A/B test process to detect a change of a specific magnitude at a designated significance level. Online calculators can help you determine this number.

00:03:46.080 Once you have calculated your required sample size, ensure that a testing goal is feasible. If you need one million users but only have a thousand, it's back to the drawing board.

00:03:55.920 When evaluating feasibility, consider cost implications as well. For instance, if you are testing a new homepage and gaining new visitors primarily through Google Ads, acquiring a significant number of users can be costly.

00:04:06.780 Once you confirm feasibility, aim to run the test without tinkering with it during the process. Try not to monitor the results until you reach the required sample size.

00:04:13.920 However, be cautious of conducting short experiments. For example, if your sample size calculation indicates you need 5,000 users and you have traffic of 5,000 users in a day, you become vulnerable to transient conditions.

00:04:25.740 If the day you conduct your test coincides with a national holiday or school breaks, the user behavior may not represent typical usage, thus resulting in skewed data.

00:04:39.000 Another approach to analyzing test outcomes is Bayesian A/B testing, which is a different statistical school of thought. While I can’t cover it in extensive detail, the main distinction lies in starting with a prior model of how you assume the world operates.

00:04:54.180 Incorporating your existing understanding and uncertainty, you update this prior with data from your experiments. Instead of deriving a single number from testing, the output becomes a probability distribution.

00:05:05.820 The output from a Bayesian A/B test typically features two curves representing the posterior probabilities for each variant. The intention is to analyze the entire range of potential outcomes and ascertain whether the green curve is likely better than the red one.

00:05:16.560 Bayesian testing allows you to incorporate existing knowledge and uncertainties, departing from a zero-knowledge perspective. This can lead to improved outcomes, especially when working with low base rates, although the underlying math is slightly more complex.

00:05:34.380 However, most of the discussion in this talk will apply equally to both classical and Bayesian A/B testing.

00:05:44.280 We now have a good understanding of A/B testing basics and potential pitfalls. However, things can go wrong from the very beginning, particularly with group randomization. This is the foundational aspect of valid testing.

00:05:55.620 If randomization is flawed, the validity of your test becomes compromised.

00:06:03.660 About ten years ago, I wrote a few lines of code to randomly assign users to test groups based on whether their user ID was even or odd. It seemed reasonable, so if your ID modulo 2 was 0, you'd see the experiment; otherwise, you'd see the control.

00:06:10.380 This solution might have sufficed for a single A/B test; however, we ran multiple tests, leading to an uneven spread of users. All users with even IDs received the experiment, which meant they would notice frequent changes.

00:06:18.960 In the long term, we weren't merely testing isolated experiments but rather assessing how users reacted to constant changes.

00:06:27.780 This raises an important point: constant changes can fatigue users. It’s crucial to maintain a stable testing environment.

00:06:38.760 You can avoid such pitfalls by employing a library like 'Vanity,' which is a well-regarded tool in Rails for conducting A/B tests.

00:06:49.620 In this library, instead of merely relying on user IDs, you can combine the user ID with a unique experiment identifier, generating a hash to arbitrate group assignment. However, ensure that hashing maintains the necessary statistical properties.

00:06:58.560 As a general rule, avoid creating your own custom randomization if possible; it can lead to unexpected flaws in your setup.

00:07:09.420 Moreover, even a single instance of non-random sampling can lead to significant issues. For example, we once participated in an A/B test run by a retailer where users were not assigned randomly.

00:07:21.600 Instead, newer users were placed in one group while older users were placed in another. This resulted in skewed outcomes since newer users typically displayed less loyalty.

00:07:31.920 Consequently, the updated test produced astonishing results; however, previous data showed that the two groups had inherently different outcomes before the test.

00:07:42.240 This situation complicates result extraction, making it difficult to trust the test. For instance, if there was a 5% difference before the test and a 6% difference afterward, simply subtracting the values does not yield a reliable conclusion.

00:07:53.400 Another potential mistake is commencing your test prematurely. To illustrate, suppose you want to test a change on your checkout page with 100,000 users visiting your site in one month.

00:08:02.640 You would split your users into two groups of 50,000, each experiencing the same homepage. Some users would proceed to the next site page as expected and see the same products.

00:08:10.140 As a result, even fewer users would reach the checkout process, leading to a smaller sample size for impactful conclusions.

00:08:17.339 Let’s say out of 5,000 total users, the conversion statistics yield results of 2,600 for group A compared to 2,500 for group B, resulting in a meager 5% to 5.2% conversion.

00:08:26.040 If we remember the earlier discussion about test power, that small change isn’t statistically significant due to the large number of users who experienced no change.

00:08:36.180 Here’s what to do differently: enable 100,000 users to access your site without splitting them yet.

00:08:43.920 Allow them to view products and only allocate user groups once they proceed to add items to their cart and visit the checkout page. This way, the split occurs just when users experience differing content.

00:08:56.040 This allows for measuring differences on a more effective basis. Imagine if this second group maximizes their conversions to 50% compared to 52.7% in group A, leading to a statistically significant difference.

00:09:06.900 In summary, it’s advisable to allocate users to groups as late as possible to maximize the efficacy of the measured change.

00:09:13.500 The last category of error I want to address involves agreeing on your test setup beforehand.

00:09:24.300 Be sure to reach consensus on various aspects like the scope of the test, deciding which pages will be included, and whether to test all users or only logged-in users.

00:09:29.760 If your website operates differently in various regions, discuss whether to test specific locales or encompass all users.

00:09:37.380 Additionally, clarify the goal of the A/B test and how to best quantify success. For instance, when testing content sites, determining the right engagement metric is essential.

00:09:49.679 You may choose between page views or average session time, each measuring engagement in different ways, further emphasizing the importance of reaching a consensus beforehand.

00:09:59.760 Engaging in these discussions at the outset is critical; if not, you could find yourself determining which results to favor instead of relying solely on data.

00:10:07.620 I advise establishing one primary metric over which all parties agree before starting the A/B test, avoiding the scenario where results are cherry-picked based on preference or bias.

00:10:15.720 Once you establish your A/B test setup, it's essential to execute it with diligence and patience, as A/B tests measure the impact of the differences between the two variants.

00:10:25.979 As individuals, we often focus on the intentional differences, but other unintentional changes can skew results. For example, an A/B test we conducted on a mobile commerce app intended to personalize product displays.

00:10:36.120 However, an implementation bug resulted in the variant with personalized recommendations crashing ten times more often than the control version.

00:10:45.420 When analyzing test data at the end, we found that the original app had performed better—but we still didn't know if the recommendations were sound or if it was merely result skewed by the bugs.

00:10:58.680 This case serves as an extreme example; however, you should remain cautious about unexpected differences in your test variants. Commonly, performance differences can create issues if one variant takes longer to load.

00:11:10.440 Be careful with significance testing, as conducting frequent analyses is an anti-pattern. It’s tempting to evaluate results daily, only stopping the test if the results appear significant.

00:11:21.480 This practice invalidates your significance testing because you are constantly 're-rolling the dice'. When you initially set a 5% threshold for significance, this can quickly elevate to a much higher chance of error.

00:11:33.960 Thus, to maintain integrity, you should avoid the temptation to peek into your test results before you reach your sample size limit. Notably, Bayesian A/B testing approaches permit this, as it deals with distributions rather than fixed thresholds.

00:11:43.680 Lastly, once the test concludes, you must conduct thorough analysis. Use the proper statistical tests based on your defined variables; not all statistical tests are interchangeable based on the metric type.

00:11:55.200 Just because one metric yields a significant difference does not mean that others will follow suit. It is crucial to evaluate metrics independently.

00:12:02.640 These evaluations can often unveil outliers. For example, in an attempt to improve donation sizes for a charity, a significant outlier like a $1 million donation can heavily skew average values.

00:12:09.600 When you observe such outliers, consider whether they merit removal or if you should use alternative metrics that minimize their impact.

00:12:16.320 It's important to understand your domain at a fundamental level, as speed bumps can emerge due to behaviors unique to your user base. For instance, return rates in e-commerce depend on timing.

00:12:24.180 If your A/B test runs and yields strange results, you might be dissatisfied. Noticing fluctuations in return rates reflects users having preferences that can vary significantly.

00:12:32.520 Finally, ensure there’s ample time allocated to adapt to changes. Users may take time before genuinely embracing new features; adoption can lag behind improvement.

00:12:50.280 Just as you should avoid abrupt shifts in interfaces, resist the temptation to overly dissect test results after they have concluded. Screens with poor performance may not benefit just by examining splits post-facto.

00:12:56.640 Remember that even if you focus on just mobile users and demonstrate a positive result, you may not have isolated the true factors that influenced the change.

00:13:05.880 You must also note the temptation to split results if the initial outcomes aren't promising. Think critically about who your users are and to what extent their behavior diverges.

00:13:13.020 While assessing splits within metrics can yield insights, doing so without prior definition can lead to a compromised foundation.

00:13:22.440 Lastly, ensure you maintain neutrality in evaluating results to avoid the natural emotional bias towards invested choices.

00:13:29.520 Once you document your A/B tests, simplify documentation to ensure that everyone involved knows the plan and can follow through.

00:13:43.440 Keep your plans documented: metrics of success, success criteria variations, and sample size calculations to establish clarity among all participants.

00:13:55.380 Moreover, it’s important to choose clear names for your test variants to prevent confusion in discussions.

00:14:05.520 Thank you for listening. If you want to explore further readings on A/B testing, especially Bayesian testing, I have some recommendations for you.

00:14:12.600 Thank you very much for your time!