00:00:04.860
Hi, welcome to my talk, 'How to A/B Test with Confidence'. In this session, we'll learn a lot about A/B tests, including what can go right and what can go wrong with them.
00:00:10.740
My name is Frederick Cheung, and I am a long-time Rubyist. I started with Rails back in version 1.0, and I am currently the CTO at Recipe.
00:00:16.680
At Recipe, we do a lot of things, mainly helping retailers personalize their content and make smarter data-driven decisions.
00:00:22.439
As part of our work, we run numerous A/B tests to compare our personalization methods with others, exploring different types and algorithms for improvements.
00:00:28.500
So, we run a lot of A/B tests. This is my assistant Luna, who is half pillow and half cat.
00:00:35.160
Briefly, the plan for the next half hour is to explore how to conduct A/B tests effectively and how to avoid common pitfalls.
00:00:42.000
We'll start with a brief recap of what A/B testing is and why we do it. Then, we'll discuss what can go wrong during setting up, running, and analyzing tests.
00:00:48.719
To finish on a positive note, I will share best practices.
00:00:54.719
So, starting with the basics, what is an A/B test? It begins when there’s a decision you'd like to make, guided by data.
00:01:05.939
For example, you might have an e-commerce application and wonder whether a button should say 'Buy Now' or 'Order.' Both options seem plausible, but instead of relying on gut feeling, you want actual data.
00:01:12.119
To test this, you would split your users randomly into two groups, with one seeing the 'Buy Now' button and the other seeing the 'Order' button.
00:01:18.000
After running this for a few days, you would count how many orders occurred in each group, resulting in quantities like 49 orders versus 56 orders.
00:01:24.119
The real question we need to answer is whether this difference is statistically significant—does changing the text genuinely improve our results, or was it just a fluke?
00:01:29.280
Before diving into the statistical analysis, it’s important to note that A/B tests are not just limited to button text. They can encompass more complex changes in layouts, designs, or overall user flows.
00:01:40.020
You can test how different buttons impact user actions; for instance, does one button lead to a quicker purchase process if it triggers an action on the page, while another button redirects to the checkout page? Additionally, you can explore variations in algorithms or recommendation engines whenever measurable differences exist.
00:02:08.580
However, before going too deep, we need to cover some jargon that will help clarify our discussion: significance, test power, and sample size.
00:02:16.920
Significance answers the key question: is the observed difference just a fluke? People often refer to p-values when discussing significance. A p-value of 0.05 indicates a 5% chance of a fluke.
00:02:27.819
People might also present this as a confidence level of 95%. Although these numbers are viewed differently, they represent the same idea from opposing perspectives.
00:02:33.540
When you run your test, collect data, and input it into the appropriate statistical test method, which may vary depending on the metric type. For example, you'll do a different analysis for retention rates than for signup conversions or average spend per customer.
00:02:50.280
Significance testing tells you whether there's a genuine difference between the two groups—but it does not indicate how substantial that difference might be. If one group achieves a 10% conversion rate and another 6%, the presence of a statistically significant difference does not guarantee that the two rates result in a 4% difference.
00:02:57.120
The counterpart to significance is test power, which focuses on the likelihood of missing an existing change instead of detecting a false positive. This concept relates to how small a change you want to be able to detect—intuitively, the smaller the change, the more data you'll need.
00:03:09.959
It’s worth noting that test power also considers absolute rates, as opposed to just relative rates. For instance, moving from a 10% to a 20% conversion rate is easier to measure than from 0.1% to 0.2%, despite both cases indicating a doubling.
00:03:17.400
If you’re measuring something with a current conversion rate of just 0.1%, achieving a statistically significant result will be quite challenging. Combining significance and test power leads us to the concept of sample size.
00:03:29.880
Sample size indicates how many users you need to include in your A/B test process to detect a change of a specific magnitude at a designated significance level. Online calculators can help you determine this number.
00:03:46.080
Once you have calculated your required sample size, ensure that a testing goal is feasible. If you need one million users but only have a thousand, it's back to the drawing board.
00:03:55.920
When evaluating feasibility, consider cost implications as well. For instance, if you are testing a new homepage and gaining new visitors primarily through Google Ads, acquiring a significant number of users can be costly.
00:04:06.780
Once you confirm feasibility, aim to run the test without tinkering with it during the process. Try not to monitor the results until you reach the required sample size.
00:04:13.920
However, be cautious of conducting short experiments. For example, if your sample size calculation indicates you need 5,000 users and you have traffic of 5,000 users in a day, you become vulnerable to transient conditions.
00:04:25.740
If the day you conduct your test coincides with a national holiday or school breaks, the user behavior may not represent typical usage, thus resulting in skewed data.
00:04:39.000
Another approach to analyzing test outcomes is Bayesian A/B testing, which is a different statistical school of thought. While I can’t cover it in extensive detail, the main distinction lies in starting with a prior model of how you assume the world operates.
00:04:54.180
Incorporating your existing understanding and uncertainty, you update this prior with data from your experiments. Instead of deriving a single number from testing, the output becomes a probability distribution.
00:05:05.820
The output from a Bayesian A/B test typically features two curves representing the posterior probabilities for each variant. The intention is to analyze the entire range of potential outcomes and ascertain whether the green curve is likely better than the red one.
00:05:16.560
Bayesian testing allows you to incorporate existing knowledge and uncertainties, departing from a zero-knowledge perspective. This can lead to improved outcomes, especially when working with low base rates, although the underlying math is slightly more complex.
00:05:34.380
However, most of the discussion in this talk will apply equally to both classical and Bayesian A/B testing.
00:05:44.280
We now have a good understanding of A/B testing basics and potential pitfalls. However, things can go wrong from the very beginning, particularly with group randomization. This is the foundational aspect of valid testing.
00:05:55.620
If randomization is flawed, the validity of your test becomes compromised.
00:06:03.660
About ten years ago, I wrote a few lines of code to randomly assign users to test groups based on whether their user ID was even or odd. It seemed reasonable, so if your ID modulo 2 was 0, you'd see the experiment; otherwise, you'd see the control.
00:06:10.380
This solution might have sufficed for a single A/B test; however, we ran multiple tests, leading to an uneven spread of users. All users with even IDs received the experiment, which meant they would notice frequent changes.
00:06:18.960
In the long term, we weren't merely testing isolated experiments but rather assessing how users reacted to constant changes.
00:06:27.780
This raises an important point: constant changes can fatigue users. It’s crucial to maintain a stable testing environment.
00:06:38.760
You can avoid such pitfalls by employing a library like 'Vanity,' which is a well-regarded tool in Rails for conducting A/B tests.
00:06:49.620
In this library, instead of merely relying on user IDs, you can combine the user ID with a unique experiment identifier, generating a hash to arbitrate group assignment. However, ensure that hashing maintains the necessary statistical properties.
00:06:58.560
As a general rule, avoid creating your own custom randomization if possible; it can lead to unexpected flaws in your setup.
00:07:09.420
Moreover, even a single instance of non-random sampling can lead to significant issues. For example, we once participated in an A/B test run by a retailer where users were not assigned randomly.
00:07:21.600
Instead, newer users were placed in one group while older users were placed in another. This resulted in skewed outcomes since newer users typically displayed less loyalty.
00:07:31.920
Consequently, the updated test produced astonishing results; however, previous data showed that the two groups had inherently different outcomes before the test.
00:07:42.240
This situation complicates result extraction, making it difficult to trust the test. For instance, if there was a 5% difference before the test and a 6% difference afterward, simply subtracting the values does not yield a reliable conclusion.
00:07:53.400
Another potential mistake is commencing your test prematurely. To illustrate, suppose you want to test a change on your checkout page with 100,000 users visiting your site in one month.
00:08:02.640
You would split your users into two groups of 50,000, each experiencing the same homepage. Some users would proceed to the next site page as expected and see the same products.
00:08:10.140
As a result, even fewer users would reach the checkout process, leading to a smaller sample size for impactful conclusions.
00:08:17.339
Let’s say out of 5,000 total users, the conversion statistics yield results of 2,600 for group A compared to 2,500 for group B, resulting in a meager 5% to 5.2% conversion.
00:08:26.040
If we remember the earlier discussion about test power, that small change isn’t statistically significant due to the large number of users who experienced no change.
00:08:36.180
Here’s what to do differently: enable 100,000 users to access your site without splitting them yet.
00:08:43.920
Allow them to view products and only allocate user groups once they proceed to add items to their cart and visit the checkout page. This way, the split occurs just when users experience differing content.
00:08:56.040
This allows for measuring differences on a more effective basis. Imagine if this second group maximizes their conversions to 50% compared to 52.7% in group A, leading to a statistically significant difference.
00:09:06.900
In summary, it’s advisable to allocate users to groups as late as possible to maximize the efficacy of the measured change.
00:09:13.500
The last category of error I want to address involves agreeing on your test setup beforehand.
00:09:24.300
Be sure to reach consensus on various aspects like the scope of the test, deciding which pages will be included, and whether to test all users or only logged-in users.
00:09:29.760
If your website operates differently in various regions, discuss whether to test specific locales or encompass all users.
00:09:37.380
Additionally, clarify the goal of the A/B test and how to best quantify success. For instance, when testing content sites, determining the right engagement metric is essential.
00:09:49.679
You may choose between page views or average session time, each measuring engagement in different ways, further emphasizing the importance of reaching a consensus beforehand.
00:09:59.760
Engaging in these discussions at the outset is critical; if not, you could find yourself determining which results to favor instead of relying solely on data.
00:10:07.620
I advise establishing one primary metric over which all parties agree before starting the A/B test, avoiding the scenario where results are cherry-picked based on preference or bias.
00:10:15.720
Once you establish your A/B test setup, it's essential to execute it with diligence and patience, as A/B tests measure the impact of the differences between the two variants.
00:10:25.979
As individuals, we often focus on the intentional differences, but other unintentional changes can skew results. For example, an A/B test we conducted on a mobile commerce app intended to personalize product displays.
00:10:36.120
However, an implementation bug resulted in the variant with personalized recommendations crashing ten times more often than the control version.
00:10:45.420
When analyzing test data at the end, we found that the original app had performed better—but we still didn't know if the recommendations were sound or if it was merely result skewed by the bugs.
00:10:58.680
This case serves as an extreme example; however, you should remain cautious about unexpected differences in your test variants. Commonly, performance differences can create issues if one variant takes longer to load.
00:11:10.440
Be careful with significance testing, as conducting frequent analyses is an anti-pattern. It’s tempting to evaluate results daily, only stopping the test if the results appear significant.
00:11:21.480
This practice invalidates your significance testing because you are constantly 're-rolling the dice'. When you initially set a 5% threshold for significance, this can quickly elevate to a much higher chance of error.
00:11:33.960
Thus, to maintain integrity, you should avoid the temptation to peek into your test results before you reach your sample size limit. Notably, Bayesian A/B testing approaches permit this, as it deals with distributions rather than fixed thresholds.
00:11:43.680
Lastly, once the test concludes, you must conduct thorough analysis. Use the proper statistical tests based on your defined variables; not all statistical tests are interchangeable based on the metric type.
00:11:55.200
Just because one metric yields a significant difference does not mean that others will follow suit. It is crucial to evaluate metrics independently.
00:12:02.640
These evaluations can often unveil outliers. For example, in an attempt to improve donation sizes for a charity, a significant outlier like a $1 million donation can heavily skew average values.
00:12:09.600
When you observe such outliers, consider whether they merit removal or if you should use alternative metrics that minimize their impact.
00:12:16.320
It's important to understand your domain at a fundamental level, as speed bumps can emerge due to behaviors unique to your user base. For instance, return rates in e-commerce depend on timing.
00:12:24.180
If your A/B test runs and yields strange results, you might be dissatisfied. Noticing fluctuations in return rates reflects users having preferences that can vary significantly.
00:12:32.520
Finally, ensure there’s ample time allocated to adapt to changes. Users may take time before genuinely embracing new features; adoption can lag behind improvement.
00:12:50.280
Just as you should avoid abrupt shifts in interfaces, resist the temptation to overly dissect test results after they have concluded. Screens with poor performance may not benefit just by examining splits post-facto.
00:12:56.640
Remember that even if you focus on just mobile users and demonstrate a positive result, you may not have isolated the true factors that influenced the change.
00:13:05.880
You must also note the temptation to split results if the initial outcomes aren't promising. Think critically about who your users are and to what extent their behavior diverges.
00:13:13.020
While assessing splits within metrics can yield insights, doing so without prior definition can lead to a compromised foundation.
00:13:22.440
Lastly, ensure you maintain neutrality in evaluating results to avoid the natural emotional bias towards invested choices.
00:13:29.520
Once you document your A/B tests, simplify documentation to ensure that everyone involved knows the plan and can follow through.
00:13:43.440
Keep your plans documented: metrics of success, success criteria variations, and sample size calculations to establish clarity among all participants.
00:13:55.380
Moreover, it’s important to choose clear names for your test variants to prevent confusion in discussions.
00:14:05.520
Thank you for listening. If you want to explore further readings on A/B testing, especially Bayesian testing, I have some recommendations for you.
00:14:12.600
Thank you very much for your time!