Learning Statistics Will Save Your Life

Talks

John Paul Ashenfelter

Learning Statistics Will Save Your Life

by John Paul Ashenfelter

In this video titled 'Learning Statistics Will Save Your Life', John Paul Ashenfelter discusses the importance of statistics for programmers and developers, particularly in the context of improving decision-making and validating assumptions through data. Key points of the talk are as follows:

Ultimately, the video serves as a practical guide for developers to enhance their understanding of statistics, not only as a theoretical subject but as a vital tool for informed decision-making in their applications.

00:00:22.880 I'm going to talk about statistics, and I understand that's probably not the most exciting topic. Really quick, who has a master's degree or higher in statistics in this room? Thank god I don't see any hands.

00:00:28.710 Good, so I'm really excited about that! If you've been in the Ruby community for a while, you might know this guy, Said Shaw, who wrote Mongrel. For you young folks, Mongrel was what you used before there were Unicorns and Rainbows to run your server.

00:00:44.010 He wrote a blog post that affected me a while ago, and it was literally called 'Programmers Need to Learn Statistics, or I Will Kill Them All.' It's worth reading because he's right about a lot of things. Developers throw statistics around like they know what they're talking about, but many people use them incorrectly to justify their conclusions.

00:01:08.520 My goal today is to ensure that you don't die because you don't know statistics. I want to make sure you survive, whether you refer to it as a zombie apocalypse or like surviving World War II. My aim is to help you understand a little bit about statistics.

00:01:32.820 The things I hope to teach you will include basic descriptive statistics. I also want to try to use every meme from the top pages of meme generators, which I think I've come pretty close to achieving—statistically significant, but I’m not going to provide any data to back that up.

00:01:44.220 We'll start with how to compare things quickly. For instance, I want to talk about comparing performance tests. I spent a significant portion of my last job as a developer assigned to a marketing team, which is important to note. One does not simply run an A/B test; it's much more complex than that.

00:02:11.579 If you guys and gals are considering applying to work for Mr. Wonka at Wonka Corp, there are ridiculous questions you may encounter, and there are ways to answer those questions using statistics and Fermi estimations.

00:02:31.020 I mention Fermi estimates because often, people are unsure if their conclusions are statistically valid or just lucky guesses. If there are numbers, people tend to think that the mere presence of those numbers indicates that the person must know what they are talking about. However, I'm going to show you that there are ways to better understand those numbers.

00:02:54.030 A bit about me: I used to live in Virginia and was a chemist where I studied science and statistical mechanics. I wouldn’t recommend it unless you really enjoy Fortran or possibly Mathematica and MATLAB. I got to work on some impressive computers, including the best VAX terminal I could find and the Cray that used to be at the National Center for Supercomputing Applications.

00:03:18.959 In fact, the iPhone you have in your pocket is likely faster than that machine now, but we begged for time on that. Then, I went to an educational institution where I discovered that educators don’t know much about statistics, but they are quite adept at pretending they do using tools like tape recorders, Scantron sheets, SAS, and SPSS.

00:03:41.940 Honestly, this was quite a learning experience for me. Nowadays, I work with Ruby at Big Cartel. There's a long path that got me here. Last year, I was at Rocky Mountain Ruby, teaching people about using data. I had a slide with a bunch of gnomes sitting on a big pile of data, figuring out how to make a profit.

00:04:00.000 At that time, I mentioned that aggregates weren't all that compelling because I was discussing more advanced data science techniques I truly love. I have to admit I was wrong because I found out that people need to do a better job with statistics, and this is what I’m here to help with. There are three kinds of lies: lies, damn lies, and statistics. This saying is attributed to Mark Twain, but he didn’t really coin it—its origins trace back to 1895.

00:04:30.330 When he made this statement in London, it essentially highlighted how misleading statistics can be. Statistics can tell you all sorts of ridiculous things, much like XKCD comics illustrate.

00:04:52.800 Now, descriptive statistics are where we should begin because you have a sea of users. I used to work for an educational company that rhymes with 'C Mouse.' We had 169 users across 3,320 batches, which means there is an average of about 20 badges per person. We need to describe the users because we often assume data follows the normal curve—a bell curve, a Gaussian curve, etc.

00:05:22.619 However, when I graphed our data, it was wildly different than expected. A hundred and thirty-two of the people had one badge, and I sat there, puzzled, trying to put a project together with this data and wondering what went wrong. This was a census, the entire population of people meeting a certain set of criteria.

00:05:53.039 One key point about statistics is that alongside a central value—what we were trying to achieve with the average—there's always some level of dispersion. Every statistical measurement represents a distribution, not just a singular value. Reporting a single value can be problematic.

00:06:23.369 While we usually work with central measures like the mean or average, there are also the median, which is the exact middle of a data set, and the mode, which is the most common value. For my data, using these basic central value calculations gave me an average of around 20 badges. However, the median was 2 and the mode was 1.

00:06:55.329 At 'C Mouse,' you earn one badge just for signing up, that's the newbie badge. Thus, looking at this data with a bit of prior knowledge changes our interpretation significantly. This means the average isn't an accurate representation of the data at all.

00:07:24.490 To gain a better understanding of the dispersion in the data, we can explore things like the range between the smallest and largest values and metrics such as variance and standard deviation. That’s the math we’ll dive into, and I want to point out quickly that many people overlook standard deviations, variances, or ranges.

00:07:44.360 Observers often simply present an average statistic without providing additional context. This practice is dangerous; averages without further details can mislead, and that average of 19 badges didn’t convey the complete picture.

00:08:08.949 So, let's quickly explore standard deviation as they teach in AP statistics. I have kids in high school, and I am amazed that their coursework reflects many of the concepts I encountered in graduate school.

00:08:29.450 Calculating standard deviation in Ruby is pretty straightforward. Standard deviation is calculated by summing up the squared differences from the mean, finding the variance, and then taking the square root. This isn't complicated, and in fact, I could utilize the descriptive statistics gem to jump right through it.

00:09:06.320 When I took that data set, I found that the range of badges earned went from 12 to 227. The variance was 1,330 badges squared, which may sound strange since units tend not to align with your expectations. Variance requires the square root for clarity's sake, and we arrive at a standard deviation in terms of badges of around 36.

00:09:42.720 This means our initial average of 20 badges looks very different when coupled with a standard deviation of plus or minus 37 badges. This is a much better narrative than stating the average alone, as it leads to different decisions being made.

00:10:05.360 I concluded that the real issue was that I ran a quick SQL query and discovered that we were working with a subset of users with internal email addresses while including various junk test accounts in our database that were all tied to that newbie badge.

00:10:27.910 Even though this trivial example on its own seems irrelevant, it shows how a little extra analysis can improve results. You should not place total faith in statistics without carefully considering what they reveal or conceal.

00:10:45.680 Next, I want to talk about inferring from statistics because this is where the fun part begins. By now, most of you have done benchmarks with the Benchmark or the benchmark-ips gem. In my tests involving strings and symbols in hashes, I noted that one was faster and one was slower, which confirms different performance.

00:11:07.320 I also decided to explore performance differences between single and double quotes. Initially, this was an interesting simplification; however, as I repeated my tests, I then asked if the differences were statistically significant.

00:11:31.730 What I needed to be careful of was that I could come across type one errors, which are false positives, or declaring differences that aren’t actually present when they are the same. This is linked to the significance level, typically set at five percent.

00:11:53.740 On the opposite side are type two errors, which are false negatives where you're 80 percent certain something is true, but there’s a 20 percent chance they aren't. Thus, we see the significance level being important since we can make these types of errors in testing.

00:12:12.800 To tackle these statistical errors, I want to highlight a brilliant work from the Guinness brewery in the late 1800s. They sought an efficient way to discern differences between beer samples, resulting in the creation of the t-test, a tool for comparing samples to determine if they come from the same population.

00:12:45.160 Returning to the string vs. symbol lookup example: if we plot this using a graphing tool, we can visualize that they're remarkably different, with one being significantly faster than the other, and the p-value indicates the probability this difference is due to chance is negligible.

00:13:10.300 However, regarding single versus double quotes, the difference is a mere fraction of a percentage, which raises questions about whether that's statistically significant. When I calculated the likelihood mathematically, surprisingly, those two results were statistically considered different.

00:13:38.080 I ran the test multiple times and received different results frequently. While I can apply various stats to analyze larger sample populations, the key takeaway is to ensure that rigorous statistical methods are employed before making claims about performance.

00:14:07.650 We've established the need for a robust approach, where testing inevitably leads to failure, and failure in turn leads to a better understanding. Now, how many people have conducted A/B tests or have companies that perform A/B tests? Everyone is often excited about A/B testing because you can increase conversions relatively quickly.

00:14:56.240 Typically, traffic is diverted to the homepage. Perhaps you see a three percent conversion rate on a 'Buy Now' button and think a free trial could boost that. You want to bring about substantial change, resulting in a sensitivity increase.

00:15:25.280 The sensitivity of converting from three to four percent indicates a 33 percent sensitivity adjustment. However, here’s the challenge: the statistics dictate that you aren't prepared for the math involved in ensuring your statistical test possesses sufficient power to achieve this.

00:15:56.380 There is an approximate way to calculate this in your head using simple formulas, the result is when testing a baseline of three percent—if you want to test for a one percent shift—you'd need around 4,782 people.

00:16:20.320 This is a considerable amount when you think about how long it takes to generate that traffic. As a result, if your page does not have sufficient traffic, choose any button color, any test text you want, and make it irrelevant because statistics cannot aid you.

00:16:54.340 Using real numbers in R, which is an excellent language for statistics, you can execute a legitimate power test. When I plugged in the information about seeing a one percent convergence shift, a significance level of 0.05, and a target power of 0.8, it calculated that I would need about 5,300 participants.

00:17:42.370 Ultimately, that's significantly more traffic than many pages generate in a timely manner. It’s unfortunate that often companies begin running A/B tests without sufficient traffic, and the designer tries to apply science without realizing it doesn't precisely compute in this case.

00:18:11.040 But the longer you maintain a test without gathering enough data, the more everything can go wrong, because someone may want your results today even when it's statistically invalid. Don’t run a 50/50 split test without giving them sufficient time.

00:18:36.830 Be conservative and resist the urge to check the results frequently, as it alters the natural flow of the test. Many online tools let you keep testing until you hit significance, which is fundamentally flawed. Statistics will not provide valid results if they are manipulated in this manner.

00:19:27.770 Yet, what if I told you there exists a statistical method wherein your beliefs and the results can influence the outcomes? Enter Bayesian statistics, a concept most of you haven't encountered extensively, as it tends not to be taught widely.

00:19:50.550 Bayesian statistics fundamentally suggest that it's not only acceptable but prudent to let the prior data shape your beliefs about your hypothesis. If you ascertain that your hypothesis could be incorrect, confronting that early will speed up your journey to finding the accurate answer.

00:20:14.360 For example, imagine we conduct a click-through rate test where we try to determine if there exists a difference in performance. Rather than running 10,000 participants and taking three days to finalize conclusions, we let Bayesian methods work with the data immediately.

00:20:48.590 This data dynamically finds a narrow variance around a potential answer, demonstrating whether or not a significant difference exists.

00:21:03.310 If we conduct traditional A/B testing, we must wait until the end for final judgment. What Bayesian statistics propose is that if data suggests a strong probability, you should consider that data valid.

00:21:30.100 Traditional A/B testing has you rejecting or accepting a null hypothesis based on conclusions drawn from statistical significance. Even with less data, Bayesian methods present practical possibilities—like indicating that there is an 85% probability of the outcome, providing further insights into actions.

00:21:55.600 The beauty of Bayesian statistics is that you consistently observe results. Not only does it empower you to react swiftly in your environment, but it also allows for flexible decision-making during operations.

00:22:35.760 The application of ideas like regret has notable implications, especially in life-threatening conditions such as medical research where finding a cure for diseases must include early determinations without waiting for extensive confirmation processes.

00:23:11.250 Bayesian statistics is useful for making effective decisions throughout various fields of study, establishing a framework where data-driven insights direct operations. Though complex at first glance, traversing the parameters of Bayesian analysis offers far-ranging options for improving processes.

00:23:45.630 For instance, when pondering Fermi estimation, we analyze the population of certain districts when determining the number of services available while also ensuring we evaluate the operational scale of those services.

00:24:25.120 The Fermi estimation technique, popularized by Enrico Fermi during his physics teachings, involves proposing estimates for seemingly impossible problems. It helps discover the parameter limits by leveraging rough but streamlined calculations.

00:25:02.280 Estimating how many piano tuners exist in Chicago exemplifies how to utilize similar reasoning. As a rule of thumb, based on the population size—approximately 3 million—if we assume a family unit contains four members, we’d presume about 750,000 families.

00:25:32.200 Assuming 20% of those households own a piano gives us an estimated 150,000 pianos. If a piano tuner serves about 4 pianos a day for a standard work year, we derive a figure of 150 piano tuners in the city. Although estimates vary, statistical reasoning offers access to potential resources.

00:26:06.310 In contrast, checking various sources yields figures between 225 and 290 piano tuners, demonstrating how rough estimates become relevant within statistical analysis. Consequently, Fermi estimation enhances understanding of viable business strategies while following similar reasoning can streamline correctness.

00:26:34.870 Again, utilizing Fermi estimation can help direct attention to viable business models. Having an estimated 16 million impressions to gain 1% revenue conversion allows one to identify suitable goals for individual services, backed by explicit calculations.

00:27:07.530 This calculation elucidates recurrent customer retention as it can indicate a sense of orientation within e-commerce. This means adjusting business strategies proactively to reflect market metrics aligns decision-making processes, facilitating awareness of viable business outcomes.

00:27:45.700 Therefore, it becomes evident how statistical reasoning and methods—both Bayesian and classical— impart an understanding of business direction. It also highlights the necessity for metrics as essential elements for decision-making, allowing management to create actionable objectives.

00:28:27.020 As we wrap up, it would please me to know you learned some valuable statistics at the MountainWest Ruby Conference. We explored the basics of descriptive statistics, understanding that employing standard deviation is critical in making decisions driven by correct information.

00:29:12.460 We also acknowledged that two statistics can be compared using the t-test—this method supported the Guinness brewery's approach to performance validation. In conclusion, I urge you to consider what Bayesian statistics can offer you in your decision-making procedures.

00:29:41.080 I want to extend my gratitude to my patrons and mention that I work at Big Cartel, a local company that believes in supporting artists and is currently hiring. My name is John Paul, and that’s my easy-to-remember Twitter handle. I'm open to questions if anyone has any, and finally, I have a few recommended books covering statistics and machine learning across a range of programming languages.

00:30:25.940 While a lot of examples you'll encounter might utilize Python as their primary language, I encourage you to explore various options like R for statistical calculations or Octave, which has its pros and cons. Thank you once more!

MountainWest RubyConf 2015