Enough Statistics So That Zed Won't Yell At You

00:00:17.840 All right, so I'm presenting today on 'Enough Statistics So That Zed Won't Yell At You.' My name is Devlin Daly.

00:00:24.660 I actually contacted Zed about this talk, and I asked him if I would be able to present enough information to prevent him from getting upset. He said probably not, but we could try, so that's what we're going to do today.

00:00:42.480 Some caveats: I am not a statistician. I'm actually into digital identity. If you want to talk about OpenID, OAuth, SRP, or permissions later on, I’d be happy to discuss that. I have taken a few classes in statistics, and Pat sort of roped me into this presentation; it's all Pat's fault.

00:01:02.399 Today, my goal is to familiarize you with some vocabulary and basic concepts related to statistics.

00:01:09.479 The basic concept behind statistics is that it emerged due to the need for models. Models should be familiar to us because software itself is a model. There are different mathematical models describing how planets orbit the sun in elliptical patterns.

00:01:21.780 While these equations can describe planetary motion accurately, the data does not always fit the model perfectly; there is often an error function involved. This error is what statistics studies.

00:01:41.280 For example, consider the temperature of the human body. This isn't just one fixed number; it varies. We typically say it is about 98.6 degrees Fahrenheit, but in reality, it is distributed in a manner described by the normal distribution, commonly represented by the bell curve.

00:02:03.299 In this distribution, 98.6 degrees would represent the average. When measured, temperatures will always be slightly above or below this average due to random fluctuations. In a normal distribution, we can determine how much data falls close to the mean.

00:02:31.020 For instance, when considering standard deviations around the mean, we expect about 68 percent of the data to fall within one standard deviation, and about 95 percent to fall within two standard deviations. This means not every data point will fit neatly within these ranges, but on average, we see these distributions hold.

00:02:57.360 When measuring something, we often encounter distributions that aren't perfect normal distributions, but they are still distributions. Distributions are characterized by their center—usually the mean—and by their spread, which is indicated by the standard deviation.

00:03:12.360 To further illustrate the relationship between the mean and the spread, consider two samples represented by blue and orange distributions. Both have the same average, but the orange distribution has greater spread in its data values. If we are looking at web server performance, we might misleadingly conclude that performance is the same because the averages are identical, yet the larger spread indicates potential problems.

00:03:39.060 Returning to the example concerning human body temperature, finding the true average requires measuring everyone in the world, which is practically impossible. Therefore, we must sample the population. A sample is a subset used to estimate the characteristics of the entire population. It's crucial that the sample is representative, or the data we gather can lead to false conclusions.

00:04:49.979 For example, if I were to sample only my family to estimate the average height of people, I'd misleadingly conclude that everyone is around six feet tall, since I'm taller than my family. To infer correctly about a population, we need a good representation of it, and that typically requires us to collect multiple samples.

00:05:21.900 One critical aspect of statistics is sampling—it’s essential. Single instances of tests or benchmarks yield singular data points; we need multiple samples to detect variance and establish a distribution. I work for Phil Windley, who is known for his love of Diet Coke.

00:06:02.039 He runs a conference where he ensures there’s plenty of Diet Coke available, much more than any other beverage. For example, if we wanted to find out if Phil could distinguish between Diet Coke and Coca-Cola Zero, we could design a simple experiment.

00:06:51.840 He could taste two cups without labels, and if he guesses right, there is a 50/50 chance of success. This method, however, is not accurate because it doesn't account for chance alone. To ensure validity, we need a larger sample size and randomization. If Phil guesses correctly more than 50% of the time, we'd determine he can indeed tell the difference.

00:07:50.400 Statistical testing, specifically the z-test, compares the proportion of correct guesses to what would happen if he were just guessing. This is how we apply logic and structure in statistical tests. We start with a null hypothesis, for instance, that Phil cannot tell the difference. The null hypothesis assumes a 50% probability of guessing correctly if he cannot tell the difference.

00:09:01.680 With statistical tests, we either reject or fail to reject the null hypothesis. If the tests show enough evidence, then we can say he might be able to distinguish between the two drinks.

00:09:59.880 An example of an application of statistical testing is A/B testing in conversion rates. If I run a website and change its design, I need to test if the changes actually improve customer conversion. We would need a representative sample of visitors to compare their conversion rates.

00:10:40.680 Issues might arise if the new site is underperforming due to a technical error, such as being incompatible with a specific browser. Statistical analysis would allow us to isolate these factors, ensuring that performance issues are accurately attributed to the web design and not user errors.

00:12:11.220 Let's switch to performance measurement for web applications. For instance, using a tool like HTTP Perf, we can measure how many requests our application can handle per second. We need to sample this population over time, pulling out multiple measurements to analyze average performance accurately.

00:13:12.540 When using a web server like Mongrel, we can start by measuring the performance per second. After making optimizations or changes to our application, we can establish if those changes improve performance using statistical methods like the t-test, which assesses whether observed changes in application performance are statistically significant.

00:14:35.640 The t-test evaluates two sample estimates of the same population, allowing us to determine if there is a statistically significant difference between them. When running performance benchmarks on different hardware configurations, we can identify if there is a true performance difference or just random variation.

00:15:38.640 This significance is expressed through a p-value, which indicates the probability that observed effects are due to randomness alone. Essentially, the t-test is an approximation of a permutation test, comparing the observed results to the entirety of randomized possibilities.

00:16:23.040 In practical scenarios, we can use statistical software, such as R, to perform these tests easily. For example, given samples from two groups, we can determine if they come from the same distribution using t-tests to inform our decisions.

00:17:31.500 While overlapping distributions may give us p-values indicating similarity, we need to assess whether this closeness is simply due to chance or captures genuine differences.

00:18:32.050 As we've discussed, confounding variables can obscure the results of statistical testing. To manage these, one should control experimental conditions by only changing one variable at a time.

00:19:49.200 It's critical that measurements are not affected by external factors. For performance tests, using separate machines for testing and production can minimize this risk.

00:20:34.320 Testing on a virtualized environment can also introduce variance due to shared resources. When using services like Amazon EC2, you must ensure adequate sampling to account for potential inconsistencies.

00:21:59.960 In summary, we have several statistical tools available to us, such as the z-test and t-test. To avoid mistakes in measurement interpretations, peer reviews and cross-checks are essential.

00:23:36.240 Resources like R can assist in statistical analysis, while good laboratory practices enhance the reliability and reproducibility of our tests. Engaging with the community and learning from others can also promote better understanding and application of statistical methods.

00:25:00.480 Lastly, discussions around the nature of measurements, like what defines a user in performance assessments, remind us of the importance of maintaining clarity about what we measure.

00:26:23.940 Sampling is essential; random sampling increases the likelihood of representative samples over repeated experiments. Additionally, the central limit theorem suggests that sample means tend toward a normal distribution with enough samples.

00:28:25.260 Determining the appropriate size for a sample relies on understanding the population dynamics and the required precision of results. Ensuring your statistical efforts are appropriately structured enhances their credibility.

00:29:24.540 In conclusion, understanding and correctly applying statistical principles is crucial in analyses, especially when working with complex systems. Causation cannot be assumed, and careful parameter management is needed for efficacious analysis.

00:30:38.340 The nuances of statistical methods, like Bayesian approaches, add depth to our understanding of data; they help define the origins of our observations within established probabilistic frameworks.

Enough Statistics So That Zed Won't Yell At You

Key Points Discussed:

Takeaways: