Machine Learning

Summarized using AI

Decade of Regression

Randall Thomas • March 06, 2014 • Earth

The video titled 'Decade of Regression' by Randall Thomas at the Ruby on Ales 2014 conference explores the evolution of statistics, specifically within the realm of data science. The speaker emphasizes the importance of statistical methods, particularly regression, as a universal tool for making sense of data.

Key points discussed include:
- Introduction to Statistics in Data Science: Thomas begins by humorously referencing his experience and influences in the field, highlighting the transition toward statistics within technology and programming.
- The Concept of Lines in Statistics: The 'line' is described as a fundamental component of statistics, akin to a universal tool or hammer that helps in visualizing relationships in data.
- Basic Questions of Statistics: The speaker outlines three main questions statistics seeks to answer: 'What happened?', 'What’s going to happen?', and the more challenging 'Why did it happen?'. He discusses the complexities surrounding causation and correlation.
- Case Study of Netflix Prize: Thomas presents the Netflix Prize as an example of the challenges of predicting outcomes based on data, referencing movies like 'Napoleon Dynamite' to illustrate how popularity does not always correlate with preferences.
- Simple Linear Regression: He explains the process of simple linear regression, emphasizing its simplicity and effectiveness for visualizing data relationships. He notes that intuitive understanding of linear relationships is crucial for applying statistical methods.
- Visual Representation in Data: The importance of visual aids in statistics is highlighted, with the assertion that visuals help convey complex information more effectively than equations alone, aiding comprehension for non-statisticians.
- Real-Life Implications: He shares a cautionary tale about Knight Capital, a company that lost a significant amount due to miscommunication in trading algorithms, reinforcing the need for clear understanding of statistical methods.
- Takeaway: The talk concludes with an encouragement not to shy away from statistics, emphasizing that data should not be intimidating and that curiosity and engagement with numbers is vital for understanding the world better.

Overall, Thomas motivates the audience to embrace statistical tools and concepts as approachable means of interpreting data, rather than viewing them as complex hurdles.

Decade of Regression
Randall Thomas • March 06, 2014 • Earth

TBA by Randall Thomas

Help us caption & translate this video!

http://amara.org/v/FG0o/

Ruby on Ales 2014

00:00:17.039 Okay, I would accept that too. Presumably, the line that was drawn— I assume that's the least squares line.
00:01:08.159 All right, everybody! Randall here. How many people here actually know who Draplin Design is? Anybody? Oh wow! Okay, so I'm basically going to steal something from him because this is the actual introduction to the talk.
00:01:38.400 I spent a lot of time talking about statistics, and I started doing it in 2008. Draplin basically told me that anytime you give a talk, instead of starting with something like, 'Hi, this is my name, I'm Randall,' you should start with a musical intro, which is what this is.
00:02:25.840 This is how we make statistics in data science—hardcore. These were actually all talks that I gave on topics related to statistics. It took me about five years to figure out what I was doing until 2013, where we could become the remix of statistics. As you all know, that year was 2014—right? This was when we started doing something special.
00:03:01.120 This is the rise of the data scientist. You all know it—kind of like how you were a Ruby programmer with RJS, but now you're a Node.js guy instead. That's right! This is the ultimate data science tool, and that's exactly what we're going to be exploring for about the next 30 minutes, or however long your attention and that beer last. All right, are you with me? You guys haven't been drinking enough!
00:03:21.440 So, let's get this thing started. Cool! This is the Schleich shameless plug portion. For those of you who don’t know who I am, I'm the other black guy in the Ruby community, often confused with Brian Lyles. Do not laugh, especially anyone who was at RubyConf in Denver. At that conference, someone was giving a talk and people were confused about who it was.
00:04:10.000 Brian and I are often mixed up. Including that time at RubyConf Denver, where Brian was on stage, and a guy came up to me and said, 'I love your talk,' and I was like, 'My talk is up next!' He insisted, 'No, no, I mean the one you just gave,' and I replied, 'That was him!' We're confused because we work at a company called Thunderbolt Labs, where we made a blog post titled 'Cornering the Rails Black Market.'
00:04:42.480 We think we build cool things; we mostly do this by talking about numbers and statistics. We're really focused on figuring out how to make numbers work for companies and businesses. If you guys actually have a statistical problem, it probably isn’t one, but you should hire us anyway because that’s what we do—we're consultants. I’d like to entitle this talk 'Decade of Regression.'
00:05:09.440 Does anybody know why? Aside from the line reference, you’re wearing an Engine Yard Slayer T-shirt, so you better know the reference for a decade of regression. Well played! It was because there was an album released by Slayer; their first double live album was called 'Decade of Aggression.'
00:05:33.800 I'm going to give you a little ode called 'In Praise of the Line.' When we think of lines, we often take them for granted. I’m pretty sure everyone here has probably gone through college algebra or geometry, right? We think of lines as things we stand in, as in queues. During my time in L.A., I realized that when I started discussing lines with our clients, they had something completely different in mind, unrelated to statistics.
00:06:09.440 In reality, the line is like the universal statistical hammer. Everything you do in statistics, in some way, shape, or form, comes back to a line drawn somewhere. Think about it; often, you hear people reference lines in political contexts, like the president saying, 'We have a line.' Even Gaddafi had a 'line of death.'
00:06:36.639 Math largely revolves around lines. So, the line becomes the AK-47 of statistics; eventually, if you want to try to figure out what’s going on with some data, you have to draw a line. For those of you who aren't statisticians, how many people here actually have a background in math or statistics? Got a couple of you? Okay, I apologize to those who actually know what we’re talking about because I’m about to oversimplify immensely.
00:07:22.680 That’s not entirely true; somewhere in between, feel free to ask questions. So, the two fundamental things we do with statistics are: we answer the question of 'What happened?' and the next one is 'What’s going to happen?' The 'why' factor—why something happened—is particularly challenging with statistics. As it turns out, answering 'why' is much harder than simply stating 'what happened.'
00:07:53.440 You may have heard the phrase, 'correlation is not causation.' Oftentimes, we can describe very accurately what happened, but we might not explain why. A canonical example of this is the Netflix Prize. Does anybody remember this? Netflix was offering a ton of money for someone who could recommend a movie better than their existing algorithm. 'Napoleon Dynamite'—do you remember that?
00:08:28.240 So, there's some debate about that film. The whole point of 'Napoleon Dynamite' indicates that either it’s the most important movie in history, beating out 'Gravity,' or it means absolutely nothing regarding your movie selection preferences, and nobody can tell you why. How many people have seen 'Napoleon Dynamite?' All the hands go up! How many people liked it? Judging by the response, we might be a bit biased, as that looked like it was more than two-thirds.
00:09:01.440 In reality, there's a third question that we’re not going to address here, which is 'What is happening now?' This question is really difficult since it deals with time series and correlations. Something we often utilize in statistics is regression—regression on a variable, regression towards the mean. Statisticians like to throw around this terminology, so the question arises: What is regression?
00:09:39.840 Well, it’s pretty simple—it's just a big happy family of techniques for describing how a series of data points relate to each other. At its core, regression is nothing more than a way to describe how a set of data points connects. However, doing that can be complicated and intricate, so often when someone says, 'We just did a regression on blah blah blah,' what they’re really saying is, 'This is some hand-wavy guess.'
00:10:01.120 At the end of the day, we are trying to describe something about the data and draw some sort of inference. For instance, is anyone here a Netflix fan? 'House of Cards'—who here binge-watched all 13 episodes? I jokingly say I took up meth to do that. If you look at the statistics, 28% of other viewers also binge-watched all at once.
00:10:37.280 'House of Cards' was one of the first data-driven television programs. They spent over a hundred million dollars producing that show, which is typically the budget for a big-name star like Sandra Bullock. So the question is, why do you do that? Some people did see 'Gravity.' We could do a regression later about how much that movie wasn't great.
00:11:07.760 Trust me; it felt like '12 Years a Slave' an hour in, but that’s not the point. Mark is over there, and he’s here to stop me when I violate the conference's code of conduct. The whole point of this is that they looked at the data and concluded that a political thriller would resonate well with audiences. They created a TV show for a gap in the market, just like 'Orange is the New Black.'
00:11:20.760 Not all their shows have worked; frankly, the next season of 'Arrested Development' flopped. But these efforts involved a multitude of regressions on the data they analyzed, and eventually, they made predictions. They identified trends like 'What’s going on?' and 'What happened?' Around these insights, they bet a hundred million dollars and struck big.
00:11:47.440 So remember, knowing something about the data you have in your collection is vital. This could range from all sorts of metrics. It's about distilling vast data points down to something understandable. One of the challenges with data problems is that they rapidly become non-trivial; regressions can get complicated and often become a pain. This is where the line becomes vital.
00:12:09.520 The line simplifies complex analyses because it’s computer-friendly. How many of you have ever tried implementing pseudo-code from a computer science paper? People fail at it regularly. We’re dealing with mathematics as a standard, and frankly, if we can’t find a library that accomplishes what we need, we won’t be able to do it. It’s effectively lazy because getting it right is hard.
00:13:13.440 Having libraries and good quality libraries that have been around since the ‘70s is invaluable. If you’ve ever installed a math library like R or compiled one from source, you’ve probably asked yourself, 'Why am I installing Fortran? These libraries, such as LAPACK and BLAS, were written back in the ‘70s, and they’re so intricate that nobody wanted to touch them since. Also, there’s a linear regression library in Brainfuck, and there’s actually a Brainfuck interpreter in R.
00:13:59.600 So, theoretically, you could execute linear regression anywhere—even in COBOL or Pascal. But effectively, what you're doing is focusing on the mechanism of drawing the straight line amidst all these complexities. The power of visual representation is essential; a picture can be worth a thousand words. It’s crucial when dealing with data. There’s a saying, 'You don’t get paid for the equation; you get paid for its appeal.'
00:14:52.200 When you’re dealing with data, clients often hire us because they don’t understand statistics. If we simply present an equation filled with symbols, expecting to get paid, they won’t comprehend it. Instead, we need to create visuals. Everybody knows what a line looks like—we learned it in first grade! You can effortlessly convey it to your boss or the non-technical person who may or may not share your passion for data.
00:15:51.680 Also, humans can read the output. Has anyone dealt with machine learning in an applied sense? Has anyone actually understood the meaning behind half of the numbers that come from those libraries? Think about the probe we launched to Mars that ended up costing millions due to miscommunication over units! When you overcomplicate the language, you lose people. So let's bring this home. Does anyone know who these folks are? They're the most expensive TDD company in existence—Knight Capital.
00:17:10.720 They put production code into a test environment and lost 400 million dollars in about 45 minutes. This could have just as easily been if they had deployed an algorithm and didn’t fully comprehend how it operated. Often, the more advanced the statistical method, the more esoteric it becomes. Lines, however, are easy for people to interpret. But to be fair, they only lost 400 million in 45 minutes—some days that happens.
00:17:49.360 You have an intuitive grasp of what lines are, making it a perfect starting point for understanding more advanced statistical methods. While we generally have a grasp of how statistics works, we often lose that intuition when confronted with complex statistical theorems filled with symbols. As engineers, much of our work relies on intuition. We must find systems or datasets that help confirm or deny our intuition.
00:18:43.920 So, I am going to walk you through simple linear regression. This is where you can either grab a beer or whatever; you will be tested on this later and no beer tomorrow if you don’t know the answers. Lies, damned lies, and statistics, right? This is pretty easy: Step one, formulate a line. Step two—do this. Step three, by using calculus or geometry, it can be shown that the values of alpha and beta minimize the objective function.
00:19:01.920 Except for that, we reduce it to this—voilà! We are done! You guys are now officially data scientists! You really can get a job as a data scientist now because that’s about what most of them know. If anything, that’s the short, pithy summary of how most of my statistics and mathematics classes went. Does that resonate with anyone?
00:19:30.720 I often had this experience: I’d stare blankly, confused, 'I don’t understand what you just said.' I was told to work it out during recitation or consult the TA, who often struggled with English. So, let’s approach this from a more intuitive perspective. We know we have some data, and we’ll label that data as 'x.' Likewise, we have some other data labeled 'y.'
00:20:17.440 So far, are you all with me? Here’s the first tough test—What’s the value? Yes, it depends on your frequency. Frequentists or Bayesian—thank you! A tough crowd indeed. Now, we intuitively fill in the gaps. Your brain will plug anything in. For linearity, if we graph this by saying 'x is 1, y is 1; x 2, y 2,' even without data at point 1.5, we may assume a linear relationship is present.
00:20:51.520 What if we started drawing lines? Imagine saying, 'At 1.5, we can predict an outcome.' Even though there isn't a data point, our minds trend toward assuming a linear relation. We rationalize there’s a pattern, one that looks promising. Specifically, graphing all our data points might take us down this intuitive path—if you're being challenged on this, it's designed to try and fit as many of these points as possible.
00:21:31.200 We might mentally play this connect-the-dots game. This illustrates how we humans perceive patterns, while teaching a machine to do it remains a challenge. In essence, let’s draw a line through the key points, which we call an alternate data model. Remember when we had those questionable equations? Let's think they ask for something different than vague labels—labels like 'y' or 'big fancy labels.' Instead, let’s call 'y'—'data.'
00:22:03.520 We’ll mark the equation, the classic mx+b formula, as our model and throw everything else in one category known as 'error.' This is how mathematicians do this; they invent systems. Let’s discuss a bit of R code- it implements a linear model and introduces a small normal noise. This noise reflects that Gaussian distribution we often observe, humorously suggesting most of us got through college without red flags.
00:22:42.720 But let's not just try to follow along. If we join together to see our data points, we notice what we've plotted: when we graph the true model, a linear line, and the surrounding points, we literally observe certain deviations. The intention is simple: if a mathematician meticulously articulated this process, I'd profess a formal, geometrically grounded approach. Instead, I can casually assert, 'Let’s draw a line and randomly scatter points around it.'
00:23:32.000 If we set aside the true model, this approach allows us to inspect a blue line determined to provide a fitting approximation almost akin to sticking to our notion of an intuitive landscape where those points appear, explaining we want to minimize the overall distance of those compared to our line. Illustratively, it’s akin to navigating obstacles on a path to the bar or a way where we want as few deviations as possible.
00:24:21.680 This encapsulates the process; essentially, what we’ve done is created a rough estimate of where a true line is supposed to be, believing that the blue line can essentialize those points approaching something within reach closer to the actual red line. If I preface that something merely sketches the effort to draw a line matching those points, you might feel relieved at its simplicity. Yet sometimes, notation can cloud an intuition that feels intrinsically familiar.
00:25:08.080 On a broader scale, sometimes the equations we learned can feel daunting. We’ve been trained to appreciate their abstractness, but we often revert basics: to draw a line fit to the data we possess caused intuitive pushback in understanding that we need both a point and slope. This should encourage us to visualize every initial and fitted guess we encounter within our dataset as we calculate.
00:25:41.520 This becomes essential when we adjust our equations to infer predictions on unseen data, so we take the equation formed and test it through drawing. This means essentially we measure the predictions for our new inputs and gauge where they intersect, allowing us to determine conclusions drawn against those mixed trends. It turns into an empirical exercise of expectations based on our original behavior explored.
00:26:27.520 Now proposing that often those predictions end up demonstrating a natural tendency illustrates a bias-variance trade-off, which conceptually means every time we augment a more robust model, the underlying assumptions proffer considerable risk. In terms of regression predictive capability, we assume approximations may deviate from a higher standard. Predictions can feel off-kilter resulting from the underlying errors creep back into our estimates, influencing performance broadly.
00:27:12.080 Still, the relationship we establish continues to nuance our interpretations emerging based on the tangibility of relationships we draw. In simpler terms, for utilization of this method trustworthiness lies in whether we’re framing the right predictions against trying to capture core relationships of reality, creating friction presented through variability present yet worth trying to grasp. The adaptation terms exhibit we can construct functional measures of outputs that synthesize these visual intentions.
00:27:55.360 Because simple linear regression sometimes is stymied by certitude lacking clarity, it hints the variety of models can become particularly complex generated or misrepresented based on legibility to outsider audiences. We must espouse the consistent aspect of regression while ensuring we acknowledge the diversity of models as they should naturally fit into probabilistic distributions grounded in real-life patterns. Essentially they can simulate how we approach predictive outputs rather than relying on blind models; we grapple with linearity.
00:28:33.840 There aren't many occurrences existing straight line regimes in life, which means we have to adopt which behaviors our models encapsulate efficiently without oversimplifying details. The question isn't if everyone understands how the world operates based on what they see but how many are willing to adapt should they discover contradictions within—or even push against mutual interpretation sealing outward perceptions. So, we hold ourselves to a higher standard.
00:29:15.680 Holding onto those decisions encourages goodwill amidst navigating challenges. Allowing for inconsistencies permits us to develop better proofs and learn from trial-error methods during investigations can foster success when tightly aligned with exploratory procedure. Issues with projections may result in unexpected anomalies that reflect ensuring we correctly apply probability leading to less than favorable outcomes if misunderstood; the disparity between predictive averages can often misrepresent core expectations.
00:29:58.960 If any pitfalls arise from drawing simple linear regressions, expect those variations to arise during times unknown. While mind triangles merely reflect expectations, these curves indicate adaptively commanded elements based upon normalized settings against an aggregation of your variables. If drawing relationships inwardly illustrates collaborations emerging out, so again acknowledgments play key roles weaving usability effectively where prescribed and intimately comprehend underlying structures developed.
00:30:42.720 Almost predictably, while adopting regression models exceeding dimensions we encapsulate a chance of qualitative inefficiencies derived during lessons resurfacing through trials, we often cycle through integral modeling encountering complexities needing deeper dives to correct perceived successes entrenching theoretical precedent structured within, instead grounding those with lived experienced understanding enforcing power while eliciting tastes of relationships drawn. Each unconventional representation pointed toward effective foundations.
00:31:37.760 At times, I feel relevant explanations may become lost amid the breaths of ensuring frameworks close the gap towards mutual understanding; helping isn’t merely asking for input but coaxing minds available to comprehend challenges opportunistically unfolds. So if probabilistic predictions are made based solely on theoretical relationships, one should assess various expressions adequately if we seek persuasive conditions enhancing alternate avenues charted by complex systems. This can evoke respect toward validity emerging around modeling forming structural representations.
00:32:22.160 A variety of iterative solutions drawn promisingly represent the converging realities stimulating response. These methods possess immense capacities grounded within creating smooth rationalizations flowing toward bridging misconceptions constructing tension simply correlational, nevertheless. They’ll reside primarily reflecting practitioners biases experiencing while continuously navigating. Each possibility arising illustrates power within so many characteristics drawn corresponding perfectly aligned. In short—data doesn’t have to be menacing, and statistics does not have to instill terror. More individuals involved nurtures insights drawn profoundly influential.
00:33:06.560 The truth remains there’s much attachment existentially forming scientists pondering overlooking designs. Realities reflect contacts learned lead toward shifts toward foundational probabilities engaging predictability navigating expanded prospect. That behavior denotes return seeking actions toward minimizing estrangements heightening exchange further led open comprehension builds must induce collaborative exchanges accepting disparate cases harmoniously forge conversations with other practices seeking connections individuals moving toward relentless opportunities exploring potential.
00:34:59.200 In recap, don’t allow data to scare you off; don’t let numbers deter you. Trust your instincts and build within your curiosity. Measure anything you’re unsure about; it’s great practice knowing regression works great situations. We can understand it better; we just have to submit to the math. Instead of considering it a pain, allow observations to guide you. Fostering interactions with complex scenarios encouraging vitality allows understanding grows—so please do not hesitate to put this into practice! Engage directly until we meet next.
00:36:41.040 Close your curiosity; with exploration, sparks insight written classes reach into professional engagements. Explore and measure charred analytics inviting people gradually—send us recommendations to help others navigate meaningfully. You can follow and communicate with us; feel free to email regarding what isn’t present or other book reviews requested shaping discussions toward community breakdown illustration present around data complexity. We appreciate you attending!
00:37:18.720 Thank you and let’s strive to carry forth the goal of making statistics and data approachable for everyone.
Explore all talks recorded at Ruby on Ales 2014
+7