Machine Learning Explained to Humans (Part 1)

Ruby

Paolo "Nusco" Perrotta

@nusco

#machine-learning

#data-science

#artificial-intelligence-ai

#programming-concepts

#python

#ruby

Machine Learning Explained to Humans (Part 1)

by Paolo "Nusco" Perrotta

In the video titled "Machine Learning Explained to Humans (Part 1)" presented at RubyConf AU 2018 by Paolo Perrotta, the speaker aims to demystify the concept of machine learning for developers. He acknowledges the common frustrations faced by those new to the field and seeks to simplify complex topics using relatable examples and intuitive explanations.

Key points discussed include:

- Understanding Machine Learning: Perrotta shares his initial confusion and how traditional resources often cater to researchers rather than developers seeking practical insights. He emphasizes that this talk is intended for those wanting to grasp the foundational concepts of machine learning.

- Example of a Pizzeria: To illustrate the principles, he uses a small restaurant that wants to predict pizza sales based on reservation data. He explains that the relationship between these variables needs to be understood to forecast sales accurately.

- Linear Regression: The core method introduced is linear regression, which involves mapping input variables (reservations) to output variables (pizza sales) using a simple equation. He describes how to collect data, visualize it, and determine a line of best fit to make predictions.

- Parameters in Linear Regression: Perrotta discusses the parameters involved—weight (W) and bias (B)—and formalizes the equation of the line, showing how these parameters help infer the predicted output (pizza sales) based on input data (reservations).

- Loss Function: The concept of loss is introduced as a measure of error in predictions. By constructing a mean squared error function, the goal is to minimize this loss, allowing for better predictions.

- Gradient Descent: This fundamental algorithm is explained as a method to iteratively adjust W and B using calculated gradients to reach the optimal parameters, minimizing prediction errors effectively.

Overall, the video provides a basic yet practical overview of machine learning, focusing on making the concepts relatable to software developers. Perrotta concludes by reinforcing that while the mathematics may seem intimidating, the underlying principles are accessible and essential for further exploration in artificial intelligence.

This informative session highlights the importance of understanding the basics of machine learning in a developer-friendly manner, paving the way for deeper insights into artificial intelligence.

00:00:01.220 For the second session today, we have Paolo Perrotta. Is that a good pronunciation? Perfect! Everyone got that perfect pronunciation? He's a well-known author in the Ruby community. You might have read some of his books; I'm sure he'll tell you about them. He currently resides in Bologna.

00:00:07.290 Bologna is the seventh largest city by population in Italy. Did you know that? It is also where Ferrari, Maserati, and Lamborghini are all based. And I presume you have one of each?

00:00:14.099 Well, I don’t really use them much these days.

00:00:21.410 Machine learning, correct? Good! Do you have John Connor on speed dial, just in case anything goes wrong during the talk?

00:00:36.090 Thank you, thank you. That was a very flattering introduction, which is not surprising since I wrote it myself.

00:00:41.760 So, talking about machine learning, which is kind of a frustrating topic for me for a few reasons.

00:00:47.270 First, because I could not understand how it could possibly work. It’s crazy! I mean, I’ve been doing software for a very long time; you would imagine I’d know this stuff. But how can a computer caption pictures? That sounds like magic.

00:01:04.620 Then, I got around to studying it and found that it was even more frustrating. There isn’t a lack of documentation online; in fact, there is a boatload of it, but it’s just not targeted at me. It's for different people, primarily researchers and mathematicians.

00:01:12.450 So, if you want to learn something, everyone says, "Sure! Read this paper; it has everything you need." But I don't want to read the paper; I don’t want to sift through all those formulas. Can’t there be a blog post? So, I hope to save you this frustration, at least in part. This is 'Machine Learning Explained to Humans.'

00:01:26.370 And by humans, of course, I mean developers. Disclaimer: this is going to be basic. If you already know anything about machine learning, then this talk is probably not for you. On the other hand, if you wonder how the magic can happen, this is a talk that I hope will demystify machine learning.

00:01:41.310 I cannot teach you machine learning in two minutes, but maybe I can help remove some of the magic from it. There is going to be some mathematics—there's nowhere around that—but I will try to make it intuitive.

00:01:54.239 Okay, so let’s start with a simple example. Let’s say we have a small restaurant, a pizzeria. Every day at noon, we get a certain number of reservations; let's call that number X.

00:02:05.639 And every night, we sell a certain number of pizzas, which I would call Y. What we want to do is build a software program that can learn the relationship between reservations and pizzas, and then use that program to forecast how many pizzas we are going to sell tonight based on the reservations.

00:02:19.859 For example, this is so we can prepare the right amount of dough. This is the idea.

00:02:27.090 So, this program has to learn something. Let’s start by collecting examples by looking at what happens. Maybe we list all these examples in a text file with two columns: reservations and pizzas—X and Y. The first column is our input data, and the second column is what we are trying to forecast.

00:02:41.310 Machine learning people have a weird name for these variables that we’re trying to forecast: they call it the label. It’s like someone labeled each piece of input data with the maximum number of pizzas.

00:02:54.120 Okay, weird name. The first thing we need to do is import this data so that we can use it in our code. I will use Python for that.

00:03:01.440 Sorry, I’m a certified Ruby fanboy, but this is machine learning and everybody's using Python, so I don’t want to go against the flow here. I tried to use Ruby, but then I was missing the libraries.

00:03:09.650 So it’s okay to be multilingual. In this case, we have this nice library called NumPy, which is a numerical library just fine for this.

00:03:15.270 You don’t necessarily need to understand in detail what loadtxt does; it loads this file into two arrays, X and Y, and that’s exactly what we need.

00:03:26.370 Now, the assumption we have is that there is some relation between X and Y; otherwise, this is all for nothing. And this relation is not one-to-one; some people reserve and some don’t end up eating pizzas.

00:03:37.890 So, we need to understand that relation and it’s hard to understand it just by looking at numbers. We should plot it and then we can actually see what it looks like.

00:03:47.010 So we plot reservations against pizzas, and that’s what we get. A few dozen examples show that indeed, more reservations generally mean more pizza sold. How can we move from here to a program that can forecast the future? There is a way to do this, and it's a method statisticians have been using for quite some time.

00:04:05.639 It works in two phases. Phase one is to trace the line that follows those points, and phase two is to forget about the points entirely and just use the line.

00:04:15.120 Essentially, what we did is approximate the points with the line. We simplified our data, and now that we have this line, we can use it to make forecasts.

00:04:24.900 For example, let’s say we have 25 reservations today; how many pizzas do we expect to sell tonight? We can trace all the way up to the line and then all the way left until we cross the pizzas axis. That point of intersection will give us our answer.

00:04:37.890 In this case, the answer is 42. Of course, that number is a coincidence.

00:04:42.410 Maybe you’ve heard about this. It’s called linear regression. That’s the name—linear because we’re using a line as opposed to some complicated curve. Regression is the statistician's way of saying 'infer the value of a variable from another value,' like pizzas from reservations.

00:04:54.269 So let me repeat how linear regression works: first approximate the data with the line, and then use that line to infer the value of data that you don't have labels for, like reservations without pizza.

00:05:01.440 Now we need to put these into a form that we can actually write code; a slightly more formal form and not a graphical form like I did so far.

00:05:14.110 Let’s start with the second step first, actually, because it’s easier. Then we can go back to the first step. But let’s assume we have a line. How can we infer the number of pizzas in a way that we can code?

00:05:26.649 So first, we need to formalize this line a little bit. And to do that, we have an equation that expresses the relation between X and Y by way of those two parameters.

00:05:35.510 You might have studied this formula in high school; it’s just the equation of a line. You might call the parameters A and B.

00:05:45.699 I’m going to move around a lot if I fall, I would like the entire room to acknowledge this event, okay?

00:05:55.220 Let’s all say it together: This is the equation. The parameters are called W and B, which stands for weight and bias—more weird machine learning names.

00:06:04.680 Intuitively, the weight is the slope of the line. You can see that for the same X, if we have a bigger W, then we get a bigger Y. Alright?

00:06:11.940 So, it’s like the line is becoming more vertical, and B is basically the distance at which this line crosses the vertical axis, the y-intercept, if you wish.

00:06:20.200 So now that we have this equation, we can write code about it. For example, let’s say we have 25 reservations, so X is 25.

00:06:29.280 If I tell you that for this particular line we have W equals 1.2 and B equals 12, which looks about right, we can just replace this stuff in the formula.

00:06:39.480 The code that does this is just a one-liner. I can write a function called infer that takes our input data and the line (remember those two parameters identify the line uniquely). It then returns the formula.

00:06:47.610 And that’s it! Okay, one little detail: I called this Y in the beginning, but then I subtly changed it and started calling it Y-hat.

00:06:57.660 I put a little hat on top of the Y. The reason for that is that I already used Y to mean the label, and this is subtly different.

00:07:05.700 This is not the label; the label is the ground truth; it's the number of pizzas we actually sold. This is the inference of the pizzas we expect to sell, so I wanted to avoid any cause for confusion.

00:07:15.360 So we have this infer method, and this is the second step in linear regression. It’s done; that’s all it takes.

00:07:23.880 Now, the first step is trickier, so I will ask you all to hold on for 10 minutes—that's how long it takes. But it’s important; it’s crucial to the entire process.

00:07:39.909 How can we go from these points to a line to W and B? Let’s start by observing that whatever line we can trace, it’s going to be wrong.

00:07:48.579 I mean, unless the points are exactly aligned, which is not going to happen. Any line has an error.

00:07:56.180 Just to prove my point here, let’s look at our original data again. Let’s pick one of these points, like the point where X is 14 and Y is 32.

00:08:03.430 The line we traced would say, okay, this is 1.2 multiplied by 14 plus 28.8, so it’s wrong. There is an error. I will mark this error in orange: that little space there.

00:08:14.430 So now, let’s write a function that considers all the errors together. Let’s call it the loss—yet another weird machine learning name.

00:08:21.210 Every time you hear 'loss,' your brain should say 'error.' That’s what it is. This function takes our examples along with the labels and the line.

00:08:29.640 What it does is first calculates all the errors. It uses the infer method to calculate Y hat and then calculates the difference from Y.

00:08:37.260 Notice that now the errors are arrays; all these things are arrays, and that’s the reason we’re using NumPy because it makes operations on arrays very seamless.

00:08:45.180 Now, these errors: some of these errors are positive, some are negative, which doesn’t make much sense at all.

00:08:51.960 I mean, we don’t want positive and negative errors to cancel each other out. So let’s make them all positive, which is traditionally done by squaring them.

00:09:01.200 The double star operator in Python is the power operator, and now we can just average them. This is the formula we can use for the loss. There are other ways to compute a loss, but this is a very traditional one; it’s called the mean squared error.

00:09:12.389 Because that’s what it is: it’s the mean of the squared errors. Why did I take you on this merry detour? Remember where we started from?

00:09:21.429 We wanted to find the best approximation of the points. Instead, I built a function that gives us the error of the line. That’s because now we can reframe the problem.

00:09:30.110 We want to find the values of the parameters that give us the minimum value for this function. This function looks like it has four parameters, but if you think about it, X and Y are not really parameters; they are constants.

00:09:43.170 Once we have our examples, they are done forever, so the parameters are actually W and B; that’s the line. How do we find W and B that give us the minimum value for the loss?

00:09:57.600 Hold on another few minutes. This is going to take some math. First, let’s plot it because if I don’t see things on the chart, then I have trouble wrapping my head around it.

00:10:09.540 What does the loss look like as W and B change? I tried plotting it, and here it is: a beautiful little surface. I spent like half an hour finding the exact color scheme.

00:10:18.210 That’s the way to go! That’s the bias and that’s the loss. Where we want to be is here: the minimum! We want to find those values of weight and bias; that’s the line.

00:10:32.580 How do we do that? There is a beautifully simple algorithm that is arguably the most fundamental algorithm in machine learning. It works like this: first, start wherever; any point will do.

00:10:48.700 I’m starting at that white spot there and then use this mathematical tool called the gradient. For those who don’t know, the gradient is essentially the slope of this surface.

00:10:56.700 The steeper the slope, the bigger the gradient, and it’s pointing in the direction of the maximum slope.

00:11:06.600 Let’s be concrete: first, we can calculate the gradients for W and B separately. This is a little bit like slicing the surface.

00:11:14.760 So first, I’m fixing W and varying B, and the bias, which means I’m essentially slicing the surface like this.

00:11:22.800 It doesn’t have to be a katana; I mean, a katana just makes it sound cooler! Then, I’m doing the same thing in the other direction.

00:11:31.800 Now that we have these, we want to go downhill! Conventionally, the gradient points uphill, so let’s revert them.

00:11:40.400 Another thing we want to do is multiply them by some factor to make them smaller. Why? Because we don’t want our steps to be too big.

00:11:49.300 We don’t want to overshoot the minimum and maybe find ourselves even farther away than where we started from.

00:11:57.700 Now that we have these, we can just add these amounts to W and B, which means that we move. We keep doing this iteratively until we reach the precision that we want.

00:12:06.600 Like a ball rolling over this surface, we will roll down the gradient until we reach the minimum, or we get close enough.

00:12:15.200 This algorithm is called gradient descent. It doesn’t sound like much, but it’s fundamentally vital to all artificial intelligence that you see around that does amazing things.

00:12:26.800 It’s self-correcting, for example: we might actually overstep the minimum, but then the gradient adjusts, and as we get closer and closer to the minimum, we take smaller steps.

00:12:39.600 We get more precise because, as we get closer, the surface flattens and the gradient gets smaller until finally, at the minimum, the gradient is zero because the surface is flat.

00:12:53.200 So now, I don’t know whether you’re comfortable with this. If you're not, you can safely ignore the math.

00:13:02.500 What’s uncomfortable? I left my gear out in the sun here in Sydney, and now this presenter is sticky. I can’t tell you how gross this is! If you are feeling uncomfortable with this, think about me in front of 400 people holding a sticky presentation.

00:13:09.720 So this is just the loss. I’m putting it here in mathematical form for those of you who like to see the formal shape of it. This is W times X plus B, which is Y hat minus Y, squared.

00:13:20.170 That’s something we’re dividing by the number of examples, which I call N, so it’s the mean squared error.

00:13:28.910 I’m putting it here only because some of you who know calculus might actually be able to calculate the gradients in your mind.

00:13:38.650 You probably remember this stuff under the name 'partial derivatives.'

Paolo "Nusco" Perrotta

@nusco

RubyConf AU 2018