00:00:01.220
For the second session today, we have Paolo Perrotta. Is that a good pronunciation? Perfect! Everyone got that perfect pronunciation? He's a well-known author in the Ruby community. You might have read some of his books; I'm sure he'll tell you about them. He currently resides in Bologna.
00:00:07.290
Bologna is the seventh largest city by population in Italy. Did you know that? It is also where Ferrari, Maserati, and Lamborghini are all based. And I presume you have one of each?
00:00:14.099
Well, I don’t really use them much these days.
00:00:21.410
Machine learning, correct? Good! Do you have John Connor on speed dial, just in case anything goes wrong during the talk?
00:00:36.090
Thank you, thank you. That was a very flattering introduction, which is not surprising since I wrote it myself.
00:00:41.760
So, talking about machine learning, which is kind of a frustrating topic for me for a few reasons.
00:00:47.270
First, because I could not understand how it could possibly work. It’s crazy! I mean, I’ve been doing software for a very long time; you would imagine I’d know this stuff. But how can a computer caption pictures? That sounds like magic.
00:01:04.620
Then, I got around to studying it and found that it was even more frustrating. There isn’t a lack of documentation online; in fact, there is a boatload of it, but it’s just not targeted at me. It's for different people, primarily researchers and mathematicians.
00:01:12.450
So, if you want to learn something, everyone says, "Sure! Read this paper; it has everything you need." But I don't want to read the paper; I don’t want to sift through all those formulas. Can’t there be a blog post? So, I hope to save you this frustration, at least in part. This is 'Machine Learning Explained to Humans.'
00:01:26.370
And by humans, of course, I mean developers. Disclaimer: this is going to be basic. If you already know anything about machine learning, then this talk is probably not for you. On the other hand, if you wonder how the magic can happen, this is a talk that I hope will demystify machine learning.
00:01:41.310
I cannot teach you machine learning in two minutes, but maybe I can help remove some of the magic from it. There is going to be some mathematics—there's nowhere around that—but I will try to make it intuitive.
00:01:54.239
Okay, so let’s start with a simple example. Let’s say we have a small restaurant, a pizzeria. Every day at noon, we get a certain number of reservations; let's call that number X.
00:02:05.639
And every night, we sell a certain number of pizzas, which I would call Y. What we want to do is build a software program that can learn the relationship between reservations and pizzas, and then use that program to forecast how many pizzas we are going to sell tonight based on the reservations.
00:02:19.859
For example, this is so we can prepare the right amount of dough. This is the idea.
00:02:27.090
So, this program has to learn something. Let’s start by collecting examples by looking at what happens. Maybe we list all these examples in a text file with two columns: reservations and pizzas—X and Y. The first column is our input data, and the second column is what we are trying to forecast.
00:02:41.310
Machine learning people have a weird name for these variables that we’re trying to forecast: they call it the label. It’s like someone labeled each piece of input data with the maximum number of pizzas.
00:02:54.120
Okay, weird name. The first thing we need to do is import this data so that we can use it in our code. I will use Python for that.
00:03:01.440
Sorry, I’m a certified Ruby fanboy, but this is machine learning and everybody's using Python, so I don’t want to go against the flow here. I tried to use Ruby, but then I was missing the libraries.
00:03:09.650
So it’s okay to be multilingual. In this case, we have this nice library called NumPy, which is a numerical library just fine for this.
00:03:15.270
You don’t necessarily need to understand in detail what loadtxt does; it loads this file into two arrays, X and Y, and that’s exactly what we need.
00:03:26.370
Now, the assumption we have is that there is some relation between X and Y; otherwise, this is all for nothing. And this relation is not one-to-one; some people reserve and some don’t end up eating pizzas.
00:03:37.890
So, we need to understand that relation and it’s hard to understand it just by looking at numbers. We should plot it and then we can actually see what it looks like.
00:03:47.010
So we plot reservations against pizzas, and that’s what we get. A few dozen examples show that indeed, more reservations generally mean more pizza sold. How can we move from here to a program that can forecast the future? There is a way to do this, and it's a method statisticians have been using for quite some time.
00:04:05.639
It works in two phases. Phase one is to trace the line that follows those points, and phase two is to forget about the points entirely and just use the line.
00:04:15.120
Essentially, what we did is approximate the points with the line. We simplified our data, and now that we have this line, we can use it to make forecasts.
00:04:24.900
For example, let’s say we have 25 reservations today; how many pizzas do we expect to sell tonight? We can trace all the way up to the line and then all the way left until we cross the pizzas axis. That point of intersection will give us our answer.
00:04:37.890
In this case, the answer is 42. Of course, that number is a coincidence.
00:04:42.410
Maybe you’ve heard about this. It’s called linear regression. That’s the name—linear because we’re using a line as opposed to some complicated curve. Regression is the statistician's way of saying 'infer the value of a variable from another value,' like pizzas from reservations.
00:04:54.269
So let me repeat how linear regression works: first approximate the data with the line, and then use that line to infer the value of data that you don't have labels for, like reservations without pizza.
00:05:01.440
Now we need to put these into a form that we can actually write code; a slightly more formal form and not a graphical form like I did so far.
00:05:14.110
Let’s start with the second step first, actually, because it’s easier. Then we can go back to the first step. But let’s assume we have a line. How can we infer the number of pizzas in a way that we can code?
00:05:26.649
So first, we need to formalize this line a little bit. And to do that, we have an equation that expresses the relation between X and Y by way of those two parameters.
00:05:35.510
You might have studied this formula in high school; it’s just the equation of a line. You might call the parameters A and B.
00:05:45.699
I’m going to move around a lot if I fall, I would like the entire room to acknowledge this event, okay?
00:05:55.220
Let’s all say it together: This is the equation. The parameters are called W and B, which stands for weight and bias—more weird machine learning names.
00:06:04.680
Intuitively, the weight is the slope of the line. You can see that for the same X, if we have a bigger W, then we get a bigger Y. Alright?
00:06:11.940
So, it’s like the line is becoming more vertical, and B is basically the distance at which this line crosses the vertical axis, the y-intercept, if you wish.
00:06:20.200
So now that we have this equation, we can write code about it. For example, let’s say we have 25 reservations, so X is 25.
00:06:29.280
If I tell you that for this particular line we have W equals 1.2 and B equals 12, which looks about right, we can just replace this stuff in the formula.
00:06:39.480
The code that does this is just a one-liner. I can write a function called infer that takes our input data and the line (remember those two parameters identify the line uniquely). It then returns the formula.
00:06:47.610
And that’s it! Okay, one little detail: I called this Y in the beginning, but then I subtly changed it and started calling it Y-hat.
00:06:57.660
I put a little hat on top of the Y. The reason for that is that I already used Y to mean the label, and this is subtly different.
00:07:05.700
This is not the label; the label is the ground truth; it's the number of pizzas we actually sold. This is the inference of the pizzas we expect to sell, so I wanted to avoid any cause for confusion.
00:07:15.360
So we have this infer method, and this is the second step in linear regression. It’s done; that’s all it takes.
00:07:23.880
Now, the first step is trickier, so I will ask you all to hold on for 10 minutes—that's how long it takes. But it’s important; it’s crucial to the entire process.
00:07:39.909
How can we go from these points to a line to W and B? Let’s start by observing that whatever line we can trace, it’s going to be wrong.
00:07:48.579
I mean, unless the points are exactly aligned, which is not going to happen. Any line has an error.
00:07:56.180
Just to prove my point here, let’s look at our original data again. Let’s pick one of these points, like the point where X is 14 and Y is 32.
00:08:03.430
The line we traced would say, okay, this is 1.2 multiplied by 14 plus 28.8, so it’s wrong. There is an error. I will mark this error in orange: that little space there.
00:08:14.430
So now, let’s write a function that considers all the errors together. Let’s call it the loss—yet another weird machine learning name.
00:08:21.210
Every time you hear 'loss,' your brain should say 'error.' That’s what it is. This function takes our examples along with the labels and the line.
00:08:29.640
What it does is first calculates all the errors. It uses the infer method to calculate Y hat and then calculates the difference from Y.
00:08:37.260
Notice that now the errors are arrays; all these things are arrays, and that’s the reason we’re using NumPy because it makes operations on arrays very seamless.
00:08:45.180
Now, these errors: some of these errors are positive, some are negative, which doesn’t make much sense at all.
00:08:51.960
I mean, we don’t want positive and negative errors to cancel each other out. So let’s make them all positive, which is traditionally done by squaring them.
00:09:01.200
The double star operator in Python is the power operator, and now we can just average them. This is the formula we can use for the loss. There are other ways to compute a loss, but this is a very traditional one; it’s called the mean squared error.
00:09:12.389
Because that’s what it is: it’s the mean of the squared errors. Why did I take you on this merry detour? Remember where we started from?
00:09:21.429
We wanted to find the best approximation of the points. Instead, I built a function that gives us the error of the line. That’s because now we can reframe the problem.
00:09:30.110
We want to find the values of the parameters that give us the minimum value for this function. This function looks like it has four parameters, but if you think about it, X and Y are not really parameters; they are constants.
00:09:43.170
Once we have our examples, they are done forever, so the parameters are actually W and B; that’s the line. How do we find W and B that give us the minimum value for the loss?
00:09:57.600
Hold on another few minutes. This is going to take some math. First, let’s plot it because if I don’t see things on the chart, then I have trouble wrapping my head around it.
00:10:09.540
What does the loss look like as W and B change? I tried plotting it, and here it is: a beautiful little surface. I spent like half an hour finding the exact color scheme.
00:10:18.210
That’s the way to go! That’s the bias and that’s the loss. Where we want to be is here: the minimum! We want to find those values of weight and bias; that’s the line.
00:10:32.580
How do we do that? There is a beautifully simple algorithm that is arguably the most fundamental algorithm in machine learning. It works like this: first, start wherever; any point will do.
00:10:48.700
I’m starting at that white spot there and then use this mathematical tool called the gradient. For those who don’t know, the gradient is essentially the slope of this surface.
00:10:56.700
The steeper the slope, the bigger the gradient, and it’s pointing in the direction of the maximum slope.
00:11:06.600
Let’s be concrete: first, we can calculate the gradients for W and B separately. This is a little bit like slicing the surface.
00:11:14.760
So first, I’m fixing W and varying B, and the bias, which means I’m essentially slicing the surface like this.
00:11:22.800
It doesn’t have to be a katana; I mean, a katana just makes it sound cooler! Then, I’m doing the same thing in the other direction.
00:11:31.800
Now that we have these, we want to go downhill! Conventionally, the gradient points uphill, so let’s revert them.
00:11:40.400
Another thing we want to do is multiply them by some factor to make them smaller. Why? Because we don’t want our steps to be too big.
00:11:49.300
We don’t want to overshoot the minimum and maybe find ourselves even farther away than where we started from.
00:11:57.700
Now that we have these, we can just add these amounts to W and B, which means that we move. We keep doing this iteratively until we reach the precision that we want.
00:12:06.600
Like a ball rolling over this surface, we will roll down the gradient until we reach the minimum, or we get close enough.
00:12:15.200
This algorithm is called gradient descent. It doesn’t sound like much, but it’s fundamentally vital to all artificial intelligence that you see around that does amazing things.
00:12:26.800
It’s self-correcting, for example: we might actually overstep the minimum, but then the gradient adjusts, and as we get closer and closer to the minimum, we take smaller steps.
00:12:39.600
We get more precise because, as we get closer, the surface flattens and the gradient gets smaller until finally, at the minimum, the gradient is zero because the surface is flat.
00:12:53.200
So now, I don’t know whether you’re comfortable with this. If you're not, you can safely ignore the math.
00:13:02.500
What’s uncomfortable? I left my gear out in the sun here in Sydney, and now this presenter is sticky. I can’t tell you how gross this is! If you are feeling uncomfortable with this, think about me in front of 400 people holding a sticky presentation.
00:13:09.720
So this is just the loss. I’m putting it here in mathematical form for those of you who like to see the formal shape of it. This is W times X plus B, which is Y hat minus Y, squared.
00:13:20.170
That’s something we’re dividing by the number of examples, which I call N, so it’s the mean squared error.
00:13:28.910
I’m putting it here only because some of you who know calculus might actually be able to calculate the gradients in your mind.
00:13:38.650
You probably remember this stuff under the name 'partial derivatives.'