For the second session today, we have Paolo Perrotta. Is that a good pronunciation? Perfect! Everyone got that perfect pronunciation? He's a well-known author in the Ruby community. You might have read some of his books; I'm sure he'll tell you about them. He currently resides in Bologna.
Bologna is the seventh largest city by population in Italy. Did you know that? It is also where Ferrari, Maserati, and Lamborghini are all based. And I presume you have one of each?
Well, I don’t really use them much these days.
Machine learning, correct? Good! Do you have John Connor on speed dial, just in case anything goes wrong during the talk?
Thank you, thank you. That was a very flattering introduction, which is not surprising since I wrote it myself.
So, talking about machine learning, which is kind of a frustrating topic for me for a few reasons.
First, because I could not understand how it could possibly work. It’s crazy! I mean, I’ve been doing software for a very long time; you would imagine I’d know this stuff. But how can a computer caption pictures? That sounds like magic.
Then, I got around to studying it and found that it was even more frustrating. There isn’t a lack of documentation online; in fact, there is a boatload of it, but it’s just not targeted at me. It's for different people, primarily researchers and mathematicians.
So, if you want to learn something, everyone says, "Sure! Read this paper; it has everything you need." But I don't want to read the paper; I don’t want to sift through all those formulas. Can’t there be a blog post? So, I hope to save you this frustration, at least in part. This is 'Machine Learning Explained to Humans.'
And by humans, of course, I mean developers. Disclaimer: this is going to be basic. If you already know anything about machine learning, then this talk is probably not for you. On the other hand, if you wonder how the magic can happen, this is a talk that I hope will demystify machine learning.
I cannot teach you machine learning in two minutes, but maybe I can help remove some of the magic from it. There is going to be some mathematics—there's nowhere around that—but I will try to make it intuitive.
Okay, so let’s start with a simple example. Let’s say we have a small restaurant, a pizzeria. Every day at noon, we get a certain number of reservations; let's call that number X.
And every night, we sell a certain number of pizzas, which I would call Y. What we want to do is build a software program that can learn the relationship between reservations and pizzas, and then use that program to forecast how many pizzas we are going to sell tonight based on the reservations.
For example, this is so we can prepare the right amount of dough. This is the idea.
So, this program has to learn something. Let’s start by collecting examples by looking at what happens. Maybe we list all these examples in a text file with two columns: reservations and pizzas—X and Y. The first column is our input data, and the second column is what we are trying to forecast.
Machine learning people have a weird name for these variables that we’re trying to forecast: they call it the label. It’s like someone labeled each piece of input data with the maximum number of pizzas.
Okay, weird name. The first thing we need to do is import this data so that we can use it in our code. I will use Python for that.
Sorry, I’m a certified Ruby fanboy, but this is machine learning and everybody's using Python, so I don’t want to go against the flow here. I tried to use Ruby, but then I was missing the libraries.
So it’s okay to be multilingual. In this case, we have this nice library called NumPy, which is a numerical library just fine for this.
You don’t necessarily need to understand in detail what loadtxt does; it loads this file into two arrays, X and Y, and that’s exactly what we need.
Now, the assumption we have is that there is some relation between X and Y; otherwise, this is all for nothing. And this relation is not one-to-one; some people reserve and some don’t end up eating pizzas.
So, we need to understand that relation and it’s hard to understand it just by looking at numbers. We should plot it and then we can actually see what it looks like.
So we plot reservations against pizzas, and that’s what we get. A few dozen examples show that indeed, more reservations generally mean more pizza sold. How can we move from here to a program that can forecast the future? There is a way to do this, and it's a method statisticians have been using for quite some time.
It works in two phases. Phase one is to trace the line that follows those points, and phase two is to forget about the points entirely and just use the line.
Essentially, what we did is approximate the points with the line. We simplified our data, and now that we have this line, we can use it to make forecasts.
For example, let’s say we have 25 reservations today; how many pizzas do we expect to sell tonight? We can trace all the way up to the line and then all the way left until we cross the pizzas axis. That point of intersection will give us our answer.
In this case, the answer is 42. Of course, that number is a coincidence.
Maybe you’ve heard about this. It’s called linear regression. That’s the name—linear because we’re using a line as opposed to some complicated curve. Regression is the statistician's way of saying 'infer the value of a variable from another value,' like pizzas from reservations.
So let me repeat how linear regression works: first approximate the data with the line, and then use that line to infer the value of data that you don't have labels for, like reservations without pizza.
Now we need to put these into a form that we can actually write code; a slightly more formal form and not a graphical form like I did so far.
Let’s start with the second step first, actually, because it’s easier. Then we can go back to the first step. But let’s assume we have a line. How can we infer the number of pizzas in a way that we can code?
So first, we need to formalize this line a little bit. And to do that, we have an equation that expresses the relation between X and Y by way of those two parameters.
You might have studied this formula in high school; it’s just the equation of a line. You might call the parameters A and B.
I’m going to move around a lot if I fall, I would like the entire room to acknowledge this event, okay?
Let’s all say it together: This is the equation. The parameters are called W and B, which stands for weight and bias—more weird machine learning names.
Intuitively, the weight is the slope of the line. You can see that for the same X, if we have a bigger W, then we get a bigger Y. Alright?
So, it’s like the line is becoming more vertical, and B is basically the distance at which this line crosses the vertical axis, the y-intercept, if you wish.
So now that we have this equation, we can write code about it. For example, let’s say we have 25 reservations, so X is 25.
If I tell you that for this particular line we have W equals 1.2 and B equals 12, which looks about right, we can just replace this stuff in the formula.
The code that does this is just a one-liner. I can write a function called infer that takes our input data and the line (remember those two parameters identify the line uniquely). It then returns the formula.
And that’s it! Okay, one little detail: I called this Y in the beginning, but then I subtly changed it and started calling it Y-hat.
I put a little hat on top of the Y. The reason for that is that I already used Y to mean the label, and this is subtly different.
This is not the label; the label is the ground truth; it's the number of pizzas we actually sold. This is the inference of the pizzas we expect to sell, so I wanted to avoid any cause for confusion.
So we have this infer method, and this is the second step in linear regression. It’s done; that’s all it takes.
Now, the first step is trickier, so I will ask you all to hold on for 10 minutes—that's how long it takes. But it’s important; it’s crucial to the entire process.
How can we go from these points to a line to W and B? Let’s start by observing that whatever line we can trace, it’s going to be wrong.
I mean, unless the points are exactly aligned, which is not going to happen. Any line has an error.
Just to prove my point here, let’s look at our original data again. Let’s pick one of these points, like the point where X is 14 and Y is 32.
The line we traced would say, okay, this is 1.2 multiplied by 14 plus 28.8, so it’s wrong. There is an error. I will mark this error in orange: that little space there.
So now, let’s write a function that considers all the errors together. Let’s call it the loss—yet another weird machine learning name.
Every time you hear 'loss,' your brain should say 'error.' That’s what it is. This function takes our examples along with the labels and the line.
What it does is first calculates all the errors. It uses the infer method to calculate Y hat and then calculates the difference from Y.
Notice that now the errors are arrays; all these things are arrays, and that’s the reason we’re using NumPy because it makes operations on arrays very seamless.
Now, these errors: some of these errors are positive, some are negative, which doesn’t make much sense at all.
I mean, we don’t want positive and negative errors to cancel each other out. So let’s make them all positive, which is traditionally done by squaring them.
The double star operator in Python is the power operator, and now we can just average them. This is the formula we can use for the loss. There are other ways to compute a loss, but this is a very traditional one; it’s called the mean squared error.
Because that’s what it is: it’s the mean of the squared errors. Why did I take you on this merry detour? Remember where we started from?
We wanted to find the best approximation of the points. Instead, I built a function that gives us the error of the line. That’s because now we can reframe the problem.
We want to find the values of the parameters that give us the minimum value for this function. This function looks like it has four parameters, but if you think about it, X and Y are not really parameters; they are constants.
Once we have our examples, they are done forever, so the parameters are actually W and B; that’s the line. How do we find W and B that give us the minimum value for the loss?
Hold on another few minutes. This is going to take some math. First, let’s plot it because if I don’t see things on the chart, then I have trouble wrapping my head around it.
What does the loss look like as W and B change? I tried plotting it, and here it is: a beautiful little surface. I spent like half an hour finding the exact color scheme.
That’s the way to go! That’s the bias and that’s the loss. Where we want to be is here: the minimum! We want to find those values of weight and bias; that’s the line.
How do we do that? There is a beautifully simple algorithm that is arguably the most fundamental algorithm in machine learning. It works like this: first, start wherever; any point will do.
I’m starting at that white spot there and then use this mathematical tool called the gradient. For those who don’t know, the gradient is essentially the slope of this surface.
The steeper the slope, the bigger the gradient, and it’s pointing in the direction of the maximum slope.
Let’s be concrete: first, we can calculate the gradients for W and B separately. This is a little bit like slicing the surface.
So first, I’m fixing W and varying B, and the bias, which means I’m essentially slicing the surface like this.
It doesn’t have to be a katana; I mean, a katana just makes it sound cooler! Then, I’m doing the same thing in the other direction.
Now that we have these, we want to go downhill! Conventionally, the gradient points uphill, so let’s revert them.
Another thing we want to do is multiply them by some factor to make them smaller. Why? Because we don’t want our steps to be too big.
We don’t want to overshoot the minimum and maybe find ourselves even farther away than where we started from.
Now that we have these, we can just add these amounts to W and B, which means that we move. We keep doing this iteratively until we reach the precision that we want.
Like a ball rolling over this surface, we will roll down the gradient until we reach the minimum, or we get close enough.
This algorithm is called gradient descent. It doesn’t sound like much, but it’s fundamentally vital to all artificial intelligence that you see around that does amazing things.
It’s self-correcting, for example: we might actually overstep the minimum, but then the gradient adjusts, and as we get closer and closer to the minimum, we take smaller steps.
We get more precise because, as we get closer, the surface flattens and the gradient gets smaller until finally, at the minimum, the gradient is zero because the surface is flat.
So now, I don’t know whether you’re comfortable with this. If you're not, you can safely ignore the math.
What’s uncomfortable? I left my gear out in the sun here in Sydney, and now this presenter is sticky. I can’t tell you how gross this is! If you are feeling uncomfortable with this, think about me in front of 400 people holding a sticky presentation.
So this is just the loss. I’m putting it here in mathematical form for those of you who like to see the formal shape of it. This is W times X plus B, which is Y hat minus Y, squared.
That’s something we’re dividing by the number of examples, which I call N, so it’s the mean squared error.
I’m putting it here only because some of you who know calculus might actually be able to calculate the gradients in your mind.
You probably remember this stuff under the name 'partial derivatives.'