Machine Learning Explained to Humans Part 2

00:00:01.909 Before we even start, I need to show you the output of this program right at the end. However, because it takes some time, let’s hold on to that thought for now.

00:00:19.640 This is a short appendix to my talk from yesterday. I introduced a tool for machine learning, but I ended on a sour note, suggesting that it doesn’t seem very impressive. This is what machine learning looks like: linear regression. It’s all very cool, but how does it relate to the really interesting stuff, like computer vision? We saw how having a bunch of data allows us to approximate it with a line, known as training. We find that line through a process called gradient descent, and then we use that line to make predictions.

00:00:40.650 So how do we progress from basic linear interactions to something more complex, like neural networks? This presentation will be briefly shorter than yesterday's, but I hope to give you an overview of this transition. Let’s start with an example. Imagine that we had reservations and pizzas. I actually received a complaint from Pat because I pronounced 'pizza' strangely, which is probably the only word I can pronounce correctly. But in real-world problems, you typically deal with multiple variables. If a problem only has one variable, it’s not interesting enough to need software solutions.

00:01:13.830 Real-life problems present multiple factors that impact the output of the system. For instance, sales may depend on various inputs, such as the number of reservations. We might retrieve this data from the local tourist office's website. So, if we have multiple variables, we could represent them in a matrix where each row is a different example, and each column corresponds to a variable.

00:01:50.800 Initially, we had one input variable and one output variable. We could throw that into a chart and draw a line to approximate the points. Remember, this is the equation of the line. If we only had a single variable, we could simply visualize this in 2D. However, now we have to consider multiple variables, like reservations and pizzas. Our data points are now distributed in a multi-dimensional space. Instead of a line, we need to use a hyperplane, which mathematically gives us more dimensions to work with.

00:02:22.740 For instance, let’s consider two variables, X1 and X2. X1 may represent reservations, and X2 the weight, which reflects the impact of reservations. The point is, as we add more dimensions, we cannot visualize a four-dimensional chart anymore. Nevertheless, the mathematical idea remains the same—we’re essentially looking at a linear equation with multiple variables.

00:03:02.570 Now, let’s consider what happens when we add another dimension. If we keep adding variables, the equation grows increasingly complex, and we can no longer manage it visually. The change in the equation significantly alters how we evaluate outputs. We need to get rid of any biases if possible, which can complicate our equations. Instead, we can simplify by adding a bias term directly to our data matrix, allowing this additional dimension to be handled more easily.

00:03:42.380 We can call this new bias term something standardized, like B, and treat it uniformly across our new equations. By doing this, we end up with a structured equation representing a linear combination of weights and input variables. The good news is that now I can implement this entire process succinctly with a single line of code using matrix multiplication in Mathematica. The concept here is not overly complex; I simply want you to grasp the transition from basic linear regression to more multifaceted approaches, which leads us to multivariate problems.

00:04:16.000 As we delve further, let’s change focus to how these systems handle discrete problems rather than continuous variables. Our work becomes interesting when it involves binary labels, for example, determining if I broke even on a particular day at my restaurant. At first glance, a simple graph might appear easy to construct, but approximating certain shapes can become complicated.

00:04:50.620 To rectify this ambiguity, we need to adjust our approach. One way to do this is to apply a different function to our data that allows us to adjust these expected ranges accordingly. I desire a function that is continuous without abrupt steps or discontinuities, as this smoothness aligns with how gradient descent operates.

00:05:17.650 This leads us to use a function known as the sigmoid function. This function offers a smooth approximation that maps our outputs into a defined likelihood. It’s crucial that the output ranges from 0 to 1, as we will utilize it later to assess accuracy. We can implement this sigmoid function across our model succinctly, consisting of just a couple of lines to compute the necessary values.

00:05:49.190 Using the sigmoid function ensures our outputs remain constrained. However, as appealing as this function is, the challenge arises because introducing it alters our loss function, and it can complicate optimization due to its non-convex nature, leading to local minima in our computations. It’s important to consider alternate loss functions, such as the logarithmic loss function, to mitigate these local minima issues and allow for smoother gradient descent paths.

00:06:23.960 From here, we can transition to logistic regression, a statistical method that represents a more sophisticated analysis than linear regression. As we muster our abilities to handle these more advanced functions, we must also confront challenges brought forth by the essential requirements of computer vision.

00:07:01.760 In the realm of computer vision, the MNIST database comes into play, which is a standardized database of handwritten digits. This dataset is particularly useful for benchmarking computer vision algorithms because it contains 60,000 neatly scaled grayscale 28x28 pixel images of digits. This allows us to accurately test and validate how well our algorithms can decipher these inputs.

00:07:34.760 The intriguing part of this consideration is how we can preprocess these images into a form that can efficiently feed into our logistic regression model. Conceptually, we can reshape these images into a single line of 784 variables, which fundamentally transforms our data representation. While this compression sacrifices some geometric detail, it allows us to derive significant statistical insights regarding digit classification.

00:08:22.740 In our logistic regression model, we need to encode our outputs, particularly since we’re working with ten possible digit classes. We achieve this through a method referred to as one-hot encoding, where each digit corresponds to a vector of ten columns, with only one position marked with a 1, representing the true digit class. All other digits are represented with 0.

00:09:05.620 This encoding signifies our model outputs the confidence level it possesses regarding each digit class. Essentially, our prediction will yield values from 0 to 1, which illustrates how certain we are about a particular classification. For example, if the highest output is found at position 5 corresponding to the digit number, we can draw the conclusion that our model perceives that digit as a 5 with the utmost confidence.

00:09:48.840 Thus, we can substitute old classifiers with multi-class classifiers since we now have ten classes to evaluate. This produces a substantial increase in our weight parameters, leading to an abundance of potential learning outputs. All of this complexity pervading our methods demonstrates how adding dimensions to our problem scales it in ways that define advanced learning techniques.

00:10:24.700 The diversity in dimensions provides us with a rich array of learning opportunities while challenging our approaches to handle the processing. This results in significant increases in the number of weights and calculations required. The resulting model is capable of recognizing patterns across various classes, demonstrating the advancements made in machine learning theory.

00:10:56.830 With all these complexities, I should apologize for the code's legibility because it might appear cluttered. However, the core function remains centered around selecting the maximum probability result among ten potential outputs. Most fundamentally, this highlights that regardless of how complicated the constructs appear, the underlying principles carry over quite consistently.

00:11:26.410 The resultant model performs classification effectively, even with basic logistic regression frameworks. It creates an image input that applies weight matrices while accruing those values to analyze and infer its results with considerable ease. Notably, this doesn’t merely simplify prediction; it remains robust in its capacity to process data through specified pathways.

00:12:03.570 On implementing these models, upon analyzing the test data, we achieve 10,000 inference outputs, with reflections upon accuracy while maintaining compactness. Additionally, the processes can be streamlined to produce optimal results with minimal code lines. Even by optimizing our approaches, achieving a classification accuracy above 90% illustrates how efficiently movements through each step can unveil understanding in complex machine learning paradigms.

00:12:30.460 Thus, all of these layers work cooperatively through algorithmic implementations that streamline recognition skills. This empowers us to transition from linear regression methodologies into comprehensive and effective computer vision innovations. Ultimately, we find ourselves on the precipice of promising advancements within machine learning and can cultivate insights for future endeavors.

00:13:21.500 Thank you.