00:00:24.600
Hello everyone. I'm here to talk about Practical Machine Learning and Rails. My name is Andrew Cantino, and I'm the VP of Engineering at Mavenlink. This is Ryan Stout, the founder of Agile Productions. I have been doing machine learning on and off for a couple of years since I studied it in grad school, while Ryan has been working in this area full time for the last year on an upcoming project. We are going to split this talk into two parts. I will provide a theoretical introduction to machine learning, explaining what it is and why it is important. Ryan will then provide several practical examples.
00:01:10.180
This talk aims to introduce you to machine learning and help you understand its significance, as well as encourage you to consider applying it to your projects. My goal is to make you machine learning aware, meaning when you return to your companies and encounter problems well-suited for machine learning solutions, you will know what tools are available and feel confident enough to dive in. It is crucial that you at least know of its existence and are not intimidated by these techniques when addressing relevant issues.
00:01:34.500
However, let me be clear: this talk will not provide you with a PhD in machine learning. This is an enormous field that consists of many specialized areas, and people obtain PhDs for very small parts of it. While we will discuss some algorithms at a high level and provide a general sense of how to use them, we will not delve into implementation details. Additionally, we will not cover topics like collaborative filtering, optimization, clustering, or advanced statistics as they fall within a broader scope.
00:02:03.220
So, what is machine learning? Machine learning encompasses numerous algorithms that predict outcomes based on data inputs using applied statistics. At its core, machine learning is applied mathematics. However, there are many libraries available that abstract the complex mathematics from you, making it more accessible. It’s essential to understand that machine learning is not magic; it relies solely on the data available.
00:02:24.610
What kind of data are we talking about? The web is full of data—APIs, logs, click trails, user decisions—all these contribute to the data sets stored in your SQL or NoSQL databases. You can analyze this data and make predictions using machine learning. Your users leave behind trails of valuable, albeit noisy, data that can be leveraged to solve various problems.
00:02:39.220
What do we do with this data? One approach is classification. Classification involves breaking data into categories and determining which category new data belongs to. For example, consider whether an email is spam or not— we want your inbox to be full of legitimate correspondence, not spam. Other classifications could include determining if something is happy or sad, appropriate or inappropriate, and much more. Ryan will provide a hands-on example of sentiment analysis shortly.
00:02:49.019
In addition to email filtering, classification could apply to sorting documents. For instance, Gmail uses classification techniques for its importance filter while Aardvark categorizes questions. Similarly, platforms like Amazon and others classify reviews according to user interests, expertise, and even the likelihood of payment for services. You can also classify behavioral events, such as logins and clicks, based on their normal or abnormal activity.
00:03:10.360
A system can even be developed to automatically detect anomalies like potential intrusions. For example, we could create a system to warn users who might fall victim to a phishing attempt before they click on a malicious link. This indicates that classification offers practical solutions to problems you may encounter in your projects.
00:04:03.220
Let’s discuss some algorithms that can facilitate classification. One of the first algorithms is decision tree learning. What I appreciate about decision trees is their straightforwardness; they resemble flowcharts. For instance, a decision tree can classify a new email as either spam or not by considering labels like "spam" or "ham" and utilizing features from that email, such as the presence of certain keywords or the number of attachments.
00:04:33.960
To construct a decision tree, you select the feature that best separates your data into classes. In this theoretical example, you would begin with the feature that provides the most significant distinction between spam and ham. In our case, the word “Viagra” might serve as a strong predictor, determining whether the email is likely to be spam based on how often that feature appears. You can estimate the probability of an email being classified as spam depending on the presence of certain keywords or attachments, allowing you to use this decision tree in practical applications.
00:05:12.330
Next, let's look at support vector machines, another algorithm for classification.
00:05:16.790
Support vector machines (SVMs) excel at classification tasks where the number of features is limited, roughly fewer than 2,000. For instance, while classifying documents, every word may be treated as a feature, and this can result in poor performance due to SVMs' memory-intensive nature. However, with smaller data sets or less complex tasks, SVMs function exceptionally well. Imagine a scenario where black dots represent spam and white dots represent ham. The goal is to find a line that best separates these two categories using features along two dimensions.
00:06:03.200
This separation line, called a hyperplane, optimally maximizes the margin between the two classes in high-dimensional space. While the SVM algorithm typically employs a straight line, it can also work with complex curves, representing more intricate relationships between classes. There are libraries like LibSVM available, featuring Ruby bindings that you can incorporate into your applications for SVM implementation.
00:06:16.280
Moving forward, I want to mention Naive Bayes, a classification algorithm that surprisingly performs well with text. It calculates probabilities based on the presence of words within a document, treating each word as an independent feature. This approach, while overly simplistic statistically, yields effective results with enough data to support it. Essentially, Naive Bayes relies on the premise that the probability of a word's occurrence remains consistent across varied contexts.
00:06:50.280
Let's consider a very simple example to illustrate this process, using a training set of 100 emails where 70 are classified as spam. We could analyze the occurrence of specific words like ‘Viagra,’ which might appear 42 times in spam emails but only once in non-spam. With suitable probabilities derived from such frequency counts, we can utilize Bayes' theorem to predict whether new emails are spam.
00:07:23.300
For instance, if 60% of spam emails contain the word 'Viagra,' we can estimate the likelihood that a new email is spam based on features present in that email. These probabilities allow us to construct a formula predicting the probability that a new email containing 'Hello' and 'Viagra' is spam and, through careful calculation, we might find confidence in its accuracy.
00:07:40.890
Next, I want to mention neural networks. You’re likely familiar with neural networks, as they have become quite popular in recent years. However, I want to caution you in using them, especially given that support vector machines generally perform better for many applications. Neural networks involve multiple layers: an input layer, one or more hidden layers, and an output layer. They are designed to mimic the human brain at a high level, making them intriguing but complex. The challenge with neural networks lies in their interpretability and their tendency to overfit data. While capable of learning complex functions, determining the optimal number of hidden layers and parameters can be quite difficult.
00:08:42.730
The input layer would take in features, such as the count of certain words or pixel colors in an image. In practice, neural networks can function efficiently at dealing with images, yet they require considerable data and fine-tuning to achieve satisfactory performance. If you're interested in using neural nets, you should thoroughly evaluate whether they're necessary for your specific application or if simpler algorithms are more suitable.
00:09:40.950
Before I hand it over to Ryan, I want to discuss two high-level concepts that you need to understand when venturing into machine learning: the curse of dimensionality and overfitting. The curse of dimensionality refers to the fact that as you add more features, the amount of data required to learn an effective classifier increases exponentially.
00:09:55.160
For instance, more features create a high-dimensional space that requires significantly more data points to inform the learning algorithm sufficiently. Although it isn't strictly true that algorithms need to fill that entire volume of feature space, it is a good principle to keep in mind that the more features you have, the more data you’ll need to train effectively.
00:10:14.480
The other concept, overfitting, relates to the challenge of creating an algorithm that generalizes well to unseen data while avoiding memorization of the training data. Parameters define the complex characteristics your algorithm can learn, but with too many parameters, you run the risk of the algorithm learning noise rather than useful patterns. To mitigate this risk, one effective approach is to use separate data sets for training and testing; ensuring you validate your model against unseen data increases the likelihood that it will perform well in real-world scenarios.
00:11:06.750
Cross-validation is a technique that allows you to estimate a model's performance by splitting the original data into k subsets and iteratively training and testing the model on varying partitions. This allows you to assess how well your model generalizes to different samples while maximizing the amount of training data available.
00:12:00.000
Before I pass the stage to Ryan, I want to highlight that algorithms often latch onto the simplest patterns they can detect within the training data set. An illustrative example can be seen in a military project aimed at teaching a classifier to detect tanks camouflaged within trees. The scientists behind the algorithm concluded that their model was functioning correctly. However, in practice, the model faltered because it had focused on numerous extraneous characteristics unrelated to the tank's features. It ended up identifying cloudiness rather than the tanks themselves due to the conditions under which images were taken.
00:12:54.970
To maximize learning efficacy, you should collect diverse data from varied environmental contexts, ensuring that the model incorporates the right features. With that, I’d like to introduce Ryan, who will work through some practical examples for you.
00:14:23.769
Thank you, everyone. Can everyone hear me? I want to provide you with a couple of examples. When I was initially learning about machine learning, I watched many videos online filled with mathematical concepts that were challenging to comprehend. I find it especially helpful to see real-world applications of these ideas. Usually, when attempting to use machine learning, you won’t directly implement algorithms but rather use excellent existing tools to get started.
00:15:06.380
For example, I want to start with sentiment classification, also referred to as sentiment analysis. In this case, you analyze a body of text and determine if it is positive or negative. Companies often employ this technique to assess customer product sentiment on social media platforms like Twitter. By counting positive and negative mentions, they can gauge product performance.
00:15:51.310
To conduct sentiment analysis, we first need a training set. In this example, we will analyze tweets and label each as positive or negative. Manually labeling thousands of tweets is labor-intensive, so one efficient method is to use emoticons, like smiley faces for positive sentiments and frowning faces for negativity. By retrieving tweets including these emoticons, we can quickly assign sentiment labels.
00:16:30.680
Once we have our tweets and their labels, we must extract features the algorithms can utilize. This leads us to the 'bag of words' model, which helps transform text into analyzable features. While this model disregards sentence structure and word order, it focuses on word frequency. We create a dictionary of words found in our training data set and compile counts for each corresponding tweet.”},{