Practical Machine Learning and Rails

Talks

Andrew Cantino

Practical Machine Learning and Rails

by Andrew Cantino and Ryan Stout

The video titled "Practical Machine Learning and Rails" presented by Andrew Cantino and Ryan Stout at Rails Conf 2012 introduces attendees to the principles of machine learning and its application within Ruby on Rails applications. The presentation is divided into two parts: an overview of machine learning concepts followed by practical examples of its application.

Key Points Discussed:

Introduction to Machine Learning:
- Machine learning involves using algorithms to make predictions from data, relying heavily on applied statistics and mathematical principles, while also having numerous libraries that simplify these processes for developers.
Data as the Core of Machine Learning:
- Essential data sources include web logs and user actions that developers can analyze for predictive modeling. The significance of having quality data for building predictive models is emphasized.
Classification Techniques:
- Various classification methods are introduced, including:
- Decision Trees: A straightforward method resembling flowcharts used to classify data based on questions about features.
- Support Vector Machines: Ideal for low-dimensional data classification with a focus on maximizing the margin between classes.
- Naive Bayes: A simple yet effective algorithm often used for spam filtering, operating on the independence of words.
- Neural Networks: Though powerful, they are complex and often produce less interpretable results than other methods.
Feature Design:
- Emphasizes the importance of selecting and transforming data features for algorithms to learn effectively. Overfitting and dimensionality issues are discussed, stressing the need for adequate training datasets.
Practical Example - Sentiment Classification:
- Ryan Stout provides an example of sentiment analysis using tweets, illustrating how to create training sets and extract features through methods like the bag of words model. He discusses practical tools like Weka for implementing machine learning algorithms without heavy mathematical prerequisites.
Improvement Strategies:
- Suggestions include expanding the dictionary for feature extraction, utilizing bi-grams for capturing contextual relations, and incorporating expert insights to derive valuable features.

Conclusions/Takeaways:
- Attendees should leave with an awareness of the machine learning tools available and how to commence integrating them into their projects. The video serves as a practical guide for developers looking to apply machine learning concepts in real-world Rails applications, focusing on overcoming initial hurdles encountered during implementation.

The session encourages interaction and further inquiry into machine learning concepts, promoting continued learning and exploration of this expansive field.

00:00:24.600 Hello everyone. I'm here to talk about Practical Machine Learning and Rails. My name is Andrew Cantino, and I'm the VP of Engineering at Mavenlink. This is Ryan Stout, the founder of Agile Productions. I have been doing machine learning on and off for a couple of years since I studied it in grad school, while Ryan has been working in this area full time for the last year on an upcoming project. We are going to split this talk into two parts. I will provide a theoretical introduction to machine learning, explaining what it is and why it is important. Ryan will then provide several practical examples.

00:01:10.180 This talk aims to introduce you to machine learning and help you understand its significance, as well as encourage you to consider applying it to your projects. My goal is to make you machine learning aware, meaning when you return to your companies and encounter problems well-suited for machine learning solutions, you will know what tools are available and feel confident enough to dive in. It is crucial that you at least know of its existence and are not intimidated by these techniques when addressing relevant issues.

00:01:34.500 However, let me be clear: this talk will not provide you with a PhD in machine learning. This is an enormous field that consists of many specialized areas, and people obtain PhDs for very small parts of it. While we will discuss some algorithms at a high level and provide a general sense of how to use them, we will not delve into implementation details. Additionally, we will not cover topics like collaborative filtering, optimization, clustering, or advanced statistics as they fall within a broader scope.

00:02:03.220 So, what is machine learning? Machine learning encompasses numerous algorithms that predict outcomes based on data inputs using applied statistics. At its core, machine learning is applied mathematics. However, there are many libraries available that abstract the complex mathematics from you, making it more accessible. It’s essential to understand that machine learning is not magic; it relies solely on the data available.

00:02:24.610 What kind of data are we talking about? The web is full of data—APIs, logs, click trails, user decisions—all these contribute to the data sets stored in your SQL or NoSQL databases. You can analyze this data and make predictions using machine learning. Your users leave behind trails of valuable, albeit noisy, data that can be leveraged to solve various problems.

00:02:39.220 What do we do with this data? One approach is classification. Classification involves breaking data into categories and determining which category new data belongs to. For example, consider whether an email is spam or not— we want your inbox to be full of legitimate correspondence, not spam. Other classifications could include determining if something is happy or sad, appropriate or inappropriate, and much more. Ryan will provide a hands-on example of sentiment analysis shortly.

00:02:49.019 In addition to email filtering, classification could apply to sorting documents. For instance, Gmail uses classification techniques for its importance filter while Aardvark categorizes questions. Similarly, platforms like Amazon and others classify reviews according to user interests, expertise, and even the likelihood of payment for services. You can also classify behavioral events, such as logins and clicks, based on their normal or abnormal activity.

00:03:10.360 A system can even be developed to automatically detect anomalies like potential intrusions. For example, we could create a system to warn users who might fall victim to a phishing attempt before they click on a malicious link. This indicates that classification offers practical solutions to problems you may encounter in your projects.

00:04:03.220 Let’s discuss some algorithms that can facilitate classification. One of the first algorithms is decision tree learning. What I appreciate about decision trees is their straightforwardness; they resemble flowcharts. For instance, a decision tree can classify a new email as either spam or not by considering labels like "spam" or "ham" and utilizing features from that email, such as the presence of certain keywords or the number of attachments.

00:04:33.960 To construct a decision tree, you select the feature that best separates your data into classes. In this theoretical example, you would begin with the feature that provides the most significant distinction between spam and ham. In our case, the word “Viagra” might serve as a strong predictor, determining whether the email is likely to be spam based on how often that feature appears. You can estimate the probability of an email being classified as spam depending on the presence of certain keywords or attachments, allowing you to use this decision tree in practical applications.

00:05:12.330 Next, let's look at support vector machines, another algorithm for classification.

00:05:16.790 Support vector machines (SVMs) excel at classification tasks where the number of features is limited, roughly fewer than 2,000. For instance, while classifying documents, every word may be treated as a feature, and this can result in poor performance due to SVMs' memory-intensive nature. However, with smaller data sets or less complex tasks, SVMs function exceptionally well. Imagine a scenario where black dots represent spam and white dots represent ham. The goal is to find a line that best separates these two categories using features along two dimensions.

00:06:03.200 This separation line, called a hyperplane, optimally maximizes the margin between the two classes in high-dimensional space. While the SVM algorithm typically employs a straight line, it can also work with complex curves, representing more intricate relationships between classes. There are libraries like LibSVM available, featuring Ruby bindings that you can incorporate into your applications for SVM implementation.

00:06:16.280 Moving forward, I want to mention Naive Bayes, a classification algorithm that surprisingly performs well with text. It calculates probabilities based on the presence of words within a document, treating each word as an independent feature. This approach, while overly simplistic statistically, yields effective results with enough data to support it. Essentially, Naive Bayes relies on the premise that the probability of a word's occurrence remains consistent across varied contexts.

00:06:50.280 Let's consider a very simple example to illustrate this process, using a training set of 100 emails where 70 are classified as spam. We could analyze the occurrence of specific words like ‘Viagra,’ which might appear 42 times in spam emails but only once in non-spam. With suitable probabilities derived from such frequency counts, we can utilize Bayes' theorem to predict whether new emails are spam.

00:07:23.300 For instance, if 60% of spam emails contain the word 'Viagra,' we can estimate the likelihood that a new email is spam based on features present in that email. These probabilities allow us to construct a formula predicting the probability that a new email containing 'Hello' and 'Viagra' is spam and, through careful calculation, we might find confidence in its accuracy.

00:07:40.890 Next, I want to mention neural networks. You’re likely familiar with neural networks, as they have become quite popular in recent years. However, I want to caution you in using them, especially given that support vector machines generally perform better for many applications. Neural networks involve multiple layers: an input layer, one or more hidden layers, and an output layer. They are designed to mimic the human brain at a high level, making them intriguing but complex. The challenge with neural networks lies in their interpretability and their tendency to overfit data. While capable of learning complex functions, determining the optimal number of hidden layers and parameters can be quite difficult.

00:08:42.730 The input layer would take in features, such as the count of certain words or pixel colors in an image. In practice, neural networks can function efficiently at dealing with images, yet they require considerable data and fine-tuning to achieve satisfactory performance. If you're interested in using neural nets, you should thoroughly evaluate whether they're necessary for your specific application or if simpler algorithms are more suitable.

00:09:40.950 Before I hand it over to Ryan, I want to discuss two high-level concepts that you need to understand when venturing into machine learning: the curse of dimensionality and overfitting. The curse of dimensionality refers to the fact that as you add more features, the amount of data required to learn an effective classifier increases exponentially.

00:09:55.160 For instance, more features create a high-dimensional space that requires significantly more data points to inform the learning algorithm sufficiently. Although it isn't strictly true that algorithms need to fill that entire volume of feature space, it is a good principle to keep in mind that the more features you have, the more data you’ll need to train effectively.

00:10:14.480 The other concept, overfitting, relates to the challenge of creating an algorithm that generalizes well to unseen data while avoiding memorization of the training data. Parameters define the complex characteristics your algorithm can learn, but with too many parameters, you run the risk of the algorithm learning noise rather than useful patterns. To mitigate this risk, one effective approach is to use separate data sets for training and testing; ensuring you validate your model against unseen data increases the likelihood that it will perform well in real-world scenarios.

00:11:06.750 Cross-validation is a technique that allows you to estimate a model's performance by splitting the original data into k subsets and iteratively training and testing the model on varying partitions. This allows you to assess how well your model generalizes to different samples while maximizing the amount of training data available.

00:12:00.000 Before I pass the stage to Ryan, I want to highlight that algorithms often latch onto the simplest patterns they can detect within the training data set. An illustrative example can be seen in a military project aimed at teaching a classifier to detect tanks camouflaged within trees. The scientists behind the algorithm concluded that their model was functioning correctly. However, in practice, the model faltered because it had focused on numerous extraneous characteristics unrelated to the tank's features. It ended up identifying cloudiness rather than the tanks themselves due to the conditions under which images were taken.

00:12:54.970 To maximize learning efficacy, you should collect diverse data from varied environmental contexts, ensuring that the model incorporates the right features. With that, I’d like to introduce Ryan, who will work through some practical examples for you.

00:14:23.769 Thank you, everyone. Can everyone hear me? I want to provide you with a couple of examples. When I was initially learning about machine learning, I watched many videos online filled with mathematical concepts that were challenging to comprehend. I find it especially helpful to see real-world applications of these ideas. Usually, when attempting to use machine learning, you won’t directly implement algorithms but rather use excellent existing tools to get started.

00:15:06.380 For example, I want to start with sentiment classification, also referred to as sentiment analysis. In this case, you analyze a body of text and determine if it is positive or negative. Companies often employ this technique to assess customer product sentiment on social media platforms like Twitter. By counting positive and negative mentions, they can gauge product performance.

00:15:51.310 To conduct sentiment analysis, we first need a training set. In this example, we will analyze tweets and label each as positive or negative. Manually labeling thousands of tweets is labor-intensive, so one efficient method is to use emoticons, like smiley faces for positive sentiments and frowning faces for negativity. By retrieving tweets including these emoticons, we can quickly assign sentiment labels.

00:16:30.680 Once we have our tweets and their labels, we must extract features the algorithms can utilize. This leads us to the 'bag of words' model, which helps transform text into analyzable features. While this model disregards sentence structure and word order, it focuses on word frequency. We create a dictionary of words found in our training data set and compile counts for each corresponding tweet.”},{

RailsConf 2012