People who liked this talk also liked ... Building Recommendation Systems Using Ruby

Ruby

Ryan Weald

1 talk

#recommendation-systems

#machine-learning

#ruby

#data-science

People who liked this talk also liked ... Building Recommendation Systems Using Ruby

by Ryan Weald

In this talk presented at LA RubyConf 2013, Ryan Weald delves into building recommendation systems using Ruby. He begins by establishing his credentials as a data scientist from a startup focused on content recommendations. The core aim of the presentation is to provide an overview of recommendation systems, which predict user preferences based on data from users and items.

Key Points Covered:

- Introduction to Recommendation Systems: Weald explains that recommendation systems are ubiquitous online and essential for enhancing user engagement. Famous examples include LinkedIn’s people suggestions, Netflix’s movie recommendations, and Amazon’s product links.

Two Main Types of Recommendation Algorithms:
- Collaborative Filtering: This algorithm predicts preferences based on the behavior of similar users. Weald elaborates on memory-based and model-based collaborative filtering, focusing primarily on the simpler memory-based methods involving user-item matrices and similarity metrics such as the Pearson correlation coefficient and cosine similarity.
- Content-Based Recommendations: These systems recommend items based on their features rather than user behavior, using techniques like k-means clustering to categorize items and recommend based on similarity within clusters.
Hybrid Systems: Weald suggests hybrid models that integrate both collaborative filtering and content-based approaches to enhance recommendation quality.
Challenges in Recommendation Systems: He discusses several challenges, including the cold start problem, data sparsity, and the limitations of content-based recommendations. He emphasizes the importance of understanding the underlying algorithms rather than relying solely on libraries, which can obscure the debugging process when issues arise.
Evaluation Metrics: To evaluate recommendation systems, Weald introduces concepts like precision and recall, along with the significance of tracking user interactions with recommendations.
Existing Libraries: He concludes by mentioning libraries like Apache Mahout for larger datasets and Sai Ruby for scientific computing, providing avenues for further exploration of recommendation systems.

In summary, the talk highlights the foundational concepts of recommendation systems and encourages attendees to explore building their own systems in Ruby, stressing the accessibility of these technologies without needing advanced degrees in the field. Question sessions at the end allowed for discussions on personalized recommendations, enhancing the context for practical applications of these systems.

00:00:11.059 Thank you.

00:00:24.420 All right, so today I'm going to talk about building recommendation systems in Ruby. But first, you're probably wondering who I am and what I know about recommendation systems. I'm far too young to have a PhD, so what gives me the authority to speak to you about this subject?

00:00:41.700 Currently, I'm a data scientist at a startup in San Francisco called Sharethrough. We are a native advertising platform, meaning we take branded content from across the web and promote it on other websites. Our goal is to make our ads feel native, which means they have to blend in with the content on that site. Ultimately, our ad targeting comes down to content recommendations.

00:00:54.059 A large part of my job is building content recommendation systems to power the ads we hope to serve. Before I get going, I have to give you a warning: There is going to be a little bit of math coming up. I know it's late in the day, so I'll try to keep it as light as possible, but obviously recommendation systems are a math-heavy subject.

00:01:11.040 My goal for this talk is basically to start by describing what a recommendation system is. We need to understand what they are before diving deeper. Then I will look at collaborative filtering-based recommendation systems, which are probably the most common type you've heard of before. Next, I'll move on to content-based recommendation systems, and finally, we will look at hybrid algorithms, which combine both collaborative filtering and content-based systems.

00:01:31.979 I will also touch on how to evaluate recommendation systems, and finally, I will provide you with some resources and point out a couple of existing libraries so that if you want to learn more, you can find additional information on your own.

00:01:54.659 This talk is not going to cover everything there is to know about recommendation systems; that is a very complex subject with tons of people pursuing PhDs in it, and companies employing entire departments focused on it. My goal is really to give you an overview so you can have a good foundation in understanding what's going on behind the scenes of these common algorithms. This way, you'll know enough to look further on your own and won't view the entire topic as a big black box. As a result, it's also not going to be bleeding-edge machine learning; this isn't the right audience for that.

00:02:19.440 I enjoy geeking out on that, but I recognize it might not be of interest or relevant to you. Additionally, this talk is not going to discuss how to use a specific library. I believe it's essential to understand what an algorithm does at its core, otherwise you'll end up in situations where you're using a library and won't know how to troubleshoot when things go wrong.

00:02:38.240 So let's start off with what a recommendation system is. Simply put, a recommendation system is a program that predicts a user's preferences using information about other users, the user themselves, and the items in the system. These systems are prevalent throughout the web—pretty much every big company uses them across all domains.

00:02:59.940 A great example in the social space is LinkedIn, which provides recommendations for people you may know and organizations you might be interested in. Netflix famously ran a million-dollar bounty for improving their recommendation systems, and almost every time you log in, you see the top ten movies they think you're most likely to like. This helps them drive engagement and keep users watching movies.

00:03:41.580 Spotify also does recommendations; radio, at its core, is just a recommendation product. It predicts what songs you likely want to hear based on the tracks you've previously listened to. The most common example, however, is found on Amazon with their 'Customers who bought this item also bought' feature.

00:04:10.319 So, how do you build one of these things? What is really happening underneath the surface of all these products we encounter daily? It turns out that most recommendation systems are powered by two main categories of algorithms: collaborative filtering, often referred to as nearest neighbor or neighborhood-based algorithms, and content-based algorithms. Content-based algorithms are sometimes just called classification algorithms since, at their core, they essentially perform a classification task.

00:05:04.620 Let's start with collaborative filtering. Collaborative filtering is a method of predicting user preferences based on other users or similar items. For example, if I haven't rated movie A, we can use information about other users to fill in or infer my preference for that movie.

00:05:40.560 Within collaborative filtering, there are really two types of algorithms: memory-based and model-based. Memory-based systems use similarity metrics between users or items to infer preferences, while model-based systems tend to be much more complicated. They train a classifier or collaborative filtering algorithm offline and generate an algorithm that explains the underlying phenomena helping fill in blanks. I won't dive into model-based algorithms today as they are quite complex and, unfortunately, Ruby doesn't have the tools for them in a short talk.

00:06:21.660 So, let’s focus on memory-based collaborative filtering. The most common memory-based technique is user-based collaborative filtering. This method relies on two main components: a user-item matrix and a similarity function. The user-item matrix is quite simple; it represents how users have rated certain items. The goal of collaborative filtering is to fill in the values for any missing entries in that matrix.

00:07:28.500 After obtaining the user-item matrix, you need a similarity function to determine which users are similar to the user for whom you're creating recommendations. There are two common similarity functions used here: the Pearson correlation coefficient and cosine similarity. The Pearson correlation measures how correlated two users are, whereas cosine similarity is based on the angle between two feature vectors.

00:08:25.859 Let’s take a closer look at the Pearson correlation coefficient, as it's the most prevalent and provides the best results. The formula for computing the similarity between users X and Y may look intimidating at first glance, with Greek letters flying around, but when we break it down, it becomes more straightforward.

00:09:50.760 We start by looking at the Sigma notation, which essentially represents a sum—a basic for loop. The key question is: What are we summing over? It's a sum of all shared items between the two users. So, in Ruby code, we iterate over all shared items that both users have rated.

00:10:43.680 Next, we calculate averages. For each rating, we subtract the average rating of the user from the observed values. This translates the math into simple Ruby iterating through shared items and computing averages.

00:11:54.540 Once we’ve tackled the numerator, we head to the denominator, which involves squaring the average ratings. Similar to the numerator, the execution in Ruby remains uncomplicated because we've already calculated the averages. After iterating over shared items and their ratings, we square them and take the square root.

00:13:00.360 The original formula for the Pearson correlation coefficient boils down to about 40 lines of Ruby code. This core metric powers a lot of the recommendations you see daily on platforms like Amazon and LinkedIn.

00:14:09.240 With this ability to compute similarity between two users, how do we turn that into a recommendation? Let’s revisit the user-video matrix as an example. The Ruby code required to turn this matrix into recommendations is straightforward: we source our user data, compute the Pearson correlation coefficient between the user we're interested in and all other users.

00:14:49.740 We focus on the top K users—let’s say K equals two. After determining the correlation, we take the top K users and average their ratings for all the videos they have rated, then sort and return the top recommendations.

00:15:56.700 However, collaborative filtering does come with challenges. The most common is the cold start problem: if your recommendation system relies on having previous ratings, it becomes difficult to recommend items if you lack sufficient data.

00:16:34.320 Another issue is data sparsity; in cases with large datasets, users typically won’t rate a huge number of items, leading to sparse data which complicates recommendations. Additionally, memory-based algorithms can be resource-intensive as they require entire matrices to be stored in memory, resulting in high computational costs.

00:17:58.260 Now, let's discuss content-based recommendations. Instead of relying on user behavior patterns, content-based recommendations focus purely on the item's features. For instance, if you have videos, you might classify based on content type, duration, or rating.

00:19:31.740 For example, using k-means clustering to group items into distinct categories allows you to recommend items within the same cluster. Again, this complex problem simplifies when broken down into Ruby code where you can normalize video features and train the k-means classifier.

00:20:54.600 Content-based recommendations also come with challenges. Unsupervised learning poses difficulties if pre-categorized data isn't available, and creating extensive training data sets that are already classified can be expensive. There’s also the limitation where recommendations don't account for user preferences, which is important for suggesting new or unexpected content.

00:22:07.020 Hybrid recommendations can address some of these limitations by combining collaborative filtering and content-based approaches. A hybrid system uses input from both algorithms and merges their recommendations to enhance overall quality.

00:23:55.380 To evaluate the quality of a recommendation system, we can consider approaches like precision versus recall, where precision measures the accuracy of recommendations, while recall assesses the ability to retrieve items from the training set. In practical terms, tracking how users respond to recommendations is vital, which can be measured through clicks and click-through rates.

00:25:56.460 In summary, we’ve learned that collaborative filtering relies on similar users to make recommendations, that content-based algorithms such as k-means clustering can also provide recommendations, and that combining these two methods enhances quality and performance.

00:27:03.600 I've shared some of the existing libraries that can help you with recommendations, so you don't have to write extensive boilerplate code yourself. One well-known option is Apache Mahout, a robust recommender framework that runs on Hadoop for larger datasets.

00:27:58.140 For Ruby developers, significant libraries like the Sai Ruby project can aid in scientific computing around algorithm building. Lastly, if you want to benchmark various recommendation algorithms, check out recommenderlab in R for rapid, iterative exploration.

00:31:01.620 Now, as I conclude this presentation, please feel free to ask any questions!

00:31:43.140 For instance, a great question arises about the state of the art regarding providing recommendations based on user context. Collaborative filtering, using more advanced techniques, incorporates various factors such as time-sensitive components to weigh the importance of interactions based on how recently they occurred. Addressing this personalization can vastly improve recommendation efficacy.

00:32:30.000 If no further questions arise, thank you all for your time and attention today!

LA RubyConf 2013