Talks

Collective Intelligence

Help us caption & translate this video!

http://amara.org/v/FGkj/

GoRuCo 2008

00:00:18.230 One thing I noticed is that the people way in the back may not be able to read the code that I have in my slides.
00:00:25.169 If you go to my website real quick, I posted the slides in PDF format so that if you wanted to download them and follow along, you can. I made the font pretty big, but since there are only these screens... okay.
00:00:31.289 Real quick, I work for Mint Digital, and like everybody else, we're hiring! I study at Columbia University, and one of the things I'm really interested in is this topic: Collective Intelligence.
00:00:40.140 So, this is like the longest title for a talk ever. But what do I mean when I say 'collective intelligence'? Really, this talk is about recommendation systems.
00:00:48.089 If you've read the O'Reilly book, recommendation systems occupy like one or two chapters in the book. There's a lot of other stuff in there, but for this talk, I'm just going to focus on recommendation systems, also known as collaborative filtering.
00:01:00.559 The idea is that you're bringing in data from a lot of users and putting it to use to make filtering choices for everybody.
00:01:08.400 I want to ground this in reality and start with some examples of it in the wild. This is one you're probably all familiar with: Amazon's book recommender, which, in the beginning, when I first started using Amazon, was pretty good. I bought a book on JXTA, which is a Java P2P library, about six years ago, and now it's still recommending stuff related to that, so it's not always accurate.
00:01:25.979 LinkedIn has a similar feature: if you look on the right-hand side, it says 'People you may know.' By the way, Jacob Harris, why have you not linked with me? What the hell?
00:01:34.680 Facebook also has a 'People you may know' section.
00:01:40.709 Beyond just products and people, music obviously plays a role. Last.fm has a whole section dedicated to recommending music to you. There's also another example from Smooth, which deals with wine recommendations. I did a search on Pinot Grigio, and here on the left, you can see similar wines.
00:02:09.990 Yahoo Movies has something similar. Now, I don't really use Yahoo Movies much, but when I first started, they asked me to rate some films, and the movies they were asking me to rate, I hadn’t seen. So, unfortunately, there’s that issue of where you get your data from and how you make a recommendation system that's legit before you have data from specific users.
00:02:28.890 These were the recommendations it made, which I wasn’t really interested in. Again, with recommendation systems, it's kind of a sliding scale: sometimes you get good recommendations, and sometimes you don't.
00:02:39.509 There are a ton of other examples: StumbleUpon, Google Suggest, iTunes, IMDb, Rotten Tomatoes, Yelp... There was a new one that just came on the scene a week or two ago called The Filter, which is again just like a recommendation system for content. But the one recommendation system that is out in the wild that I consider the biggest, baddest, and most successful example is Google Ads.
00:03:06.360 They track click-throughs and stuff like that, so it's contextually based, but it’s also based on people clicking through on the ads. They’re recommending ads that they think you’ll click.
00:03:18.300 So, after all this, this is kind of like the sales pitch for why you should care. Why would you actually want to implement a recommendation system? This idea of serendipity—other than being an ice cream shop north of here and a movie with John Cusack—also encompasses the phenomenon of finding valuable or agreeable things not sought for.
00:03:41.350 At least in my theory, I think this leads to happy users, additional sales (with Amazon, obviously). If you're selling products, it becomes pretty obvious. But it also results in more page views or visits, so if you’re based on an ad model, then obviously that’s good.
00:04:25.970 Some other ways to think about the problem of recommending stuff include thinking of it as a ranking problem. If you have, say, 500 things that you could possibly recommend to a user, how do you rank them and decide which are the few things you want to list on top? It’s a prediction problem where you're predicting whether or not a user is going to be interested in something, like movies, where you're giving it a star rating.
00:05:26.520 It could also just be a classification problem: would they be interested, or would they not? Before I get into some crazy mathematical stuff, I wanted to show a really simple example—a Twitter-like 'who you should follow' recommender.
00:05:46.820 At the time when I wrote this, there wasn’t one available, but of course now there is—just not by Twitter, but by someone else. Here’s the basic code: I’m just getting the user, getting their friends, and then for each of their friends, I’m getting their friends. So, I’m going one level deep and creating a list of those people, those IDs.
00:06:11.880 My original idea when I did this was that I could see the friends of my friends and do a pairwise intersection. If two of my friends are following the same person, that person may be interesting for me to follow as well. The truth is, you could simply count how many of your friends are following that person. Here’s the basic code: I just go through each pair and am doing the intersection to find the common friends.
00:06:39.380 Once I’ve done that, I take out all the people I’m already friends with and of course take out myself because realistically, I’m probably going to show up in that list. Then I just sort them based on how many friends they are followed by.
00:07:18.700 The question is: how good is this? Does it actually produce legitimate results? It’s really simple, but I found it to be pretty effective. I ran it for Trotter, and at the time I ran it, at the top you see he’s following five. That for me was kind of like how I evaluate whether or not the algorithm is making good recommendations. I figured if they’re already following some of the top recommendations I’m making, that’s probably pretty good.
00:08:08.450 Just going that level deep, you can see that there are about 1,429 people that I can possibly recommend. I just picked off the top 10, and as you can see from the different names, those are all people in the Ruby community that he should probably be following.
00:08:42.910 Now, I want to talk about the actual stuff that’s a bit more complex than something like a follow recommender. When dealing with these kinds of systems, there are two different kinds of data you can get: implicit data and explicit data.
00:09:06.250 Examples of implicit data include purchased items or viewed items, such as if you are using a feed reader and they clicked on something. Time spent viewing items and click-throughs also count. Last.fm has this thing called scrobbling, which is a plug-in for iTunes that tracks what you listen to and reports it back.
00:09:43.540 Explicit data is when you ask the user for actual information, like giving them two things and asking which of these do you like better. There's a website like command shift three that does that, and there’s the classical star rating or numerical rating—for instance, on Netflix, if you’re giving it stars or whatever. There’s also an ordered list, which is just giving you a list of items and asking you to rank them.
00:10:36.570 There are also different types of recommendations: content-based recommendations, like what part of the AdSense thing is obviously content-based. I think another good example is if you look at Last.fm or Pandora. Both are music sites that make recommendations. Pandora is kind of a mix: Last.fm is statistical, whereas Pandora is content-based because they have people flagging content as it comes into the system.
00:11:13.810 There’s user-based, where these users should follow each other, and then there’s item-based, where if you bought this item or are interested in this item, these other items are similar to it.
00:11:50.040 The basic strategy for putting this together is to map the data into a Euclidean space. I’ll define that a little bit better in a second. You calculate the similarity using some sort of metric, and then you use those similarities to recommend whatever the user should be following.
00:12:28.440 That's the really high-level overview. When you’re doing this, you represent the data as vectors, which means it’s just a matrix—a table of data mapping all your data into n-dimensional Euclidean space.
00:13:05.440 To visualize, let’s take an example. In this case, I have users in a movie recommendation system: Paul, Brian, Trotter, and Josh. I have two movies that I’m comparing: Fight Club and The Matrix. The table form might look like this: I thought Fight Club was totally awesome, while Trotter thought they were both mediocre.
00:13:41.000 Brian thought they were both totally awesome, and Josh was somewhere in the middle. You might notice that this looks a lot like creating data points in a regular spreadsheet. Fight Club would be your x-axis and The Matrix would be your y-axis.
00:14:15.600 Visually, it looks like this: you have The Matrix and Fight Club on the axes, and Trotter, Josh, Brian, and Paul are represented as specific points indicating where they rank these films. You can see that Trotter and Josh are probably similar; they would likely like the same kind of content, and Brian and Paul may also have similar tastes.
00:14:56.270 So, if you’re not dealing with ratings or star ratings in this case, here's another example: it’s people following people. In this case, person zero isn’t following anyone, while person one is following Trotter, Brian is following himself.
00:15:26.650 Here’s another movie ratings example with different users rating different movies. In this case, I have floats as well as regular integer values. Each column represents a different movie. As you can imagine, that can blow up quickly.
00:15:59.240 Here’s another visualization of similar items. In this case, we’re taking distance as the metric. You might think of all the pluses down on the lower left as being similar to one another and all the circles as similar, which can represent users or movies.
00:16:35.100 How do you determine similarity? There are several metrics you can use, such as Manhattan distance, Euclidean distance, and cosine similarity. Levenshtein distance can be applied to strings to see how close they are. The only two examples I will give later on are Euclidean and cosine.
00:16:59.070 Graphically, this is what Manhattan and Euclidean look like. Manhattan would be represented by the red, blue, and yellow areas, while Euclidean is just the green line showing the shortest distance between two points. Cosine similarity represents just the angle between two points.
00:17:36.840 The thing is, if you choose to use cosine similarity vs. Euclidean, it has different effects. Because if you look at point B, if it were higher up to the right, it could share the same cosine similarity, but the Euclidean distance would be much further, so you kind of have to play with these metrics to see what yields the best results.
00:18:16.370 One of the libraries I’m using is Linalg, which is a linear algebra package originally written in Fortran that I’m calling from Ruby. Calculating cosine similarity is pretty easy with it: you have a movie vector for any user and compute the dot product of the two, divided by the norms.
00:18:49.900 So, once you have a similarity metric, you can figure out how to find other things that are close to this thing. These are generally called nearest neighbor methods. The really simple one is just k-nearest neighbors. If I have 500 users, I just pick one user and ask who is most like this user, then look through all the other users to find, say, the three most similar.
00:19:29.220 This is easy to compute in parallel because you can compute the similarity for each point in a different process. It’s computationally expensive, and it can also be expensive concerning the data because you have to look at all the data in your system.
00:20:07.820 So, some things you can do: there are various partitioning schemes to partition the data into separate areas, allowing you to evaluate only a subset. There’s also dimensionality reduction, which I’ll cover.
00:20:46.750 I want to use a real data example. The problem with many of the recommendation systems online is they use toy examples with only ten different users or items. Of course, you can compute similarity on that because it’s a smaller dataset.
00:21:27.290 A few years ago, Netflix announced a challenge to find ways to improve their recommendation algorithm for movies. They released a dataset, offering a million-dollar prize if someone could improve recommendations by ten percent.
00:22:12.080 The dataset consists of 17,000 different movies and about 450,000 users. Obviously, that’s not all the users in their systems, so they just sampled, but the sampling is supposed to represent the distribution across their user base.
00:22:42.400 At first, I tried parsing it, but then I said okay, I'm just going to actually put it into a database. Here’s the basic schema: I have a users table that just has a Netflix ID to map to the Netflix ID, with an index added, as I’m looking up that a lot.
00:23:20.920 I also created a movies table, which just has the title and whether or not I’ve imported the ratings. Then there’s the ratings table that has the user ID, the movie ID, and the rating itself. I’ve added an index on the user ID and the movie ID because I'm searching for those often.
00:24:04.180 Here’s how the movie model works: it has many ratings. I order it by user ID, and the reason I’m doing that is because I have to keep track of which movie rating I'm looking at and what comparison rating I'm looking at.
00:24:57.850 If both have the same number of actual variables, like actual dimensions, the movie ratings create a sparse matrix, as not every user has rated every movie. Here’s the simple algorithm for computing nearest neighbors. I grab the ratings of the movie and then loop through the others for comparison.
00:25:51.150 You might wonder why I grab ratings by movie instead of just doing dot ratings. The problem is that the dataset is so huge it can’t all fit into memory with Active Record. Eager loading creates an obscene amount of memory usage because those top-ranked movies have a couple of hundred thousand ratings.
00:26:59.740 I pull them out into the compare rating, so that on each iteration, it'll release memory. I’m using the Ratings Pair class to calculate the distance between two different movies. One set of ratings is for the Matrix, and the other is for whatever the other movie is.
00:27:39.300 The calculation for Euclidean distance is ridiculously simple in mathematical terms: sum the squared differences of each dimension and take the square root. But in code, I keep track of the movie rating I'm looking at and what the comparison rating is. I have to keep track of disparities because of the nature of the sparse matrix.
00:28:36.350 Here are the results: I used the Matrix to see what would happen. I wanted to pick movies that I thought everybody in the room had seen. So the fifth element, Terminator 2, Blade, The Matrix Revolutions, Total Recall, and the Matrix Reloaded were my chosen films.
00:29:34.120 These movies had appropriate matches, but it’s not guaranteed that the recommendations will always be spot on. That’s a value judgment on whether they are considered good recommendations or not.
00:30:20.930 One issue I faced was the dataset size. I didn’t run it on all the movies; I just looked at the top 500 movies with the most ratings, as the dataset was too large. The movie with the highest number of ratings was Miss Congeniality, with around 240,000 ratings.
00:31:30.860 Even the fifth hundredth movie had about 50,000 ratings. To have all that in memory is obviously too demanding, similarly with CPU usage. Just grabbing the data was challenging.
00:32:14.400 Despite having an index on movie ID, when I pulled back ratings, each rating isn't stored in sequence on disk. With 230,000 different records, this process results in a lot of I/O from different pages.
00:32:56.140 When I compared movies, it took roughly 30 minutes on my laptop with just 500 movies, so with large data volumes, it’s about bending space and time to manage that efficiently.
00:33:43.500 To reduce data loading, prototype selection methods are useful. Netflix already does this with their dataset, so they are not giving the entire dataset but a specific selection. It makes the problem more tractable and less memory intensive.
00:34:40.520 However, the accuracy of recommendations can vary based on how you select prototypes. You can also use dimensionality reduction methods like SVD and clustering to intelligently reduce the data you need to process.
00:35:35.700 Methodologies like k-means clustering allow you to find areas of data where items reside, which helps avoid exhaustive comparisons and keeps calculations isolated to relevant groups.
00:36:28.900 Lastly, if you’re dealing with linear algebra in Ruby, external libraries such as the lin-alg and Ruby GSL are valuable. I opted for lin-alg, but it required some effort to install.
00:37:18.360 Using item-based approaches helps to move towards effective recommendations. Tracking your movie ratings involves complex matrix manipulations and can grow quite sizeable. The need to manage these efficiently is vital in a practical application.
00:38:26.790 Moreover, tracking your movies involves maintaining a reliable structure between users and movies. As you can imagine on a larger scale, your movie matrix could extend to 450,000 rows and 17,000 columns—an enormous undertaking.
00:39:20.170 While I focused on a specific outcome in managing 100 movies due to resource limitations, variations in the dataset could lead to wild differences in outcomes depending on how you aggregate or parse the data.
00:40:35.570 This leads us to the system’s scalability when assessing the ordinary structures used in queries. While basic setups can suffice for simple implementations, moving to more complex databases or external libraries becomes necessary for long-term projects.
00:41:41.780 Finally, I want to leave you with one idea: Iteratively enhancing algorithms through testing is fundamental to refining these systems. The open opportunity exists, considering that Netflix is offering a prize but so far, there’s been a lack of claimants.
00:41:58.880 Thank you all for listening!
00:42:09.320 Any questions?
00:42:11.250 Person in the audience: So, first off, I want to know how do you test like a friend recommendation engine?
00:42:52.840 One of the metrics I considered was to check how many of the people who were recommended are already following certain key friends. It stands to reason that if they’re following several of the recommended people, it’s likely working fairly well.
00:43:52.400 Testing Netflix’s recommendations was easier, given that they had user data and a training set of rankings. You can predict how users will rate movies they haven’t seen based on the training information.
00:44:55.530 For example, using squared loss as a penalty function quantifies the distance between predicted and observed ratings. Netflix offers feedback on its recommendation systems, and the goal is to see if you can at least match their metrics.
00:45:34.740 Person in the audience: Are there existing libraries like lib SVM for these types of data tasks?
00:45:57.160 Yes, but remember that even with optimized libraries, you have to load everything into memory as Ruby objects before computation. So, efficiency remains a big challenge.
00:46:32.920 These insights reveal the challenges around memory management and data processing that greatly affect systems design.
00:47:07.840 Thank you all, and if you have any more questions, feel free to ask!