Machine Learning for Fun and Profit

Talks

John Paul Ashenfelter

#recommendation-systems

#clustering

Machine Learning for Fun and Profit

by John Paul Ashenfelter

In the video titled Machine Learning for Fun and Profit, speaker John Paul Ashenfelter presents an engaging introduction to applying machine learning techniques in Ruby to derive insights from user data within a Rails application. He emphasizes the importance of utilizing a Users table not only to enhance business profits but also to improve user experience through informed decision making. The presentation dives into several key machine learning concepts and practical applications, which can be immediately implemented.

Key Points Discussed:
- Understanding Users: Ashenfelter encourages attendees to comprehend their users deeply, posing questions like who the users are and what insights can be gathered from their data.
- Assigning Categories: He introduces techniques for categorizing users based on their first names through the tool Sex Machine, which estimates gender rather than collecting sensitive information upfront.
- Location Awareness: Geolocation data is highlighted as another crucial factor for understanding user demographics, discussed through tools such as FreeGeoIP, which allows users to gather geo-related insights from IP addresses.
- Clustering and Behavior Segmentation: He explains K-means clustering, a method used to segment users into different groups based on behaviors or spending patterns—helpful for identifying target customer segments.
- Collaborative Filtering: Ashenfelter illustrates how companies like Netflix use singular value decomposition (SVD) for making personalized recommendations, enabling users with similar preferences to connect more effectively.
- Practical Tools and Resources: Throughout the talk, he shares various Ruby gems and resources that developers can use to implement the discussed techniques.

Significant Examples & Case Studies:
- Ashenfelter references his experiences from working with companies like Treehouse where initial analytics related to gender distribution from first names led to insights with minimal user input, thus demonstrating the power of data science in real-world scenarios.
- He also mentions the potential for enhancing user interactions through collaborative filtering, stemming from shared skill sets among users at Treehouse.

Conclusions and Takeaways:
- Machine learning techniques offer significant potential to turn user data into actionable insights and profits.
- The tools discussed, including gender assignment, geographic data analysis, clustering, and collaborative filtering, are accessible to developers of varying experience levels.
- Ashenfelter encourages continuous learning through resources from O'Reilly and Stanford, underscoring the value of further education in machine learning and data science.

Ultimately, the goal elucidated in the presentation transcends mere financial gain; it aims at enriching user experiences and creating supportive communities around shared knowledge and collaborative efforts.

00:00:13.759 Hey, hey, hey! Alright, so there I am, ready to discuss machine learning for fun and profit. That’s what I want to talk about today.

00:00:20.800 I've got 30 minutes to convince you why this is a great idea. In a nutshell, let’s start with the premise: the goal here is to use Ruby to answer questions about your users and make your business money. Quick hands, how many people have a Users table? Right? Yes, the folks who don't have their hands up must be doing something like Internet of Things, maybe Bitcoin or something unusual where you don’t have any real users to know about—just aliases and such. But everyone ends up having a users table somewhere in their system.

00:01:05.040 We want to be able to make money from this, whether you are a small band of bootstrappers or more traditional VCs with a business plan. That business plan may seem ridiculous, but at the end of it, there's profit, and in between, there’s this magic thing. What I want to talk to you about is that magic thing between those two points.

00:01:33.240 What I want to discuss is how collecting data can lead to profit, insights, or social good, for that matter. We can do this in a lot of different ways. The tools we are going to use are Ruby, which is exactly where we should be at this conference here in the Rocky Mountains. We will use a users table, which might look something like that, and as promised, we are going to use science.

00:01:55.640 Just to be clear, we’re not going to use this kind of science, like the crazy Mad Science. We're going to use the Kick-Ass science. It's fun to note that both of those people I mentioned are named Neil, which presents a slight disambiguation problem between the two. And the person who’s going to guide you through this is me. Just a quick note, the arrow in my presentation indicates that I did all the code for this at RailsConf, and it hasn't changed much since then. I tweeted out what this was this morning, and I'm happy to show you and help you get the code down.

00:02:24.840 I’m not going to do live coding; rather, I’m going to show you the results of the code. I’ve been doing this for a long time, and these are real conferences where I’ve presented and conducted talks. Back when building a data warehouse was like 10 gigabytes, I would say, 'Wow! You’ve got 10 gigabytes?' But now, coming from NASA, I’d tell you I’ve got a terabyte. Who doesn’t have a terabyte of data anymore?

00:02:38.680 I started programming data analysis stuff in Microsoft Visual Basic. You can't see from that picture down at the bottom, but it says 'For DOS,' which was a real product. We even had a math co-processor back then. This image is of a backpropagation neural network. I used to work on neural network problems, but the computational power wasn’t available and it wasn't trendy anymore, so I shifted my focus to other things like Ruby and Java. Now, I'm thrilled to be getting back into this field again.

00:03:17.960 I’ve been doing data science and analytics in various forms for a long time, from many different perspectives. Let’s start with this question: who are your users? How many people here think they know a lot about their users? How many feel comfortable with the analytics they currently have? There must be a few of you because some of you have to be at a start-up where you know the names of the people who use your product. If you know your users really well, then this may not be applicable to you right now.

00:03:54.480 However, it will be applicable in a few months when you have thousands or even hundreds of thousands of users. I find this kind of funny because I was a chemist prior to this and worked with absurd statistics, sometimes numbers like 10 to the 202nd or 10 to the 25th—ridiculous numbers of users that aren’t available in real life. I find myself getting wiggy about statistics that involve just a few million users or a few hundred thousand users, but I’m learning to adapt.

00:04:24.920 This is how most people look at their users. You may use Mixpanel, Heap, or other similar tools. Google can present some insights about your users. What these analytics typically reveal is that you have this user that is an aggregation of all your other users. But that’s not a very good story. I don’t want all my users to be the same. How can I market to my users? How can I learn something about them? How can I impact their lives based on the information I have? Because if we use flawed statistics, many of you might have seen this XKCD comic before.

00:05:07.800 When you start looking at aggregates, it’s possible to draw some really horrible conclusions. So let’s figure out a way to do that. This is what our users look like right now. At least we know we have this idea that we’ve got different kinds of users. We need a plan to better fill in the blanks.

00:05:52.479 No matter how little Ruby experience you have, I know there are folks from various schools and many newcomers here. The first thing we need to do is assign gender. I hate saying that out loud because that's not really the essence of what we're doing. What we will do is take the data from the first names of our users and use a gem called Sex Machine. I did not name it; it’s a C library under the hood. What we will do is assign a sex—male or female—based on the names.

00:06:36.680 This gem is really fun to work with. Often, you'll hear people suggest just to collect gender information from users upfront, but that’s a dangerous minefield. Facebook had to create 40 different choices to handle gender effectively. A relatable example would be ordering t-shirts for a conference—you might not think to ask how many were men or women or their preferences for t-shirt cuts. Instead, you'll make an intelligent guess, because we don’t have to be exact.

00:07:06.400 So, this is a practical application of this approach: we get a detector via Sex Machine, we give it names, and it prints out genders. For instance, here's a real set of results from a data sample. Is anyone from the UK in the room? Indeed, Jamie is almost always male there. This gem is smart enough to understand local information and can adapt based on it.

00:07:33.640 The defaults used to be androgynous names, but you can also set it to something like 'unable to compute.' You can take this gem and run it against a dataset to get at least a rough idea of gender distributions. For example, at Treehouse, we were interested in how many women were among our users. Although we don’t explicitly ask for that information upfront, using first-name analysis gives us a reliable approximation.

00:08:05.600 This method resulted in a quick estimate of our user gender ratio, which was far easier than sending out a survey. If you've ever dealt with surveys, you know it's nearly impossible to get consistent results back. That’s the first tool you can take home and start using today—just simple Ruby code.

00:08:41.239 Next up is location awareness. Many of you likely already deal with geolocation in your applications. If you are a smaller company or want to do this for fun, you can create your own geolocation services by obtaining another piece of data. There’s a buzzword-compliant tool called FreeGeoIP that’s hosted but can be downloaded. You’ll need Python to run the scripts, and it uses MaxMind to pull down free data.

00:09:26.479 There’s a little script in the repository where we use Faraday, so many of you might already be comfortable with it. I’m running my own copy of that IP database, which has been enhanced with additional information, such as political affiliation by state or average income by ZIP code. The request can be executed with curl from the command line, making it easy to get your data.

00:09:58.119 The most typical way to analyze user IP addresses is through a resolver. If you're using Devise, you probably already have the IP addresses in your database, as that's one of the default features. Using a tool, I can loop over all the users, and if there’s an IP address available, it will look it up and return some valuable data: average income, political leaning, latitude, and longitude, which is crucial for understanding what's near your users.

00:10:50.600 Now, what does this data look like? It provides a more realistic portrait of your users. Everyone's keeping up so far? I know I’m going quickly, so allow me to refuel briefly.

00:11:21.560 Let's delve into more advanced analytics. For those of you who might appreciate it, this is where people can find real joy. At the edge of our analytics journey, we will encounter clustering, a common problem in machine learning.

00:11:37.720 To start clustering in Ruby, we need to include relevant properties about users, such as dollars spent on your e-commerce platform. At Treehouse, we’ve dealt with aspects like the number of badges earned or points accrued, time spent on site, or anything else that opens the doors for additional insights.

00:12:01.560 There’s a gem called AI for R that wraps up a lot of complex mathematics. You want to use a gem like this instead of doing it from scratch because attempting that can be frightening. In terms of clustering, we can apply a method known as K-means clustering, which entails deciding how many clusters we want to create. You might have three kinds of customers: great, mediocre, and those you really don’t want to spend time on. What if I want to determine which of my users fit into these categories based on their behavior?

00:13:05.360 At the outset, we assign users randomly into groups. It’s not unlike dealing cards; if you envision red, green, and blue cards, you’ll eventually sort them. Then we need to calculate the centroid of each cluster, effectively finding the center for each group of users.

00:13:53.240 The next step involves assigning the other user properties to the appropriate clusters. If we go through this sorting process enough times, we arrive at a point where users are no longer switching groups, thus accomplishing effective clustering.

00:14:39.920 Now let’s talk about linear algebra, a critical component of this process, particularly for K-means. Linear algebra tools are necessary for diverse operations that help make sense of clusters you may create. I will emphasize the importance of understanding matrices and vectors as they are essential tools in data science.

00:15:29.680 A very real concern is that Ruby may not be the optimal language for heavy-duty data science operations. It serves as a good gateway but it’s important to recognize that other languages like Python and R perform better for math-heavy applications.

00:16:11.160 Nonetheless, it’s important to experiment in Ruby to see if it meets your needs. If it does, you may very well migrate to learning other languages that are tailored for more complex tasks.

00:16:39.040 Collaborative filtering is another key concept. This technique has been successfully employed by companies like Netflix in their recommendation engines. They have used an approach called SVD, which denotes the singular value decomposition.

00:17:08.080 SVD, implemented through linear algebra, allows us to distill user inputs into useful comparisons based on previous ratings. By mapping user ratings to common dimensions, we can determine which users have similar tastes, with the potential for closer interactions outside of the site.

00:17:55.120 At Treehouse, we ran several analyses through our own data, seeking to connect users based on their skill sets where someone excelling in HTML, JavaScript, and Ruby could be paired with someone proficient in JavaScript and CSS. The ultimate goal is to enable users to help each other, effectively creating a supportive community.

00:18:40.800 Here's where I emphasize that all of these codes and insights are available online in the repository I created. This showcases not just how to implement these ideas, but how to optimize them in meaningful ways.

00:19:21.760 Finally, we can survey the diversity of clustering algorithms and show how they can yield different outcomes. The remarkable thing about data science is that the resulting applications will differ based on the chosen algorithms. Different algorithms can lead to different conclusions based on initial settings and may yield various recommendations.

00:20:22.160 As we conclude, let's revisit the tools. You can assign gender based on names; geolocation data can provide insights on client distribution; clustering reveals user types based on behaviors, and collaborative filtering can uncover potential relationships between users. All these approaches are data-driven methods that can create improvements both in your analytics and in how users perceive their experiences.

00:21:05.680 Ultimately, this brings us to the goal: rolling in money is fun, but the genuine achievement lies in providing joy for both users and the community at large while solving relevant problems.

00:21:39.920 I want to express my gratitude to those who have supported me through this journey. The workshop I provided at previous venues was very rewarding, and I acknowledge all the wonderful people I’ve worked with at conferences. I appreciate the warmth shown to me here at Rocky Mountain Ruby, and I’m grateful for various experiences.

00:22:44.720 As I work at Treehouse, and because we are currently hiring for several developer positions, I want to invite those interested in joining us. My social media presence is limited, but I’m always willing to share resources and experiences with those seeking to learn.

00:23:50.480 Finally, I highly recommend checking out relevant literature and online courses to further your understanding of machine learning and data science. From various resources at O'Reilly to stand-out classes offered at institutions like Stanford, these materials are invaluable for anyone pursuing knowledge in the field.

00:24:50.760 I genuinely enjoyed discussing these topics with you all today, and I'm here for any questions you might have, whether you prefer to address them openly or quietly after the session.

00:25:10.280 Thank you!

John Paul Ashenfelter

@johnpaulashenfelter

Rocky Mountain Ruby 2014