Talks

Workshop: Machine Learning For Fun and Profit

Workshop: Machine Learning For Fun and Profit

by John Paul Ahenfelter

In the workshop titled "Machine Learning For Fun and Profit" presented by John Paul Ahenfelter at RailsConf 2014, participants are introduced to basic machine learning techniques applicable to their Ruby on Rails applications. The main theme revolves around leveraging data from user tables to generate insights that can enhance business profitability.

Key Points Discussed:
- Understanding Users: The workshop begins with the importance of user data, where Ahenfelter engages the audience about their user tables and business goals, emphasizing the necessity to understand users to retrieve meaningful data insights.
- Machine Learning Techniques: Participants learn several foundational machine learning techniques starting from categorizing users, segmenting behavior, and employing recommendation algorithms. Ahenfelter stresses the importance of using science and data analytics effectively to derive actionable business insights.
- Practical Implementation: A significant focus of the workshop is on hands-on coding examples, utilizing the 'sex machine' gem to assign gender to users based on first names and analyzing user data without extensive surveys. This approach aims to achieve better accuracy than traditional survey methods.
- Geolocation: The presenter also covers geolocation, explaining how to derive rough user locations from IP addresses using free geo-IP services, which helps in understanding user demographics and tailoring support strategies accordingly.
- Clustering Algorithms: The workshop highlights the concept of user segmentation through clustering algorithms like K-means and hierarchical clustering, which aid in identifying different user groups based on their interaction with the application. Ahenfelter provides practical demonstrations, encouraging attendees to explore these algorithms for real-time data.
- Recommender Systems: Finally, attendees are introduced to recommendations using Single Value Decomposition (SVD) to find similar users based on their interactions, which could assist in personalizing user experience and increasing engagement.

Takeaways:
- The workshop emphasizes the need for Rails developers to become data scientists to utilize the wealth of user data effectively.
- Participants take home practical knowledge about applying machine learning in their own applications, equipped with the tools to answer user-related business questions and drive profitability.

00:00:18.279 All right, people! Is it cool if I go ahead and kick this off right on time? Everybody okay with that? Looks like we have a lot of seats filled. Thank you all so much for coming down.
00:00:23.480 I'm John, and I'm going to talk about machine learning. I know some of you might be at Sy's talk right now, so I appreciate you coming to this instead.
00:00:29.599 Sandy will be great on video. She's a wonderful person, and I know that I have to deliver at least as much value as you would have gotten out of Sy's talk, so you've set a pretty high bar for me. I appreciate it, and I hope I won't let you all down.
00:00:42.239 So what's our goal today? Ideally, I'd like one takeaway, but three takeaways would be great as well. One takeaway is even better! I want to use Ruby to answer questions about your users and your business. That’s my goal, and we're going to employ machine learning to achieve it.
00:00:58.600 If there are some chairs down here, feel free to grab them and scoot them around somewhere. This room is kind of arranged a bit funky.
00:01:13.119 I have a question for all of you, and this is going to be interactive for a bit. How many people have a users table in their Rails app?
00:01:18.400 Okay, here's a better question: How many people do not have a users table? Alright, yeah. I'm just curious, what's the primary object in your app instead of users? You said assets? That one makes sense.
00:01:33.520 So, we are talking about machine learning for fun and profit, and yes, you know, some things like that. You'll probably find the same techniques apply, but almost everybody has a users table, which is what started this discussion.
00:01:39.799 Now, what is the goal of your businesses? Anyone? Just shout it out. What is the real goal of your business? Make money! Thank you! So, we've got users and we've got profit. Who here has a plan for making money from their users? Raise your hand if that's you. Alright, you're first!
00:01:58.079 I’m going to put you on the spot. What’s your plan? How do you turn users into profit? Oh, you give loans to users, and then they pay them off? Awesome! I completely understand that business model. That is fantastic!
00:02:19.800 Does anyone work for a social network-type company in the attention economy? Yes? Okay, definitely!
00:02:31.640 I’ve done that too, and that frames the story for me. We’re probably all familiar with the concept that we have users and we want to generate profit. Everyone knows about the underwear gnomes, our friends the underwear gnomes.
00:02:43.560 There’s that hilarious part where the gnomes explain to the South Park boys that step two is a big question mark after they collect the underpants, and from that they will derive profit.
00:03:01.680 It's really strange to speculate on what types of business models you could create by collecting underpants and using machine learning on those underpants to generate profit. But we're not going to delve into that today.
00:03:14.480 Instead, we are going to figure out how to fill in that gap, how to fill in that question mark with the information that's in your users table right now that you can use to turn into money, or hopefully some form of money.
00:03:20.239 We’re going to employ a specific set of tools, most of which you’re likely already familiar with, as we discussed earlier regarding the user table.
00:03:31.480 I'm a big fan of science—I was a chemist in another life, so I appreciate scientific principles—but science can lead you down a bad path. I want to ensure that when we're thinking about the data science we’re undertaking today, we think less about the crazy, like this guy from Back to the Future, and more like kickass science.
00:03:45.400 Neil deGrasse Tyson is one of my favorite examples of how science can be approached.
00:03:52.079 So, we're going to utilize our users table to determine how to make a profit with data science, and we're going to endeavor to do so with a mindset that is more kickass like Neil deGrasse Tyson than crazy.
00:04:09.760 Quickly, the obligatory introduction: I'm John Paul Ahenfelter. I work here at Treehouse. I asked earlier how many Treehouse fans are here, and a few hands went up.
00:04:27.919 Before that, I was at General Assembly, so I've covered two of the big names in education. My next stop might be Dev Bootcamp so I can keep collecting education companies.
00:04:36.880 I've got Treehouse stickers for anyone who wants them up here because we do have some pretty cool branding. You can come grab Mike the Frog or check out our new boat design. I really have no clue what the boat is for Treehouse, but it’s wonderful!
00:04:57.160 More importantly, why should you care about me in data science? I’ve been doing this for a long time.
00:05:08.199 This is a snapshot from 2006 when I started the data warehousing track at the MySQL conference, and I've since taught it extensively at O'Reilly's Open Source Convention.
00:05:20.240 We were discussing big databases that were in the 10 to 100 gigabyte range. That was considered huge at the time—difficult to store such large amounts of data.
00:05:32.759 So, who here has a database larger than 100 gigabytes? Just curious. Quite a few of you. How many are over a terabyte?
00:05:44.120 We even have Facebook here with their exabyte data, although I don't think any Facebook folks are present since they're all into PHP.
00:05:56.680 Data has changed significantly, and so have the tools we use to handle it.
00:06:02.600 I started working with neural networks back in grad school—and even before that, back in undergrad I was doing this with Visual Basic on MS-DOS and had to buy a math co-processor for my computer.
00:06:20.800 Running numerical simulations back then used to take hours—sometimes even days! Thankfully, a lot has changed since then. So, I've been at this for quite a while.
00:06:37.840 At the same time I started my research project, Inc. magazine published a cover story that was far more interesting than what I was working on.
00:06:49.500 Our format today is going to use problem-solving with some data. We'll apply code to get some results, allowing us to learn a bit about our users and subsequently how to generate revenue.
00:07:06.840 My session is titled "Machine Learning for Fun and Profit." Recently, I’ve been reflecting on how this should be framed more as storytelling—storytelling about your users.
00:07:23.400 I believe that stories are a much more powerful metaphor. Let's start with simple stories, just like the ones you share around the campfire—stories that make people happy, stories that teach.
00:07:36.720 So, let me ask: Who here actually knows their users? How many of you really work with your users table? Perhaps you’re in marketing or business development? Any of you really feel like you know who your users are? It’s often challenging.
00:07:53.760 I bet all of you know about your users in some broad strokes. How many people are familiar with analyzing users through tools like Google Analytics or Mixpanel?
00:08:09.280 These methods tend to portray users as homogeneous entities, and you aggregate that data. A standard Google Analytics dashboard only provides surface-level insight.
00:08:26.079 Aggregates can tell you about your average user, and we all know that nobody dreams of being the average user. We should strive for something more engaging.
00:08:39.200 People want to feel special. We need to tell better stories. Aggregates can be boring, and the SQL database administrators of the past often had a tough time—the same goes for those dealing with reports.
00:08:56.919 Aggregates can still tell interesting stories, especially when we examine changes over time—seeing your user growth, engagement patterns, or revenue generation.
00:09:12.239 The context can make all the difference. We want to discover important aspects of your users that tell a compelling narrative.
00:09:26.320 I was thinking about the users in my database who spent good money at my company. I wondered how many of them are female. This thought bridged me toward storytelling in a meaningful way.
00:09:40.199 No one does storytelling quite like This American Life. They have a unique structure that captivates an audience.
00:09:47.919 They masterfully weave individual stories into a larger, meaningful narrative that takes you on an emotional journey.
00:10:02.760 Different methods of storytelling exist online, with headlines that bring you in. You may see phrases like "Seven Unbelievable Facts About Your Users—Click Here for More!" This is all part of how people want to receive data.
00:10:17.240 Understanding your users involves delving into their qualities. How do you find out more about your users? If you wanted to know about the male-female distribution, how would you discover that?
00:10:31.560 For instance, how many of you collect gender information at registration? Not many, right? What’s the traditional way to figure this out? Surveys! But what are the percentages typically?
00:10:46.079 Survey response rates are generally very low. You may think you have a representative sample when you don’t, which can lead to statistical insignificance.
00:11:00.920 Wouldn't it be better to have more confidence and better knowledge about your users? Descriptive data can help you segment your users into different groups.
00:11:14.240 You can use lookup tables to do this, which we'll discuss shortly. You can also perform name analysis. Most of these methods are quick to execute and yield better results.
00:11:27.599 If I told you I could provide you with 80% accuracy on male and female categorization based on the first name alone, who would think that's worse than a survey? I believe it’s at least as good, likely better.
00:11:39.640 Today, we're going to explore a couple of examples together that can be done without any complex gems or advanced linear algebra.
00:11:54.640 One tool we're going to look at is the 'sex machine' gem, which uses data to assign gender based on names. So let’s run through the code and see how it performs.
00:12:12.440 If you need to install the gems, feel free to follow along. I'll give an explanation of what we're doing and then we can test this out together.
00:12:28.679 We'll start by selecting all users by first name and then analyze it using the sex machine gem, allowing us to see how accurate it is for various names.
00:12:43.360 How many of you have the gem installed so far? Alright, there should be someone near you who can help. Let’s take a minute, and while people are setting that up, let’s check some names.
00:13:02.559 For example, what is the assigned gender for names like Cedar or Justice? For those with unusual names, these results can be quite fascinating!
00:13:17.360 Next, we’ll dive into how to assign gender to users based on the information in our database.
00:13:30.040 This ongoing story affirms how we can utilize data and machine learning to generate insights from our user data effectively.
00:13:47.680 If you have access to your users with relevant attributes, I want you to experiment with your local machine data to see how effective this can be.
00:14:02.960 You can run user analysis against your data rather than relying only on what I provided as sample data.
00:14:14.320 Let’s take five minutes to experiment and then we'll reconvene. I'd like you to try checking the gender assignments for various names and see what results you obtain.
00:14:43.960 If you’d be willing, please share your findings. For instance, I often use ‘John Paul’ for my tests and I’m curious to see how it plays out with the gender assignment.
00:15:00.920 Let’s see how the results pan out. Also, if anyone has names that are commonly gender-neutral, we should check those too.
00:15:17.440 When you run this, I'd love to hear your reactions about the outcomes, especially if they contradict your expectations.
00:15:29.000 Regarding the data we have, I can provide insight into the challenges of accurately determining gender from names, particularly in diverse datasets.
00:15:41.280 Now, let's delve into how this gender assignment can translate to user insights and how we can apply them practically.
00:15:56.080 Using a served database, we can easily track user profiles and utilize this information to improve engagement across our audience.
00:16:09.440 Continuing, I want to explore the importance of geolocation services, especially through IP address assignments now.
00:16:27.720 Apps collecting such data can vastly improve tailored experiences, particularly for customer support based on their location.
00:16:41.520 So, how many of you currently utilize geolocation tools for user interaction or behavior tracking? Many platforms today leverage such functionality for additional insights.
00:17:02.000 For context, we at Treehouse have seen how geolocation informs our user engagement strategies.
00:17:16.000 Would it surprise you to learn about how useful this information can be for planning our support efforts?
00:17:29.600 This data aids in staffing strategies around peak usage times based on user locations, facilitating better customer experiences.
00:17:46.240 Next, let’s dive into the code that utilizes geolocation services via free APIs that help track user geolocations effectively.
00:18:02.560 It's essential to understand the context of our user data. We’ve got users always pining for reliable support, and understanding geographical spread allows us to handle that better.
00:18:18.560 Alright, now let's code our two cases: assigning gender to users and collecting geographic data through APIs.
00:18:35.560 Remember, a lot of users in our tables can indeed look like aggregates, but we need to segment and understand those users better.
00:18:51.560 We’ve got about an hour left to dive into the remaining examples.
00:19:06.560 Now we are going to explore user segmentation through clustering. Clustering can provide insights into user behavior.
00:19:23.000 For context, K-means clustering allows you to categorize users into a fixed number of segments based on similarities among their attributes.
00:19:40.600 Let’s run a quick code demo on clustering algorithms to provide you with visual data representation, which can enhance how we understand end-users.
00:19:56.560 This is where we can dive into more meaningful insights about user activity—this is where the magic of machine learning comes into play.
00:20:09.680 To this end, the clustering we deploy will allow us to segment users better than using broad strokes.
00:20:28.360 By utilizing the information we have, we can categorize these users into casual, professional, or super-users. This deep divide can help optimize our service delivery.
00:20:45.240 The next portion of our workshop will tackle implementing clustering using Ruby's AI libraries.
00:21:05.320 Let’s see how that performs as we categorize various action metrics against clustered user trends.
00:21:20.640 Now that we have a solid understanding of data transformations, it’s time to look at the coding side of comfortable clustering.
00:21:32.920 This approach can truly bring results to the business's bottom line by harnessing user data effectively.
00:21:50.360 If you haven’t yet, please validate your understanding of these analytical tools and how they engage with your dataset.
00:22:09.840 Next, we’ll examine collaborative filtering, a popular method for personalized recommendations in real-world applications.
00:22:26.160 Collaborative filtering takes user-item interactions to recommend items based on what similar users have liked.
00:22:46.480 This methodology has been adopted by many tech giants in various forms. Next, let’s showcase how it's commonly executed.
00:23:04.440 What we aim to do is summarize the data using Singular Value Decomposition (SVD), a key technique for reducing dimensions in user-item matrices.
00:23:20.840 With SVD, we aim to deal with large, sparse matrices effectively, allowing us to analyze user interactions better.
00:23:38.240 Through the implementation of algorithms, you can find more tailored recommendations for users, enhancing their overall experience.
00:23:51.600 Now that we've outlined these advanced concepts in this workshop, let's run through some examples of how to practically apply them.
00:24:06.440 You can begin exploring datasets relevant to your applications to bolster your engagement efforts.
00:24:21.840 As we round out our session, the goal here was to illustrate how Ruby can be practically employed to derive actionable insights.
00:24:34.080 Let’s take a moment to delve into the tools specifically, which can supercharge what you bring back to your teams after this workshop.
00:24:48.280 We’ve discussed various levels of analysis capturing user interactions, recommendation algorithms, and clustering for segmentation.
00:25:02.320 So, moving on, the underlying message here is that there are many resources at your disposal to deepen your understanding of data science.
00:25:17.000 For those of you eager to learn more, consider diving into O’Reilly resources, especially those relevant to your technical skills.
00:25:33.560 And I highly encourage you to explore additional materials on machine learning and data science that might pertain to your needs.
00:25:52.360 The breadth of learning opportunities available is extensive, so take advantage of this by committing to a tailored resource that fits your learning style.
00:26:10.720 To conclude our workshop, I'd like to thank you all for your engagement, and I’d love to open the floor for any questions!
00:26:22.080 Feel free to reach out to me on social platforms or here after the session.
00:26:36.560 I genuinely appreciate the time you've given to this, and I hope you find immense value in applying these concepts!
00:26:50.320 Thank you so much for attending, and let’s all strive to turn our user data into insights that positively impact our businesses.