00:00:13.759
Hey, hey, hey! Alright, so there I am, ready to discuss machine learning for fun and profit. That’s what I want to talk about today.
00:00:20.800
I've got 30 minutes to convince you why this is a great idea. In a nutshell, let’s start with the premise: the goal here is to use Ruby to answer questions about your users and make your business money. Quick hands, how many people have a Users table? Right? Yes, the folks who don't have their hands up must be doing something like Internet of Things, maybe Bitcoin or something unusual where you don’t have any real users to know about—just aliases and such. But everyone ends up having a users table somewhere in their system.
00:01:05.040
We want to be able to make money from this, whether you are a small band of bootstrappers or more traditional VCs with a business plan. That business plan may seem ridiculous, but at the end of it, there's profit, and in between, there’s this magic thing. What I want to talk to you about is that magic thing between those two points.
00:01:33.240
What I want to discuss is how collecting data can lead to profit, insights, or social good, for that matter. We can do this in a lot of different ways. The tools we are going to use are Ruby, which is exactly where we should be at this conference here in the Rocky Mountains. We will use a users table, which might look something like that, and as promised, we are going to use science.
00:01:55.640
Just to be clear, we’re not going to use this kind of science, like the crazy Mad Science. We're going to use the Kick-Ass science. It's fun to note that both of those people I mentioned are named Neil, which presents a slight disambiguation problem between the two. And the person who’s going to guide you through this is me. Just a quick note, the arrow in my presentation indicates that I did all the code for this at RailsConf, and it hasn't changed much since then. I tweeted out what this was this morning, and I'm happy to show you and help you get the code down.
00:02:24.840
I’m not going to do live coding; rather, I’m going to show you the results of the code. I’ve been doing this for a long time, and these are real conferences where I’ve presented and conducted talks. Back when building a data warehouse was like 10 gigabytes, I would say, 'Wow! You’ve got 10 gigabytes?' But now, coming from NASA, I’d tell you I’ve got a terabyte. Who doesn’t have a terabyte of data anymore?
00:02:38.680
I started programming data analysis stuff in Microsoft Visual Basic. You can't see from that picture down at the bottom, but it says 'For DOS,' which was a real product. We even had a math co-processor back then. This image is of a backpropagation neural network. I used to work on neural network problems, but the computational power wasn’t available and it wasn't trendy anymore, so I shifted my focus to other things like Ruby and Java. Now, I'm thrilled to be getting back into this field again.
00:03:17.960
I’ve been doing data science and analytics in various forms for a long time, from many different perspectives. Let’s start with this question: who are your users? How many people here think they know a lot about their users? How many feel comfortable with the analytics they currently have? There must be a few of you because some of you have to be at a start-up where you know the names of the people who use your product. If you know your users really well, then this may not be applicable to you right now.
00:03:54.480
However, it will be applicable in a few months when you have thousands or even hundreds of thousands of users. I find this kind of funny because I was a chemist prior to this and worked with absurd statistics, sometimes numbers like 10 to the 202nd or 10 to the 25th—ridiculous numbers of users that aren’t available in real life. I find myself getting wiggy about statistics that involve just a few million users or a few hundred thousand users, but I’m learning to adapt.
00:04:24.920
This is how most people look at their users. You may use Mixpanel, Heap, or other similar tools. Google can present some insights about your users. What these analytics typically reveal is that you have this user that is an aggregation of all your other users. But that’s not a very good story. I don’t want all my users to be the same. How can I market to my users? How can I learn something about them? How can I impact their lives based on the information I have? Because if we use flawed statistics, many of you might have seen this XKCD comic before.
00:05:07.800
When you start looking at aggregates, it’s possible to draw some really horrible conclusions. So let’s figure out a way to do that. This is what our users look like right now. At least we know we have this idea that we’ve got different kinds of users. We need a plan to better fill in the blanks.
00:05:52.479
No matter how little Ruby experience you have, I know there are folks from various schools and many newcomers here. The first thing we need to do is assign gender. I hate saying that out loud because that's not really the essence of what we're doing. What we will do is take the data from the first names of our users and use a gem called Sex Machine. I did not name it; it’s a C library under the hood. What we will do is assign a sex—male or female—based on the names.
00:06:36.680
This gem is really fun to work with. Often, you'll hear people suggest just to collect gender information from users upfront, but that’s a dangerous minefield. Facebook had to create 40 different choices to handle gender effectively. A relatable example would be ordering t-shirts for a conference—you might not think to ask how many were men or women or their preferences for t-shirt cuts. Instead, you'll make an intelligent guess, because we don’t have to be exact.
00:07:06.400
So, this is a practical application of this approach: we get a detector via Sex Machine, we give it names, and it prints out genders. For instance, here's a real set of results from a data sample. Is anyone from the UK in the room? Indeed, Jamie is almost always male there. This gem is smart enough to understand local information and can adapt based on it.
00:07:33.640
The defaults used to be androgynous names, but you can also set it to something like 'unable to compute.' You can take this gem and run it against a dataset to get at least a rough idea of gender distributions. For example, at Treehouse, we were interested in how many women were among our users. Although we don’t explicitly ask for that information upfront, using first-name analysis gives us a reliable approximation.
00:08:05.600
This method resulted in a quick estimate of our user gender ratio, which was far easier than sending out a survey. If you've ever dealt with surveys, you know it's nearly impossible to get consistent results back. That’s the first tool you can take home and start using today—just simple Ruby code.
00:08:41.239
Next up is location awareness. Many of you likely already deal with geolocation in your applications. If you are a smaller company or want to do this for fun, you can create your own geolocation services by obtaining another piece of data. There’s a buzzword-compliant tool called FreeGeoIP that’s hosted but can be downloaded. You’ll need Python to run the scripts, and it uses MaxMind to pull down free data.
00:09:26.479
There’s a little script in the repository where we use Faraday, so many of you might already be comfortable with it. I’m running my own copy of that IP database, which has been enhanced with additional information, such as political affiliation by state or average income by ZIP code. The request can be executed with curl from the command line, making it easy to get your data.
00:09:58.119
The most typical way to analyze user IP addresses is through a resolver. If you're using Devise, you probably already have the IP addresses in your database, as that's one of the default features. Using a tool, I can loop over all the users, and if there’s an IP address available, it will look it up and return some valuable data: average income, political leaning, latitude, and longitude, which is crucial for understanding what's near your users.
00:10:50.600
Now, what does this data look like? It provides a more realistic portrait of your users. Everyone's keeping up so far? I know I’m going quickly, so allow me to refuel briefly.
00:11:21.560
Let's delve into more advanced analytics. For those of you who might appreciate it, this is where people can find real joy. At the edge of our analytics journey, we will encounter clustering, a common problem in machine learning.
00:11:37.720
To start clustering in Ruby, we need to include relevant properties about users, such as dollars spent on your e-commerce platform. At Treehouse, we’ve dealt with aspects like the number of badges earned or points accrued, time spent on site, or anything else that opens the doors for additional insights.
00:12:01.560
There’s a gem called AI for R that wraps up a lot of complex mathematics. You want to use a gem like this instead of doing it from scratch because attempting that can be frightening. In terms of clustering, we can apply a method known as K-means clustering, which entails deciding how many clusters we want to create. You might have three kinds of customers: great, mediocre, and those you really don’t want to spend time on. What if I want to determine which of my users fit into these categories based on their behavior?
00:13:05.360
At the outset, we assign users randomly into groups. It’s not unlike dealing cards; if you envision red, green, and blue cards, you’ll eventually sort them. Then we need to calculate the centroid of each cluster, effectively finding the center for each group of users.
00:13:53.240
The next step involves assigning the other user properties to the appropriate clusters. If we go through this sorting process enough times, we arrive at a point where users are no longer switching groups, thus accomplishing effective clustering.
00:14:39.920
Now let’s talk about linear algebra, a critical component of this process, particularly for K-means. Linear algebra tools are necessary for diverse operations that help make sense of clusters you may create. I will emphasize the importance of understanding matrices and vectors as they are essential tools in data science.
00:15:29.680
A very real concern is that Ruby may not be the optimal language for heavy-duty data science operations. It serves as a good gateway but it’s important to recognize that other languages like Python and R perform better for math-heavy applications.
00:16:11.160
Nonetheless, it’s important to experiment in Ruby to see if it meets your needs. If it does, you may very well migrate to learning other languages that are tailored for more complex tasks.
00:16:39.040
Collaborative filtering is another key concept. This technique has been successfully employed by companies like Netflix in their recommendation engines. They have used an approach called SVD, which denotes the singular value decomposition.
00:17:08.080
SVD, implemented through linear algebra, allows us to distill user inputs into useful comparisons based on previous ratings. By mapping user ratings to common dimensions, we can determine which users have similar tastes, with the potential for closer interactions outside of the site.
00:17:55.120
At Treehouse, we ran several analyses through our own data, seeking to connect users based on their skill sets where someone excelling in HTML, JavaScript, and Ruby could be paired with someone proficient in JavaScript and CSS. The ultimate goal is to enable users to help each other, effectively creating a supportive community.
00:18:40.800
Here's where I emphasize that all of these codes and insights are available online in the repository I created. This showcases not just how to implement these ideas, but how to optimize them in meaningful ways.
00:19:21.760
Finally, we can survey the diversity of clustering algorithms and show how they can yield different outcomes. The remarkable thing about data science is that the resulting applications will differ based on the chosen algorithms. Different algorithms can lead to different conclusions based on initial settings and may yield various recommendations.
00:20:22.160
As we conclude, let's revisit the tools. You can assign gender based on names; geolocation data can provide insights on client distribution; clustering reveals user types based on behaviors, and collaborative filtering can uncover potential relationships between users. All these approaches are data-driven methods that can create improvements both in your analytics and in how users perceive their experiences.
00:21:05.680
Ultimately, this brings us to the goal: rolling in money is fun, but the genuine achievement lies in providing joy for both users and the community at large while solving relevant problems.
00:21:39.920
I want to express my gratitude to those who have supported me through this journey. The workshop I provided at previous venues was very rewarding, and I acknowledge all the wonderful people I’ve worked with at conferences. I appreciate the warmth shown to me here at Rocky Mountain Ruby, and I’m grateful for various experiences.
00:22:44.720
As I work at Treehouse, and because we are currently hiring for several developer positions, I want to invite those interested in joining us. My social media presence is limited, but I’m always willing to share resources and experiences with those seeking to learn.
00:23:50.480
Finally, I highly recommend checking out relevant literature and online courses to further your understanding of machine learning and data science. From various resources at O'Reilly to stand-out classes offered at institutions like Stanford, these materials are invaluable for anyone pursuing knowledge in the field.
00:24:50.760
I genuinely enjoyed discussing these topics with you all today, and I'm here for any questions you might have, whether you prefer to address them openly or quietly after the session.
00:25:10.280
Thank you!