00:00:19.439
So the other day, I was at my grandmother's house, and she needed some help going through a lot of papers that she had accumulated over the years. She invited me into the back of this little extra room that she has in her apartment. In the corner, she had a small card table with a leatherette cover still attached with tags, and about thirty inches of paper piled up. Before RAID levels, my grandmother knew all about redundancy; she kept three copies of absolutely everything—three copies of her 1949 tax return, three copies of a random advertisement about glucosamine supplements. She kept three copies of everything and was going through the papers one by one, while I sat there losing patience. Suddenly, she came upon this 1947 playbill from when she was in the freshman play at the University of Puget Sound, and she went off on a tirade about how great that was. 'I was in the play and everybody loved it!' she exclaimed before putting it back.
00:00:43.280
My grandma is a data hoarder. She used to be a librarian and really likes to hold on to information. She likes it around her because even though that playbill is not important right now, it might be important a year from now, or maybe one day she might need to look at that 1975 tax return. You never know. I'm sure everybody here has a family member who's a hoarder and likes to hold on to different things. We might say, 'Oh, well, I don’t do that; I get rid of all of my stuff, and I’m perfect,' but I would argue that’s not necessarily the case. If you've heard of inbox zero, there are a lot of people who go with inbox infinity. You look over their shoulder, and you see that they have about 8,000 messages, and 3,000 of them are unread. So, why are they holding on to this? Generally speaking, it's because of a compulsive hoarding feeling where we feel the need to hold on to every bit of information as it comes through, just in case our boss asks us about that Nigerian bank account we need to forward to him later.
00:01:57.040
It’s even worse than our own personal quirks; it’s turned into an entire industry. I'm sure many of you have heard of big data. It’s amazing that we have five exabytes of data being created every two days. We have all this information—everyone is connected on iPhones, social media, email, you name it. However, I would argue that this big data has really just become more of a big distraction. Unfortunately, if any of you have ever looked at the Twitter live stream as it comes through, most of the time it is total garbage. It tends to be the same people tweeting the same thing about Justin Bieber or Kim Kardashian or something in Portuguese that you can’t quite understand. This leads us to the real point of this talk: if we have all this information, that’s great, but what's more important is figuring out the 20% of all this information that really drives the 80% of the outcomes. If any of you have heard of the 80/20 principle, it’s very simple: 20% equals 80% output. I would go further to say that big data is a kind of 99/1% principle.
00:03:36.560
I’m going to go over a couple of techniques, as there are many ways to see through the noise of the data we have access to. Today, since our time is limited, I’m going to discuss backward induction, which is a game theory tactic that starts with the end in mind and works forwards. We’ll review a couple of different techniques for graphing in more than three dimensions and then we’ll end with a very meaty portion, which is variable selection. This is a relatively new topic, using machine learning and classification trees to determine which variables are important and which ones aren’t.
00:04:12.960
Just a quick poll: who does TDD (Test-Driven Development) here? Okay, who has ever had to explain to a boss why you should be able to do TDD? Now personally, I've had this conversation a lot of times with business leaders, saying, 'We really need to do TDD because it provides insurance; it's great because we'll have full code coverage and the ability to ensure that all our tests are completely covered.' However, they often respond, 'Oh, well, that’s going to take 30% more time, and I don't have that time.' This creates a conflict between long-term benefit and short-term cost. If you look up here (not necessarily a tree diagram but a bifurcation graph), it’s a nice example in chaos theory of how things can get totally screwed up. The way I think about it is, if you start from the left and say, 'If I make a decision, should I go left or right?' and keep doing that, it just gets totally chaotic. Just like with a tree that grows more and more, you'll find yourself in a rutted path.
00:05:56.000
Going back to the TDD example: A business leader looks at it in the short term, saying, 'I need to spend more time and money on testing,' but the programmer might argue that the long-term benefits are much greater. This echoes what game theorists face in a game like chess; realistically, what you want to do is start ten moves ahead—ideally thirty moves ahead, if you can—and then work backwards from there to strategize your moves. So very simply, backward induction is about thinking with the end in mind and then working forwards. Instead of merely asking, 'We have this data; what are we going to do with it? Should we go left or right?', you need to focus on the goal first.
00:06:42.880
I’m assuming most of the people here are using Rails and developing some kind of web app. Most web apps tend to align with Dave McClure’s metrics model. If you’ve ever heard of Dave McClure, he runs something called 500 Startups, which was an incubator. He proposed that all important metrics fall into five buckets: acquisition (user growth), activation (users doing something), retention (keeping users), referral, and revenue. If you understand this framework and know what you’re trying to accomplish, then it's much simpler than just saying, 'We have the data; let's do something with it.' It becomes more a question of stating, 'I want to bring up user acquisition. How do we get there?'
00:08:18.000
To illustrate, I presented some random ideas: spending more money on pay-per-click advertising, retweeting about your offerings, or even the number of times you swear in a sentence could potentially lead to more users. It’s a lot simpler than merely asking, 'We have Google Analytics; how do we get more users?' Instead, you're looking at the goal first. Going even further back, you can ask, 'How does pay-per-click advertising relate to user growth?' Simply graphing this relationship can provide clear insights.
00:09:07.280
A quick side note: correlation does not equal causation. Just because I woke up one morning and drank a cup of coffee doesn’t mean I raised the sun from its horizon; it’s just correlated. Unfortunately, correlation is most likely what you have to work with, while causation tends to be a qualitative measure.
00:09:45.279
Let’s do another quick poll: Who likes Star Wars more than Star Trek? Now, who likes Star Trek more than Star Wars? Alright, pretty evenly split. And who doesn't care? Most of you probably recognize this guy: if any of you have seen Star Trek: The Next Generation, this is Geordi La Forge, one of my favorite characters of all time. Some of you might know him as the 'Reading Rainbow' guy. La Forge is intriguing because he wears a visor that lets him see things that no one else can. In a couple of episodes, he looks at a bulkhead and remarks, 'Oh, there’s a crack in it.' Pretty cool, right? Plus, it's LeVar Burton, so if you watched Reading Rainbow, that’s kind of cool too!
00:10:56.160
This introduces us to the importance of visualizing data. When we can visualize, we can quickly disseminate information. That’s what’s so great about visualization; we can see patterns. Our ability to recognize patterns is unparalleled—humans are excellent at pattern recognition, but unfortunately, we can't see more than three dimensions: depth, width, and height. However, there are solutions to extract more dimensions, such as color tables and glyph plots, which I will explain later, along with scatter plot matrices. When working with many variables, starting with tables is a good approach; tables are straightforward and everyone knows how to use Excel.
00:12:41.440
As soon as you pass tables, you can introduce color. Using Crazy Egg, for example, is a great use of color because it tracks user clicks on a page and aggregates that data, indicating which areas receive more interaction. This allows you to extract an extra dimension; you can graph three dimensions along with color, which is useful. Scatter plot matrices are beneficial for viewing more than four dimensions. They create 2D representations for each variable, enabling you to observe relationships among up to 100 different dimensions in a discernible manner. A great benefit is that you'll also see how these variables relate to one another.
00:13:30.080
Lastly, I want to introduce you to the glyph plot, particularly the Chernoff faces. Has anybody heard of Chernoff faces before? This was developed by a guy named Chernoff in the 1970s, who studied judges and public perceptions of them. He aimed to take twelve different attributes of these judges and map them to something easily visualizable. He ingeniously used faces, as humans can effortlessly recognize emotions through facial expressions. We are good at identifying emotions like upset, surprise, anger, or confusion, and he created an intelligent way to represent this data with faces. Since this is a Ruby conference, I thought it would be fun to show an example of Chernoff faces.
00:14:57.600
For those of you on your computers, feel free to take a look at the GitHub repository titled 'hex-canoe/chernoff.' If any of you have worked with Shoes before, it's a GUI tool that was started by Why the Lucky Stiff. It's excellent for creating simple graphical user interfaces with pure Ruby code. Here, I’ll quickly create an 8x12 matrix of Chernoff faces. This face painter class I used is a bit of a hack, so please excuse the messy code. All you need to know is that there is a variable for head size, eyebrow slant, cross-eyedness, pupil size, eye width, height, area, and mouth dimensions, including depth, width, and a smile variable, which is very important.
00:15:43.600
Running this results in something similar to this—a bunch of smiley faces. Just to prove that I'm not making this up, I can run it again, and there are even more smiley faces. It’s all well and good, but what’s the real point of using Chernoff faces? It might initially seem like a novelty, but I would argue differently. Many companies, like Radiant 6 and others in sentiment analysis, rely on this trait. They address textual data and categorize it as positive, negative, or neutral. However, if you know enough about human emotions, you realize that emotions don’t exist on a one-dimensional plane. They can be negative, sad, or angry, for instance, and Chernoff faces present that information rapidly so one can perceive emotions more deeply, rather than simply stating that the sentiment is positive or neutral.
00:17:18.120
When I was at a startup in Seattle, we gathered every Monday in this tiny room for company meetings. Analysts would discuss how we did last week in terms of page views and uniques over eleven and a half minutes. However, instead of focusing on trends, he'd also include standard deviations for both page views and uniques, the growth growth rates, kurtosis, skew—terminology that, even for a mathy type like me, was overwhelming and tedious. I couldn't understand why he didn’t summarize how we had performed last week in terms of uniques, comparisons with the previous week, and leave it at that. This situation leads us to our next topic: There are numerous variables available today, much of which might not significantly impact daily operations.
00:18:24.880
We’re going to discuss how to use classification trees to determine which variables matter and which do not. Has anyone heard of classification trees or regression trees? A few have. Very simply, here’s an example from Wikipedia illustrating survival probabilities from the Titanic. The chart asks questions such as: Are you male? Yes or no. Are you older than nine and a half? Yes or no. Is your sibling or spouse greater than two and a half? You get the idea. The process to build these trees can vary; there are bagging, which stands for bootstrap aggregating, and boosted trees, but my favorite is called random forests. I always envision random forests filled with random deer and mushrooms, not sure why, but that’s why I favor it.
00:19:38.880
The random forest concept originated with a guy named Leo Breiman in 2001. He was a Ph.D. nerd who later became a consultant before returning to academia. This methodology is significant; it randomly goes through a massive dataset, creates a collection of random trees, and determines which tree has the most predictive power. Pretty simple, right? There are many tools available for creating random forests, including options for Python, Java with Mahout, and also Ruby—called Nimbus. Nimbus is geared primarily toward bioinformatics and genetics, but is also very helpful for generating random forests.
00:20:54.280
I wanted to conduct another example that’s more applicable. I examined all the tweets I've ever made to identify the factors that determine my likelihood of retweeting. The idea was to analyze all the text from my Twitter account and determine which words influence my retweet behavior. The relevant code is available on GitHub. For those unfamiliar with the T gem, it’s a fantastic command line interface for Twitter, allowing you to export all your tweets into a CSV file. I used it to split the data based on tokens and organize the words into a dictionary and create variables accordingly.
00:21:46.960
After creating a training set consisting of ones and zeros, where one indicates that I retweeted something and zero means I didn’t, I ran Nimbus in the folder and, after some time consuming effort, it returned a classification tree showing the questions to ask about whether something would be retweeted. From this process, I was able to identify which variables were more important than others. There are a couple of ways to assess the importance of variables within random forests. One method, ‘variable importance’ (VIMP) gauges how much sway each variable has on the resulting output. A simpler way to conceptualize this is through what’s called ‘minimum depth’—variables that appear higher up in the classification tree are the most influential.
00:23:04.240
That’s all well and good, but let’s look at the data that determines whether I'll retweet something. The top four factors are specific individuals. The first is a management consultant who writes humorous tweets. I tend to retweet him frequently. The second influential factor is my wife—if I didn’t retweet her, I'd be sleeping on the couch! Third, there’s Abdi Grima, and Matt Might from my Twitter feed, whom I also admire. Examples of retweeted tweets include humorous observations or relatable moments, like when someone exclaimed, ‘Holy crap! Recruiter, do you have a job lead for me?’ or, ‘Hitler was a micromanager!’ It’s funny and relatable, highlighting the essence of how data can be compelling when presented right.
00:24:49.520
Now that you’ve identified which variables are important, it’s time to understand how they relate to earlier discussions. If you can determine which variables matter, you can visualize them more effectively. For instance, a study by a guy from the University of Florida, G. P. Ishwaran, focused on cardiovascular disease and demonstrated that by applying similar techniques, he managed to reduce 310 variables down to only 10, which is significant. By filtering out noise, it becomes easier to analyze outcomes since looking at just 10 variables is manageable and comprehensible.
00:26:17.200
When working with vast amounts of information, start with the end in mind. It’s essential in this age of smartphones, social media, email, IMs, and text messaging. Without a clear goal, information can overwhelm you. Furthermore, it’s crucial to visualize the data you encounter. When you properly visualize it, you can conceptualize and understand it much more easily. Lastly, classification trees are invaluable for identifying the importance of each variable. It’s easy to grasp—variables at the top are more influential than those at the bottom.
00:27:46.880
To conclude, remember not to be like my grandmother, clinging to data for the sake of it. It’s not about how much data you have; rather, the critical factor is how you utilize that data. Thank you.
00:31:20.880
Thank you very much.