Talks

Exploring Big Data with rubygems.org Download Data

http://rubykaigi.org/2016/presentations/the_thagomizer.html

Many people strive to be armchair data scientists. Google BigQuery provides an easy way for anyone with basic SQL knowledge to dig into large data sets and just explore. Using the rubygems.org download data we'll see how the Ruby and SQL you already know can help you parse, upload, and analyze multiple gigabytes of data quickly and easily without any previous Big Data experience.

Aja Hammerly, @the_thagomizer
Aja lives in Seattle where she is a developer advocate at Google and a member of the Seattle Ruby Brigade. Her favorite languages are Ruby and Prolog. She also loves working with large piles of data. In her free time she enjoys skiing, cooking, knitting, and long coding sessions on the beach.

RubyKaigi 2016

00:00:00 This is what happens when you've spoken at a lot of conferences: you get a little bit afraid when Jonah and Aaron say they're going to introduce you together because you're not quite sure what they're going to say.
00:00:05 But I, Aja Hammerly, am @the_thagomizer on most of the places on the internet. I love it when people tweet at me during my talks.
00:00:13 My phone is over there, so you won't distract me, and I have notifications turned off, so this should work well. I work on the Google Cloud Platform and, among other things, I have this slide.
00:00:23 I also really think this cat is cute! I work on a team at Google that is working to improve our support for Ruby. As part of that, we've been asking ourselves a lot of questions.
00:00:36 We're exploring how we should interact with the Ruby community, what the community is doing, what's important to them, and making sure that we're using the tools we have correctly.
00:00:50 Some of the questions we've been asking include: What gems are used most often? We want to make sure that we have the appropriate libraries installed on all of our base Ruby images, so that when people need specific gems, they just work.
00:01:04 We were also discussing which test framework we should write the tests for our examples in. This discussion in the Ruby community turned into a mini-test versus RSpec debate, so it became another question we needed to answer.
00:01:25 There was a lot of discussion about whether we needed to support Ruby 1.9, and we were trying to figure out what version of Rails is the most popular right now. Answering these questions was a little hard.
00:01:39 I will admit that I guessed for the first six months or so that I was working on this because we were still new to the project. Even my guessing wasn't particularly bad; I thought Rails was one of the most popular gems.
00:01:58 I also guessed that probably more people were still on Ruby 3, and because I'm from Seattle, I guessed that mini-test was more popular than RSpec. But we are scientists, and scientists don't guess; they use data.
00:02:08 That's what this talk is about: how I used data to answer these questions and some others, so we can build the right products for the right people.
00:02:22 I use two data sources for this analysis: the rubygems.org download data, which you can find by going to rubygems.org, and there's a tiny data link in the bottom left.
00:02:40 One of my co-workers also got us a very large data dump from GitHub of all the public repositories, and I use that data to answer some of these questions.
00:02:53 So now I'm going to give you a brief overview of the data. Rubygems supplies two datasets: a PostgreSQL dataset and a Redis dataset, but I'm only using the PostgreSQL data because the Redis data wasn't useful for answering my questions.
00:03:06 The primary table in the PostgreSQL dataset is called 'RubyGems,' and it lists all the gems. This is what the schema looks like, and I ignored the slug for my purposes because answering my questions didn't require that field.
00:03:26 I want to know how popular a gem is, so I need to know how many times it's been downloaded. The 'gem downloads' table contains that information.
00:03:39 Here's the schema: you can see that it includes three foreign keys and a count. So, I'm going to need the tables those foreign keys point to in order to get any useful information. There are also dependencies and linked sets tables.
00:03:53 The dependencies table wasn't particularly interesting for my analysis, nor was the linked sets table. However, the versions table is very useful and quite large. This schema includes some fields that I found particularly useful such as ID, Ruby gem, and number, where the number refers to the version string.
00:04:18 The dates are important too, as they help me figure out trends and what's happening currently relative to five years ago. I had some interesting questions about platform licenses.
00:04:34 The GitHub data has a couple of tables that I'm going to use. The first one is files, which contains all of the files in all public repositories on GitHub.
00:04:45 This is a very large set of data, and I’m mostly interested in the path and the ID so I can do joins. I'm only interested in specific files like Gemfile, Gemfile.lock, Rakefile, and .rb files.
00:05:03 The 'contents' field contains the raw content of all the files, which includes the entirety of the Gemfile, Gemfile.lock, and all of the .rb files in public repositories.
00:05:22 The 'commits' table was interesting but not particularly relevant for my analysis. The languages table tracks the repository's language and may help answer some of our questions.
00:05:41 This table has grouped nested fields, but the licenses table wasn't particularly interesting either. So, I've got all this data, but now I need to answer my questions.
00:06:07 The RubyGems download data is pretty small. I could have queried it with standard PostgreSQL on my MacBook, but the GitHub data set is huge, weighing in at about 14 terabytes.
00:06:26 If I want to see what gems people are using in their projects, I need to parse the content of all those Gemfiles or Gemfile.lock files.
00:06:44 I don't know how to parse a raw text file and get counts out with PostgreSQL, but I do know how to do it with BigQuery, which can easily handle a 14 terabyte data set.
00:07:06 So, what is BigQuery? It's a non-relational, non-indexed data warehouse tool that's part of the Google Cloud Platform. It was built to search and analyze logs at very large scale, which is part of why it's not indexed.
00:07:27 Honestly, I don't know much about how it works. I'm guessing that there's some parallel processing and MapReduce involved, but that's just from reading the documentation.
00:07:44 All I know is that I love BigQuery. I really love BigQuery because it supports standard SQL; I didn't have to learn a new programming language to use it.
00:08:03 And while there are some specific extensions, that's true with any database. I love it because it's freakishly fast; querying that GitHub dataset is usually under a minute.
00:08:21 The vast majority of the time, though, it can scale to handle super large datasets, like the GitHub data, and it's complex enough to do things like sliding window analysis and multiple levels of aggregation.
00:08:44 It can also gracefully handle data that isn't consistently well-formed, which is something I need, because, as I found out, some of this data is not particularly well-formed.
00:09:06 This is the part where I'm going to try to do a demo. I can say these words, but you won't believe me until I show you. Let's see if the internet is back.
00:09:29 What I'm doing right here is querying the repos to figure out how many repos there are for different languages. You can see the query running. In this instance, it took about 21 seconds, but it did use caching.
00:09:49 When I ran this right before my talk, it took 12 seconds. We can see that JavaScript is number one, followed by CSS, HTML, and Shell, with Python and Ruby coming in at number seven.
00:10:07 So Ruby is actually quite popular on GitHub, but that query accessed a fairly large amount of data because it looked at every single repo to get the repository's language.
00:10:24 Now back to the talk. BigQuery has a specific vocabulary you need to know. The first thing is a dataset; this is a group of related tables. If you’re used to PostgreSQL or MySQL, just think of it as a database.
00:10:42 A table is a bunch of records structured in some sort of regular way. That's all you need to know to get the data into BigQuery.
00:11:02 I will show you two ways of doing that. The GitHub data was set up by someone else, but I had to put the RubyGems data in there. Streaming is when you push records to BigQuery, adding them to your dataset in real-time.
00:11:20 There’s a tutorial on our website about streaming records in from Fluentd. Coming back to the log analysis roots of BigQuery, you could also use it for clickstream data.
00:11:39 Maybe you want your purchases to show up in real time for analysis. Those are all good reasons to use streaming. I'm using the Google Cloud gem, which was renamed last week from the G-cloud gem.
00:12:00 It's the same gem, same code, just a different name. I'm using the PG gem for PostgreSQL. Here’s some initial code you need: you have to require and initialize some environment variables when creating a BigQuery object.
00:12:21 I'm creating a BigQuery dataset. A bit of syntax note: because I'm dealing with two separate data sources, it got super confusing. Everything related to BigQuery includes 'BQ_' as a prefix.
00:12:43 For PostgreSQL data, I am creating the schema for the table in BigQuery. This looks a lot like writing a Rails migration, and that's intentional.
00:13:00 The Google Cloud gem is written by Rubyists for Rubyists, and they’re doing their best to make it easy to use. Having used a lot of different gems from large companies, the contrast between a hand-coded gem and an auto-generated one is huge.
00:13:19 I need this columns array of strings because I need to know which columns I'm pulling from the PostgreSQL table. Here’s the rest of the code. This is all the code you need to run a select star on the PostgreSQL table.
00:13:36 You iterate through each row, make a hashed row using the values and that columns array, and then insert it into the BigQuery table.
00:13:53 Some people did a code review on this, and the first line confused them. Let's take a side trip to discuss what zip and hash are doing.
00:14:09 Say you have two arrays representing the keys and values that you want to make into a hash. Zip takes elements pairwise and makes an array of pairs. If you didn't understand that, that's okay. I have an animation.
00:14:23 This leaves you with something like this, and now you can make it into a hash using the square bracket class method.
00:14:38 By running this, that's what comes out: I'm putting my keys and my values together to make a hash. This is necessary because BigQuery expects a hash of values instead of plain rows.
00:14:52 This allows you to reorder the values if needed or to skip missing values without causing issues.
00:15:09 Batch processing is also supported, with a couple formats such as CSV, JSON, and Avro. I always use CSV because I’m almost always pulling from another data source.
00:15:21 Exporting to CSV from PostgreSQL is pretty easy. Once you have your CSV, you can import it. You can do this programmatically, but I have a rule against automating anything.
00:15:38 I only do it once or twice, so I handle this through the UI. You specify the data source location where it should end up, give it a schema with column names and types, and whether those columns are nullable.
00:15:56 You can also specify options such as how many header rows to skip and how many errors to allow. All these options exist because of the roots of BigQuery in log processing.
00:16:12 Log data tends to be messy, and sometimes you want to get most of it in without caring about a few dozen errors.
00:16:25 I want to note that it automatically parses timestamps. I haven't found a format that Ruby supports that it can't parse.
00:16:36 Now we're about a third of the way through the talk, and we get to answer the questions. The first question is: Which gem has the most downloads?
00:16:52 We were specifically looking for a top 100. Anyone have a guess? Just shout it out: Active Record? Rails?
00:17:05 Unfortunately, we’re all wrong! Here’s the query: a really basic SQL query that took about five seconds to run, and here’s the data that comes back from the RubyGems downloads.
00:17:25 The top downloaded gems were Rake, Rack, MultiJSON, JSON, and Bundler. I saw this and thought, 'Of course, those would be the most downloaded because everyone uses them, even in non-Rails projects.'
00:17:43 But I remembered that the downloads table actually records downloads by version, and this view only included one version of each gem. So I had to do a group by and some aggregation.
00:18:01 The query took about seven and a half seconds, but it returned the same list with slightly larger numbers after each gem name.
00:18:15 At this point, I was speaking to the team, and we were curious: how many downloads does Rails have? So, let’s figure that out by restricting to gems named 'Rails'.
00:18:36 The result was actually pretty high; it ranks within the top 50, possibly in the top 20, but it wasn’t in the top ten, which surprised me.
00:18:51 The second question: in the spirit of Rubyists and our debates over test frameworks, is MiniTest or RSpec more popular? Everyone can think privately, then I'll share results.
00:19:08 Here’s the query using HAVING. The results show that MiniTest is winning. We are still using RSpec for some of our examples and MiniTest for others.
00:19:25 But it was surprising to us that MiniTest has more downloads. Next question: Which version of Rails is most popular?
00:19:42 I hypothesized that most people were still on Rails 3. Note that the data I'm showing is about a month old, so Rails 5 is probably more popular now.
00:20:04 Here's the query: it’s big and kind of gross because there are two joins, but what's cool is I'm using regex to extract the major version from the version number.
00:20:26 I'm getting all the digits to the left of the first dot. What I’m finding is that Rails 3 and Rails 4 are about equally popular in terms of downloads.
00:20:48 If you were paying attention in Seattle RB, you would notice the download numbers didn’t add up to the larger total I showed earlier: about half were missing.
00:21:13 I spent an hour earlier this week figuring out where the other half of the downloads went. It turns out the RubyGems downloads table has many Rails downloads associated with version ID 0.
00:21:32 But there is nothing in the versions table with ID 0, so I had to use a left join to pull that data in. This is an example where I assumed the data was well-formed, but it wasn't.
00:21:45 I don’t know the actual reasoning; perhaps it was a database schema update with some downloads happening before version tracking was clean.
00:22:05 Paying attention to detail in adding up my values to ensure they equal what I thought they should was important.
00:22:22 Another question we had was: Do we need to support Ruby 1.9? The polite answer is maybe, but we want to know for sure.
00:22:38 A good way is to see what everyone else is doing: which versions of Ruby do gems released in the past year require?
00:22:52 I'm using the date_add function to subtract one year from the current timestamp and getting everything created since then.
00:23:14 Here's what I found: most gems in the last year say they’ll work on anything greater than or equal to Ruby version 0.
00:23:34 Still, it is important for clarity to specify that greater than or equal to 2.0 is the same for my purposes.
00:23:50 To get better aggregation, I used regex to extract 'n', and that provides better grouping. I will ignore the top row since I don't think that's valid.
00:24:05 I can see various ways of specifying versions greater than 2, indicating that we might be able to avoid supporting 1.9.
00:24:22 At a meeting, we discussed that downloads don’t neatly track against actual usage. When we ask what gems are most commonly used, it doesn’t necessarily align with downloads.
00:24:39 For example, if one company spins up many servers and downloads all their gems repeatedly, that can skew the results.
00:25:01 We considered looking at the GitHub data to determine which gems are used most in public repositories, emphasizing the importance of requiring this data.
00:25:20 I extracted all the Gemfile.lock files from GitHub by looking at the paths and placing them in another table to simplify my queries.
00:25:44 Then, I wrote a query to extract gem names from the Gemfile.lock files, counting them for every line split by new line, grouping by name, and ordering by count.
00:26:06 This query took 14 seconds, parsing every single Gemfile.lock in every public GitHub repo that contains one.
00:26:25 The results showed a quite different list compared to the downloads data, which can happen for several reasons.
00:26:45 It's likely many public GitHub repos are Rails projects not actually deployed anywhere; I know I have several.
00:27:01 There’s also the chance that we’re seeing dependencies because we’re examining Gemfile.lock, which displays the dependency tree.
00:27:20 With these, we had two different top 100 lists and were able to rank gems based on their positioning on both lists, combining results.
00:27:41 We focused on ensuring they ran neatly and installed without errors on every Ruby image we have.
00:28:04 In conclusion, I learned a few things from this exercise. I learned that my intuition can be wrong, illustrating the need for data over guessing.
00:28:25 I also learned that you can’t assume data is well-formed, even when it comes from reliable sources like RubyGems.
00:28:45 Using different data sources can yield varying answers to the same question.
00:29:01 I want to highlight that Rubyists are the most consistent language community in terms of using spaces, not tabs.
00:29:14 Lastly, I work for Google Cloud Platform, and we will open a data center in Japan before the end of the year.
00:29:34 I will hold office hours during the afternoon break, and one of my colleagues who speaks Japanese will also be available to answer questions.
00:29:55 Feel free to hit me up on Twitter, and if you want to ask questions later, I'm available via email at @faggo_miser at google.com.
00:30:12 I have a lot of stickers to give away, and I also have plastic dinosaurs to share. Thank you all, and I’m happy to take some questions.
00:30:36 The usual question is, 'Can I have a dinosaur? Thank you, everyone!'