Exploring Big Data with rubygems.org Download Data

Exploring Big Data with rubygems.org Download Data by Aja Hammerly

Many people strive to be armchair data scientists. Google BigQuery provides an easy way for anyone with basic SQL knowledge to dig into large data sets and just explore. Using the rubygems.org download data we'll see how the Ruby and SQL you already know can help you parse, upload, and analyze multiple gigabytes of data quickly and easily without any previous Big Data experience.

Help us caption & translate this video!

http://amara.org/v/PsL8/

GoRuCo 2016

00:00:15.400 Welcome everyone to my talk about RubyGems data and exploring big data using rubygems.org download data.

00:00:18.500 I'm Aja Hammerly, and I'm on GitHub as 'AjaHammerly'.

00:00:20.510 I enjoy it when people tweet at me during my talk because it gives me something to do while coming down from the adrenaline. The slides for this talk are available at hagemeyer.com, thanks to conference Wi-Fi that actually works. I have a strong interest in dinosaurs and I work at Google on the Google Cloud Platform. Since I paid for my trip out here, let me quickly pitch what we offer: we have VMs, container running infrastructure, storage, and a lot of exciting machine learning capabilities.

00:00:35.420 If you have questions or need hosting for a website, come talk to me or reach out via Twitter or email me at '[email protected]'. As I work for a large company, I have to include this legal disclaimer: all code in this talk is copyright Google and licensed under the Apache 2.0 license.

00:00:59.750 Now, let's dive into big data. Data is at the heart of many of our favorite applications. I've been in technology for about 15 to 20 years, and I've seen how data, recommendations, and mapping technology have improved our lives. For instance, I'm originally from Seattle, and one of the applications I rely on to improve my commuting experience is one that tells me when the buses are running late.

00:01:19.579 As I engage with startups and attend events, I'm hearing more stories where data plays a critical role. Big data seems to be ubiquitous now, likely because storage costs have significantly decreased over time. Let me take you back to my first tech job, where I was installing 10-gigabyte hard drives—those were considered large back then. Now, I can purchase that same amount of storage on a flash drive from Target. Storage is now super cheap, yet I don't see many Ruby developers engaging with big data.

00:02:04.100 One reason for this might be that working with big data can be intimidating. I felt the same way when starting out because of the complexity of statistics. If you read books on data mining or machine learning, you often encounter complicated formulas and Greek letters, which can be overwhelming even for those determined to get into the field. Additionally, many think a PhD is needed to pursue machine learning, but I'm here to tell you that it’s not a prerequisite.

00:02:54.720 Today's talk is focused on exploratory data analysis and building accessible dashboards that allow you to do interesting things with the large datasets you might already have. For this data analysis talk, I’ll be using the RubyGems download data. By visiting rubygems.org, you'll find a link to their data, and I encourage you to seek out similar information available on various websites—more places are providing access to subsets of their data for public use.

00:03:39.200 The datasets from the Ruby Core team and RubyGems include two types: a weekly dump of their Postgres database and the latest data. I'll be focusing on the Postgres data because it has the information I find most useful. The main table is 'RubyGems', which lists all the gems, with roughly 125,000 records in the database. The second table is 'Downloads', which contains raw download numbers for each version of the gem—this table has about 900,000 rows, reflecting the many versions that exist.

00:04:06.840 Next is the 'Dependencies' table, which has around 3.5 million rows, and it connects gem IDs to their respective dependencies—important for managing scopes in development. The 'LinkSets' table has about 125,000 records, providing information from the right-hand navigation on each Ruby gem page like home, wiki, etc. Lastly, the 'Versions' table contains extensive information about the versions of gems, with around 750,000 records. This section allows us to analyze trends over time and examine which authors are more prolific.

00:05:04.100 Once I’ve gathered information about these datasets, the next step is to formulate questions. This can be challenging. An effective approach is using domain knowledge to generate insights and inquiries. For example, many gems often depend on JSON libraries because I’ve found that nearly all the projects I’ve worked on rely on at least one JSON gem. Additionally, conducting formal statistical analysis requires forming a hypothesis to test against the null hypothesis, meaning that nothing interesting is happening.

00:05:58.000 For example, I could set hypotheses such as: which gem has the most downloads? Does Rails seem to be the most downloaded? Is MiniTest more popular than RSpec? One could conclude that gems released in the last year commonly require Ruby 2.0 or higher. I want to point out that the datasets we’re using aren’t enormous by industry standards; they're just large enough to be relevant for our audience.

00:07:12.000 To analyze this data, I’ll leverage BigQuery, which is designed for large datasets. It is a non-relational, non-indexed database tool created by Google and is part of Google Cloud Platform. It was built to effectively search through and analyze massive logs, which is why it handles unstructured data so well. BigQuery facilitates standard SQL queries, making it user-friendly, and it scales efficiently to accommodate large datasets, including those on the order of terabytes.

00:08:57.000 Now, I am going to run a live demo to demonstrate how fast it is to query data. I have a dataset containing all the stories from Hacker News in 2015. Someone can shout out a word they think might be in the title of a Hacker News story, and I'll run a query to find it. For instance, let’s use 'Bitcoin'. Running the query processes approximately 3.91 gigabytes and produces several records. The average score of stories with 'Bitcoin' in the title is 9.4.

00:10:12.000 Next, I will calculate the average score of all stories with that keyword in the title. This showcases BigQuery's capacity to handle complex aggregations more quickly than traditional methods. The average score I found for stories with 'Hiring' in the title wasn't high, indicating that 'Bitcoin' stories tend to be of more interest.

00:12:07.000 Now, let’s take a moment to review basic definitions of terminology for those unfamiliar. A dataset in BigQuery is a collection of tables, where tables consist of records structured in a consistent way. To get the data into BigQuery, I’ll demonstrate two methods: streaming and batch processing.

00:12:50.000 Streaming is used to push records into BigQuery almost in real time, perfect for applications requiring quick data availability. I plan to connect to Google Cloud using a Ruby gem, allowing me to send records directly from a Ruby application. On the other hand, batch processing is best for data that doesn't need immediate availability. Common formats include CSV, JSON, and Avro, and for this demo, I’ll utilize CSV because of its simplicity.

00:14:00.000 I begin by connecting to Postgres and preparing to move the dependencies table into a CSV file. Once ready, I'll upload the CSV to Google Cloud, ensuring I configure the schema correctly during the import process. This schema approach mirrors many practices in Ruby, making it accessible for developers accustomed to Rails migrations.

00:15:23.000 As the final stage, I will fully automate the import process, which consists of a simple step of selecting rows from Postgres and inserting them into BigQuery. By demonstrating the power of Ruby's enumerable methods, I can create a hash from the data structure seamlessly and import it into BigQuery.

00:16:35.000 In conclusion, we discussed how to utilize BigQuery for analyzing RubyGems data, and the challenges of making data accessible and generating insights through the analysis process. I encourage you to explore the capabilities that come with utilizing data in your applications, from analyzing trends to understanding how different libraries are being used.

00:17:58.000 Thank you all for your attention, and please feel free to grab some free stickers and Google Cloud swag after this talk. I have a high quantity of dinosaur stickers as well as GCP branded sunglasses and other goodies. I really appreciate your participation and hope you found value in this session.