Talks
Confessions of a Data Junkie: Fetching, Parsing, and Visualization
Summarized using AI

Confessions of a Data Junkie: Fetching, Parsing, and Visualization

by Bobby Wilson

The video titled "Confessions of a Data Junkie: Fetching, Parsing, and Visualization" features Bobby Wilson discussing his passion for data visualization, emphasizing the process of creating meaningful insights from complex data sets. Wilson, who works at a startup named Next, shares his journey in analyzing data sourced from government repositories and code repositories.

Key points discussed in the presentation include:

- Interest in Data Visualization: Bobby highlights the growing importance of data visualization in technology, driven by the abundance of available data.

- Data Sources and Tools: He mentions various tools and libraries, including Grit for interacting with GitHub repositories, and pointed out the challenges of dealing with sparse and unstructured data, particularly from government sources.

- Data Persistence: Wilson emphasizes the need for a persistence layer in data analysis, detailing his use of MongoDB and JSON for easy data management and manipulation of commit history from code repositories.

- Simple Queries: He breaks down the process of data analysis into manageable queries, illustrating how to extract useful insights from repositories with simple MongoDB commands.

- Visualization Libraries: Various visualization libraries are discussed, notably Raphaël and Protovis, with a focus on the syntax and efficiency of creating engaging visualizations. He even introduces RubyViz, which integrates Protovis within Ruby for generating server-side SVG visualizations.

- Commit Analysis: Wilson demonstrates how to analyze commit data over time, showcasing trends in codebase growth and contributor activity through visual representations like bar charts and commit maps.

- Engagement with the Audience: He connects with the audience by contrasting the appeal of visualizations with the complexities of raw government data, making a case for visualization as a means of clearer communication.

- Practical Applications: Bobby concludes by expressing his desire to make visualization tools more accessible, suggesting that future services could automate this process for users trying to visualize their repository's data.

Ultimately, Wilson's talk emphasizes the accessibility of data visualization techniques and tools while advocating for a structured approach to transforming raw data into clear, insightful visual formats.

00:00:13.120 Alright, yes! So I guess first of all, for my first regional talk, this is a subject I'm really excited about. I have a few things to go over before I dive in. I work at a company called Next. It's a great place; it's a startup.
00:00:24.519 I've been getting interested in data visualization. It's kind of a hot topic in technology right now. There's a lot of data docs available, which are huge repositories for all things government-related.
00:00:31.880 Some labs also provide a lot of tools and data themselves. Infochimps is a commercial offering; they have some open-source data.
00:00:37.120 I gave kind of a precursor to this talk, you know, a sort of testing-things-out talk, at my local user group, and they really liked the slides I prepared on code repositories. I think that's pretty cool too.
00:00:45.960 The means by which I was analyzing the code from my repositories was Grit. I think most of you remember Grit. It was kind of the first library that came out when GitHub was launching.
00:01:00.320 It was really cool because it was a library in Ruby that would interface with your Git repositories and provide you methods to interact with your repository.
00:01:08.119 The reason I liked using it for data visualization is that the data is very complete.
00:01:13.600 One of the things I found was that using some of the other sources, the data would often be really sparse. It might come in an Excel file or a CSV file.
00:01:20.560 One of the repositories I pulled down from data.gov was the energy data, and I found that interesting. I thought it would be cool to work with something topical.
00:01:26.479 However, I quickly discovered that there were all these really weird formats, and the data wasn’t where it was supposed to be.
00:01:34.320 You had to use six-digit codes to figure out which energy source and units were being used, so it got really complicated very quickly.
00:01:41.680 Grit also provides some nice basic built-in stats for commits, and it’s appealing because you have access to anything on GitHub concerning the code repositories.
00:01:49.240 I like having an engaging topic to talk about, and I thought code repositories would be more interesting to you than government data.
00:01:54.280 After all, government data could get weirdly political, especially if I had opinions about it.
00:02:00.960 Now, the first thing I was worried about with the data was having some sort of persistence layer.
00:02:06.960 I initially used the Grit interface to access all the commits and decided to work on a persistence method.
00:02:12.840 I thought this would be a nice way to access the data and ensure it was persisted properly.
00:02:18.480 One challenge I encountered while getting into the MongoDB stuff was messing up the data a lot.
00:02:24.080 I would overwrite existing data or create duplicates, and that sort of thing.
00:02:29.400 I thought exporting to plain text would be really nice, so I aimed to export to JSON.
00:02:35.560 This would provide easy importing into MongoDB and serve as a backup for the persisted data in case I messed anything up.
00:02:43.280 I also wanted to do some MapReduce, which is a pretty neat way to handle such tasks.
00:02:49.599 This was the short script I used for importing data. I utilized Ruby's MRI version.
00:02:56.599 I pulled down the repo, and the commit count is simply the total commits.
00:03:03.239 I navigated through each commit to import the data about those commits.
00:03:09.400 At this point, I’m just writing it out to a JSON file.
00:03:16.759 This was my original intention: to have a flat file representation that I could easily retrieve.
00:03:24.759 I found that I really liked persisting data with JSON, especially given the database options available now.
00:03:32.360 For instance, CouchDB makes it easy to import and export JSON.
00:03:39.960 After writing out that file, all we had to do was run a long import command and specify the database and the collection, along with the file.
00:03:46.639 It takes a line-by-line JSON file and imports it into the MongoDB, giving us access to all these commits.
00:03:53.120 This was kind of the first visualization. At this point, it's just a snapshot of the data I was extracting from the repo.
00:04:04.079 I show this slide because it's so simple to do this sort of analysis.
00:04:10.879 With all the data heads here, it seems we don't spend enough time examining the factual information about our repositories.
00:04:17.680 For example, does anyone know how many commits the Ruby repo has?
00:04:24.960 It's around 22,000, but some estimates speculate it's more like half a million.
00:04:31.760 This just offers some basic information. I want to show how I gathered this data.
00:04:38.760 I was just experimenting in the MongoDB console to extract this information.
00:04:45.559 You can see the queries are really simple, even if you're not a MongoDB expert.
00:04:51.760 Even though it's not super significant to my application here, it's simply the method I chose for this instance.
00:04:56.760 One amusing quirk is the sorting: you can see date1 is in descending order while date2 is in ascending order.
00:05:04.079 In approaching this data, just start simple. Begin with small queries to get the information you want.
00:05:11.960 Choosing visualization libraries is crucial. There are many options available.
00:05:19.400 Raphaël is one of the more popular ones, but I also like a project called Protovis by Matt B.
00:05:26.760 It's a library that produces SVG-style visualizations.
00:05:32.960 The syntax is great and very declarative, and they provide lots of examples and good documentation.
00:05:40.560 Matt B happens to be a Stanford professor, so he definitely knows what he’s doing.
00:05:47.679 This example shows how appealing the syntax is; you can easily call out properties like method calls.
00:05:54.399 Chaining methods elegantly leads to a long sequence of commands to produce the visualization.
00:06:01.760 This way of working with visualizations is very interesting, allowing us to build on previous commands seamlessly.
00:06:07.280 In this instance, the canvas is where we visualize our data by adding bars to represent the array.
00:06:14.120 From a first glance, it’s pretty straightforward, even if you haven’t used this library before.
00:06:22.720 The function dd* is a little tricky, but it’s nice since it allows closures within the chain.
00:06:29.480 You can access the current data within the array, making data manipulation efficient.
00:06:35.679 This is a very simple example of the kind of visualizations we're creating.
00:06:42.240 Now we have a slightly more complicated example, which is Ruby code showing the growth of the codebase over time.
00:06:53.720 This data reflects the size of the code and looks back over time.
00:07:01.320 In the slide, you can observe the year separations at the bottom.
00:07:09.880 The black spots indicate when more aggressive development occurred.
00:07:17.480 You can visually discern when the Ruby library was more active in its development.
00:07:24.080 I created a Sinatra app to access the data from our previous import.
00:07:31.680 It’s very simple, including MongoDB and setting up a route for repo commits.
00:07:38.760 I think at some point it would be cool to have a web service where you just input your repo and it displays this kind of data automatically.
00:07:47.480 There is one quirk in this code, which is the dollar sign followed by custom filtering for commits.
00:07:55.160 This gets all commits that change more than 300 lines of code.
00:08:02.960 Filtering out smaller commits resulted in a clearer graph, though I know dropping data isn’t ideal.
00:08:09.800 I went ahead with it to keep my slide clear while showing the most relevant details.
00:08:16.560 Next, I created a hash of the fields I wanted, ensuring I had more than enough data for analysis.
00:08:24.240 This view represents a snapshot of the analysis, ensuring we have a good overview.
00:08:32.480 I focused on the additions and deletions from the commits, looking for trends.
00:08:39.200 Here, we're managing running totals of changes over time.
00:08:46.880 The library I'm using provides great tools to normalize the data, making analysis more straightforward.
00:08:54.080 I’m simply pushing that data into an array and returning a consolidated view.
00:09:01.760 The other aspect of working with data is ensuring it's normalized for effective visualization.
00:09:08.960 This next section discusses another use case where I previously generated visuals.
00:09:15.760 Here, I'm focused on creating a bar chart as part of the visual representation.
00:09:23.040 I made the bars thin to illustrate the progression over time, keeping the visuals engaging.
00:09:31.280 There’s nothing particularly special about the charting process, but showing this process is vital.
00:09:37.600 I wanted to highlight that this methodology is accessible for creating such outcomes.
00:09:45.760 A challenge I faced was needing to constantly create new endpoints for different visualizations.
00:09:54.080 Eventually, I stumbled upon a library called RubyViz, which ported Protovis to Ruby.
00:10:02.080 This allows generation of SVG server-side, and the syntax is even better in my opinion.
00:10:09.200 The example code from before becomes cleaner and more concise, leveraging Ruby's block syntax.
00:10:16.680 It also produces the same results with much less effort, making the code more readable.
00:10:23.600 The advantage of RubyViz is that you don’t have to create a web service just to render visualizations.
00:10:31.040 You can just call the 2vg method, which allows for the full manipulation of your data before producing SVG output.
00:10:38.480 This creates a flat file with SVG markup that you can access easily.
00:10:46.080 Additionally, once you render the visualization, you don't have to rerun the queries repeatedly.
00:10:55.040 This is significant because it saves resources and time, especially with more complex data.
00:11:01.760 If you want to see the last 20 commits, this example showcases that process.
00:11:10.080 The output shows additions represented in green and red, similar to console outputs.
00:11:16.640 The queries are straightforward; we just access our collection and filter for the recent commits.
00:11:25.920 The scaling helps ensure data isn't too large or small, providing a uniform perspective.
00:11:33.600 The scale can be adjusted to your liking, placing positive and negative values in a graphing context.
00:11:41.760 Here’s a view of the bar styling using Proto, and we leverage that block syntax for neatness.
00:11:49.440 I find the syntax to be practical, leading to cleaner output and improving readability.
00:11:56.800 This outputs directly to a file, generating SVG markup pretty effortlessly.
00:12:03.680 The last visualization was about committers and the amount of code they shifted.
00:12:10.560 Although the labels are a bit janky, it represents the contributions accurately.
00:12:16.640 You can see that Matt's contributions are significant, as indicated by the size of the circles.
00:12:23.680 Visualizing data in this way makes it easier to perceive relationships without clutter.
00:12:31.040 The visualization offers a clear understanding of the differences in commit sizes.
00:12:38.480 This was a fun and engaging way to analyze the Ruby repository and its contributors.
00:12:46.320 To create that visualization, I used MapReduce for more clarity on commit data by author.
00:12:53.920 The MapReduce structure required extracting data efficiently for all commits.
00:13:01.760 The map function provides the author and the reduce function sums the totals.
00:13:08.120 I created an array containing all the nodes for the commits to visualize effectively.
00:13:15.760 This layout helps in providing a visual perspective on the changes over time.
00:13:22.960 The fill and stroke styles make the visualization pop, thanks to Ruby's color libraries.
00:13:30.240 I filtered out a few less notable committers to keep the visualization clear.
00:13:37.600 Lastly, I added labels for each author, representing their contributions neatly.
00:13:44.640 This process demonstrates my love for RubyViz, effectively calling out the 2vg method for a static representation.
00:13:51.520 I can work through the whole visualization talk with confidence.
00:13:59.520 Are there any questions? For example, have you encountered cross-browser issues?
00:14:07.360 I haven’t researched the browser compatibility issues much, but I know there’s a solid site for tracking compatibility.
00:14:14.160 If I were to do this for a job, I'd ensure compatibility. However, this was just a fun endeavor for me.
00:14:21.760 I utilized whatever worked, and Chrome seemed to handle everything fine.
00:14:29.200 I'm planning to put everything on GitHub, and I'd love to chat with anyone interested.
00:14:36.960 Lastly, I got this shirt from doing a visualization for the Milla Foundation.
00:14:43.720 I didn’t win the prize, but I ended up with this quirky t-shirt.
00:14:50.480 I’m eager to continue this visualization work and connect with anyone interested. Thank you, everyone!
Explore all talks recorded at MountainWest RubyConf 2011
+14