Demystifying Data Science: A Live Tutorial

The video titled "Demystifying Data Science: A Live Tutorial" features Todd Schneider, an engineer at Rap Genius, presenting a live coding session at RailsConf 2014. Schneider aims to illustrate the practical aspects of data science through a real-world text mining problem, specifically tracing the evolution of popular words and phrases in RailsConf talk abstracts over the years.

Key Points Discussed:

Introduction to Data Science: Schneider begins by sharing his experience transitioning from finance to web programming, emphasizing the relevance of working on problems that interest you in data science due to its often messy nature.
Project Overview: He introduces his previous project, Wedding Crunchers, which analyzes the popularity of words in New York Times wedding announcements over time. This serves as the inspiration for the current project focused on RailsConf abstracts.
Engram Analysis: Schneider explains what an Engram is (consecutive words in a sentence) and outlines the three main steps for the project:
- Gathering data
- Analyzing data using Engram calculations
- Creating a visual interface.
Data Gathering: Focusing on the first two steps, he describes how to scrape data from the RailsConf website using Nokogiri, a Ruby gem for parsing HTML. He demonstrates the process of identifying the structure of the web page and extracting needed attributes like titles, speakers, abstracts, and bios.
Data Analysis: After gathering the data, Schneider explains the analysis part, where he calculates the Engrams from the abstracts. He emphasizes the importance of normalizing text (e.g., downcasing, removing punctuation) before performing any analysis.
Using Redis for Storage: He discusses employing Redis to create sorted sets to store and query the Engram data efficiently, by year, to track how often certain words or phrases appear.
Final Visualization: Although there isn't enough time to demonstrate the front-end interface in full, Schneider mentions using libraries like HiCharts for visualizing the results on a web platform.

Conclusions and Takeaways:

Schneider stresses that the intricacies of data scraping and analysis are crucial for any data science project, often being the most challenging part.
He underscores that one doesn’t always need advanced algorithms to derive meaningful insights; simple methods can yield insightful results.
He invites viewers to explore the full code and examples on GitHub, reinforcing the notion that Ruby and the Rails ecosystem are capable of tackling data science problems effectively.

Demystifying Data Science: A Live Tutorial
Todd Schneider • April 22, 2014 • Chicago, IL

To get a grip on what "data science" really is, we'll work through a real text mining problem live on stage. Our mission? Trace the evolution of popular words and phrases in RailsConf talk abstracts over the years! We'll cover all aspects of the problem, from gathering and cleaning our data, to performing analysis and creating a compelling visualization of our results. People often overlook Ruby as a choice for scientific computing, but the Rails ecosystem is surprisingly suited to the problem.

Originally a "math guy", Todd spent six years working for a hedge fund building models to value mortgage-backed securities before a fortuitous foray into web programming got him tangled up with Rap Genius. His recent Rails work includes Rap Stats, Wedding Crunchers, and Gambletron 2000.

Help us caption & translate this video!

http://amara.org/v/FG1d/

RailsConf 2014