Talks
Speakers
Events
Topics
Sign in
Home
Talks
Speakers
Events
Topics
Leaderboard
Use
Analytics
Sign in
Suggest modification to this talk
Title
Description
More of us are building apps to make sense of massive amounts of data. From public datasets like lobbying records and insurance data, to shared resources like Wikipedia, to private data from IoT, there is no shortage of large datasets. This talk covers strategies for ingesting, transforming, and storing large amounts of data, starting with familiar tools and frameworks like ActiveRecord, Sidekiq, and Postgres. By turning a massive data job into successively smaller ones, you can turn a data firehose into something manageable.
Date
Summarized using AI?
If this talk's summary was generated by AI, please check this box. A "Summarized using AI" badge will be displayed in the summary tab to indicate that the summary was generated using AI.
Show "Summarized using AI" badge on summary page
Summary
Markdown supported
The video titled "Processing data at scale with Rails" by Corey Martin presented at RailsConf 2021 focuses on the construction of data pipelines using familiar tools within the Ruby on Rails ecosystem. The talk aims to empower developers in managing large datasets effectively through strategies that involve ingesting, transforming, and storing data. Key Points Discussed: - **Introduction to Data Pipelines:** Explains the concept of data pipelines in the context of building applications that deal with massive datasets, citing examples such as political campaign spending data and public health information during the COVID-19 pandemic. - **Examples of Data Sources:** Highlights how open datasets and information from IoT devices can be leveraged for constructing useful applications. Corey shares anecdotes such as "openSecrets" and "CVS vaccine alerts" to illustrate the practical applications of data pipelines. - **ETL Process:** Introduces the concept of ETL (Extract, Transform, Load) as a simple yet effective way to describe data handling. Martin emphasizes that building ETL pipelines can be done with open-source tools and familiar frameworks in Rails, without needing expensive software. - **Building a Data Pipeline with Rails Tools:** The tutorial cuts to an example project—an app named "Lobby Focus" that consumes lobbying data from XML files published by the U.S Congress. Corey details the steps of how data is ingested, including scraping, unzipping, and processing XML data using Rails tools like PostgreSQL, Sidekiq, and Nokogiri. - **Job Scaling and Parallel Processing:** Discusses the advantages of breaking down large data jobs into smaller tasks to ensure better manageability and fault isolation. This allows developers to pinpoint errors at the record level, enhancing the pipeline efficiency. - **Handling Data Changes:** Corey stresses the importance of modular pipeline design that allows easy adjustments when the data structure or source changes, keeping the workflow adaptable and reduce maintenance headaches. - **Conclusion and Call to Action:** Encourages developers to explore building applications using open data, reiterating that working with real datasets is a great way to build portfolios and demonstrate skills. He emphasizes that experience in constructing data pipelines can be immensely valuable professionally.
Suggest modifications
Cancel