Talks
Speakers
Events
Topics
Sign in
Home
Talks
Speakers
Events
Topics
Leaderboard
Use
Analytics
Sign in
Suggest modification to this talk
Title
Description
Imagine you receive a CSV of data that has over 500,000 rows and 100 columns. Data is randomly missing in some places, some of the column names are wrong, and you have mixed Data types in some of the columns. Correcting and cleaning that data by hand could take hours. Fear not! There is a better and faster way. We will look into using Ruby Polars, a gem written in Rust with a Ruby API, to wrangle and clean tabular data to get it prod ready. By learning some basic operations used in Polars you can greatly expedite the import process of CSV files and API Data. Whether your goal is to use the Data in an existing application or use it in a Ruby AI/Machine learning project(since cleaning Data is a vital first step in this process), this talk will get you well on your way!
Date
Summarized using AI?
If this talk's summary was generated by AI, please check this box. A "Summarized using AI" badge will be displayed in the summary tab to indicate that the summary was generated using AI.
Show "Summarized using AI" badge on summary page
Summary
Markdown supported
In the talk titled "Get your Data prod ready, Fast, with Ruby Polars!" presented by Paul Reece at RubyConf 2023, the focus is on using Ruby Polars, a gem that simplifies data wrangling and cleaning tasks. Reece begins by setting the context: when handling extensive datasets, such as a CSV with 500,000 rows and 100 columns filled with random missing values and incorrect data types, manual data cleaning can be overwhelmingly time-consuming. The key points covered in the talk include: - **Introduction to Data Cleaning**: Reece explains data cleaning, which involves the removal of missing values, correction of incorrect data types, and reformattings such as dealing with outliers. He establishes the significance of this process for web developers and data practitioners alike. - **Ruby Polars Overview**: Polars is introduced as a fast alternative for data manipulation in Ruby, comparing it favorably against Python. The speed of Polars comes from its Rust backend, providing efficient parsing and dataframe operations. - **Data Structures**: He elaborates on two primary structures in Polars: the Series (a one-dimensional data structure) and the DataFrame (a two-dimensional table). Both are crucial for processing data effectively. - **Data Cleaning Demonstration**: Reece demonstrates the step-by-step process of converting API data into a DataFrame, removing outliers with the 'filter' method, checking for missing values, and filling these gaps with reasonable estimates. He shows all these steps live in the IRB environment. - **Combining DataFrames**: The importance of stacking DataFrames using the 'vstack' method for efficient bulk inserts into production databases is highlighted. - **Addressing Duplicates**: He explains how to handle duplicate entries using the 'unique' method to ensure that only distinct records survive in the final dataset. - **Final Steps and Advanced Techniques**: Reece mentions advanced operations such as extracting additional information from existing columns, calculating averages across different scales, and generating charts using the Vega library. In conclusion, the session encapsulates impactful techniques for converting and cleaning data efficiently in Ruby using Polars, thereby making it production-ready. Reece provides valuable resources, including a data cleaning checklist and a cheat sheet for attendees to refer to in their projects, while also inviting continued discussions within the community on the evolution of Ruby tools for data manipulation and AI applications.
Suggest modifications
Cancel