RubyConf 2023

Get your Data prod ready, Fast, with Ruby Polars!

Get your Data prod ready, Fast, with Ruby Polars!

by Paul Reece

In the talk titled "Get your Data prod ready, Fast, with Ruby Polars!" presented by Paul Reece at RubyConf 2023, the focus is on using Ruby Polars, a gem that simplifies data wrangling and cleaning tasks. Reece begins by setting the context: when handling extensive datasets, such as a CSV with 500,000 rows and 100 columns filled with random missing values and incorrect data types, manual data cleaning can be overwhelmingly time-consuming.

The key points covered in the talk include:

- Introduction to Data Cleaning: Reece explains data cleaning, which involves the removal of missing values, correction of incorrect data types, and reformattings such as dealing with outliers. He establishes the significance of this process for web developers and data practitioners alike.
- Ruby Polars Overview: Polars is introduced as a fast alternative for data manipulation in Ruby, comparing it favorably against Python. The speed of Polars comes from its Rust backend, providing efficient parsing and dataframe operations.

- Data Structures: He elaborates on two primary structures in Polars: the Series (a one-dimensional data structure) and the DataFrame (a two-dimensional table). Both are crucial for processing data effectively.

- Data Cleaning Demonstration: Reece demonstrates the step-by-step process of converting API data into a DataFrame, removing outliers with the 'filter' method, checking for missing values, and filling these gaps with reasonable estimates. He shows all these steps live in the IRB environment.

- Combining DataFrames: The importance of stacking DataFrames using the 'vstack' method for efficient bulk inserts into production databases is highlighted.

- Addressing Duplicates: He explains how to handle duplicate entries using the 'unique' method to ensure that only distinct records survive in the final dataset.

- Final Steps and Advanced Techniques: Reece mentions advanced operations such as extracting additional information from existing columns, calculating averages across different scales, and generating charts using the Vega library.

In conclusion, the session encapsulates impactful techniques for converting and cleaning data efficiently in Ruby using Polars, thereby making it production-ready. Reece provides valuable resources, including a data cleaning checklist and a cheat sheet for attendees to refer to in their projects, while also inviting continued discussions within the community on the evolution of Ruby tools for data manipulation and AI applications.

00:00:49.430 Thank you all so much for being here! How's everyone's RubyConf going so far? Awesome! I'm thrilled to be here and I'm really excited to share this talk with you today. A quick note: I will be covering a lot of material, so I won’t have time for a live Q&A. If you have any questions about the material, please take a mental note or write it down, and I'll be available off stage afterwards to answer them.
00:01:17.520 As was mentioned, my name is Paul Reece, and my pronouns are he, him, and his. You can find me on LinkedIn at Paul Hyen Reece and on X and Mastodon as Paul_Reece. Some of you may be aware that the actor strike ended last week and there was a writer strike that ended about a month ago. This means a content winter is coming for those of us who enjoy watching new TV shows and movies; it might be a year or two before we're able to watch new shows once our streaming platforms run out of back catalogs. So, we all got together and decided it would be fun to look backwards and watch the top 100 movies of all time. We thought it would be a great idea to build an app to facilitate this. To accomplish this, we needed to get some data, and we are going to ping a fictional top movies API to retrieve it.
00:02:38.959 This is a fictional API, so it will return some interesting data. For this demonstration, we will work with a subset of just three records instead of the top 100, as we will be doing various data processing and transformations on this subset. Furthermore, we already have some missing values from this data, indicating that we need to clean it. But what is data cleaning? Data cleaning is the process of removing missing values, correcting incorrect data types, addressing outliers, and reformatting incorrectly structured data.
00:03:09.080 As web developers, we perform data cleaning to avoid bugs in production when adding new data. Data scientists and AI practitioners do this to ensure the results of their projects are more reliable and accurate. It is often said that data scientists and AI practitioners spend about 80% of their time cleaning data, making it a crucial skill to have. Today, we're going to learn how to effectively clean data. First, let's consider this question: why not use Ruby to work with data? Why not use Ruby for your AI projects? Luckily, we now have some high-quality gems available, like Polars, that help us work with and clean our data.
00:03:40.319 Furthermore, thanks to the hard work of those on the Polars team, especially Maxim, we have a Ruby version that is comparable in speed to Python. This is a tremendous opportunity for us to learn how to work with data in Ruby. Give it up for the Polars team! It’s a wonderful time to get involved and continue the conversation around using Ruby to work with data. The more we talk about this, the more mature our ecosystem becomes and the more features we get in our Ruby gems.
00:04:57.280 Today, as hinted at in the title, we will learn about Polars, a gem whose internal libraries are written in Rust and its core API is written in Ruby. This allows us to benefit from the lower-level speed of Rust alongside the developer-friendly features of Ruby. Polars is particularly good at working with tabular data, such as Excel, CSVs, and JSON. To effectively work with Polars, we need to understand two custom data structures: the Series and the DataFrame.
00:05:36.000 A Series is a one-dimensional data structure, similar to an array but with a name attribute. In most of the examples we will explore and clean this data in the Interactive Ruby (IRB) environment, which is familiar to many of us. In IRB, we can create a Series by calling the new method on the Series class, passing the name as the first argument and our list of values (an array) as the second.
00:06:03.000 After creating the Series, we receive information about its shape, which shows us it has three values and one dimension. We can also see its name and data type (in this case, a string). The Series serves as the core data structure for the next structure we'll learn, which is the DataFrame.
00:06:36.560 A DataFrame is a two-dimensional data structure made of rows and columns, resembling a virtual representation of a spreadsheet in code. Within this structure, we can perform a variety of different transformations and operations on data. For example, we can create a DataFrame from a hash where each key corresponds to a column name and each value is an array representing the values for that column.
00:07:18.879 When we create a DataFrame, we can see its shape, which might show, for instance, three rows and two columns, as well as the data type for each column—for example, strings for titles and 64-bit integers for release dates. One of the benefits of Polars is that it provides a nicely formatted representation of our table, which helps visualize the data cleaning steps we need to take.
00:07:56.479 Speaking of data types, it is important to be aware of the specific data types available in Polars, including Boolean, float, integer, unsigned integer, string, temporal, and others. Now that we have a high-level overview of Polars and our two primary data structures, let’s take another look at our top movies app.
00:08:32.320 In our production database table for the top movies app, we have eight columns: ID, name, year, rank, director, rating, streaming, and date added. Across these columns, we have five different data types: integer, string, float, Boolean, and datetime. Understanding the expected data types in our production database is crucial before we begin cleaning our data, as it informs the necessary transformations we must apply.
00:09:05.160 Let’s convert the JSON response we received from the top movies API into a DataFrame. I will ping the top movies API using a Net::HTTP library, parse the response into Ruby, and create a DataFrame from the parsed response. We can observe that Polars quickly and conveniently creates the DataFrame for us, which has a shape of four rows and seven columns.
00:09:56.640 There are two important points to note here. First, we see a null value in the streaming column. Don’t be confused by the term 'null' in Polars. In Ruby, we're accustomed to using 'nil' to represent missing values, but Polars uses 'null'. When we insert Ruby data into a DataFrame in Polars, it will show as null, but when extracted back to Ruby data structures, it will revert to nil.
00:10:32.480 Second, we have an outlier represented as 'Water World.' While it was an interesting concept, it lies outside our scope of watching the top 100 movies, so we will need to address this later. Before doing that, I want to highlight a few important methods that will be used throughout this presentation for various data cleaning operations.
00:10:57.560 The first method is 'with_columns,' which we can call on a DataFrame object. This method typically takes arguments representing data cleaning operations on one or more columns. 'with_columns' is quite useful because it returns the cleaned column along with all other columns in the DataFrame, allowing us to save our work incrementally. The second method we’ll likely use in conjunction with 'with_columns' is 'Polars.call' (pol.call). This method allows us to select specific columns within the DataFrame.
00:11:43.600 In some instances, we may want a Series instead of a DataFrame, especially when performing mathematical operations (excluding comparisons) or when converting column data using another data structure like a hash. To obtain a Series, we would use the 'get_column' method on the DataFrame and provide the column name as an argument. Now that we’ve established these important methods, let’s proceed to tackle 'Water World' and remove this outlier.
00:12:11.920 We can remove outliers easily using the 'filter' method, which allows us to filter based on specified arguments. Here, we're selecting the rank column using 'Polars.call' (pol.call) and filtering everything less than or equal to 100. As a result, we successfully remove the outlier, which is a notable achievement.
00:12:49.560 When working with different DataFrame libraries, it's common practice to append '_df' to your DataFrame variable names. This helps distinguish DataFrame objects from regular variables in your code. Additionally, if you require a more advanced method of outlier removal, you can apply statistical techniques such as the interquartile range or standard deviation. However, I won't cover those methods today, but I will share a resource repository at the end that includes examples and articles on those techniques.
00:13:24.160 Now that we have removed outliers, we need to reformat and restructure the DataFrame so that our production database can correctly understand the different pieces of data. The left table shows that every column is different except for the ID column. In cases like this, we can use the 'columns' method in Polars, which is the only in-place method we’ll learn about today. We set it equal to an array with the same number of elements as there are columns and fill the array with our new column names.
00:13:58.399 Now, we have columns that our production database understands. While there is a 'rename' method we could use for renaming one or two columns, we won’t focus on that today. Now that we’ve successfully addressed the outlier and restructured our data, we need to deal with the missing values. We notice a NaN value for 'Back to the Future' in the rating column and a null value in the streaming column for 'Jurassic Park'.
00:14:37.360 When dealing with missing values (pretending I didn't previously highlight these), we first need to check if these missing values exist. If there are no missing values, we can ignore this step. Luckily, Polars has a convenient 'null_count' method. When called on our DataFrame object, it returns a DataFrame-like structure that counts all the null values in each column.
00:15:08.120 Unfortunately, Polars does not currently have a 'NaN_count' method, but I built my own to check for NaN columns, which are numeric. In Polars, you can only call the 'is_nan' method on numeric columns. I retrieve all the columns as Series using the 'get_column' method, which allows for returning the columns. Then, we call 'filter_map' to check if each column is numeric. If it is, we get that column's name.
00:15:55.040 Next, we use 'select' to obtain this array of NaN columns in the DataFrame. We can also select multiple columns using this method. Then, we call 'is_nan' and sum the responses of the 'is_nan' method for all identified columns. Once we do this, we can see the NaN count, which shows that there is indeed a NaN value in the 'Back to the Future' column.
00:16:27.960 Having confirmed the existence of missing values, we must decide how to address them. If a project’s scope dictates that a missing value in any column invalidates that specific row, we can use the 'drop_nans' or 'drop_nan' method to remove those rows with null and NaN values instantly. For our project, however, we need to ensure we still want to watch the top 100 movies, so we’ll replace these missing values.
00:16:56.960 To handle this, we can utilize the 'fill_null' method. Once again, I am re-establishing the 'movie_data_frame' variable to save our work and calling 'with_columns' on it while selecting the streaming column. We then call the 'fill_null' method, which will replace every single null value in that column with the arguments we provide, essentially filling those nulls.
00:17:31.640 When replacing nulls, we decided to insert 'false' for 'Jurassic Park', as if we aren’t sure of its streaming availability, we would typically check it out from the library or rent it. A nuance to remember is that we need to wrap these values in 'Pol.lit', which converts them into a Polars literal expression that can be processed by DataFrame methods. Additionally, this only applies to Boolean or temporal data types like datetime.
00:18:00.580 Once we execute these commands, we successfully address our null values and insert the false value for 'Jurassic Park' streaming. For addressing NaN values, the process is quite similar: we again use 'with_columns' to select the rating column. After reviewing values in other columns, we added a 7.5 as an eyeballed estimate for its rating.
00:18:27.960 Here is the current state of our DataFrame; we have effectively managed outliers and missing values, and we’ve restructured our DataFrame to better align with our production database schema. However, we need to ensure we keep track of when we add our data pieces, especially regarding the 'date added' column.
00:18:50.640 To do this, we will add this column to our DataFrame. Again, by calling 'with_columns', we can specify that when creating a column with the same value for every row, we need to wrap any data type in Pol.lit. In this case, we’re inserting today’s date for every row and aliasing it as 'date added'. We can verify that the new column accurately reflects the datetime and NaN data types.
00:19:23.360 Now that we’re getting closer to making our data production-ready, we must remove the ID column provided by the API, which could interfere with our production database's primary key column. Thankfully, Polars provides a 'drop' method that simplifies this process; we just pass in the name of the column we wish to drop.
00:19:58.160 The data is looking fresh and clean, indeed production-ready! However, one of our friends, who was involved in the project, decided to gather her top movie choices into a CSV file. As mentioned, we aim for it to have similarly structured data so we can incorporate her choices into our project for a fair selection process.
00:20:39.560 To do this, we will import her data. Polars provides a 'read_csv' method in which we pass the path to her CSV document. In no time, we have her data, and so far, we’ve noticed that it's well-structured with no missing values. However, upon examining the right side of her data, we see that the 'date added' column is in a string format, which must be converted to a datetime data type.
00:21:11.760 Let’s convert it! Again using 'with_columns', we select the 'date added' column and call the string namespace class. Polars provides various namespace classes for different data types, enabling access to respective methods. Here, I'm calling the 'to_datetime' method to convert the string into datetime.
00:21:54.240 In this method, I also specify the string's format along with the time unit of nanoseconds. This successfully translates her 'date added' column into an accepted format for our production database.
00:22:30.560 With everything looking good in her DataFrame, we can now proceed to combine our different DataFrames. Combining them allows for efficient bulk inserts into our production database. Polars provides a handy 'vstack' method that stacks the columns of two DataFrames with matching column names.
00:23:02.800 After using 'vstack', we confirm it accomplished what we needed. Additionally, there’s also an 'hstack' method that combines the columns of two DataFrames and a more nuanced 'concat' method, as well as a 'join' method for those familiar with SQL syntax.
00:23:34.080 Having combined our DataFrames, we notice one final data cleaning operation remains: we have a duplicate of 'Jurassic Park' in the dataset. As was the case when handling missing values, the scope of our project will determine how to handle duplicates.
00:24:06.720 We do not want to watch one of the top 100 movies twice, so we need to drop this duplicate. To do that in Polars, we have a 'unique' method, which operates similarly to the Ruby 'unique' method but expresses the full word. We will pass in a subset that iterates through the name column, as this approach efficiently identifies duplicates.
00:24:38.240 This practice is advisable in many cases, as it will examine just the name column and drop the entire row if it detects duplicates. If we had different 'date added' values, calling 'unique' without specifying a subset would still retain them, as they wouldn’t technically be identical. After executing this operation, our data is fresh, clean, and truly production-ready.
00:25:10.560 Now, we can look at inserting this data into our production database. To facilitate this, we need to extract the information from the DataFrame structure back into a more Ruby-friendly format. We will call the 'rows' method on our DataFrame.
00:25:38.880 In this case, I’m passing 'named: true', which returns an array of hashes, where each key is the column name and each value corresponds to that row’s data values. For this example, I am using ActiveRecord. If no argument is included, this method produces a 2D array where each inner array represents a row.
00:26:14.000 Having the hashes will facilitate bulk inserting these into our production database. We can easily access the Movie model and call 'insert_all' to execute the bulk insert with the prepared array of hashes. And just like that, we successfully cleaned and added our data to production!
00:26:56.160 We’ve covered a lot of ground! Today you learned how to import different data types into Polars, restructure data, address missing values, and remove duplicates before inserting the cleaned data into our production database.
00:27:36.080 For most of us, this would be everything we need for everyday use as a web developer! If you have a new API to add to your application, you can open IRB, figure out your data cleaning steps, and copy and paste them into your application. I’ve also prepared a data cleaning checklist, which I have as a PDF in my resources repository.
00:28:08.880 The resources repo includes a cheat sheet covering what we accomplished today. You can follow this checklist along with the Polars cheat sheet to successfully clean your data.
00:28:41.000 Now, I’d like to share a few advanced techniques that showcase the power of the Polars library and provide those who have advanced needs with ideas for working with their data.
00:29:10.760 The first advanced use case involves extracting data from another column. We found it helpful to extract the month from our 'date added' column into a new column. Here, we retrieve the date added column and use 'Polars.call' to gain access to the datetime namespace class, where we will use the 'month' method to extract the month value.
00:29:54.520 The new column can be aliased as 'month' and Polars allows this process to occur quickly and easily. Upon reevaluating the data, we concluded it would be even more valuable to display the string month name rather than its integer representation.
00:30:22.960 We can accomplish this by implementing a hash, where every key represents the integer value for the month, and every value corresponds to the string name for that month. Creating the 'month_names' variable, we select the month column using 'get_column', and then we can invoke the 'map' method to translate the integer values to the associated string values.
00:30:49.520 We can then replace our old integer column with the new string value column by using 'with_columns' and aliasing it as 'month'. Keep in mind that aliasing an existing column name will replace the old data. If you'd like, this could be efficiently done in a single line, but when you're starting out, taking a step-by-step approach is often beneficial.
00:31:25.640 Another advanced technique we can employ involves column-wise operations. We decided to calculate an average score using the rank and rating columns. The rank column operates on a reverse rating scale from 1 to 100, while the rating column uses a standard 10-point scale.
00:32:02.760 To find a usable average between these two rating scales, we first invert the rank column by subtracting it from 100 and dividing it by 10. We can then create a new column for the calculated average by adding the inverted rank column and the rating column together and dividing them by two.
00:32:39.880 This can also be done succinctly in a one-liner, but a step-by-step approach is often more understandable when beginning with the library.
00:33:14.440 Finally, I want to highlight the ability to create charts from our cleaned DataFrame, which showcases our findings. Using the Vega library, we can easily install the required JavaScript dependencies. In this Rails application example, I create a new DataFrame from the Movie model in the controller, which allows straightforward work with ActiveRecord objects.
00:33:38.640 When we pass data to the 'plot' method, we specify the x-axis as the name, y-axis as the rank, and the plot type (line, in this case). Other chart types available include pie, bar, area, and scatter plots, demonstrating the versatility of the Vega library.
00:34:08.840 I encourage everyone to become involved in the growing conversation around data within the Ruby community. The more we engage, the more our ecosystem can flourish and enhance Ruby's capabilities for data manipulation and AI applications.
00:34:38.960 Here is the QR code for accessing the resources repository, which contains a comprehensive cheat sheet, a data cleaning checklist, and selected resources including various AI gems. I’d like to thank Allison, Chelsea, and the entire program committee for putting on this fantastic conference and for allowing me to share my talk with you.