Processing data at scale with Rails

by Corey Martin

The video titled "Processing data at scale with Rails" by Corey Martin presented at RailsConf 2021 focuses on the construction of data pipelines using familiar tools within the Ruby on Rails ecosystem. The talk aims to empower developers in managing large datasets effectively through strategies that involve ingesting, transforming, and storing data.

Key Points Discussed:
- Introduction to Data Pipelines: Explains the concept of data pipelines in the context of building applications that deal with massive datasets, citing examples such as political campaign spending data and public health information during the COVID-19 pandemic.
- Examples of Data Sources: Highlights how open datasets and information from IoT devices can be leveraged for constructing useful applications. Corey shares anecdotes such as "openSecrets" and "CVS vaccine alerts" to illustrate the practical applications of data pipelines.
- ETL Process: Introduces the concept of ETL (Extract, Transform, Load) as a simple yet effective way to describe data handling. Martin emphasizes that building ETL pipelines can be done with open-source tools and familiar frameworks in Rails, without needing expensive software.
- Building a Data Pipeline with Rails Tools: The tutorial cuts to an example project—an app named "Lobby Focus" that consumes lobbying data from XML files published by the U.S Congress. Corey details the steps of how data is ingested, including scraping, unzipping, and processing XML data using Rails tools like PostgreSQL, Sidekiq, and Nokogiri.
- Job Scaling and Parallel Processing: Discusses the advantages of breaking down large data jobs into smaller tasks to ensure better manageability and fault isolation. This allows developers to pinpoint errors at the record level, enhancing the pipeline efficiency.
- Handling Data Changes: Corey stresses the importance of modular pipeline design that allows easy adjustments when the data structure or source changes, keeping the workflow adaptable and reduce maintenance headaches.
- Conclusion and Call to Action: Encourages developers to explore building applications using open data, reiterating that working with real datasets is a great way to build portfolios and demonstrate skills. He emphasizes that experience in constructing data pipelines can be immensely valuable professionally.

00:00:06.080 Hi, I'm Corey Martin and I'm excited to be speaking at RailsConf 2021. I'm thrilled to be with you virtually. This talk is about building a data pipeline in Ruby on Rails using the tools we know in the ecosystem.

00:00:12.960 I'll be discussing an example of using familiar tools like PostgreSQL, Sidekiq, Redis, and, of course, Rails to build a large-scale data pipeline.

00:00:18.359 So, what is a data pipeline? There are many familiar examples in the open government space. In the U.S., the federal government often publishes datasets.

00:00:30.019 One example is spending on political campaigns, which they publish in formats like large text files or through an API.

00:00:36.840 An app like Open Secrets consumes that data and makes it available to people in a more user-friendly way, and in a very timely fashion. There are sites aimed at helping people navigate the COVID-19 pandemic using open data. For instance, CVS COVID Alerts.com scrapes the CVS website to monitor vaccine appointment availability. As a user, you can input your phone number and receive a text when a vaccine appointment opens at a CVS near you.

00:00:55.379 This is a great example of utilizing a data pipeline, in this case, scraping the CVS website to serve the public. Lastly, in your job, you might deal with ingesting data from a vendor. For instance, if you pay for data, a company working with Internet of Things devices might consume a data stream from those devices and then surface that data in an app.

00:01:21.420 There are numerous use cases we might encounter in our volunteer work, day jobs, or personal projects that require bringing data in from external sources into our applications.

00:01:30.780 This can be very empowering; for example, as a developer looking to grow my portfolio, building an app based on open data ingested from a public source can make that app useful to a lot of people.

00:01:47.640 One example could be building a doctor finder with Medicare data. Examples abound of useful apps you can build with freely available data. This makes it more likely that someone looking at the app will find it useful, and it eliminates the blank slate challenge often present when building your first apps for practice or portfolio.

00:02:13.500 Therefore, I always recommend keeping in mind the amazing space of building an app based on a data pipeline from an open data source—this is exactly the example we'll be discussing.

00:02:28.020 Data pipelines often have a reputation for being expensive. The term ETL, which stands for Extract, Transform, Load, comes to mind; many people associate ETL software with high costs and big enterprises.

00:02:45.540 However, ETL simply means that you're extracting data from an external source, transforming it to fit the structure you need for your app, and then loading it into a database that your app can access.

00:03:00.300 This is a very straightforward concept that is not limited to specialized or expensive tools. In fact, you can build an ETL pipeline entirely with open-source tools, just like we will in this example.

00:03:19.440 In this specific case with Rails, there are many options. You have enterprise software, including hosted solutions from Google Cloud and AWS. There are also open-source software solutions you can host on your own infrastructure that are designed specifically for pipelines, such as Apache Spark and Apache Beam.

00:03:37.080 When it comes to Rails options, there are gems available, one example being Kiba, an ETL gem that mimics many concepts from larger enterprise tools while being tailored for Rails.

00:03:51.120 But today, we will focus on the really familiar tools from the Rails ecosystem: PostgreSQL, Sidekiq, Redis, and Rails—tools we work with day in and day out. We will apply them to the data pipeline use case.

00:04:02.640 To illustrate how we would build a data pipeline with Rails and familiar tools, let’s use a relatable example: Chipotle. Many of us love their burritos, and believe it or not, Chipotle hires lobbyists to represent them to the U.S. federal government, particularly Congress.

00:04:23.160 This year, and in the past few years, they have hired the firm DLA Piper, a large law firm located in Washington, D.C., where I live. DLA Piper represents Chipotle before the federal government.

00:04:36.840 When this lobbyist relationship was established, DLA Piper had to file a form with the U.S. Congress, which they must submit quarterly throughout the duration of their lobbying relationship with Chipotle. All these forms are made available to the public in large zip files containing numerous XML files—this is our data source for this example.

00:04:52.620 This example illustrates the relationship between Chipotle and DLA Piper in Lobby Focus, an app I created in my spare time that consumes this publicly available lobbying data and presents it in a searchable and browsable format.

00:05:09.900 For instance, you can see that Chipotle paid DLA Piper about $190,000 in 2020, along with the quarterly filings, which raise the question: what is Chipotle lobbying about? Let's open the app to find out.

00:05:29.160 In the app, I'm watching certain companies and firms, and I can see their activities in a feed. If I search for Chipotle, I can view their profile and see the details of who they've hired—just one lobbyist in Q4 2020. Clicking into that lobbyist reveals the details of the relationship.

00:05:45.720 This is a Rails app utilizing a PostgreSQL database, and we will explore how the data was ingested. For example, the latest quarterly filing links to the XML file we've ingested, hosted by the U.S. Senate.

00:06:03.900 In the filing, you can see that their lobbying focuses on the federal response to the coronavirus pandemic. This might relate to stimulus legislation and how it affects Chipotle—perhaps they are seeking funds from such legislation and want to ensure their interests are considered during the drafting process.

00:06:16.320 That's my guess regarding the filing context. I just wanted to illustrate an example of the insights you can uncover with the data we are processing.

00:06:31.560 Now, let's discuss the data source. The U.S. Senate website publishes these quarterly zip files, and you'll notice that the format will soon change, which we will discuss towards the end.

00:06:39.720 But for now, they continue to offer these zip files that contain XML files we will parse in our data pipeline.

00:06:52.260 As previously mentioned, we will use familiar tools for this example, including Sidekiq, Nokogiri, PostgreSQL, an AWS gem for interacting with S3 buckets, HTTP Party, and others.

00:07:11.520 These are readily available gems that can be integrated into your Rails app. A high-level overview of the pipeline is as follows: we begin by scraping a web page.

00:07:25.680 This is the page we reviewed earlier with the listed zip files. The next step involves downloading those zip files, extracting the XML files, retrieving the filings from the XML files, and then loading the filings into PostgreSQL.

00:07:39.120 We will examine all these steps in detail, but basically, we're extracting everything until we reach the filings we need and then storing them in PostgreSQL with related information.

00:07:58.559 The first task in the process is the scraping action, wherein we access the web page to gather the zip file locations for ingestion. We're leveraging Nokogiri to parse the HTML; during my work, there was no API available, so we had to scrape the web page.

00:08:15.780 By checking the table, we extract the zip file links using some XPath code to identify the hyperlinks.

00:08:36.960 For each of those zip links, we launch a Sidekiq job to download the zip file and unzip it before proceeding.

00:08:50.160 As a little optimization, we cache the last modified date for each zip file before processing it. After processing it, we check if we've previously processed a zip file that was modified on the same date.

00:09:07.440 If this is the case, we avoid processing it again, as there’s no new data.

00:09:20.520 To achieve this caching, we utilize Redis, a simple key-value store system. Redis is easily usable from Rails, and I believe Rails is releasing an even simpler way to use Redis for custom applications.

00:09:38.640 For others who may not know, Sidekiq is a background job processing framework that enables you to create background jobs. It manages those jobs by allowing you to queue multiple background tasks at once.

00:09:56.520 Sidekiq orchestrates the execution of these jobs over time, monitoring their progress, completion status, and failures, providing a great overview of the entire pipeline.

00:10:08.880 Once we determine we need to process a zip file, we download it and examine the XML files inside. Our goal is to process every one of these XML files.

00:10:23.240 We employ the strategy of spawning new smaller jobs, with one job created for each XML file within the zip. These XML files contain the lobbying filings which we seek.

00:10:36.780 After unzipping the file, we upload each XML file to S3 using the AWS SDK S3 gem for this purpose.

00:10:46.800 Once the file is uploaded, we retrieve the file path, which is the S3 key, and pass it as an argument to the next Sidekiq job, responsible for processing that specific XML file.

00:11:06.840 As promised, I’ll make sure we discuss all the tools listed in this slide. Ruby Zip provides a simple interface to unzip the files and process the contents.

00:11:22.260 HTTP Party allows us to retrieve the zip file from the U.S. Congress, and while there are many libraries available for similar requests, this was my preferred choice at that time.

00:11:37.560 It's important to note that everything in the pipeline operates as a Sidekiq job. Now, we have a job designated to process one XML file. Each XML file contains around a thousand lobbying filings.

00:11:54.600 Inside the XML, there are specific entities related to the filed paperwork from DLA Piper regarding their relationship with Chipotle.

00:12:08.040 Our aim is to extract all the filing elements within the XML file using Nokogiri, which plays a critical role in this pipeline due to its heavy use of XML.

00:12:23.040 We iterate through all the filing tags and record the XML of each filing within PostgreSQL. It may sound odd, but we write each filing string before thoroughly processing it.

00:12:38.760 By storing each filing’s raw XML, we can spawn a job that specifically processes that individual filing. The only requirement for this job is to reference the record ID in PostgreSQL that contains the XML data.

00:12:55.560 This approach allows us to fan out, creating a job for each filing extracted from the XML file, acknowledging that a single XML file can contain around a thousand filings.

00:13:09.840 This process of fanning out means starting with one scrape job, which generates multiple download jobs, spawning further process jobs, and so forth.

00:13:26.640 The idea is to transition from monolithic jobs—like the scrape job, which handles the entire web page content—into progressively smaller jobs until we reach a point where each job is responsible for just one record.

00:13:42.180 By doing this, if a job fails, we know precisely which record caused the issue. In a data pipeline, it’s common to find one out of many records exhibiting errant data.

00:14:01.740 Limiting the failure to a single record simplifies diagnosing the problem. If needed, you can adjust the pipeline or reach out to the data source.

00:14:16.440 The method also promotes efficient parallelization through Sidekiq: with many small jobs, Sidekiq can process thousands simultaneously.

00:14:29.460 The degree of concurrency is up to you, depending on your database’s capacity to handle all the incoming records. Sidekiq supports adding numerous workers to manage more jobs concurrently.

00:14:42.600 Ultimately, Sidekiq is highly scalable and won’t limit your processing capabilities.

00:14:54.840 Additionally, it provides a dashboard that tracks the status of running jobs, monitors queued jobs, handles failures gracefully, and presents error messages.

00:15:04.560 This means you won’t need to build extensive monitoring tools for your pipeline; you can leverage what Sidekiq provides.

00:15:20.520 Now, let's examine another perspective on the fan-out concept. As previously mentioned, the initial scrape job encompasses all lobbying records available from the U.S. Congress.

00:15:30.000 While that job might scope in on approximately one million records, all it's responsible for is reading the web page and retrieving the zip file locations.

00:15:50.520 Subsequent jobs break this task into smaller batches—each download job processes a single zip file containing around 25,000 records.

00:16:04.380 If a download job fails, it's not a major loss; you’ve only risked about 25,000 records instead of all one million.

00:16:17.640 Progressing to the process jobs, these engage with one XML file at a time—unconcerned with the individual fields until processing begins. They're designed to handle a thousand records.

00:16:31.620 Ultimately, the most likely job to fail deals with a specific filing, as it is prone to potential data inconsistencies.

00:16:44.640 For example, there could be odd date formats that create issues during loading.

00:16:58.200 However, since this load job is only focused on one record, it’s easier to identify and resolve any anomalies.

00:17:14.760 Meanwhile, all other jobs continue to run smoothly, ensuring that you minimize missing data in your final destination, the PostgreSQL database.

00:17:29.520 In our example, we may successfully load all lobbying records since 2013, with a few that encountered errors.

00:17:43.680 This is a manageable issue, as we can easily locate the problematic records in Sidekiq and address them while allowing all others to continue processing.

00:17:59.520 Regarding the load job, which operates on just one filing at a time, this specific instance represents one of Chipotle's and DLA Piper’s filings in XML format.

00:18:11.640 We extract the necessary fields from this XML filing entity, utilizing Nokogiri once again, a vital tool for this pipeline due to its XML focus.

00:18:28.920 The fields we extract include client and firm details. This filing's XML structure resembles flat data, displaying client and lobbying firm information in a straightforward manner.

00:18:44.880 However, we want to structure this into relational data within our PostgreSQL database. Hence, we will create separate tables for clients, firms, and filings.

00:19:01.080 For instance, when accessing a client's profile page in Lobby Focus, we want to show all associated firms and filings in a correlated manner; the data must be connected for effective functionality.

00:19:15.840 For this reason, we first ensure there’s a client record. We use Create or Find By to avoid duplicating records in PostgreSQL.

00:19:31.200 The same principle applies to writing the firm record, followed by the filing associated with the identified client and firm records. We utilize a custom function called Create or Update Unique.

00:19:46.920 This function checks the fields unique to a filing; in this case, each filing has a unique Senate ID. By employing this approach, we can safely run the pipeline multiple times without creating duplicates in the database.

00:20:05.520 To prevent duplicates during reprocessing, we establish a unique index on the unique filing field in PostgreSQL. The Senate ID will always remain unique.

00:20:18.060 This ensures we'll never store multiple copies of the same filing in the filings table. When we write filings initially, if that write fails due to a unique index violation, we'll simply update the existing record with any changes.

00:20:32.880 This methodology embodies idempotency, ensuring that we can run the pipeline multiple times without generating issues.

00:20:44.520 Normalization is critical, especially depending on the data source. A specific challenge in lobbying data is the lack of standardized company names.

00:20:58.680 For example, Microsoft could appear as Microsoft, Microsoft Corp, Microsoft Inc, or Microsoft Corporation—all different ways to reference the same entity.

00:21:10.680 Since lobbying firms can input any variation of company names in filings, you end up with multiple entries for what is essentially the same entity.

00:21:24.720 To address these discrepancies in a data pipeline, it's useful to utilize open-source solutions, and indeed, there's the Consistent Company gem designed for normalizing such variations.

00:21:39.420 Using a unique index on the normalized name within the clients table will enable you to associate all lobbying filings using any of the variations of Microsoft with a single Microsoft entry in your database.

00:21:53.040 You definitely don't want your users to search your app and discover five distinct listings for the same company. Therefore, normalizing the data is essential when the source data isn't standardized.

00:22:08.040 Changes can also occur in data sources, which adds complexity to managing a data pipeline.

00:22:23.520 For instance, a field might be added or removed, or a web page might transition to a JSON API. Such changes can disturb the data format you are addressing.

00:22:36.720 Modularly building your pipeline to maintain narrowly scoped responsibilities for each job will help you minimize the changes you need to make when a data source changes.

00:22:51.180 If the webpage layout changes to present zip files differently, for example, you would only have to adjust the scrape job rather than refactor the entire pipeline.

00:23:04.620 This approach prevents the pipeline from becoming brittle and overly complex. With distinct jobs carrying specific responsibilities, alterations needed due to a data source can often be isolated.

00:23:18.240 You might be wondering about scalability. This is often the question posed about Rails apps, though I propose it isn't the most relevant inquiry. Nonetheless, we need to consider scalability when deploying solutions.

00:23:32.520 As we've discussed, we're building our pipeline on Sidekiq, which is highly scalable. If data processing is lagging, you can enhance this by increasing Sidekiq workers.

00:23:48.120 For performance issues related to CPU, adding more Sidekiq workers will allow faster processing of jobs concurrently. If loading to PostgreSQL is slow, this same principle applies.

00:24:02.520 While we horizontally scale Sidekiq by adding workers, we might also need to vertically scale PostgreSQL by upgrading the instance with increased resources such as memory or CPU.

00:24:17.520 Although you will encounter limits, it’s essential to analyze data volume and the desired processing time.

00:24:31.620 Neither quantity alone should stall progress, but a combination of high data volume needing swift processing could encounter constraints in PostgreSQL or Redis.

00:24:46.020 Keep in mind that Redis stores the pending jobs list, which could run out of memory—most commonly managed through vertical scaling.

00:25:00.720 Under certain circumstances, you may have trouble getting additional memory, making it impossible to accommodate all jobs in your pipeline.

00:25:13.920 This scenario—a hard scaling limit—is rare in most projects; however, it is advantageous to have alternatives lined up.

00:25:29.460 If the Rails approach shows limitations while processing significant amounts of data, you can explore Rails gems like Kiba, which specifically tackles the ETL use case.

00:25:45.060 Another option is to investigate more enterprise-level data pipeline tools designed to manage vast amounts of data efficiently and quickly.

00:26:01.990 In summary, you might not need those advanced tools, so it’s worthwhile to consider the methods discussed focusing on the reliable tools familiar to the Rails ecosystem.

00:26:16.499 Personally, I'm enthusiastic about leveraging this approach since it offers a quick ramp-up period and exceptional scalability.

00:26:29.880 For those interested, numerous possibilities exist. If you're building a portfolio, showing your skills through a well-established app can be advantageous.

00:26:45.420 Having real data integrated makes the app substantially valuable from the get-go. The potential use cases are expansive, including scraping Medicare data for a doctor directory with Medicare ratings.

00:27:01.920 Other interesting options are scraping weather data or Wikimedia content, among others, to create a truly useful app or website utilizing a data pipeline.

00:27:17.520 You might also encounter scenarios at work demanding either the adjustment of an existing data pipeline or the construction of a new one.

00:27:33.600 Gaining experience in building a data pipeline end-to-end, even through side projects, provides invaluable lessons around scalability, identifying choke points, and managing data inconsistencies.

00:27:48.060 These insights are relevant, whether you operate in the corporate world or small-scale projects, where both often face similar challenges regarding data.

00:28:02.520 I genuinely appreciate you taking the time to join this virtual talk. I hope that next time we can connect in person, but in the meantime, I’ll be available for a Q&A.

00:28:16.920 Check the schedule for more information, and I look forward to seeing you on Discord to discuss any questions or feedback you may have regarding this talk.

00:28:30.240 Thank you so much for attending!