00:00:06.080
Hi, I'm Corey Martin and I'm excited to be speaking at RailsConf 2021. I'm thrilled to be with you virtually. This talk is about building a data pipeline in Ruby on Rails using the tools we know in the ecosystem.
00:00:12.960
I'll be discussing an example of using familiar tools like PostgreSQL, Sidekiq, Redis, and, of course, Rails to build a large-scale data pipeline.
00:00:18.359
So, what is a data pipeline? There are many familiar examples in the open government space. In the U.S., the federal government often publishes datasets.
00:00:30.019
One example is spending on political campaigns, which they publish in formats like large text files or through an API.
00:00:36.840
An app like Open Secrets consumes that data and makes it available to people in a more user-friendly way, and in a very timely fashion. There are sites aimed at helping people navigate the COVID-19 pandemic using open data. For instance, CVS COVID Alerts.com scrapes the CVS website to monitor vaccine appointment availability. As a user, you can input your phone number and receive a text when a vaccine appointment opens at a CVS near you.
00:00:55.379
This is a great example of utilizing a data pipeline, in this case, scraping the CVS website to serve the public. Lastly, in your job, you might deal with ingesting data from a vendor. For instance, if you pay for data, a company working with Internet of Things devices might consume a data stream from those devices and then surface that data in an app.
00:01:21.420
There are numerous use cases we might encounter in our volunteer work, day jobs, or personal projects that require bringing data in from external sources into our applications.
00:01:30.780
This can be very empowering; for example, as a developer looking to grow my portfolio, building an app based on open data ingested from a public source can make that app useful to a lot of people.
00:01:47.640
One example could be building a doctor finder with Medicare data. Examples abound of useful apps you can build with freely available data. This makes it more likely that someone looking at the app will find it useful, and it eliminates the blank slate challenge often present when building your first apps for practice or portfolio.
00:02:13.500
Therefore, I always recommend keeping in mind the amazing space of building an app based on a data pipeline from an open data source—this is exactly the example we'll be discussing.
00:02:28.020
Data pipelines often have a reputation for being expensive. The term ETL, which stands for Extract, Transform, Load, comes to mind; many people associate ETL software with high costs and big enterprises.
00:02:45.540
However, ETL simply means that you're extracting data from an external source, transforming it to fit the structure you need for your app, and then loading it into a database that your app can access.
00:03:00.300
This is a very straightforward concept that is not limited to specialized or expensive tools. In fact, you can build an ETL pipeline entirely with open-source tools, just like we will in this example.
00:03:19.440
In this specific case with Rails, there are many options. You have enterprise software, including hosted solutions from Google Cloud and AWS. There are also open-source software solutions you can host on your own infrastructure that are designed specifically for pipelines, such as Apache Spark and Apache Beam.
00:03:37.080
When it comes to Rails options, there are gems available, one example being Kiba, an ETL gem that mimics many concepts from larger enterprise tools while being tailored for Rails.
00:03:51.120
But today, we will focus on the really familiar tools from the Rails ecosystem: PostgreSQL, Sidekiq, Redis, and Rails—tools we work with day in and day out. We will apply them to the data pipeline use case.
00:04:02.640
To illustrate how we would build a data pipeline with Rails and familiar tools, let’s use a relatable example: Chipotle. Many of us love their burritos, and believe it or not, Chipotle hires lobbyists to represent them to the U.S. federal government, particularly Congress.
00:04:23.160
This year, and in the past few years, they have hired the firm DLA Piper, a large law firm located in Washington, D.C., where I live. DLA Piper represents Chipotle before the federal government.
00:04:36.840
When this lobbyist relationship was established, DLA Piper had to file a form with the U.S. Congress, which they must submit quarterly throughout the duration of their lobbying relationship with Chipotle. All these forms are made available to the public in large zip files containing numerous XML files—this is our data source for this example.
00:04:52.620
This example illustrates the relationship between Chipotle and DLA Piper in Lobby Focus, an app I created in my spare time that consumes this publicly available lobbying data and presents it in a searchable and browsable format.
00:05:09.900
For instance, you can see that Chipotle paid DLA Piper about $190,000 in 2020, along with the quarterly filings, which raise the question: what is Chipotle lobbying about? Let's open the app to find out.
00:05:29.160
In the app, I'm watching certain companies and firms, and I can see their activities in a feed. If I search for Chipotle, I can view their profile and see the details of who they've hired—just one lobbyist in Q4 2020. Clicking into that lobbyist reveals the details of the relationship.
00:05:45.720
This is a Rails app utilizing a PostgreSQL database, and we will explore how the data was ingested. For example, the latest quarterly filing links to the XML file we've ingested, hosted by the U.S. Senate.
00:06:03.900
In the filing, you can see that their lobbying focuses on the federal response to the coronavirus pandemic. This might relate to stimulus legislation and how it affects Chipotle—perhaps they are seeking funds from such legislation and want to ensure their interests are considered during the drafting process.
00:06:16.320
That's my guess regarding the filing context. I just wanted to illustrate an example of the insights you can uncover with the data we are processing.
00:06:31.560
Now, let's discuss the data source. The U.S. Senate website publishes these quarterly zip files, and you'll notice that the format will soon change, which we will discuss towards the end.
00:06:39.720
But for now, they continue to offer these zip files that contain XML files we will parse in our data pipeline.
00:06:52.260
As previously mentioned, we will use familiar tools for this example, including Sidekiq, Nokogiri, PostgreSQL, an AWS gem for interacting with S3 buckets, HTTP Party, and others.
00:07:11.520
These are readily available gems that can be integrated into your Rails app. A high-level overview of the pipeline is as follows: we begin by scraping a web page.
00:07:25.680
This is the page we reviewed earlier with the listed zip files. The next step involves downloading those zip files, extracting the XML files, retrieving the filings from the XML files, and then loading the filings into PostgreSQL.
00:07:39.120
We will examine all these steps in detail, but basically, we're extracting everything until we reach the filings we need and then storing them in PostgreSQL with related information.
00:07:58.559
The first task in the process is the scraping action, wherein we access the web page to gather the zip file locations for ingestion. We're leveraging Nokogiri to parse the HTML; during my work, there was no API available, so we had to scrape the web page.
00:08:15.780
By checking the table, we extract the zip file links using some XPath code to identify the hyperlinks.
00:08:36.960
For each of those zip links, we launch a Sidekiq job to download the zip file and unzip it before proceeding.
00:08:50.160
As a little optimization, we cache the last modified date for each zip file before processing it. After processing it, we check if we've previously processed a zip file that was modified on the same date.
00:09:07.440
If this is the case, we avoid processing it again, as there’s no new data.
00:09:20.520
To achieve this caching, we utilize Redis, a simple key-value store system. Redis is easily usable from Rails, and I believe Rails is releasing an even simpler way to use Redis for custom applications.
00:09:38.640
For others who may not know, Sidekiq is a background job processing framework that enables you to create background jobs. It manages those jobs by allowing you to queue multiple background tasks at once.
00:09:56.520
Sidekiq orchestrates the execution of these jobs over time, monitoring their progress, completion status, and failures, providing a great overview of the entire pipeline.
00:10:08.880
Once we determine we need to process a zip file, we download it and examine the XML files inside. Our goal is to process every one of these XML files.
00:10:23.240
We employ the strategy of spawning new smaller jobs, with one job created for each XML file within the zip. These XML files contain the lobbying filings which we seek.
00:10:36.780
After unzipping the file, we upload each XML file to S3 using the AWS SDK S3 gem for this purpose.
00:10:46.800
Once the file is uploaded, we retrieve the file path, which is the S3 key, and pass it as an argument to the next Sidekiq job, responsible for processing that specific XML file.
00:11:06.840
As promised, I’ll make sure we discuss all the tools listed in this slide. Ruby Zip provides a simple interface to unzip the files and process the contents.
00:11:22.260
HTTP Party allows us to retrieve the zip file from the U.S. Congress, and while there are many libraries available for similar requests, this was my preferred choice at that time.
00:11:37.560
It's important to note that everything in the pipeline operates as a Sidekiq job. Now, we have a job designated to process one XML file. Each XML file contains around a thousand lobbying filings.
00:11:54.600
Inside the XML, there are specific entities related to the filed paperwork from DLA Piper regarding their relationship with Chipotle.
00:12:08.040
Our aim is to extract all the filing elements within the XML file using Nokogiri, which plays a critical role in this pipeline due to its heavy use of XML.
00:12:23.040
We iterate through all the filing tags and record the XML of each filing within PostgreSQL. It may sound odd, but we write each filing string before thoroughly processing it.
00:12:38.760
By storing each filing’s raw XML, we can spawn a job that specifically processes that individual filing. The only requirement for this job is to reference the record ID in PostgreSQL that contains the XML data.
00:12:55.560
This approach allows us to fan out, creating a job for each filing extracted from the XML file, acknowledging that a single XML file can contain around a thousand filings.
00:13:09.840
This process of fanning out means starting with one scrape job, which generates multiple download jobs, spawning further process jobs, and so forth.
00:13:26.640
The idea is to transition from monolithic jobs—like the scrape job, which handles the entire web page content—into progressively smaller jobs until we reach a point where each job is responsible for just one record.
00:13:42.180
By doing this, if a job fails, we know precisely which record caused the issue. In a data pipeline, it’s common to find one out of many records exhibiting errant data.
00:14:01.740
Limiting the failure to a single record simplifies diagnosing the problem. If needed, you can adjust the pipeline or reach out to the data source.
00:14:16.440
The method also promotes efficient parallelization through Sidekiq: with many small jobs, Sidekiq can process thousands simultaneously.
00:14:29.460
The degree of concurrency is up to you, depending on your database’s capacity to handle all the incoming records. Sidekiq supports adding numerous workers to manage more jobs concurrently.
00:14:42.600
Ultimately, Sidekiq is highly scalable and won’t limit your processing capabilities.
00:14:54.840
Additionally, it provides a dashboard that tracks the status of running jobs, monitors queued jobs, handles failures gracefully, and presents error messages.
00:15:04.560
This means you won’t need to build extensive monitoring tools for your pipeline; you can leverage what Sidekiq provides.
00:15:20.520
Now, let's examine another perspective on the fan-out concept. As previously mentioned, the initial scrape job encompasses all lobbying records available from the U.S. Congress.
00:15:30.000
While that job might scope in on approximately one million records, all it's responsible for is reading the web page and retrieving the zip file locations.
00:15:50.520
Subsequent jobs break this task into smaller batches—each download job processes a single zip file containing around 25,000 records.
00:16:04.380
If a download job fails, it's not a major loss; you’ve only risked about 25,000 records instead of all one million.
00:16:17.640
Progressing to the process jobs, these engage with one XML file at a time—unconcerned with the individual fields until processing begins. They're designed to handle a thousand records.
00:16:31.620
Ultimately, the most likely job to fail deals with a specific filing, as it is prone to potential data inconsistencies.
00:16:44.640
For example, there could be odd date formats that create issues during loading.
00:16:58.200
However, since this load job is only focused on one record, it’s easier to identify and resolve any anomalies.
00:17:14.760
Meanwhile, all other jobs continue to run smoothly, ensuring that you minimize missing data in your final destination, the PostgreSQL database.
00:17:29.520
In our example, we may successfully load all lobbying records since 2013, with a few that encountered errors.
00:17:43.680
This is a manageable issue, as we can easily locate the problematic records in Sidekiq and address them while allowing all others to continue processing.
00:17:59.520
Regarding the load job, which operates on just one filing at a time, this specific instance represents one of Chipotle's and DLA Piper’s filings in XML format.
00:18:11.640
We extract the necessary fields from this XML filing entity, utilizing Nokogiri once again, a vital tool for this pipeline due to its XML focus.
00:18:28.920
The fields we extract include client and firm details. This filing's XML structure resembles flat data, displaying client and lobbying firm information in a straightforward manner.
00:18:44.880
However, we want to structure this into relational data within our PostgreSQL database. Hence, we will create separate tables for clients, firms, and filings.
00:19:01.080
For instance, when accessing a client's profile page in Lobby Focus, we want to show all associated firms and filings in a correlated manner; the data must be connected for effective functionality.
00:19:15.840
For this reason, we first ensure there’s a client record. We use Create or Find By to avoid duplicating records in PostgreSQL.
00:19:31.200
The same principle applies to writing the firm record, followed by the filing associated with the identified client and firm records. We utilize a custom function called Create or Update Unique.
00:19:46.920
This function checks the fields unique to a filing; in this case, each filing has a unique Senate ID. By employing this approach, we can safely run the pipeline multiple times without creating duplicates in the database.
00:20:05.520
To prevent duplicates during reprocessing, we establish a unique index on the unique filing field in PostgreSQL. The Senate ID will always remain unique.
00:20:18.060
This ensures we'll never store multiple copies of the same filing in the filings table. When we write filings initially, if that write fails due to a unique index violation, we'll simply update the existing record with any changes.
00:20:32.880
This methodology embodies idempotency, ensuring that we can run the pipeline multiple times without generating issues.
00:20:44.520
Normalization is critical, especially depending on the data source. A specific challenge in lobbying data is the lack of standardized company names.
00:20:58.680
For example, Microsoft could appear as Microsoft, Microsoft Corp, Microsoft Inc, or Microsoft Corporation—all different ways to reference the same entity.
00:21:10.680
Since lobbying firms can input any variation of company names in filings, you end up with multiple entries for what is essentially the same entity.
00:21:24.720
To address these discrepancies in a data pipeline, it's useful to utilize open-source solutions, and indeed, there's the Consistent Company gem designed for normalizing such variations.
00:21:39.420
Using a unique index on the normalized name within the clients table will enable you to associate all lobbying filings using any of the variations of Microsoft with a single Microsoft entry in your database.
00:21:53.040
You definitely don't want your users to search your app and discover five distinct listings for the same company. Therefore, normalizing the data is essential when the source data isn't standardized.
00:22:08.040
Changes can also occur in data sources, which adds complexity to managing a data pipeline.
00:22:23.520
For instance, a field might be added or removed, or a web page might transition to a JSON API. Such changes can disturb the data format you are addressing.
00:22:36.720
Modularly building your pipeline to maintain narrowly scoped responsibilities for each job will help you minimize the changes you need to make when a data source changes.
00:22:51.180
If the webpage layout changes to present zip files differently, for example, you would only have to adjust the scrape job rather than refactor the entire pipeline.
00:23:04.620
This approach prevents the pipeline from becoming brittle and overly complex. With distinct jobs carrying specific responsibilities, alterations needed due to a data source can often be isolated.
00:23:18.240
You might be wondering about scalability. This is often the question posed about Rails apps, though I propose it isn't the most relevant inquiry. Nonetheless, we need to consider scalability when deploying solutions.
00:23:32.520
As we've discussed, we're building our pipeline on Sidekiq, which is highly scalable. If data processing is lagging, you can enhance this by increasing Sidekiq workers.
00:23:48.120
For performance issues related to CPU, adding more Sidekiq workers will allow faster processing of jobs concurrently. If loading to PostgreSQL is slow, this same principle applies.
00:24:02.520
While we horizontally scale Sidekiq by adding workers, we might also need to vertically scale PostgreSQL by upgrading the instance with increased resources such as memory or CPU.
00:24:17.520
Although you will encounter limits, it’s essential to analyze data volume and the desired processing time.
00:24:31.620
Neither quantity alone should stall progress, but a combination of high data volume needing swift processing could encounter constraints in PostgreSQL or Redis.
00:24:46.020
Keep in mind that Redis stores the pending jobs list, which could run out of memory—most commonly managed through vertical scaling.
00:25:00.720
Under certain circumstances, you may have trouble getting additional memory, making it impossible to accommodate all jobs in your pipeline.
00:25:13.920
This scenario—a hard scaling limit—is rare in most projects; however, it is advantageous to have alternatives lined up.
00:25:29.460
If the Rails approach shows limitations while processing significant amounts of data, you can explore Rails gems like Kiba, which specifically tackles the ETL use case.
00:25:45.060
Another option is to investigate more enterprise-level data pipeline tools designed to manage vast amounts of data efficiently and quickly.
00:26:01.990
In summary, you might not need those advanced tools, so it’s worthwhile to consider the methods discussed focusing on the reliable tools familiar to the Rails ecosystem.
00:26:16.499
Personally, I'm enthusiastic about leveraging this approach since it offers a quick ramp-up period and exceptional scalability.
00:26:29.880
For those interested, numerous possibilities exist. If you're building a portfolio, showing your skills through a well-established app can be advantageous.
00:26:45.420
Having real data integrated makes the app substantially valuable from the get-go. The potential use cases are expansive, including scraping Medicare data for a doctor directory with Medicare ratings.
00:27:01.920
Other interesting options are scraping weather data or Wikimedia content, among others, to create a truly useful app or website utilizing a data pipeline.
00:27:17.520
You might also encounter scenarios at work demanding either the adjustment of an existing data pipeline or the construction of a new one.
00:27:33.600
Gaining experience in building a data pipeline end-to-end, even through side projects, provides invaluable lessons around scalability, identifying choke points, and managing data inconsistencies.
00:27:48.060
These insights are relevant, whether you operate in the corporate world or small-scale projects, where both often face similar challenges regarding data.
00:28:02.520
I genuinely appreciate you taking the time to join this virtual talk. I hope that next time we can connect in person, but in the meantime, I’ll be available for a Q&A.
00:28:16.920
Check the schedule for more information, and I look forward to seeing you on Discord to discuss any questions or feedback you may have regarding this talk.
00:28:30.240
Thank you so much for attending!