RubyKaigi 2018

Kiba 2 - Past, present & future of data processing with Ruby

Kiba ETL (http://www.kiba-etl.org) is a lightweight, generic data processing framework for Ruby, initially released in 2015 & now in v2.

In this talk, I'll highlight why Kiba was created, how it is used for low-maintenance data preparation and processing in the enterprise (illustrated by many different use cases), why and how the version 2 (leveraging Ruby's Enumerator) brings a massive improvement in authoring reusable & composable data processing components, and why I'm optimistic about the future of data processing with Ruby.

RubyKaigi 2018 https://rubykaigi.org/2018/presentations/thibaut_barrere

RubyKaigi 2018

00:00:00.030 Hello, can you hear me? Yeah, hi, konichiwa! I'm T-bo from France, and I'm going to share a little bit of perspective on my work using Ruby for data processing tasks.
00:00:09.150 In particular, I'll discuss ETL, which I will define with an easy-to-understand diagram. It won't be too technical, but I want to give you patterns and a perspective on how time flies—the past, present, and future of data processing.
00:00:20.939 Before that, I want to thank Matt and the Ruby committers—arigato gozaimasu—for making my life nicer as a programmer.
00:00:31.949 I've been using Ruby since 2005, and I'm very happy with it. So, what is Kiba ETL? It's a lightweight, generic data processing framework for Ruby.
00:00:40.320 I'm going to explain first what led to each question of how I'm using it on a daily basis. I want you to understand what solutions are available in Ruby, how you can use them, and to trust the future of data processing.
00:01:01.050 So, first—why Kiba? Kiba in Japanese means 'fang,' and there's a little background to it because it comes from my parents' dogs. The idea is that with your fang (kiba), you can bite into the data, allowing you to control it, manage it, and modify it.
00:01:15.240 Now, what is ETL? I really want you to underline this term because it is something, as Matz explained this morning, that if you know the concept, you can Google for it and find solutions, books, articles, and everything related. Keep that word in mind.
00:01:38.040 ETL means Extract, Transform, and Load. It's like you take some data from one place—you extract it—then you have a pipeline where you transform the extracted data, and finally, you load it into a destination. It's a commonly used term but not well-known in Ruby. I encourage you to check out my blog, where I elaborate more on this concept, how it can help you as a coder, and what types of resources are available.
00:02:21.150 To describe it visually, imagine a pipe. Water flows from a source at the top, and the data flows through the transformations you write in Ruby. You modify it in a way that fits your needs before it reaches the destination, often referred to as a 'sink' in some paradigms. You can find this type of visual definition in many data processing tools today.
00:02:39.030 Going back in time, let’s take a look at the history of Kiba and understand why it was created in the first place. I started my journey in 2006 when I discovered a gem called Khalid Act Veracity, developed by Anthony Eden, now the founder of DNS Simple. I fell in love with the tool.
00:03:09.270 Previously, I had been using GUI tools, which required a lot of clicking and pointing to manage data processing, and it was quite a headache. You couldn't easily customize things or efficiently manage the data you needed to process. However, this gem allowed for a much nicer DSL declaration.
00:03:39.630 It provided an imperative approach, allowing users to declare their processing steps clearly—types of sources like files from a CSV, transformations, and destination types like databases. This approach greatly aided in maintenance and data handling.
00:04:05.250 Over the long run, you could easily add or remove components, facilitating method chaining, for instance. Some of the ETL processes I built back in 2006 are still in production today, almost unchanged, and this shows the sustainability of this approach.
00:04:39.270 In Ruby 1.8.5, even the older methods supported filtering rows and other functions, yet the overall system was still somewhat difficult to customize. Then came Ruby, which interpreted the declarative structure of code, processing data across each transform and to the final destination.
00:05:06.539 It allowed me to escape the frustrating world of GUI tools, like Microsoft’s point-and-click systems, which could be very convenient yet limited in understanding how the processing occurred. Thanks to this gem, I was able to create many different data extraction pipelines.
00:05:56.550 Throughout six years, I extracted data from a CSV file from a costly CRM to import it into MySQL, enabling basic business intelligence solutions. I even created a connector to send data to a COBOL mainframe backend.
00:06:41.350 I took over maintenance of the tool because the original author, Anthony, was moving on to other projects. Although I only utilized a small part of its features, it felt essential to preserve such a valuable solution.
00:07:08.760 Kiba's DSL offered unique long-term value at both writing and maintenance times, and I wanted to ensure availability for my consulting clients and my needs. However, I quickly realized that many features in the library made it overly complex and difficult for a solo developer to maintain. Specs were coupled tightly with active record, making adoption challenging.
00:08:31.210 After some time, I had to step back because the complexity and lack of composability were hindering my efficiency. I still had my data processing needs and various client demands. Nevertheless, I couldn't afford to lose the unique syntax and ease of use of the DSL.
00:09:00.440 I remembered that a few years prior, I had written a simple reimplementation of that gem. By knitting together my experiences, I created Kiba in 2015, a gem designed to be easy to maintain and use. I focused on making it sustainable, to avoid the burnout seen by many contributors in the open-source community.
00:09:32.880 Now, let me show you the core parts of Kiba. There are five keywords: pre-process, process, and post-process, which are executed at the start or end of a procedure. The source, transform, and destination complete the pipeline.
00:10:05.880 Kiba core does not provide components; it delivers conventions to follow, allowing for easy implementation of components on your end.
00:10:36.239 Here is an example of a data declaration. You start with the source keyword, name the class serving as your source, and then configure settings as a hash. Next, you define the transform and finally the destination.
00:11:03.790 Implementing a source involves writing the initializer, which takes arguments and prepares your source. Each new row yields one row at a time, allowing for indefinite streaming. This implementation is simple, providing no complicated dependencies to make testing easier.
00:12:01.210 Transformations modify the data, either by defining a class with a process method returning one row or filtering rows using blocks. In Kiba, you can yield multiple rows, allowing transformations to generate several output rows per input.
00:12:53.030 Destinations follow the same approach, implementing constructors and writing methods to handle one row. For the last three years, I've used Kiba in numerous data processing scenarios.
00:13:18.140 As both a developer and user, I found components easy to maintain and write. This sustainable approach is doable for me as a solo developer and family man.
00:14:14.990 You have a reliable solution for data processing, with minimal fear of abandonment, as it's manageable for me to keep it alive.
00:14:52.470 Now, regarding sustainable project definitions, I've made improvements to Kiba with version two, which is open-source and contains helpful components. I've also begun working on Kiba Pro, offering tools for fast SQL and Amazon S3 data profiling.
00:15:12.440 This is done in a sustainable model, minimizing burnout risk. Now, let me share four patterns that indicate Kiba could be a good fit.
00:15:38.100 The first pattern is micro-batches. When handling fewer than 50,000 rows per job, Kiba facilitates near-real-time synchronization, especially for companies using a large CRM. Kiba can streamline data synchronization, transforming the internal app's data dynamically.
00:16:08.460 The second type is multi-step batch processing, which I call enterprise data aggregation. Here, various client systems each provide different formats and extraction methods. You need to conform this data into a single database for analysis.
00:16:59.300 Each step in the Kiba script can provide files to one another, employing message passing that simplifies your pipeline. Keeping things simple by managing inputs through basics such as cron jobs or Bash can be helpful.
00:17:35.580 You can even utilize Amazon S3 as an inbox to manage incoming files from various sources, which standardizes the input method and allows you to avoid a chaotic setup.
00:18:03.370 Now, regarding internal tasks, automation can be valuable for reconciling data from banks or other systems. By including human oversight in data processing, you can achieve systematic results with checks to ensure accuracy.
00:19:18.500 Finally, data migrations are significant projects that benefit from a robust ETL process side-by-side with the development of the new app. Utilizing data screens can ensure successful migrations."
00:19:43.879 Key takeaways include using bulk inserts, avoiding active record validations during migrations to improve speed, and ensuring you work with determined ID sequences to ensure a smooth data migration process.
00:20:11.740 I believe Kiba, a drop-in replacement, is continuously evolving. I've written numerous components to guarantee flexibility and manage complex data transformations.
00:21:00.780 This toolkit has become more maintainable over time, offering great assets for data integration focused on clean code practices, reusable components, and effective data handling capabilities.
00:21:40.000 In conclusion, Ruby shines when processing data, especially with emerging performance improvements. Utilizing libraries effectively facilitates robust ETL implementations, albeit with less popularity than ecosystems like Python.
00:22:20.200 I'm satisfied with the current developments in Kiba, which aligns perfectly with my data processing needs. If you have specific questions or wish to discuss further, I invite you to reach out.
00:22:51.310 We have five minutes left; please feel free to ask any questions.
00:23:10.490 A participant asks about retry mechanisms within the Kiba framework, to which the speaker responds that it ultimately depends on how one implements retries when jobs fail. The suggestion is to fail fast, raising exceptions when necessary.
00:23:51.630 Another participant shares their positive experience with Kiba in comparison to Logstash, noting the challenges they have faced with plugins. The speaker acknowledges that while Kiba doesn’t currently have watchers implemented, it can support continuous checks for data sources.
00:24:36.720 Regarding unstructured or normalized data extraction from sources like MongoDB, the speaker recommends identifying actual schemas within the unstructured data, ensuring that different formats are appropriately handled. There's an emphasis on being adaptable to various data sources.
00:25:51.800 Thank you for your attention! If you have more questions or need further assistance, please feel free to reach out.