Streaming data transformations with Ruby

by Ville Lautanala

This video features a lightning talk by Ville Lautanala at Euruko 2021, focusing on the theme of streaming data transformations using Ruby. The speaker introduces the concept by describing a practical example of downloading a web page, decompressing it, and calculating the number of characters found on it. This task can be efficiently handled in the terminal using curl and other Unix utilities, demonstrating the simplicity of creating a data pipeline.

Key points discussed in the talk include:

Streaming Capabilities: Streaming allows for the processing of large data sets without consuming excessive memory resources, as the operations can handle data of any size incrementally.
Initial Implementation Challenges: Lautanala shares his first attempt at implementing a streaming pipeline in Ruby, which resulted in convoluted code due to tightly coupled functions. He highlights the importance of keeping the code modular for better maintenance and scalability.
Refinement with Lambda Functions: To improve his solution, Lautanala suggests using lambda functions that return enumerables, enabling better composition of the streaming pipeline and allowing for flexible input parameters.
Utilization of the Typewriter Gem: The implementation details of the Typewriter gem are discussed, particularly focusing on the use of an IO wrapper for enumerators to improve the decompression step of the pipeline.
Ruby Version Features: The speaker notes the advancements in Ruby, particularly from version 2.7 onward, where function composition operators could simplify pipeline creation. He suggests the potential addition of an Enumerator IO object to Ruby to enhance future development in streaming processing.
Demonstration of Functionality: Finally, Lautanala demonstrates that his approach works effectively, providing comparable outputs to those achieved in previous implementations, reinforcing the validity of his method.

In conclusion, Lautanala expresses his hope that attendees find these streaming data pipeline concepts interesting and relevant. He aims to inspire further enhancements to Ruby's capabilities in handling streamed data transformations, ultimately contributing to the language's growth and usability.
Overall, the talk serves as an insightful exploration of practical Ruby programming techniques for streaming data processing.

00:00:00.160 But first, we are going to listen to a talk about streaming data transformations with Ruby. We have a Finnish speaker here, Ville Lautanala. It's your turn, Ville.

00:00:21.039 Hi, I'm Ville Lautanala, or if you happen to run into me on the internet, I usually go by the handle 'Lautanala'. The topic for this lightning talk is streaming data transformations with Ruby.

00:00:30.960 To give an example of what I'm talking about, I might want to download a web page, decompress it, and count the number of characters on the page. In the terminal, this is rather simple to do using curl to fetch the page, piping the output to gzip, and then finding that to word count to get the number of characters.

00:00:44.079 On the example page, we have roughly 132,000 characters. As you can see, we can add more steps if needed, and things kind of come together. The best part is that this is streaming, so even if the page were terabytes in size, it would work; it might take a bit longer, but wouldn't consume all my available memory.

00:01:16.080 So how hard can it be to implement something like this in Ruby? Well, it's not too hard, but at least my first attempt ended up being a bit of a mess. It works, but it's somewhat like spaghetti code where things are coupled together, mainly because there is a gzip decoder that needs to be finalized to free up the resources.

00:01:35.040 This made it harder to decompose the functionality into smaller pieces. It's not a significant problem with the simple pipeline I have, but what if I had more steps in the streaming pipeline? Then it wouldn’t be acceptable anymore, and I would have to find a way to recombine it. So, back to the drawing board.

00:02:00.640 How could this be improved? Well, taking inspiration from the terminal script, I thought of creating a pipeline with steps where each step is a lambda function that returns an enumerable. I could compose those together. For good measure, the page URL or input for the first pipeline could be given as a function call, allowing me to change the page URL when the pipeline is initialized if necessary.

00:02:50.319 This is all implemented in the Typewriter gem. The HTTP fetch isn't particularly interesting; it's pretty much the same as before. For the decompression part, I really wanted to use the GC Breeder, so I implemented an IO wrapper for enumerators to pass in the enumerator for the GZ reader as an argument. The length is rather unremarkable.

00:03:21.280 So my original goal for this talk was to show that I hope the audience finds this interesting. I want to pitch the idea that we could have something like this in Ruby. To avoid embarrassing myself, I thought it would be good to check if there is something similar in the latest Ruby versions, as I have not been writing much Ruby in the past two years.

00:03:45.519 It turns out that, at least for the pipeline part, since Ruby 2.7, you could have written it out using the function composition operator and get pretty much the same code as before. The enumerator IO part is still valid, so to make my gem redundant, could we please have an enumerator IO object in Ruby, just like we have a string IO? This might be something to consider for future Ruby versions.

00:04:20.720 Otherwise, I think this is a pretty neat way of composing streaming data pipelines. I can also demonstrate that it actually works. If I run the command or script, I get the exact same outputs as before. Thank you for your attention, and I hope you have a really nice day.