00:00:00.160
But first, we are going to listen to a talk about streaming data transformations with Ruby. We have a Finnish speaker here, Ville Lautanala. It's your turn, Ville.
00:00:21.039
Hi, I'm Ville Lautanala, or if you happen to run into me on the internet, I usually go by the handle 'Lautanala'. The topic for this lightning talk is streaming data transformations with Ruby.
00:00:30.960
To give an example of what I'm talking about, I might want to download a web page, decompress it, and count the number of characters on the page. In the terminal, this is rather simple to do using curl to fetch the page, piping the output to gzip, and then finding that to word count to get the number of characters.
00:00:44.079
On the example page, we have roughly 132,000 characters. As you can see, we can add more steps if needed, and things kind of come together. The best part is that this is streaming, so even if the page were terabytes in size, it would work; it might take a bit longer, but wouldn't consume all my available memory.
00:01:16.080
So how hard can it be to implement something like this in Ruby? Well, it's not too hard, but at least my first attempt ended up being a bit of a mess. It works, but it's somewhat like spaghetti code where things are coupled together, mainly because there is a gzip decoder that needs to be finalized to free up the resources.
00:01:35.040
This made it harder to decompose the functionality into smaller pieces. It's not a significant problem with the simple pipeline I have, but what if I had more steps in the streaming pipeline? Then it wouldn’t be acceptable anymore, and I would have to find a way to recombine it. So, back to the drawing board.
00:02:00.640
How could this be improved? Well, taking inspiration from the terminal script, I thought of creating a pipeline with steps where each step is a lambda function that returns an enumerable. I could compose those together. For good measure, the page URL or input for the first pipeline could be given as a function call, allowing me to change the page URL when the pipeline is initialized if necessary.
00:02:50.319
This is all implemented in the Typewriter gem. The HTTP fetch isn't particularly interesting; it's pretty much the same as before. For the decompression part, I really wanted to use the GC Breeder, so I implemented an IO wrapper for enumerators to pass in the enumerator for the GZ reader as an argument. The length is rather unremarkable.
00:03:21.280
So my original goal for this talk was to show that I hope the audience finds this interesting. I want to pitch the idea that we could have something like this in Ruby. To avoid embarrassing myself, I thought it would be good to check if there is something similar in the latest Ruby versions, as I have not been writing much Ruby in the past two years.
00:03:45.519
It turns out that, at least for the pipeline part, since Ruby 2.7, you could have written it out using the function composition operator and get pretty much the same code as before. The enumerator IO part is still valid, so to make my gem redundant, could we please have an enumerator IO object in Ruby, just like we have a string IO? This might be something to consider for future Ruby versions.
00:04:20.720
Otherwise, I think this is a pretty neat way of composing streaming data pipelines. I can also demonstrate that it actually works. If I run the command or script, I get the exact same outputs as before. Thank you for your attention, and I hope you have a really nice day.