Ruby Unconf 2018

Understanding Unix Pipes with Ruby

Ruby Unconf 2018

00:00:12 Thank you, thank you.
00:00:14 So, yeah, it's not the only stupid job. I hope you also liked it as much as I did.
00:00:19 I was thinking about the title for this talk. Then I thought it would be even nicer to call it something else.
00:00:24 However, I remembered that Unix is a very selfish topic and went with the boring title you have all read. Anyway, my name is Sergio, and I work for Thomas.
00:00:34 I am the person in charge of the green stickers, for those of you who forgot to pick them up. It was my personal mistake, and I will personally ensure you get them in Europe.
00:00:42 Now, let's talk about pipes. It's safe to assume that most programmers know the basics of pipes. It's something you can use to combine several commands.
00:00:53 For instance, we can use a command to list files and then filter them by piping the output into another program that filters them, such as grep. In this case, we are looking for executable files.
00:01:07 We can even pipe that into another command that will count how many executable files there are. This idea of taking the output of one operation and using it as the input for the next is familiar to us as programmers.
00:01:19 We define functions, call them, get results, and pass them into other functions. So, that's how we would write the same example in Ruby. It looks a bit odd because the order is reversed, but it achieves the same goal: listing files, filtering them, and counting them.
00:01:50 I say it's a bit confusing because it's written in the opposite order. However, we can use variables to help clarify this. Now, it's not in the opposite order; it does the same thing.
00:02:07 We can even use something like method chaining to wrap it in an order that makes sense. Some languages have specific syntax that makes this behavior almost identical to the way we write traditional code.
00:02:20 In my understanding of pipes over the years, it's one of the first concepts I learned when I started programming. I have continued using pipes to pass the output from one command to another without thinking much about them.
00:02:31 What I want to show you today is that this simplification, while very useful for learning, is a bit over-simplified. It ignores the interesting details. I want to show you three properties of pipes that have remained unchanged and useful for over fifty years.
00:02:57 Let’s explore those properties. To begin, we’ll start with an example. As you may know, a processing unit has one input and two outputs.
00:03:04 Both outputs are the standard ones we usually use for debugging, printing messages, or standard output. This is the setup I am talking about—it looks complicated with a lot of lines and connections.
00:03:19 However, thanks to pipes being such an excellent invention, it's really easy to implement that setup. We would have two processes where the output of one is the input of the other.
00:03:36 Note that since we are going to do the examples in Ruby, we need to write another Ruby program that reads the input.
00:03:50 In our first example, we are going to print five lines to both outputs: one to stderr and the other to stdout. It's something really simple.
00:04:05 For the second program, it will probably be even simpler as it will print something about the incoming lines. In this case, since we know both will go to the same terminal window, we will distinguish them by printing everything to stdout.
00:04:36 Now, what do we expect to happen when we run this? According to our simplified understanding, many of us might think that the first program's output is the input of the second one.
00:04:52 So we could expect that the result would show all the lines that the first program printed, and the second one would count and process them.
00:05:04 It's not exactly like that. This is what happens when we run those programs.
00:05:13 The second program receives the lines as soon as they are printed by the first one—there's no need to wait for all of them to be generated. This idea of streaming can also be seen as 'laziness' in the sense that as soon as something is ready, we can start processing it.
00:05:42 This has several implications, especially as the amount of data passed from one process to another increases. The bigger the data, the bigger the implications becomes.
00:06:03 The first implication involves memory consumption. In our simplified model, we assumed that we need enough memory to store intermediate results.
00:06:14 However, in reality, each line can be discarded from memory as soon as it's read. So, memory consumption can be as low as what’s needed to store one line.
00:06:27 Recently, I showed how one can process a large volume of output without needing to store everything in memory.
00:06:42 Instead of generating all lines by the first process and then processing them by the second, it's more accurate to say that lines are generated and processed continuously.
00:07:00 This is the streaming behavior that I wanted to illustrate with this example.
00:07:30 Interestingly, I have never thought of this in terms of streaming, but it's something that I have always experienced, especially when using the command 'tail -f' on a file.
00:07:51 This command monitors a file’s output, showing whatever lines get added in real-time. This is a typical practice in programming where you want to see the logs and pipe them down to another program for filtering.
00:08:10 What’s interesting about this command is that it monitors the file and never finishes; it waits indefinitely for the file to receive new lines.
00:08:31 That doesn't stop grep from filtering and printing lines for me, so we are dealing with infinite streams that never finish.
00:08:54 This practical implication allows us to have infinite streams, and the next programs can still process them without any issues.
00:09:14 Now, we have seen that the Ruby implementation of pipes might seem too simple and doesn't realistically portray what’s happening. So, what does it take to make my Ruby implementation behave like this?
00:09:41 It gets a bit more complicated. We will need to utilize enumerator objects, which are classes in the Ruby standard library that allow us to manage the output of our processes in a streamlined way.
00:10:08 So how does this work? When we run it, we read one line from the output and then filter it, processing it right after.
00:10:25 That way, rather than having everything in one go, we can consume it line by line, as they are generated by each method.
00:10:43 While this code may not be particularly beautiful, it serves to illustrate that pipes aren't as simple as they may first appear. They have a lot of nuances.
00:11:00 To discover the second property, we will employ a similar schema with two processes: one generates output for the second.
00:11:23 However, this time we are going to generate the output at a slower rate. Instead of generating all the items simultaneously, we'll introduce a delay.
00:11:48 As we start calculating, we are propagating it at a slower pace. The second one will also be delayed.
00:12:05 Before we could estimate the output based on how we've structured our pipes. Now, let's look at how long this takes.
00:12:28 With our naive interpretation of pipes, we might think that since there are 5 items, and each needs 2 steps, it would take 10 seconds.
00:12:47 But that's incorrect; it actually takes 66 seconds due to our previous conclusions where processes can happen at the same time.
00:13:02 This time when we generate the first item while the second processes it, it'll continue generating the seconds and so on.
00:13:16 When we run the example, we will see that it reflects this concurrent processing, leading to a total execution time of 6 seconds instead of 10.
00:13:36 So, we discover that execution is concurrent and that the processing speed can be significantly different between different processes.
00:13:54 Going back to the Ruby implementation, however, it doesn't represent this behavior. We can implement threading in Ruby.
00:14:15 This will make it a bit more complicated, but essentially we need to embrace the fact that we can initially spawn a thread as an immediate operation.
00:14:36 In this case, one method will spawn a thread that pushes output into a queue. This allows for parallel processing.
00:14:57 The second method will read from the queue, filter, and return the output. The third needs to gather all its input before returning results.
00:15:19 Things start to get complicated and maybe difficult to follow, but the key point is that pipes do a lot more than we give them credit for.
00:15:43 So, when pipes are implemented correctly, they exhibit concurrency and back pressure.
00:16:05 For our last property, we need to consider what we discovered about parallel processes that generate output for follow-up actions.
00:16:30 The interesting part is to understand that these processes might run at different speeds. In a case where one process generates more output than the next can keep up, we create a bottleneck.
00:16:54 If we run this for a prolonged period, it can become a bigger problem if proper communication principles aren’t in place.
00:17:11 So where do those excess items end up? They have to accumulate somewhere, and memory is not the ideal location for excess data.
00:17:34 This is a challenge we will face if we implement a queue system; it can be a complex issue when managing back pressure between processes.
00:17:54 An interesting concept called back pressure emerges here, where one process can signal to another to slow down its output. This is vital for managing the relationship between processes.
00:18:21 To differentiate from pipes, we could have more robust methods of managing communication, ensuring that one process doesn't overwhelm another.
00:18:39 In the end, we return to the example of two processes arranged via a pipe, but increase the number of items to manage that back pressure effectively.
00:19:00 We will send 15,000 items through the pipe, so it will demonstrate how behavior changes in the presence of back pressure.
00:19:23 However, processing of these lines occurs at various rates, ultimately allowing us to observe how each item is handled effectively.
00:19:45 Before we could process the first 5,000 lines, we had already sent through 15,000 but without continuous growth.
00:20:04 Finally, we see that the first process doesn’t keep increasing its output rate as we'd expect—this is all thanks to the buffering implemented by the operating system.
00:20:28 Here, the buffer prevents overwhelming the receiving process, implementing the concept of back pressure effectively.
00:20:54 This illustrates that pipes can implement back pressure efficiently, ensuring that we don't face issues due to inconsistent speeds.
00:21:17 This demonstration highlights that pipes embody the concept of concurrency and can prevent resource overloads in our systems.
00:21:39 It's essential to understand that the ideal use of pipes can lead to efficient resource utilization when operated correctly.
00:22:02 The use of pipes highlights the benefits of concurrency, yet they are not a catch-all solution for every scenario.
00:22:29 I would advise using pipes where they fit best, while remaining aware of their limitations. Be mindful of their capacity and potential issues.
00:22:50 What I have shared today are insights into some design principles that are timeless in software engineering. These principles can help us build robust and efficient Ruby applications.
00:23:12 Finally, it is relevant that these design strategies and principles can last in our product lifecycle as well.
00:23:37 Thank you for your attention, and I hope you found this discussion insightful.