Lies, Damned Lies, and Substrings

by Haseeb Qureshi

In the talk titled "Lies, Damned Lies, and Substrings," presented by Haseeb Qureshi at RubyConf 2016, the focus is on understanding the complexities involved in generating all substrings of a string in Ruby, alongside exploring memory management and optimization techniques such as Copy-on-Write.

Key Points Discussed:

00:00:15.219 So, I'm Haseeb Qureshi, and I'm going to talk to you today about Ruby, particularly about a time when Ruby lied to me. Many of you may have had a similar experience before. Ruby is a fickle language to have fallen in love with. But I'm going to tell you a story about what happened to me.

00:00:38.720 A co-worker and I were arguing about an algorithm, as you do. This is an artist's depiction of people arguing about an algorithm. Right there is me, and that's him. I'm the fashionable one. We were discussing a pretty classic problem—some of you might have encountered this as an interview question—how do you generate all the substrings of a string?

00:01:08.720 Given a string like 'hello', generating all the substrings of 'hello' is pretty straightforward. You start with the string itself, then consider all the substrings of length 4, length 3, length 2, and length 1. All of these together form the set of substrings. So, how do you generate those?

00:01:27.260 It's quite simple. If you look at how a substring is generated—taking substrings of length 4—you can see that it starts at index i equals 0 and ends at index j equals 3. Then, the next substring starts at i equals 1 and j equals 4. Thus, each substring is uniquely defined by a start index and an end index. As long as you generate all unique pairs of start and end indices, you have all the substrings.

00:01:49.560 You can turn that into code using Ruby pretty easily. We're going to write a method called 'substrings', and we can use each object to collect all the substrings. Inside this nested loop, we go from zero as i all the way to the end of the string. Then we do another loop for j, which is the end index, starting at i and going to the end of the string. We append the substring from i to j into our collection of substrings.

00:02:07.680 You can see in the structure of this code that we have these two nested loops. Therefore, the inner loop runs O(N squared) many times. For those familiar with Big O notation, this should be straightforward. I say this algorithm is O(N squared), but what about what's inside the loop?

00:02:32.230 We kind of glossed over that for a second. If you look at the part where we shove into an array, it's amortized constant time. However, how long does it actually take to build this substring—taking the string from i to j? For simplicity, we're going to assume fixed-width strings, even though it's true for non-fixed widths as well.

00:02:57.220 The thing is that Ruby treats strings less than 24 characters differently. But for large n, we can ignore that. Let's look at how strings are represented in Ruby. Ruby provides a lot of abstraction over what's really happening at the machine level, but in reality, a string is just an array of characters. If you imagine an ASCII string, it's essentially an array of ASCII-encoded characters stored next to each other in memory.

00:03:22.210 To take a substring and copy it somewhere else, we have to go into a whole new place in memory and copy each character one by one. This copying process must take linear time in relation to the number of characters being copied. So, thinking about each substring, we have to consider its average length.

00:03:46.360 What's the average substring length? It's actually not obvious just by looking at the array of substrings. On one hand, the substrings could be dominated by small ones, and this makes it hard to determine their long-term average size. Assuming the average grows logarithmically is a common misconception, but it's far from obvious.

00:04:09.639 Let's do the math and actually compute it using Ruby. I'm going to take the substring code I wrote earlier and create a function called 'average_substring_ratio'. This function will compute the lengths of all the substrings and return the ratio of the average substring length to the original string's length. Essentially, I want to see how that ratio changes.

00:04:34.290 Now, I'll generate strings made of the letter 'a' with a size defined by the number you give me. Then I'll call the substrings method we wrote earlier, collect the lengths of those substrings, and calculate the average substring length simply by adding them together and dividing by the count.

00:05:03.150 Finally, I'm going to iterate from 1 to 150 in steps of 5, measuring the average substring length ratios for these strings as I go. Here we go—let’s see what the results show.

00:05:23.709 As I run these numbers, I see that the average substring length converges on one-third. It might seem counterintuitive, but with some mathematical proof, which I'm not going to display due to time constraints, you can see this is indeed true. The average substring length grows linearly with the size of the original string. This leads to the conclusion that this copy process occurring inside the loop is O(N), and if that copying happens for each substring, the entire algorithm would result in O(N cubed). Therefore, my colleague's assertion was correct.

00:05:50.530 Basically, I was wrong, but I also learned something. What I knew was about Copy-on-Write, which has a name that reflects its functionality. Copy-on-Write allows the program to share existing data until a modification is made.

00:06:07.370 So, the question now is: what exactly is Copy-on-Write? It is a form of structural sharing that can be seen in functional programming. Within the context of strings, Copy-on-Write means that instead of copying everything into a new memory location, we make a shallow object that points to the start index in the original string and retains a reference to its length.

00:06:28.140 For instance, if we have the string 'hello' and we want to create the substring 'll', Copy-on-Write will result in a shallow object that just points at the first index. This means that until one of these strings gets altered, there’s no real copying taking place.

00:06:52.800 Let me show you an example using a library called 'display_string', which allows us to inspect the structure of strings in Ruby. I will create the alphabet string and then duplicate it. The string representation will include memory addresses for the object and pointers that indicate where the string lives in memory.

00:07:17.450 When I generate two strings using a duplicate operation in Ruby, they are unique objects with different object IDs, but they share the same underlying data array. If either string is modified, Ruby's system knows to make a separate copy to avoid conflicts and maintain the integrity of the data.

00:07:39.700 This means that when we go ahead and mutate one of the strings, Ruby’s system detects this, creates a new instance, and allows the original string to remain unchanged. The pointer to the original array stays the same for the copied string until it is modified.

00:08:01.700 This shallow copy methodology allowing for Copy-on-Write leads to an efficient way to handle memory, meaning we can perform the substring algorithm in O(N squared) time rather than O(N cubed). Case closed! A couple of days later, my colleague returned with a benchmark test.

00:08:26.860 We set up a benchmark where one large string is 1024 characters and another string is twice as long, 2048 characters. We use the benchmark module to evaluate the time it takes to generate the substrings of these strings.

00:08:55.620 What I noticed was that when we doubled the size of the input, the time taken increased by a factor of eight rather than four, suggesting that the algorithm deviates from the expected quadratic behavior.

00:09:18.370 To investigate further, we benchmarked the performance of calling the substring method multiple times with both of the strings to compare their behavior. Surprisingly, I found that both strings took almost the same amount of time despite their size difference.

00:09:30.540 This occurrence prompts the question: Why didn’t this operation grow as expected with input size? After digging through the Ruby implementation, I discovered that the substrings, especially those without the last character, do not use Copy-on-Write, leading to inconsistencies in performance.

00:09:51.270 When you invoke a substring that does not include the last character, Ruby’s memory management insists on a full copy of data, resulting in linear performance degradation. Meanwhile, substrings that include the very last character properly take advantage of Copy-on-Write, hence the performance discrepancy.

00:10:13.500 My investigation led me deep into Ruby's C codebase, where I found something interesting: all strings in C end with a null terminator. Ruby, in turn, has to ensure this mechanism operates correctly across its various string manipulations. My goal was to ensure that when Ruby creates a substring, it provides a valid termination.

00:10:36.300 I attempted to compile Ruby with a modification that would set these substrings to always use Copy-on-Write consistently and ensure that any substring produced reflects this behavior. After recompiling Ruby and rerunning the benchmarks, I finally saw good results.

00:10:59.400 Now, the performance outcomes showed that all substrings appeared to efficiently utilize Copy-on-Write, confirming the O(N squared) performance I anticipated for substring generation.

00:11:23.940 So, to wrap things up, we learned that the copy-on-write optimization in MRI Ruby serves a dual-purpose not only for performance but also ensures the stability of data representations in memory.

00:11:51.800 But the takeaway applies beyond just this specific example. Copy-on-Write is a critical concept that shows how reducing unnecessary copies can significantly affect the overall performance. Just like everything in programming, there's always more to the story. That's it from my talk. Thank you for listening.

00:12:14.030 If you wish to engage further, feel free to follow me on Twitter at @haseeb, and you can find related code on GitHub. A special thanks to Ned, David, and Pat for their help with this talk, and I apologize for the technical issues we encountered. Now, if there are any questions, I'm ready to take them!

00:12:57.050 So, to clarify, if I have string 1 equal to some string and string 2 equal to string 1, will they both point to the same object in memory? If I decide to duplicate string 1 and create string 3, will those strings maintain their own copies or share the same? My guess is they will copy their own versions, as it's a rare situation to optimize for.

00:13:24.930 I appreciate your engagement, and I look forward to answering further questions. Thank you once again for listening!