Rethinking Strings

RubyKaigi 2023

00:00:02.959 Let's start. Alright, hi everyone! Thank you for coming out. I know it's the last day of the conference, and everyone might be a little tired, but I appreciate you being here. Thank you to the organizers for putting this on. This is the first RubyKaigi I’ve been able to attend in four years, and it's great to be back. For those who don't know me, I am a staff engineer at Shopify. I've been there for almost two years, but I've been working with Ruby for the last 15 years. A recurring theme in my work has been performance.

00:00:21.539 I've worked with Ruby Enterprise Edition, JRuby, TruffleRuby, and so on. I think what's great about a VM like Ruby is that if we can improve performance in the runtime, applications get faster for free. This means individual application developers don't need to spend time working around deficiencies in the VM. Their end users get a faster experience, and ideally, there's even an environmental impact, as we can do more with fewer resources. In 2014, I turned this into a full-time position when I began working on TruffleRuby, which I initially started during my time at Oracle Labs. I’m continuing that work now at Shopify.

00:01:14.520 TruffleRuby was founded in February 2013. Last year, we tragically lost our founder, Chris Seaton, which put a bit of a damper on the celebrations, but we have recently hit our 10-year anniversary. Many people have contributed to the project, and I believe we've been beneficial for the Ruby community. We aim to provide tooling that makes it easier for others to optimize Ruby in different ways. We're significant contributors to the Ruby Spec Suite, which is incredibly helpful for people trying to build new Ruby implementations. TruffleRuby has also served as a good test bed for trying different ideas, such as object shapes that have made their way into CRuby.

00:02:09.380 I've spent a lot of time examining string performance. Strings are often perceived as merely a display characteristic, so many people think they don’t need to be fast. However, we use them extensively, especially in web applications—for generating web responses, processing emails, and even handling log output. Behind the scenes, there's a myriad of concatenation operations and string interpolations occurring. If we can speed those up, we can generate responses more quickly.

00:02:47.640 It’s also crucial for APIs. We know that binary-based APIs can generally be more efficient in terms of memory and performance, but they're more challenging for developers to work with. REST and GraphQL largely use JSON as the interchange format, and we aim to make those interactions fast as well. Configuration data is another area; traditionally, we haven't placed a heavy emphasis on this because it's usually handled at startup. However, with the rise of functions as a service, we want to start that process as quickly as possible.

00:03:06.180 In Ruby, we have another aspect to consider: metaprogramming. Many of the various metaprogramming calls take strings as arguments; they may also take symbols. Using a symbol is generally more efficient, but if you’re working with dynamically generated data at runtime, it will likely be a string. One area we worked hard on in TruffleRuby was eliminating the overhead of metaprogramming. This was previously thought to be largely undoable; if you wanted rich expressive functionality from the VM, you would consequently pay a performance penalty for it. However, we in the JIT compiler managed to eliminate that overhead, but it required a strong emphasis on string performance.

00:03:35.640 So, the title of this talk is 'Rethinking Strings.' I’m going to walk through some of the string history, what the API looks like today, how it sets up new Ruby users for failure in various ways, and how it can impact performance. My goal is to stimulate a conversation about how we might advance strings for Ruby going forward. I've been involved in three different string implementations in TruffleRuby, and I've also investigated string implementations in CRuby, JRuby, and Rubinius extensively. Much of this talk will be informed by my background in these different systems.

00:04:34.399 The first approach to strings in TruffleRuby was a straightforward port of what CRuby does to Java. TruffleRuby is predominantly written in Ruby, so we imported some code from Rubinius and translated parts of their C++ to Java. However, for the most part, we translated CRuby's C code into Java, which gave us a high level of compatibility. Along the way, we discovered that it didn’t play well with the JIT compiler for optimizing idiomatic code. Hence, we rewrote the string implementation using a method called ropes, which I presented on at RubyKaigi back in 2016.

00:05:08.580 While some of the finer details have changed since that presentation, the core concept remains the same. The key difference is that with the previous byte list-based approach, we were dealing with contiguous regions of memory. If you wanted to concatenate strings, you would allocate a new buffer and perform memory copies from the two strings. This process, particularly when doing many concatenations in a row, results in frequent allocations. We can, of course, implement various tricks behind the scenes to over-allocate some spare buffer space to mitigate this, but the core operation remains linear time.

00:05:51.779 With ropes, we still have contiguous regions of memory at the leaves, but we build up a tree of lazy operations. This has served us well in optimizing for various use cases, like ERB, which historically performed numerous concatenations. While it has since evolved to do much less, we observed significant performance benefits from our approach. Another notable aspect of ropes is that they are immutable, persistent data structures. This seems paradoxical since Ruby strings are generally mutable. However, by utilizing an immutable representation, we can cache and pre-calculate certain pieces of information.

00:06:34.680 Every string in Ruby has an associated value called the code range. These values determine how the string is represented and manipulated, and calculating them traditionally requires a linear time scan to determine the value. Moreover, any modification to that code range could potentially change it, necessitating another linear-time scan. Likewise, determining the length of a string also requires a linear scan. With ropes, we implemented the ability to pre-calculate these values, allowing for two significant improvements. First, deriving a new value from existing ones could often be done in constant time, and second, these constant values functioned well with the JIT compiler.

00:07:19.740 Despite these advantages, strings in Ruby are still fundamentally bound to a byte-oriented API. A pathological case arises when you modify every single byte in the string; with a rope, we would need to allocate a new node for each modification. This approach has notable drawbacks in terms of memory efficiency. As a result, last year, we re-implemented strings in TruffleRuby again, this time adopting a more blended approach. We now support both a rope-based representation and a mutable byte array.

00:08:01.260 We're still in the process of fully capitalizing on this new strategy, but this allows us, on a per-method call and ideally on a per-call-site basis, to determine which representation is most advantageous. Along the way, we’ve also developed additional storage strategies that facilitate representing strings more compactly, making applications more efficient, particularly for polyglot applications running on the GraalVM platform.

00:08:43.560 Next, I want to discuss the string API. It's important to remember that Ruby is a 30-year-old language, and none of this evolution occurred in a vacuum; much of it has been quite organic. As someone who's worked with Ruby for 15 years, starting with Ruby 1.8, I wasn’t initially involved with the core team or the bug tracker. Hence, I don’t possess the context for many decisions or changes that were made. I often refer back to a copy of the Ruby Programming Language book that I’ve had since 2008, which has remained my primary Ruby reference. This book was written when Ruby 1.8 was the predominant version and Ruby 1.9 was in development, covering both fairly well.

00:09:44.160 In it, there's a passage differentiating strings in Ruby 1.8 and 1.9. A key takeaway is that in Ruby 1.8, strings were sequences of bytes without defined encoding. If there was any encoding, it would be assumed to be ASCII or binary. To use something like UTF-8, one had to rely on a third-party library, which often came with performance and compatibility challenges. However, in Ruby 1.9, a substantial shift occurred: strings transitioned from being a sequence of bytes to a sequence of characters, which, while somewhat abstract, can be thought of as strings of length one. This shift also enabled encoding awareness.

00:10:38.520 While this was a significant development, it brought compatibility concerns—especially regarding how Ruby 1.8 applications would transition to the new 1.9 string ecosystem. As such, the string API today is extensive and can largely be categorized into two groups: a text-oriented API, which facilitates operations like sub-stringing, up-casing, and capitalizing; and a byte-oriented API, which retains much of what was present in Ruby 1.8, allowing for direct byte manipulation and slicing.

00:11:24.300 The string API's surface area can feel overwhelming to new Rubyists who might be uncertain about which one to utilize: the text-oriented one, which is more contemporary, or the byte-oriented one, which is often seen as more performant.

00:12:05.940 Today’s Ruby strings are interesting and complex beneath the surface. Nonetheless, Ruby has effectively concealed that intricacy, providing a flexible representation. The API accommodates both immutable and mutable operations, allowing developers to program in either a functional or imperative style based on preference. You can freeze strings for additional optimizations from the VM or mutate them in place. In contrast, many languages, like C or JavaScript, only support one of these methods.

00:12:50.220 Ruby strings are also encoding-aware. Unlike other VMs or languages, Ruby does not enforce an internal encoding. For instance, in JavaScript, every string is UTF-16, while Rust mandates UTF-8 for all strings. These languages permit text representation in various character sets, requiring developers to manage transcoding manually. In Ruby, however, you can choose the default internal encoding when starting the VM or override it at runtime, and even alter it on a per-string basis. As a result, Ruby can have strings with different encodings coexisting within the system.

00:13:56.460 Generally, encodings aren’t compatible, meaning you can’t directly combine strings bearing different encodings. Ruby attempts to automatically assess whether an operation is safe and proceeds with encoding automatically when feasible. Now, to illustrate how encodings function in Ruby: they're essentially methods for converting bytes into characters. Every encoding comes with a character map—think of it as a table where you look up an index known as a code point. Each code point is designated to be represented in a certain way.

00:14:44.520 Sometimes, a character requires more than one code point, necessitating a secondary encoding function that will take a sequence of code points and convert them back to a character. By composing these functions, it's possible to directly map bytes to characters. For instance, we can pack two bytes into a string, and using different encodings, we could yield various characters; in UTF-8, we might get the copyright symbol, while in Shift-JIS, we might get something else—all while the underlying bytes remain unchanged.

00:15:21.840 Ruby comes equipped with over 100 encodings, which represents a substantial divergence from other languages that mandate a single encoding. This wide array complicates the VM, as it needs to manage diverse encodings effectively, many of which can contain aliases—there are 175 different aliases available for encoding strings. Nonetheless, Ruby endeavors to simplify this complexity, incorporating default encodings. Consequently, in most applications, three encodings are typically utilized: the default encoding for strings, which is often UTF-8; in certain situations, the VM will downgrade a string from UTF-8 to ASCII if it's recognized as a simple encoding; and lastly, there's the ASCII-8BIT encoding utilized primarily for binary I/O.

00:16:26.340 It's important to note that ASCII-8BIT is somewhat peculiar compared to Ruby’s other encodings. A misunderstanding arises, as many believe ASCII-8BIT is an 8-bit version of ASCII, yet true ASCII strictly fits within 7 bits. Historically, there were attempts at implementing an 8-bit character set extending ASCII, but none have matched the specifications to be termed ASCII-8BIT. This encoding is effectively for binary data and does not map to characters, and while it indicates ASCII compatibility, it is crucial to keep in mind that it doesn't represent characters in any way.

00:17:45.840 Now, addressing code ranges: Working with an encoding like UTF-8 can be complex, as each character can occupy between 1 to 4 bytes. If you use a string index operator, Ruby must ascertain the individual character boundaries. If you're dealing only with ASCII data, then each character is one byte wide, which allows Ruby to optimize that operation by scanning the string from start to finish and recording information about it. This code range value every string commences from unknown, transitioning to lazy computation; then CR7BIT indicates it comprises purely ASCII data.

00:18:34.320 The next level is valid, meaning that while it’s not strictly seven-bit, the bytes hold meaning with the corresponding encoding. Lastly, there’s the broken state, where the bytes and the encoding do not align, which can lead to not yielding characters as expected. Although the code range value isn’t explicitly exposed in Ruby, you can still infer it using two functions on the string: checking if it’s ASCII-only and validating its encoding.

00:19:20.520 Moving on to binary data: Ruby, as a general-purpose programming language, needs to handle binary data effectively—we work with images, create thumbnails, compress strings, and read tarballs from disk. Yet, Ruby doesn't precisely possess a dedicated data type for binary data. One method to conceptualize it is that bytes are just integers within a specific range, thus we could theoretically manage arrays of integers representing those bytes.

00:20:26.160 While this could work, it is remarkably inefficient. The reason is that Ruby does not impose guarantees regarding the byte size of an integer, leading to unintended increases in memory consumption. For example, a megabyte of bytes may actually end up occupying eight megabytes of memory. The API allows for this approach, but it results in poor cache performance and substantial memory bloat. As an alternative, one could revert to using that sequence of bytes similar to Ruby 1.8 strings. Indeed, if you allocate a megabyte of ASCII characters, it roughly occupies a megabyte of memory, considering some overhead for string metadata—but this approach leaks implementation details.

00:21:26.760 Ruby ought not to guarantee a specific representation or memory layout for strings. This discrepancy can lead to confusion; for instance, when compressing a string using Zlib, you might obtain a string with an ASCII-8BIT encoding. While that string might include non-ASCII characters, it is still regarded as valid. For example, when splitting it into lines, Ruby may process it based on the presence of an ASCII new line byte even if no genuine line breaks exist. Actions like getting the successor of the string can yield nonsensical results because Ruby operates under the assumption that strings with an ASCII-8BIT encoding are still ASCII compatible.

00:22:57.760 Critically, binary strings and binary data are not logically equivalent, yet Ruby has been designed in such a way that developers treat them as if they are. This confusion can lead to unexpected outcomes. For example, if we allocate two strings—one using the empty string literal and the other with the String.new constructor—they may initially be considered equivalent. However, appending a byte value to each string can render them unequal. This is due to the fact that the String.new method allocates a binary string whereas the empty string literal adheres to your default encoding, which will likely be UTF-8.

00:24:16.560 Additionally, the shovel operator command behaves differently; it does not accept byte values but instead operates with code point values. Consequently, if you append a hex value like 0x80, the UTF-8 representation spans two bytes while the binary string retains just one. I often see developers trip on this peculiarity—if you allocate with an empty string literal and then proceed to append data, it may initially function, but subsequent operations can lead to catastrophic failures.

00:25:26.520 Currently, Ruby has introduced an I/O buffer class that represents a promising step forward, albeit resides under the I/O namespace, which may mislead users into thinking it's solely applicable for I/O purposes. There are use cases in Ruby for working with binary data beyond just I/O-related tasks, and this misperception can lead to confusion regarding the buffer's usability. Two significant issues I observe today that could be rectified are that it still leans on binary strings in the system, and users are not directly exposed to them. If you call file.read with a specified byte count, you receive a binary string rather than an I/O buffer.

00:26:44.460 Now, focusing on the broken strings specifically—broken strings refer to those where the bytes lack meaning concerning the corresponding encoding. This phenomenon is notably unique to Ruby. When working with many modern languages, invalid strings cannot be created, and performing operations like up-casing or capitalizing generates well-defined outcomes, always yielding a valid string.

00:27:09.860 Nevertheless, in Ruby, situations arise where one might encounter a broken string or even unintentionally generate one. For instance, if a user reads from a binary file, perhaps an image, using a read variant that doesn't specify byte count, they could obtain a UTF-8 string that isn't restricted to ASCII characters as expected. Attempting to up-case it in this scenario would induce an exception due to its broken nature, which one can confirm by checking its validity.

00:28:30.060 Although one can force the encoding to binary, this change does not alter the byte content of the string. Therefore, one could operate on it to up-case it, despite that not making sense for a binary encoding. Another scenario demonstrates that yet another broken string might return nonsensical values. For example, changing the encoding of a multi-byte UTF-8 string to ASCII yields erroneous sizes, as broken strings that return invalid values can do so in ways that slightly vary, depending on the position of the broken byte within the string. This complication represents an implementation detail that should not become something developers rely upon, as it complicates our potential to manage the string API and data.

00:29:48.900 To wrap up, there are ways to address these issues, and I have some proposals for Ruby 3 or beyond. It’s important to express that we prioritize backwards compatibility, so I'm suggesting a phased approach. For instance, we could introduce a command-line flag that allows users to opt into new behaviors. We could start with warnings, potentially progressing to exceptions, and identifying problem areas in applications that could be rectified. I'd also be supportive of any deprecations, even if we never completely remove methods, as these could affect developers by indicating they should avoid specific practices. Firstly, let’s invert the notion of the ASCII-8BIT into a binary representation. When inspecting output for a binary string, let’s report its encoding as binary instead of ASCII-8BIT; this could enhance developers’ understanding that they are working with binary strings.

00:31:01.920 I also propose to eliminate ASCII compatibility from the binary encoding altogether. This might disrupt applications, but I firmly believe there are few valid use cases for combining binary strings with text strings. Invariably, situations arise where developers unknowingly handle binary strings, thinking they work interchangeably, when typically only ASCII characters provide reliable functionality. Furthermore, I would argue for the discontinuation of broken strings since they introduce numerous complications without valid use cases. Centralizing binary data handling outside of the string API is ideal. We already have an I/O buffer; instead of returning binary strings, we should provide I/O buffers and necessitate that users convert to strings when integrating them. We could benefit from introducing a straightforward type designed solely for binary data handling akin to JRuby’s internal byte list that would adequately encapsulate that functionality.

00:32:29.520 The byte list type would maintain a smaller API, cutting out unnecessary string text capabilities that can often confuse users, thereby streamlining the string implementation solely for textual content and delivering heightened efficiency.

00:32:40.500 Lastly, I would champion making it significantly more difficult to create broken strings. A few of these recommendations might provoke controversy, and they presuppose that a binary type exists. Nevertheless, I propose deprecating set_byte.

00:32:48.059 This behavior primarily caters to those who wish to utilize strings as binary buffers, a situation we could address through less confusing alternatives.