UTF-8

Summarized using AI

The Three-Encoding Problem

Kevin Menard • February 28, 2023 • Providence, RI

In this talk, Kevin Menard addresses the complexities of Ruby's encoding system, emphasizing that Ruby supports over 100 different encodings and typically utilizes three encodings within an application without the developer realizing it. He begins by noting the significance of understanding Ruby's encoding, as mismanagement can lead to data loss or corruption.

Key Points Discussed:
- History of Encodings: Ruby's encoding capabilities began as a limited feature up until Ruby 1.9, evolving to accommodate a wide range of characters beyond ASCII.
- String Representation: Ruby treats strings as arrays of bytes paired with encodings that define character mappings. This dual level is crucial, as different encodings can offer varying representations for the same character.
- Diverse Encoding Systems: The talk highlights ASCII as the initial encoding standard which, while foundational, is limited. UTF-8 emerges as a more flexible system, capable of representing numerous characters from global languages.
- Transcoding Challenges: Menard explains how encoding conversions, or transcoding, may lead to exceptions like undefined conversions or invalid byte sequences. Strategies for managing these errors, such as using keyword arguments with 'string.encode', are discussed.
- Prevalence of Encoding Issues: The sociocultural aspects of encoding highlight the need for broad compatibility across encodings, especially in applications interacting with global user bases.
- Performance Optimizations: Future improvements in Ruby 3.2 aim to enhance encoding error handling and string concatenation efficiency, with a focus on maintaining performance during transcoding operations.

Menard concludes by underscoring the importance of being aware of encoding systems, encouraging developers to utilize Ruby’s rich encoding API thoughtfully to mitigate common pitfalls associated with encapsulated data and system performance. The talk culminates with resources for further exploration, emphasizing the intricacies and ongoing developments within Ruby's encoding landscape.

The Three-Encoding Problem
Kevin Menard • February 28, 2023 • Providence, RI

You’ve probably heard of UTF-8 and know about strings, but did you know that Ruby supports more than 100 other encodings? In fact, your application probably uses three encodings without you realizing it. Moreover, encodings apply to more than just strings. In this talk, we’ll take a look at Ruby’s fairly unique approach to encodings and better understand the impact they have on the correctness and performance of our applications. We’ll take a look at the rich encoding APIs Ruby provides and by the end of the talk, you won’t just reach for force_encoding when you see an encoding exception.

RubyConf 2022 Mini

00:00:12.740 I am a staff engineer at Shopify, and I am here today to talk to you about Ruby's encodings.
00:00:20.539 I work in a group at Shopify called the Ruby and Rails Infrastructure Team. We're basically tasked with ensuring the longevity of Ruby. One of the nifty things that we think about at Shopify is how we can ensure the company will be here 100 years from now. Since we're heavily invested in Ruby, this largely means ensuring Ruby remains a relevant technology in 100 years.
00:00:39.420 We do all sorts of things. We have people working directly on Ruby and Rails, fixing bugs, improving performance, and adding new features. We focus on the developer experience. If you caught Jenny's talk earlier about supply chain security, that's another thing our group does.
00:00:51.360 I work on Truffle Ruby, which is an alternative implementation of the Ruby programming language. We emphasize very high performance, and as a new implementation, we aren't bound by some of the design decisions in CRuby. This allows us to optimize in ways that CRuby is unable to.
00:01:05.880 We try to maintain a very high degree of compatibility, though we can run native extensions, and we are currently working on production deployments. So, let's dive into the three encoding problem.
00:01:34.080 Encoding is one of those topics that I find many Rubyists don't fully grasp, and you can get quite far in Ruby without really understanding encodings. Up until Ruby 1.9, encodings really didn't exist in Ruby the way they do today. You can get quite far with it because it's a powerful abstraction, but if you don't understand encodings and run into an issue, you're bound to resolve it incorrectly, which can lead to data loss or corruption.
00:01:53.579 To get started, let's take a peek at Ruby strings, which is about as simple as it gets. We have a string literal here, and I'm using the magic comment syntax to associate the encoding with the string as US ASCII.
00:02:05.520 Ruby provides many API calls to inspect strings and delve into their internals. Here, we can see that a string is made up of bytes. The string has three characters, namely A, B, and C, and here are the corresponding bytes. We can also obtain the associated encoding with the string. Essentially, a string in Ruby is just an array of bytes and an encoding to make sense of those bytes, resulting in an array of characters.
00:02:31.200 At its simplest level, encoding is merely a mapping function that takes an array of bytes and produces an array of characters. Different encodings can have varying lists of characters, which may result in different byte representations. I want to clarify some terminology – throughout the talk, I'll be using somewhat loose terms because, if you dive into encoding, there are very precise definitions.
00:02:53.940 For our purposes, we'll consider encoding to be this mapping from byte representation to a character. A character, in this context, refers to anything that can be put in a string, including all major writing systems, punctuation, digits, and so on.
00:03:02.880 Then there are code points, which serve as indices to these characters. If an encoding maps bytes to characters, it follows that each encoding has a set of valid characters, with code points acting as indices. There are also code units, though we won't discuss them much; they refer to how a code point maps to bytes.
00:03:25.140 Ruby utilizes bytes for its strings, but if you look at some of the Unicode literature, you'll find the term code point used, and you can mentally substitute 'bytes' for that term. Revising our definition a little, we have two levels to encoding: the encoding that takes a code point (that index in the list) and gives you a character, and the mapping from bytes to a particular code point.
00:03:49.380 Ruby allows us to check the code points for a string. For example, given the string ABC, we can derive the corresponding bytes, which must be in the same order. This gives us a list of bytes, which aligns directly with the three code points, and in this case, they share the same values. However, this is by design and not a requirement, as will be illustrated in later examples.
00:04:08.040 For each code point, we have a corresponding character. Unfortunately, understanding encodings requires looking at some history. I know it's not the most exciting topic, but bear with me. If you haven't heard of ASCII, it stands for American Standard Code for Information Interchange. This was a very popular encoding in American businesses and computer science.
00:04:25.920 ASCII comprises 128 characters, of which only 95 are printable. This is a quite restricted set of representations, and everything fits into seven bits. The creation of ASCII aimed to represent seven bits as an advantage over eight bits, loosely corresponding to modern English. It can handle lowercase letters A through Z, uppercase letters, digits, some punctuation, and a selection of symbols.
00:04:46.800 Additionally, there are control characters that instruct the computer to perform actions such as inserting a new line. ASCII has its own history, supporting some characters for old-style terminals. This restriction may have influenced English writing conventions; for example, prices often appear in cents, but since there's no cent symbol, they are expressed in dollars instead.
00:05:06.660 Similarly, words with accented characters, like ‘resume,’ are often simplified due to ASCII's limitations. ASCII was in use for a long time, and any language wishing for widespread usage should support ASCII. Ruby does so at two different levels.
00:05:29.580 You can examine an encoding and check if it's ASCII compatible, and in this case, we're using US ASCII, which will trivially be ASCII compatible. We will explore this further in a moment. The world is, however, much more diverse than we might think, and we can't represent everything in English using ASCII.
00:05:46.920 As a result, we collectively needed to move beyond ASCII. Many different encodings have been proposed, but I want to focus on UTF-8. Let's consider the French word ‘très,’ which logically has four characters but takes up five bytes. The third character has a code point value greater than 128, which exceeds the maximum range of ASCII; this character maps to two bytes.
00:06:12.480 If you're not familiar with Unicode, it is arguably the most popular encoding system in use today. There’s a large standards body overseeing Unicode, which releases versions – currently, it supports about 150,000 different characters, encompassing writing systems from various world languages. This includes control characters that allow you to flip the reading direction of text, among other whitespace characters.
00:06:30.660 One challenge with Unicode is the necessity of resolving cultural differences. Different cultures have historically utilized similar alphabets or encoding systems, but today there is disagreement on how certain aspects should be interpreted. The Unicode standards body seeks to address these issues, though it doesn't always succeed.
00:06:47.880 While Unicode is widely adopted, it’s not the only encoding system available. Adding complexity, Unicode defines transformation formats. Unicode has 150,000 code points, and it makes an effort to be ASCII-compatible, ensuring that each ASCII code point maps to the same value in Unicode, but we still need to represent these in bytes.
00:07:05.400 UTF-8 is likely the encoding you are most familiar with, and it is known as a variable-width encoding. Each character in UTF-8 can take up between one and four bytes. Conversely, UTF-32 is a fixed-width encoding where every character occupies four bytes. This allows it to represent everything in Unicode but lacks memory efficiency. UTF-16 sits somewhere in between, where 16 bits can represent around 65,000 characters, but if more characters are needed, two code units are required.
00:07:30.639 In Ruby, you might not frequently use UTF-16 and UTF-32, but UTF-16 does come up in JavaScript, so that’s something you might encounter while doing web development. That’s a brief overview of encodings.
00:07:47.640 Ruby has a rich system for interfacing with encodings. Compared to many modern VM-based scripting languages, Ruby stands out because most languages implement a unified internal string representation. For instance, JavaScript internally represents all strings as UTF-16. This normalization simplifies interfacing with other encodings, but Ruby takes a more inclusive approach.
00:08:06.240 Ruby acknowledges that not everyone in the world uses UTF-8 or Unicode. Therefore, Ruby can efficiently handle various encodings without requiring conversion. However, this also means Ruby must manage multiple encodings throughout its operations.
00:08:27.720 In Ruby, there's an Encoding class that you can call the 'list' method on to retrieve a list of all the encodings. It ships with over 100 encodings out of the box. Returning to the ASCII compatibility point, here are three encodings that I happen to know are ASCII compatible. They represent different types of characters, but for each of the ASCII characters, they share the same code point values.
00:08:46.919 Importantly, they also utilize the same byte representation for each of those code points, meaning you can take an ASCII string and read it without any data loss in any of these ASCII-compatible encodings. Approximately 90% of the encodings in Ruby are ASCII-compatible, so there’s a good chance what you’re working with is ASCII-compatible.
00:09:04.620 The encodings you may run into that aren’t ASCII-compatible include UTF-16 and UTF-32. This distinction matters because, due to Unicode using the same code point values for ASCII, you can equalize code points across the encodings. However, looking at byte representations, UTF-16 uses two bytes per ASCII character, whereas UTF-32 uses four.
00:09:20.520 As a result, with multiple encodings, you might need to convert strings from one encoding to another, a process known as transcoding. We’ve discussed this process throughout the talk because each time we call ‘string.encode,’ we engage in transcoding. Until now, we have mostly addressed ASCII-compatible encodings and ASCII strings, meaning no conversion is necessary.
00:09:40.560 However, when transcoding to UTF-16, for instance, the byte representation converts to two bytes per character. This transcoding process can be error-prone; one common type of exception you'll encounter is the undefined conversion error. In a contrived example, I have a UTF-8 string with non-ASCII characters, and when converting it to ASCII, I encounter this error.
00:10:01.320 If you're not familiar with encodings, this message can be difficult to comprehend. It displays a funny 'U+' followed by four hex digits, indicating the two encodings involved. You must understand that 'U+' represents a Unicode code point value, and you might need to convert from hexadecimal to decimal to look it up on a table.
00:10:15.340 If you're not familiar with encodings, you might capture the exception and try to move on. Another type of exception you may run into involves invalid byte sequences. This situation occurs when you've read data from a network but didn't fully receive the string, resulting in bytes at the end that don't fit the target encoding.
00:10:39.540 You'll again see a similar error message. If you don't know the context of Unicode, the message switches to a hexadecimal format, which can also be overwhelming if you're not knowledgeable about it. Ruby provides a few different ways to handle transcoding errors. On the string.encode method, you can supply a couple of keyword arguments to manage undefined characters or invalid byte sequences.
00:10:55.680 For example, if you convert to an ASCII encoding and want to replace either character, it will show a question mark in the middle. If you’re dealing primarily with non-ASCII data and it’s Unicode, you might get a diamond with a question mark. These are ways of addressing the error, but you would be losing data in the process.
00:11:18.600 Thus, you need to be sure that this is what you intend, especially if the data you're inserting into a database is a user’s name, as you might end up with these strange characters when you read it back.
00:11:38.700 Ruby also offers a sledgehammer approach. Since a string is simply an array of bytes with an associated encoding, you can instruct Ruby, ‘Hey, I know better,’ to change the encoding. I've worked with teams that have taken this approach, and while it may eliminate the error, it likely leads to data corruption.
00:11:56.340 Such methods should be reserved for very specific use cases. Having Ruby handle this exists primarily for backward compatibility, but if you see 'force_encoding' in your codebase, it's worth investigating whether that was truly your intention.
00:12:13.920 This leads us to another concept: broken strings, which is another area where Ruby differs from its modern peers. In Ruby, strings are mutable, and you can modify bytes in them or override the encoding. There’s a possibility of having a byte array where the associated encoding becomes nonsensical.
00:12:32.220 For instance, if I take the string ‘très’ and tell Ruby that it's actually US ASCII without changing the bytes, when I perform string operations, you might receive unexpected results.
00:12:51.720 You can check each string to determine if it's broken or valid. Ruby has a method called 'valid_encoding,' and you’d expect that most strings will return true. However, a broken string allowing operations to return different results based on character placement should be treated as an implementation detail not for testing.
00:13:07.860 Now let's discuss binary strings. Much of this reasoning stems from Ruby's compatibility requirements. Since Ruby strings are essentially arrays of bytes with no other mechanisms for bytes, we have a dummy encoding called ASCII 8-bit, also known as binary, which indicates binary data.
00:13:31.200 We use this dummy encoding when reading from network sockets or files, where we read arbitrary bytes without definitive character boundaries. Unfortunately, this encoding also reports whether it’s ASCII compatible. This can lead to common errors when working with binary strings.
00:13:54.540 You can work with strings that consist solely of ASCII data like other strings without issues, but when encountering the first multi-byte character, the situation changes, and confusion ensues. Interestingly, binary strings cannot be broken, so the 'valid_encoding' method always returns true for those.
00:14:16.380 Binary strings are beneficial for I/O operations. For illustration, let’s use String IO. We won't hit the disk; instead, we’ll simulate reading a multi-byte character string. Reading the entire string gives a buffer encoded as UTF-8, where the default internal encoding Ruby uses can be overridden.
00:14:37.140 If we rewind to read from the beginning but specify the number of bytes, we retrieve an ASCII 8-bit string. This may seem contrived, but it illustrates how calling functions in specific ways can yield unexpected data. A valid use case is reading data in chunks from files or networks to minimize memory usage.
00:14:58.920 You pass fixed-byte chunks on to various parts of the system, and that part must be aware it works with binary data rather than string data. In this case, I structured it such that each of the chunks will receive one of the two bytes for that character. Consequently, both segments could be broken.
00:15:23.700 This also illustrates the rationale behind allowing broken strings in Ruby: you can concatenate them back together, resulting in a complete byte sequence that is valid UTF-8 data. As we proceed to wrap up, I want to share a fun trick to show your friends.
00:15:38.520 Let's start by taking two strings. We can create one as a string literal and the other via the 'String.new' method. We’ll verify if they are equal, getting back the expected result. Now, we can use the shovel operator to append bytes and expand the string.
00:15:56.720 However, checking if they are equal now gives a different result, which is puzzling. On examining the byte data, one string has two bytes while the other has one. The discrepancy arises because an empty string literal and 'String.new' are not equivalent; one leads to a UTF-8 string, and the other yields a binary string.
00:16:16.260 This confusion often leads developers into errors when they allocate a buffer as an empty string without considering encoding. The shovel operator takes not byte values but code point values, and the hex character '0x80' occupies two bytes in UTF-8.
00:16:38.220 Ruby's handling of multiple encodings necessitates their interaction with each other, and Ruby emphasizes the developer experience. It aims to avoid the need to transcode everything to a single encoding since that would simplify these operations.
00:16:58.740 In reality, 90% of encodings are ASCII-compatible. As a result, operations involving ASCII strings are straightforward, regardless of other encodings. We can affirm this compatibility with the encoding class’s compatible method, given two objects.
00:17:18.600 The behavior of this method varies based on object type. When provided encodings, it identifies the one that contains the superset of all code points. For example, if we check if US ASCII and UTF-8 are compatible, we should find they are, which can initially confuse because it doesn't return a Boolean value but rather an encoding or nil.
00:17:37.080 Flipping the arguments yields the same encoding. If the encodings aren’t compatible, however, it returns nil. Yet, things change when you pass two strings.
00:17:59.820 The documentation states that rules about strings depend on whether they’re concatenatable, introducing a complex set of rules. My best interpretation is that Ruby prioritizes successful operations, examining the contents of the strings along with their encodings to ascertain if concatenation can proceed.
00:18:21.060 However, the order of arguments matters: concatenating A plus B might yield a different encoding than B plus A.
00:18:41.340 In instances where two strings in different encodings are involved, they might remain compatible. This means you should be cautious about the object types you pass into the method, as merely supplying one encoding may not provide an accurate view.
00:19:02.520 If the strings are genuinely incompatible, you’ll get nil. However, exceptions occur for empty strings: if one of the two strings is empty, Ruby allows the concatenation where it otherwise wouldn't between ASCII and UTF-16.
00:19:22.680 In summary, two encodings' compatibility often hinges on the code points they share and their byte representations. Once it comes to string operations, it further depends on the strings' contents. Thus far, we've primarily discussed encodings for strings.
00:19:43.320 However, there are encodings for other objects as well, especially symbols and regular expressions. I/O also has a related encoding, which features in the standard library. We lack the time to elaborate, but you can override default encodings when reading files or creating strings.
00:20:06.000 For example, Ruby defaults to UTF-8 for internally created strings and ASCII 8-bit or binary for external data. If you convert a string to a symbol, the symbol retains UTF-8 encoding. Conversely, if the string contains only ASCII data, Ruby converts the symbol to US ASCII, which remains attached to it.
00:20:27.360 When converting the symbol back to a string, it adopts the US ASCII encoding. Normally, this distinction doesn’t matter since ASCII and UTF-8 are compatible. However, if you change the default encoding, it could lead to complex scenarios.
00:20:48.300 This phenomenon arises in numerics as well. With digits being representable in ASCII, converting an integer or floating-point value to a string results in the US ASCII encoding. Returning to the title of this talk, consider how many encodings your application uses.
00:21:12.780 I propose you likely have three encodings even if you’re unaware. A lot of people assume everything is UTF-8 because that's Ruby's default, but that’s only true for string literals and particular operations. When converting symbols or numbers to strings, they’ll come in as US ASCII. If you’re working with I/O at any level, which most applications that connect with the outside world do, you might also encounter binary data.
00:21:32.880 Why does this matter? Recalling that encoding negotiation process, when operating on strings of the same encoding, the derived encoding is trivially the encoding in use, making for a quick operation. However, if the encodings differ, the system must run through a long checklist determining compatibility.
00:21:54.840 That rule requires examining every character in the string, which results in an unexpected linear scan due to differing encodings. Ruby tries to shield you from this reality; it supports over 100 encodings by default and seeks to maintain fairness, optimizing for the properties instead of certain encodings.
00:22:17.520 For instance, with US ASCII, every code point can be presented in a byte, representing a fixed-width encoding. Similarly, with UTF-32, every character takes four bytes, allowing easy index operator jumps since you only require access to an array of bytes.
00:22:40.500 Conversely, when using UTF-8 with multi-byte character data, determining character boundaries proves necessary. There are ways to optimize this, but if a string is broken, leveraging UTF-8’s method based on the first byte's specifics isn't an option.
00:23:04.440 Ruby maintains cached metadata for strings, termed a code range, which assures that as long as the string remains unchanged, the code range remains intact, enhancing query speed for whether the string consists solely of ASCII data and whether it is valid. When these properties hold, Ruby may optimize string operations akin to supposing a fixed-width encoding.
00:23:29.880 This optimization enhances performance for string concatenation substantially; a common operation that could shift from constant time for memory copying to linear time due to character scans.
00:23:52.560 Behaviorally, when dealing with encodings, it’s crucial to have alignment in encoding across the entire system. It’s ineffective to work with UTF-8 data if your database is not configured for it, leading to potential corruption.
00:24:15.840 When reading from files or processing user submissions, you must verify the encoding. Historically, ISO 8859-1 has been a popular web encoding that you might encounter in addition to database encodings. Databases also incorporate collation, which affects operations like string sorting.
00:24:40.500 It's important to highlight that even systems implementing the same encoding could diverge, as seen in Postgres, which sorts UTF-8 strings differently from Ruby due to differing operator precedences.
00:25:06.720 This discrepancy poses a challenging error. Particularly if you're an American company, thinking primarily in terms of ASCII, but then serving a global audience, you may encounter unexpected encoding when someone submits strings you didn't anticipate.
00:25:33.840 This is why you should strive to be conscious of encodings. My hope is that today’s talk equips you with better tools to solve encoding errors. I’m pleased to share that in Ruby 3.2, various changes have been implemented to enhance performance.
00:26:00.720 These efforts were driven by my colleague John Busier, who goes by Beirut on the Ruby commit list. One method we're improving performance is by invalidating code range values less frequently.
00:26:25.800 If we correctly determine the necessary condition, we can circumvent costly scans of the strings, which also benefits copy-on-write operations since the code range value is computed lazily. Many operations adjust how it is populated, avoiding unnecessary work.
00:26:48.960 Improvements have also been made in string concatenation speed. John's work centers on optimizing Ruby’s encoding handling for the three most prevalent encodings — all of which default to ASCII compatibility. With these enhancements, we streamline checks for encoding compatibility and subsequently speed up concatenation.
00:27:10.320 There remains significant potential for future performance optimizations in Ruby. If you look through the Ruby release notes now, you will have a clearer understanding of what those optimizations are structured around.
00:27:28.010 That covers everything I had to share today, so we’ve explored encodings and how they relate to Ruby. We discussed Ruby's various mechanisms for transcoding strings, and the historical context that makes the encoding system complex.
00:27:47.520 Now, hopefully, we have a richer appreciation for how this may affect performance and behavior. I’ve included a slide with resources that I will publish later. I wrote a blog post discussing code ranges in detail and linked to several PRs from Ruby 3.2 for your reference.
00:28:07.200 And that’s it. Thank you for your time!
Explore all talks recorded at RubyConf 2022 Mini
+33