Debug Hard: Ruby String Library Methods and Underlying C Implementations

Debug Hard: Ruby String Library Methods and Underlying C Implementations by Vishal Chandnani

What if the Ruby String library ‘reverse’ method or its underlying C implementation had a bug? What if it produced unexpected results with certain types of inputs? e.g. strings with unicode characters. How would you catch and fix such a bug? How would you explain the unexpected results?

1. Relevance
The Ruby String library ‘reverse’ method is implemented in C. The debugging tools in this talk apply to Ruby programs, and help provide useful insights into the underlying C implementation.

2. Novelty and Originality
The ‘use unicode_normalize to address certain string reversal issues’ appears to be known to certain developers. The novel idea in this talk is the analysis of Ruby and C implementation to explain the problem and possible solutions.

3. Knowledge
I started my software development career at Lucent Technologies (originally Bell Labs Innovations, currently Nokia Bell Labs) and used C/C++ to develop CDMA wireless communication system software for 12 years. At The Boeing Company, I used Ruby/Rails to develop U.S. government intelligence community software for 7 years. I am fascinated that Ruby is implemented in C and am excited to share my recent debugging experiences with our community.

4. Coverage
This talk presents a step-by-step approach to debugging Ruby programs by diving into their underlying C implementation. It uses a string with unicode characters to demonstrate the problem and provides insights into the reversal process by understanding their byte-level representation.

5. Organization
This talk starts with a high-level view of the Ruby String library ‘reverse’ method implementation. It introduces the idea of using a Virtual Machine (VM) to build Ruby from source. We learn about the Unicode standard and encoding fundamental principles. We explore the ‘unicode_normalize’ implementation and how it addresses ‘reverse’ method problems. Along the way, we use commands/tools like grep, chars, code_points, each_byte, printf and gdb to provide insight into Ruby library methods.

6. Bottom Line
This talk aims to improve confidence in understanding bugs and/or unexpected results in the current application language (e.g. Ruby) as well as the underlying (e.g. C) implementations. I hope to inspire the Ruby community to explore the internals of Ruby strings and provide recommendations for further exploration.

RubyConf TH 2019

00:00:06.900 Hello everyone! Did I get that right, Sowerby Gloom Tip? I've been practicing all morning.

00:00:19.650 Welcome to the first RubyConf in Thailand! This was a long time ago. Let's dive in and debug hard together. First, let's write a test with the following assertion.

00:00:31.120 My name is Vishal Chandnani, and I wrote this test at the DEAF method in New York City. How's that for a test-driven introduction?

00:00:54.600 Honestly, I had this idea long before the talk itself during a long train ride into the city. Enough about me; let me talk about my family. Right in the center is my best friend of 14 years, and we have two little gems of our own.

00:01:19.659 Quick disclaimer: every talk needs a story. When I first started learning Ruby, I was fascinated that it is written almost entirely in C. After a few years of working in the language, I couldn't help but notice the implementation of methods via the 'click to toggle source' button in the documentation. I clicked on a few to study their implementations and gradually began comparing Ruby and C for simple operations like string reversals.

00:01:44.770 As I explored the topic further, I learned that certain strings, particularly those in Unicode, don't interact well with certain string methods like the reverse method. I found many articles and blogs online, accepting this as a problem and suggesting Unicode normalization as a solution.

00:02:20.290 Typically, when you're working in a language that builds on top of or uses another language underneath, you tend to suspect the underlying implementation. I put on my debugger hat expecting to find my first Ruby bug but went down the rabbit hole of C in Reverse. To my surprise, I couldn't find anything wrong with the way C implemented the solution.

00:02:56.200 I stepped back, took a closer look at the inputs being passed to the function, and realized that Ruby was incorrectly representing certain Unicode characters at least in my mind. That whole experience changed the way I approached similar problems. This talk aims to provide a more logical approach to debugging across several languages.

00:03:29.400 Let's get started! This picture shows the moth found trapped in a relay of a Mark II computer at Harvard University way back in 1947. Meet Admiral Grace Hopper, who, by her own admission, didn’t coin the term debugging, but she used it so often that it became popular.

00:03:41.199 We're going to explore the Ruby string library, particularly its reverse method. My first programming language, C, was developed in the early '70s at Bell Labs. Coincidentally, this is also where I started my career. C was intended to be used for utilities running in Unix, and by the late '70s it had become so popular that it was used to rewrite the kernel of the UNIX operating system.

00:04:06.490 Imagine a language so powerful that not only does it write other languages but also operating systems! I needed an environment to build Ruby from source. I chose to use VirtualBox since it was free at the time and worked right out of the box.

00:05:00.680 For this exercise, I chose Ruby 2.5.1 as it's available as an option. Both Ruby and C are my favorite programming languages, and I find it fascinating to see both languages come together in one file. The first tool we are going to use today is grep, which helps find patterns in files. It's a UNIX command. The reverse function implemented in C is actually called rBSTr_reverse. The first thing I did was use grep to search for that string throughout the entire directory where I had the Ruby source code. It turns out it lives in a file called string.c.

00:06:34.000 The first version of grep was written overnight by a gentleman named Ken Thompson, aided by a friend who analyzed the contents of some Federalist Papers at the time. Speaking of overnight ideas, around that same time in 1974, this classic film, `Murder on the Orient Express`, was released. If you haven’t seen it, I won’t give it away.

00:07:01.270 Given my belief in test-driven development, I went looking for tests and found a few attempts to reverse certain strings. They tested simple examples, like 'beta,' which is a palindrome. However, I didn’t find anything complex—nothing for Unicode. It seemed that the rBSTr_reverse function uses pointers to swap and reverse a string by copying characters from the beginning and the end, switching them out.

00:07:45.679 In doing so, it also must calculate the length of each character it intends to copy. To build Ruby from source, the instructions are quite simple: configure, make, install. I wrestled the whole weekend to obtain a basic version of Ruby built from source in VirtualBox. I took notes and have the commands if anyone is interested.

00:08:53.370 I started with a simple string and called the reverse method to observe its results. This was an interesting case, especially with the character Rafael, which includes the Latin lowercase e with diaeresis. The diaeresis affects the letter to which it is attached, emphasizing it even if it's preceded by a vowel. In this case, the letter was represented as 'Rafael'.

00:09:41.060 Other examples include 'Chloe' and 're-entry.' Few people actually use diaeresis in their names; one example is Rafael Javier, a footballer who plays for the French national team and the Spanish club Real Madrid.

00:10:06.200 So when we reverse 'Rafael,' what's wrong with this picture? The diaeresis appears on the wrong letter. Turns out, this is a known issue that many have accepted. Let's examine what happened.

00:10:34.830 The second tool we'll use today is a simple method called `chars` applied to the string 'Rafael.' It provides an array of all characters in that string. Note that I'm saying 'characters,' not 'bytes.' The lowercase e with diaeresis is one example.

00:10:54.670 At this point, we need to learn a bit more about Unicode. Unicode is essentially a standard that allows us to encode, represent, and handle text and symbols from different parts of the world. This includes characters you cannot easily type on a keyboard. The lowercase e with diaeresis is one such example. Here's another: a fancy version of the Devanagari 'Om,' which is also part of the Unicode standard.

00:11:33.820 In fact, I have the Sanskrit version tattooed on my left arm. The concept of Unicode was proposed by a gentleman named Joe Becker from Xerox in 1988 as a unified encoding scheme. Around this time, one of my favorite movies was released: `The Goonies`, which inspired the title of this talk, 'Code Points.' Code points provide a numerical representation of characters passed to a string, and we use these code points with a simple iterator to print out the value of each character in both decimal and hexadecimal forms.

00:12:28.360 We can see here that the lowercase e with diaeresis was split into two characters: the lowercase e followed by the diaeresis. In hexadecimal, the e appears as hex 65, which seems correct. The combining diaeresis has a UTF representation of hex 308. This detail is my favorite because it brings us closer to the byte level representation.

00:13:16.860 Previously, we were discussing characters, but the key takeaway is that a character is not necessarily a byte. Applying the same method, we can iterate over each character. The e remains as hex 65, while the combining diaeresis displays a 2-byte representation of hex CC and hex 88. Looking closer, this representation corresponds to the UTF-16 version and does not represent UTF-8.

00:14:44.560 To delve into the implementation, let’s understand pointers. A pointer is a variable that points to another variable or memory location. Here’s a straightforward example: we have a string of 25 characters called 'Hello, World!' The notation char *ptr designates a pointer, where the star indicates that it points to a variable of type character. We can use a loop through that string to print the contents of each character.

00:15:34.410 The function `printf` is somewhat akin to `puts`, where 'printf' stands for formatted print. This function is useful for tracking where you are in your program. I'll illustrate with a simple example: I used `printf` to determine the length of a character in bytes and also to monitor where I was during execution.

00:16:46.420 In the previous execution blocks, I was able to zero in on a specific line of code calculating the length using rstrnc_len and subsequently using memcopy to copy the characters and replace values based on length. The length determination is crucial as it tells the code how many bytes to copy over. The origins of `printf` go back to BCPL, which had a function named writef in 1966.

00:18:06.310 The Tepe Sadin stadium right here in Bangkok was built around that time, just in time for the 1966 Asian Games.

00:18:26.770 `printf` gives a general overview of where you are in program execution. However, due to the complexity of the function I examined, I rapidly found myself needing more powerful tools. That's where the GNU debugger, gdb, comes into play. You invoke it by using `gdb ruby`, set breakpoints, and run the corresponding Ruby program.

00:19:13.900 After setting several breakpoints in multiple files, I discovered that the function tasked with calculating length in bytes is located in a file named utf-8.c.

00:19:42.920 Thus far, I've established a debugger.rb file containing a string where I call string.reverse to debug. Following this process will result in all debugging statements appearing correspondingly. Notably, the results pinpointed that the lowercase e has a length of 1, while the combining diaeresis did not print properly using printf.

00:20:52.210 C recognizes that the length of e is indeed 1, while the length for the corresponding diaeresis stands at 2. Hence, at this point, I discovered that C is functioning properly and, at least at this stage, no apparent bug exists.

00:21:37.329 Now, let’s apply my hack concerning pointer mechanics. I believed a proper representation of the e with diaeresis should show hex C 3 A, which spans 2 bytes. I modified the pointer's target to represent the correct byte values and adjusted the character assignment accordingly.

00:22:11.490 After applying the adjustments, I executed the reverse operation. It returned the correct string, displaying the diaeresis on the right letter. This introduces the ongoing debate between Ruby and C in this context. Do any movie fans recognize the famous arm-wrestling scene featuring Arnold Schwarzenegger? Thank you for guessing! But let's discuss this further later.

00:23:53.000 There is an established solution for this Unicode problem; many recommend using Unicode normalization. It tackles anomalies in string functions. It’s essential to comprehend how Unicode normalization functions before adopting it as a fix.

00:24:57.490 Unicode has a concept of equivalence whereby certain sequences represent the same character. For instance, the lowercase e with diaeresis can be expressed outright as a single character or as a combination of a lowercase e and a combining diaeresis.

00:25:55.220 Unicode normalization assists in converting sequences of characters to equivalent code points and consolidates characters so they effectively have the same representation. This helps eliminate inconsistencies.

00:26:12.730 This concludes my presentation. Throughout this journey, I've emphasized using powerful debugging tools effectively and avoiding assumptions regarding the code behavior.

00:26:34.800 Hopefully, this talk inspires you all to debug hard. You've been a fantastic audience. Namaste!