00:00:06.900
Hello everyone! Did I get that right, Sowerby Gloom Tip? I've been practicing all morning.
00:00:19.650
Welcome to the first RubyConf in Thailand! This was a long time ago. Let's dive in and debug hard together. First, let's write a test with the following assertion.
00:00:31.120
My name is Vishal Chandnani, and I wrote this test at the DEAF method in New York City. How's that for a test-driven introduction?
00:00:54.600
Honestly, I had this idea long before the talk itself during a long train ride into the city. Enough about me; let me talk about my family. Right in the center is my best friend of 14 years, and we have two little gems of our own.
00:01:19.659
Quick disclaimer: every talk needs a story. When I first started learning Ruby, I was fascinated that it is written almost entirely in C. After a few years of working in the language, I couldn't help but notice the implementation of methods via the 'click to toggle source' button in the documentation. I clicked on a few to study their implementations and gradually began comparing Ruby and C for simple operations like string reversals.
00:01:44.770
As I explored the topic further, I learned that certain strings, particularly those in Unicode, don't interact well with certain string methods like the reverse method. I found many articles and blogs online, accepting this as a problem and suggesting Unicode normalization as a solution.
00:02:20.290
Typically, when you're working in a language that builds on top of or uses another language underneath, you tend to suspect the underlying implementation. I put on my debugger hat expecting to find my first Ruby bug but went down the rabbit hole of C in Reverse. To my surprise, I couldn't find anything wrong with the way C implemented the solution.
00:02:56.200
I stepped back, took a closer look at the inputs being passed to the function, and realized that Ruby was incorrectly representing certain Unicode characters at least in my mind. That whole experience changed the way I approached similar problems. This talk aims to provide a more logical approach to debugging across several languages.
00:03:29.400
Let's get started! This picture shows the moth found trapped in a relay of a Mark II computer at Harvard University way back in 1947. Meet Admiral Grace Hopper, who, by her own admission, didn’t coin the term debugging, but she used it so often that it became popular.
00:03:41.199
We're going to explore the Ruby string library, particularly its reverse method. My first programming language, C, was developed in the early '70s at Bell Labs. Coincidentally, this is also where I started my career. C was intended to be used for utilities running in Unix, and by the late '70s it had become so popular that it was used to rewrite the kernel of the UNIX operating system.
00:04:06.490
Imagine a language so powerful that not only does it write other languages but also operating systems! I needed an environment to build Ruby from source. I chose to use VirtualBox since it was free at the time and worked right out of the box.
00:05:00.680
For this exercise, I chose Ruby 2.5.1 as it's available as an option. Both Ruby and C are my favorite programming languages, and I find it fascinating to see both languages come together in one file. The first tool we are going to use today is grep, which helps find patterns in files. It's a UNIX command. The reverse function implemented in C is actually called rBSTr_reverse. The first thing I did was use grep to search for that string throughout the entire directory where I had the Ruby source code. It turns out it lives in a file called string.c.
00:06:34.000
The first version of grep was written overnight by a gentleman named Ken Thompson, aided by a friend who analyzed the contents of some Federalist Papers at the time. Speaking of overnight ideas, around that same time in 1974, this classic film, `Murder on the Orient Express`, was released. If you haven’t seen it, I won’t give it away.
00:07:01.270
Given my belief in test-driven development, I went looking for tests and found a few attempts to reverse certain strings. They tested simple examples, like 'beta,' which is a palindrome. However, I didn’t find anything complex—nothing for Unicode. It seemed that the rBSTr_reverse function uses pointers to swap and reverse a string by copying characters from the beginning and the end, switching them out.
00:07:45.679
In doing so, it also must calculate the length of each character it intends to copy. To build Ruby from source, the instructions are quite simple: configure, make, install. I wrestled the whole weekend to obtain a basic version of Ruby built from source in VirtualBox. I took notes and have the commands if anyone is interested.
00:08:53.370
I started with a simple string and called the reverse method to observe its results. This was an interesting case, especially with the character Rafael, which includes the Latin lowercase e with diaeresis. The diaeresis affects the letter to which it is attached, emphasizing it even if it's preceded by a vowel. In this case, the letter was represented as 'Rafael'.
00:09:41.060
Other examples include 'Chloe' and 're-entry.' Few people actually use diaeresis in their names; one example is Rafael Javier, a footballer who plays for the French national team and the Spanish club Real Madrid.
00:10:06.200
So when we reverse 'Rafael,' what's wrong with this picture? The diaeresis appears on the wrong letter. Turns out, this is a known issue that many have accepted. Let's examine what happened.
00:10:34.830
The second tool we'll use today is a simple method called `chars` applied to the string 'Rafael.' It provides an array of all characters in that string. Note that I'm saying 'characters,' not 'bytes.' The lowercase e with diaeresis is one example.
00:10:54.670
At this point, we need to learn a bit more about Unicode. Unicode is essentially a standard that allows us to encode, represent, and handle text and symbols from different parts of the world. This includes characters you cannot easily type on a keyboard. The lowercase e with diaeresis is one such example. Here's another: a fancy version of the Devanagari 'Om,' which is also part of the Unicode standard.
00:11:33.820
In fact, I have the Sanskrit version tattooed on my left arm. The concept of Unicode was proposed by a gentleman named Joe Becker from Xerox in 1988 as a unified encoding scheme. Around this time, one of my favorite movies was released: `The Goonies`, which inspired the title of this talk, 'Code Points.' Code points provide a numerical representation of characters passed to a string, and we use these code points with a simple iterator to print out the value of each character in both decimal and hexadecimal forms.
00:12:28.360
We can see here that the lowercase e with diaeresis was split into two characters: the lowercase e followed by the diaeresis. In hexadecimal, the e appears as hex 65, which seems correct. The combining diaeresis has a UTF representation of hex 308. This detail is my favorite because it brings us closer to the byte level representation.
00:13:16.860
Previously, we were discussing characters, but the key takeaway is that a character is not necessarily a byte. Applying the same method, we can iterate over each character. The e remains as hex 65, while the combining diaeresis displays a 2-byte representation of hex CC and hex 88. Looking closer, this representation corresponds to the UTF-16 version and does not represent UTF-8.
00:14:44.560
To delve into the implementation, let’s understand pointers. A pointer is a variable that points to another variable or memory location. Here’s a straightforward example: we have a string of 25 characters called 'Hello, World!' The notation char *ptr designates a pointer, where the star indicates that it points to a variable of type character. We can use a loop through that string to print the contents of each character.
00:15:34.410
The function `printf` is somewhat akin to `puts`, where 'printf' stands for formatted print. This function is useful for tracking where you are in your program. I'll illustrate with a simple example: I used `printf` to determine the length of a character in bytes and also to monitor where I was during execution.
00:16:46.420
In the previous execution blocks, I was able to zero in on a specific line of code calculating the length using rstrnc_len and subsequently using memcopy to copy the characters and replace values based on length. The length determination is crucial as it tells the code how many bytes to copy over. The origins of `printf` go back to BCPL, which had a function named writef in 1966.
00:18:06.310
The Tepe Sadin stadium right here in Bangkok was built around that time, just in time for the 1966 Asian Games.
00:18:26.770
`printf` gives a general overview of where you are in program execution. However, due to the complexity of the function I examined, I rapidly found myself needing more powerful tools. That's where the GNU debugger, gdb, comes into play. You invoke it by using `gdb ruby`, set breakpoints, and run the corresponding Ruby program.
00:19:13.900
After setting several breakpoints in multiple files, I discovered that the function tasked with calculating length in bytes is located in a file named utf-8.c.
00:19:42.920
Thus far, I've established a debugger.rb file containing a string where I call string.reverse to debug. Following this process will result in all debugging statements appearing correspondingly. Notably, the results pinpointed that the lowercase e has a length of 1, while the combining diaeresis did not print properly using printf.
00:20:52.210
C recognizes that the length of e is indeed 1, while the length for the corresponding diaeresis stands at 2. Hence, at this point, I discovered that C is functioning properly and, at least at this stage, no apparent bug exists.
00:21:37.329
Now, let’s apply my hack concerning pointer mechanics. I believed a proper representation of the e with diaeresis should show hex C 3 A, which spans 2 bytes. I modified the pointer's target to represent the correct byte values and adjusted the character assignment accordingly.
00:22:11.490
After applying the adjustments, I executed the reverse operation. It returned the correct string, displaying the diaeresis on the right letter. This introduces the ongoing debate between Ruby and C in this context. Do any movie fans recognize the famous arm-wrestling scene featuring Arnold Schwarzenegger? Thank you for guessing! But let's discuss this further later.
00:23:53.000
There is an established solution for this Unicode problem; many recommend using Unicode normalization. It tackles anomalies in string functions. It’s essential to comprehend how Unicode normalization functions before adopting it as a fix.
00:24:57.490
Unicode has a concept of equivalence whereby certain sequences represent the same character. For instance, the lowercase e with diaeresis can be expressed outright as a single character or as a combination of a lowercase e and a combining diaeresis.
00:25:55.220
Unicode normalization assists in converting sequences of characters to equivalent code points and consolidates characters so they effectively have the same representation. This helps eliminate inconsistencies.
00:26:12.730
This concludes my presentation. Throughout this journey, I've emphasized using powerful debugging tools effectively and avoiding assumptions regarding the code behavior.
00:26:34.800
Hopefully, this talk inspires you all to debug hard. You've been a fantastic audience. Namaste!