Unicode

How to Become an Encoding Champion

How to Become an Encoding Champion

by Deedee Lavinder

The video titled "How to Become an Encoding Champion" by Deedee Lavinder at RubyConf 2019 explores the vital topic of encoding in computer systems, aiming to demystify the subject for developers. Lavinder starts by discussing the foundational concepts of encoding, emphasizing that it is not encryption and reiterating that encoding facilitates the proper understanding and consumption of data across different systems.

Key points discussed throughout the talk include:
- History of Encoding: Lavinder provides a brief historical overview, starting from early encoding methodologies such as Morse code and Braille to ASCII and EBCDIC frameworks used in computing.
- Importance of Standards: The introduction of ASCII in the 1960s illustrates the move towards standardization in encoding, enabling compatibility across systems, while multi-byte encodings emerge to support non-English languages, leading to the creation of Unicode.
- Understanding Unicode: Lavinder highlights the significance of Unicode, which accommodates a vast amount of characters across multiple languages, adding a layer of abstraction between characters and their binary representations.
- Ruby Encoding Practices: The presentation delves into Ruby's handling of encoding, particularly the methods 'encode' and 'force_encoding', illustrating how mishandling these methods can lead to data loss and encoding issues.
- Base64 Encoding: The concept of Base64 as an encoding method is explained, describing its function in converting binary data into ASCII strings, ensuring safe data transmission.

Throughout the speak, Lavinder shares personal anecdotes related to encoding bugs she encountered, reinforcing her points with real-world implications. She concludes by stressing the following key takeaways:
- There is no such thing as plain text; all text is encoded.
- Always specify which encoding to accept from users and strive to use UTF-8 wherever possible.
- Proper parsing of files necessitates accurate knowledge of the encoding originally used to avoid data loss.

Overall, Lavinder's talk provides valuable insights into the complexities of encoding in various contexts, aimed at equipping developers with a robust understanding of how to manage encoding effectively in their applications.

00:00:13.390 Hello, everyone! My name is Didi Lavinder, and I am very happy to be here. I'm a back-end engineer at Spreadly in North Carolina, where I work in the payment space.
00:00:21.550 I am also a director for our local chapter of Women Who Code, and I'm very excited to talk about encoding today.
00:00:28.630 How many of you love encoding? Yes, I see some enthusiasts. Okay, I also saw some head shaking—who can't stand encoding? All right, more hands. Hopefully, the numbers will look slightly different by the end of this talk; that is what I'm hoping for.
00:00:37.510 Even though entire books have been written about this topic, I'm hoping to spend the next 30 minutes covering some history, foundational concepts, and a bit of Ruby-specific basics. For me, understanding encoding has been crucial, and I hope that will also be true for you.
00:00:54.190 Of course, the idea for this talk came from a bug that I encountered. How many of you have seen examples that could be included here? I learned an interesting word for this in Japanese: 'ma.'
00:01:07.660 Has anybody heard that word before? It basically means an unintelligible sequence of characters. As I dug in and started troubleshooting, I discovered some fascinating things about encoding and its history, which were instrumental in helping me debug this issue.
00:01:13.720 Recognizing that we may have a wide range of experience in the audience, let's start at the very beginning. To encode something simply means to convert it into a coded form. This concept has existed long before computers. The purpose of encoding is to transform data so that it can be properly and safely consumed by, or understood by, a different type of system.
00:01:54.940 Morse code, for example, uses dots and dashes of sound or light, along with varying length pauses, to communicate certain characters over long distances. Another example of encoding that we're all familiar with is Braille, where physical raised bumps represent letters and symbols. The goal of encoding is not to keep information secret, but rather to ensure that the information is able to be properly consumed or understood.
00:02:30.720 So encoding is not the same as encryption. These terms are sometimes mistakenly used interchangeably. While both transform data into another format, encoding uses publicly available schemes that are intended to be reversed. This is a key distinction. Decoding only requires knowing which scheme was used to encode the data. You can think of it like a map, a legend, or even a table. On the other hand, encryption is meant to protect valuable or sensitive information. Decryption is more complicated because data is transformed using mathematical algorithms and a secret key with the intention of it not being easily reversed.
00:03:30.170 We all know that computers use binary code, comprised of the digits 0 and 1, to store data. To communicate with computers, we must understand the basics of binary code. The smallest unit of data, a 1 or a 0, is called a bit, and groups of 8 bits are called bytes. To put it in perspective, 8 bits allow for 256 possible combinations. It's fascinating to recognize that two to the eighth is 256!
00:04:27.680 So, if you're new to binary, here's a quick explanation: two types of people can exist within the binary system. Using number system translations and encoding schemes allows us to communicate or share data with computers. It translates those 1s and 0s into a system or language that we understand and can process. After we go from binary to decimal, then we can map those familiar decimal numbers to our writing system or the characters of our language. This is how we communicate with computers.
00:05:32.009 Now, let's take a look at hexadecimal. The hexadecimal system—often referred to simply as 'hex'—is important for understanding number systems. Hex means 6 and decimal means 10; together, they make 16, which is represented by the digits 0 through 9 and the letters A through F. Converting from decimal to hexadecimal involves dividing by 16 using integers and recording the remainders until you reach 0. The final hexadecimal value is represented by the digit sequence of those remainders in reverse order.
00:06:42.149 In the beginning of the computer age, there were no shared standards for mapping binary to language. Many unknown encoding schemes were used in those early days, but ensuring compatibility eventually became increasingly important. In the 1960s, the organization we now know as ANSI, or the American National Standards Institute, began working on the American Standard Code for Information Interchange, commonly known as ASCII.
00:07:54.950 First published in 1963, ASCII uses seven bits to encode 128 specified characters. Remember, two to the seventh equals 128. Out of those 128 possibilities, 95 are printable characters, which include digits 0-9, uppercase and lowercase letters A-Z, and a handful of punctuation symbols. The original ASCII specification also included 33 non-printing control codes, but many of those are obsolete today.
00:08:31.720 There were many other encoding schemes at the time, one notable scheme being EBCDIC (Extended Binary Coded Decimal Interchange Code), which was used extensively in the early days by IBM. The difference between ASCII and EBCDIC is that ASCII allows for the ordering and sorting of lists alphabetically, making it a key player in computing, while EBCDIC did not.
00:09:36.900 Now, remember that ASCII only uses seven bits. The eighth bit was unassigned and can be utilized for further coding options. By using that eighth bit, it was possible to create another set of 128 byte configurations that can be assigned to different characters, which led programmers to create various characters from other languages or special symbols, presenting numerous compatibility issues.
00:10:16.790 Meanwhile, ISO-8859 was attempting to provide encoding schemes for languages other than English. Encoding for all languages was being developed, including Russian, Hindi, Arabic, Korean, and many others. It became clear that one byte per character would never be sufficient, especially considering languages like Chinese and Japanese, which contain tens of thousands of characters. Therefore, multi-byte encodings were created.
00:10:54.510 For example, a two-byte encoding using 16 bits can encode 65,536 distinct values, which was still just another specialized encoding format among many. During the late 80s, the concept of a new character encoding scheme that could handle any character on any computer emerged, leading to the invention of Unicode. The first version was introduced in 1991, adding a layer of abstraction by assigning a code point to every conceivable character. Unicode is a massive table containing over a million potential code points, allowing representation of a multitude of character sets, including those from Middle Eastern, Far Eastern, European, and prehistoric languages.
00:11:54.490 With Unicode, one can write a document encompassing virtually any language, including Klingon, and even emojis! As of May 2019, Unicode version 12.1 contains a repertoire of 137,929 characters, with numerous code points still available for additional characters. If you're thinking about creating a new writing system or language, worry not—Unicode has room for you!
00:12:24.500 A common misunderstanding is that Unicode is an encoding. It is not. Unicode does not define how to store numbers in a file; it's not binary code. Rather, it is an additional layer of abstraction between our understanding of a character and the bytes that the computer uses to represent that character.
00:13:04.600 For example, here's an illustration of an African writing system used to transcribe a Somali-language text. You can think of Unicode code points as associations or legends of characters. Interestingly, sometimes a character may have more than one code point, which brings us back to hexadecimal representations. Unicode points are often displayed in hexadecimal notation as it provides shorter ways to reference large numbers.
00:13:56.950 As we delve deeper into understanding encoding, it’s important to keep apart the characters—which are more abstract (like distinguishing between lowercase and uppercase)—from the actual encoding, which is the sequence of bytes representing those code points. There are various encodings for Unicode that fall under an umbrella called Unicode Transformation Format, or UTF. Three of the most popular UTF encodings are UTF-32, UTF-16, and UTF-8, each differing in how much space or how many bits they take to represent characters.
00:14:54.250 UTF-32 uses four bytes, hence 32 bits for every code point; UTF-16 uses between 2 and 4 bytes; while UTF-8 uses between 1 and 4 bytes per character. It's no surprise that UTF-8 is the most popular encoding. This graph may be dated, but the trajectory hasn't shifted since 2009. UTF-8 remains the dominant encoding, accounting for more than 94% of all web pages as of 2019, and an even higher percentage for the most popular web pages.
00:15:30.95 UTF-8's popularity stems from several factors, one being that it is backward-compatible with ASCII. This means that in most situations, it simply works. Moreover, given that UTF-8 uses between 1 and 4 bytes per character, it saves space compared to encodings that always use two or four bytes, such as UTF-16 or UTF-32. Additionally, it avoids complications related to endianness and byte order marks, which are required for UTF-16 and UTF-32.
00:16:27.790 Let's discuss the term 'endianness' in the context of encodings. Endianness refers to the order of bytes in a multi-byte representation of a character. Computers, like humans, can read from left to right or right to left, and in this context, there are two options known as little-endian and big-endian. Little-endian stores bytes from least to most significant. For example, date formats in Europe that write dates as day-month-year reflect this ordering. Conversely, big-endian orders bytes from most to least significant, as seen in the ISO date format year-month-day.
00:17:56.330 When interpreting a multi-byte character representation, if your computer reads based on big-endian and the data is encoded using little-endian format, it won't parse correctly. Endianness only becomes critical for multi-byte characters; single-byte characters, like ASCII, don't present ordering issues. The challenges of endianness underscore UTF-8's popularity, as it is not endianness dependent; the way bytes are interpreted is unaffected by processor architecture.
00:19:47.020 Unicode is a brilliant solution to a significant issue, though it has its imperfections. The Unicode Consortium decides what to add or change, and their choices can sometimes be perplexing, as seen with the unification of common origin characters like the Han glyphs. This merging has caused frustration for users of those characters. Furthermore, variable-width encoding isn't suitable for every application, and it can be challenging to anticipate the size of character sets. Converting between UTF encodings can also be quite complex.
00:20:38.800 Not every language receives equal attention and inclusion in Unicode, which is problematic. However, it remains the best solution we have today. As we look ahead, I’m curious about how encoding will evolve over the next ten to twenty years. Now, let’s revisit something I mentioned earlier: decoding only requires knowing which encoding scheme was used. This is especially relevant as we look at Ruby.
00:21:40.370 Ruby has an interesting history with encoding. Many have experienced the encoding frustrations that arose during Ruby's transition from ASCII to UTF-8 in versions 1.8 and 1.9. Since version 2.0, UTF-8 has been the default source encoding, which has significantly helped with these issues. Ruby provides several methods to help figure out data encoding and understand how information is represented or has been represented.
00:22:18.730 I focused on two key methods that were pivotal for me in untangling my encoding problems. The 'encode' method alters the underlying bytes of a string to a specific encoding or changes it from one encoding to another. This is essential to understand. In contrast, the 'force_encoding' method does not alter the underlying bytes but interprets the string using a different encoding scheme. Understanding when to use these methods is critical, as using 'encode' alters the data, while 'force_encoding' does not.
00:23:51.640 For example, my original bug involved a string related to language from the Scandinavian region, and 'encode' was used to change the original UTF-8 string to an ISO 8859 encoding. As a result, things became distorted, and when trying to revert to UTF-8, it didn't work as expected. The byte size of the two strings was different, demonstrating that once the data is altered through 'encode,' reverting becomes impossible unless the original data is preserved. Understanding the distinction between bytes and string representation is critical.
00:24:54.710 Additionally, in Ruby, you can call the 'bytes' method, which returns the decimal value corresponding to the binary representation. You may call 'chr' on a byte to return the character associated with that byte in the specified encoding. Conversely, if you have a character, you can use 'ord' to retrieve the corresponding byte. You can also determine a string's encoding or whether it is valid through methods like 'valid_encoding?.' These methods can help troubleshoot tricky encoding issues.
00:25:50.660 When discussing file-level encoding, Ruby’s default source encoding has been UTF-8 since version 2.0. If using an older version, you can check the default external encoding. You might encounter files with different encodings than your system's default, which can be addressed with the default internal encoding. If you need a specific encoding for one file, you can implement a magic comment on the first or second line of your code file, and this is particularly pertinent for web applications and emails.
00:26:47.130 Remember to validate the encodings you’re accepting—without knowing the encoding of the incoming data, you can't decode it accurately. For web pages, ensure to specify the meta tag immediately after the opening tag. Additionally, if working with a web app linked to a database, don’t forget to set the database encoding, as compatibility is crucial for accurate decoding.
00:27:55.260 Also, consider the gems you’re using, along with their encoding schemes. This was related to my earlier bug; the response body came back as ASCII 8-bit, while my request had specified the character set as UTF-8—another sneaky aspect to be aware of. Let’s discuss base64, which is essential to any conversation about encoding. Base64 is an encoding method that converts binary data into an ASCII string while ignoring the original data's encoding. It operates using a set of 64 characters typical across multiple encodings, which allows safe transmission of data without loss.
00:29:21.370 Base64 takes three bytes and encodes them into four ASCII characters. If the bits don't divide neatly by 6, padding is added with equal signs. In Ruby, Base64 is part of a module that needs to be required. Using 'encode64' will add line feeds to every 60 encoded characters unless you specify 'strict' encoding, which omits these newlines. Meanwhile, any character not found in the original Base64 set is ignored when decoding, and 'strict' decoding raises an error if an invalid character is detected.
00:30:41.720 Today’s main takeaways are: binary and number systems are extremely fascinating; there is no such thing as plain text as every text is encoded; be explicit in specifying which encoding to accept from users; use UTF-8 whenever possible; and remember that characters are abstract entities with potential variations. Properly parsing files necessitates knowing the encoding used during creation because misguessing could result in permanent data loss. Thank you so much for your time! I'm happy to take questions or discuss more about encoding afterward. The resources I used for this talk will be available when the slides are published online.
00:32:15.010 Thank you!