Character Encoding

Summarized using AI

"👨‍👩‍👧‍👧".length == 7

Sven Dittmer • May 25, 2019 • Hamburg, Germany

The video titled 'Family Emojis' features Sven Dittmer at the Ruby Unconf 2019, where he addresses the complexities of counting characters in strings, particularly focusing on family emojis across various platforms like Android, iOS, and web services. Dittmer begins by sharing his background as a Ruby developer at Sing is Inc, and he expresses gratitude to the audience for attending his first conference talk.

The central theme revolves around the challenge of achieving consistency in character counting, especially considering the diversity of characters such as emojis, extended Latin characters, and pictographs. Key points discussed include:

  • Character Counting Challenges: Dittmer explains that the length of emojis can be subjective, citing that some may count a family emoji as seven characters, which poses a challenge for developers.

  • Normal Forms: He introduces normalization concepts, specifically the Composed Normal Form (NFC) and Decomposed Normal Form (NFD), which help standardize how characters are counted, particularly in Ruby.

  • Encoding Formats: Different encoding formats, primarily UTF-8 and UTF-16, are explored. Dittmer notes how these formats handle character representation differently, particularly with emojis. For instance, an emoji in UTF-8 can be represented by four bytes but takes six bytes in UTF-16.

  • Byte Order Mark (BOM): The byte order mark's role in character representation and the discrepancies introduced by counting characters in UTF-16 are discussed, emphasizing how the BOM can lead to unexpected character totals.

  • Cross-Platform Discrepancies: The presentation highlights how different programming languages and platforms may yield varying results for the same character counts. Dittmer emphasizes the need for teams to devise strategies that ensure consistency in character counting across systems.

  • Audience Interaction: The talk also involves audience questions regarding maximum character lengths and database constraints which can clash with frontend expectations, underscoring the practical implications of these character counting challenges.

In conclusion, Dittmer stresses the importance of consistent character counting when developing cross-platform applications, warning that many programming languages do not align on specific counts for emojis. This necessitates careful consideration and adaptation among development teams to maintain uniformity in character counting logic.

Overall, this presentation offers valuable insight into the intricacies of handling character counts in modern software development, particularly with the proliferation of varied character types on different platforms.

"👨‍👩‍👧‍👧".length == 7
Sven Dittmer • May 25, 2019 • Hamburg, Germany

Ruby Unconf 2019

00:00:03.920 Welcome to my talk called 'Family Emojis'. It’s my first talk at this conference, and it’s also my first time attending. I really appreciate that you voted for this talk and came here to listen. Thank you very much for that.
00:27:340 A little bit about me: my name is Sven, and I’m a Ruby back-end developer working at Sing is Inc since 2018. Unlike many developers who share their Twitter handles, I am not on Twitter or Facebook. If you’d like to get in touch with me or provide feedback, you can message me on the Ruby message network where my handle is ‘sweetie’. If you haven’t checked it out yet, I recommend it.
00:01:00.110 Now, let’s dive into what this talk is all about. It focuses on a challenge my team faced and what we learned from it. The key issue we encountered involves counting characters in a string that includes the family emoji.
00:01:10.429 The challenge is to count characters consistently across all platforms: Android, iOS, web, and our Ruby back-end. We realized that counting correctly is subjective; for example, one could argue about the length of a family emoji being 7. The main aspect we focused on was consistency in our character counts.
00:01:54.650 To illustrate, let’s consider various types of characters: extended Latin characters, emojis, pictographs, and Chinese characters. Each of these character types presents unique complexities in counting, especially emojis and Zalgo text.
00:02:03.480 For instance, you might view an accented 'e' as one character with an accent or as two characters: the accent and the 'e'. This brings us to the concept of normal forms, which help standardize character representation. Two normal forms are the Composed Normal Form (NFC) and Decomposed Normal Form (NFD), which respectively decide how many characters we count.
00:02:45.350 We researched these normal forms and discovered that, in Ruby, you can normalize strings using a function to standardize the character count. Most platforms default to NFC when counting characters, and we decided to adopt this norm.
00:03:41.140 Next, let’s discuss how different encodings work in Ruby. The most commonly used one is UTF-8, but there are other encodings as well. UTF-16 is the default in JavaScript and Cocoa frameworks, while UTF-8 is standard for Ruby and most web content.
00:03:49.670 We also explored how character counting varies with encoding formats, focusing on the difference between UTF-8 and UTF-16. For instance, while Ruby’s operations on UTF-16 can be unusual, JavaScript seamlessly counts characters using the length property.
00:04:00.310 When we analyzed how an emoji is represented in UTF-8 and UTF-16, we found that in UTF-8, an emoji corresponds to four bytes, while in UTF-16, it could be represented with six bytes. This led us to consider the byte order mark (BOM) and how byte order influences character representation.
00:04:55.690 The byte order mark tells how the bytes should be interpreted. Understanding big-endian versus little-endian formats is crucial for character representation across different systems.
00:05:05.670 Ultimately, we realized that the discrepancies arising from counting in UTF-16—particularly with emojis—led to a situation where what we expected did not match the encoded values.
00:05:52.560 Through our investigation, we confirmed that the character counts in UTF-16 include the BOM, which can result in unexpected totals when counting characters. For example, counting an emoji could yield two characters where we expected one.
00:06:37.700 During my exploration of emoji length, I often wondered if different platforms counted certain emojis consistently. This concern culminated in discussions around the use of NFC in counting and how differing methodologies lead to varied character lengths.
00:07:03.080 We found that certain emojis combined to form new characters still required careful consideration on how to count them. Each time we tried to expand on a character, we needed to accurately assess whether it would generate a new count.
00:08:05.090 As we made adjustments and changes in our approaches to counting Unicode characters, it was essential to note how different programming languages may vary in how they handle these situations. This fact posed challenges, especially when scaling across platforms.
00:08:36.889 Looking ahead, we understand the importance of maintaining consistency in character counts and sharing this information across systems, so each player can interact harmoniously with each other.
00:09:01.889 Thank you for your attention so far, and now I would like to open the floor for any questions.
00:09:31.800 One question that arose involved how to validate a maximum character length across platforms without losing consistency in the count. For instance, when evaluating counts in iOS versus JavaScript, you may find discrepancies.
00:10:03.430 The follow-up discussion focused on how databases count characters and how they might enforce constraints that could clash with frontend expectations. This led to a realization that adjusting validation rather than character limits might be the preferable route.
00:10:35.460 We also looked at some character tests across platforms to validate how characters, especially complex ones like emojis, are counted, and we discovered varying results. This variability adds to the challenge of establishing norms across character counting.
00:10:49.910 In conclusion, what we found during our research was that most programming languages might not agree on specific emojis when it comes to character counts. This revelation underscores the necessity for teams working with cross-platform applications to devise strategies to maintain consistency in their character counting logic.
00:11:16.590 Thank you for being such an attentive audience during this presentation. Are there any other questions?
00:11:59.200 (Applauding) Thank you. If you have thoughts about the material shared or ideas on how we might refine our approach to character counting moving forward, please feel free to reach out.”},{
Explore all talks recorded at Ruby Unconf 2019
+21