Ten Unicode Characters You Should Know About as a Programmer

In the talk titled "Ten Unicode Characters You Should Know About as a Programmer," Jan Lelis provides an insightful examination of Unicode, emphasizing its significance and complexities for programmers. The discussion is anchored in Ruby, showcasing how its robust support for Unicode facilitates the management of different characters and their corresponding code points.

Key Points Discussed:

- Unicode Overview: Unicode assigns a unique code point to every character, making it essential to understand that multiple code points can represent a single visual character. The Thai character is introduced as an example that illustrates this concept.

- Character Normalization: The demonstration of the character 'Ä' highlights issues with keyboard input and how Ruby's normalization feature allows for transforming characters composed of multiple code points into a single cohesive representation.

- Confusable Characters: Lelis discusses the confusion that arises between visually similar characters, like different versions of the letter 'I', especially in relation to Turkish capitalizing rules. Programmers face security risks due to these similarities, necessitating tools like the Unicode Confusable gem to identify and manage them effectively.

- Locale-aware Methods: Enhancements in Ruby (post-2.4) allow the use of locale-specific arguments in methods like 'upcase' and 'downcase', improving internationalization support for varying languages.

- Control Characters: The importance of understanding control characters such as newlines is emphasized, including how they behave differently in various operating systems and affect string manipulation in Ruby.

- Legal but Unassigned Code Points: Lelis explains the concept of non-assignable characters within Unicode, where certain code points, like 10FFFF, are legal but not assigned any character. Knowing this can help prevent potential application errors.

- Regular Expressions and Unicode Properties: The use of Unicode properties in Ruby's regular expressions enables developers to match unique and sometimes visually ambiguous characters effectively, enhancing string manipulation capabilities.

In conclusion, Jan Lelis highlights that understanding Unicode intricacies is crucial for programmers, especially when building robust applications. By recognizing confusable characters and learning how to handle them correctly in Ruby, developers can avoid potential pitfalls associated with Unicode data. This knowledge not only enriches a developer's toolkit but also enhances application performance across diverse linguistic settings.

Ten Unicode Characters You Should Know About as a Programmer
Jan Lelis • November 28, 2017 • Earth

RubyConf 2017: Ten Unicode Characters You Should Know About as a Programmer by Jan Lelis

There are a lot of things that can go wrong when working with Unicode data. Some examples of unmeant behavior:

You try to downcase "I" to "i", but your Turkish friends want it to be a dotless "ı"
Your UI is broken, because people use empty usernames, despite the String#blank? check
You think "C" is the same letter as "С", but your system does not think so and crashes
Using ten characters as representatives, I will highlight some Unicode characteristics which require a programmer's attention and demonstrate how Ruby's solid Unicode support can be of useful assistance!

RubyConf 2017