Talks

Ups and Downs of Ruby Internationalization

http://rubykaigi.org/2016/presentations/duerst.html

Currently many of Ruby's String methods, such as upcase and downcase, are limited to ASCII and ignore the rest of the world. This is finally going to change in Ruby 2.4, where this functionality will be extended to cover full Unicode. You will get to know what will change, how your programs may be affected, and how these changes are implemented behind the scenes. We will also look at the overall state of internationalization functionality in Ruby, and potential future directions.

Martin J. Dürst @duerst
Martin is a Professor of Computer Science at Aoyama Gakuin University in Japan. He has been one of the main drivers of Internationalization (I18N) and the use of Unicode on the Web and the Internet. He published the first proposals for DNS I18N and NFC character normalization, and is the main author of the W3C Character Model and the IRI specification (RFC 3987). Since 2007, he and his students have contributed to the implementation of Ruby, mostly in the area of I18N.

RubyKaigi 2016

00:00:00 Hello everybody! Can you hear me? Thank you. Okay, let's go ahead.
00:00:10 This is an outline with about 45 parts. These are some of the conventions used.
00:00:17 It's pretty much what you need to know, but I would like to know something about you.
00:00:25 Who here writes code using more than just the letters A to Z? Okay, there are some.
00:00:39 Who uses encodings other than ASCII? That's probably about the same group of people.
00:00:47 Now, another question: who uses encodings other than UTF-8? There are still quite a few.
00:01:00 Okay, this is just a brief introduction about myself.
00:01:07 I worked at W3C, so all my slides are in HTML with not a single line of JavaScript, using the available power here.
00:01:14 Anyway, this is a summary of my contributions to Ruby.
00:01:20 As of yesterday, Ruby has been updated to Unicode version 9.0, thanks to a lot of work.
00:01:27 The process went very smoothly and you can get the version with this LB config command.
00:01:34 Over the last 23 years, we have had a stable relationship with Unicode.
00:01:41 Unicode is usually published every summer with a new version every year.
00:01:48 Ruby is typically published around Christmas every year.
00:01:54 The release cycles are just off by a little.
00:02:00 But here are the formulas to calculate these versions if you want to extrapolate.
00:02:13 However, don’t extrapolate too far. If we go to Ruby 3.0 in 2020, these formulas will start to be inaccurate.
00:02:22 Now let’s get to the main topic of this talk.
00:02:27 This is about case conversion or case mapping in Ruby.
00:02:39 In Ruby 2.2, we have the following methods: upcase, downcase, capitalize, and swapcase.
00:02:44 They all look pretty good.
00:02:50 However, let's take a closer look at these methods.
00:02:57 I slightly updated this and most of the characters are not upcased even if I use upcase.
00:03:10 This leads to the conclusion that in Ruby up to version 2.3, case conversions that convert all letters to uppercase are not available.
00:03:20 However, these have been implemented in Ruby 2.4, and this talk discusses what is happening behind the scenes.
00:03:35 There are quite a few scripts that have two cases. Georgian is particularly unique because it has complications.
00:03:44 Some minority scripts are adopting the idea of upper cases and lower cases.
00:03:51 Historically, the distinction between upper case and lower case didn't exist.
00:03:59 It was introduced in the 15th century as a functional distinction.
00:04:07 Modern case usage varies by language.
00:04:12 Even for a single language, how you uppercase words can differ.
00:04:19 In English, whether you uppercase words in a title can depend on your location.
00:04:26 There's a famous example that we learned in schools in Germany or Switzerland.
00:04:32 We were taught to always ensure correct uppercasing and lowercasing to avoid misunderstandings.
00:04:42 This might be a bit outdated, but many people once believed ASCII was sufficient.
00:04:48 Most people no longer maintain that position, but we must consider backwards compatibility.
00:04:57 Changing our method might pose dangers. We might need to revert to older methods.
00:05:05 For those migrating, implementing a new option to use Unicode is vital.
00:05:11 This way, we can adopt new features without disrupting existing applications.
00:05:19 Consider checking your codebase for potential upcase and downcase issues.
00:05:26 Testing early is essential.
00:05:31 There should be a preview for Ruby 2.4 coming out soon.
00:05:38 What specific problems could occur? One potential issue is DNS servers.
00:05:50 DNS servers define case conversion only for ASCII. This has been unchanged for ages.
00:05:57 That’s why we have some issues like Punycode.
00:06:06 In cases such as this, you should have been using ASCII since the beginning.
00:06:10 Similarly, if you allow non-ASCII characters in user IDs, you could face challenges.
00:06:18 If you decide to handle case transformations, you always run the risk of inconsistencies.
00:06:28 Ensure consistency in your database since mixing character cases can lead to matching issues.
00:06:35 This is particularly problematic after updating to Ruby 2.4.
00:06:41 Special cases also exist, particularly for scripts like Turkic.
00:06:52 For backwards compatibility, you can utilize an ASCII option.
00:07:00 This option allows you to revert to behaviors consistent with Ruby 2.3.
00:07:08 To manage conversions, everyone generally knows how to convert between uppercase and lowercase.
00:07:15 It's a basic exercise taught in programming classes.
00:07:24 However, Unicode introduces complexities beyond basic letter transformations.
00:07:32 Unicode maintains data sets for these different transformations.
00:07:41 There are special cases where a single character may change into multiple characters.
00:07:50 For instance, the German sharp S (ß) transforms into two 'S' characters.
00:08:00 There are also transformations that don’t revert cleanly.
00:08:09 For example, the final Sigma of the Greek alphabet should conform to the Unicode standard context.
00:08:19 Currently, this is not yet implemented, but I hope to get it done.
00:08:27 Another special case to consider is called simple case mapping.
00:08:36 Unicode offers special specifications for this process.
00:08:41 In Ruby, strings often change in length.
00:08:49 Implementing length consistency while manipulating strings can be challenging.
00:08:56 The next special case involves target case that must maintain consistent casing.
00:09:06 In situations involving periods, we must ensure they don't affect case.
00:09:14 Inaccuracies in name spelling can lead to significant miscommunication.
00:09:22 This remains a serious concern, especially with punctuations.
00:09:30 Particularly, in accented characters, the accents might go unnoticed.
00:09:38 However, technology may not always capture these details.
00:09:47 This less important detail is a challenge to address.
00:09:54 We also have case folding, a critical method for equality checking.
00:10:03 This method can distort values when comparing two strings.
00:10:11 For example, a sharp S in lowercase becomes SS in uppercase.
00:10:19 Case folding is usually combined with down case in practical applications.
00:10:28 It’s implemented using an option in the down case feature.
00:10:34 If you want to keep track of characters, ensure accurate representation.
00:10:41 There’s another special case concerning title casing.
00:10:50 When capitalizing, the first character used must conform to title case.
00:10:58 This month's events showcase relevant examples.
00:11:06 These variations continue, and many more special cases need attention.
00:11:13 Now, looking in detail at the implementation, it might seem straightforward.
00:11:23 However, special cases do lead to complications.
00:11:30 We have twelve methods that we need to implement.
00:11:40 There are non-destructive versions of these methods.
00:11:49 However, we also have destructive versions, which can alter values permanently.
00:11:56 There's also a method called case comparison that does comparisons cautiously.
00:12:04 Sorting by Unicode gets complex, introducing various challenges in implementation.
00:12:12 Now, if we want to find out how these methods are executed, several resources are available.
00:12:19 You can explore the source code via Subversion or check the related files.
00:12:27 Look for functions like init_string and others.
00:12:33 The corresponding C functions perform these operations.
00:12:41 Let’s apply this to the symbol’s upcase method.
00:12:48 When executed, this function takes the symbol, converts it to a string, and performs the upcase.
00:12:55 Then it can convert it back to a symbol.
00:13:02 For the non-destructive version, a duplicate is made before altering the original.
00:13:10 The destructive version modifies the original value.
00:13:20 There are numerous perspective considerations here.
00:13:27 When implemented, there are flags to manage different requirements.
00:13:37 These flags include the upcase, downcase, and title case flags.
00:13:45 Operations based on these flags determine the specific transformations.
00:13:53 Again, these transformations will need to be applied consistently.
00:14:00 The focus is on ensuring transformations take place accurately.
00:14:07 We check the options and assign their corresponding flags.
00:14:14 Then the actual work of string case mapping begins.
00:14:22 Managing string expansions presents additional challenges.
00:14:29 Using a linked list of buffers is one way to address these challenges.
00:14:37 The goal is to use as few buffers as possible.
00:14:45 When buffers are filled, you must reassess your size estimates.
00:14:52 Refinement occurs through iterative size adjustments.
00:15:00 The approach involves systematic buffer actions.
00:15:06 Next, we utilize the above case mapping functions based on encoding.
00:15:14 Different encodings can necessitate different handling.
00:15:22 For example, variations exist between UTF-8 and ISO 8859 encodings.
00:15:30 Alright, let’s explore some practical examples.
00:15:39 The Latin-1 example serves as a straightforward introduction.
00:15:47 Consider a basic loop that iterates over characters for transformations.
00:15:56 Special cases, like the sharp S, are addressed in this scenario.
00:16:05 The need to ensure cases harmonize in the system remains vital.
00:16:14 Additional character cases may have specific lower-case equivalents.
00:16:23 If a character has no upper-case equivalent, we can skip modification.
00:16:30 When converting from lower case to upper case, related calculations apply.
00:16:37 Now, let's appreciate the contributions made by students.
00:16:44 They include valuable feedback and ideas.
00:16:50 As for new characters, existing encodings should be inclusive.
00:16:58 When addressing various encodings, examine implementations.
00:17:05 Further improvements must attend to any issues discovered.
00:17:12 While we seek input, continuous focus on testing is essential.
00:17:19 Entertainment through accurate test coverage protects robust implementation.
00:17:28 With over 20 million tests conducted, consistency is upheld.
00:17:35 Integration offers significant insight into maintainability.
00:17:41 Developers are urged to commit frequently.
00:17:49 Utilizing tools to prevent errors during this process is advisable.
00:17:56 For implementations, look for chances to reinforce testing.
00:18:04 Additionally, future developments should be considered for scalability.
00:18:12 As we identify flaws in the dummy encodings, we must address them.
00:18:21 The transition toward UTF-8 encoding should be gradual.
00:18:29 Any suggestions on how to approach this migration are welcomed.
00:18:36 To conclude, acknowledgments are essential for future improvements.
00:18:44 Please share your feedback and support for the new features.
00:18:52 Thank you for your attention!