00:00:00
Hello everybody! Can you hear me? Thank you. Okay, let's go ahead.
00:00:10
This is an outline with about 45 parts. These are some of the conventions used.
00:00:17
It's pretty much what you need to know, but I would like to know something about you.
00:00:25
Who here writes code using more than just the letters A to Z? Okay, there are some.
00:00:39
Who uses encodings other than ASCII? That's probably about the same group of people.
00:00:47
Now, another question: who uses encodings other than UTF-8? There are still quite a few.
00:01:00
Okay, this is just a brief introduction about myself.
00:01:07
I worked at W3C, so all my slides are in HTML with not a single line of JavaScript, using the available power here.
00:01:14
Anyway, this is a summary of my contributions to Ruby.
00:01:20
As of yesterday, Ruby has been updated to Unicode version 9.0, thanks to a lot of work.
00:01:27
The process went very smoothly and you can get the version with this LB config command.
00:01:34
Over the last 23 years, we have had a stable relationship with Unicode.
00:01:41
Unicode is usually published every summer with a new version every year.
00:01:48
Ruby is typically published around Christmas every year.
00:01:54
The release cycles are just off by a little.
00:02:00
But here are the formulas to calculate these versions if you want to extrapolate.
00:02:13
However, don’t extrapolate too far. If we go to Ruby 3.0 in 2020, these formulas will start to be inaccurate.
00:02:22
Now let’s get to the main topic of this talk.
00:02:27
This is about case conversion or case mapping in Ruby.
00:02:39
In Ruby 2.2, we have the following methods: upcase, downcase, capitalize, and swapcase.
00:02:44
They all look pretty good.
00:02:50
However, let's take a closer look at these methods.
00:02:57
I slightly updated this and most of the characters are not upcased even if I use upcase.
00:03:10
This leads to the conclusion that in Ruby up to version 2.3, case conversions that convert all letters to uppercase are not available.
00:03:20
However, these have been implemented in Ruby 2.4, and this talk discusses what is happening behind the scenes.
00:03:35
There are quite a few scripts that have two cases. Georgian is particularly unique because it has complications.
00:03:44
Some minority scripts are adopting the idea of upper cases and lower cases.
00:03:51
Historically, the distinction between upper case and lower case didn't exist.
00:03:59
It was introduced in the 15th century as a functional distinction.
00:04:07
Modern case usage varies by language.
00:04:12
Even for a single language, how you uppercase words can differ.
00:04:19
In English, whether you uppercase words in a title can depend on your location.
00:04:26
There's a famous example that we learned in schools in Germany or Switzerland.
00:04:32
We were taught to always ensure correct uppercasing and lowercasing to avoid misunderstandings.
00:04:42
This might be a bit outdated, but many people once believed ASCII was sufficient.
00:04:48
Most people no longer maintain that position, but we must consider backwards compatibility.
00:04:57
Changing our method might pose dangers. We might need to revert to older methods.
00:05:05
For those migrating, implementing a new option to use Unicode is vital.
00:05:11
This way, we can adopt new features without disrupting existing applications.
00:05:19
Consider checking your codebase for potential upcase and downcase issues.
00:05:26
Testing early is essential.
00:05:31
There should be a preview for Ruby 2.4 coming out soon.
00:05:38
What specific problems could occur? One potential issue is DNS servers.
00:05:50
DNS servers define case conversion only for ASCII. This has been unchanged for ages.
00:05:57
That’s why we have some issues like Punycode.
00:06:06
In cases such as this, you should have been using ASCII since the beginning.
00:06:10
Similarly, if you allow non-ASCII characters in user IDs, you could face challenges.
00:06:18
If you decide to handle case transformations, you always run the risk of inconsistencies.
00:06:28
Ensure consistency in your database since mixing character cases can lead to matching issues.
00:06:35
This is particularly problematic after updating to Ruby 2.4.
00:06:41
Special cases also exist, particularly for scripts like Turkic.
00:06:52
For backwards compatibility, you can utilize an ASCII option.
00:07:00
This option allows you to revert to behaviors consistent with Ruby 2.3.
00:07:08
To manage conversions, everyone generally knows how to convert between uppercase and lowercase.
00:07:15
It's a basic exercise taught in programming classes.
00:07:24
However, Unicode introduces complexities beyond basic letter transformations.
00:07:32
Unicode maintains data sets for these different transformations.
00:07:41
There are special cases where a single character may change into multiple characters.
00:07:50
For instance, the German sharp S (ß) transforms into two 'S' characters.
00:08:00
There are also transformations that don’t revert cleanly.
00:08:09
For example, the final Sigma of the Greek alphabet should conform to the Unicode standard context.
00:08:19
Currently, this is not yet implemented, but I hope to get it done.
00:08:27
Another special case to consider is called simple case mapping.
00:08:36
Unicode offers special specifications for this process.
00:08:41
In Ruby, strings often change in length.
00:08:49
Implementing length consistency while manipulating strings can be challenging.
00:08:56
The next special case involves target case that must maintain consistent casing.
00:09:06
In situations involving periods, we must ensure they don't affect case.
00:09:14
Inaccuracies in name spelling can lead to significant miscommunication.
00:09:22
This remains a serious concern, especially with punctuations.
00:09:30
Particularly, in accented characters, the accents might go unnoticed.
00:09:38
However, technology may not always capture these details.
00:09:47
This less important detail is a challenge to address.
00:09:54
We also have case folding, a critical method for equality checking.
00:10:03
This method can distort values when comparing two strings.
00:10:11
For example, a sharp S in lowercase becomes SS in uppercase.
00:10:19
Case folding is usually combined with down case in practical applications.
00:10:28
It’s implemented using an option in the down case feature.
00:10:34
If you want to keep track of characters, ensure accurate representation.
00:10:41
There’s another special case concerning title casing.
00:10:50
When capitalizing, the first character used must conform to title case.
00:10:58
This month's events showcase relevant examples.
00:11:06
These variations continue, and many more special cases need attention.
00:11:13
Now, looking in detail at the implementation, it might seem straightforward.
00:11:23
However, special cases do lead to complications.
00:11:30
We have twelve methods that we need to implement.
00:11:40
There are non-destructive versions of these methods.
00:11:49
However, we also have destructive versions, which can alter values permanently.
00:11:56
There's also a method called case comparison that does comparisons cautiously.
00:12:04
Sorting by Unicode gets complex, introducing various challenges in implementation.
00:12:12
Now, if we want to find out how these methods are executed, several resources are available.
00:12:19
You can explore the source code via Subversion or check the related files.
00:12:27
Look for functions like init_string and others.
00:12:33
The corresponding C functions perform these operations.
00:12:41
Let’s apply this to the symbol’s upcase method.
00:12:48
When executed, this function takes the symbol, converts it to a string, and performs the upcase.
00:12:55
Then it can convert it back to a symbol.
00:13:02
For the non-destructive version, a duplicate is made before altering the original.
00:13:10
The destructive version modifies the original value.
00:13:20
There are numerous perspective considerations here.
00:13:27
When implemented, there are flags to manage different requirements.
00:13:37
These flags include the upcase, downcase, and title case flags.
00:13:45
Operations based on these flags determine the specific transformations.
00:13:53
Again, these transformations will need to be applied consistently.
00:14:00
The focus is on ensuring transformations take place accurately.
00:14:07
We check the options and assign their corresponding flags.
00:14:14
Then the actual work of string case mapping begins.
00:14:22
Managing string expansions presents additional challenges.
00:14:29
Using a linked list of buffers is one way to address these challenges.
00:14:37
The goal is to use as few buffers as possible.
00:14:45
When buffers are filled, you must reassess your size estimates.
00:14:52
Refinement occurs through iterative size adjustments.
00:15:00
The approach involves systematic buffer actions.
00:15:06
Next, we utilize the above case mapping functions based on encoding.
00:15:14
Different encodings can necessitate different handling.
00:15:22
For example, variations exist between UTF-8 and ISO 8859 encodings.
00:15:30
Alright, let’s explore some practical examples.
00:15:39
The Latin-1 example serves as a straightforward introduction.
00:15:47
Consider a basic loop that iterates over characters for transformations.
00:15:56
Special cases, like the sharp S, are addressed in this scenario.
00:16:05
The need to ensure cases harmonize in the system remains vital.
00:16:14
Additional character cases may have specific lower-case equivalents.
00:16:23
If a character has no upper-case equivalent, we can skip modification.
00:16:30
When converting from lower case to upper case, related calculations apply.
00:16:37
Now, let's appreciate the contributions made by students.
00:16:44
They include valuable feedback and ideas.
00:16:50
As for new characters, existing encodings should be inclusive.
00:16:58
When addressing various encodings, examine implementations.
00:17:05
Further improvements must attend to any issues discovered.
00:17:12
While we seek input, continuous focus on testing is essential.
00:17:19
Entertainment through accurate test coverage protects robust implementation.
00:17:28
With over 20 million tests conducted, consistency is upheld.
00:17:35
Integration offers significant insight into maintainability.
00:17:41
Developers are urged to commit frequently.
00:17:49
Utilizing tools to prevent errors during this process is advisable.
00:17:56
For implementations, look for chances to reinforce testing.
00:18:04
Additionally, future developments should be considered for scalability.
00:18:12
As we identify flaws in the dummy encodings, we must address them.
00:18:21
The transition toward UTF-8 encoding should be gradual.
00:18:29
Any suggestions on how to approach this migration are welcomed.
00:18:36
To conclude, acknowledgments are essential for future improvements.
00:18:44
Please share your feedback and support for the new features.
00:18:52
Thank you for your attention!