Talks

Squeezing Unicode Names into Ruby Regular Expressions

RubyKaigi 2024

00:00:14.440 Good afternoon, everybody. Welcome to this talk about squeezing Unicode names into Ruby regular expressions. My name is Martin Dürst.
00:00:31.679 Before we start, I did a little search yesterday and found a snippet that said that RubyKaigi 224 Martin and others discussed the future of Ruby expressions.
00:00:39.640 If you looked at the discussion, that would mean that AI saved me a lot of work. So, maybe AI has a lot of foresight; perhaps it thought about this evening or whatever. It's one of these hallucinations or it probably just saw old RubyKaigi data and inferred that if you give a talk, then you have to say discuss because that applies to most past RubyKaigi events.
00:01:01.640 This is some information about me. I want to especially mention here that this work was done with a student of mine. Unfortunately, he isn't here; he loves trains very much and took a job with a train company. In Okayama, there are not too many trains anyway.
00:01:26.520 Today, we are talking about regular expressions and Unicode properties. Generally, you have a string and you match it with a regular expression. In the regular expression, you can use a backslash followed by 'p' and then you can use a property. In this particular case, it's actually a property value; it's the property value of the string property, and this matches a Hiragana character.
00:01:39.440 Because we have a Hiragana character here, it says it matches at position one. What I would like to do, and I would appreciate your feedback after the talk, is to match a character with the name here so that it should match this letter 'K'. However, what we currently get is an invalid character property name because it actually thinks that this is the property name. The only name would be the property name, and this would be the property value.
00:02:20.420 Now, the question arises: why do we need that? This is a valid question because we can already match like this directly on the character, or we can match with a Unicode code point escape. It is interesting in some cases but not strictly necessary.
00:02:54.800 This data is from a few years back, but the situation is essentially the same. Unicode has very good support for binary properties because the way it handles properties in the regular expression engine is well done—it is very well suited for binary properties. However, it doesn't have much support for multivalued properties; only at that time it was 7 out of 43, and it is probably still around that because Unicode likes to add properties. The name property is one of the unsupported multivalued properties.
00:03:39.240 Here are some examples of names; you see there are all kinds of names like 'Latin capital letter A', which are descriptive names. You just have the actual code point number in the name. You can still recognize it's 'A', or you have very short names even for special characters like emojis, but you can also have very long names. If you read carefully, you can figure out that it describes two diagonal lines in a square.
00:04:09.720 You can find more details about the name properties in the Unicode standard's Section 4.8. However, I want to point out that this isn't documented anywhere in the Unicode standard. When I worked on this, I figured out that the name property is the oldest character property; it existed before Unicode. It was widely used to definitively identify characters across encodings. When we had ISO 646 in Europe, different countries had different representations for characters, but the name was used to identify characters.
00:04:46.320 Nowadays, Unicode and ISO 10646 are centered on identifying characters. Unicode and ISO 10646 were developed in parallel. Unicode has a lot of properties, but ISO did not have many properties until recently. However, they always insisted on the name, and these names are also translated into French and sometimes into Russian, as ISO standards are often translated into these languages. This raises the question: do we also want to support the Japanese character names? That is not covered in this talk.
00:06:07.640 It's very important to note that Unicode character names never change. If there is an error, it is never changed; there is an alias that can be fixed, but the name is never changed. Even in the Unicode compendium, they moved around over 3,000 Korean code points, but they never changed the names. The first file with Unicode data, the UnicodeData.txt file, has the code point first, followed immediately by the name, emphasizing that this is a very important property.
00:06:26.480 On the other hand, this property is a strange kind of multivalued property because it has the most values. Each character has a different value; most other multivalued properties, like 'age' or 'block', have many characters sharing the same property. Therefore, we need a lot of data, and that is the main part of this talk. We need a function where we can give it a name and get back a code point.
00:07:13.920 The reverse is also interesting, and there are some other interesting features, but that will be discussed as future work. If we want to do this just in Ruby, we can use a hash. We could also use binary search or a search tree. For hashing, it will use lots of memory, and binary search will lose a lot of memory too. We are considering using a search tree. The search tree we will use is called a TI tree.
00:07:59.440 The way it works is we start with the empty string at the root, and then on each level, we have one child for each character that we can choose. The next character is at the next level, so to get to Hiragana, which is just the start of the name, we can imagine we have 36 different streets going up from a crossing. You take Street H, go down a block, then take Street I, and so on. Each crossing or node in this tree would look like this: we have a crossing that stands for 'Latin capital letter A'. We have already passed quite a few crossings, and here we can actually have a code point because 'Latin capital letter A' is a character name.
00:09:01.560 However, it could also be that the name is still longer, for example, 'Latin capital letter A'. We need to go the next step; we can look for the code point here. If we want to go down the A street, we have a pointer here to another node. Overall, we have 37 entries here, which is quite a bit. The size we need really depends on the code point calculation; today, usually, pointers would be 64 bits, but we will use relative pointers, which would be 32 bits or less.
00:09:47.559 To simplify things, we choose 32 bits based on the machine architecture. So, we are working with an entry size of four bytes. Here are these four bytes, or 32 bits, numbered from 0 to 31. For a code point, we would just place the data at the lowest 21 bits, which is how it shows up in the data file in our implementation. We have a lot of comments for debugging, but this is just the data. For pointer entries, we don't know exactly how many bits we need. It turned out during the implementation that we started with quite a few nodes, requiring many pointers, and then we were able to shrink that, using the freed space to implement some more features.
00:11:25.040 If we just do that very simply, we end up with an overall node size of 148 bytes, and we need about 100,000 nodes, which means we currently need around 50 megabytes—this isn't great so far. Now we come to the important part: tree structures. In computer science, we learn that in trees, most nodes are leaves. For a binary tree, about half of the nodes are leaves. However, if we have four children per node, three-quarters of the nodes are leaves. In our case, we have 36 children per node, so almost every node is a leaf, which means if we can use leaves, we can save a lot of space.
00:12:39.440 How do these leaf nodes look in our case? We have one code point, and then we have 36 null pointers, so we do not have any additional data. Therefore, instead of having a pointer, we can have the code point directly here, which we know belongs to a specific letter. By doing this, we can save a significant amount.
00:12:54.919 We still have nodes where some pointers lead to other nodes, but of the 36 different paths that diverge at a crossing, maybe only two, three, or even just one leads somewhere. Thus, we can compact these nodes. One way to achieve this is to use an additional index, which says where each entry is located. We can also use linear search with labeled entries. This is what we implement; we have these labeled entries that use six bits. Since the name characters can have only 36 different values (uppercase letters or numbers), we reduce the pointers to 21 bits.
00:14:02.639 The code point is also 21 bits, and we use six bits to indicate which direction the pointer is pointing. We can now distinguish these entries with the next bit being a code point bit. We have a list of labeled nodes, and to distinguish them from full nodes, we use a labeled first bit and a labeled last bit so we can know the source and destination of the link.
00:15:04.079 This compression reduces the nodes where we have a short branch of entries. As we observe, character names usually have a pattern. For example, when looking at 'Latin capital letter A', all letters from A to Z follow the same pattern. But in some cases, each crossing leads to only one available pathway. For instance, looking at the name 'Rolling on the floor laughing', there are many smiling faces that are represented uniquely in Unicode, but there’s only one representation of a serious face.
00:16:01.159 Therefore, if we're forming a linear path down the tree, we only need four bytes per character. However, even those four bytes are a substantial amount. We hypothesize that six bits should suffice, squeezing up to five characters into a single entry. To do that, we create radix branches; these radix trees function by having between two and eight entries followed by words. The first entry is marked with a radix flag to indicate that we are on a radix branch.
00:17:22.080 We have three skip bits to show how many entries are in this radix expansion. This is employed in the labeled mode to skip over these radix entries. We then have zero to six intermediate entries containing five characters, with the last entry allowing for one additional character if necessary and the payload for the next step, which can be a code point or another branching point. In this configuration, the first radix entry has the radix flag marked, followed by the skip bits and the first character—and so forth for characters.
00:18:44.520 The first entry signifies the number of entries in the radix branch. It’s interesting to note that these arrangements are in little-endian format. Initially, I thought big-endian was the more logical choice, but it turns out that little-endian is arguably more efficient for our code. Intermediate entries consist of multiple characters—five characters in total—and the last entry may incorporate an additional character along with the payload. This data structure is effectively shifted by six bits, leading us to efficiently account for characters.
00:20:06.640 If we look at the most efficient case, we can achieve about 0.9 bytes saved per character, which leads us to a saving of 0.99 bytes per character. Additionally, another significant reduction can occur because many names are defined algorithmically, such as Unicode's 11,000 angle and syllables. So we can treat certain characters, like decimal numbers from 1 to 8, algorithmically to produce smaller representations, which avoids needing extensive storage for unused characters.
00:21:40.880 Currently, I didn't manage to integrate this directly into regular expressions; however, I did add a method to convert Unicode names to code points directly within the String class. Thus, you can utilize the functionality without complications. This system is capable of testing all Unicode data, including all the syllables; it performs functional checks on special cases which all pass the tests with notable performance results measuring approximately 300,000 names processed per second.
00:22:40.560 Our current memory requirements are about 400,000 bytes, which is significantly less than the raw data size of around one million bytes. We are generally working with more than the gzip data size; there has been some research into utilizing gzip-like encodings while maintaining access to specific entries, although that was beyond my desired scope.
00:23:33.720 Future improvements could involve freeing up bits, or if there are additional suggestions on how to optimize memory usage, we could also explore extending functionality, such as including character sequences. In summary, we have successfully implemented a method for code lookup through Unicode character names, utilizing Ternary trees, radix trees, and carefully tailored bit-squeezing techniques to save memory. The implementation works, but it still requires significant cleanup and is currently available on GitHub; I advise not to scrutinize it too intently.
00:24:14.920 I look forward to receiving your feedback on whether you find this useful and how you would like to interact with it. Additionally, it would be enlightening to know how much memory you are willing to allocate because all the code, not the data, is read-only and with modern operating systems, it loads once for all processes. In closing, I would like to offer my acknowledgments: as usual, to the Ruby community and this time specifically to my student, Kya, for their collaboration on this project. The presentation will also be made available, linked for your convenience.