Talks

The Three-Encoding Problem

You’ve probably heard of UTF-8 and know about strings, but did you know that Ruby supports more than 100 other encodings? In fact, your application probably uses three encodings without you realizing it. Moreover, encodings apply to more than just strings. In this talk, we’ll take a look at Ruby’s fairly unique approach to encodings and better understand the impact they have on the correctness and performance of our applications. We’ll take a look at the rich encoding APIs Ruby provides and by the end of the talk, you won’t just reach for force_encoding when you see an encoding exception.

RubyConf Mini 2022

00:00:12.740 I am a staff engineer at Shopify and I am here today to talk to you about Ruby's uh and encodings
00:00:20.539 so I work in a group at Shopify called the Ruby and rails infrastructure team we're basically tasked with ensuring the
00:00:26.939 longevity of Ruby one of the Nifty Things that we think about at Shopify is how can we ensure the company will be
00:00:33.480 here 100 years from now and since we're so heavily invested in Ruby that largely means how do we ensure
00:00:39.420 Ruby will be of relevant technology 100 years from now so we do all sorts of things we have
00:00:44.640 people working directly on Ruby and rails fixing bugs improving performance adding new features we focus on the
00:00:51.360 developer experience if you cut Jenny's talk earlier about supply chain security that's another thing the group does
00:00:59.340 I work on truffle Ruby which is an alternative implementation of the Ruby programming language we have an emphasis
00:01:05.880 on very high performance by being a new implementation we're not bound by some
00:01:11.400 of the design decisions in C Ruby so we're able to optimize things in ways
00:01:18.299 that Ruby see Ruby is unable to do so we try
00:01:23.400 to maintain a very high degree compatibility though we can run native extensions and we're working on
00:01:28.680 production deployments right now so yeah the three encoding problem
00:01:34.080 this talk is encoding is one of these things that I find a lot of rubiists
00:01:39.659 don't have a good handle on and you get very far in Ruby without really understanding encodings up until Ruby 1
00:01:46.439 9 encodings really didn't even exist in Ruby the way they do today um but well you can get very far with it
00:01:53.579 because it's a very powerful abstraction if you don't understand Coatings when you invariably run into an issue you're
00:02:00.060 bound to resolve it incorrectly which can lead to loss or corrupt data
00:02:05.520 so to get started we'll just take a peek at Ruby strings this is about as simple as it gets we
00:02:10.979 have a string literal here and I'm using the magic comment syntax to associate the encoding with the string as us ASCII
00:02:19.560 Ruby gives you a lot of API calls to kind of inspect strings and dig into
00:02:24.720 their internals so here we can see that a string is made up of bytes and uh the string had three characters
00:02:31.200 the ABC and this will be the bytes for them and then we can also get the
00:02:36.239 associated encoding with the string so string and Ruby really is just an array of bytes and encoding to make
00:02:42.959 sense of those and what we get out of that is an array of characters
00:02:48.319 so at the simplest level and encoding is really just a mapping function it takes
00:02:53.940 a array of bytes in and produces array of characters there's different encodings so you can have different
00:03:00.420 lists of characters and then you can have different byte representations for them so pausing for a moment I just wanna
00:03:07.260 clarify some terminology throughout the talk I'm going to be using kind of loose terms if you really get into encoding
00:03:13.560 there's some very very precise definitions on things but for our
00:03:18.659 purposes we'll considering coding to just be this mapping of a byte representation to a character a
00:03:25.140 character that really is a difficult one we have printable and non-printable
00:03:30.480 characters like anything that you can put in a string will just call a character so it
00:03:36.480 includes like all the major writing systems punctuation digits and and so on
00:03:43.500 then we have code points and if an encoding is a map of bytes to characters then it stands to reason that encoding
00:03:49.980 has a set of valid characters and all a code point is is an index to one of
00:03:55.019 those characters and then there's also code units which we won't talk about too much but code
00:04:01.620 unit is just a fancy way of saying how a code Point maps to bytes so Ruby just
00:04:07.799 kind of uses bytes but if you look at some of the Unicode stuff they'll use the term code point and uh you can just
00:04:14.340 mentally substitute bytes for that so revising our definition a little bit we have two levels to an encoding we
00:04:22.440 have encoding that takes a code point that that index into the list and gives you a character and then we have a
00:04:28.800 mapping from bytes to a particular code point so Ruby allows us to also check the code
00:04:34.620 points for a string and you can see everything lines up here so given that string ABC we get the set
00:04:40.320 of bytes sorry a list of bytes because order matters and there's three of those then those map directly to the three
00:04:47.880 code points and in this case they have the same values that's why I used ASCII that was by
00:04:53.699 Design that's not a requirement and we'll show some examples in a bit where that doesn't map up
00:04:58.919 and then for each code point we have a corresponding character um so unfortunately it's a really
00:05:05.100 understanding codings we do need to take a look at some of the history here I know it's not the most exciting topic
00:05:11.040 but just bear with me if you haven't heard of it before there's ASCII which
00:05:16.680 is the American Standard code for information interchange this was a very popular encoding
00:05:23.100 particularly in American businesses and computer science and with ASCII we have
00:05:28.740 128 characters of those only nine to five 95 of them are printable so this is a really
00:05:35.100 restricted set of representations and everything fits into seven bits uh
00:05:40.620 because one ASCII was created being able to represent seven bits was a benefit over eight bits
00:05:46.500 so this Loosely corresponds to Modern English we can handle the letters a
00:05:52.139 through z lowercase same thing uppercase digits some punctuation and a selection
00:05:58.740 of symbols and then there's other things like control characters which tell the
00:06:03.840 computer to go do something like inserting a new line and we have white space and then ASCII has its own history
00:06:10.320 that it kind of supports so there's some characters for ringing a bell uh for old style terminals
00:06:17.460 um what you can see is it it's really quite restricted and I think this has even influenced how we write
00:06:23.580 certain things in English for instance we have prices that might be more naturally represented in cents but there
00:06:29.759 is no Cent symbol so we write them as dollars instead um and we can represent like the word
00:06:35.880 resume but not the word resume and I think over time at least in the US we've
00:06:42.300 just dropped putting those accented characters in uh so ASCII was around for a really long
00:06:49.860 time so any language that really wants to see widespread usage needs to support ASCII
00:06:55.800 and Ruby does this at two different levels so you can look at it on coding and ask if it's ASCII compatible and in
00:07:03.000 this case we're using us ASCII which will trivially be ASCII compatible and we'll dig into that in a little bit
00:07:08.940 and then on a per string basis you can ask if it's made up of only ASCII characters so that is all the characters
00:07:16.139 in The String are they drawn from that ASCII encoding of course
00:07:21.539 the world is much bigger than the us and we can't even represent everything that we would in English so we've
00:07:27.000 collectively needed to move Beyond ASCII and there's been a lot of different ways this is been proposed there's a ton of
00:07:33.120 different encodings out there uh but I'm going to focus on utf-8 so
00:07:38.160 taking this string tray the French word for very it logically has four characters but it takes up five bytes so
00:07:47.340 I tried to line things up in the array here but that third character with the code Point 232 you can see that value is
00:07:54.539 greater than 128 which was the max range of ASCII and that maps to two
00:08:00.419 bytes which brings us to Unicode if you're not familiar with it Unicode is like
00:08:06.240 probably the most popular encoding system in use today there's a big standards body for it Unicode creates
00:08:12.960 these version releases it currently Sports 150 000 different characters includes writing systems for all sorts
00:08:21.180 of world languages it has even new control characters that allow you to flip the reading direction of text and
00:08:27.660 things like that additional white space characters um but one thing about Unicode is it has
00:08:35.339 to try to resolve differences because writing systems and languages are cultural and throughout time different
00:08:41.459 cultures have used roughly the same alphabet or the same encoding system
00:08:47.220 um and like in the modern era they might disagree on how certain things should be
00:08:52.620 interpreted so the Unicode standard body tries to resolve these but it often doesn't do it satisfactorily for all so
00:08:59.040 while Unicode is in widespread use it's definitely not the only encoding out there
00:09:05.220 uh complicating things a little bit is Unicode defines these things called transformation formats so Unicode has
00:09:11.820 you know this 150 000 characters so it's 150 000 code points um and unicode tries to be ASCII
00:09:18.420 compatible in the sense that each of the ASCII code points map to the same in Unicode but we still have to represent
00:09:25.260 those as bytes in one way or another so utf-8 is probably the one you're most familiar with and this is what's known
00:09:30.959 as a variable width encoding so each character in uni utf-8 can take up between one and four bytes
00:09:37.740 uh utf-32 goes the other way it's really simple everything takes up four bytes
00:09:42.839 that's 2 to the 32 it represents everything in Unicode but the downside is every character takes up four bytes
00:09:49.440 so it's not terribly memory efficient and unityf16 sits somewhere in between
00:09:54.959 um where like because it's 16 bits it can represent 65 000 characters or so inside of one of these code units but if
00:10:02.459 you need to move beyond that then you need two of them um You probably won't use utf-16 and
00:10:07.620 utf-32 that frequently in Ruby but utf-16 comes up with JavaScript so if you're doing web development that might
00:10:13.620 be something you run into uh so that that's kind of encodings of a nutshell
00:10:19.140 um Ruby on top of that has a very rich system for interfacing with encodings
00:10:24.779 so Ruby is pretty unique I think at least in terms of the modern VM based kind of scripting languages
00:10:31.980 um most languages nowadays will have a unified internal representation of strings so if you're working with C like
00:10:39.240 that that predates all of this but JavaScript for instance represents all strings internally as utf-16
00:10:46.440 so it can interface with other encodings but it normalizes everything and does
00:10:51.779 this kind of conversion process Ruby doesn't do that for one reason Ruby has
00:10:57.060 to be a bit more inclusive than you might see what some other languages and acknowledges that the entire world isn't
00:11:02.820 using you utf-8 or unicode so as a consequence Ruby can actually
00:11:09.240 work with these other encodings quite efficiently it doesn't have to do any kind of conversion but the flip side is
00:11:15.320 Ruby has to now handle multiple encodings pretty pervasively
00:11:20.459 um so there's this encoding class in Ruby you can do call Dot list on it and get
00:11:26.399 the list of all the encodings it ships with over 100 out of the box and getting back to that ask
00:11:33.240 compatibility I talked about uh here are three encodings that I just happen to know are ASCII compatible they represent
00:11:40.019 different types of characters but for each of the ASCII characters they have
00:11:45.420 the same code Point values which you can see in that array and then critically it uses the same byte representation for
00:11:51.600 each of those code points so you can take an ASCII string and read it without
00:11:56.820 any loss of data in any one of these ASCII compatible encodings
00:12:02.339 um so about 90 of the encodings in Ruby are asked compatible there's a really
00:12:07.800 good chance what you're working with is asking compatible uh the ones that you probably
00:12:13.800 are more likely to run into that aren't as compatible or utf-16 and utf-32 so
00:12:19.740 again this is kind of the distinction because Unicode uses the same code Point values for the ASCII code points if you
00:12:26.820 just look at the code points you'll get the same arrays but when you look at the byte representation utf-16 will use two bytes per ASCII
00:12:34.920 character in utf-32 you use four bytes per uh so since we have multiple encodings
00:12:41.940 in this um you might need to convert encodings from strings from one encoding to another and that brings in this
00:12:48.120 process called transcoding and we've kind of looked at that throughout the talk every time we've
00:12:53.160 called string and code that's a transcoding process up until now we've mostly looked at ask compatible
00:12:59.220 encodings and ASCII strings so there really is no conversion necessary
00:13:04.760 but here you can see the case where what utf-16 if we call in code it changes the
00:13:11.339 byte representation to have two bytes per character critically this transcoding process is
00:13:18.360 error prone or it can fail and this might be the first type of exception you'll run into
00:13:24.139 so here I'm it's a bit of a contrived example but I have utf-8 string with non-ascii
00:13:30.060 characters and I'm converting it to ASCII and then you get this undefined conversion error and if you're not
00:13:35.880 familiar with encodings this can be really difficult to comprehend the message has this kind of funny U plus
00:13:42.480 with four hex characters uh hex digits rather and then indicates the two
00:13:47.760 encodings involved like you just have to know that that U plus means this is a
00:13:52.860 Unicode code point value and then you probably need to convert that from hexadecimal to decimal so you can look
00:13:57.959 it up on the table um and if you're not familiar with that you might just capture the exception and
00:14:04.320 try to move on one way or the other case you might run into is an invalid
00:14:09.779 byte sequence so this is a case where maybe you've read something from a network and you didn't get the full string so you have like bytes at the end
00:14:18.120 that don't really make sense for the the target encoding you're trying to transcode to and here you get a very
00:14:25.320 similar type message here because we don't know that it's Unicode the message now switches to a hexadecimal format and
00:14:32.459 again if you're not familiar with this stuff it can be really overwhelming
00:14:37.560 uh so Ruby gives us a few different ways to handle transcoding errors on that string encode method we can provide a
00:14:44.399 couple keyword arguments that allow you to handle whether you encounter undefined characters or invalid byte
00:14:51.180 sequences if you're converting to an asking coding and you want to replace either character
00:14:59.699 you'll get this question mark in the middle so if you've ever seen a string that confusingly has a question mark in the middle that's probably what happened
00:15:06.779 if you're dealing with non-ascii data and it's Unicode you'll get this one where it looks like a question mark and
00:15:12.420 a diamond and these are definitely ways of resolving the error but you've lost data
00:15:18.779 in the process so you really need to be sure that's what you intended to do because if you put this into a database
00:15:25.079 and it was a user's name or something then the next time you read it back out you're going to have these weird
00:15:30.660 characters in there and your customer might be curious why that is Ruby also gives you the sledgehammer
00:15:37.740 approach so because a string is just an array of bytes with an Associated encoding you can just tell Ruby hey I
00:15:44.940 know better just change the encoding and I've worked with teams that have done this and it definitely gets rid of the
00:15:50.940 error but it is probably not what you wanted to do and will invariably lead to corrupt data it exists for very narrow
00:15:59.339 use cases and it is important to have Ruby and a lot of it has to do with backwards compatibility but if you see
00:16:05.820 force and coding being used somewhere in your code base it's worth taking another look at to see if that's really
00:16:11.100 something you intended which brings us to the notion of broken strings and this is another area where
00:16:17.220 Ruby is a little bit different than its modern peers because Ruby a lot so first of all
00:16:23.699 strings in Ruby are mutable you can just randomly change bytes in them and you can override the encoding it is possible
00:16:30.060 to have a byte array for which the associating coding is just utter nonsense
00:16:35.399 so here I took that that string tray and told Ruby no it's actually us ASCII and
00:16:42.480 it doesn't change any of the bytes but now when you start performing stirring operations you get kind of funny answers
00:16:47.579 back um so you can check each string to see if it's broken
00:16:52.740 um and there's this method column called valid encoding you'd expect most strings you work with this will return true
00:16:58.680 so you have this difference between valid and broken I think you'll find if you look into it a lot of people call them broken strings but the API for it
00:17:06.959 is valid string whether a string is broken or not is just a property of the string either it
00:17:12.900 is broken or it is not but a ruby allows broken strings to float through the system which means you can make calls on
00:17:20.579 broken strings and the result you get actually depends on where the broken character appears on the string this
00:17:26.579 should be treated as an implementation detail and not something to rely upon please do not write
00:17:33.720 specs or unit tests that check if the String's broken that you get this answer back because it really depends on the
00:17:40.860 operation and where the broken character appears but this brings us to Binary strings
00:17:48.360 so a lot of this has to do with Ruby compatibility but because again a ruby string is just an array of bytes and
00:17:54.539 there really is no other mechanism for an array of bytes in Ruby historically we came up with this idea of well we'll
00:18:01.679 just have kind of a dummy encoding called ASCII 8-bit uh in its aliased is binary which I
00:18:08.160 think is a more descriptive name and this is an encoding that doesn't actually map to any characters it just says this is binary data
00:18:14.700 and we use this when reading from network sockets or from files where we're just kind of reading arbitrary
00:18:20.280 bytes and we don't know if we've read the entire thing and hit character boundaries and all that
00:18:27.559 unfortunately this encoding does report whether it's asking compatible and this creates
00:18:35.039 another kind of common situation where I see errors where you can work with strings that are actually binary but if they
00:18:42.000 only consist of asking data you can treat them like other strings and things basically work and then the first time
00:18:48.299 you get a multi-byte character it stops working and no one knows why because this could be a code that's existed for
00:18:55.740 a long time and now suddenly you have user supplied input coming through a web form or you're reading a file from
00:19:01.980 someplace and you're seeing data you didn't previously expect and all your tests might pass the code might not have
00:19:08.340 been touched for six months but confusingly you have these errors now uh interestingly
00:19:14.940 binary strings by definition cannot be broken so the valid encoding method will always return true in those cases
00:19:21.620 and yeah they're really useful for i o so to give an example we'll use string i
00:19:27.240 o so we don't actually have to hit a disk and here we're going to wrap a multi-byte character string and if we
00:19:33.480 just read the entire string we get back like a buffer and it's encoding as utf-8
00:19:39.000 and that that encoding is overrideable but that's the default internal encoding Ruby uses
00:19:45.120 if we rewind it so we can read from the beginning but this time we give it a number of bytes and here I'm just going
00:19:50.340 to give it the total number of bytes of the string now we get back to string that's ASCII 8-bit and that's a bit
00:19:56.039 contrived but it gives you an idea of how making a call a certain way you can get back data you didn't quite expect
00:20:03.480 the real case you might use this is if you're trying to read data in chunks
00:20:08.580 from either file or network often you don't want the whole thing in memory
00:20:13.860 so you read in these fixed number byte chunks and then if you pass it on to another part of the system that system
00:20:21.360 needs to realize that it's not really working with string data it's working with binary data in this case I
00:20:29.100 structured it such that each of the chunks will get one of the two bytes for that c with the c deal character on it
00:20:35.640 so both of the chunks will actually be broken um and this is also why Ruby does allow
00:20:41.520 broken strings to exist because you can concatenate them back together then you have the full bite sequence and it's
00:20:47.160 valid utf-8 data so with all that we'll build up to a fun
00:20:53.940 little trick you can show your friends when you go home we'll start by taking two strings we'll
00:20:59.760 just create one as a string literal and one by clung string new just two different ways of creating strings
00:21:05.460 we can check if they're true we get back what we expected I hope uh and now we're going to use that fun
00:21:12.240 little C plus plus operator to shovel bytes into the end of it this will grow the string if needed and this is
00:21:18.600 typically how you'll see binary data added to a string now let's check if they're equal and
00:21:24.299 suddenly they're not so what gives take a look at the byte data now one of them has two bytes and one of them only has
00:21:30.840 one byte well what gives the problem is is that an empty string
00:21:36.059 literal and string.new are not actually equivalent one will create a utf-8
00:21:41.280 string and one will create a binary string and this is another way I've seen
00:21:46.799 people like really get themselves into confusing errors because they'll want to
00:21:51.960 do something like reading bytes and chunks and they'll allocate a buffer and they're just so used to allocating as an
00:21:59.280 empty string that they don't really think much about the encoding and that that shovel operator what it actually
00:22:04.980 takes is not byte values but code Point values and that 0x80 hex character takes
00:22:11.640 up two bytes in utf-8 um so that brings us to the next kind of
00:22:18.900 like a semi-novel feature about Ruby is because we have different encodings floating through the system uh different
00:22:25.200 strings with different encodings we need them to interact with each other Ruby really puts an emphasis on the
00:22:31.320 developer experiences I'm sure you're aware so Ruby does not want to force you to transcode everything to the same
00:22:37.559 encoding because if everything's the same encoding then these operations are pretty trivial
00:22:42.720 but again like 90 of the encodings are ascially compatible a lot of the time
00:22:48.179 you'll be working with some form of ASCII data so in those situations you know we can connect concatenate two
00:22:54.600 strings that are asking only trivially regardless of what the encoding is
00:23:00.539 um so we can check this compatibility ourselves going back to the encoding
00:23:05.580 class there's this compatible question method on it and you give it two objects
00:23:11.059 what the behavior of that method is changes depending on the type of objects
00:23:16.260 if you give it two encodings it tries to give you the one that has the superset
00:23:21.360 of all the available code points so if we check if us asking utf-8 are
00:23:27.960 compatible we get back an answer of utf-8 and this is the first thing that kind of throws people off even though it
00:23:33.780 ends in a question mark it doesn't return a Boolean value it returns an encoding or it returns nil
00:23:39.659 and here if we flip the the arguments it doesn't matter in the argument quarter we get back to tf8 and then if we have
00:23:45.419 two encodings that aren't compatible you just get back nil so this is something you could check if you wanted to be a
00:23:50.520 bit more defensive uh confusingly things change when you
00:23:55.679 give it to string arguments the documentation says the rules for Strings depend on whether the strings
00:24:01.559 are concatenatable and that that involves a kind of confusing set of
00:24:06.900 rules the best I can put it is Ruby really tries to make sure the operation succeeds so it looks at the contents of
00:24:13.260 the strings in addition to the encoding to see if the operation could proceed but here you can see that argument order
00:24:20.220 now matters whether you do a plus b or B plus a you get a different encoding
00:24:27.179 more confusingly kinda if if you have two encodings let's say they're not compatible
00:24:32.220 strings in those encodings might be compatible so if you're trying to again be a bit defensive you really need to
00:24:38.700 make sure you're passing in the right type of object because just giving it an encoding will give you an incorrect
00:24:45.659 perception of how the string operation will proceed uh but if the strings truly aren't
00:24:52.679 compatible then you will get back nil but there are also exceptions carved out
00:24:58.320 for empty strings so again Ruby checks the contents of the strings if you ask ASCII and utf-16 generally you can't
00:25:05.580 concatenate them but if one of the two of them is empty then suddenly the operation proceeds so your notion of
00:25:12.720 whether two encodings are compatible really depends on the code points available in both encodings whether they
00:25:19.320 use the same byte representation then when you get to the string operation level it even depends on the contents of
00:25:26.340 those strings uh so so far we've kind of looked at Just encodings For Strings but there are
00:25:32.460 encodings for other objects the ones you're more likely to run into are symbols and regular expressions and then
00:25:39.900 I O has an Associated encoding like we looked at and these pop up inside the standard library in various places I
00:25:46.679 don't have time to get really into it but there's different ways to override what the default encoding is when you're
00:25:52.200 reading from files or when you're creating strings Ruby will default to utf-8 for internally created strings and
00:25:59.220 ASCII 8-bit or that binary encoding for external data
00:26:04.440 uh but we can take a steering being converted to a symbol so we got a string it's utf-8 converted to a symbol we can
00:26:10.140 check that symbol is encoding it's also utf-8 and if we just want to round trip it and convert that symbol back to a
00:26:15.659 string we get back utf-8 great that's what I hope everyone expected however if the string consists of only
00:26:22.440 ASCII data Ruby has an exception where it will change the encoding on the
00:26:28.260 symbol to us ASCII and then that kind of sticks with the symbol so you convert
00:26:33.539 the symbol back to a string that string will now have the US ASCII encoding
00:26:39.000 and most of the time this doesn't matter us asking utf-8 are ASCII compatible it
00:26:44.640 gets a little trickier if you have changed the default encoding but the point is you could be working with strings and different encodings and not
00:26:51.059 even really realize it this comes up in numerics as well so all of the digits are representable and
00:26:57.900 ASCII so if you convert integer or a floating point value to a string that will have the US ASCII encoding as well
00:27:06.299 so which brings us back to kind of the title of the talk how many encodings do you think your application uses
00:27:13.740 I pause it you probably have three even if you don't realize it so a lot of
00:27:19.020 people think everything is just utf-8 because that's what ruby defaults to but that's only true for string literals and
00:27:24.659 then particular operations if you're converting symbols or numbers to Strings they're going to come in US ASCII and if
00:27:31.620 you're dealing with I O at any level which most applications that need to
00:27:36.840 interface with the outside world do then you could have binary data as well
00:27:42.299 so why does this matter well if we get back to that encoding negotiation process if the operation is on strings
00:27:50.520 with the same encoding then the derived encoding is trivially the encoding in use and that's a very quick operation
00:27:57.020 but if they're not the same encoding then we got to run through that kind of long set of rules that checks if these
00:28:03.659 two strings are compatible and that rule involves looking at every character in
00:28:09.000 the string so you can have a linear scan of the string that you don't really expect just because you're using two
00:28:15.360 different encodings uh Ruby tries to hide some of this from
00:28:21.000 you so Ruby again supports 100 some odd encodings out of the box but it tries to
00:28:26.340 be pretty Equitable so rather than optimize for particular encodings it optimizes for properties of those
00:28:32.039 encodings so us ASCII every code Point represent can be represented in a byte
00:28:37.440 so it's a fixed width encoding every character has the same width just like utf-32 every character's four bytes so
00:28:45.360 if you're using the index operator on a string you can really trivially figure out where to jump to because you just
00:28:51.960 have an array of bytes and you can work out the index but with utf-8 if you have multi-byte
00:28:58.860 character data you have to figure out the character boundaries for every single character there are various ways to shortcut
00:29:06.620 but if you're trying to get the last character you basically have to walk the whole string
00:29:12.779 additionally because Ruby allows broken strings you can't even use particular tricks in utf-8 to hop through when you
00:29:20.039 look at the encoding for utf-8 the first byte in any character tells you how many bytes that will take up but that doesn't
00:29:27.059 work when you allow broken strings to float through the system um so yeah I went too far so
00:29:34.799 Ruby has this cached metadata associated with strings something called a code range and as long as the string doesn't
00:29:41.399 change the code range will stay the same and this basically just tells you is the string consisting only of ASCII data and
00:29:48.480 is the string valid and if those properties hold then Ruby is able to
00:29:54.059 optimize operations on those strings just like as if it were a fixed width encoding so this is one of the things that again
00:30:01.380 could take an operation like string concatenation and change it from basically constant time minus having a
00:30:08.940 copy memory to a linear time operation because you got to scan the whole string um
00:30:14.520 from a behavioral impact when you're dealing with encodings you need to be in agreement with encoding all the way
00:30:20.880 through the system so it doesn't help if you're working with utf-8 data but your database isn't because you're going to
00:30:27.240 start stuffing strings in there and they're going to get corrupted along the way and likewise if you're reading stuff
00:30:33.480 from files or if you have user forms being submitted you need to check the encoding like for a long time ISO 88591
00:30:41.940 was a really popular web encoding so this might be a weird one that you run into
00:30:47.580 in addition to a database encoding databases also have this notion of
00:30:52.620 collation that impacts operations like how things sort unfortunately
00:30:58.500 um even systems that implement the same encoding can disagree on this so for instance if you're using postgres it
00:31:05.220 will actually sort utf-8 strings differently than Ruby will they disagree on the relative precedence of operators
00:31:12.799 so that compare another really confusing error to run into something to watch out for
00:31:19.080 uh and again like the the way a lot of this stuff breaks is particularly if uh
00:31:25.440 you're an American company you're primed to think of strings and ASCII data but you're
00:31:31.200 serving an international audience and now suddenly you have someone submitting strings and encoding uh you weren't
00:31:37.200 expecting and your seed data might just use all um ASCII data Maybe using Faker or
00:31:43.740 something and you're not forcing it to give you multi-bike characters so this
00:31:49.320 is one of the reasons I think you really should try to be aware of encodings I hope that the talk that you'll be better
00:31:55.679 equipped to solve some errors uh let's tie this all together I'm happy to say that in Ruby three two there have
00:32:02.460 been changes to try to speed some of this up a lot of this was spearheaded by a colleague of mine named John busier he
00:32:08.640 goes by Beirut if you look at the Ruby commit list so one of the ways we're speeding things
00:32:14.820 up is we're just invalidating the the code range value less frequently so if
00:32:20.159 we contributely derive what it should be then we can avoid this expensive scan through the string
00:32:25.620 and this also has benefits for copy on write because the code range value is
00:32:31.740 lazily computed and a lot of operations will actually populate it so if you fork
00:32:36.899 and then you have to scan that string you'll basically trigger a metadata
00:32:41.940 right and have to have a fault uh and then string concatenation is now faster this is one area that John's been
00:32:48.840 looking at quite a bit uh here we're modernizing Ruby a little bit to really actually optimize for the three most
00:32:55.919 common encodings the the ones that Ruby T Falls to and they all happen to be ASCII compatible so in those situations
00:33:03.480 we can more trivially figure out if the encodings are compatible and that helps
00:33:09.600 the concatenation operation there's more room for optimization here so I'm really excited that there will be
00:33:16.320 more string performance coming in future Ruby releases but if you're looking at the Ruby release notes now you have a
00:33:21.899 better idea of what that works all about and uh yeah that's about it
00:33:27.299 um so we've looked at encodings just the general concept how they apply to Ruby Ruby's different mechanisms for
00:33:34.080 transcoding and changing the encoding on strings there's some history here both in Ruby
00:33:39.240 and just in Computing in general for why the encoding system is kind of messy and
00:33:45.419 we have a hopefully a richer appreciation for how this can impact performance in the behavior
00:33:51.000 I've got a slide here for some resources I'll just publish the slides but these were links I thought might be helpful I
00:33:57.120 wrote a blog post on code ranges that gets into the details of how those work and I've linked to some of the PRS from
00:34:03.539 Ruby three two so you can see how some of the changes were made and that is it thank you for your time