00:00:12.740
I am a staff engineer at Shopify and I am here today to talk to you about Ruby's uh and encodings
00:00:20.539
so I work in a group at Shopify called the Ruby and rails infrastructure team we're basically tasked with ensuring the
00:00:26.939
longevity of Ruby one of the Nifty Things that we think about at Shopify is how can we ensure the company will be
00:00:33.480
here 100 years from now and since we're so heavily invested in Ruby that largely means how do we ensure
00:00:39.420
Ruby will be of relevant technology 100 years from now so we do all sorts of things we have
00:00:44.640
people working directly on Ruby and rails fixing bugs improving performance adding new features we focus on the
00:00:51.360
developer experience if you cut Jenny's talk earlier about supply chain security that's another thing the group does
00:00:59.340
I work on truffle Ruby which is an alternative implementation of the Ruby programming language we have an emphasis
00:01:05.880
on very high performance by being a new implementation we're not bound by some
00:01:11.400
of the design decisions in C Ruby so we're able to optimize things in ways
00:01:18.299
that Ruby see Ruby is unable to do so we try
00:01:23.400
to maintain a very high degree compatibility though we can run native extensions and we're working on
00:01:28.680
production deployments right now so yeah the three encoding problem
00:01:34.080
this talk is encoding is one of these things that I find a lot of rubiists
00:01:39.659
don't have a good handle on and you get very far in Ruby without really understanding encodings up until Ruby 1
00:01:46.439
9 encodings really didn't even exist in Ruby the way they do today um but well you can get very far with it
00:01:53.579
because it's a very powerful abstraction if you don't understand Coatings when you invariably run into an issue you're
00:02:00.060
bound to resolve it incorrectly which can lead to loss or corrupt data
00:02:05.520
so to get started we'll just take a peek at Ruby strings this is about as simple as it gets we
00:02:10.979
have a string literal here and I'm using the magic comment syntax to associate the encoding with the string as us ASCII
00:02:19.560
Ruby gives you a lot of API calls to kind of inspect strings and dig into
00:02:24.720
their internals so here we can see that a string is made up of bytes and uh the string had three characters
00:02:31.200
the ABC and this will be the bytes for them and then we can also get the
00:02:36.239
associated encoding with the string so string and Ruby really is just an array of bytes and encoding to make
00:02:42.959
sense of those and what we get out of that is an array of characters
00:02:48.319
so at the simplest level and encoding is really just a mapping function it takes
00:02:53.940
a array of bytes in and produces array of characters there's different encodings so you can have different
00:03:00.420
lists of characters and then you can have different byte representations for them so pausing for a moment I just wanna
00:03:07.260
clarify some terminology throughout the talk I'm going to be using kind of loose terms if you really get into encoding
00:03:13.560
there's some very very precise definitions on things but for our
00:03:18.659
purposes we'll considering coding to just be this mapping of a byte representation to a character a
00:03:25.140
character that really is a difficult one we have printable and non-printable
00:03:30.480
characters like anything that you can put in a string will just call a character so it
00:03:36.480
includes like all the major writing systems punctuation digits and and so on
00:03:43.500
then we have code points and if an encoding is a map of bytes to characters then it stands to reason that encoding
00:03:49.980
has a set of valid characters and all a code point is is an index to one of
00:03:55.019
those characters and then there's also code units which we won't talk about too much but code
00:04:01.620
unit is just a fancy way of saying how a code Point maps to bytes so Ruby just
00:04:07.799
kind of uses bytes but if you look at some of the Unicode stuff they'll use the term code point and uh you can just
00:04:14.340
mentally substitute bytes for that so revising our definition a little bit we have two levels to an encoding we
00:04:22.440
have encoding that takes a code point that that index into the list and gives you a character and then we have a
00:04:28.800
mapping from bytes to a particular code point so Ruby allows us to also check the code
00:04:34.620
points for a string and you can see everything lines up here so given that string ABC we get the set
00:04:40.320
of bytes sorry a list of bytes because order matters and there's three of those then those map directly to the three
00:04:47.880
code points and in this case they have the same values that's why I used ASCII that was by
00:04:53.699
Design that's not a requirement and we'll show some examples in a bit where that doesn't map up
00:04:58.919
and then for each code point we have a corresponding character um so unfortunately it's a really
00:05:05.100
understanding codings we do need to take a look at some of the history here I know it's not the most exciting topic
00:05:11.040
but just bear with me if you haven't heard of it before there's ASCII which
00:05:16.680
is the American Standard code for information interchange this was a very popular encoding
00:05:23.100
particularly in American businesses and computer science and with ASCII we have
00:05:28.740
128 characters of those only nine to five 95 of them are printable so this is a really
00:05:35.100
restricted set of representations and everything fits into seven bits uh
00:05:40.620
because one ASCII was created being able to represent seven bits was a benefit over eight bits
00:05:46.500
so this Loosely corresponds to Modern English we can handle the letters a
00:05:52.139
through z lowercase same thing uppercase digits some punctuation and a selection
00:05:58.740
of symbols and then there's other things like control characters which tell the
00:06:03.840
computer to go do something like inserting a new line and we have white space and then ASCII has its own history
00:06:10.320
that it kind of supports so there's some characters for ringing a bell uh for old style terminals
00:06:17.460
um what you can see is it it's really quite restricted and I think this has even influenced how we write
00:06:23.580
certain things in English for instance we have prices that might be more naturally represented in cents but there
00:06:29.759
is no Cent symbol so we write them as dollars instead um and we can represent like the word
00:06:35.880
resume but not the word resume and I think over time at least in the US we've
00:06:42.300
just dropped putting those accented characters in uh so ASCII was around for a really long
00:06:49.860
time so any language that really wants to see widespread usage needs to support ASCII
00:06:55.800
and Ruby does this at two different levels so you can look at it on coding and ask if it's ASCII compatible and in
00:07:03.000
this case we're using us ASCII which will trivially be ASCII compatible and we'll dig into that in a little bit
00:07:08.940
and then on a per string basis you can ask if it's made up of only ASCII characters so that is all the characters
00:07:16.139
in The String are they drawn from that ASCII encoding of course
00:07:21.539
the world is much bigger than the us and we can't even represent everything that we would in English so we've
00:07:27.000
collectively needed to move Beyond ASCII and there's been a lot of different ways this is been proposed there's a ton of
00:07:33.120
different encodings out there uh but I'm going to focus on utf-8 so
00:07:38.160
taking this string tray the French word for very it logically has four characters but it takes up five bytes so
00:07:47.340
I tried to line things up in the array here but that third character with the code Point 232 you can see that value is
00:07:54.539
greater than 128 which was the max range of ASCII and that maps to two
00:08:00.419
bytes which brings us to Unicode if you're not familiar with it Unicode is like
00:08:06.240
probably the most popular encoding system in use today there's a big standards body for it Unicode creates
00:08:12.960
these version releases it currently Sports 150 000 different characters includes writing systems for all sorts
00:08:21.180
of world languages it has even new control characters that allow you to flip the reading direction of text and
00:08:27.660
things like that additional white space characters um but one thing about Unicode is it has
00:08:35.339
to try to resolve differences because writing systems and languages are cultural and throughout time different
00:08:41.459
cultures have used roughly the same alphabet or the same encoding system
00:08:47.220
um and like in the modern era they might disagree on how certain things should be
00:08:52.620
interpreted so the Unicode standard body tries to resolve these but it often doesn't do it satisfactorily for all so
00:08:59.040
while Unicode is in widespread use it's definitely not the only encoding out there
00:09:05.220
uh complicating things a little bit is Unicode defines these things called transformation formats so Unicode has
00:09:11.820
you know this 150 000 characters so it's 150 000 code points um and unicode tries to be ASCII
00:09:18.420
compatible in the sense that each of the ASCII code points map to the same in Unicode but we still have to represent
00:09:25.260
those as bytes in one way or another so utf-8 is probably the one you're most familiar with and this is what's known
00:09:30.959
as a variable width encoding so each character in uni utf-8 can take up between one and four bytes
00:09:37.740
uh utf-32 goes the other way it's really simple everything takes up four bytes
00:09:42.839
that's 2 to the 32 it represents everything in Unicode but the downside is every character takes up four bytes
00:09:49.440
so it's not terribly memory efficient and unityf16 sits somewhere in between
00:09:54.959
um where like because it's 16 bits it can represent 65 000 characters or so inside of one of these code units but if
00:10:02.459
you need to move beyond that then you need two of them um You probably won't use utf-16 and
00:10:07.620
utf-32 that frequently in Ruby but utf-16 comes up with JavaScript so if you're doing web development that might
00:10:13.620
be something you run into uh so that that's kind of encodings of a nutshell
00:10:19.140
um Ruby on top of that has a very rich system for interfacing with encodings
00:10:24.779
so Ruby is pretty unique I think at least in terms of the modern VM based kind of scripting languages
00:10:31.980
um most languages nowadays will have a unified internal representation of strings so if you're working with C like
00:10:39.240
that that predates all of this but JavaScript for instance represents all strings internally as utf-16
00:10:46.440
so it can interface with other encodings but it normalizes everything and does
00:10:51.779
this kind of conversion process Ruby doesn't do that for one reason Ruby has
00:10:57.060
to be a bit more inclusive than you might see what some other languages and acknowledges that the entire world isn't
00:11:02.820
using you utf-8 or unicode so as a consequence Ruby can actually
00:11:09.240
work with these other encodings quite efficiently it doesn't have to do any kind of conversion but the flip side is
00:11:15.320
Ruby has to now handle multiple encodings pretty pervasively
00:11:20.459
um so there's this encoding class in Ruby you can do call Dot list on it and get
00:11:26.399
the list of all the encodings it ships with over 100 out of the box and getting back to that ask
00:11:33.240
compatibility I talked about uh here are three encodings that I just happen to know are ASCII compatible they represent
00:11:40.019
different types of characters but for each of the ASCII characters they have
00:11:45.420
the same code Point values which you can see in that array and then critically it uses the same byte representation for
00:11:51.600
each of those code points so you can take an ASCII string and read it without
00:11:56.820
any loss of data in any one of these ASCII compatible encodings
00:12:02.339
um so about 90 of the encodings in Ruby are asked compatible there's a really
00:12:07.800
good chance what you're working with is asking compatible uh the ones that you probably
00:12:13.800
are more likely to run into that aren't as compatible or utf-16 and utf-32 so
00:12:19.740
again this is kind of the distinction because Unicode uses the same code Point values for the ASCII code points if you
00:12:26.820
just look at the code points you'll get the same arrays but when you look at the byte representation utf-16 will use two bytes per ASCII
00:12:34.920
character in utf-32 you use four bytes per uh so since we have multiple encodings
00:12:41.940
in this um you might need to convert encodings from strings from one encoding to another and that brings in this
00:12:48.120
process called transcoding and we've kind of looked at that throughout the talk every time we've
00:12:53.160
called string and code that's a transcoding process up until now we've mostly looked at ask compatible
00:12:59.220
encodings and ASCII strings so there really is no conversion necessary
00:13:04.760
but here you can see the case where what utf-16 if we call in code it changes the
00:13:11.339
byte representation to have two bytes per character critically this transcoding process is
00:13:18.360
error prone or it can fail and this might be the first type of exception you'll run into
00:13:24.139
so here I'm it's a bit of a contrived example but I have utf-8 string with non-ascii
00:13:30.060
characters and I'm converting it to ASCII and then you get this undefined conversion error and if you're not
00:13:35.880
familiar with encodings this can be really difficult to comprehend the message has this kind of funny U plus
00:13:42.480
with four hex characters uh hex digits rather and then indicates the two
00:13:47.760
encodings involved like you just have to know that that U plus means this is a
00:13:52.860
Unicode code point value and then you probably need to convert that from hexadecimal to decimal so you can look
00:13:57.959
it up on the table um and if you're not familiar with that you might just capture the exception and
00:14:04.320
try to move on one way or the other case you might run into is an invalid
00:14:09.779
byte sequence so this is a case where maybe you've read something from a network and you didn't get the full string so you have like bytes at the end
00:14:18.120
that don't really make sense for the the target encoding you're trying to transcode to and here you get a very
00:14:25.320
similar type message here because we don't know that it's Unicode the message now switches to a hexadecimal format and
00:14:32.459
again if you're not familiar with this stuff it can be really overwhelming
00:14:37.560
uh so Ruby gives us a few different ways to handle transcoding errors on that string encode method we can provide a
00:14:44.399
couple keyword arguments that allow you to handle whether you encounter undefined characters or invalid byte
00:14:51.180
sequences if you're converting to an asking coding and you want to replace either character
00:14:59.699
you'll get this question mark in the middle so if you've ever seen a string that confusingly has a question mark in the middle that's probably what happened
00:15:06.779
if you're dealing with non-ascii data and it's Unicode you'll get this one where it looks like a question mark and
00:15:12.420
a diamond and these are definitely ways of resolving the error but you've lost data
00:15:18.779
in the process so you really need to be sure that's what you intended to do because if you put this into a database
00:15:25.079
and it was a user's name or something then the next time you read it back out you're going to have these weird
00:15:30.660
characters in there and your customer might be curious why that is Ruby also gives you the sledgehammer
00:15:37.740
approach so because a string is just an array of bytes with an Associated encoding you can just tell Ruby hey I
00:15:44.940
know better just change the encoding and I've worked with teams that have done this and it definitely gets rid of the
00:15:50.940
error but it is probably not what you wanted to do and will invariably lead to corrupt data it exists for very narrow
00:15:59.339
use cases and it is important to have Ruby and a lot of it has to do with backwards compatibility but if you see
00:16:05.820
force and coding being used somewhere in your code base it's worth taking another look at to see if that's really
00:16:11.100
something you intended which brings us to the notion of broken strings and this is another area where
00:16:17.220
Ruby is a little bit different than its modern peers because Ruby a lot so first of all
00:16:23.699
strings in Ruby are mutable you can just randomly change bytes in them and you can override the encoding it is possible
00:16:30.060
to have a byte array for which the associating coding is just utter nonsense
00:16:35.399
so here I took that that string tray and told Ruby no it's actually us ASCII and
00:16:42.480
it doesn't change any of the bytes but now when you start performing stirring operations you get kind of funny answers
00:16:47.579
back um so you can check each string to see if it's broken
00:16:52.740
um and there's this method column called valid encoding you'd expect most strings you work with this will return true
00:16:58.680
so you have this difference between valid and broken I think you'll find if you look into it a lot of people call them broken strings but the API for it
00:17:06.959
is valid string whether a string is broken or not is just a property of the string either it
00:17:12.900
is broken or it is not but a ruby allows broken strings to float through the system which means you can make calls on
00:17:20.579
broken strings and the result you get actually depends on where the broken character appears on the string this
00:17:26.579
should be treated as an implementation detail and not something to rely upon please do not write
00:17:33.720
specs or unit tests that check if the String's broken that you get this answer back because it really depends on the
00:17:40.860
operation and where the broken character appears but this brings us to Binary strings
00:17:48.360
so a lot of this has to do with Ruby compatibility but because again a ruby string is just an array of bytes and
00:17:54.539
there really is no other mechanism for an array of bytes in Ruby historically we came up with this idea of well we'll
00:18:01.679
just have kind of a dummy encoding called ASCII 8-bit uh in its aliased is binary which I
00:18:08.160
think is a more descriptive name and this is an encoding that doesn't actually map to any characters it just says this is binary data
00:18:14.700
and we use this when reading from network sockets or from files where we're just kind of reading arbitrary
00:18:20.280
bytes and we don't know if we've read the entire thing and hit character boundaries and all that
00:18:27.559
unfortunately this encoding does report whether it's asking compatible and this creates
00:18:35.039
another kind of common situation where I see errors where you can work with strings that are actually binary but if they
00:18:42.000
only consist of asking data you can treat them like other strings and things basically work and then the first time
00:18:48.299
you get a multi-byte character it stops working and no one knows why because this could be a code that's existed for
00:18:55.740
a long time and now suddenly you have user supplied input coming through a web form or you're reading a file from
00:19:01.980
someplace and you're seeing data you didn't previously expect and all your tests might pass the code might not have
00:19:08.340
been touched for six months but confusingly you have these errors now uh interestingly
00:19:14.940
binary strings by definition cannot be broken so the valid encoding method will always return true in those cases
00:19:21.620
and yeah they're really useful for i o so to give an example we'll use string i
00:19:27.240
o so we don't actually have to hit a disk and here we're going to wrap a multi-byte character string and if we
00:19:33.480
just read the entire string we get back like a buffer and it's encoding as utf-8
00:19:39.000
and that that encoding is overrideable but that's the default internal encoding Ruby uses
00:19:45.120
if we rewind it so we can read from the beginning but this time we give it a number of bytes and here I'm just going
00:19:50.340
to give it the total number of bytes of the string now we get back to string that's ASCII 8-bit and that's a bit
00:19:56.039
contrived but it gives you an idea of how making a call a certain way you can get back data you didn't quite expect
00:20:03.480
the real case you might use this is if you're trying to read data in chunks
00:20:08.580
from either file or network often you don't want the whole thing in memory
00:20:13.860
so you read in these fixed number byte chunks and then if you pass it on to another part of the system that system
00:20:21.360
needs to realize that it's not really working with string data it's working with binary data in this case I
00:20:29.100
structured it such that each of the chunks will get one of the two bytes for that c with the c deal character on it
00:20:35.640
so both of the chunks will actually be broken um and this is also why Ruby does allow
00:20:41.520
broken strings to exist because you can concatenate them back together then you have the full bite sequence and it's
00:20:47.160
valid utf-8 data so with all that we'll build up to a fun
00:20:53.940
little trick you can show your friends when you go home we'll start by taking two strings we'll
00:20:59.760
just create one as a string literal and one by clung string new just two different ways of creating strings
00:21:05.460
we can check if they're true we get back what we expected I hope uh and now we're going to use that fun
00:21:12.240
little C plus plus operator to shovel bytes into the end of it this will grow the string if needed and this is
00:21:18.600
typically how you'll see binary data added to a string now let's check if they're equal and
00:21:24.299
suddenly they're not so what gives take a look at the byte data now one of them has two bytes and one of them only has
00:21:30.840
one byte well what gives the problem is is that an empty string
00:21:36.059
literal and string.new are not actually equivalent one will create a utf-8
00:21:41.280
string and one will create a binary string and this is another way I've seen
00:21:46.799
people like really get themselves into confusing errors because they'll want to
00:21:51.960
do something like reading bytes and chunks and they'll allocate a buffer and they're just so used to allocating as an
00:21:59.280
empty string that they don't really think much about the encoding and that that shovel operator what it actually
00:22:04.980
takes is not byte values but code Point values and that 0x80 hex character takes
00:22:11.640
up two bytes in utf-8 um so that brings us to the next kind of
00:22:18.900
like a semi-novel feature about Ruby is because we have different encodings floating through the system uh different
00:22:25.200
strings with different encodings we need them to interact with each other Ruby really puts an emphasis on the
00:22:31.320
developer experiences I'm sure you're aware so Ruby does not want to force you to transcode everything to the same
00:22:37.559
encoding because if everything's the same encoding then these operations are pretty trivial
00:22:42.720
but again like 90 of the encodings are ascially compatible a lot of the time
00:22:48.179
you'll be working with some form of ASCII data so in those situations you know we can connect concatenate two
00:22:54.600
strings that are asking only trivially regardless of what the encoding is
00:23:00.539
um so we can check this compatibility ourselves going back to the encoding
00:23:05.580
class there's this compatible question method on it and you give it two objects
00:23:11.059
what the behavior of that method is changes depending on the type of objects
00:23:16.260
if you give it two encodings it tries to give you the one that has the superset
00:23:21.360
of all the available code points so if we check if us asking utf-8 are
00:23:27.960
compatible we get back an answer of utf-8 and this is the first thing that kind of throws people off even though it
00:23:33.780
ends in a question mark it doesn't return a Boolean value it returns an encoding or it returns nil
00:23:39.659
and here if we flip the the arguments it doesn't matter in the argument quarter we get back to tf8 and then if we have
00:23:45.419
two encodings that aren't compatible you just get back nil so this is something you could check if you wanted to be a
00:23:50.520
bit more defensive uh confusingly things change when you
00:23:55.679
give it to string arguments the documentation says the rules for Strings depend on whether the strings
00:24:01.559
are concatenatable and that that involves a kind of confusing set of
00:24:06.900
rules the best I can put it is Ruby really tries to make sure the operation succeeds so it looks at the contents of
00:24:13.260
the strings in addition to the encoding to see if the operation could proceed but here you can see that argument order
00:24:20.220
now matters whether you do a plus b or B plus a you get a different encoding
00:24:27.179
more confusingly kinda if if you have two encodings let's say they're not compatible
00:24:32.220
strings in those encodings might be compatible so if you're trying to again be a bit defensive you really need to
00:24:38.700
make sure you're passing in the right type of object because just giving it an encoding will give you an incorrect
00:24:45.659
perception of how the string operation will proceed uh but if the strings truly aren't
00:24:52.679
compatible then you will get back nil but there are also exceptions carved out
00:24:58.320
for empty strings so again Ruby checks the contents of the strings if you ask ASCII and utf-16 generally you can't
00:25:05.580
concatenate them but if one of the two of them is empty then suddenly the operation proceeds so your notion of
00:25:12.720
whether two encodings are compatible really depends on the code points available in both encodings whether they
00:25:19.320
use the same byte representation then when you get to the string operation level it even depends on the contents of
00:25:26.340
those strings uh so so far we've kind of looked at Just encodings For Strings but there are
00:25:32.460
encodings for other objects the ones you're more likely to run into are symbols and regular expressions and then
00:25:39.900
I O has an Associated encoding like we looked at and these pop up inside the standard library in various places I
00:25:46.679
don't have time to get really into it but there's different ways to override what the default encoding is when you're
00:25:52.200
reading from files or when you're creating strings Ruby will default to utf-8 for internally created strings and
00:25:59.220
ASCII 8-bit or that binary encoding for external data
00:26:04.440
uh but we can take a steering being converted to a symbol so we got a string it's utf-8 converted to a symbol we can
00:26:10.140
check that symbol is encoding it's also utf-8 and if we just want to round trip it and convert that symbol back to a
00:26:15.659
string we get back utf-8 great that's what I hope everyone expected however if the string consists of only
00:26:22.440
ASCII data Ruby has an exception where it will change the encoding on the
00:26:28.260
symbol to us ASCII and then that kind of sticks with the symbol so you convert
00:26:33.539
the symbol back to a string that string will now have the US ASCII encoding
00:26:39.000
and most of the time this doesn't matter us asking utf-8 are ASCII compatible it
00:26:44.640
gets a little trickier if you have changed the default encoding but the point is you could be working with strings and different encodings and not
00:26:51.059
even really realize it this comes up in numerics as well so all of the digits are representable and
00:26:57.900
ASCII so if you convert integer or a floating point value to a string that will have the US ASCII encoding as well
00:27:06.299
so which brings us back to kind of the title of the talk how many encodings do you think your application uses
00:27:13.740
I pause it you probably have three even if you don't realize it so a lot of
00:27:19.020
people think everything is just utf-8 because that's what ruby defaults to but that's only true for string literals and
00:27:24.659
then particular operations if you're converting symbols or numbers to Strings they're going to come in US ASCII and if
00:27:31.620
you're dealing with I O at any level which most applications that need to
00:27:36.840
interface with the outside world do then you could have binary data as well
00:27:42.299
so why does this matter well if we get back to that encoding negotiation process if the operation is on strings
00:27:50.520
with the same encoding then the derived encoding is trivially the encoding in use and that's a very quick operation
00:27:57.020
but if they're not the same encoding then we got to run through that kind of long set of rules that checks if these
00:28:03.659
two strings are compatible and that rule involves looking at every character in
00:28:09.000
the string so you can have a linear scan of the string that you don't really expect just because you're using two
00:28:15.360
different encodings uh Ruby tries to hide some of this from
00:28:21.000
you so Ruby again supports 100 some odd encodings out of the box but it tries to
00:28:26.340
be pretty Equitable so rather than optimize for particular encodings it optimizes for properties of those
00:28:32.039
encodings so us ASCII every code Point represent can be represented in a byte
00:28:37.440
so it's a fixed width encoding every character has the same width just like utf-32 every character's four bytes so
00:28:45.360
if you're using the index operator on a string you can really trivially figure out where to jump to because you just
00:28:51.960
have an array of bytes and you can work out the index but with utf-8 if you have multi-byte
00:28:58.860
character data you have to figure out the character boundaries for every single character there are various ways to shortcut
00:29:06.620
but if you're trying to get the last character you basically have to walk the whole string
00:29:12.779
additionally because Ruby allows broken strings you can't even use particular tricks in utf-8 to hop through when you
00:29:20.039
look at the encoding for utf-8 the first byte in any character tells you how many bytes that will take up but that doesn't
00:29:27.059
work when you allow broken strings to float through the system um so yeah I went too far so
00:29:34.799
Ruby has this cached metadata associated with strings something called a code range and as long as the string doesn't
00:29:41.399
change the code range will stay the same and this basically just tells you is the string consisting only of ASCII data and
00:29:48.480
is the string valid and if those properties hold then Ruby is able to
00:29:54.059
optimize operations on those strings just like as if it were a fixed width encoding so this is one of the things that again
00:30:01.380
could take an operation like string concatenation and change it from basically constant time minus having a
00:30:08.940
copy memory to a linear time operation because you got to scan the whole string um
00:30:14.520
from a behavioral impact when you're dealing with encodings you need to be in agreement with encoding all the way
00:30:20.880
through the system so it doesn't help if you're working with utf-8 data but your database isn't because you're going to
00:30:27.240
start stuffing strings in there and they're going to get corrupted along the way and likewise if you're reading stuff
00:30:33.480
from files or if you have user forms being submitted you need to check the encoding like for a long time ISO 88591
00:30:41.940
was a really popular web encoding so this might be a weird one that you run into
00:30:47.580
in addition to a database encoding databases also have this notion of
00:30:52.620
collation that impacts operations like how things sort unfortunately
00:30:58.500
um even systems that implement the same encoding can disagree on this so for instance if you're using postgres it
00:31:05.220
will actually sort utf-8 strings differently than Ruby will they disagree on the relative precedence of operators
00:31:12.799
so that compare another really confusing error to run into something to watch out for
00:31:19.080
uh and again like the the way a lot of this stuff breaks is particularly if uh
00:31:25.440
you're an American company you're primed to think of strings and ASCII data but you're
00:31:31.200
serving an international audience and now suddenly you have someone submitting strings and encoding uh you weren't
00:31:37.200
expecting and your seed data might just use all um ASCII data Maybe using Faker or
00:31:43.740
something and you're not forcing it to give you multi-bike characters so this
00:31:49.320
is one of the reasons I think you really should try to be aware of encodings I hope that the talk that you'll be better
00:31:55.679
equipped to solve some errors uh let's tie this all together I'm happy to say that in Ruby three two there have
00:32:02.460
been changes to try to speed some of this up a lot of this was spearheaded by a colleague of mine named John busier he
00:32:08.640
goes by Beirut if you look at the Ruby commit list so one of the ways we're speeding things
00:32:14.820
up is we're just invalidating the the code range value less frequently so if
00:32:20.159
we contributely derive what it should be then we can avoid this expensive scan through the string
00:32:25.620
and this also has benefits for copy on write because the code range value is
00:32:31.740
lazily computed and a lot of operations will actually populate it so if you fork
00:32:36.899
and then you have to scan that string you'll basically trigger a metadata
00:32:41.940
right and have to have a fault uh and then string concatenation is now faster this is one area that John's been
00:32:48.840
looking at quite a bit uh here we're modernizing Ruby a little bit to really actually optimize for the three most
00:32:55.919
common encodings the the ones that Ruby T Falls to and they all happen to be ASCII compatible so in those situations
00:33:03.480
we can more trivially figure out if the encodings are compatible and that helps
00:33:09.600
the concatenation operation there's more room for optimization here so I'm really excited that there will be
00:33:16.320
more string performance coming in future Ruby releases but if you're looking at the Ruby release notes now you have a
00:33:21.899
better idea of what that works all about and uh yeah that's about it
00:33:27.299
um so we've looked at encodings just the general concept how they apply to Ruby Ruby's different mechanisms for
00:33:34.080
transcoding and changing the encoding on strings there's some history here both in Ruby
00:33:39.240
and just in Computing in general for why the encoding system is kind of messy and
00:33:45.419
we have a hopefully a richer appreciation for how this can impact performance in the behavior
00:33:51.000
I've got a slide here for some resources I'll just publish the slides but these were links I thought might be helpful I
00:33:57.120
wrote a blog post on code ranges that gets into the details of how those work and I've linked to some of the PRS from
00:34:03.539
Ruby three two so you can see how some of the changes were made and that is it thank you for your time