00:00:00.000
ready for takeoff
00:00:17.359
okay so welcome to my talk splitting The
00:00:20.460
crucial optimization for Roblox I'm
00:00:22.740
bernardellos and I welcome trophy Ruby
00:00:25.439
since quite a while actually since eight
00:00:27.720
years I did a PhD on parallelism in
00:00:29.939
concurrent uh on Dynamic languages I'm
00:00:32.940
dimitino Ruby spec and I'm also a crb
00:00:35.160
committer and you can find me on Twitter
00:00:37.260
Mastodon and everything
00:00:40.140
uh trophy Ruby is a high performance
00:00:42.180
super implementation where it aims to be
00:00:43.980
the fastest Ruby out there and that's a
00:00:47.520
good a good deal of that is through the
00:00:49.200
jit compiler the just-in-time compiler
00:00:50.879
it has which is uh the guardian juicy
00:00:53.280
time compiler
00:00:54.539
and for free targets full compatibility
00:00:56.460
with crb 3.1 including extension so it
00:00:59.879
really aims to be a drop-in replacement
00:01:01.140
for serebi and so for instance recently
00:01:03.780
we tried to run Mastodon which is like a
00:01:06.119
a rails up with a few hundred gems and
00:01:08.400
actually it just works on traffic Ruby
00:01:10.200
there's only like a single line change
00:01:11.520
is in the Puma config but everything is
00:01:13.920
otherwise the server restarts and
00:01:15.420
everything works
00:01:18.119
um it's on GitHub and there's a website
00:01:20.220
and so on
00:01:21.900
so today I want to talk about splitting
00:01:23.700
but what is splitting
00:01:25.439
and to explain it I want to go back to
00:01:28.080
the origin when this when this concept
00:01:30.479
was actually invented and that was
00:01:32.520
actually like quite a long time ago
00:01:35.159
so in 1986 some researcher created the
00:01:37.920
self language and the self language is a
00:01:40.500
little bit similar to Small Talk which
00:01:42.360
is quite similar to Ruby but it's
00:01:44.400
actually a prototype based so it's a bit
00:01:45.780
like JavaScript net aspect
00:01:48.119
and the self-researcher actually had
00:01:50.100
many breakthrough and here I just listed
00:01:52.560
four of them and this breakthrough
00:01:54.659
they're all used still in almost all
00:01:56.939
Dynamic language implementation nowadays
00:01:59.280
because they are so important they're so
00:02:01.020
fundamental to having good performance
00:02:03.119
in Dynamic languages
00:02:06.180
and so one of them for instance is the
00:02:08.340
self Maps or the ships which is a way to
00:02:11.099
represent object efficiently and that's
00:02:12.900
been used by trophy rubies in the
00:02:14.459
beginning and Serbia just recently like
00:02:16.800
a few months ago started to use it too
00:02:19.860
and another concept they went through is
00:02:22.379
the optimization so once we adjust the
00:02:24.900
time compile Rubicon to machine code
00:02:26.720
sometimes it doesn't hold anymore like
00:02:28.800
for instance somebody after monkey
00:02:30.180
Patcher method then we cannot use this
00:02:32.160
optimized code anyway so what we do is
00:02:34.080
we have to go back to The Interpreter
00:02:35.459
restore all the state of The Interpreter
00:02:37.560
which is that's what the optimization is
00:02:39.660
and then later we can potentially
00:02:41.340
compile it again
00:02:43.920
and then a third concept is polymorphic
00:02:46.739
inline caches and this I will explain as
00:02:48.840
we go through the talk when I have a
00:02:50.280
concrete example
00:02:51.900
and finally splitting that's the subject
00:02:54.239
of this talk
00:02:55.260
and splitting was invented in this paper
00:02:57.060
33 years ago it's called the
00:02:59.340
customization of the splitting paper by
00:03:01.980
Craig Chambers and David gangar at
00:03:03.780
Stanford
00:03:05.220
and what's amazing is the example they
00:03:07.800
use in this paper this of course
00:03:10.019
self-code can be very easily translated
00:03:12.480
to Ruby and the concept basically hold
00:03:14.879
one to one
00:03:16.500
so we're actually going to take this
00:03:18.420
example and transit to Ruby
00:03:21.000
it's very straightforward because
00:03:22.620
actually like you might notice but self
00:03:24.420
also has blocks and has a very similar
00:03:27.300
concept overall
00:03:29.400
so what they do with this example is it
00:03:31.500
defines a sum to method and it defines
00:03:33.360
this on all numbers so it defines it on
00:03:35.159
numeric
00:03:36.300
and then it's very simple but we say
00:03:38.220
something called zero and then we step
00:03:39.780
from the current number which is simply
00:03:42.360
say the self
00:03:43.560
and until the upper Bond included and
00:03:45.959
then every time you just add it to the
00:03:47.280
sum so it's very a trivial example but
00:03:49.620
it's a good example to illustrate
00:03:50.760
splitting
00:03:51.959
and as a note where you step here not up
00:03:54.060
to because up2 is actually only defined
00:03:56.099
for integer it's not different for float
00:03:58.080
and rational and so on
00:04:00.659
and this function you can do quite a
00:04:02.640
like is you can call it on various
00:04:04.500
things right so you can call it on
00:04:05.580
integers you can just one sum to 10
00:04:07.739
which is 55 you can use floating Point
00:04:10.080
numbers you can use rationals and you
00:04:12.299
can use large integers for instance it
00:04:14.640
just works for all of them
00:04:18.419
and so we want to make this function
00:04:20.340
fast right we want to see like okay oh
00:04:22.620
would like an optimizing Ruby
00:04:24.300
implementation go about this like oh can
00:04:26.220
we actually optimize this properly
00:04:28.259
and to do that we are going to just in
00:04:30.120
time compile we're gonna going to
00:04:31.680
illustrate or we would just in time
00:04:33.300
compile this sum two methods
00:04:35.759
and the simple methods doesn't do much
00:04:38.280
right there's the summonization written
00:04:40.139
some that's nothing that's trivial the
00:04:42.240
main thing is it calls this step method
00:04:43.740
right
00:04:45.240
and the first question is can we inline
00:04:47.220
step can we when we compile some two
00:04:49.440
also compile step within it and optimize
00:04:51.840
them together
00:04:53.759
and here we like the first approach we
00:04:56.280
here we take here is like can we know
00:04:58.259
that it record numeric step from a
00:05:00.240
static analysis point of view and the
00:05:02.520
answer is no like even though we know
00:05:04.199
self is a numeric because it's a method
00:05:06.060
defined on numeric we don't know if
00:05:07.979
somebody might have defined float step
00:05:09.479
or integer step or rational step and
00:05:11.880
that would then take over and then it
00:05:14.040
wouldn't call numeric steps so we
00:05:15.540
couldn't just use the logic from never
00:05:18.240
except for being correct
00:05:20.220
and here have two example call sites
00:05:22.919
have 17 and 1.0 some to ten zero just to
00:05:25.680
give an idea of what what might be used
00:05:28.979
so Dynamic language implementation or
00:05:30.780
they deal with this because like static
00:05:32.280
analysis basically use less than Ruby
00:05:33.840
because we know very very little from it
00:05:36.419
what they do is they use inline caches
00:05:39.120
so it's a cache that's in the
00:05:41.220
representation that the virtual machine
00:05:42.840
uses to execute so for instance on
00:05:44.880
serebi it's a cache that's directly in
00:05:46.860
the byte code that's why it's called
00:05:48.060
inline it's like together with the
00:05:51.360
representation of how we execute things
00:05:53.580
enter the Ruby for instance would be
00:05:55.440
inside the abstracted X3 directly
00:05:58.199
an innocent and cache for instance what
00:06:00.360
you want to Cache here is Method lookup
00:06:02.460
because meta lookup can be quite
00:06:04.080
expensive right it can it can be like
00:06:06.120
multiple hash map lookups and so on we
00:06:08.340
don't want to do that every time
00:06:10.020
and so we have a question we say like
00:06:11.460
okay when we have integer and we resolve
00:06:14.340
through metal lookup the result was
00:06:15.720
numeric step and then when we saw a
00:06:17.820
float from the second example the result
00:06:19.860
was also a numeric step
00:06:22.440
and referee with a slightly more
00:06:24.120
advanced version here we actually left
00:06:25.919
two Inland caches so we have the lookup
00:06:28.139
cache which we just explained but we
00:06:30.240
also have the call Target cache and the
00:06:32.160
cold colored cache is simply like okay
00:06:34.080
after we resolved everything which
00:06:36.000
method did we end up calling
00:06:38.100
and here exactly it's always the same
00:06:39.780
method it's numeric step and because now
00:06:42.240
we have a single method we know like
00:06:43.979
basically like okay most likely we are
00:06:46.319
going to call numeric step here it's not
00:06:47.940
a guarantee it could still change but so
00:06:49.860
far it has always been the case and
00:06:51.840
we're just gonna suppose it's gonna
00:06:53.819
continue like this right and so because
00:06:55.979
we have a single entry like this it just
00:06:57.419
makes sense to inline it because like at
00:06:59.160
that point you're like why not right we
00:07:00.479
basically know we're very good uh guess
00:07:03.360
that is going to be in the next step
00:07:05.759
so let's compile numeric stepper let's
00:07:07.800
look into it the program is number step
00:07:10.080
is really complicated it doesn't fit on
00:07:11.880
a slide and this is already a simplified
00:07:13.440
version it doesn't handle keyword
00:07:14.819
arguments and more things
00:07:16.620
and the reason is so complicated it's
00:07:18.960
because there's many ways you can call
00:07:20.280
numeric step
00:07:21.660
so you can call just one step to three
00:07:23.460
okay just one two three very easy right
00:07:25.259
but you can add that with floats you can
00:07:27.180
have that with a step value of two so
00:07:28.919
like only for instance only odd numbers
00:07:30.960
or you can have minus two you can
00:07:33.360
actually go descending
00:07:35.220
or you can use keyword arguments and if
00:07:37.199
you use keyword arguments you can even
00:07:38.400
omit the two value so then you step to
00:07:40.620
Infinity by the given step
00:07:43.620
and you can also just code it without a
00:07:45.539
block and then you get an emulator
00:07:46.680
that's just to illustrate like never a
00:07:48.780
step can can do a little bit of
00:07:50.039
everything and that's why it's quite
00:07:51.960
complicated
00:07:54.060
uh but I want this to fit on the slide
00:07:55.919
so I'm gonna simplify a few things so
00:07:58.440
I'm gonna remove this first three line
00:07:59.819
because the first line checked like the
00:08:01.319
animator case but we're not really
00:08:03.060
interested in that in this talk
00:08:04.860
and the other two line is just check
00:08:06.180
that the step is not zero or nil because
00:08:07.740
that's like arrow cases
00:08:10.080
so just get rid of this
00:08:12.120
and then we have the descending and the
00:08:14.400
ascending case whether the step is
00:08:15.780
positive or negative the case you are
00:08:18.000
interesting to interested to for some
00:08:19.740
two is the ascending case so the
00:08:21.840
descending one will just put it in a
00:08:23.520
different method we just put it here and
00:08:25.259
step descending to get it out of the way
00:08:28.500
I know we have something that actually
00:08:30.000
fits on the slide and we can think about
00:08:32.099
how do we compile this
00:08:33.779
and the important part in this whole
00:08:35.279
method is the loop because that's where
00:08:37.440
we're going to spend most of the time
00:08:38.820
that's where we like the CPU will be
00:08:41.039
executing this or like most of the time
00:08:43.140
and that's typically like Loops are good
00:08:44.820
places to look for optimization
00:08:48.180
so we have this Loop and inside this
00:08:49.800
Loop what is there in there this is
00:08:51.600
actually a value bigger than limit okay
00:08:53.160
it's very simple Value Plus equal step
00:08:55.019
but then there is this yield value right
00:08:56.820
so we call a block with the given value
00:09:00.959
and suppose we have two cold site again
00:09:02.820
we have one sum to 10 but we also have a
00:09:05.160
call side directly to Step One Dot step
00:09:07.440
three and that means actually the Inland
00:09:09.600
cash will see two blocks it will see the
00:09:11.519
Block in some two but we'll also see the
00:09:13.740
Block in main the one that does Pi
00:09:17.820
and that's the question now what do we
00:09:19.800
do like do we want to inline this block
00:09:21.360
too but like would you do in line
00:09:23.880
so we could align both block that's a
00:09:26.519
possibility
00:09:27.660
so say okay if the block is the Block in
00:09:29.339
some tool we're gonna like do the logic
00:09:31.440
of that block in line and if it's the
00:09:33.540
other one we do the other one and if
00:09:35.160
it's none of these we do day opt which
00:09:36.959
makes it video optimized so we just
00:09:38.339
don't under the untold that in the
00:09:39.959
compile code we'll just recompile again
00:09:42.240
if this happens
00:09:44.519
uh but it's a big problem here because
00:09:46.380
like this really hurts optimization for
00:09:48.360
instance if it's the first block
00:09:50.700
and we want here to to actually we do
00:09:53.700
the same political value but some was a
00:09:55.620
variable outside the block
00:09:57.300
so it means we cannot just access it
00:09:58.680
like this actually several in direction
00:10:00.060
right so that's why I here write it as a
00:10:02.100
block dot Auto variable sum like there's
00:10:04.320
at least two in direction right
00:10:06.180
and this kind of a direction we want to
00:10:07.740
move this out of the look out of the
00:10:09.779
loop so we can just uh we have less work
00:10:12.720
to do in the loop but we cannot do that
00:10:14.820
here because we don't know which block
00:10:16.019
it will be and if we do that basically
00:10:17.760
we would make the second block slower
00:10:20.519
um so that's really the problem here
00:10:22.920
and then the question is like what if we
00:10:24.480
don't have two blocks but we have like
00:10:25.740
seven of them then like sure we can have
00:10:28.200
seven cases like this and line all seven
00:10:30.240
of them but that's getting much less
00:10:31.860
reasonable I know if it's actually the
00:10:33.959
seven block that's being passed I will
00:10:35.580
have seven check until I even get to it
00:10:37.200
and now it's getting really slow
00:10:39.899
so that's not good
00:10:41.580
so what can we do well dirty is very
00:10:43.920
simple we say like what if we copy the
00:10:46.380
method here we copy the method step
00:10:48.600
but like the virtual machine does this
00:10:50.579
right so even to our write them step one
00:10:52.140
and step two here to illustrate the
00:10:54.120
source code doesn't change it's the same
00:10:55.620
step we had from the beginning
00:10:57.120
it's just like the the virtual machine
00:10:58.800
is able to optimize all of this
00:11:00.000
automatically
00:11:01.200
but so the thing it does is it takes a
00:11:03.240
copy of step and this first copy is
00:11:05.640
going to be specialized for the Block in
00:11:07.440
sum tool and the second copy is going to
00:11:09.480
be specialized for the Block in Main
00:11:13.019
and then we just have a check right
00:11:14.640
which is like okay is it the block we
00:11:16.260
expect to and otherwise we'd have to
00:11:17.640
optimize and so now we can optimize this
00:11:19.800
much better for instance this outer
00:11:21.300
variable we could put it out and so on I
00:11:23.399
mean the just time compiler would be
00:11:24.899
able to do that
00:11:27.000
so that's cool and this is splitting
00:11:28.680
this is very simple which is copy
00:11:30.480
basically
00:11:31.500
so graphically we are these two call
00:11:33.600
sites right one sum to ten and one step
00:11:35.820
to three
00:11:37.560
and okay this one calls sum2 but
00:11:39.720
eventually both call step and step was
00:11:41.279
calling one of two block I didn't know
00:11:42.779
which one what we did was just copy step
00:11:45.120
and now it's much simpler now we have
00:11:47.160
like a linear flow here of course and
00:11:49.440
that's much more optimizable right the
00:11:51.480
CPU is very happy when it has a straight
00:11:53.519
sequence of a structure it knows where
00:11:55.740
to go it's not happy when it has no idea
00:11:57.959
where it is going and learn many
00:11:59.160
branches everywhere
00:12:02.279
so yeah splitting and so the main
00:12:04.260
advantage is like we take a copy of Step
00:12:06.600
here for each caller and those copies
00:12:09.240
can be optimized further simply because
00:12:10.680
they have more information basically
00:12:12.260
they they know they have information
00:12:14.640
that okay it's only through this caller
00:12:16.500
so it has less cases it's more precise I
00:12:19.260
have the extra context of the color to
00:12:21.120
optimize it better
00:12:24.300
and interferably specifically splitting
00:12:26.399
is a bit more advanced than what I
00:12:27.540
explained
00:12:28.680
but first I want to Define some terms so
00:12:31.860
we had an inland cache before and Inland
00:12:33.660
cache cannot be monomorphic polymorphic
00:12:36.000
or megamorphic
00:12:38.360
means a single entry does the uppercase
00:12:41.240
polymorphic is two or more entries and
00:12:44.160
megamorphic is too many entries like
00:12:46.079
when we had seven blocks of for instance
00:12:47.459
so on at some point we just can't track
00:12:48.959
it anymore
00:12:50.519
and polymorphic is slower because you
00:12:52.079
can optimize less the megamorphic is
00:12:53.760
horribly slow because it cannot cache
00:12:55.680
anymore so for example we do has to do
00:12:57.420
metal lookup every single time right so
00:12:59.399
we have to say Okay integer what's the
00:13:01.139
methods on it not take ashma lookup okay
00:13:03.420
it's not there then we go in numeric
00:13:04.860
does it what's the matter the wash might
00:13:06.360
look up and so on
00:13:08.579
and so what referee does is like
00:13:10.260
whenever it detects polymorphism
00:13:11.760
megamorphism basically detect this
00:13:13.800
inline caches are getting two or more
00:13:15.540
entries it tries to split to kind of fix
00:13:18.420
it and be monomorphic again
00:13:21.180
and interferably we decide to split for
00:13:23.279
each call site once we decided to split
00:13:25.019
the given method so like once we decided
00:13:26.820
we need to split step we're going to
00:13:28.560
split every call to step that's it
00:13:31.800
but more than that we can actually split
00:13:33.899
that more level and I will illustrate
00:13:36.000
that to explain it
00:13:37.440
so suppose here we have two call side we
00:13:39.180
have one sum to 10 and 1.0 sum to 10 0.
00:13:43.200
and so they both called sum2 the vocal
00:13:45.360
and then that called steps inside step I
00:13:47.880
have this Loop right this until value
00:13:49.620
bigger than limits not is bigger than
00:13:52.200
actually is polymorphic it can either
00:13:54.240
call integer bigger than or float bigger
00:13:56.519
than
00:13:57.180
that's a problem because like we are the
00:13:58.800
innermost Loop the most important part
00:14:00.420
of our code and now we still have to
00:14:02.519
decide between these two
00:14:04.620
so we say okay we could split step but
00:14:06.899
if we split step or does some to know
00:14:09.360
which copy to call it just doesn't right
00:14:12.060
so what we do is we split both of them
00:14:14.940
and then we have this again and so this
00:14:17.040
is like kind of recursive splitting and
00:14:19.079
preferably from sensible to to notice
00:14:21.180
this and say okay I need to split one
00:14:22.500
level further uh and then we can have
00:14:25.079
this and everything is monomorphic and
00:14:26.880
linear again
00:14:30.660
and so to to make the point again like
00:14:32.579
what if we have if we add-ons we if what
00:14:34.980
if we haven't splitting
00:14:36.839
what happens what's the problems right
00:14:38.639
and here in the red I put all the plays
00:14:40.500
that have polymorphism if we don't have
00:14:42.120
splitting it's one step steps smaller
00:14:44.220
than zero maybe step could be a floating
00:14:45.600
Point number and then we don't know it
00:14:46.920
could be two different methods here this
00:14:48.899
one we saw on the Block could also be
00:14:50.699
different Block in the plus as well
00:14:52.019
right so there's already four places
00:14:53.820
that are problematic if we don't have
00:14:55.980
spitting but more than that if we
00:14:58.620
consider polymorphism's more General
00:15:00.180
concept if you consider consider it for
00:15:02.820
branches so for instance if we have if
00:15:04.860
else branch and both are possible
00:15:06.779
because the big program that could step
00:15:08.459
in many different way
00:15:09.959
then basically we will explore all
00:15:12.000
possibilities in this method and then
00:15:13.620
everything in red here is a place where
00:15:15.959
the CPU doesn't know where to go where
00:15:17.459
it does to Branch it cannot predict
00:15:18.959
where it goes because actually all the
00:15:21.480
all are possible
00:15:23.160
so that's a big problem right and
00:15:24.720
splitting basically solves this
00:15:26.639
by saying we take multiple copies and
00:15:28.620
then actually we know exactly where
00:15:30.060
we'll go in our cases
00:15:34.079
so back to back to having splitting here
00:15:37.019
we have a Sumter method right
00:15:39.420
and now we've actually split up the
00:15:41.279
symptomatic and this split is for this
00:15:43.560
call side one dot sum to 10.
00:15:46.139
of course like this 10 could be like I
00:15:47.880
don't know an argument like n coming
00:15:49.260
from somewhere else it doesn't have to
00:15:51.120
be literal 10. but here this split it
00:15:54.120
sees actually profiles so it checks that
00:15:57.000
the argument is always an integer if
00:15:59.160
it's not it's optimized and recompiles
00:16:01.380
but here is most likely let it always be
00:16:03.720
an integer so upper bound is profiled as
00:16:06.420
an integer this is just a quick check
00:16:07.920
which we don't show uh and then after we
00:16:10.800
can just assume upper bond is going to
00:16:12.180
be an integer inside the method
00:16:14.880
and so he also in the net cash of step
00:16:16.740
we only have one and three we only ever
00:16:19.440
seen integer as the receiver so it's a
00:16:22.019
simple so let's in line let's continue
00:16:24.000
and type inside step
00:16:27.360
and inside step we also have an argument
00:16:30.180
profile and I tell you what arguments
00:16:32.220
did we get so actually we had like this
00:16:33.959
upper bound is the limit and that's the
00:16:36.240
only argument we passed so limit is an
00:16:38.459
integer and step is not passed and
00:16:40.860
that's very interesting because step is
00:16:42.959
not passed we also like profile that
00:16:45.300
it's not passed so we know that if it's
00:16:46.800
not passed it must be the default value
00:16:48.180
and the default value is one
00:16:50.160
so that point we know that step is one
00:16:52.100
so here I highlighted all occurrence of
00:16:54.540
step and we can just replace them by one
00:16:56.220
that's like what the just 10 compiler
00:16:57.720
would do
00:16:59.579
and then next thing with is one smaller
00:17:01.860
than zero well obviously that's not true
00:17:03.540
I understand compiler can also figure
00:17:06.000
that out so we can just replace that by
00:17:08.280
false
00:17:09.540
and then no descending is just false we
00:17:11.640
can just fold it right we can just
00:17:13.020
remove some branches because we know
00:17:14.280
it's not descending it's ascending
00:17:17.640
and one more thing here is like we know
00:17:19.500
limit is an integer from the profile so
00:17:22.199
the or equality only ever does something
00:17:24.179
if limit would be false or nil but it's
00:17:26.400
not that it's always an integer so this
00:17:28.319
we can just remove as well
00:17:31.740
then one more thing we have is we notice
00:17:35.640
that self is an integer actually like I
00:17:37.620
said we profile arguments but we also
00:17:39.539
profile self just because it's useful
00:17:42.179
and so because we know surface integer
00:17:44.400
means value is an integer and means
00:17:46.200
value bigger than limit it means a
00:17:47.940
vectorally called integer bigger than
00:17:49.919
actually already have this information
00:17:51.179
to an inland cash but still like here we
00:17:52.919
can prove it so even if one less check
00:17:55.140
we don't need to check that it's an
00:17:56.640
integer we know it
00:17:58.559
and then the same thing here for the
00:17:59.940
increment we know it's integer Plus
00:18:04.200
and we have some checks here that
00:18:06.299
actually I kind of we wrote a little bit
00:18:08.520
to fit in on the slide so it was value
00:18:10.200
limit step any of them float right if
00:18:12.360
any of them flow then we have a
00:18:13.559
different logic because you have to be
00:18:15.240
careful because when you add flow
00:18:16.980
together if you are not careful you
00:18:18.360
would just end up with a lot of errors
00:18:21.480
um but so we write this to the original
00:18:22.919
version with actually just like a if or
00:18:25.440
or right
00:18:27.780
and so we have if value is a fluid but
00:18:29.940
no value is an integer so it's
00:18:31.320
definitely not to float or if limit is a
00:18:33.480
float no limit is integer and one is not
00:18:35.700
a fruit so all of these are false so the
00:18:37.620
condition just goes away it's if false
00:18:39.720
and so we can just remove is that code
00:18:42.299
okay and now we are fairly simple uh
00:18:45.419
step like a valid simple definition of
00:18:47.940
Step optimize one and now we can inline
00:18:50.400
it right so what we do to in line is we
00:18:52.200
just have both methods and we copy paste
00:18:54.360
tab inside sum tool and we adjust the
00:18:57.179
variable names
00:18:59.039
so then we end up with this right now we
00:19:00.960
have the loop from inside step inside
00:19:02.460
sub 2.
00:19:04.740
and what we still left here is we have
00:19:06.960
this block right this some samples equal
00:19:09.120
I block
00:19:10.320
and we still have that there
00:19:12.120
then we immediately call it well if we
00:19:14.460
immediately call the block basically we
00:19:16.500
can understand like okay it's the same
00:19:17.700
as doing the logic in line
00:19:19.620
so we can just replace it by sample
00:19:21.480
cycle value
00:19:24.240
and actually what motting is like
00:19:25.919
potentially this allocation would stay
00:19:27.419
but the girl jit compiler is able to do
00:19:30.299
Escape analysis and realize okay nobody
00:19:32.220
use this block so we can just remove it
00:19:34.020
entirely there's no location either so
00:19:35.940
it's very as simple as as written here
00:19:39.179
I know we have something that's very
00:19:40.500
well optimized right now we know exactly
00:19:42.120
what it does it's just a simple Loop
00:19:43.740
there are no calls except this very
00:19:45.900
trivial arithmetic stuff which we also
00:19:47.820
like typically also in line
00:19:50.820
I know what we have actually is very
00:19:52.620
similar to a corresponding C program
00:19:54.179
right
00:19:55.140
so if we look between these two it's
00:19:57.120
very similar
00:19:58.620
and actually the only difference is like
00:20:00.480
the Ruby version doesn't overflow check
00:20:02.160
whenever it adds something because we
00:20:04.320
have to care like okay if there would be
00:20:05.520
overflow would actually need to use big
00:20:07.620
integers
00:20:09.120
but this overflow check is fairly cheap
00:20:10.799
because actually it's a flag in the CPU
00:20:12.299
whenever you add it it remembers whether
00:20:14.400
there was another flow and you can just
00:20:15.660
check it and there's an instruction to
00:20:17.760
do that efficiently
00:20:19.799
and so what we did is like yeah we have
00:20:21.600
Ruby high level Ruby code with blocks
00:20:23.460
and everything and compiled as good as C
00:20:25.260
code basically but this Ruby code works
00:20:28.200
much better in a c code because it works
00:20:29.700
for float for rational it doesn't have
00:20:31.740
overflow and so on
00:20:34.919
so let's Benchmark this right so here I
00:20:37.380
have a few call sites to simulate like
00:20:39.419
being in a bigger application which
00:20:40.860
calls a step and sum to it various
00:20:42.960
different ways and then what I actually
00:20:44.880
Benchmark is this one sum to 1000.
00:20:50.220
and here I take cerebi 3.1 as the
00:20:52.200
Baseline so it's always one
00:20:54.600
and if I use traffic without splitting
00:20:56.820
it's already 50 times faster and the
00:20:59.100
reason for that is truthfully use an
00:21:01.080
advanced just-in-time compiler so it's
00:21:03.120
about to be that much faster
00:21:05.100
but if you use splitting then we are
00:21:07.679
more than 100 times faster than srubi uh
00:21:10.799
so like actually seven point seven times
00:21:12.360
faster than preferably without splitting
00:21:14.160
so that's really the benefit of
00:21:15.600
splitting like okay on a single
00:21:17.820
optimization make things 7.7 times
00:21:20.039
faster that's like crazy it's almost 100
00:21:22.380
of and the reason of that is it's a bit
00:21:24.840
like inlining it's like because
00:21:26.700
splitting triggers it enables a lot of
00:21:29.039
other optimization to come in
00:21:31.020
so like it's really an optimization in
00:21:32.820
Neighbors it enables other optimization
00:21:34.799
to go in and then we have really big
00:21:36.360
speedups
00:21:37.799
login so instead of being a very generic
00:21:39.780
version which branches everywhere like
00:21:41.280
when I had all the red on the slide uh
00:21:44.039
no it's actually like a perfectly clear
00:21:46.200
version
00:21:48.780
and so if you take other Benchmark for
00:21:50.760
instance here we look at opt carrots uh
00:21:53.100
tougher way without splitting sorry five
00:21:54.659
times faster but with splitting by
00:21:56.640
actually seven point seven times faster
00:21:58.140
so here's for again a big gain from
00:22:00.120
splitting
00:22:01.380
and that's maybe not so not so expected
00:22:03.419
on opt carrot which maybe us doesn't use
00:22:05.400
blocks that much but still it uses them
00:22:07.860
to some degree
00:22:10.679
and then if you look at rails bench the
00:22:12.840
version from the widget bench Benchmark
00:22:14.640
Suite
00:22:15.600
we see that traffic without splitting is
00:22:17.659
1.366 times as fast as Ruby which is
00:22:21.000
something but with splitting is 2.75
00:22:23.760
times faster and that's much better
00:22:26.700
so here on Rails it's like it looks like
00:22:28.799
it's really key to the splitting because
00:22:30.480
it enables so much more optimization to
00:22:32.760
enables to compile rails and all the
00:22:34.679
method it uses in a much more efficient
00:22:36.720
way
00:22:39.900
and to get to get an idea of it of the
00:22:42.000
global performance of prefer Ruby here
00:22:43.820
that's a blog post I did in the
00:22:46.140
beginning of the year
00:22:47.520
and on the 14 Benchmark of wedged bench
00:22:50.220
we see that trophy Ruby is 6.23 times as
00:22:53.700
fast as 3.1
00:22:55.500
and so we can think like a big part of
00:22:57.419
this is also splitting
00:22:59.159
not only of course the battery stamp
00:23:01.080
compiler also matters
00:23:03.179
um but that's yeah that's part of the
00:23:04.860
path of the overall results and we can
00:23:07.500
see trophy is basically a different
00:23:08.760
level here like it has all these very
00:23:11.039
important optimization that so far other
00:23:13.140
group implementation don't really have
00:23:14.760
yet
00:23:17.880
so let's talk about some related paper
00:23:20.400
uh
00:23:22.140
from from people at Kent from Sophie
00:23:25.620
caliber octave laros Jeff and Mar and
00:23:28.200
Richard Jones
00:23:29.700
and these people uses true for Ruby to
00:23:31.440
analyze the behavior of cold sites and
00:23:33.840
so the big view of course our Electronic
00:23:35.159
Center are they monomorphic polymorphic
00:23:36.720
or megamorphic but also other things
00:23:39.840
and this paper I find a traffic Ruby as
00:23:41.940
two ways to reduce polymorphism and
00:23:44.640
megamorphism and one of them is two
00:23:46.860
level in land cache we saw in the
00:23:48.419
beginning where we separate the lookup
00:23:49.799
from the dispatch when we call a method
00:23:53.580
and the other one is splitting
00:23:56.280
and there's also a blog post on Street
00:23:57.900
fan website if you want to have a kind
00:23:59.520
of a summary of the paper
00:24:02.280
and so I took the number from from that
00:24:04.140
paper and specifically the real bench
00:24:06.179
number
00:24:07.140
and we see that initially rails bench
00:24:09.179
has about seven percent of the call
00:24:10.799
which are polymorphic that's not good
00:24:13.559
that's kind of like not so small number
00:24:15.659
of polymorphic core right it means like
00:24:17.220
seven percent of all cores in a
00:24:19.140
benchmark that actually they don't know
00:24:20.760
where they're going right they have to
00:24:22.080
at least choose between two or more
00:24:23.520
possibilities
00:24:25.380
and even more ring is like half a
00:24:27.360
percent of those code which is still 630
00:24:30.000
63 000 calls are megamorphic which means
00:24:32.760
they don't have honey cash they have to
00:24:34.860
do method lookup every single time
00:24:37.980
uh but what they see is like after we
00:24:39.900
use these two level Inland cash for call
00:24:41.460
this lookup and dispatch cache we
00:24:43.799
actually reduce the polymorphism the
00:24:45.539
polymorphic call by half now we only
00:24:47.340
have 3.5 percent
00:24:48.900
and the megamorphicol are now very small
00:24:51.179
like 0.004 percent
00:24:54.419
and then once we apply splitting on top
00:24:56.159
of that we actually come to a flat zero
00:24:58.200
there is no polymorphism no megamorphism
00:25:00.240
anymore
00:25:01.559
so that's really a big result that one
00:25:03.840
that I think Daddy done and I didn't
00:25:05.460
expect is that when you have rails you
00:25:07.559
have quite a bit of of polymorphism
00:25:09.240
megamorphism and actually these two
00:25:10.919
optimization remove it entirely and this
00:25:13.620
didn't just happen on Rails bench but
00:25:15.059
actually happened all 44 Benchmark in
00:25:17.220
the paper
00:25:21.360
so in conclusion we saw that splitting
00:25:22.980
is a technique from the self VM research
00:25:25.020
it was invented 33 years ago
00:25:27.539
but even though it's a bit old like this
00:25:29.880
it actually applies very well to Ruby
00:25:31.559
right it applies for instance for metal
00:25:33.179
attacking blocks and cooling them in a
00:25:34.679
loop but not only there are many more
00:25:36.480
formal polymorphism like we saw like the
00:25:38.220
branches and so on
00:25:40.799
and we see that splitting can completely
00:25:42.900
remove polymorphism and gamorphism in
00:25:44.760
the 44 Benchmark of that paper which is
00:25:46.919
really quite a good result and also it
00:25:49.080
gives really big speed up like we see
00:25:50.640
this 7.7 time speed up on some two 1.5
00:25:53.100
on optical carrot and two times on Rails
00:25:55.440
bench
00:25:57.659
and with that we want to thank you for
00:26:00.419
listening and I'm open to any question
00:26:10.620
thank you
00:26:11.820
right so the question is does it
00:26:13.200
definitely impact on memory footprint
00:26:16.980
and yes it does have an impact of course
00:26:19.200
like if we copy something we need to use
00:26:21.000
the memory for it
00:26:22.559
has an impact which is then proportional
00:26:24.720
to like the the representation that the
00:26:27.659
virtual machine uses to represent code
00:26:29.700
right so like byte code or whatever else
00:26:31.559
right
00:26:32.640
so it's a like a for instance I don't
00:26:34.740
know a percentage of that trade so
00:26:36.779
interfer Ruby there is actually like a
00:26:38.220
limit it says like we cannot copy a
00:26:40.200
thing if I remember correctly
00:26:41.880
so if we take the wall footprint for all
00:26:44.460
the representation of the code for our
00:26:45.840
application we don't grow it up more
00:26:47.880
than 2.5 what it was originally so we're
00:26:50.460
allowed to split kind of we're allowed
00:26:52.559
to have one copy and a half but on
00:26:54.360
average some things will never be split
00:26:56.340
and something will be split many times
00:26:57.779
but so there's a limit but actually we
00:27:00.240
don't hit this limit so much so in
00:27:02.640
practice like yes it needs to be
00:27:05.100
controlled so it doesn't go crazy but if
00:27:07.260
it's done in sensible place it tends to
00:27:08.820
not be not be too unrealistic nothing
00:27:11.279
compared to like the application
00:27:12.720
footprint is typically not that
00:27:14.340
important
00:27:15.480
but yeah select every optimization it's
00:27:17.279
a trade-off here the trade-off for a
00:27:19.080
little bit more of a print for much
00:27:20.340
faster code right so it was this one
00:27:22.679
right
00:27:24.360
yeah yeah yeah so there's a the trick
00:27:26.760
here in truffle at least is that we know
00:27:29.580
like we we noticed that step here is
00:27:32.520
called by a single caller there's only
00:27:34.440
one thing called step at this point in
00:27:36.299
the program
00:27:38.220
and so this is the kind of the first
00:27:39.659
time we see a step being used right and
00:27:42.179
that means it's like well this can't be
00:27:44.279
the source of polymorphism or if it is
00:27:46.260
we can't do anything about it so like
00:27:48.000
let's keep looking higher and so that we
00:27:50.100
can we kind of keep looking higher like
00:27:51.840
this as long as single colors up to a
00:27:54.240
threshold I think is something like five
00:27:56.460
uh and then here for instance we saw
00:27:58.440
like okay now we have two colors we
00:28:00.419
don't only have one right so this point
00:28:02.279
is probably okay to split
00:28:04.020
but even then if it didn't work like no
00:28:06.419
it would have a speed but maybe there's
00:28:07.559
still some polymorphism it would detect
00:28:09.120
it later and then try again to split
00:28:10.679
higher potentially
00:28:12.240
okay then thank you