Talks

Splitting: the Crucial Optimization for Ruby Blocks

Blocks are one of the most expressive parts of the Ruby syntax. Many Ruby methods take a block. When a method is given different blocks, there is a crucial optimization necessary to unlock the best performance. This optimization dates back to the early days of research on dynamic languages, yet it seems only a single Ruby implementation currently uses it. This optimization is called splitting and what it does is using different copies of a method and specialize them to the block given at different call sites. This enables compiling the method and the block together for the best performance.

RubyConf 2022

00:00:00.000 ready for takeoff
00:00:17.359 okay so welcome to my talk splitting The
00:00:20.460 crucial optimization for Roblox I'm
00:00:22.740 bernardellos and I welcome trophy Ruby
00:00:25.439 since quite a while actually since eight
00:00:27.720 years I did a PhD on parallelism in
00:00:29.939 concurrent uh on Dynamic languages I'm
00:00:32.940 dimitino Ruby spec and I'm also a crb
00:00:35.160 committer and you can find me on Twitter
00:00:37.260 Mastodon and everything
00:00:40.140 uh trophy Ruby is a high performance
00:00:42.180 super implementation where it aims to be
00:00:43.980 the fastest Ruby out there and that's a
00:00:47.520 good a good deal of that is through the
00:00:49.200 jit compiler the just-in-time compiler
00:00:50.879 it has which is uh the guardian juicy
00:00:53.280 time compiler
00:00:54.539 and for free targets full compatibility
00:00:56.460 with crb 3.1 including extension so it
00:00:59.879 really aims to be a drop-in replacement
00:01:01.140 for serebi and so for instance recently
00:01:03.780 we tried to run Mastodon which is like a
00:01:06.119 a rails up with a few hundred gems and
00:01:08.400 actually it just works on traffic Ruby
00:01:10.200 there's only like a single line change
00:01:11.520 is in the Puma config but everything is
00:01:13.920 otherwise the server restarts and
00:01:15.420 everything works
00:01:18.119 um it's on GitHub and there's a website
00:01:20.220 and so on
00:01:21.900 so today I want to talk about splitting
00:01:23.700 but what is splitting
00:01:25.439 and to explain it I want to go back to
00:01:28.080 the origin when this when this concept
00:01:30.479 was actually invented and that was
00:01:32.520 actually like quite a long time ago
00:01:35.159 so in 1986 some researcher created the
00:01:37.920 self language and the self language is a
00:01:40.500 little bit similar to Small Talk which
00:01:42.360 is quite similar to Ruby but it's
00:01:44.400 actually a prototype based so it's a bit
00:01:45.780 like JavaScript net aspect
00:01:48.119 and the self-researcher actually had
00:01:50.100 many breakthrough and here I just listed
00:01:52.560 four of them and this breakthrough
00:01:54.659 they're all used still in almost all
00:01:56.939 Dynamic language implementation nowadays
00:01:59.280 because they are so important they're so
00:02:01.020 fundamental to having good performance
00:02:03.119 in Dynamic languages
00:02:06.180 and so one of them for instance is the
00:02:08.340 self Maps or the ships which is a way to
00:02:11.099 represent object efficiently and that's
00:02:12.900 been used by trophy rubies in the
00:02:14.459 beginning and Serbia just recently like
00:02:16.800 a few months ago started to use it too
00:02:19.860 and another concept they went through is
00:02:22.379 the optimization so once we adjust the
00:02:24.900 time compile Rubicon to machine code
00:02:26.720 sometimes it doesn't hold anymore like
00:02:28.800 for instance somebody after monkey
00:02:30.180 Patcher method then we cannot use this
00:02:32.160 optimized code anyway so what we do is
00:02:34.080 we have to go back to The Interpreter
00:02:35.459 restore all the state of The Interpreter
00:02:37.560 which is that's what the optimization is
00:02:39.660 and then later we can potentially
00:02:41.340 compile it again
00:02:43.920 and then a third concept is polymorphic
00:02:46.739 inline caches and this I will explain as
00:02:48.840 we go through the talk when I have a
00:02:50.280 concrete example
00:02:51.900 and finally splitting that's the subject
00:02:54.239 of this talk
00:02:55.260 and splitting was invented in this paper
00:02:57.060 33 years ago it's called the
00:02:59.340 customization of the splitting paper by
00:03:01.980 Craig Chambers and David gangar at
00:03:03.780 Stanford
00:03:05.220 and what's amazing is the example they
00:03:07.800 use in this paper this of course
00:03:10.019 self-code can be very easily translated
00:03:12.480 to Ruby and the concept basically hold
00:03:14.879 one to one
00:03:16.500 so we're actually going to take this
00:03:18.420 example and transit to Ruby
00:03:21.000 it's very straightforward because
00:03:22.620 actually like you might notice but self
00:03:24.420 also has blocks and has a very similar
00:03:27.300 concept overall
00:03:29.400 so what they do with this example is it
00:03:31.500 defines a sum to method and it defines
00:03:33.360 this on all numbers so it defines it on
00:03:35.159 numeric
00:03:36.300 and then it's very simple but we say
00:03:38.220 something called zero and then we step
00:03:39.780 from the current number which is simply
00:03:42.360 say the self
00:03:43.560 and until the upper Bond included and
00:03:45.959 then every time you just add it to the
00:03:47.280 sum so it's very a trivial example but
00:03:49.620 it's a good example to illustrate
00:03:50.760 splitting
00:03:51.959 and as a note where you step here not up
00:03:54.060 to because up2 is actually only defined
00:03:56.099 for integer it's not different for float
00:03:58.080 and rational and so on
00:04:00.659 and this function you can do quite a
00:04:02.640 like is you can call it on various
00:04:04.500 things right so you can call it on
00:04:05.580 integers you can just one sum to 10
00:04:07.739 which is 55 you can use floating Point
00:04:10.080 numbers you can use rationals and you
00:04:12.299 can use large integers for instance it
00:04:14.640 just works for all of them
00:04:18.419 and so we want to make this function
00:04:20.340 fast right we want to see like okay oh
00:04:22.620 would like an optimizing Ruby
00:04:24.300 implementation go about this like oh can
00:04:26.220 we actually optimize this properly
00:04:28.259 and to do that we are going to just in
00:04:30.120 time compile we're gonna going to
00:04:31.680 illustrate or we would just in time
00:04:33.300 compile this sum two methods
00:04:35.759 and the simple methods doesn't do much
00:04:38.280 right there's the summonization written
00:04:40.139 some that's nothing that's trivial the
00:04:42.240 main thing is it calls this step method
00:04:43.740 right
00:04:45.240 and the first question is can we inline
00:04:47.220 step can we when we compile some two
00:04:49.440 also compile step within it and optimize
00:04:51.840 them together
00:04:53.759 and here we like the first approach we
00:04:56.280 here we take here is like can we know
00:04:58.259 that it record numeric step from a
00:05:00.240 static analysis point of view and the
00:05:02.520 answer is no like even though we know
00:05:04.199 self is a numeric because it's a method
00:05:06.060 defined on numeric we don't know if
00:05:07.979 somebody might have defined float step
00:05:09.479 or integer step or rational step and
00:05:11.880 that would then take over and then it
00:05:14.040 wouldn't call numeric steps so we
00:05:15.540 couldn't just use the logic from never
00:05:18.240 except for being correct
00:05:20.220 and here have two example call sites
00:05:22.919 have 17 and 1.0 some to ten zero just to
00:05:25.680 give an idea of what what might be used
00:05:28.979 so Dynamic language implementation or
00:05:30.780 they deal with this because like static
00:05:32.280 analysis basically use less than Ruby
00:05:33.840 because we know very very little from it
00:05:36.419 what they do is they use inline caches
00:05:39.120 so it's a cache that's in the
00:05:41.220 representation that the virtual machine
00:05:42.840 uses to execute so for instance on
00:05:44.880 serebi it's a cache that's directly in
00:05:46.860 the byte code that's why it's called
00:05:48.060 inline it's like together with the
00:05:51.360 representation of how we execute things
00:05:53.580 enter the Ruby for instance would be
00:05:55.440 inside the abstracted X3 directly
00:05:58.199 an innocent and cache for instance what
00:06:00.360 you want to Cache here is Method lookup
00:06:02.460 because meta lookup can be quite
00:06:04.080 expensive right it can it can be like
00:06:06.120 multiple hash map lookups and so on we
00:06:08.340 don't want to do that every time
00:06:10.020 and so we have a question we say like
00:06:11.460 okay when we have integer and we resolve
00:06:14.340 through metal lookup the result was
00:06:15.720 numeric step and then when we saw a
00:06:17.820 float from the second example the result
00:06:19.860 was also a numeric step
00:06:22.440 and referee with a slightly more
00:06:24.120 advanced version here we actually left
00:06:25.919 two Inland caches so we have the lookup
00:06:28.139 cache which we just explained but we
00:06:30.240 also have the call Target cache and the
00:06:32.160 cold colored cache is simply like okay
00:06:34.080 after we resolved everything which
00:06:36.000 method did we end up calling
00:06:38.100 and here exactly it's always the same
00:06:39.780 method it's numeric step and because now
00:06:42.240 we have a single method we know like
00:06:43.979 basically like okay most likely we are
00:06:46.319 going to call numeric step here it's not
00:06:47.940 a guarantee it could still change but so
00:06:49.860 far it has always been the case and
00:06:51.840 we're just gonna suppose it's gonna
00:06:53.819 continue like this right and so because
00:06:55.979 we have a single entry like this it just
00:06:57.419 makes sense to inline it because like at
00:06:59.160 that point you're like why not right we
00:07:00.479 basically know we're very good uh guess
00:07:03.360 that is going to be in the next step
00:07:05.759 so let's compile numeric stepper let's
00:07:07.800 look into it the program is number step
00:07:10.080 is really complicated it doesn't fit on
00:07:11.880 a slide and this is already a simplified
00:07:13.440 version it doesn't handle keyword
00:07:14.819 arguments and more things
00:07:16.620 and the reason is so complicated it's
00:07:18.960 because there's many ways you can call
00:07:20.280 numeric step
00:07:21.660 so you can call just one step to three
00:07:23.460 okay just one two three very easy right
00:07:25.259 but you can add that with floats you can
00:07:27.180 have that with a step value of two so
00:07:28.919 like only for instance only odd numbers
00:07:30.960 or you can have minus two you can
00:07:33.360 actually go descending
00:07:35.220 or you can use keyword arguments and if
00:07:37.199 you use keyword arguments you can even
00:07:38.400 omit the two value so then you step to
00:07:40.620 Infinity by the given step
00:07:43.620 and you can also just code it without a
00:07:45.539 block and then you get an emulator
00:07:46.680 that's just to illustrate like never a
00:07:48.780 step can can do a little bit of
00:07:50.039 everything and that's why it's quite
00:07:51.960 complicated
00:07:54.060 uh but I want this to fit on the slide
00:07:55.919 so I'm gonna simplify a few things so
00:07:58.440 I'm gonna remove this first three line
00:07:59.819 because the first line checked like the
00:08:01.319 animator case but we're not really
00:08:03.060 interested in that in this talk
00:08:04.860 and the other two line is just check
00:08:06.180 that the step is not zero or nil because
00:08:07.740 that's like arrow cases
00:08:10.080 so just get rid of this
00:08:12.120 and then we have the descending and the
00:08:14.400 ascending case whether the step is
00:08:15.780 positive or negative the case you are
00:08:18.000 interesting to interested to for some
00:08:19.740 two is the ascending case so the
00:08:21.840 descending one will just put it in a
00:08:23.520 different method we just put it here and
00:08:25.259 step descending to get it out of the way
00:08:28.500 I know we have something that actually
00:08:30.000 fits on the slide and we can think about
00:08:32.099 how do we compile this
00:08:33.779 and the important part in this whole
00:08:35.279 method is the loop because that's where
00:08:37.440 we're going to spend most of the time
00:08:38.820 that's where we like the CPU will be
00:08:41.039 executing this or like most of the time
00:08:43.140 and that's typically like Loops are good
00:08:44.820 places to look for optimization
00:08:48.180 so we have this Loop and inside this
00:08:49.800 Loop what is there in there this is
00:08:51.600 actually a value bigger than limit okay
00:08:53.160 it's very simple Value Plus equal step
00:08:55.019 but then there is this yield value right
00:08:56.820 so we call a block with the given value
00:09:00.959 and suppose we have two cold site again
00:09:02.820 we have one sum to 10 but we also have a
00:09:05.160 call side directly to Step One Dot step
00:09:07.440 three and that means actually the Inland
00:09:09.600 cash will see two blocks it will see the
00:09:11.519 Block in some two but we'll also see the
00:09:13.740 Block in main the one that does Pi
00:09:17.820 and that's the question now what do we
00:09:19.800 do like do we want to inline this block
00:09:21.360 too but like would you do in line
00:09:23.880 so we could align both block that's a
00:09:26.519 possibility
00:09:27.660 so say okay if the block is the Block in
00:09:29.339 some tool we're gonna like do the logic
00:09:31.440 of that block in line and if it's the
00:09:33.540 other one we do the other one and if
00:09:35.160 it's none of these we do day opt which
00:09:36.959 makes it video optimized so we just
00:09:38.339 don't under the untold that in the
00:09:39.959 compile code we'll just recompile again
00:09:42.240 if this happens
00:09:44.519 uh but it's a big problem here because
00:09:46.380 like this really hurts optimization for
00:09:48.360 instance if it's the first block
00:09:50.700 and we want here to to actually we do
00:09:53.700 the same political value but some was a
00:09:55.620 variable outside the block
00:09:57.300 so it means we cannot just access it
00:09:58.680 like this actually several in direction
00:10:00.060 right so that's why I here write it as a
00:10:02.100 block dot Auto variable sum like there's
00:10:04.320 at least two in direction right
00:10:06.180 and this kind of a direction we want to
00:10:07.740 move this out of the look out of the
00:10:09.779 loop so we can just uh we have less work
00:10:12.720 to do in the loop but we cannot do that
00:10:14.820 here because we don't know which block
00:10:16.019 it will be and if we do that basically
00:10:17.760 we would make the second block slower
00:10:20.519 um so that's really the problem here
00:10:22.920 and then the question is like what if we
00:10:24.480 don't have two blocks but we have like
00:10:25.740 seven of them then like sure we can have
00:10:28.200 seven cases like this and line all seven
00:10:30.240 of them but that's getting much less
00:10:31.860 reasonable I know if it's actually the
00:10:33.959 seven block that's being passed I will
00:10:35.580 have seven check until I even get to it
00:10:37.200 and now it's getting really slow
00:10:39.899 so that's not good
00:10:41.580 so what can we do well dirty is very
00:10:43.920 simple we say like what if we copy the
00:10:46.380 method here we copy the method step
00:10:48.600 but like the virtual machine does this
00:10:50.579 right so even to our write them step one
00:10:52.140 and step two here to illustrate the
00:10:54.120 source code doesn't change it's the same
00:10:55.620 step we had from the beginning
00:10:57.120 it's just like the the virtual machine
00:10:58.800 is able to optimize all of this
00:11:00.000 automatically
00:11:01.200 but so the thing it does is it takes a
00:11:03.240 copy of step and this first copy is
00:11:05.640 going to be specialized for the Block in
00:11:07.440 sum tool and the second copy is going to
00:11:09.480 be specialized for the Block in Main
00:11:13.019 and then we just have a check right
00:11:14.640 which is like okay is it the block we
00:11:16.260 expect to and otherwise we'd have to
00:11:17.640 optimize and so now we can optimize this
00:11:19.800 much better for instance this outer
00:11:21.300 variable we could put it out and so on I
00:11:23.399 mean the just time compiler would be
00:11:24.899 able to do that
00:11:27.000 so that's cool and this is splitting
00:11:28.680 this is very simple which is copy
00:11:30.480 basically
00:11:31.500 so graphically we are these two call
00:11:33.600 sites right one sum to ten and one step
00:11:35.820 to three
00:11:37.560 and okay this one calls sum2 but
00:11:39.720 eventually both call step and step was
00:11:41.279 calling one of two block I didn't know
00:11:42.779 which one what we did was just copy step
00:11:45.120 and now it's much simpler now we have
00:11:47.160 like a linear flow here of course and
00:11:49.440 that's much more optimizable right the
00:11:51.480 CPU is very happy when it has a straight
00:11:53.519 sequence of a structure it knows where
00:11:55.740 to go it's not happy when it has no idea
00:11:57.959 where it is going and learn many
00:11:59.160 branches everywhere
00:12:02.279 so yeah splitting and so the main
00:12:04.260 advantage is like we take a copy of Step
00:12:06.600 here for each caller and those copies
00:12:09.240 can be optimized further simply because
00:12:10.680 they have more information basically
00:12:12.260 they they know they have information
00:12:14.640 that okay it's only through this caller
00:12:16.500 so it has less cases it's more precise I
00:12:19.260 have the extra context of the color to
00:12:21.120 optimize it better
00:12:24.300 and interferably specifically splitting
00:12:26.399 is a bit more advanced than what I
00:12:27.540 explained
00:12:28.680 but first I want to Define some terms so
00:12:31.860 we had an inland cache before and Inland
00:12:33.660 cache cannot be monomorphic polymorphic
00:12:36.000 or megamorphic
00:12:38.360 means a single entry does the uppercase
00:12:41.240 polymorphic is two or more entries and
00:12:44.160 megamorphic is too many entries like
00:12:46.079 when we had seven blocks of for instance
00:12:47.459 so on at some point we just can't track
00:12:48.959 it anymore
00:12:50.519 and polymorphic is slower because you
00:12:52.079 can optimize less the megamorphic is
00:12:53.760 horribly slow because it cannot cache
00:12:55.680 anymore so for example we do has to do
00:12:57.420 metal lookup every single time right so
00:12:59.399 we have to say Okay integer what's the
00:13:01.139 methods on it not take ashma lookup okay
00:13:03.420 it's not there then we go in numeric
00:13:04.860 does it what's the matter the wash might
00:13:06.360 look up and so on
00:13:08.579 and so what referee does is like
00:13:10.260 whenever it detects polymorphism
00:13:11.760 megamorphism basically detect this
00:13:13.800 inline caches are getting two or more
00:13:15.540 entries it tries to split to kind of fix
00:13:18.420 it and be monomorphic again
00:13:21.180 and interferably we decide to split for
00:13:23.279 each call site once we decided to split
00:13:25.019 the given method so like once we decided
00:13:26.820 we need to split step we're going to
00:13:28.560 split every call to step that's it
00:13:31.800 but more than that we can actually split
00:13:33.899 that more level and I will illustrate
00:13:36.000 that to explain it
00:13:37.440 so suppose here we have two call side we
00:13:39.180 have one sum to 10 and 1.0 sum to 10 0.
00:13:43.200 and so they both called sum2 the vocal
00:13:45.360 and then that called steps inside step I
00:13:47.880 have this Loop right this until value
00:13:49.620 bigger than limits not is bigger than
00:13:52.200 actually is polymorphic it can either
00:13:54.240 call integer bigger than or float bigger
00:13:56.519 than
00:13:57.180 that's a problem because like we are the
00:13:58.800 innermost Loop the most important part
00:14:00.420 of our code and now we still have to
00:14:02.519 decide between these two
00:14:04.620 so we say okay we could split step but
00:14:06.899 if we split step or does some to know
00:14:09.360 which copy to call it just doesn't right
00:14:12.060 so what we do is we split both of them
00:14:14.940 and then we have this again and so this
00:14:17.040 is like kind of recursive splitting and
00:14:19.079 preferably from sensible to to notice
00:14:21.180 this and say okay I need to split one
00:14:22.500 level further uh and then we can have
00:14:25.079 this and everything is monomorphic and
00:14:26.880 linear again
00:14:30.660 and so to to make the point again like
00:14:32.579 what if we have if we add-ons we if what
00:14:34.980 if we haven't splitting
00:14:36.839 what happens what's the problems right
00:14:38.639 and here in the red I put all the plays
00:14:40.500 that have polymorphism if we don't have
00:14:42.120 splitting it's one step steps smaller
00:14:44.220 than zero maybe step could be a floating
00:14:45.600 Point number and then we don't know it
00:14:46.920 could be two different methods here this
00:14:48.899 one we saw on the Block could also be
00:14:50.699 different Block in the plus as well
00:14:52.019 right so there's already four places
00:14:53.820 that are problematic if we don't have
00:14:55.980 spitting but more than that if we
00:14:58.620 consider polymorphism's more General
00:15:00.180 concept if you consider consider it for
00:15:02.820 branches so for instance if we have if
00:15:04.860 else branch and both are possible
00:15:06.779 because the big program that could step
00:15:08.459 in many different way
00:15:09.959 then basically we will explore all
00:15:12.000 possibilities in this method and then
00:15:13.620 everything in red here is a place where
00:15:15.959 the CPU doesn't know where to go where
00:15:17.459 it does to Branch it cannot predict
00:15:18.959 where it goes because actually all the
00:15:21.480 all are possible
00:15:23.160 so that's a big problem right and
00:15:24.720 splitting basically solves this
00:15:26.639 by saying we take multiple copies and
00:15:28.620 then actually we know exactly where
00:15:30.060 we'll go in our cases
00:15:34.079 so back to back to having splitting here
00:15:37.019 we have a Sumter method right
00:15:39.420 and now we've actually split up the
00:15:41.279 symptomatic and this split is for this
00:15:43.560 call side one dot sum to 10.
00:15:46.139 of course like this 10 could be like I
00:15:47.880 don't know an argument like n coming
00:15:49.260 from somewhere else it doesn't have to
00:15:51.120 be literal 10. but here this split it
00:15:54.120 sees actually profiles so it checks that
00:15:57.000 the argument is always an integer if
00:15:59.160 it's not it's optimized and recompiles
00:16:01.380 but here is most likely let it always be
00:16:03.720 an integer so upper bound is profiled as
00:16:06.420 an integer this is just a quick check
00:16:07.920 which we don't show uh and then after we
00:16:10.800 can just assume upper bond is going to
00:16:12.180 be an integer inside the method
00:16:14.880 and so he also in the net cash of step
00:16:16.740 we only have one and three we only ever
00:16:19.440 seen integer as the receiver so it's a
00:16:22.019 simple so let's in line let's continue
00:16:24.000 and type inside step
00:16:27.360 and inside step we also have an argument
00:16:30.180 profile and I tell you what arguments
00:16:32.220 did we get so actually we had like this
00:16:33.959 upper bound is the limit and that's the
00:16:36.240 only argument we passed so limit is an
00:16:38.459 integer and step is not passed and
00:16:40.860 that's very interesting because step is
00:16:42.959 not passed we also like profile that
00:16:45.300 it's not passed so we know that if it's
00:16:46.800 not passed it must be the default value
00:16:48.180 and the default value is one
00:16:50.160 so that point we know that step is one
00:16:52.100 so here I highlighted all occurrence of
00:16:54.540 step and we can just replace them by one
00:16:56.220 that's like what the just 10 compiler
00:16:57.720 would do
00:16:59.579 and then next thing with is one smaller
00:17:01.860 than zero well obviously that's not true
00:17:03.540 I understand compiler can also figure
00:17:06.000 that out so we can just replace that by
00:17:08.280 false
00:17:09.540 and then no descending is just false we
00:17:11.640 can just fold it right we can just
00:17:13.020 remove some branches because we know
00:17:14.280 it's not descending it's ascending
00:17:17.640 and one more thing here is like we know
00:17:19.500 limit is an integer from the profile so
00:17:22.199 the or equality only ever does something
00:17:24.179 if limit would be false or nil but it's
00:17:26.400 not that it's always an integer so this
00:17:28.319 we can just remove as well
00:17:31.740 then one more thing we have is we notice
00:17:35.640 that self is an integer actually like I
00:17:37.620 said we profile arguments but we also
00:17:39.539 profile self just because it's useful
00:17:42.179 and so because we know surface integer
00:17:44.400 means value is an integer and means
00:17:46.200 value bigger than limit it means a
00:17:47.940 vectorally called integer bigger than
00:17:49.919 actually already have this information
00:17:51.179 to an inland cash but still like here we
00:17:52.919 can prove it so even if one less check
00:17:55.140 we don't need to check that it's an
00:17:56.640 integer we know it
00:17:58.559 and then the same thing here for the
00:17:59.940 increment we know it's integer Plus
00:18:04.200 and we have some checks here that
00:18:06.299 actually I kind of we wrote a little bit
00:18:08.520 to fit in on the slide so it was value
00:18:10.200 limit step any of them float right if
00:18:12.360 any of them flow then we have a
00:18:13.559 different logic because you have to be
00:18:15.240 careful because when you add flow
00:18:16.980 together if you are not careful you
00:18:18.360 would just end up with a lot of errors
00:18:21.480 um but so we write this to the original
00:18:22.919 version with actually just like a if or
00:18:25.440 or right
00:18:27.780 and so we have if value is a fluid but
00:18:29.940 no value is an integer so it's
00:18:31.320 definitely not to float or if limit is a
00:18:33.480 float no limit is integer and one is not
00:18:35.700 a fruit so all of these are false so the
00:18:37.620 condition just goes away it's if false
00:18:39.720 and so we can just remove is that code
00:18:42.299 okay and now we are fairly simple uh
00:18:45.419 step like a valid simple definition of
00:18:47.940 Step optimize one and now we can inline
00:18:50.400 it right so what we do to in line is we
00:18:52.200 just have both methods and we copy paste
00:18:54.360 tab inside sum tool and we adjust the
00:18:57.179 variable names
00:18:59.039 so then we end up with this right now we
00:19:00.960 have the loop from inside step inside
00:19:02.460 sub 2.
00:19:04.740 and what we still left here is we have
00:19:06.960 this block right this some samples equal
00:19:09.120 I block
00:19:10.320 and we still have that there
00:19:12.120 then we immediately call it well if we
00:19:14.460 immediately call the block basically we
00:19:16.500 can understand like okay it's the same
00:19:17.700 as doing the logic in line
00:19:19.620 so we can just replace it by sample
00:19:21.480 cycle value
00:19:24.240 and actually what motting is like
00:19:25.919 potentially this allocation would stay
00:19:27.419 but the girl jit compiler is able to do
00:19:30.299 Escape analysis and realize okay nobody
00:19:32.220 use this block so we can just remove it
00:19:34.020 entirely there's no location either so
00:19:35.940 it's very as simple as as written here
00:19:39.179 I know we have something that's very
00:19:40.500 well optimized right now we know exactly
00:19:42.120 what it does it's just a simple Loop
00:19:43.740 there are no calls except this very
00:19:45.900 trivial arithmetic stuff which we also
00:19:47.820 like typically also in line
00:19:50.820 I know what we have actually is very
00:19:52.620 similar to a corresponding C program
00:19:54.179 right
00:19:55.140 so if we look between these two it's
00:19:57.120 very similar
00:19:58.620 and actually the only difference is like
00:20:00.480 the Ruby version doesn't overflow check
00:20:02.160 whenever it adds something because we
00:20:04.320 have to care like okay if there would be
00:20:05.520 overflow would actually need to use big
00:20:07.620 integers
00:20:09.120 but this overflow check is fairly cheap
00:20:10.799 because actually it's a flag in the CPU
00:20:12.299 whenever you add it it remembers whether
00:20:14.400 there was another flow and you can just
00:20:15.660 check it and there's an instruction to
00:20:17.760 do that efficiently
00:20:19.799 and so what we did is like yeah we have
00:20:21.600 Ruby high level Ruby code with blocks
00:20:23.460 and everything and compiled as good as C
00:20:25.260 code basically but this Ruby code works
00:20:28.200 much better in a c code because it works
00:20:29.700 for float for rational it doesn't have
00:20:31.740 overflow and so on
00:20:34.919 so let's Benchmark this right so here I
00:20:37.380 have a few call sites to simulate like
00:20:39.419 being in a bigger application which
00:20:40.860 calls a step and sum to it various
00:20:42.960 different ways and then what I actually
00:20:44.880 Benchmark is this one sum to 1000.
00:20:50.220 and here I take cerebi 3.1 as the
00:20:52.200 Baseline so it's always one
00:20:54.600 and if I use traffic without splitting
00:20:56.820 it's already 50 times faster and the
00:20:59.100 reason for that is truthfully use an
00:21:01.080 advanced just-in-time compiler so it's
00:21:03.120 about to be that much faster
00:21:05.100 but if you use splitting then we are
00:21:07.679 more than 100 times faster than srubi uh
00:21:10.799 so like actually seven point seven times
00:21:12.360 faster than preferably without splitting
00:21:14.160 so that's really the benefit of
00:21:15.600 splitting like okay on a single
00:21:17.820 optimization make things 7.7 times
00:21:20.039 faster that's like crazy it's almost 100
00:21:22.380 of and the reason of that is it's a bit
00:21:24.840 like inlining it's like because
00:21:26.700 splitting triggers it enables a lot of
00:21:29.039 other optimization to come in
00:21:31.020 so like it's really an optimization in
00:21:32.820 Neighbors it enables other optimization
00:21:34.799 to go in and then we have really big
00:21:36.360 speedups
00:21:37.799 login so instead of being a very generic
00:21:39.780 version which branches everywhere like
00:21:41.280 when I had all the red on the slide uh
00:21:44.039 no it's actually like a perfectly clear
00:21:46.200 version
00:21:48.780 and so if you take other Benchmark for
00:21:50.760 instance here we look at opt carrots uh
00:21:53.100 tougher way without splitting sorry five
00:21:54.659 times faster but with splitting by
00:21:56.640 actually seven point seven times faster
00:21:58.140 so here's for again a big gain from
00:22:00.120 splitting
00:22:01.380 and that's maybe not so not so expected
00:22:03.419 on opt carrot which maybe us doesn't use
00:22:05.400 blocks that much but still it uses them
00:22:07.860 to some degree
00:22:10.679 and then if you look at rails bench the
00:22:12.840 version from the widget bench Benchmark
00:22:14.640 Suite
00:22:15.600 we see that traffic without splitting is
00:22:17.659 1.366 times as fast as Ruby which is
00:22:21.000 something but with splitting is 2.75
00:22:23.760 times faster and that's much better
00:22:26.700 so here on Rails it's like it looks like
00:22:28.799 it's really key to the splitting because
00:22:30.480 it enables so much more optimization to
00:22:32.760 enables to compile rails and all the
00:22:34.679 method it uses in a much more efficient
00:22:36.720 way
00:22:39.900 and to get to get an idea of it of the
00:22:42.000 global performance of prefer Ruby here
00:22:43.820 that's a blog post I did in the
00:22:46.140 beginning of the year
00:22:47.520 and on the 14 Benchmark of wedged bench
00:22:50.220 we see that trophy Ruby is 6.23 times as
00:22:53.700 fast as 3.1
00:22:55.500 and so we can think like a big part of
00:22:57.419 this is also splitting
00:22:59.159 not only of course the battery stamp
00:23:01.080 compiler also matters
00:23:03.179 um but that's yeah that's part of the
00:23:04.860 path of the overall results and we can
00:23:07.500 see trophy is basically a different
00:23:08.760 level here like it has all these very
00:23:11.039 important optimization that so far other
00:23:13.140 group implementation don't really have
00:23:14.760 yet
00:23:17.880 so let's talk about some related paper
00:23:20.400 uh
00:23:22.140 from from people at Kent from Sophie
00:23:25.620 caliber octave laros Jeff and Mar and
00:23:28.200 Richard Jones
00:23:29.700 and these people uses true for Ruby to
00:23:31.440 analyze the behavior of cold sites and
00:23:33.840 so the big view of course our Electronic
00:23:35.159 Center are they monomorphic polymorphic
00:23:36.720 or megamorphic but also other things
00:23:39.840 and this paper I find a traffic Ruby as
00:23:41.940 two ways to reduce polymorphism and
00:23:44.640 megamorphism and one of them is two
00:23:46.860 level in land cache we saw in the
00:23:48.419 beginning where we separate the lookup
00:23:49.799 from the dispatch when we call a method
00:23:53.580 and the other one is splitting
00:23:56.280 and there's also a blog post on Street
00:23:57.900 fan website if you want to have a kind
00:23:59.520 of a summary of the paper
00:24:02.280 and so I took the number from from that
00:24:04.140 paper and specifically the real bench
00:24:06.179 number
00:24:07.140 and we see that initially rails bench
00:24:09.179 has about seven percent of the call
00:24:10.799 which are polymorphic that's not good
00:24:13.559 that's kind of like not so small number
00:24:15.659 of polymorphic core right it means like
00:24:17.220 seven percent of all cores in a
00:24:19.140 benchmark that actually they don't know
00:24:20.760 where they're going right they have to
00:24:22.080 at least choose between two or more
00:24:23.520 possibilities
00:24:25.380 and even more ring is like half a
00:24:27.360 percent of those code which is still 630
00:24:30.000 63 000 calls are megamorphic which means
00:24:32.760 they don't have honey cash they have to
00:24:34.860 do method lookup every single time
00:24:37.980 uh but what they see is like after we
00:24:39.900 use these two level Inland cash for call
00:24:41.460 this lookup and dispatch cache we
00:24:43.799 actually reduce the polymorphism the
00:24:45.539 polymorphic call by half now we only
00:24:47.340 have 3.5 percent
00:24:48.900 and the megamorphicol are now very small
00:24:51.179 like 0.004 percent
00:24:54.419 and then once we apply splitting on top
00:24:56.159 of that we actually come to a flat zero
00:24:58.200 there is no polymorphism no megamorphism
00:25:00.240 anymore
00:25:01.559 so that's really a big result that one
00:25:03.840 that I think Daddy done and I didn't
00:25:05.460 expect is that when you have rails you
00:25:07.559 have quite a bit of of polymorphism
00:25:09.240 megamorphism and actually these two
00:25:10.919 optimization remove it entirely and this
00:25:13.620 didn't just happen on Rails bench but
00:25:15.059 actually happened all 44 Benchmark in
00:25:17.220 the paper
00:25:21.360 so in conclusion we saw that splitting
00:25:22.980 is a technique from the self VM research
00:25:25.020 it was invented 33 years ago
00:25:27.539 but even though it's a bit old like this
00:25:29.880 it actually applies very well to Ruby
00:25:31.559 right it applies for instance for metal
00:25:33.179 attacking blocks and cooling them in a
00:25:34.679 loop but not only there are many more
00:25:36.480 formal polymorphism like we saw like the
00:25:38.220 branches and so on
00:25:40.799 and we see that splitting can completely
00:25:42.900 remove polymorphism and gamorphism in
00:25:44.760 the 44 Benchmark of that paper which is
00:25:46.919 really quite a good result and also it
00:25:49.080 gives really big speed up like we see
00:25:50.640 this 7.7 time speed up on some two 1.5
00:25:53.100 on optical carrot and two times on Rails
00:25:55.440 bench
00:25:57.659 and with that we want to thank you for
00:26:00.419 listening and I'm open to any question
00:26:10.620 thank you
00:26:11.820 right so the question is does it
00:26:13.200 definitely impact on memory footprint
00:26:16.980 and yes it does have an impact of course
00:26:19.200 like if we copy something we need to use
00:26:21.000 the memory for it
00:26:22.559 has an impact which is then proportional
00:26:24.720 to like the the representation that the
00:26:27.659 virtual machine uses to represent code
00:26:29.700 right so like byte code or whatever else
00:26:31.559 right
00:26:32.640 so it's a like a for instance I don't
00:26:34.740 know a percentage of that trade so
00:26:36.779 interfer Ruby there is actually like a
00:26:38.220 limit it says like we cannot copy a
00:26:40.200 thing if I remember correctly
00:26:41.880 so if we take the wall footprint for all
00:26:44.460 the representation of the code for our
00:26:45.840 application we don't grow it up more
00:26:47.880 than 2.5 what it was originally so we're
00:26:50.460 allowed to split kind of we're allowed
00:26:52.559 to have one copy and a half but on
00:26:54.360 average some things will never be split
00:26:56.340 and something will be split many times
00:26:57.779 but so there's a limit but actually we
00:27:00.240 don't hit this limit so much so in
00:27:02.640 practice like yes it needs to be
00:27:05.100 controlled so it doesn't go crazy but if
00:27:07.260 it's done in sensible place it tends to
00:27:08.820 not be not be too unrealistic nothing
00:27:11.279 compared to like the application
00:27:12.720 footprint is typically not that
00:27:14.340 important
00:27:15.480 but yeah select every optimization it's
00:27:17.279 a trade-off here the trade-off for a
00:27:19.080 little bit more of a print for much
00:27:20.340 faster code right so it was this one
00:27:22.679 right
00:27:24.360 yeah yeah yeah so there's a the trick
00:27:26.760 here in truffle at least is that we know
00:27:29.580 like we we noticed that step here is
00:27:32.520 called by a single caller there's only
00:27:34.440 one thing called step at this point in
00:27:36.299 the program
00:27:38.220 and so this is the kind of the first
00:27:39.659 time we see a step being used right and
00:27:42.179 that means it's like well this can't be
00:27:44.279 the source of polymorphism or if it is
00:27:46.260 we can't do anything about it so like
00:27:48.000 let's keep looking higher and so that we
00:27:50.100 can we kind of keep looking higher like
00:27:51.840 this as long as single colors up to a
00:27:54.240 threshold I think is something like five
00:27:56.460 uh and then here for instance we saw
00:27:58.440 like okay now we have two colors we
00:28:00.419 don't only have one right so this point
00:28:02.279 is probably okay to split
00:28:04.020 but even then if it didn't work like no
00:28:06.419 it would have a speed but maybe there's
00:28:07.559 still some polymorphism it would detect
00:28:09.120 it later and then try again to split
00:28:10.679 higher potentially
00:28:12.240 okay then thank you