00:00:08.330
so today I want to talk about parallel intuitive Ruby high speed with shuffle
00:00:13.889
Ruby my name is Benoit it's French I come from Belgium and I move to Zurich here
00:00:21.230
beginning of December so it's quite recent some curious to me together hmm
00:00:32.120
first have to show you this which means what I talk about is a research project and so you should not buy a record store
00:00:38.760
based on that so no work at Oracle labs
00:00:45.809
and we are developing a ruby implementation which is called truffle Ruby
00:00:53.930
and I think doing research as a PhD for four years industry actually do any
00:01:00.510
research on concurrency in worthy I wanted to improve on that because of the do Ruby was not very little current I've
00:01:08.310
been working on truffle Ruby since it was created almost enough for modern for years
00:01:13.590
I'm also the maintainer of Ruby spec which is the set of specs for the Ruby language itself and I'm also a
00:01:19.680
micrometer so today I want to talk about two things first I want to introduce
00:01:25.259
what is true for Ruby and also talk about performance because it's the most exciting aspect of it probably and then
00:01:32.460
I want to talk also about my research as a second part first thing you know so
00:01:40.950
tougher Ruby intends to be high performance to maybe reach a new level of performance that other web
00:01:46.140
implementation just never reach so far try to be as fast as fast as it is for
00:01:52.170
fascists just-in-time compiler for dynamic language for instance v8 whatever as people we try to be as fast
00:01:58.500
as that for a week we use the gorgeous tan compiler which is what helps achieve this and
00:02:05.740
which I get full compatibility with Serie B I like very few exception we always try to be as competitive as
00:02:11.800
possible and it means we also runs the extension because a lot of rails are out there and
00:02:17.650
Ruby application use extension and there is just no easy replacement so you want to run that and really support quite a
00:02:23.950
bit of that it's all on github it's open source you can check it want to dietary
00:02:32.890
two ways to run Java Ruby one way is to write on the JVM the Java Virtual
00:02:38.290
Machine and the main advantage here is you can interoperate with other jvm language code in Java and this provides
00:02:46.480
great performance now the default mode what we we run in is what we call substrate vm and
00:02:53.560
substrate vm is basically ahead of time compiler that compares troffer rubies or
00:02:59.800
web implementation and grout to just an compared to a native executable so i
00:03:05.380
won't get much to details there but the main idea is like this whole thing is initially java code so to fro visit in
00:03:10.870
java corral the system compiler is also written in java and we compile all of these new onyx native executable and
00:03:18.610
that tradition gives us much faster starter because most runs like a devious doesn't start with a one second with
00:03:23.920
anything significant but here but it can start in 25 May second and also is
00:03:30.580
faster warm-up so the time it takes until your Ruby application is fires also better because here the just a
00:03:36.940
compiler is pre-compile itself so it's already optimized to cope area code it's a bit of a meter thick yeah it's also
00:03:45.010
lower footprint because you don't have to do that class loading or loading polymer Java code and so on all of this
00:03:50.410
is done ahead of time and then basically the only memory we need is what the Ruby interpreter needs and nothing else and
00:03:56.739
the gin of course and the performance is secret in that configuration so that's
00:04:02.320
why it is the before with elegant the main advantage of deuterium is then if you want to interpret it with other JVM
00:04:07.900
stuff this is something else so
00:04:13.510
I guess you're out of there will be three times three projects and the goal of this is Serie B or MRI 3.0 should be
00:04:21.400
three times faster than C Ruby 2.0 and I don't do this with a just a compiler
00:04:27.250
which is called mg but that's interesting it's a good direction I think that would be funnier it gets a
00:04:33.280
bit faster but my question is cannot be faster than three times every 2.0 and to
00:04:40.419
illustrate this I want to make a demonstration and everyone what is called opt carrots opt cavities the main
00:04:47.289
CPU benchmark for which times three it's actually an emulator for the Nintendo
00:04:52.630
Entertainment System look okay can you
00:05:08.080
read this so the baseline for be two
00:05:13.660
point three is ruby 2.0 so let's do that
00:05:20.280
and so here we just run the benchmark so
00:05:28.180
you can see here we have a miniature and we have this game called lamb master which is which is use for a backpack and
00:05:33.880
a tank a plate but we see is not really fast so when I move around you can see
00:05:41.169
it as a bit of light and the aim of this game is very simple as connect all the
00:05:46.300
computers together now you see like it like that
00:05:52.410
so the original the second of the of the nests is not only a 60 frame per second
00:05:57.940
but here's only 34 so that's not fast enough so we can use the letters to be a
00:06:06.000
little Serie B which is 2.6 and I can run it
00:06:13.669
and there it's 43 frames per second so well time frame per second better but
00:06:18.689
still not 60 but no we can also use mg and we do this we do the DEF - it back
00:06:24.479
in 2.6 and now it gets actually a bit
00:06:29.759
faster around 70 frames per second which is nice but then we can also run it one
00:06:39.870
so far really yeah it gets more
00:06:45.599
interesting after this is too much I
00:06:51.479
explained earlier with JVM or SVM likes obsidian and here's a bit better on JVM
00:06:56.900
but so sorry so we see it start with this load is
00:07:02.159
like a 10 frame per second but gets faster because the cheetah now learns what what they're executing it's
00:07:08.219
optimized in comparison to machine code and optimizes energy better than traditional G that's a bit unstable
00:07:15.629
until it learns the program but then now we get into 250 when the second night
00:07:21.349
just like yeah it's where I had like I
00:07:26.490
also paid in this mode from is the time is friend day so it means I have much less time than zero and ya know again
00:07:34.319
like we surprised at it now it's like we're doing something a bit of difference and it's a really nice things but soon enough
00:07:52.390
and today I can see I've been doing a lot of research I used to play this on
00:08:25.280
my other laptop first a bit slower it was manageable no it's not only that's ok I think you got the ID so here
00:08:34.430
we can see the frame per second right so you can see like into Ferrari which over 200 which is like three times faster
00:08:41.240
than them so you did three times three times three and so what we saw is
00:08:50.690
feel like if you want to graph it I can you show it refer to the Green Line and it's pretty slow in the beginning but
00:08:56.660
once it learns a program on top smells it is all you use the methods or use for instant Edition used on the own integers
00:09:02.900
or something else like you have also big numbing it when Salons is it very fast
00:09:08.720
and like this level of difference feels like it's a big margin rate and today I
00:09:14.570
want to try to give you some insight into why we do so much better and this
00:09:21.230
not only on a particular good for instance also on classic numeric benchmarks we also perform very well
00:09:26.810
like typically between ten and fifteen times faster than MRI and you see here
00:09:34.460
we also as a JIT performs a bit better but not ever
00:09:39.730
and then if you run a metallurgist own set of micro benchmark likely they run
00:09:45.410
it against preferably at some point and here's the tentative configuration so the default one and see there we
00:09:53.060
actually 30 times faster than sir to conserve and mjid itself is four times faster Mira query like we're good
00:10:01.680
at optimism on this kind of benchmark and there's just like a different like that of how much you can optimize that
00:10:07.260
code other areas is like for instance rendering template rendering I something
00:10:13.740
we are pretty good at when sense on a super yabby benchmark we are about ten times faster than MRI and this is due to
00:10:20.459
a different representation for Strings so if I sell string concatenation in MRI is pretty slow because involves copying
00:10:27.420
and reallocating but if you use something else like ropes which is user context a littles then-secretary
00:10:33.630
constant time so what we do is yeah we
00:10:39.779
implement coconut generation differently so when I connect to string actually
00:10:45.180
instead of like copying this one we can be a buffer and copy the stuff we like okay with these two one and where the new nodes here that represent the
00:10:51.779
virtual concatenation of them and we only actually like flatten it when I write it to the network cousin yeah
00:10:59.360
there's a talk about this one more info a lot more stuff we do very much the big
00:11:10.260
question is do we want rails and the answer is yes to some extent that thing
00:11:15.959
that already is new almost once out of the box of a small blog application and
00:11:25.020
we trying to run discourse because this causes you that's one of the benchmark also for will be two times three night
00:11:31.560
aerial of a fair bit of it working already but currently with a few patches so we need to fix a few things like
00:11:37.380
there and want to like do it properly not fixed in everything in the interpretive switches works without changing anything in the implication in
00:11:43.890
their application how do we challenge of running big rails app is often if tons
00:11:49.020
of dependencies over 100 gems for instance and that's one that doesn't know this enough to not be able to run
00:11:55.860
track then you have to start to work around it the idea and many of these stream also
00:12:01.460
the extension which tends to be more complicated so recently we got to create
00:12:07.010
a new approach for the extension that works a lot better like before we actually are to like for instance for
00:12:12.200
the PG driver or MySQL to graduate to touch quite a lot of pace places and so
00:12:18.110
on and this doesn't scare if you have to patch every the extension to make it work with us it's never gonna work so now we have something called a
00:12:24.500
different kind of simile it simulate the the GC file right for the extension and
00:12:30.200
the end result is basically like many successional work out of the box and just change to it and that's it and so
00:12:37.430
we support this one is the I know we support all of these but I guess many more now
00:13:12.870
that's a very simple blog nothing very fancy but it works which is created with
00:13:19.390
like when's news and scaffolding but the
00:13:24.550
thing over is 5 so oh it's true for be
00:13:35.200
so fast basically and the main two concept which is the Bashar evaluation
00:13:40.829
which is kind of new I think it's something that was introduced by our project and never
00:13:45.880
ready use it focus before at least watch it compare and then we use of course it
00:13:50.950
is the quadric compiler which as another of optimization as we will see that helps a lot as well implementation wise
00:13:58.860
like integer plus for instance basic primitives their return in Java because so far will be service rendered in Java
00:14:05.680
but collaborates with general Ruby itself so this is like Rubina's so after
00:14:10.899
you have a lot of to collaborate defining Ruby itself and then just what we need that we cannot express it will be called like a different of course in
00:14:17.890
the future so let's see our Ruby interpreter can execute your program
00:14:25.000
right so if a very simple function method foo and what it does is it Maps
00:14:30.160
an array an array of two numbers and triple double and the first thing any
00:14:36.850
word implementation we do is pass this text into an abstract syntax tree and actually it's here it's three abstract
00:14:43.420
syntax tree because there is one for the method yeah at the military it only calls map and it calls that eliminated
00:14:49.720
an array literal and then the only argument is the block then the block
00:14:55.329
itself here is also different ast but then multiplied the argument by
00:15:00.459
three then finally map is also a general st if it wasn't if it's implemented in
00:15:05.500
Ruby it's its own logic another we do is obviously read from the from the array
00:15:11.560
with the new array the blood for my child now what we don't refer ruby is very similar to this
00:15:17.900
instead of just an abstraction actually we do what we call it refer abstract syntax tree and it employs similar but
00:15:24.780
the main thing is here actually each of this node is know a Java object and each
00:15:31.470
of this node has an execute method which defines all to execute this note to
00:15:36.720
explain semantics for instance we have a multiplication node and what it does is
00:15:42.030
execute the left child is equal to right child multiply it with under but that's
00:15:48.240
a very simple way to implement in jeopardy and the way we do so is actually very slow and that would be one
00:15:54.420
point eight was a bit like this it wasn't very fast but because you have this trick of partial evaluation we
00:16:00.720
connected in make it very fast and the ID so this magic process which I will
00:16:07.290
explain later or partial relation it takes this troffer DHT and then we get out of it is a compiler graph
00:16:13.500
representing author educator Boneta and this basically sent immediate step because once the JIT takes this it an
00:16:20.190
image machine code of that that's basically we need L to like a make
00:16:25.380
machine code for this beautiful hd now it works is we start from the from the
00:16:32.130
initial node the top node so like this one for instance and then we see okay what's the second
00:16:38.670
method is doing like or the middle that is using this one that this one is getting this one this one and actually
00:16:45.330
we go to every note that this we in line the entire st and then we get like risk
00:16:51.840
a lot of Jellico and then we come to my stem cell it's in action so we have automated again and
00:16:58.880
there's a good method of this fool is just a committed this execute was inside the body so it is it a good the child no
00:17:05.510
and a China would be is a core node is a codemod right and the corner would have
00:17:11.550
some logical to call a Ruby method but first before calling anything it after the key the receiver and the arguments
00:17:17.839
and educating the receiver means the receiver is an array literal
00:17:24.780
and this is create a new ruby array with the value of educated the values are 1 &
00:17:30.180
2 so that integer literal and they just return that 1 & 2 now the interesting thing is when we do
00:17:36.000
this process we also do some kind of constant folding so we know we are compiling foo and not any other method
00:17:43.730
so I read this link like here discharge that will you feel right nobody would have to read this field in normal
00:17:49.860
Jericho but here we don't because we know this HT is constant and so because
00:17:56.580
it is concern we can fold everything and so of all these edited metal here all together of like this and so the
00:18:04.200
equivalent thing we did after partial relation is this with a food gel emitter
00:18:09.660
and everything is aligned we know we're already colemak we know which which block we know is which array which value
00:18:15.450
and everything this is done for food but
00:18:21.000
also do it for the other HT so we also do it for the mat VST and we also do it for the block HD again with the same
00:18:27.750
process so much theory sitter is already called a block for element and block which is multiplied by 3 and then
00:18:37.380
another thing that happens also if we do inlining at this point so what we see is like when this foo method call map which
00:18:43.980
will you remember what which is T record which method the corners you know I tell
00:18:49.050
you always call map Fred's not very surprising that's what the code it has to do so let's call the same thing and
00:18:54.720
then new one we call the proper 0 actually always called the simbook Sun we see this is like maybe this was in
00:19:00.300
line it's maybe putting all this ste together into one which is very easy to rearrange it and done and then now have
00:19:08.730
a lot more code we can optimize together and I can start to like optimize things that are between different metals in
00:19:14.460
class so once we light everything we have something like this so the original array with a 1 & 2 I
00:19:22.200
will put a new array of the same size and that fourth element will call but this is a line so we just multiply
00:19:27.950
battery directly now at this point the
00:19:33.890
gradual stem compiler kicks in and here we start to have some optimization classes the first thing it will do here
00:19:40.010
in this case is to do what is called some escape analysis that we see like this array here we create we wrap it on
00:19:46.610
your rate actually it never escapes the method it's like the return value would
00:19:51.770
be dungeon you're right it doesn't care about your geography and I tell you read some stuff from the array like the size
00:19:57.680
of the array and with elements from it but then stance itself there will be a
00:20:03.680
rain sensor doesn't escape so we don't need to allocate it compiler can figure this out and there is replace every
00:20:08.750
usage with what what is actually like what's in the fields so what happen is
00:20:14.720
this so that the array becomes you the storage which is this inter internally the size of the array here was to
00:20:22.370
because you see like okay to the side of this is to so this is it and reading from the array is reading to destination
00:20:30.100
the next thing is now we are reading here from the interest so what we read must be an int so this is always true so
00:20:37.600
there is no point with another branch that do more complicated semantics now
00:20:47.420
the comparison or there is an asset or a loop here it always does exactly two iteration and inside the loop there is
00:20:54.530
not so much code so maybe about the ticketing this code for every iteration because only two of them and then they
00:21:01.340
could have be much simpler and easier to optimize and of this now it seems like
00:21:09.770
ok actually like we are reading for Mary storage 0 all and our storage is you know in the comparator between is
00:21:16.620
linked to videos that can see like this is what crazy this is what uses it so I can just forward the value and I restore
00:21:23.070
it zero which would become one angel and I tell you the rest of it we don't need it anymore after it's gone now if I get
00:21:32.730
seasick it's multiply exact it's like multiply but also consider overflow but
00:21:37.920
obviously one time there's another floor so we can just do it and we rearrange the cool and we got this and this is
00:21:44.190
basically the most optimized code you could get for what we got initially so
00:21:49.890
we transform this into this java code which basically if it runs from back into Ruby code like yo gates the answer
00:21:55.919
is what we could reason or self about and this looks simple but I tell you so we understand it is not that easy and
00:22:01.919
actually only tougher we managed to this level of understanding of Ruby semantics
00:22:07.700
now you can yeah of course this is then assembly and then that sumbitch's does
00:22:12.900
you can imagine right three two memory six memory the return on your the
00:22:18.390
entities so that was preferably I not to
00:22:25.080
compare a bit I want to compare with mg so the meta GTV and then the first step
00:22:30.570
is also departure we could one st there is transform not to a trophyless table to a bytecode it's just another
00:22:36.809
representation doesn't really matter but the point is that when a method is
00:22:41.820
called many times mg will generate C code so they really like princess C code
00:22:47.549
and then pass it to a C compiler which then generated share library which it then load back so that regenerates
00:22:55.140
sequel that's the way that's the way they choose to communicate with the compiler because of course communicate with a C compiler that's the most
00:23:01.500
convenient and so then this will be function its smallest related to this I
00:23:08.010
was really like look like the extension good the killing right like okay the
00:23:13.590
values are here and then there's a BRE new something and then function call with a block is not then they can also a
00:23:20.370
compile the block separately so here the could be lectures multiplied by three but it's inline and specialized for like
00:23:26.269
six nothing float for instance and then at this point Jesus your client kicks in
00:23:34.309
and what it optimize is not magic today the only thing it can figure out it like this rule is true your floats no it's
00:23:41.330
not a static points an integer so you can say okay that Brad doesn't exist move this and then I'll just keep this
00:23:46.399
one but that's it it cannot go further because it doesn't know how to go in
00:23:52.279
this or this or this because this is already compiled to native it has no idea what it does
00:23:57.409
so this is currently the main limitation of energy is it doesn't know anything about the LAT function for instance
00:24:02.889
because that's part of the rule binary but there's a missing source form at one
00:24:08.299
time I doesn't know like also just ought to go through it or to like inline its order tonight so in summary like we can
00:24:19.610
see the performance of Ruby can be significant improve like it can be up to ten times faster for instance on the
00:24:24.950
number of benchmark and reverb is an example of that thing the message she is like it's only to rewrite application
00:24:31.549
another language and Ruby for speed like Ruby can be as fast as JavaScript another dynamic language at least which
00:24:37.490
is already pretty good and that's like within a factor of two of Java which is probably good enough for most companies
00:24:46.120
when christian phone is that you compare only access to the code every reader to this map function for instance
00:24:51.259
otherwise it's very limited by the only understand user code that never use colabro method but really good use call
00:24:57.409
every metals everywhere that it doesn't understand them can't do much I totally
00:25:04.100
understand already constant so easily like for mg for instance a bit more complicated because I send see there is
00:25:11.330
no concept of an object or location just object because I never I know
00:25:17.120
gravitation is just like Panther punter and write to memory the signal Philemon
00:25:22.879
Minot doesn't know that all maybe nobody will read from this from other traders it just tells you up there so there's
00:25:30.040
there's some challenges find it there to address basically like be able to communicate better with this compiler
00:25:36.340
like Enoch you can do this better because we're in Java so there's already the concept objective location but also
00:25:42.520
we control the test and compare and so we can give it more information so now I
00:26:01.000
want to show it during my PhD which is I
00:26:06.100
want to make Ruby another dynamic language palette and the program I
00:26:11.200
noticed and promoted to working on is like dynamic language not too sure bf4 support for piracy I think it none of
00:26:18.280
them as variant good solution and often it's not only actually due to the
00:26:24.940
language themselves it's at it due to the implementation and so they very personal connotation the first type is
00:26:32.380
the one with a closer look and in this way Alex will be your C Python and GS
00:26:37.840
could also become JavaScript or some big considered there are some extent and then there's a global log so there is no
00:26:43.660
final execution of Ruby or dynamic language code ever in a single process so the only possibility there to achieve
00:26:50.560
biasing is to use multiple process and up ways memory the biggest resources and
00:26:55.660
the slow communication between spaces that's like kind of like the last layer to do politcs use multiple doses then
00:27:06.190
second category which is more interesting is what I called unsafe where GB everything is half so doctor
00:27:11.560
you allow Ruby threads to run in parallel but they give up unimportant
00:27:16.660
guarantees the finest entity called array append concurrently on the same array yet today you might get an
00:27:22.870
exception that's a problem not because in the documentation area pen up ends
00:27:28.390
one element to the array and not like all might throw an exception and then we
00:27:35.380
have a third category which is like oh that's what the primal together and you know this share memory but as it cos for sure so this is like
00:27:44.049
share nothing of shared ito and in that favor JavaScript because no it does kind of an actor or like model with the
00:27:50.860
worker thing along is in that too and the builds which is for Reba tree I will
00:27:57.850
expand on that it's also at this model that the problem is you cannot pass
00:28:03.490
object between threads it is impossible just like a doe even to dip copy or
00:28:10.299
transfer ownership then you can't use it anymore the original thread so there is something in it like when you use it that way it's much more restrictive so
00:28:22.000
yeah that's it's not to be dong killed so that's when you a new concurrency model for will be three the advantage of
00:28:29.950
this is there's a stronger memory model so the semantics are simpler Dino Lalli the races because there is no shared
00:28:36.279
memory point epic tradition right the process level but it's like every in
00:28:41.830
your real process every guild would have a different heat defender behave and there's nothing in common disease so
00:28:50.470
people in the air is like you cannot share past object or question around and well there's not really oriented
00:28:56.380
languages so it feels a big deal so I don't you deep copy them but if the object is large it takes us a long time
00:29:02.679
to think of it oh you transfer ownership but actually there is a caption I transferred ownership factor you need to
00:29:07.899
dip copy everything except the last layer from Dom additional reason so it's not much faster so this is a problem
00:29:15.700
like compared to a second shell program when you paralyzed with this then you have this copy over it and of course you
00:29:21.940
always have to balance it like if the copy of Rights to hide and maybe version is slower than sequential version it
00:29:30.820
doesn't mean that share multiple state can only access the contrary that it has to be in its own actor it's in own guild
00:29:36.789
and every time just to be communicated from the outside it's one at a time so
00:29:42.010
it's something like encouragement with nobody you can read from multiple through the center not possible vector
00:29:47.889
motor like very special data structure cannot correct a model for that but the
00:29:53.860
normal way like your own business logic no it's a different programming model so
00:30:02.920
for instance if you are code using Ruby threads then it won't work this girl just like that we need to adapt it because you confess something of Indian
00:30:10.660
I think it's a complementary solution some problem some problems you which are like very nice to express with this like
00:30:17.110
share nothing a share little brain model like for instance I don't know like some
00:30:22.660
IRC chat BOTS or so on it something that tends to be very isolated and I can put the rows and you don't need to like
00:30:28.930
shares a lot of stay together then it's quite nice but some of them might find some switch racing where like you would
00:30:35.680
say okay I want to render this picture and I will give this part which right there not about the thread not about it rather than the battery thread
00:30:41.620
that's like much easier to do with cinnamon because it won't have to just copy the screen all the time and just
00:30:47.050
write to it then everyone writes a different part so I was saying Derby and
00:30:55.360
Rubinius are unsafe and I want shown example so I create an array here and I created a hundred threads and each of
00:31:02.320
these tread is gonna happen 1/2 of an integer to the area then I will wait for all the threads and I will put in the
00:31:07.390
size of the other so you know the programmer should get one three times one thousand one hundred thousand and
00:31:12.700
sure enough I run this one sooner do I get the right answer of course nothing runs in parallel so these strategies but
00:31:19.210
yeah waiting for each other but turns out correctly I can run this on terribly
00:31:25.000
again to get a random number or I get I
00:31:32.950
need to know and if I run be some Rubinius same thing I did I get a random
00:31:38.590
number didn't show it or a different encryption that's kind of bad likes like
00:31:43.780
if you have something in production using parallelism maybe locally there's not enough requests for it to happen but then the
00:31:50.980
introduction may be the right path which is dial is this it's not very nice the
00:31:57.910
reason why we have smaller number here it means something happens we are lost so the race was done in a way that
00:32:05.070
basically the boss incremented the size at the same time when the atomically so
00:32:10.200
it was like three and three three plus one for white for like four hours like oops when I commented 1 to 32 area and
00:32:16.470
side so this is very bad to the coach so
00:32:24.210
workaround for this implementation and yeah the only way the way they will come under resist it's like okay in to do
00:32:30.690
some synchronization so there you can use on the text like this or you can use like a concurrent array that new which
00:32:37.590
is the contribute ivory the main problem is this is it's raised easy to forget to use that and then again like if you Tony
00:32:44.190
up in introduction how to reproduce so I don't think they want unsafe by default
00:32:51.710
also this in register is not necessary to be a it is to add significant over it
00:32:57.720
they actually wanted a yellow sequential access to the array which is sad because the next census and I really different
00:33:04.350
of the array you should be able to get in trouble but okay
00:33:10.649
I think the biggest point is like dislike I mean a hash a thread safe on C Ruby and if they are not under a
00:33:16.139
permutation 10 which is incompatible one instead of example in bundler and
00:33:22.440
religions are like yeah they don't work on your revenues because re a national
00:33:27.450
treasure but of course I don't do this by by
00:33:32.639
intent I mean like the reason they do this because to make them thread safe already in hash it would be very
00:33:37.739
expensive and it would affect signature performance which they don't want the
00:33:43.649
big we subscription is like a big collection thread seed and not ever never I don't see good fatty performance
00:33:49.499
and one more thing is also one parallel axis just my entire I want to access
00:33:55.019
mahashiv parallel especially if access different paths there is no reason I can't do that I think the problem is
00:34:04.859
even louder than that's not just collection because object can kind of try America there's nothing you can add or remove
00:34:11.320
instance variable ticket is the same power as a hash table even though they're not you see the same pattern but
00:34:17.669
they've seen that problem and so because you can add or remove feel that we can
00:34:23.679
either a hundred or thousand feel if you wanted to we need to grow the storage for this object dynamic any need to
00:34:29.830
expand this to which store more instance right and you do that we have to prime the concurrent rights and extending the
00:34:36.760
storage that we did we might do some right so here's an example as I create
00:34:42.520
an object and then I created thread that will set the field eight one and then
00:34:48.520
update it to to the second thread will set the feed be a string be a weight for both red and one possible outcome is a
00:34:56.590
directory I mean I get one when reading date so it means like this one was completely just and it's actually
00:35:03.820
possible on Rubinius and the reason is like ok initially thread one sets set a
00:35:10.390
to one very good we stir it but then before we set it to to the second thread
00:35:16.330
start and then I said like oh wait the storage only has a capacity for one instant variable so I need to grow it so
00:35:22.000
to grow it I have to make a look at a bigger chunk and copy the existing that and does this and then we will set the
00:35:28.960
new value in it and then assign the storage the new strange problem is first
00:35:34.450
thread wrote the date on the old storage which is no lost and unused and so this
00:35:39.910
a business so do something at your be fixed but that's one customer and the
00:35:48.339
main idea is like over can we address like this like oh can we avoid overhead for sequential program and still have to
00:35:55.300
achieve connection and objects is to distinguish to only synchronize on what
00:36:00.310
we need so my approximation area is like only synchronize on object or
00:36:05.770
collections that are reachable by multiple threads if an object our collections which have only one thread
00:36:11.230
there is no need to synchronize there's no concurrence
00:36:16.590
so here's an example so a third one and that of your great fear and tried to and that's another Fiocchi well object are
00:36:23.340
like white currently and unsynchronized right because they come only gets it by one threat now what happen if the queue
00:36:29.970
is put in a global variable presence then successive all by average word or
00:36:35.100
if we give the queue somehow to try to now that point the queue become shared and become synchronized and there's some
00:36:41.730
synchronization accessing it let only the queue but everything with a performance because of course spread to
00:36:47.400
know can put a message another queue access everything of this and to do this
00:36:56.430
actually quite simple we don't need much of course need to try this at one time right so when this happened we need to
00:37:02.190
track it under the way we do this is the right barrier so it means like when you write to a shared object or to a shared
00:37:09.330
collection whatever we put in it is going to be shared as well because it becomes reachable through it right to
00:37:16.110
share the field then we share whatever's in it so the hash the key in the value and in an array then shared element in
00:37:22.980
the hash case looking the value and
00:37:29.160
release approach actually you get pretty good result so this is single threaded benchmark and it is been checked uh
00:37:36.210
basically use objects and we see that unsafe is true for Ruby initially and safe is to throw use
00:37:42.960
thread save object which is local and share distinction and there it was a great no difference I'd rather greater
00:37:49.110
the same everywhere on the other hand if you would synchronize on every right which is like
00:37:54.450
Nigeria we do then was so much trying everybody back up to here 2.5 times slower on anybody because in every of
00:38:02.400
the cried becomes my expensive but here since its singer trained benchmark all of our local they can be reached by one
00:38:09.090
track so they override with at NASA condition and we can do the same trick for collection of course so again
00:38:15.510
probably say version and otherwise the trecek action it's same performance everywhere
00:38:21.590
forcing outside attachment so now let me
00:38:27.570
detail a bit more like all we deal with fines and salaries because it's quite interesting so the first thing you
00:38:35.730
introduce is average strategy erased our strategies this is an implementation technique that's used in most virtual
00:38:42.270
state-of-the-art virtual machines and the idea is we represent the array with
00:38:50.550
different kind of like and different representation behind the scene to fit most exactly what's in it so initially
00:38:58.170
for instance I have an empty array that use this empty strategy so there there is no storage at all could be like not only whatever then when we open an
00:39:05.940
integer into it and it becomes an intern here migrates like this then if we append later an object or an object then
00:39:14.460
becomes an object array so the most general one is this object array the division we don't use it all the time
00:39:20.430
and if there's extra clothes eaten it because of this of the career names you would box primitives like int and also
00:39:30.180
it's an observation that even though it's dynamically typed languages so we can mix types in an array in practice
00:39:35.460
very few cases do some practice almost all aware like if there is integer in it
00:39:41.010
probably all the other elements are going to be integer it's not gonna it's rather a mix and so here the advantage
00:39:47.160
then of course if you represent an array just as an int array and then stick four bytes per element it's very compact very
00:39:53.010
efficient I can address I can access it the wreckage is read the wrong way and that's it but in like like in MRIs and
00:40:00.600
since it should never take eight bytes and I have to encode it from the type font and whatnot and so well these are
00:40:12.600
ways to a strategy that already exists and what we do is first of all so I want
00:40:22.440
to add this shadow me I want to make this to work also and make it right signified this is complicated because now the array it can move
00:40:29.090
in this change the storage and so on like if to dress traditional storage the same time but in forever so do you think
00:40:35.300
this and I want to make all our information transit so just basically distinction that exists MRI I want to
00:40:43.040
preserve this optimization of strategies and I want to still enable higher reads
00:40:48.260
and write the different parts of the way and the way this is like oh let's add another dimension to these are
00:40:54.380
strategies which I call conference ready geez yeah and I carry two of them this what I call
00:40:59.990
shared six storage and share that any storage now she'll pick storage is like
00:41:05.890
strategy of specialized for some usage pattern a very physically it assumes like ok this is a real it'll be like
00:41:12.200
kind of a fixed size array but I can arrange other like you but I'll access it by just reading it but you're not going to append or delete element in it
00:41:19.420
if I did by having this more like specialized thing then we connected axis
00:41:24.500
faster we have like Nova right in that case over the over the classical things up but now what if you actually append
00:41:33.260
an element or delete an element from one of these then we need to go in the delegate which they will share that an acceleration and there because our like
00:41:40.460
a hundreds method on array and they do all kind of crazy things we need to use a lot synchronize there's no way there's a like a lot of
00:41:46.910
implementation of something verifies it is like we have to use a lot the point
00:41:53.140
now let me see if for instance here we have some touch my bridges do like read from an array or read and do some rights
00:42:01.450
and with a lot of different things here we explain but we do this from one one
00:42:07.820
threads 240 fortress so this is on a big machine with actually 44 calls and what
00:42:14.300
we see is the local so the when the array is not shared between threads the performance is pretty good like here we
00:42:20.780
reach like 30 billion array accesses per second just by good but any use shared
00:42:28.760
pick storage which was this guy we have the same thing the same performance so
00:42:34.010
that's why Hispanics is selected directory access to data like this without like confronting modifying it in that pretty
00:42:40.130
big waste and detected optimize nicely but then as soon as you start to use a look so all
00:42:46.200
you're the one actually using some kind of locking mechanism and performance like BAM three times for at least and so
00:42:54.210
then we attract different kind of blocks so well like the reentrant work is like a global of this so they are done scale
00:43:01.230
for the fortress not any faster or one time then you're something like Stampler
00:43:06.390
which is like a read/write lock which scale for reads by okay the dark green
00:43:11.580
here but those kids no right at all so that's not good enough so we need to invent something new and that's what we
00:43:17.250
say there's something that called the layered look and this is justify under like leaders are just a variant of it
00:43:22.950
but basically has three more devices yes read access write access and layout
00:43:28.890
change no change means okay we change everything like we change the strategy or like we need to did it an element in
00:43:34.830
the middle that marine it exclusive access to the area but the region right can be and so under these two factory
00:43:42.150
scale for weeds and foreign I can run be
00:43:48.330
some bigger benchmarks this actually does NASA parallel benchmark that's sweet of by the whole bunch of our
00:43:54.450
original written in Fortran that's typical scientific computing and there were 22 over and I know you see here is
00:44:03.270
a scalability so how much time we're faster on androids than one thread and
00:44:09.450
you see actually like the scale D curves are basically very similar for Java photon unsafe tough already and thread
00:44:17.430
set collection they all the same so you can say that did I go with scale of code
00:44:22.800
that Java and Fortran which is very good for this kind of all the reason here
00:44:29.550
simple because it's like very regular workload we use mostly like this big storage technology just
00:44:39.400
so it's about the end so I saw that I showed that the performance of Ruby can
00:44:46.400
be better and the verb is an example we can have Paris an entrance safety and
00:44:52.640
like based on the simple idea of like whether or not real Christians would have a manipulate then we synchronized
00:44:58.130
otherwise we don't then we can actually go pretty far and you can have battle array entrance with pylean should say
00:45:05.180
very in a hash and so the end yes with this I should be was one particular
00:45:12.799
in parallel as the most important strategy to guarantees like we have with the global lock but still yet scared of
00:45:19.880
any girls I fee on threat of a Ruby I interpreted it with all the Ruby manager
00:45:26.180
I could find so drive install IBM IBM and choose just work if it doesn't
00:45:32.719
please report back to us and then yes
00:45:38.239
probably is part of a bigger project which is for gravettian and was raised
00:45:43.249
last year actually we see and doesn't
00:45:48.289
only contain Ruby but also as a JavaScript runtime a node.js runtime our Python and then because one time which
00:45:55.130
means we can agree should cease buzz-buzz and mustard or in a single VM the advantage of this is we can all this
00:46:01.670
language can interoperate they can call each other easily efficiently it means that for instance if you call your
00:46:07.549
Python from Ruby you can actually in line through it and that it can sit for both languages and optimize them
00:46:13.339
together so basically you can yeah and call a ruby two out to Python to see and
00:46:19.309
it is as fast if you were the voting and see yeah this Olympic cooking is also
00:46:28.190
what we use actually the first extension so that you don't actually the when you
00:46:33.259
were six tension actually interpreted with our own interpreter for C code which has a jitter and it's essentially
00:46:39.739
so you're going to put C we compile just that compile C and we'll be good
00:46:45.289
together so here's some code rebellion C and back and to everything it's much more
00:46:51.280
efficient answering these come in two addition so there's a open-source commit
00:46:57.070
additional materials the wizard is undeterred there's at the presentation if flag for bigger depended upon some
00:47:03.700
support and also like litigate to support nothing yet
00:47:27.910
I guess we'll get there at some point the Roku is an interesting business model where I like to give you
00:47:34.180
physically very like comment with megabyte is very small family so I'm not
00:47:41.110
sure it would work there would definitely not want to run a JVM there but we could when we substrate be and
00:47:47.440
probably some workload a thing could be interesting so the answer is we don't
00:47:52.510
try it and there's not equation yet but I think it fit at the end of it suppose
00:47:59.590
you like at some point hopefully like on like Oracle cloud or other clouds probably will be available integrated so
00:48:06.850
you just try running applications here it goes
00:48:21.320
put the servicing right what is still missing rather than not much I guess so
00:48:35.990
running race and like lost lots of sea extension is in high school and then of
00:48:42.590
course you want to have number how much faster we own rails you can see like
00:48:48.500
over two times faster on Rails it's already something right so well one part
00:49:11.810
is like one of these Guardian project is the garage in compiler which is in Java
00:49:17.690
and that's actually getting more and more integrated so no other software reasonable ease of Java requires accurate part of it and you can use it
00:49:24.890
at things till the end of fact the differential a compiler will replace the old jim compiler probably this it
00:49:33.140
probably is the original pedals that where there are in the janilla very simple simple thing is like it's Jericho
00:49:40.430
this moment enable and the whole thing was supposed to us and nobody want to touch it because certain begin so that
00:49:48.200
there's something stupid ever like where where we run this code faster and scatter for instance it's collected also
00:49:54.320
to be runs faster this graph substrate
00:50:00.950
VM which then threads in the beginning can actually take a job application and compile it down to a native executable
00:50:07.210
so just like GCC would for sequel doing the same kind of thing that's very
00:50:13.430
interesting for like I don't know functional server less computing more like you could ship your job application as a simple small binary there's a
00:50:20.720
constant mesh memory the start instantly that can be editing so some of the stuff
00:50:26.720
is flying back into Java yes we're trying to be like a bigger ecosystem with a lot of language like behind all
00:50:33.590
this come on framework which is called truffle and I were building with that trying to make it easy of people to
00:50:39.770
actually build but I knew chip limitation and have it first and still I can keep it not too complex my company
00:50:54.740
we have a team that works with our and the team that works with movie so it
00:51:00.050
would be possible to to run both like some are scripts together yeah yes
00:51:09.980
absolutely so the lecture is someone external working on this a bit actually
00:51:20.150
like met a blog post about this like very like it is some protein which is deployed in our and code that from there
00:51:28.160
is some integration I can have a nice a layer to like interoperate but even like lot of normal per VM you can already do
00:51:34.700
this yeah you can really like the assumption or I can call it and pass the Ruby
00:51:39.980
object and so on but that's something that doesn't exist before right or if
00:51:46.040
there is it was only like a two language mapping maximum a way of like an language might think right
00:51:51.140
that's something that's rather interesting in but people are like didn't taste so far too much to try
00:51:57.380
because I don't know something new
00:52:03.460
publications about this because I think this array or this collection idea is very abstract is extremely interesting
00:52:10.940
because you can really do a lot of things with any language so yeah
00:52:20.720
actually got a paper about this as part of my PhD so I could have a lot of
00:52:27.140
publication as well on all of a kind of IDs in the project is so strategy is
00:52:33.050
actually not so new it was introducing what kind of to introducing pi PI a bit
00:52:38.810
after 2000 I think and so here I did the more like general version with concurrency the
00:52:45.740
stuff like the object model introduced by the self guys in 1989 or something
00:52:52.240
that's what a version like we're like okay I make this recipe if you want
00:52:59.750
links about that yeah just going to grab the end up all all website writing never is a fabrication so you contribute to
00:53:08.780
see Ruby right so what's their guys's opinion on I don't know individual
00:53:16.730
person opinions so they open to maybe basically throwing away the byte code
00:53:22.970
generation approach and using something like partial evaluation as well oh no because the project context is
00:53:30.320
very different like first 300ft much involve m is a truncated team of 50 be per unit and it's like for people on
00:53:37.610
ruby search for the different sub project but see really is like physically for five people full-time and
00:53:43.430
that's it and all the rest is open so volunteers so there's not so many resources and that's why I did it is I'm
00:53:49.580
approached for the image C code because they don't have to write the object like six years that groud better
00:53:55.730
no more than five years to be returned and there was already a lot of background from the Java compiler but
00:54:01.280
waiting a cheat from Ruby from scratch would take so you need to be like one of
00:54:07.640
these huge company and spent a few years on ok so I guess for them like it's
00:54:13.430
mysterious off like follow from jetpack bit on there were source and open the main thing is like it was something like
00:54:21.320
well ok we don't reduce so much of the Amara interpreter become I use this the extension but not the core auntie
00:54:27.950
because you want to build to optimize through it and secret it is not a very good medium for that I don't know
00:54:35.780
sometimes they are interests like that I don't know I post it on an issue not some guy working on mathematical
00:54:41.990
libraries on the benchmark and like oh cool but yeah I guess the main thing for
00:54:48.590
us is like they of course to get more adopted there are more users and you can co-host of these years thank like ruby
00:54:56.210
is one of the prime example because we already have a huge speed-up over the restaurant limitation but from JavaScript well they're the same level
00:55:02.930
of z8 put that up much faster different
00:55:10.299
person it's about four years ago so
00:55:15.950
Chris even started it as an internship and the project was taken on and develop
00:55:21.829
since then so good yeah you can see that it's big for years we have about four people working on it for years and we're
00:55:38.289
not going to Reno I think it so for a
00:55:46.309
long time the meeting out there was the coop
00:55:52.489
versus the extension is it like oh we're not managed RIA I mean to put C so everything can be managed but no this is
00:55:58.130
a dream it works for C extension that our written for performance because then
00:56:04.160
okay it's just like a different way to idle because it doesn't work for six tensional try to bind to a native
00:56:09.170
library like when database driver yeah instance because of course then a lot of
00:56:16.880
the Ruby object in actually go into native and then that's where I get complicated because now I have like a my
00:56:24.109
natural presentation of my will be array I know I have a native representation of it and the same for the string because
00:56:29.719
the 69p I could use just a tarp on so for the Ruby string and you're like well what do I do this and all three memory
00:56:35.359
and other I think there are dates back it's kind of complicated you don't like for know level operating
00:56:43.609
system access Linux and you have like libraries that are very powerful inter
00:56:50.869
process communication and whatever that and access that does network or do you have
00:56:59.080
problem on your to always so for the
00:57:04.420
post approach basically this kind of thought didn't work very well because where to select it specific okay this thing is going to native so we create
00:57:11.020
some kind of a direction from it and every time you use it will remove this interaction that was inconvenient
00:57:16.390
because you can't do this automatically now in the new approach we just like every ruby of the be converted to
00:57:22.420
natives some get a sponsor then when you use it with managed to send it back and then we have to do some kind of GC to
00:57:28.690
find out like okay whatever we convert it to native once we are back from it and number nothing refers to it anymore
00:57:34.450
we can release them so that part no
00:57:41.550
Indian the six tension code we first compile it with clang but not to native code to bit code and then yeah we
00:57:50.980
visited that with all them interpreter but that one for like very basic to a lot of stuff it just called operating system so that's what this word there
00:58:00.010
are some sort of a bit more difficult like waiting on threads if you just do a pitch where a pitch right code directly
00:58:05.050
and it's actually complicated because you need to do some innovation of the Java and there will be thread so that we
00:58:10.690
need to intercept that it basic things I got an open a file and access Dolabella
00:58:17.500
pretty sisters you can just let it do native this is actually another mode
00:58:24.300
Enterprise where this interpreter is entire managed so that Testament around
00:58:31.720
entire Cisco's but then it guarantees memory safety it
00:58:37.990
means that it's impossible that which is something like out of bone autumn already to go sexy it's not only
00:58:43.750
impossible to criticize but something interesting but something that is very
00:58:49.240
hard to work with for that cop and sister which is like a lot of very old C code was assembly and whatnot in it so
00:58:58.030
yeah that's that's a bit more new but it could be interesting also like if my
00:59:04.240
census extension is not very stable Kosek for some I'm the one eating that like they did not be able to create a secret anymore
00:59:11.320
we wouldn't be able to also like see the bug I never back trace and say like okay this is the problem I dig access