Parallel and Thread-Safe Ruby at High-Speed with TruffleRuby

Summarized using AI

Parallel and Thread-Safe Ruby at High-Speed with TruffleRuby

Benoit Daloze • January 29, 2019 • Zurich, Switzerland • Talk

In this presentation titled "Parallel and Thread-Safe Ruby at High-Speed with TruffleRuby," speaker Benoit Daloze discusses advancements in Ruby performance through the TruffleRuby implementation. The talk explains how TruffleRuby, developed at Oracle Labs, aims to overcome the limitations posed by traditional Ruby implementations regarding concurrency and threading, particularly with structures like Array and Hash. Daloze emphasizes the research-driven nature of the project and the goal of achieving high performance akin to JavaScript engines like V8.

Key points covered in the presentation include:

- Introduction to TruffleRuby: Daloze provides background on TruffleRuby, describing it as a high-performance Ruby implementation that utilizes a Just-In-Time (JIT) compiler to achieve significant speed improvements.

- Performance Benchmarks: Through various examples, including CPU benchmarks and game emulation, the presentation showcases TruffleRuby's performance capabilities, claiming it can be up to ten times faster than traditional MRI (Matz's Ruby Interpreter). This includes good scalability on multi-core systems.

- Thread-Safety and Concurrency: Daloze stresses the importance of making Array and Hash structures thread-safe while maintaining performance levels. He introduces a new concurrency model for Ruby that allows safe parallel access, addressing issues encountered by existing Ruby implementations with the global interpreter lock (GIL).

- Demonstrations of Performance: The talk includes practical demonstrations showcasing how TruffleRuby exceeds the performance of MRI, specifically in multi-threaded operations and standard benchmarks.

- Research Insights: Daloze shares insights into the research behind TruffleRuby, detailing strategies such as partial evaluation and specialized implementations of data structures that optimize performance without compromising thread safety.

- Challenges and Future Directions: He discusses ongoing challenges in adopting Ruby on Rails with TruffleRuby due to dependencies and the need to maintain compatibility with existing Ruby gems. He expresses ambition for broader adoption and further enhancements.

In conclusion, Daloze emphasizes that TruffleRuby not only enhances Ruby's performance but also makes it viable for concurrent programming, thus aligning it more closely with modern requirements for high-performance applications. The talk couples theoretical insights with practical demonstrations, making a compelling case for the future of Ruby under the TruffleRuby implementation.

Parallel and Thread-Safe Ruby at High-Speed with TruffleRuby
Benoit Daloze • January 29, 2019 • Zurich, Switzerland • Talk

Array and Hash are used in every Ruby program. Yet, current implementations either prevent the use of them in parallel (the global interpreter lock in MRI) or lack thread-safety guarantees (JRuby raises an exception on concurrent Array#<<). Concurrent::Array from concurrent-ruby is thread-safe but prevents parallel access.

This talk starts with an introduction to show how TruffleRuby works and then shows a technique to make Array and Hash thread-safe while enabling parallel access, with no penalty on single-threaded performance. In short, we keep the most important thread-safety guarantees of the global lock while allowing Ruby to scale up to tens of cores!

https://www.meetup.com/rubyonrails-ch/events/258090195

Railshöck January 2019

00:00:08.330 so today I want to talk about parallel intuitive Ruby high speed with shuffle
00:00:13.889 Ruby my name is Benoit it's French I come from Belgium and I move to Zurich here
00:00:21.230 beginning of December so it's quite recent some curious to me together hmm
00:00:32.120 first have to show you this which means what I talk about is a research project and so you should not buy a record store
00:00:38.760 based on that so no work at Oracle labs
00:00:45.809 and we are developing a ruby implementation which is called truffle Ruby
00:00:53.930 and I think doing research as a PhD for four years industry actually do any
00:01:00.510 research on concurrency in worthy I wanted to improve on that because of the do Ruby was not very little current I've
00:01:08.310 been working on truffle Ruby since it was created almost enough for modern for years
00:01:13.590 I'm also the maintainer of Ruby spec which is the set of specs for the Ruby language itself and I'm also a
00:01:19.680 micrometer so today I want to talk about two things first I want to introduce
00:01:25.259 what is true for Ruby and also talk about performance because it's the most exciting aspect of it probably and then
00:01:32.460 I want to talk also about my research as a second part first thing you know so
00:01:40.950 tougher Ruby intends to be high performance to maybe reach a new level of performance that other web
00:01:46.140 implementation just never reach so far try to be as fast as fast as it is for
00:01:52.170 fascists just-in-time compiler for dynamic language for instance v8 whatever as people we try to be as fast
00:01:58.500 as that for a week we use the gorgeous tan compiler which is what helps achieve this and
00:02:05.740 which I get full compatibility with Serie B I like very few exception we always try to be as competitive as
00:02:11.800 possible and it means we also runs the extension because a lot of rails are out there and
00:02:17.650 Ruby application use extension and there is just no easy replacement so you want to run that and really support quite a
00:02:23.950 bit of that it's all on github it's open source you can check it want to dietary
00:02:32.890 two ways to run Java Ruby one way is to write on the JVM the Java Virtual
00:02:38.290 Machine and the main advantage here is you can interoperate with other jvm language code in Java and this provides
00:02:46.480 great performance now the default mode what we we run in is what we call substrate vm and
00:02:53.560 substrate vm is basically ahead of time compiler that compares troffer rubies or
00:02:59.800 web implementation and grout to just an compared to a native executable so i
00:03:05.380 won't get much to details there but the main idea is like this whole thing is initially java code so to fro visit in
00:03:10.870 java corral the system compiler is also written in java and we compile all of these new onyx native executable and
00:03:18.610 that tradition gives us much faster starter because most runs like a devious doesn't start with a one second with
00:03:23.920 anything significant but here but it can start in 25 May second and also is
00:03:30.580 faster warm-up so the time it takes until your Ruby application is fires also better because here the just a
00:03:36.940 compiler is pre-compile itself so it's already optimized to cope area code it's a bit of a meter thick yeah it's also
00:03:45.010 lower footprint because you don't have to do that class loading or loading polymer Java code and so on all of this
00:03:50.410 is done ahead of time and then basically the only memory we need is what the Ruby interpreter needs and nothing else and
00:03:56.739 the gin of course and the performance is secret in that configuration so that's
00:04:02.320 why it is the before with elegant the main advantage of deuterium is then if you want to interpret it with other JVM
00:04:07.900 stuff this is something else so
00:04:13.510 I guess you're out of there will be three times three projects and the goal of this is Serie B or MRI 3.0 should be
00:04:21.400 three times faster than C Ruby 2.0 and I don't do this with a just a compiler
00:04:27.250 which is called mg but that's interesting it's a good direction I think that would be funnier it gets a
00:04:33.280 bit faster but my question is cannot be faster than three times every 2.0 and to
00:04:40.419 illustrate this I want to make a demonstration and everyone what is called opt carrots opt cavities the main
00:04:47.289 CPU benchmark for which times three it's actually an emulator for the Nintendo
00:04:52.630 Entertainment System look okay can you
00:05:08.080 read this so the baseline for be two
00:05:13.660 point three is ruby 2.0 so let's do that
00:05:20.280 and so here we just run the benchmark so
00:05:28.180 you can see here we have a miniature and we have this game called lamb master which is which is use for a backpack and
00:05:33.880 a tank a plate but we see is not really fast so when I move around you can see
00:05:41.169 it as a bit of light and the aim of this game is very simple as connect all the
00:05:46.300 computers together now you see like it like that
00:05:52.410 so the original the second of the of the nests is not only a 60 frame per second
00:05:57.940 but here's only 34 so that's not fast enough so we can use the letters to be a
00:06:06.000 little Serie B which is 2.6 and I can run it
00:06:13.669 and there it's 43 frames per second so well time frame per second better but
00:06:18.689 still not 60 but no we can also use mg and we do this we do the DEF - it back
00:06:24.479 in 2.6 and now it gets actually a bit
00:06:29.759 faster around 70 frames per second which is nice but then we can also run it one
00:06:39.870 so far really yeah it gets more
00:06:45.599 interesting after this is too much I
00:06:51.479 explained earlier with JVM or SVM likes obsidian and here's a bit better on JVM
00:06:56.900 but so sorry so we see it start with this load is
00:07:02.159 like a 10 frame per second but gets faster because the cheetah now learns what what they're executing it's
00:07:08.219 optimized in comparison to machine code and optimizes energy better than traditional G that's a bit unstable
00:07:15.629 until it learns the program but then now we get into 250 when the second night
00:07:21.349 just like yeah it's where I had like I
00:07:26.490 also paid in this mode from is the time is friend day so it means I have much less time than zero and ya know again
00:07:34.319 like we surprised at it now it's like we're doing something a bit of difference and it's a really nice things but soon enough
00:07:52.390 and today I can see I've been doing a lot of research I used to play this on
00:08:25.280 my other laptop first a bit slower it was manageable no it's not only that's ok I think you got the ID so here
00:08:34.430 we can see the frame per second right so you can see like into Ferrari which over 200 which is like three times faster
00:08:41.240 than them so you did three times three times three and so what we saw is
00:08:50.690 feel like if you want to graph it I can you show it refer to the Green Line and it's pretty slow in the beginning but
00:08:56.660 once it learns a program on top smells it is all you use the methods or use for instant Edition used on the own integers
00:09:02.900 or something else like you have also big numbing it when Salons is it very fast
00:09:08.720 and like this level of difference feels like it's a big margin rate and today I
00:09:14.570 want to try to give you some insight into why we do so much better and this
00:09:21.230 not only on a particular good for instance also on classic numeric benchmarks we also perform very well
00:09:26.810 like typically between ten and fifteen times faster than MRI and you see here
00:09:34.460 we also as a JIT performs a bit better but not ever
00:09:39.730 and then if you run a metallurgist own set of micro benchmark likely they run
00:09:45.410 it against preferably at some point and here's the tentative configuration so the default one and see there we
00:09:53.060 actually 30 times faster than sir to conserve and mjid itself is four times faster Mira query like we're good
00:10:01.680 at optimism on this kind of benchmark and there's just like a different like that of how much you can optimize that
00:10:07.260 code other areas is like for instance rendering template rendering I something
00:10:13.740 we are pretty good at when sense on a super yabby benchmark we are about ten times faster than MRI and this is due to
00:10:20.459 a different representation for Strings so if I sell string concatenation in MRI is pretty slow because involves copying
00:10:27.420 and reallocating but if you use something else like ropes which is user context a littles then-secretary
00:10:33.630 constant time so what we do is yeah we
00:10:39.779 implement coconut generation differently so when I connect to string actually
00:10:45.180 instead of like copying this one we can be a buffer and copy the stuff we like okay with these two one and where the new nodes here that represent the
00:10:51.779 virtual concatenation of them and we only actually like flatten it when I write it to the network cousin yeah
00:10:59.360 there's a talk about this one more info a lot more stuff we do very much the big
00:11:10.260 question is do we want rails and the answer is yes to some extent that thing
00:11:15.959 that already is new almost once out of the box of a small blog application and
00:11:25.020 we trying to run discourse because this causes you that's one of the benchmark also for will be two times three night
00:11:31.560 aerial of a fair bit of it working already but currently with a few patches so we need to fix a few things like
00:11:37.380 there and want to like do it properly not fixed in everything in the interpretive switches works without changing anything in the implication in
00:11:43.890 their application how do we challenge of running big rails app is often if tons
00:11:49.020 of dependencies over 100 gems for instance and that's one that doesn't know this enough to not be able to run
00:11:55.860 track then you have to start to work around it the idea and many of these stream also
00:12:01.460 the extension which tends to be more complicated so recently we got to create
00:12:07.010 a new approach for the extension that works a lot better like before we actually are to like for instance for
00:12:12.200 the PG driver or MySQL to graduate to touch quite a lot of pace places and so
00:12:18.110 on and this doesn't scare if you have to patch every the extension to make it work with us it's never gonna work so now we have something called a
00:12:24.500 different kind of simile it simulate the the GC file right for the extension and
00:12:30.200 the end result is basically like many successional work out of the box and just change to it and that's it and so
00:12:37.430 we support this one is the I know we support all of these but I guess many more now
00:13:12.870 that's a very simple blog nothing very fancy but it works which is created with
00:13:19.390 like when's news and scaffolding but the
00:13:24.550 thing over is 5 so oh it's true for be
00:13:35.200 so fast basically and the main two concept which is the Bashar evaluation
00:13:40.829 which is kind of new I think it's something that was introduced by our project and never
00:13:45.880 ready use it focus before at least watch it compare and then we use of course it
00:13:50.950 is the quadric compiler which as another of optimization as we will see that helps a lot as well implementation wise
00:13:58.860 like integer plus for instance basic primitives their return in Java because so far will be service rendered in Java
00:14:05.680 but collaborates with general Ruby itself so this is like Rubina's so after
00:14:10.899 you have a lot of to collaborate defining Ruby itself and then just what we need that we cannot express it will be called like a different of course in
00:14:17.890 the future so let's see our Ruby interpreter can execute your program
00:14:25.000 right so if a very simple function method foo and what it does is it Maps
00:14:30.160 an array an array of two numbers and triple double and the first thing any
00:14:36.850 word implementation we do is pass this text into an abstract syntax tree and actually it's here it's three abstract
00:14:43.420 syntax tree because there is one for the method yeah at the military it only calls map and it calls that eliminated
00:14:49.720 an array literal and then the only argument is the block then the block
00:14:55.329 itself here is also different ast but then multiplied the argument by
00:15:00.459 three then finally map is also a general st if it wasn't if it's implemented in
00:15:05.500 Ruby it's its own logic another we do is obviously read from the from the array
00:15:11.560 with the new array the blood for my child now what we don't refer ruby is very similar to this
00:15:17.900 instead of just an abstraction actually we do what we call it refer abstract syntax tree and it employs similar but
00:15:24.780 the main thing is here actually each of this node is know a Java object and each
00:15:31.470 of this node has an execute method which defines all to execute this note to
00:15:36.720 explain semantics for instance we have a multiplication node and what it does is
00:15:42.030 execute the left child is equal to right child multiply it with under but that's
00:15:48.240 a very simple way to implement in jeopardy and the way we do so is actually very slow and that would be one
00:15:54.420 point eight was a bit like this it wasn't very fast but because you have this trick of partial evaluation we
00:16:00.720 connected in make it very fast and the ID so this magic process which I will
00:16:07.290 explain later or partial relation it takes this troffer DHT and then we get out of it is a compiler graph
00:16:13.500 representing author educator Boneta and this basically sent immediate step because once the JIT takes this it an
00:16:20.190 image machine code of that that's basically we need L to like a make
00:16:25.380 machine code for this beautiful hd now it works is we start from the from the
00:16:32.130 initial node the top node so like this one for instance and then we see okay what's the second
00:16:38.670 method is doing like or the middle that is using this one that this one is getting this one this one and actually
00:16:45.330 we go to every note that this we in line the entire st and then we get like risk
00:16:51.840 a lot of Jellico and then we come to my stem cell it's in action so we have automated again and
00:16:58.880 there's a good method of this fool is just a committed this execute was inside the body so it is it a good the child no
00:17:05.510 and a China would be is a core node is a codemod right and the corner would have
00:17:11.550 some logical to call a Ruby method but first before calling anything it after the key the receiver and the arguments
00:17:17.839 and educating the receiver means the receiver is an array literal
00:17:24.780 and this is create a new ruby array with the value of educated the values are 1 &
00:17:30.180 2 so that integer literal and they just return that 1 & 2 now the interesting thing is when we do
00:17:36.000 this process we also do some kind of constant folding so we know we are compiling foo and not any other method
00:17:43.730 so I read this link like here discharge that will you feel right nobody would have to read this field in normal
00:17:49.860 Jericho but here we don't because we know this HT is constant and so because
00:17:56.580 it is concern we can fold everything and so of all these edited metal here all together of like this and so the
00:18:04.200 equivalent thing we did after partial relation is this with a food gel emitter
00:18:09.660 and everything is aligned we know we're already colemak we know which which block we know is which array which value
00:18:15.450 and everything this is done for food but
00:18:21.000 also do it for the other HT so we also do it for the mat VST and we also do it for the block HD again with the same
00:18:27.750 process so much theory sitter is already called a block for element and block which is multiplied by 3 and then
00:18:37.380 another thing that happens also if we do inlining at this point so what we see is like when this foo method call map which
00:18:43.980 will you remember what which is T record which method the corners you know I tell
00:18:49.050 you always call map Fred's not very surprising that's what the code it has to do so let's call the same thing and
00:18:54.720 then new one we call the proper 0 actually always called the simbook Sun we see this is like maybe this was in
00:19:00.300 line it's maybe putting all this ste together into one which is very easy to rearrange it and done and then now have
00:19:08.730 a lot more code we can optimize together and I can start to like optimize things that are between different metals in
00:19:14.460 class so once we light everything we have something like this so the original array with a 1 & 2 I
00:19:22.200 will put a new array of the same size and that fourth element will call but this is a line so we just multiply
00:19:27.950 battery directly now at this point the
00:19:33.890 gradual stem compiler kicks in and here we start to have some optimization classes the first thing it will do here
00:19:40.010 in this case is to do what is called some escape analysis that we see like this array here we create we wrap it on
00:19:46.610 your rate actually it never escapes the method it's like the return value would
00:19:51.770 be dungeon you're right it doesn't care about your geography and I tell you read some stuff from the array like the size
00:19:57.680 of the array and with elements from it but then stance itself there will be a
00:20:03.680 rain sensor doesn't escape so we don't need to allocate it compiler can figure this out and there is replace every
00:20:08.750 usage with what what is actually like what's in the fields so what happen is
00:20:14.720 this so that the array becomes you the storage which is this inter internally the size of the array here was to
00:20:22.370 because you see like okay to the side of this is to so this is it and reading from the array is reading to destination
00:20:30.100 the next thing is now we are reading here from the interest so what we read must be an int so this is always true so
00:20:37.600 there is no point with another branch that do more complicated semantics now
00:20:47.420 the comparison or there is an asset or a loop here it always does exactly two iteration and inside the loop there is
00:20:54.530 not so much code so maybe about the ticketing this code for every iteration because only two of them and then they
00:21:01.340 could have be much simpler and easier to optimize and of this now it seems like
00:21:09.770 ok actually like we are reading for Mary storage 0 all and our storage is you know in the comparator between is
00:21:16.620 linked to videos that can see like this is what crazy this is what uses it so I can just forward the value and I restore
00:21:23.070 it zero which would become one angel and I tell you the rest of it we don't need it anymore after it's gone now if I get
00:21:32.730 seasick it's multiply exact it's like multiply but also consider overflow but
00:21:37.920 obviously one time there's another floor so we can just do it and we rearrange the cool and we got this and this is
00:21:44.190 basically the most optimized code you could get for what we got initially so
00:21:49.890 we transform this into this java code which basically if it runs from back into Ruby code like yo gates the answer
00:21:55.919 is what we could reason or self about and this looks simple but I tell you so we understand it is not that easy and
00:22:01.919 actually only tougher we managed to this level of understanding of Ruby semantics
00:22:07.700 now you can yeah of course this is then assembly and then that sumbitch's does
00:22:12.900 you can imagine right three two memory six memory the return on your the
00:22:18.390 entities so that was preferably I not to
00:22:25.080 compare a bit I want to compare with mg so the meta GTV and then the first step
00:22:30.570 is also departure we could one st there is transform not to a trophyless table to a bytecode it's just another
00:22:36.809 representation doesn't really matter but the point is that when a method is
00:22:41.820 called many times mg will generate C code so they really like princess C code
00:22:47.549 and then pass it to a C compiler which then generated share library which it then load back so that regenerates
00:22:55.140 sequel that's the way that's the way they choose to communicate with the compiler because of course communicate with a C compiler that's the most
00:23:01.500 convenient and so then this will be function its smallest related to this I
00:23:08.010 was really like look like the extension good the killing right like okay the
00:23:13.590 values are here and then there's a BRE new something and then function call with a block is not then they can also a
00:23:20.370 compile the block separately so here the could be lectures multiplied by three but it's inline and specialized for like
00:23:26.269 six nothing float for instance and then at this point Jesus your client kicks in
00:23:34.309 and what it optimize is not magic today the only thing it can figure out it like this rule is true your floats no it's
00:23:41.330 not a static points an integer so you can say okay that Brad doesn't exist move this and then I'll just keep this
00:23:46.399 one but that's it it cannot go further because it doesn't know how to go in
00:23:52.279 this or this or this because this is already compiled to native it has no idea what it does
00:23:57.409 so this is currently the main limitation of energy is it doesn't know anything about the LAT function for instance
00:24:02.889 because that's part of the rule binary but there's a missing source form at one
00:24:08.299 time I doesn't know like also just ought to go through it or to like inline its order tonight so in summary like we can
00:24:19.610 see the performance of Ruby can be significant improve like it can be up to ten times faster for instance on the
00:24:24.950 number of benchmark and reverb is an example of that thing the message she is like it's only to rewrite application
00:24:31.549 another language and Ruby for speed like Ruby can be as fast as JavaScript another dynamic language at least which
00:24:37.490 is already pretty good and that's like within a factor of two of Java which is probably good enough for most companies
00:24:46.120 when christian phone is that you compare only access to the code every reader to this map function for instance
00:24:51.259 otherwise it's very limited by the only understand user code that never use colabro method but really good use call
00:24:57.409 every metals everywhere that it doesn't understand them can't do much I totally
00:25:04.100 understand already constant so easily like for mg for instance a bit more complicated because I send see there is
00:25:11.330 no concept of an object or location just object because I never I know
00:25:17.120 gravitation is just like Panther punter and write to memory the signal Philemon
00:25:22.879 Minot doesn't know that all maybe nobody will read from this from other traders it just tells you up there so there's
00:25:30.040 there's some challenges find it there to address basically like be able to communicate better with this compiler
00:25:36.340 like Enoch you can do this better because we're in Java so there's already the concept objective location but also
00:25:42.520 we control the test and compare and so we can give it more information so now I
00:26:01.000 want to show it during my PhD which is I
00:26:06.100 want to make Ruby another dynamic language palette and the program I
00:26:11.200 noticed and promoted to working on is like dynamic language not too sure bf4 support for piracy I think it none of
00:26:18.280 them as variant good solution and often it's not only actually due to the
00:26:24.940 language themselves it's at it due to the implementation and so they very personal connotation the first type is
00:26:32.380 the one with a closer look and in this way Alex will be your C Python and GS
00:26:37.840 could also become JavaScript or some big considered there are some extent and then there's a global log so there is no
00:26:43.660 final execution of Ruby or dynamic language code ever in a single process so the only possibility there to achieve
00:26:50.560 biasing is to use multiple process and up ways memory the biggest resources and
00:26:55.660 the slow communication between spaces that's like kind of like the last layer to do politcs use multiple doses then
00:27:06.190 second category which is more interesting is what I called unsafe where GB everything is half so doctor
00:27:11.560 you allow Ruby threads to run in parallel but they give up unimportant
00:27:16.660 guarantees the finest entity called array append concurrently on the same array yet today you might get an
00:27:22.870 exception that's a problem not because in the documentation area pen up ends
00:27:28.390 one element to the array and not like all might throw an exception and then we
00:27:35.380 have a third category which is like oh that's what the primal together and you know this share memory but as it cos for sure so this is like
00:27:44.049 share nothing of shared ito and in that favor JavaScript because no it does kind of an actor or like model with the
00:27:50.860 worker thing along is in that too and the builds which is for Reba tree I will
00:27:57.850 expand on that it's also at this model that the problem is you cannot pass
00:28:03.490 object between threads it is impossible just like a doe even to dip copy or
00:28:10.299 transfer ownership then you can't use it anymore the original thread so there is something in it like when you use it that way it's much more restrictive so
00:28:22.000 yeah that's it's not to be dong killed so that's when you a new concurrency model for will be three the advantage of
00:28:29.950 this is there's a stronger memory model so the semantics are simpler Dino Lalli the races because there is no shared
00:28:36.279 memory point epic tradition right the process level but it's like every in
00:28:41.830 your real process every guild would have a different heat defender behave and there's nothing in common disease so
00:28:50.470 people in the air is like you cannot share past object or question around and well there's not really oriented
00:28:56.380 languages so it feels a big deal so I don't you deep copy them but if the object is large it takes us a long time
00:29:02.679 to think of it oh you transfer ownership but actually there is a caption I transferred ownership factor you need to
00:29:07.899 dip copy everything except the last layer from Dom additional reason so it's not much faster so this is a problem
00:29:15.700 like compared to a second shell program when you paralyzed with this then you have this copy over it and of course you
00:29:21.940 always have to balance it like if the copy of Rights to hide and maybe version is slower than sequential version it
00:29:30.820 doesn't mean that share multiple state can only access the contrary that it has to be in its own actor it's in own guild
00:29:36.789 and every time just to be communicated from the outside it's one at a time so
00:29:42.010 it's something like encouragement with nobody you can read from multiple through the center not possible vector
00:29:47.889 motor like very special data structure cannot correct a model for that but the
00:29:53.860 normal way like your own business logic no it's a different programming model so
00:30:02.920 for instance if you are code using Ruby threads then it won't work this girl just like that we need to adapt it because you confess something of Indian
00:30:10.660 I think it's a complementary solution some problem some problems you which are like very nice to express with this like
00:30:17.110 share nothing a share little brain model like for instance I don't know like some
00:30:22.660 IRC chat BOTS or so on it something that tends to be very isolated and I can put the rows and you don't need to like
00:30:28.930 shares a lot of stay together then it's quite nice but some of them might find some switch racing where like you would
00:30:35.680 say okay I want to render this picture and I will give this part which right there not about the thread not about it rather than the battery thread
00:30:41.620 that's like much easier to do with cinnamon because it won't have to just copy the screen all the time and just
00:30:47.050 write to it then everyone writes a different part so I was saying Derby and
00:30:55.360 Rubinius are unsafe and I want shown example so I create an array here and I created a hundred threads and each of
00:31:02.320 these tread is gonna happen 1/2 of an integer to the area then I will wait for all the threads and I will put in the
00:31:07.390 size of the other so you know the programmer should get one three times one thousand one hundred thousand and
00:31:12.700 sure enough I run this one sooner do I get the right answer of course nothing runs in parallel so these strategies but
00:31:19.210 yeah waiting for each other but turns out correctly I can run this on terribly
00:31:25.000 again to get a random number or I get I
00:31:32.950 need to know and if I run be some Rubinius same thing I did I get a random
00:31:38.590 number didn't show it or a different encryption that's kind of bad likes like
00:31:43.780 if you have something in production using parallelism maybe locally there's not enough requests for it to happen but then the
00:31:50.980 introduction may be the right path which is dial is this it's not very nice the
00:31:57.910 reason why we have smaller number here it means something happens we are lost so the race was done in a way that
00:32:05.070 basically the boss incremented the size at the same time when the atomically so
00:32:10.200 it was like three and three three plus one for white for like four hours like oops when I commented 1 to 32 area and
00:32:16.470 side so this is very bad to the coach so
00:32:24.210 workaround for this implementation and yeah the only way the way they will come under resist it's like okay in to do
00:32:30.690 some synchronization so there you can use on the text like this or you can use like a concurrent array that new which
00:32:37.590 is the contribute ivory the main problem is this is it's raised easy to forget to use that and then again like if you Tony
00:32:44.190 up in introduction how to reproduce so I don't think they want unsafe by default
00:32:51.710 also this in register is not necessary to be a it is to add significant over it
00:32:57.720 they actually wanted a yellow sequential access to the array which is sad because the next census and I really different
00:33:04.350 of the array you should be able to get in trouble but okay
00:33:10.649 I think the biggest point is like dislike I mean a hash a thread safe on C Ruby and if they are not under a
00:33:16.139 permutation 10 which is incompatible one instead of example in bundler and
00:33:22.440 religions are like yeah they don't work on your revenues because re a national
00:33:27.450 treasure but of course I don't do this by by
00:33:32.639 intent I mean like the reason they do this because to make them thread safe already in hash it would be very
00:33:37.739 expensive and it would affect signature performance which they don't want the
00:33:43.649 big we subscription is like a big collection thread seed and not ever never I don't see good fatty performance
00:33:49.499 and one more thing is also one parallel axis just my entire I want to access
00:33:55.019 mahashiv parallel especially if access different paths there is no reason I can't do that I think the problem is
00:34:04.859 even louder than that's not just collection because object can kind of try America there's nothing you can add or remove
00:34:11.320 instance variable ticket is the same power as a hash table even though they're not you see the same pattern but
00:34:17.669 they've seen that problem and so because you can add or remove feel that we can
00:34:23.679 either a hundred or thousand feel if you wanted to we need to grow the storage for this object dynamic any need to
00:34:29.830 expand this to which store more instance right and you do that we have to prime the concurrent rights and extending the
00:34:36.760 storage that we did we might do some right so here's an example as I create
00:34:42.520 an object and then I created thread that will set the field eight one and then
00:34:48.520 update it to to the second thread will set the feed be a string be a weight for both red and one possible outcome is a
00:34:56.590 directory I mean I get one when reading date so it means like this one was completely just and it's actually
00:35:03.820 possible on Rubinius and the reason is like ok initially thread one sets set a
00:35:10.390 to one very good we stir it but then before we set it to to the second thread
00:35:16.330 start and then I said like oh wait the storage only has a capacity for one instant variable so I need to grow it so
00:35:22.000 to grow it I have to make a look at a bigger chunk and copy the existing that and does this and then we will set the
00:35:28.960 new value in it and then assign the storage the new strange problem is first
00:35:34.450 thread wrote the date on the old storage which is no lost and unused and so this
00:35:39.910 a business so do something at your be fixed but that's one customer and the
00:35:48.339 main idea is like over can we address like this like oh can we avoid overhead for sequential program and still have to
00:35:55.300 achieve connection and objects is to distinguish to only synchronize on what
00:36:00.310 we need so my approximation area is like only synchronize on object or
00:36:05.770 collections that are reachable by multiple threads if an object our collections which have only one thread
00:36:11.230 there is no need to synchronize there's no concurrence
00:36:16.590 so here's an example so a third one and that of your great fear and tried to and that's another Fiocchi well object are
00:36:23.340 like white currently and unsynchronized right because they come only gets it by one threat now what happen if the queue
00:36:29.970 is put in a global variable presence then successive all by average word or
00:36:35.100 if we give the queue somehow to try to now that point the queue become shared and become synchronized and there's some
00:36:41.730 synchronization accessing it let only the queue but everything with a performance because of course spread to
00:36:47.400 know can put a message another queue access everything of this and to do this
00:36:56.430 actually quite simple we don't need much of course need to try this at one time right so when this happened we need to
00:37:02.190 track it under the way we do this is the right barrier so it means like when you write to a shared object or to a shared
00:37:09.330 collection whatever we put in it is going to be shared as well because it becomes reachable through it right to
00:37:16.110 share the field then we share whatever's in it so the hash the key in the value and in an array then shared element in
00:37:22.980 the hash case looking the value and
00:37:29.160 release approach actually you get pretty good result so this is single threaded benchmark and it is been checked uh
00:37:36.210 basically use objects and we see that unsafe is true for Ruby initially and safe is to throw use
00:37:42.960 thread save object which is local and share distinction and there it was a great no difference I'd rather greater
00:37:49.110 the same everywhere on the other hand if you would synchronize on every right which is like
00:37:54.450 Nigeria we do then was so much trying everybody back up to here 2.5 times slower on anybody because in every of
00:38:02.400 the cried becomes my expensive but here since its singer trained benchmark all of our local they can be reached by one
00:38:09.090 track so they override with at NASA condition and we can do the same trick for collection of course so again
00:38:15.510 probably say version and otherwise the trecek action it's same performance everywhere
00:38:21.590 forcing outside attachment so now let me
00:38:27.570 detail a bit more like all we deal with fines and salaries because it's quite interesting so the first thing you
00:38:35.730 introduce is average strategy erased our strategies this is an implementation technique that's used in most virtual
00:38:42.270 state-of-the-art virtual machines and the idea is we represent the array with
00:38:50.550 different kind of like and different representation behind the scene to fit most exactly what's in it so initially
00:38:58.170 for instance I have an empty array that use this empty strategy so there there is no storage at all could be like not only whatever then when we open an
00:39:05.940 integer into it and it becomes an intern here migrates like this then if we append later an object or an object then
00:39:14.460 becomes an object array so the most general one is this object array the division we don't use it all the time
00:39:20.430 and if there's extra clothes eaten it because of this of the career names you would box primitives like int and also
00:39:30.180 it's an observation that even though it's dynamically typed languages so we can mix types in an array in practice
00:39:35.460 very few cases do some practice almost all aware like if there is integer in it
00:39:41.010 probably all the other elements are going to be integer it's not gonna it's rather a mix and so here the advantage
00:39:47.160 then of course if you represent an array just as an int array and then stick four bytes per element it's very compact very
00:39:53.010 efficient I can address I can access it the wreckage is read the wrong way and that's it but in like like in MRIs and
00:40:00.600 since it should never take eight bytes and I have to encode it from the type font and whatnot and so well these are
00:40:12.600 ways to a strategy that already exists and what we do is first of all so I want
00:40:22.440 to add this shadow me I want to make this to work also and make it right signified this is complicated because now the array it can move
00:40:29.090 in this change the storage and so on like if to dress traditional storage the same time but in forever so do you think
00:40:35.300 this and I want to make all our information transit so just basically distinction that exists MRI I want to
00:40:43.040 preserve this optimization of strategies and I want to still enable higher reads
00:40:48.260 and write the different parts of the way and the way this is like oh let's add another dimension to these are
00:40:54.380 strategies which I call conference ready geez yeah and I carry two of them this what I call
00:40:59.990 shared six storage and share that any storage now she'll pick storage is like
00:41:05.890 strategy of specialized for some usage pattern a very physically it assumes like ok this is a real it'll be like
00:41:12.200 kind of a fixed size array but I can arrange other like you but I'll access it by just reading it but you're not going to append or delete element in it
00:41:19.420 if I did by having this more like specialized thing then we connected axis
00:41:24.500 faster we have like Nova right in that case over the over the classical things up but now what if you actually append
00:41:33.260 an element or delete an element from one of these then we need to go in the delegate which they will share that an acceleration and there because our like
00:41:40.460 a hundreds method on array and they do all kind of crazy things we need to use a lot synchronize there's no way there's a like a lot of
00:41:46.910 implementation of something verifies it is like we have to use a lot the point
00:41:53.140 now let me see if for instance here we have some touch my bridges do like read from an array or read and do some rights
00:42:01.450 and with a lot of different things here we explain but we do this from one one
00:42:07.820 threads 240 fortress so this is on a big machine with actually 44 calls and what
00:42:14.300 we see is the local so the when the array is not shared between threads the performance is pretty good like here we
00:42:20.780 reach like 30 billion array accesses per second just by good but any use shared
00:42:28.760 pick storage which was this guy we have the same thing the same performance so
00:42:34.010 that's why Hispanics is selected directory access to data like this without like confronting modifying it in that pretty
00:42:40.130 big waste and detected optimize nicely but then as soon as you start to use a look so all
00:42:46.200 you're the one actually using some kind of locking mechanism and performance like BAM three times for at least and so
00:42:54.210 then we attract different kind of blocks so well like the reentrant work is like a global of this so they are done scale
00:43:01.230 for the fortress not any faster or one time then you're something like Stampler
00:43:06.390 which is like a read/write lock which scale for reads by okay the dark green
00:43:11.580 here but those kids no right at all so that's not good enough so we need to invent something new and that's what we
00:43:17.250 say there's something that called the layered look and this is justify under like leaders are just a variant of it
00:43:22.950 but basically has three more devices yes read access write access and layout
00:43:28.890 change no change means okay we change everything like we change the strategy or like we need to did it an element in
00:43:34.830 the middle that marine it exclusive access to the area but the region right can be and so under these two factory
00:43:42.150 scale for weeds and foreign I can run be
00:43:48.330 some bigger benchmarks this actually does NASA parallel benchmark that's sweet of by the whole bunch of our
00:43:54.450 original written in Fortran that's typical scientific computing and there were 22 over and I know you see here is
00:44:03.270 a scalability so how much time we're faster on androids than one thread and
00:44:09.450 you see actually like the scale D curves are basically very similar for Java photon unsafe tough already and thread
00:44:17.430 set collection they all the same so you can say that did I go with scale of code
00:44:22.800 that Java and Fortran which is very good for this kind of all the reason here
00:44:29.550 simple because it's like very regular workload we use mostly like this big storage technology just
00:44:39.400 so it's about the end so I saw that I showed that the performance of Ruby can
00:44:46.400 be better and the verb is an example we can have Paris an entrance safety and
00:44:52.640 like based on the simple idea of like whether or not real Christians would have a manipulate then we synchronized
00:44:58.130 otherwise we don't then we can actually go pretty far and you can have battle array entrance with pylean should say
00:45:05.180 very in a hash and so the end yes with this I should be was one particular
00:45:12.799 in parallel as the most important strategy to guarantees like we have with the global lock but still yet scared of
00:45:19.880 any girls I fee on threat of a Ruby I interpreted it with all the Ruby manager
00:45:26.180 I could find so drive install IBM IBM and choose just work if it doesn't
00:45:32.719 please report back to us and then yes
00:45:38.239 probably is part of a bigger project which is for gravettian and was raised
00:45:43.249 last year actually we see and doesn't
00:45:48.289 only contain Ruby but also as a JavaScript runtime a node.js runtime our Python and then because one time which
00:45:55.130 means we can agree should cease buzz-buzz and mustard or in a single VM the advantage of this is we can all this
00:46:01.670 language can interoperate they can call each other easily efficiently it means that for instance if you call your
00:46:07.549 Python from Ruby you can actually in line through it and that it can sit for both languages and optimize them
00:46:13.339 together so basically you can yeah and call a ruby two out to Python to see and
00:46:19.309 it is as fast if you were the voting and see yeah this Olympic cooking is also
00:46:28.190 what we use actually the first extension so that you don't actually the when you
00:46:33.259 were six tension actually interpreted with our own interpreter for C code which has a jitter and it's essentially
00:46:39.739 so you're going to put C we compile just that compile C and we'll be good
00:46:45.289 together so here's some code rebellion C and back and to everything it's much more
00:46:51.280 efficient answering these come in two addition so there's a open-source commit
00:46:57.070 additional materials the wizard is undeterred there's at the presentation if flag for bigger depended upon some
00:47:03.700 support and also like litigate to support nothing yet
00:47:27.910 I guess we'll get there at some point the Roku is an interesting business model where I like to give you
00:47:34.180 physically very like comment with megabyte is very small family so I'm not
00:47:41.110 sure it would work there would definitely not want to run a JVM there but we could when we substrate be and
00:47:47.440 probably some workload a thing could be interesting so the answer is we don't
00:47:52.510 try it and there's not equation yet but I think it fit at the end of it suppose
00:47:59.590 you like at some point hopefully like on like Oracle cloud or other clouds probably will be available integrated so
00:48:06.850 you just try running applications here it goes
00:48:21.320 put the servicing right what is still missing rather than not much I guess so
00:48:35.990 running race and like lost lots of sea extension is in high school and then of
00:48:42.590 course you want to have number how much faster we own rails you can see like
00:48:48.500 over two times faster on Rails it's already something right so well one part
00:49:11.810 is like one of these Guardian project is the garage in compiler which is in Java
00:49:17.690 and that's actually getting more and more integrated so no other software reasonable ease of Java requires accurate part of it and you can use it
00:49:24.890 at things till the end of fact the differential a compiler will replace the old jim compiler probably this it
00:49:33.140 probably is the original pedals that where there are in the janilla very simple simple thing is like it's Jericho
00:49:40.430 this moment enable and the whole thing was supposed to us and nobody want to touch it because certain begin so that
00:49:48.200 there's something stupid ever like where where we run this code faster and scatter for instance it's collected also
00:49:54.320 to be runs faster this graph substrate
00:50:00.950 VM which then threads in the beginning can actually take a job application and compile it down to a native executable
00:50:07.210 so just like GCC would for sequel doing the same kind of thing that's very
00:50:13.430 interesting for like I don't know functional server less computing more like you could ship your job application as a simple small binary there's a
00:50:20.720 constant mesh memory the start instantly that can be editing so some of the stuff
00:50:26.720 is flying back into Java yes we're trying to be like a bigger ecosystem with a lot of language like behind all
00:50:33.590 this come on framework which is called truffle and I were building with that trying to make it easy of people to
00:50:39.770 actually build but I knew chip limitation and have it first and still I can keep it not too complex my company
00:50:54.740 we have a team that works with our and the team that works with movie so it
00:51:00.050 would be possible to to run both like some are scripts together yeah yes
00:51:09.980 absolutely so the lecture is someone external working on this a bit actually
00:51:20.150 like met a blog post about this like very like it is some protein which is deployed in our and code that from there
00:51:28.160 is some integration I can have a nice a layer to like interoperate but even like lot of normal per VM you can already do
00:51:34.700 this yeah you can really like the assumption or I can call it and pass the Ruby
00:51:39.980 object and so on but that's something that doesn't exist before right or if
00:51:46.040 there is it was only like a two language mapping maximum a way of like an language might think right
00:51:51.140 that's something that's rather interesting in but people are like didn't taste so far too much to try
00:51:57.380 because I don't know something new
00:52:03.460 publications about this because I think this array or this collection idea is very abstract is extremely interesting
00:52:10.940 because you can really do a lot of things with any language so yeah
00:52:20.720 actually got a paper about this as part of my PhD so I could have a lot of
00:52:27.140 publication as well on all of a kind of IDs in the project is so strategy is
00:52:33.050 actually not so new it was introducing what kind of to introducing pi PI a bit
00:52:38.810 after 2000 I think and so here I did the more like general version with concurrency the
00:52:45.740 stuff like the object model introduced by the self guys in 1989 or something
00:52:52.240 that's what a version like we're like okay I make this recipe if you want
00:52:59.750 links about that yeah just going to grab the end up all all website writing never is a fabrication so you contribute to
00:53:08.780 see Ruby right so what's their guys's opinion on I don't know individual
00:53:16.730 person opinions so they open to maybe basically throwing away the byte code
00:53:22.970 generation approach and using something like partial evaluation as well oh no because the project context is
00:53:30.320 very different like first 300ft much involve m is a truncated team of 50 be per unit and it's like for people on
00:53:37.610 ruby search for the different sub project but see really is like physically for five people full-time and
00:53:43.430 that's it and all the rest is open so volunteers so there's not so many resources and that's why I did it is I'm
00:53:49.580 approached for the image C code because they don't have to write the object like six years that groud better
00:53:55.730 no more than five years to be returned and there was already a lot of background from the Java compiler but
00:54:01.280 waiting a cheat from Ruby from scratch would take so you need to be like one of
00:54:07.640 these huge company and spent a few years on ok so I guess for them like it's
00:54:13.430 mysterious off like follow from jetpack bit on there were source and open the main thing is like it was something like
00:54:21.320 well ok we don't reduce so much of the Amara interpreter become I use this the extension but not the core auntie
00:54:27.950 because you want to build to optimize through it and secret it is not a very good medium for that I don't know
00:54:35.780 sometimes they are interests like that I don't know I post it on an issue not some guy working on mathematical
00:54:41.990 libraries on the benchmark and like oh cool but yeah I guess the main thing for
00:54:48.590 us is like they of course to get more adopted there are more users and you can co-host of these years thank like ruby
00:54:56.210 is one of the prime example because we already have a huge speed-up over the restaurant limitation but from JavaScript well they're the same level
00:55:02.930 of z8 put that up much faster different
00:55:10.299 person it's about four years ago so
00:55:15.950 Chris even started it as an internship and the project was taken on and develop
00:55:21.829 since then so good yeah you can see that it's big for years we have about four people working on it for years and we're
00:55:38.289 not going to Reno I think it so for a
00:55:46.309 long time the meeting out there was the coop
00:55:52.489 versus the extension is it like oh we're not managed RIA I mean to put C so everything can be managed but no this is
00:55:58.130 a dream it works for C extension that our written for performance because then
00:56:04.160 okay it's just like a different way to idle because it doesn't work for six tensional try to bind to a native
00:56:09.170 library like when database driver yeah instance because of course then a lot of
00:56:16.880 the Ruby object in actually go into native and then that's where I get complicated because now I have like a my
00:56:24.109 natural presentation of my will be array I know I have a native representation of it and the same for the string because
00:56:29.719 the 69p I could use just a tarp on so for the Ruby string and you're like well what do I do this and all three memory
00:56:35.359 and other I think there are dates back it's kind of complicated you don't like for know level operating
00:56:43.609 system access Linux and you have like libraries that are very powerful inter
00:56:50.869 process communication and whatever that and access that does network or do you have
00:56:59.080 problem on your to always so for the
00:57:04.420 post approach basically this kind of thought didn't work very well because where to select it specific okay this thing is going to native so we create
00:57:11.020 some kind of a direction from it and every time you use it will remove this interaction that was inconvenient
00:57:16.390 because you can't do this automatically now in the new approach we just like every ruby of the be converted to
00:57:22.420 natives some get a sponsor then when you use it with managed to send it back and then we have to do some kind of GC to
00:57:28.690 find out like okay whatever we convert it to native once we are back from it and number nothing refers to it anymore
00:57:34.450 we can release them so that part no
00:57:41.550 Indian the six tension code we first compile it with clang but not to native code to bit code and then yeah we
00:57:50.980 visited that with all them interpreter but that one for like very basic to a lot of stuff it just called operating system so that's what this word there
00:58:00.010 are some sort of a bit more difficult like waiting on threads if you just do a pitch where a pitch right code directly
00:58:05.050 and it's actually complicated because you need to do some innovation of the Java and there will be thread so that we
00:58:10.690 need to intercept that it basic things I got an open a file and access Dolabella
00:58:17.500 pretty sisters you can just let it do native this is actually another mode
00:58:24.300 Enterprise where this interpreter is entire managed so that Testament around
00:58:31.720 entire Cisco's but then it guarantees memory safety it
00:58:37.990 means that it's impossible that which is something like out of bone autumn already to go sexy it's not only
00:58:43.750 impossible to criticize but something interesting but something that is very
00:58:49.240 hard to work with for that cop and sister which is like a lot of very old C code was assembly and whatnot in it so
00:58:58.030 yeah that's that's a bit more new but it could be interesting also like if my
00:59:04.240 census extension is not very stable Kosek for some I'm the one eating that like they did not be able to create a secret anymore
00:59:11.320 we wouldn't be able to also like see the bug I never back trace and say like okay this is the problem I dig access
Explore all talks recorded at Railshöck Meetup
+25