00:00:15.160
all right thank you and uh thanks for
00:00:17.199
coming to such a heady topic as your
00:00:19.439
last Talk of the day um my talk is
00:00:21.800
called in-depth Ruby concurrency
00:00:23.320
navigating the rubby concurrency
00:00:24.640
landscape like Mike said uh when I first
00:00:27.920
found out I was doing this talk um
00:00:30.560
I told my parents I was doing a Talk on
00:00:32.320
Ruby and they were really confused um
00:00:35.640
because they thought I was doing a talk
00:00:36.840
about their dog uh and so if you've come
00:00:40.800
here to to learn about how to play with
00:00:42.719
my parents dog Ruby concurrently I'm
00:00:44.600
sorry this isn't the talk for you and
00:00:46.079
and you may want to
00:00:47.800
leave um so I'm JP CRA um I publish
00:00:51.600
technical blog posts at JP ca.com and
00:00:53.960
over the past year I've been doing a
00:00:55.280
series on uh in-depth rubby concurrency
00:00:57.760
basically this topic um I do some open
00:01:00.000
source of GitHub I'm active on Blue Sky
00:01:02.280
at jpc.com and I'm a principal engineer
00:01:05.360
at wealthbox CRM which is a CRM for
00:01:07.759
financial
00:01:09.200
advisers So today we're going to talk
00:01:11.320
about uh a group that I affectionately
00:01:13.360
refer to as the Ruby concurrency crew um
00:01:16.680
that consists of processes rectors
00:01:19.320
threads and
00:01:22.439
fibers so when I think about Ruby
00:01:24.360
concurrency I think of it sort of like a
00:01:26.759
uh nesting doll essentially as you open
00:01:29.520
up pieces P of the nesting doll there's
00:01:31.600
new layers of concurrency inside and uh
00:01:35.040
at runtime in Ruby you can introspect
00:01:37.000
all those layers really easily so you've
00:01:38.920
got the process ID that you can get from
00:01:41.200
process. PID and then you've got
00:01:42.840
reactors threads and fibers all kind of
00:01:44.360
nested within there um and you can
00:01:46.119
access all them using dot current to
00:01:47.680
introspect information about
00:01:50.399
them so the first thing we're going to
00:01:52.280
talk about at the top level of that
00:01:53.520
nesting doll is the process so processes
00:01:56.880
are parallel and concurrent and we'll
00:01:59.000
kind of dig into what that really means
00:02:00.799
exactly uh they work in isolated memory
00:02:03.240
spaces meaning that when you have
00:02:05.079
multiple processes they're uh unable to
00:02:08.119
communicate with each other they have is
00:02:10.039
they can't share memory between each
00:02:11.400
other so things like class loading and
00:02:12.840
stuff can't Shar be shared between each
00:02:14.400
other uh they map to an actual OS
00:02:17.040
process um so you can introspect them at
00:02:19.120
the operating system level um and
00:02:20.959
they're the most expensive form of Rubik
00:02:23.040
concurrency and we'll talk about that a
00:02:24.440
bit so we have a very simple example of
00:02:26.319
just Fork do here that'll create a
00:02:28.239
process for you
00:02:30.680
um very coincidentally Matt's did a
00:02:33.200
whole section about prime numbers and
00:02:35.000
how he can generate the world's largest
00:02:36.280
prime number um and I happened to use
00:02:38.280
primes here because they are kind of an
00:02:39.800
expensive operation to do in parallel
00:02:41.800
and by the way I tried Ruby head with
00:02:44.200
the largest prime available and 10
00:02:45.879
minutes in I just had to shut my
00:02:47.239
computer off because it was just dying
00:02:48.680
under this the weight of it so I
00:02:49.920
wouldn't actually try running this code
00:02:52.360
um you need to make it a little simpler
00:02:54.440
but basically I've got three ranges of
00:02:56.440
numbers from one to that very large
00:02:59.239
prime number and I'm going to walk over
00:03:01.000
them and create forks and try to figure
00:03:02.920
out what prime numbers are in my
00:03:04.959
lists and so the first thing you'll see
00:03:07.319
when I run that code is that uh the fork
00:03:10.440
actually starts running immediately this
00:03:11.920
is this is true parallelism in you know
00:03:14.239
the regular cr- Ruby runtime and at the
00:03:17.000
same time you'll also see I've got this
00:03:18.400
process status command running at the
00:03:19.840
bottom we've got a new process that got
00:03:22.200
created so I set the process title to
00:03:24.560
Ruby with the index after it and then I
00:03:26.720
just start selecting Primes from my
00:03:28.280
numbers and while I'm doing that I'm
00:03:30.239
getting full CPU
00:03:32.519
saturation and for each one that I run
00:03:35.159
until I go to process weight all and
00:03:36.599
that initial process just kind of goes
00:03:38.200
off and waits I've got three
00:03:40.879
99.9% um CPU saturation processes
00:03:44.799
running at the same time uh I mentioned
00:03:48.159
that processes are parallel and
00:03:50.360
concurrent um I think people normally
00:03:52.239
think of just like threads and fibers as
00:03:53.640
being concurrent but really concurrency
00:03:56.120
uh whereas parallelism is sort of uh
00:03:57.959
running things simultaneously
00:03:59.319
concurrency is really about
00:04:00.519
orchestrating or composing tasks and you
00:04:03.319
essentially put those things into a unit
00:04:04.840
of work you hand it off to something and
00:04:06.680
you say run this for me in whatever way
00:04:08.439
makes sense um so for instance like
00:04:10.640
psychic jobs are a great example of of
00:04:13.040
concurrency and so in our case here when
00:04:15.000
we first start running because it's a
00:04:16.799
process that starts running in parallel
00:04:18.799
but in this example we've actually got
00:04:20.280
five ranges and on our weird
00:04:21.919
hypothetical computer we've only got
00:04:23.280
three cores and so as we start to go
00:04:25.840
beyond the initial three cores we have
00:04:28.000
available to us things still continue to
00:04:30.440
run things still continue to switch
00:04:31.960
around but our CPU usage for each of our
00:04:34.360
processes goes down because they're
00:04:35.680
starting to share those three CPUs and
00:04:37.960
so at this point we've got 75% and then
00:04:40.680
basically by the time we get to the end
00:04:42.000
of it we're running at about 60% CPU but
00:04:44.880
we're running concurrently we're
00:04:46.000
swapping back and forth between them
00:04:47.440
we're orchestrating these
00:04:49.600
tasks so what's an example of processes
00:04:52.160
in the real world one of my favorite
00:04:54.120
examples is a newer server from Shopify
00:04:56.400
called Pitchfork uh Pitchfork is a
00:04:58.680
multi-process what's called a refor
00:05:00.560
forking server uh it Maps web requests
00:05:04.199
one to one from a web request to a
00:05:06.120
process meaning for however many
00:05:07.840
processes you have that's how many web
00:05:09.400
requests you can run at any given time
00:05:11.600
and if you go over that then your
00:05:12.759
requests start to queue up and those
00:05:14.320
handle your parallel CPU and IO for
00:05:16.720
you but how do you decide how many
00:05:19.160
processes you should actually
00:05:21.560
use and for my answer is really ideally
00:05:24.680
as many processes as you have cores so
00:05:27.240
if you don't have as many processes as
00:05:28.720
you have cores you're ESS leaving
00:05:30.400
parallelism on the
00:05:32.080
table additionally if you're using a
00:05:34.400
server that only uses processes like
00:05:36.280
Pitchfork does and doesn't use any of
00:05:37.720
the other concurrency units we're going
00:05:38.919
to talk about you really want to up that
00:05:40.960
process count as well um more like 1.25
00:05:44.600
to two times as many processes as you
00:05:46.560
have CPUs and that's because concurrency
00:05:49.919
so when you have uh when you're only
00:05:52.560
running processes you really want to
00:05:53.960
take advantage of that concurrency some
00:05:55.720
things are going to block on things like
00:05:57.120
IO they're not going to be running in
00:05:58.360
parallel on CPU so you want to be able
00:06:00.479
to take advantage of that and and run
00:06:02.520
extra
00:06:04.440
processes I mentioned that processes are
00:06:06.599
expensive and one of the things that
00:06:08.160
that's most expensive about them is
00:06:09.440
they're heavy on memory and so in a
00:06:12.080
typical example you know you've got your
00:06:13.919
parent process that you're running out
00:06:15.080
of we call our Fork do like we were
00:06:16.599
doing for our prime number selection um
00:06:18.919
our child processes each have what are
00:06:21.039
called pages of memory and so those
00:06:23.440
pages of memory are completely unshared
00:06:25.000
between them but there's lots of things
00:06:26.160
that you could share in your application
00:06:27.840
you know you load your classes maybe
00:06:29.319
you've got yit bite code there's lots of
00:06:31.199
things that that are important to share
00:06:32.520
but you can't do it
00:06:34.160
here and so you really want to improve
00:06:36.199
your memory usage by doing something
00:06:37.639
called preloading which you may have
00:06:39.000
seen before uh pre-loading basically
00:06:41.720
just means before I start forking I
00:06:44.240
require some amount of code um by
00:06:46.639
requiring that code in this example here
00:06:48.120
I'm requiring uh the rails application
00:06:50.199
code and I'm calling initialize and so
00:06:52.800
if you have wet running you're going to
00:06:54.080
get some bite code you're going to load
00:06:55.319
a bunch of your classes maybe some
00:06:56.800
things that are memoized or initializers
00:06:58.639
run and that all gets pre-loaded into my
00:07:00.919
parent process now when I fork my child
00:07:04.759
processes are able to take advantage of
00:07:06.360
something called copy on write so
00:07:08.000
there's pieces of memory that may never
00:07:09.560
change in my child processes they never
00:07:11.560
have to reinitialize them and so the
00:07:13.080
memory overhead on my child processes
00:07:14.879
goes down because it shares it with the
00:07:17.360
parent most servers support a preload
00:07:19.879
option so you really want to improve
00:07:21.319
your copy on right memory by utilizing
00:07:24.400
that the most interesting thing to me
00:07:26.520
about Pitchfork is it's got this newer
00:07:28.039
concept called reforging and so what
00:07:30.199
reking is is basically the parent
00:07:32.280
process creates a child without doing
00:07:34.280
any preloading and it's called the mold
00:07:35.879
or the
00:07:37.240
template that template does the
00:07:39.160
preloading and from here things start to
00:07:41.240
pretty much look like our previous
00:07:42.520
example we've pre-loaded we've got a few
00:07:44.360
pages of memory that we can use to
00:07:46.440
create our what we call here sibling
00:07:48.120
processes but effectively child
00:07:49.400
processes so we're sharing a minimal
00:07:51.159
amount of memory we're getting some
00:07:52.240
reduced overhead we can probably use
00:07:53.759
some more concurrency and we're getting
00:07:55.520
benefits where things really get cool is
00:07:58.080
with pitchfork it has something called
00:08:00.000
refor after which means that after a
00:08:02.199
certain number of web requests it'll
00:08:04.479
actually Fork again from that mold
00:08:06.680
process and what that means is we've
00:08:08.280
gotten lots of web requests we've done
00:08:10.199
all our initializers memoization may
00:08:11.960
have happened a lot more wet may have
00:08:14.039
happened at that point all of that stuff
00:08:16.199
is now reforced and able to be shared
00:08:18.960
with a next generation of processes and
00:08:21.360
so in this case we went from you know
00:08:22.919
three pages of memory to many pages of
00:08:25.759
memory a much more warm application um
00:08:28.639
and it continues to that at different
00:08:30.080
intervals here we have 50 100
00:08:33.640
1,000 at at Shopify it reduced their
00:08:35.880
memory usage by 30% and their latency by
00:08:39.039
nine so we kind of understand what
00:08:41.719
processes are good for in terms of
00:08:43.159
parallelization and a certain amount of
00:08:44.480
concurrency what is Pitchfork good for
00:08:46.600
so if you're currently a unicorn user um
00:08:48.920
for as your server or if you're using
00:08:50.440
Puma without threads Pitchfork might be
00:08:52.399
a good option for you uh if your web
00:08:54.680
requests are primarily CPU constrained
00:08:57.040
um Pitchfork also might be a good option
00:08:58.680
or if you just aren't aren't sure if
00:08:59.800
your code is thread safe you might want
00:09:01.480
to utilize
00:09:03.720
Pitchfork so the next thing technically
00:09:06.160
would be rators um but I'm going to save
00:09:09.000
that for a little bit later so we're
00:09:10.120
going to come back to in a
00:09:12.279
bit so next we're going to talk about
00:09:14.680
threads uh threads are purely concurrent
00:09:17.440
whereas processes are concurrent and
00:09:19.200
parallel threads just have this
00:09:20.680
concurrent concept um they operate in
00:09:23.120
shared memory space so they don't have
00:09:24.760
to initialize a whole new block of
00:09:26.600
memory every time they can share the the
00:09:28.480
memory of their parents they do allocate
00:09:30.440
a little bit of memory but nothing near
00:09:31.920
what a process does they map one to one
00:09:34.279
for the most part with OS threads and
00:09:36.200
they're less expensive than a process
00:09:38.240
and so we've got a little example here
00:09:39.600
of thread. new now we're going to try to
00:09:42.360
use our same prime example here um and
00:09:45.079
it's not really going to work out so
00:09:46.279
well and so the first thing you'll
00:09:48.079
notice as opposed to our process where
00:09:49.880
our process immediately started uh
00:09:51.720
running our operation in parallel
00:09:53.680
nothing really happens here at first
00:09:55.800
we're seeing our threads actually are
00:09:57.160
getting created at the operating system
00:09:58.560
level but they're operating at at 0%
00:10:01.079
usage and that's because of the gvl you
00:10:03.399
may have seen some previous talks about
00:10:05.320
the gvl it's this Global virtual machine
00:10:07.720
lock that every thread needs to have
00:10:09.160
access to to keep the Ruby runtime
00:10:10.800
consistent internally and so that means
00:10:12.800
our threads can only run Ruby code one
00:10:14.680
at a time uh basically until some event
00:10:18.440
comes along something blocks uh the
00:10:20.399
thread scheduler switches them they
00:10:21.920
can't do anything until that
00:10:24.200
happens and so in our case here
00:10:26.240
eventually our main thread says hey you
00:10:28.560
know threads give give me your values
00:10:30.279
which is just a way of blocking on the
00:10:31.680
thread and asking for whatever the last
00:10:33.200
value is that comes back from the thread
00:10:35.120
we set a thread current name so we get
00:10:36.800
those Ruby 012 kind of options there and
00:10:39.760
then we select our primes But ultimately
00:10:42.279
we have to just keep swapping back and
00:10:43.839
forth with the gvl and so we never
00:10:45.639
achieve more than 100% of one CPU and
00:10:48.519
each thread is operating at about
00:10:50.639
33%
00:10:52.880
effectively with that in mind what are
00:10:54.920
they good for uh and they're good for
00:10:56.800
things that block so they're good for
00:10:58.639
file operation
00:10:59.800
DB calls HTTP sleep Process Management
00:11:03.160
and if you use a library like bcrypt or
00:11:05.320
zib uh those will actually release the
00:11:07.279
gvl as well and you can operate those in
00:11:09.160
parallel so each thread waits for the OS
00:11:11.639
response in
00:11:12.959
parallel so that ultimately means that
00:11:15.760
to me uh how I refer to threads is
00:11:17.560
they're really parallel is um
00:11:19.839
effectively like if we were to take an
00:11:21.560
example that better utilizes them so
00:11:23.360
here we're doing a slow report in the
00:11:25.120
database we're retrieving a customer
00:11:27.200
from stripe we're doing another related
00:11:29.760
API call and then we're still glutton
00:11:31.959
for punishment and we're doing more
00:11:33.160
prime number
00:11:34.560
generation and so in this case we still
00:11:36.800
have to grab that gbll first but once we
00:11:39.120
go to a blocking operation we release it
00:11:41.600
and those blocking operations can
00:11:43.200
continue to happen in parallel in the
00:11:44.959
background and so in this case what ends
00:11:46.880
up happening is we we're parallel is
00:11:49.200
we've got three additional things
00:11:50.839
running in the background while we're
00:11:52.160
doing our prime number generation in the
00:11:55.720
foreground so what's an example of
00:11:57.560
threads in the real world Sidekick is a
00:12:00.079
perfect example um Sidekick is a
00:12:02.160
multi-threaded job server that was kind
00:12:04.000
of one of its Innovations at the time
00:12:05.760
jobs often block on iio so it's really
00:12:08.200
valuable to have threads they can help
00:12:10.320
you paralyze your IO uh and threads are
00:12:12.800
much cheaper than processes so you can
00:12:14.279
have a lot more of
00:12:16.360
them it's pretty simple to express jobs
00:12:19.519
essentially run on a thread they pull
00:12:21.720
jobs from redis uh job information from
00:12:24.519
reddis and then within your jobs
00:12:26.279
whenever you do something like an HTTP
00:12:27.760
call or a database call or whatever
00:12:29.800
uh those can be paralyzed and so you get
00:12:31.440
better
00:12:32.760
throughput but how many threads should
00:12:35.760
we use how much should we allocate it's
00:12:37.920
a complicated question but there is one
00:12:39.639
answer you may have seen some of Nate
00:12:40.920
burk's work where he's talked about AMD
00:12:42.880
doll's law so we don't really need to
00:12:44.880
understand this formula per se but we
00:12:46.760
should just understand a couple portions
00:12:48.320
of it so s is the proportion of our
00:12:50.240
program that can be made parallel so the
00:12:52.399
percentage in IO or CPU and then p is
00:12:55.519
the speed up factor of the parallel
00:12:57.160
portion so that's the number of process
00:12:59.560
rors threads or fibers that you're going
00:13:01.639
to be running in association with the
00:13:03.519
kind of I that you're doing so we've got
00:13:06.320
this handy table here pre pre
00:13:08.639
pre-computing the formula for you
00:13:10.639
basically at the lowest end when you're
00:13:11.839
around 10% you're really not getting any
00:13:13.760
benefit out of threads you might as well
00:13:15.000
just be running processes when you get
00:13:17.160
into the more common range of around 25
00:13:19.240
to 50% CPU in your background jobs 5 to
00:13:22.399
10 threads start to help a lot from on
00:13:24.519
the low end it's about 1.25 increase in
00:13:27.079
in throughput and we get almost up to
00:13:29.120
two times throughput as you get really
00:13:31.240
really high that might be something more
00:13:32.639
like you're making just tons of API
00:13:34.240
calls or websockets or something like
00:13:35.920
that but you can get a lot of thread
00:13:37.120
benefit out of that point and a lot more
00:13:40.399
throughput so what's sidekick good for I
00:13:42.880
mean you're probably using sidekick
00:13:44.360
right now so you probably know what it's
00:13:45.519
good for but for jobs that operate with
00:13:47.199
more than 10% IO uh your app is thread
00:13:50.199
safe which most apps start out that way
00:13:52.160
so just try to keep them that way and
00:13:53.959
when you're using sidekick you know five
00:13:55.480
to 10 threads is a safe bet
00:13:59.480
how do I decide between processes and
00:14:02.240
threads and the answer is really you
00:14:04.920
want to have both and so what's a good
00:14:07.480
example of something that has both uh
00:14:10.240
Puma oops is a multi uh process
00:14:13.279
multi-threaded server it process uh the
00:14:15.920
processes paralyze your CPU for you the
00:14:17.920
threads more efficiently paralyze your
00:14:20.600
IO so similar to Pitchfork when your
00:14:23.079
request comes in things can only be
00:14:24.959
paralyzed uh by using uh processes
00:14:28.759
there's reactor thing for connection
00:14:30.399
handling in the middle but then we
00:14:32.680
ultimately hand off to our threads which
00:14:35.000
for something that does you know look at
00:14:36.880
those am doll's law ranges of IO you can
00:14:39.959
increase and tweak the amount of threads
00:14:41.480
you have to to benefit your throughput
00:14:44.440
more so I've thrown this reactor thing
00:14:46.800
in the middle here uh what is a reactor
00:14:49.600
and and why should you maybe care or
00:14:51.839
understand what reactors
00:14:53.720
are so reactors not reactors sorry
00:14:57.839
little buddy uh uh basically a reactor
00:15:01.680
is this Loop um that runs and interacts
00:15:04.560
with something called the kernel event
00:15:06.199
que so pretty much every operating
00:15:08.079
system has this very highly optimized
00:15:10.160
queue where you can say hey I have this
00:15:12.320
particular operation I want to run I
00:15:13.880
want to do some HTP and tell me when the
00:15:15.920
socket's ready to write to or read from
00:15:18.079
you can give that off to the kernel and
00:15:19.920
it it can handle thousands hundreds of
00:15:22.079
thousands of these at a time but it's
00:15:23.880
kind of inconvenient to directly deal
00:15:25.279
with that so the reactor handles that
00:15:26.839
for you you register an event handler
00:15:28.560
saying hey when when is the socket ready
00:15:30.399
to write
00:15:31.519
to that hands off to the kernel event CE
00:15:34.680
you go off and do something else when
00:15:36.600
it's done the kernel event queue lets
00:15:37.959
the reactor know it invokes your Handler
00:15:40.040
for you and you can start writing to
00:15:41.480
your socket or reading from your
00:15:44.680
database this is the same thing as an
00:15:46.800
event Loop you may have heard of and
00:15:48.120
it's used by tons of stuff Puma action
00:15:51.000
cable event machine all sorts of things
00:15:52.800
it's a really powerful concept the only
00:15:54.720
reason I bring it up is because we're
00:15:55.880
going to even use it later
00:15:57.920
on so what are reactors good for they're
00:16:00.319
good for the foundation for highly
00:16:01.680
scalable IO they're good for buffering
00:16:04.319
requests and for slow clients so in Puma
00:16:06.440
utilizes it so if a client is really
00:16:08.319
slowly downloading or uploading content
00:16:10.560
your threads don't get occupied the
00:16:11.959
reactor can handle thousands of these at
00:16:13.560
a time and for managing incoming
00:16:15.639
connections it's not the type of thing
00:16:17.199
you may use day-to-day in your own code
00:16:19.319
um but tools you use will use
00:16:21.720
that and then what is Puma good for it's
00:16:24.079
a general purpose web server you've got
00:16:25.959
a mixture of CPU and IO no one got fired
00:16:28.639
for choosing Puma it's got 44 million
00:16:30.480
downloads you're probably using it right
00:16:33.399
now all right the next concurrency unit
00:16:35.720
we're going to talk about are fibers uh
00:16:38.720
fibers are concurrent similar to threads
00:16:41.600
they're shared memory memory similar to
00:16:43.759
threads they are user space in this case
00:16:46.079
so there's no OS equivalent that we can
00:16:47.759
introspect it's something that's
00:16:49.079
essentially just managed and created by
00:16:50.959
Ruby and they're less expensive than
00:16:53.199
threads rators processes pretty similar
00:16:55.880
interface to creating them fiber. new
00:16:59.160
no sorry
00:17:02.480
buddy so in this case we're going to do
00:17:05.280
the same example again but what we'll
00:17:07.640
notice here is that unlike uh threads
00:17:09.839
threads need to have the gvl to be able
00:17:11.520
to run code but I've got that little
00:17:13.199
thread. main with the gvl acquired
00:17:15.039
already at the bottom that's because
00:17:16.640
fibers basically run inside of a thread
00:17:18.880
so they don't acquire the GBL but they
00:17:20.400
still can't run in parallel and that's
00:17:22.400
because fibers are actually considered
00:17:24.039
to be cooperative so fibers can actually
00:17:26.919
jump back and forth between each other
00:17:28.439
they can Feld and resume they have to
00:17:30.120
communicate with each other and
00:17:31.120
cooperate with each other to be able to
00:17:32.559
be created and uh and run so at the end
00:17:35.720
here our main fiber calls resume on all
00:17:38.039
of them each one of them runs range.
00:17:40.360
select and yields the value but all we
00:17:42.760
really get is this like sequential going
00:17:45.039
back and forth between them it's just a
00:17:47.960
it just sequentially runs so it's not
00:17:50.600
really that
00:17:52.360
useful so in Ruby
00:17:55.440
3.0 there there came along the fiber
00:17:57.679
Schuler and so it gives this ability to
00:18:00.280
basically put fibers kind of on steroids
00:18:02.919
um fibers essentially they still manage
00:18:05.159
the stack for you they can still operate
00:18:06.640
con currently but we've got the
00:18:07.840
scheduler behind the scenes that will do
00:18:10.679
a bunch of extra operations for us and
00:18:12.880
will uh unlock a lot more features for
00:18:15.600
fibers and so if we take that same
00:18:17.840
reactor example that we had before and
00:18:20.080
we apply it to fibers and we'll use the
00:18:22.320
something called The async Gem which is
00:18:23.840
kind of the primary fiber schuer
00:18:25.799
implementation the first thing you do is
00:18:27.480
you create the sync block which starts a
00:18:29.159
reactor for us behind the scenes when
00:18:31.640
you call a syn there you basically just
00:18:33.640
creating a fiber behind the
00:18:35.400
scenes when we call dbquery it just
00:18:38.600
looks like a regular synchronous call
00:18:40.159
it's just the same thing we're used to
00:18:42.440
but behind the scenes what happens is
00:18:44.200
the fiber scheduler puts that into a
00:18:45.880
reactor for us it registers the event it
00:18:48.520
then puts us into a list of blocked
00:18:50.080
fibers checks for available fibers and
00:18:52.000
moves on the same thing happens with our
00:18:53.960
stripe call
00:18:55.480
here once we get to the end eventually
00:18:58.120
those operations are finished our
00:18:59.480
synchronous calls will return and the
00:19:02.039
fiber schedule kind of handles all this
00:19:03.559
transparently for us and gives us our
00:19:05.120
result
00:19:06.799
back so once you have the fiber
00:19:09.320
scheduler they're good for a bunch more
00:19:11.039
stuff but they basically at that point
00:19:12.960
still just kind of look like threats
00:19:15.720
each operation becomes a part of the
00:19:17.159
event Loop in this case instead of being
00:19:19.080
put to the OS in
00:19:20.799
parallel I'll go through the same kind
00:19:22.679
of parallel is example because they kind
00:19:24.679
of get that parallel is functionality
00:19:26.480
again at the same time once they have
00:19:28.480
the fiber Schuler so we do async here
00:19:31.880
and the difference when we're doing
00:19:32.840
these async calls is they actually
00:19:34.200
automatically go into that
00:19:36.480
background and so we've got these
00:19:38.919
background processes they're being put
00:19:40.919
into the reactor and handled for us and
00:19:42.600
we can still do our um useless
00:19:46.559
calculation of prime
00:19:48.559
numbers this all seems familiar and it
00:19:51.039
seems almost kind of pointless to have
00:19:52.960
two things that nearly operate the same
00:19:54.600
way but the way that I look at them is I
00:19:57.799
I kind of compare the strengths and
00:19:59.400
weaknesses of each and so for me with
00:20:01.480
fibers um they operate more
00:20:03.600
deterministically uh meaning because
00:20:05.679
they have to cooperate there's very like
00:20:07.840
uh exact seams of where things can
00:20:09.919
actually happen where things can switch
00:20:11.559
out versus threads where you basically
00:20:13.600
can have any instruction swap while
00:20:15.640
you're in the middle of running code
00:20:17.280
they're lighter weight on me on memory
00:20:18.919
and CPU and you know if you have five
00:20:21.559
threads and five rectors this really
00:20:22.960
isn't going to matter but as you scale
00:20:24.400
up a server the scalability of the
00:20:26.280
reactor goes much higher the cons is is
00:20:29.760
sometimes a big one it's that they block
00:20:31.000
on CPU if a fiber starts doing a really
00:20:33.400
heavy CPU operation um there's nothing
00:20:35.919
you can do about it without it
00:20:37.039
cooperatively scheduling another
00:20:39.280
fiber so what's an example of fibers in
00:20:41.559
the real world we've got Falcon which is
00:20:43.960
a multiprocess multifiber server built
00:20:46.360
on the fiber scheduler like our previous
00:20:48.919
examples like everything in C Ruby uh
00:20:51.240
the parallelization happens from our
00:20:53.200
processes but then everything else our
00:20:55.000
connection handling our request
00:20:56.200
buffering our parallel IO all of that
00:20:57.960
just gets handed off into the fiber
00:21:00.679
Schuler and so what's Falcon good for
00:21:03.159
it's it's good really for any web app
00:21:05.120
but it's particularly good for high iio
00:21:07.320
high connection or proxying web
00:21:09.039
applications um because of these this
00:21:11.320
fiber scheduler and the ability to scale
00:21:12.840
up these connections you can use web
00:21:14.360
sockets and http2 really easily in it I
00:21:17.440
did a benchmark recently against a
00:21:19.559
node.js websocket benchmark and Falcon
00:21:22.720
was very uh comparable in terms of the
00:21:25.039
performance which was great to see
00:21:26.159
because node.js is a highly optimized
00:21:27.799
environment for we
00:21:29.440
sockets additionally even without action
00:21:32.080
cable you can kind of just slap some
00:21:34.480
websocket functionality into your
00:21:35.799
controllers using Falcon and it just
00:21:39.159
works all right we're finally getting to
00:21:42.320
rators and giving them their time to
00:21:43.880
shine uh rators are both parallel and
00:21:46.880
concurrent they share M memory but with
00:21:49.559
a more strict interface which we're not
00:21:51.039
entirely going to get to today they map
00:21:53.159
to a pool of os threads so they're not
00:21:54.960
directly one to one but they do get to
00:21:56.400
utilize threads behind the scene and
00:21:58.080
they're less expensive than processes
00:22:00.320
they're pretty simple to
00:22:02.159
instantiate so in our case here uh
00:22:04.799
exactly the same as when we were
00:22:05.960
demonstrating processes when we start
00:22:07.840
calling that rapor new we immediately
00:22:09.559
start running in the uh parallel code in
00:22:11.559
the
00:22:13.360
background so that we can see that
00:22:15.360
behind the scenes when we do this
00:22:16.520
process status for our threads we are
00:22:18.480
getting actual CPU saturation every time
00:22:21.080
we create a new
00:22:22.279
Rector and once the main Rector calls
00:22:24.919
take on your rectors they can go through
00:22:27.279
their range select they can find our
00:22:29.159
prime numbers and then they can yield
00:22:30.640
those prime numbers at the
00:22:33.679
end they work but they do need more
00:22:36.960
Community Support uh you still get a
00:22:39.240
rector's experimental warning at the
00:22:41.000
beginning you know there's certain
00:22:42.960
common libraries that because of the
00:22:44.679
strict sharing support you can't utilize
00:22:46.440
out of the box which can be a little bit
00:22:47.840
confusing and there are some API uses
00:22:50.120
that need to be
00:22:52.400
fixed is there an example of rors in the
00:22:54.760
real world there are people using them
00:22:56.480
but in terms of of Library support you
00:22:58.279
don't see a lot but there is a cool
00:23:00.679
Library called Morrow um Morrow is a
00:23:03.400
multiactor experimental server uh it
00:23:06.000
uses in front of it it uses async HTTP
00:23:08.600
which is that same underlying fiber
00:23:10.039
scheduler Falcon technology to handle
00:23:12.240
connection handling and request
00:23:14.000
buffering but then it uses rators to
00:23:16.159
handle parallel CPU and
00:23:18.520
IO so what's Morrow good for it's really
00:23:21.120
just good for experimentation right now
00:23:22.679
but I encourage you to to give it a look
00:23:24.440
and and try it out because it's one of
00:23:26.200
the the main examples out there of
00:23:27.640
something trying to use rors um in a
00:23:30.279
more seriousish
00:23:32.200
way all right that was a lot uh what are
00:23:36.039
the actual takeaways for the
00:23:38.279
talk so the first takeaway that I I want
00:23:40.960
to give you is uh maximize vertical
00:23:43.320
scale so before you start trying to
00:23:45.559
scale out horizontally you know take
00:23:47.559
advantage of everything that's exists on
00:23:49.240
your container your server whatever
00:23:51.960
always take advantage of preload you
00:23:54.000
know and if you have the option you know
00:23:55.520
use refor uh Puma actually has an
00:23:57.880
experimental feature for re working as
00:23:59.640
well um you get more space for
00:24:01.600
concurrency and more space for wet bik
00:24:03.880
code so you can run your Ruby code
00:24:05.520
faster match your Ruby process count to
00:24:08.320
available CPU count and from there you
00:24:11.360
know utilize amdall law to figure out
00:24:13.600
what kind of IO percentages you're
00:24:15.240
utilizing in each of uh and figure out
00:24:17.120
how to scale your largely your threads
00:24:19.039
from there so for instance if you have
00:24:20.960
sidekick running you know make sure you
00:24:22.760
take advantage of amd's Law and
00:24:24.320
understand what your IO percentages are
00:24:25.760
in your
00:24:26.960
jobs because if not
00:24:29.039
you're just paying for
00:24:31.840
less and then in terms of a conceptual
00:24:34.320
compression for how uh how to understand
00:24:36.760
which one to use where I would say
00:24:39.159
whenever you're trying to parallel
00:24:40.880
parallelize IO within your own code I
00:24:43.559
would suggest using the asnc fiber
00:24:45.159
scheduler uh the asyn gem has a really
00:24:47.399
great uh interface and ergonomics and
00:24:49.760
because fibers operate more
00:24:51.600
deterministically uh it's largely a
00:24:53.559
safer way um to to write your code and
00:24:56.120
have less foot guns along the way
00:24:59.039
if you're trying to paralyze something I
00:25:00.559
would encourage you to use rators there
00:25:02.120
are people out there using rators for uh
00:25:05.600
paralyzing CPU operations within their
00:25:07.640
code um there's also a gem called
00:25:09.640
parallel which will run processes for
00:25:11.919
you or threads but it can also
00:25:14.159
experimentally run rors for you so you
00:25:15.880
could use that as an abstraction and
00:25:18.000
then otherwise tune your servers and let
00:25:20.080
them do what they do best so processes
00:25:22.360
for CPU and threads fibers uh for Io and
00:25:25.960
tune them based on your server settings
00:25:29.360
there were things I didn't get to
00:25:30.480
because I think all our heads would
00:25:31.799
explode if I tried to do that but uh
00:25:34.000
basically there's an MN thread
00:25:35.919
initiative in Ruby 3.3 its threads
00:25:38.760
essentially backed by reactor um which I
00:25:40.960
think is a really cool initiative and
00:25:42.440
and it is actually available to use
00:25:44.279
under a flag um so you could try it out
00:25:46.919
today solq which is a server you've
00:25:49.399
probably heard of it's multi-threaded
00:25:50.760
and multi-process job server which is
00:25:52.440
really nice um and then just you know
00:25:55.240
ideas for a simplified concurrency
00:25:56.760
feature but maybe that could be future
00:25:58.679
talk of some
00:25:59.760
kind uh and also I have I do have some
00:26:03.039
stickers of all the Ruby mascots if
00:26:04.919
you're interested I don't have a ton um
00:26:07.240
and I have more fibers than others I
00:26:08.799
don't know if that's some sort of
00:26:10.480
metaphor or something for how much
00:26:11.760
cheaper fibers are um but basically I do
00:26:14.480
have some stickers if you're if you're
00:26:15.799
interested in that um you can find me at
00:26:17.559
jpc.com or on Blue Sky at jpc.com thank
00:26:21.279
you