00:00:02.929
All right, hello folks.
00:00:08.599
Excellent! Oh, and sorry, I will try to speak up. The audience requests that I can sometimes talk too fast.
00:00:15.570
So feel free to yell out if I'm going too fast. You could say, "slow down." Well, it's wonderful to be in Fukuoka.
00:00:21.240
My family is enjoying it very much. I'm amazed that I can fly to all kinds of places around the world, and somehow there are already Ruby programmers there.
00:00:34.290
So thank you! This is great. Again, if I'm speaking too fast, let me know.
00:00:40.770
If I'm not loud enough, tell me. If you've got a methodology question, you know we're all here because we care about performance.
00:00:51.570
You folks all know that if I measure performance the wrong way, the results are no good. Methodology questions? Speak up and let me know.
00:00:57.480
It's kind of dark in here, so I can't always see your hands, so just ask. The worst that happens is I say, "Oh, that's on a later slide," or, "Grab me later, and we'll talk about it."
00:01:10.770
Earlier this week in Utah and yesterday, Matz talked about Ruby 3x3 and how we're doing. He discussed performance, and overall it sounded like he thought performance was pretty good.
00:01:23.640
We're not there yet, but pretty good. Specifically, he talked about how Ruby 2.6 is 280% the speed of 2.0 in some of the benchmarks.
00:01:37.500
For a performance talk, I guess that means we can all go home, right? Except it doesn't help with Rails yet. It may make it better, but it doesn't help a lot with Rails.
00:01:44.579
Rails is slower, mostly without JIT. I don't know if you caught Takashi Kukoblum's talk; it's recorded and very good, but basically, JIT is not going to help us with Rails anytime soon.
00:02:02.280
So I'm going to look at using Rails to measure Ruby's speed. This is partly because that's what I do: I wrote and run Rails Ruby Bench.
00:02:15.390
I do a lot of that, and it's also because Rails is important code that is slow right now. A lot of people use it, and it's worth looking at.
00:02:27.630
It's also a good example of code that's hard to optimize. Legit folks have their work cut out there.
00:02:35.670
The way I've been doing this is with a benchmark called Rails Ruby Bench. We'll talk a little bit about what I found there. You may have seen blog posts that I wrote with lots of graphs; I do that often.
00:03:00.140
But the short version is if you measure with old Discourse and old Worthy because that's what works together, you can see there was a bit of a speed improvement.
00:03:32.610
With newer Discourse and a newer Ruby, there was another bit of speed improvement. If you multiply those together, you can get that for real-world large Rails applications.
00:03:48.220
Ruby 2.6, as it was released, is about 172% of the speed of 2.0. So at the very beginning of the 2.0s, we've gotten about 72% faster than we were pretending.
00:04:07.560
We can just multiply the two rates, but it's actually hard for a big real-world Rails app to compare across all of them because compatibility across that range of Rubies is nearly impossible.
00:04:21.210
But I spent a couple of years working on it. You could call that what I found during that couple of years. How do I measure that?
00:04:43.040
Because at a performance talk, you should always be asking, "How do you measure that?" How you measure changes the results completely. For a number of folks here that I recognize already, I say it's highly concurrent.
00:05:27.610
So on an EC2 dedicated m4.xlarge instance, which is a fairly big virtual machine, it uses a highly concurrent workload: ten processes and sixty threads.
00:05:43.000
It uses a simulated real-world workload, meaning it's using forum software to reproducibly generate a bunch of HTTP requests pretending to be users doing forum things.
00:05:55.000
It uses Discourse, which many of you may have heard of. It's real-world forum software and is basically one of the largest, most representative real-world Rails apps I could easily get at that actually does its work in Rails, as opposed to something like rubygems.org, which is extremely busy but S3 does a lot of the work.
00:06:24.500
Our Rails Ruby Bench is designed for throughput, not latency. This is basically its fastest configuration for throughput.
00:06:42.229
So, okay, that means we can say it's 172% faster. That's great, but that uses about as many slides as we've seen here, so that was worth four minutes of your time.
00:07:04.030
If you didn't already know it, but all those times are mixed together. It runs with a lot of different routes, different pieces, and uses the database and Redis.
00:07:16.890
Which part is slow? That's not really what a giant, complicated benchmark is good at. In fact, benchmarks come in very different sizes. I tend to call them micro versus macro benchmarks.
00:07:46.979
Rails Ruby Bench is very large compared to benchmarks generally. The reason you care about whether it's large or small is that small benchmarks are specific.
00:08:05.249
They're good at answering a question like, "For this one operation, how much slower or faster is it and why?" Whereas a very large benchmark that has a lot of different things mixed together can be very representative. It can feel a lot like a real-world question.
00:08:23.440
You wonder about how big this real-world Rails app is, but it's not great at answering the question of which exact part is slower or faster and why.
00:08:38.760
Additionally, it has the usual problems with high concurrency, like garbage collection happening at the same time, database operations, and whether the database is already busy.
00:09:17.460
It's hard to tease apart. For a smaller benchmark to answer more of those specific questions, what should that look like? I've spent a fair bit of time on Rails Ruby Bench, and I'm now working on a much simpler benchmark.
00:09:56.430
I'm just calling it the Rails Simpler Bench (RSB) if nobody has a wonderful creative name for it. It's going to be a lot less representative. It's not an answer to the question of what a big real-world Rails app does.
00:10:25.640
However, it's going to be a lot more specific. For small, I thought starting with a simple hello world route and something that just returns a static string was a very good place to start.
00:10:53.859
If I'm going to be specific, I need to check each layer in turn. I can make it as small as I can while using Rails. I'm starting with single process, single thread concurrency.
00:11:23.750
There will be a little bit more branching out from that later, but if you want to get simple and see what it does before you measure everything else against that, that's how you start small.
00:11:40.740
It's also good for profiling. I built it and you'll see that later on. But the first thing you could investigate here is Rails overhead: how slow is Rails in the first place if you're not doing anything else with it?
00:12:05.520
So I ran a very simple load test on what's the throughput using Rails on a static route, single process, single thread from Ruby 2.0 through Ruby 2.6.
00:12:31.700
If you look at those numbers, it goes from about 761 requests per second to about 1016 requests per second.
00:12:49.960
That's about a 25 to 30 percent jump, which is not bad. That's okay. Again, talking methodology because that’s what I do: this is a single unloaded Amazon m4.2xlarge EC2 instance.
00:13:10.440
So it's dedicated, not running anything else. All the networking is taking place locally rather than going over Amazon's networking.
00:13:27.650
It's very hard to get that predictable speed, so you could think of this as if you've got one CPU sitting and churning along. This is how many Rails requests it can handle for a second.
00:13:43.720
I also did a 15-second warmup before starting the benchmark. This is about a two-minute benchmark. The load tester I'm using you tell it a time, and since it's a two-minute benchmark.
00:14:04.680
For the slowest Ruby, I got about 916,000 HTTP requests, and for the fastest one, it's about 1.2 million HTTP requests.
00:14:23.960
For the graphs you see here, it's going to be a varying number of TCP requests in the same time rather than the way I usually do it, which is a fixed number in a varying time.
00:14:39.230
So OK, that graph was okay, but it didn't look a lot like the first graph. Anybody here with a great memory for graphs notice a big difference there?
00:14:50.740
Those two versions were half the improvement all by themselves, and this doesn't look like that.
00:15:04.040
So that's interesting. I measured differently and got these very different results. Some of the answer is that we're not just measuring the Ruby framework code.
00:15:18.320
We are also measuring the application server, and it would be nice to say, 'Oh, we're just measuring Ruby,' but it turns out that measuring Ruby is surprisingly hard once you're doing any real-world tasks.
00:15:35.820
We can cut out Redis, we can cut out external caches, we can cut out the database, but somehow there's always some more non-Ruby code hiding in there somewhere.
00:16:05.780
Puma is very well-tuned. It also spends a lot of its time not doing Ruby stuff, but we can account for that. The way we account for that is I've been timing a Hello World Rails server.
00:16:25.240
I can time a Hello World Rack server, which should essentially capture all the non-Rails stuff, and Puma should account for pretty much all of that time.
00:16:43.480
You'll notice that the number of iterations per second is a lot higher because Rack is a lot faster than Rails.
00:16:59.600
But you know timing that barely speeds up at all; it gets a tiny bit faster from ten and a half thousand iterations per second to twelve, but that barely gets faster at all.
00:17:28.180
Again, that's because Puma is very well-tuned and spends a lot of its time in C libraries. For those wondering, that's part of the reason they're careful using Rails.
00:17:55.190
Rails can be a lot slower, and they have an API mode to reduce overhead, but overall that's what you're paying for Rails.
00:18:21.860
We can take that overhead from Rack: how much of it is Puma time, how much of it is Rack time? By subtracting the number of microseconds, we can basically say here is what Rails costs.
00:18:42.340
For each version of Ruby, here's how fast that is speeding up. You will notice that even though that's Ruby code, it's not getting as much faster as it did previously.
00:19:06.080
Ruby 2.6 gains only about 30% rather than seventy plus percent. When I measure differently, I get these very different results.
00:19:20.070
Some of the answer is that we're measuring the Ruby framework code while also measuring the application server. It would be nice to note that.
00:19:58.740
So, with the Gil present, what you're really seeing here is that this is such a different set of results for concurrency than it was with Rails Ruby Bench.
00:20:25.660
This depends on your workload; I'm showing you here is a Hello World app with no database access, no Redis, and very little use of C extensions.
00:20:45.940
So it's about the amount of context switching. Rails Ruby Bench has an enormous amount of non-Ruby stuff going on; it switches back and forth.
00:21:17.830
A Hello World app is almost entirely just Ruby work.
00:21:39.440
So part of what we've done is find a kind of range. Normally, the right amount of threading is one thread, but if you could up to about eight processes with Puma.
00:21:53.460
The right number of threads is about four threads up through the far side of the spectrum. Rails Ruby Bench, again for a large instance, uses ten processes and sixty threads.
00:22:11.550
Nick Berk, who's here in the audience and does a lot of work on Rails performance for many companies, has a rule of thumb for big Rails apps: about five threads per process.
00:22:31.890
When you get this large, like with six, it's a very large one. So what you have there is kind of a range of how much threading will help you for C Ruby.
00:22:56.790
For the Gil, you know, in all of these current circumstances, but if you do see Ruby in the go. So we've got that range.
00:23:14.790
I'll make one more potentially interesting observation here: those are all roughs you've already seen next to each other.
00:23:40.920
For the range of Rubies from 2.0 on the left to 2.6 on the right, those graphs all do pretty much the same thing.
00:23:56.630
For any of those graphs, you can change the number of processes, change the number of threads—you're going to get about the same speed-up.
00:24:11.600
When we talk about concurrency, a lot of people say, "Can we make big fixes to how threading works or big fixes to how processes work?" To some extent, you can't.
00:24:27.780
There are changes that can help, but if you're wondering whether Ruby 2.6 has made a lot of changes that make a big difference in performance.
00:24:42.140
I got pretty much the same kind of graph for Rails Ruby Bench as far as I can tell. Big Rails apps, small Rails apps, or small Hello World type apps: mostly the answer is no.
00:25:00.780
There's an absolute speed-up on the operations, but 2.6 isn't a lot better at threading or multi-process than 2.0 was. They were very similar in that way.
00:25:38.780
So I've been looking at where we are right now. This is what processes do, this is what threads do, and you might reasonably ask if there's some kind of a win here with Gil.
00:26:00.770
I'll say that the math here is completely different if you get to the point where you can execute Ruby in more than one thread at once. Right now, C Ruby doesn't do that.
00:26:41.290
Any of these ways to potentially avoid the Gil can, if they work out, do an end run around all of these. If you were to use, say, JRuby.
00:27:02.090
I've got it working for some of my benchmarks; there have been interesting configuration problems just getting everything working there.
00:27:48.790
But it's thread model is a lot more compatible with keeping everything in one process, scaling the number of threads, and getting reasonably linear gains.
00:28:15.960
If there's some way we can do an end run around the Gil, then a lot of these graphs that don't move all that dramatically suddenly start to move dramatically.
00:28:40.050
You can suddenly get a lot better results. For a lot of what you're looking at here, fibers are not going to help that much because they're already saturating the CPU.
00:28:57.970
There are already cases that are doing a pretty good job of using threads and processes to get the advantage by moving processor time around.
00:29:18.780
By and large, these are doing a decent job with a good configuration of saturating the processors. It's not that there's nothing to gain there, but fibers aren't going to save us.
00:29:47.260
Gil's on the other hand; there are some potentially large benefits. A lot of the reason I'm giving this talk is that I've now got a better test harness for small cases.
00:30:03.140
Those of you who have tried to use Rails Ruby Bench, first off I'm so sorry, but second, you're very aware that it's hard to configure.
00:30:24.070
If you're not using the method that I use to build an AMI on Amazon EC2, it's honestly very hard to put it all together.
00:30:45.460
I'm honored that you've been willing to treat me as someone who knows what I'm talking about, but it would be even better if more of you could tell me when I'm doing it wrong.
00:31:06.250
So I'll give you a quick methodology overview and then a quick demo of the code. We're doing pretty well for time; luckily, I talked quickly.
00:31:23.990
Of course, I plan to do a lot more with this. If it works the same way as Rails Ruby Bench, I'll keep doing interesting things, and maybe that'll convince the rest of you.
00:31:43.370
If you want to go and try it with Rails 5 and up, I'd love to see it. I'd love to help you with it.
00:32:05.320
So let's talk methodology quickly, because that's a big part of what I do. Some of these things that Rails Ruby Bench did well, I carried forward.
00:32:29.890
Some of them were things Rails Ruby Bench did poorly, and I fixed those issues. Dedicated EC2 instances—it's all localhost networking.
00:32:48.090
If I'm using an EC2 instance, I can't trust the speed of the networking. I've switched to a lower overhead load tester.
00:33:09.430
Those of you looking for a load tester: I'll say WRK in general works very well despite an interesting bug we found with JRuby.
00:33:26.390
I keep a live on it recently, but in general, it’s a very good load tester, as the folks at Passenger would tell you.
00:33:41.790
Multiple consecutive batches in random order—those of you who do much benchmarking know that sometimes if you run only one process doing everything, you get a large weird correlation.
00:33:56.550
So it makes sense to run a certain number of requests and switch batches. What I'm looking for here is consistency.
00:34:15.360
So what I want to do is to go for consistency. I have scraped as much of the database, Redis, and other things that can cause variation as reasonably possible.
00:34:33.230
I'll carefully log all the errors and the machine configuration. I'm using Rails 4.2 because that's compatible across my range of Ruby versions.
00:34:50.930
It would be really easy to change that if you wanted to play with a slightly different app. My code base is simple hello world stuff.
00:35:14.920
I've scraped out as much data database and Redis that can cause variation, so now we take a look at what this looks like.
00:35:34.230
We're running. This is what you get if you check out the repository. The things that you are most interested in are in the runners.
00:35:54.890
Just in case you're not memorizing everything here, everybody in this room is memorizing this presentation, right? You'll remember every moment.
00:36:10.670
But just to clarify, the README is pretty good at explaining why and how to use it.
00:36:28.960
And if you're curious for the significantly more complicated version, this will show you passing all those environment variables.
00:36:45.010
For a number of runs, which Ruby to run under, the duration for benchmarking, the duration for warmups—whether to debug or benchmark in Rails or Rack.
00:37:02.120
If you want a second app, you might need it to change after running them twice. But again, the Rails test app is not doing anything complicated.
00:37:17.260
It runs a real server; if you change the Rails version, it would do that. So, great.
00:37:32.280
And now we have something very clear.
00:37:39.160
If it sounds less complicated than most of what you've seen, that's because this is the simple runner that runs in your current Ruby.
00:37:53.280
The more complicated runner requires a number of environment variables for a number of runs.
00:38:07.390
It sets up a server, so in your case for instance with Rails 5, you'd need to make a second app and change over or just run it twice.
00:38:22.300
So the Rails test app is simple; it runs a real server in whatever is found. So if you change the Rails version, it would do that.
00:38:36.630
With that done, we've got about 10 minutes left according to that clock, and I've said everything I have to say. So, are there any questions?
00:39:06.520
Yes, I see a hand up; I have not yet tried Falcon. I saw the presentation you probably did, and that immediately got me curious.
00:39:17.510
I'm definitely curious about it and will check it out. It supports processes, threads, and fibers, but I don't know if it supports a hybrid process-thread model.
00:39:32.950
I'm going to check it out, though. Haven't done it yet!