Talks

Six Years of Ruby Performance: A History

Ruby keeps getting faster. And people keep asking, "but how fast is it for Rails?" Rails makes a great way to measure Ruby's speed, and how Ruby has changed version-by-version. Let's look at six years of performance graphs for apps big and small. How fast is 2.6.0? With JIT? How close is Ruby 3x3?

RubyKaigi 2019 https://rubykaigi.org/2019/presentations/codefolio.html#apr19

RubyKaigi 2019

00:00:02.929 All right, hello folks.
00:00:08.599 Excellent! Oh, and sorry, I will try to speak up. The audience requests that I can sometimes talk too fast.
00:00:15.570 So feel free to yell out if I'm going too fast. You could say, "slow down." Well, it's wonderful to be in Fukuoka.
00:00:21.240 My family is enjoying it very much. I'm amazed that I can fly to all kinds of places around the world, and somehow there are already Ruby programmers there.
00:00:34.290 So thank you! This is great. Again, if I'm speaking too fast, let me know.
00:00:40.770 If I'm not loud enough, tell me. If you've got a methodology question, you know we're all here because we care about performance.
00:00:51.570 You folks all know that if I measure performance the wrong way, the results are no good. Methodology questions? Speak up and let me know.
00:00:57.480 It's kind of dark in here, so I can't always see your hands, so just ask. The worst that happens is I say, "Oh, that's on a later slide," or, "Grab me later, and we'll talk about it."
00:01:10.770 Earlier this week in Utah and yesterday, Matz talked about Ruby 3x3 and how we're doing. He discussed performance, and overall it sounded like he thought performance was pretty good.
00:01:23.640 We're not there yet, but pretty good. Specifically, he talked about how Ruby 2.6 is 280% the speed of 2.0 in some of the benchmarks.
00:01:37.500 For a performance talk, I guess that means we can all go home, right? Except it doesn't help with Rails yet. It may make it better, but it doesn't help a lot with Rails.
00:01:44.579 Rails is slower, mostly without JIT. I don't know if you caught Takashi Kukoblum's talk; it's recorded and very good, but basically, JIT is not going to help us with Rails anytime soon.
00:02:02.280 So I'm going to look at using Rails to measure Ruby's speed. This is partly because that's what I do: I wrote and run Rails Ruby Bench.
00:02:15.390 I do a lot of that, and it's also because Rails is important code that is slow right now. A lot of people use it, and it's worth looking at.
00:02:27.630 It's also a good example of code that's hard to optimize. Legit folks have their work cut out there.
00:02:35.670 The way I've been doing this is with a benchmark called Rails Ruby Bench. We'll talk a little bit about what I found there. You may have seen blog posts that I wrote with lots of graphs; I do that often.
00:03:00.140 But the short version is if you measure with old Discourse and old Worthy because that's what works together, you can see there was a bit of a speed improvement.
00:03:32.610 With newer Discourse and a newer Ruby, there was another bit of speed improvement. If you multiply those together, you can get that for real-world large Rails applications.
00:03:48.220 Ruby 2.6, as it was released, is about 172% of the speed of 2.0. So at the very beginning of the 2.0s, we've gotten about 72% faster than we were pretending.
00:04:07.560 We can just multiply the two rates, but it's actually hard for a big real-world Rails app to compare across all of them because compatibility across that range of Rubies is nearly impossible.
00:04:21.210 But I spent a couple of years working on it. You could call that what I found during that couple of years. How do I measure that?
00:04:43.040 Because at a performance talk, you should always be asking, "How do you measure that?" How you measure changes the results completely. For a number of folks here that I recognize already, I say it's highly concurrent.
00:05:27.610 So on an EC2 dedicated m4.xlarge instance, which is a fairly big virtual machine, it uses a highly concurrent workload: ten processes and sixty threads.
00:05:43.000 It uses a simulated real-world workload, meaning it's using forum software to reproducibly generate a bunch of HTTP requests pretending to be users doing forum things.
00:05:55.000 It uses Discourse, which many of you may have heard of. It's real-world forum software and is basically one of the largest, most representative real-world Rails apps I could easily get at that actually does its work in Rails, as opposed to something like rubygems.org, which is extremely busy but S3 does a lot of the work.
00:06:24.500 Our Rails Ruby Bench is designed for throughput, not latency. This is basically its fastest configuration for throughput.
00:06:42.229 So, okay, that means we can say it's 172% faster. That's great, but that uses about as many slides as we've seen here, so that was worth four minutes of your time.
00:07:04.030 If you didn't already know it, but all those times are mixed together. It runs with a lot of different routes, different pieces, and uses the database and Redis.
00:07:16.890 Which part is slow? That's not really what a giant, complicated benchmark is good at. In fact, benchmarks come in very different sizes. I tend to call them micro versus macro benchmarks.
00:07:46.979 Rails Ruby Bench is very large compared to benchmarks generally. The reason you care about whether it's large or small is that small benchmarks are specific.
00:08:05.249 They're good at answering a question like, "For this one operation, how much slower or faster is it and why?" Whereas a very large benchmark that has a lot of different things mixed together can be very representative. It can feel a lot like a real-world question.
00:08:23.440 You wonder about how big this real-world Rails app is, but it's not great at answering the question of which exact part is slower or faster and why.
00:08:38.760 Additionally, it has the usual problems with high concurrency, like garbage collection happening at the same time, database operations, and whether the database is already busy.
00:09:17.460 It's hard to tease apart. For a smaller benchmark to answer more of those specific questions, what should that look like? I've spent a fair bit of time on Rails Ruby Bench, and I'm now working on a much simpler benchmark.
00:09:56.430 I'm just calling it the Rails Simpler Bench (RSB) if nobody has a wonderful creative name for it. It's going to be a lot less representative. It's not an answer to the question of what a big real-world Rails app does.
00:10:25.640 However, it's going to be a lot more specific. For small, I thought starting with a simple hello world route and something that just returns a static string was a very good place to start.
00:10:53.859 If I'm going to be specific, I need to check each layer in turn. I can make it as small as I can while using Rails. I'm starting with single process, single thread concurrency.
00:11:23.750 There will be a little bit more branching out from that later, but if you want to get simple and see what it does before you measure everything else against that, that's how you start small.
00:11:40.740 It's also good for profiling. I built it and you'll see that later on. But the first thing you could investigate here is Rails overhead: how slow is Rails in the first place if you're not doing anything else with it?
00:12:05.520 So I ran a very simple load test on what's the throughput using Rails on a static route, single process, single thread from Ruby 2.0 through Ruby 2.6.
00:12:31.700 If you look at those numbers, it goes from about 761 requests per second to about 1016 requests per second.
00:12:49.960 That's about a 25 to 30 percent jump, which is not bad. That's okay. Again, talking methodology because that’s what I do: this is a single unloaded Amazon m4.2xlarge EC2 instance.
00:13:10.440 So it's dedicated, not running anything else. All the networking is taking place locally rather than going over Amazon's networking.
00:13:27.650 It's very hard to get that predictable speed, so you could think of this as if you've got one CPU sitting and churning along. This is how many Rails requests it can handle for a second.
00:13:43.720 I also did a 15-second warmup before starting the benchmark. This is about a two-minute benchmark. The load tester I'm using you tell it a time, and since it's a two-minute benchmark.
00:14:04.680 For the slowest Ruby, I got about 916,000 HTTP requests, and for the fastest one, it's about 1.2 million HTTP requests.
00:14:23.960 For the graphs you see here, it's going to be a varying number of TCP requests in the same time rather than the way I usually do it, which is a fixed number in a varying time.
00:14:39.230 So OK, that graph was okay, but it didn't look a lot like the first graph. Anybody here with a great memory for graphs notice a big difference there?
00:14:50.740 Those two versions were half the improvement all by themselves, and this doesn't look like that.
00:15:04.040 So that's interesting. I measured differently and got these very different results. Some of the answer is that we're not just measuring the Ruby framework code.
00:15:18.320 We are also measuring the application server, and it would be nice to say, 'Oh, we're just measuring Ruby,' but it turns out that measuring Ruby is surprisingly hard once you're doing any real-world tasks.
00:15:35.820 We can cut out Redis, we can cut out external caches, we can cut out the database, but somehow there's always some more non-Ruby code hiding in there somewhere.
00:16:05.780 Puma is very well-tuned. It also spends a lot of its time not doing Ruby stuff, but we can account for that. The way we account for that is I've been timing a Hello World Rails server.
00:16:25.240 I can time a Hello World Rack server, which should essentially capture all the non-Rails stuff, and Puma should account for pretty much all of that time.
00:16:43.480 You'll notice that the number of iterations per second is a lot higher because Rack is a lot faster than Rails.
00:16:59.600 But you know timing that barely speeds up at all; it gets a tiny bit faster from ten and a half thousand iterations per second to twelve, but that barely gets faster at all.
00:17:28.180 Again, that's because Puma is very well-tuned and spends a lot of its time in C libraries. For those wondering, that's part of the reason they're careful using Rails.
00:17:55.190 Rails can be a lot slower, and they have an API mode to reduce overhead, but overall that's what you're paying for Rails.
00:18:21.860 We can take that overhead from Rack: how much of it is Puma time, how much of it is Rack time? By subtracting the number of microseconds, we can basically say here is what Rails costs.
00:18:42.340 For each version of Ruby, here's how fast that is speeding up. You will notice that even though that's Ruby code, it's not getting as much faster as it did previously.
00:19:06.080 Ruby 2.6 gains only about 30% rather than seventy plus percent. When I measure differently, I get these very different results.
00:19:20.070 Some of the answer is that we're measuring the Ruby framework code while also measuring the application server. It would be nice to note that.
00:19:58.740 So, with the Gil present, what you're really seeing here is that this is such a different set of results for concurrency than it was with Rails Ruby Bench.
00:20:25.660 This depends on your workload; I'm showing you here is a Hello World app with no database access, no Redis, and very little use of C extensions.
00:20:45.940 So it's about the amount of context switching. Rails Ruby Bench has an enormous amount of non-Ruby stuff going on; it switches back and forth.
00:21:17.830 A Hello World app is almost entirely just Ruby work.
00:21:39.440 So part of what we've done is find a kind of range. Normally, the right amount of threading is one thread, but if you could up to about eight processes with Puma.
00:21:53.460 The right number of threads is about four threads up through the far side of the spectrum. Rails Ruby Bench, again for a large instance, uses ten processes and sixty threads.
00:22:11.550 Nick Berk, who's here in the audience and does a lot of work on Rails performance for many companies, has a rule of thumb for big Rails apps: about five threads per process.
00:22:31.890 When you get this large, like with six, it's a very large one. So what you have there is kind of a range of how much threading will help you for C Ruby.
00:22:56.790 For the Gil, you know, in all of these current circumstances, but if you do see Ruby in the go. So we've got that range.
00:23:14.790 I'll make one more potentially interesting observation here: those are all roughs you've already seen next to each other.
00:23:40.920 For the range of Rubies from 2.0 on the left to 2.6 on the right, those graphs all do pretty much the same thing.
00:23:56.630 For any of those graphs, you can change the number of processes, change the number of threads—you're going to get about the same speed-up.
00:24:11.600 When we talk about concurrency, a lot of people say, "Can we make big fixes to how threading works or big fixes to how processes work?" To some extent, you can't.
00:24:27.780 There are changes that can help, but if you're wondering whether Ruby 2.6 has made a lot of changes that make a big difference in performance.
00:24:42.140 I got pretty much the same kind of graph for Rails Ruby Bench as far as I can tell. Big Rails apps, small Rails apps, or small Hello World type apps: mostly the answer is no.
00:25:00.780 There's an absolute speed-up on the operations, but 2.6 isn't a lot better at threading or multi-process than 2.0 was. They were very similar in that way.
00:25:38.780 So I've been looking at where we are right now. This is what processes do, this is what threads do, and you might reasonably ask if there's some kind of a win here with Gil.
00:26:00.770 I'll say that the math here is completely different if you get to the point where you can execute Ruby in more than one thread at once. Right now, C Ruby doesn't do that.
00:26:41.290 Any of these ways to potentially avoid the Gil can, if they work out, do an end run around all of these. If you were to use, say, JRuby.
00:27:02.090 I've got it working for some of my benchmarks; there have been interesting configuration problems just getting everything working there.
00:27:48.790 But it's thread model is a lot more compatible with keeping everything in one process, scaling the number of threads, and getting reasonably linear gains.
00:28:15.960 If there's some way we can do an end run around the Gil, then a lot of these graphs that don't move all that dramatically suddenly start to move dramatically.
00:28:40.050 You can suddenly get a lot better results. For a lot of what you're looking at here, fibers are not going to help that much because they're already saturating the CPU.
00:28:57.970 There are already cases that are doing a pretty good job of using threads and processes to get the advantage by moving processor time around.
00:29:18.780 By and large, these are doing a decent job with a good configuration of saturating the processors. It's not that there's nothing to gain there, but fibers aren't going to save us.
00:29:47.260 Gil's on the other hand; there are some potentially large benefits. A lot of the reason I'm giving this talk is that I've now got a better test harness for small cases.
00:30:03.140 Those of you who have tried to use Rails Ruby Bench, first off I'm so sorry, but second, you're very aware that it's hard to configure.
00:30:24.070 If you're not using the method that I use to build an AMI on Amazon EC2, it's honestly very hard to put it all together.
00:30:45.460 I'm honored that you've been willing to treat me as someone who knows what I'm talking about, but it would be even better if more of you could tell me when I'm doing it wrong.
00:31:06.250 So I'll give you a quick methodology overview and then a quick demo of the code. We're doing pretty well for time; luckily, I talked quickly.
00:31:23.990 Of course, I plan to do a lot more with this. If it works the same way as Rails Ruby Bench, I'll keep doing interesting things, and maybe that'll convince the rest of you.
00:31:43.370 If you want to go and try it with Rails 5 and up, I'd love to see it. I'd love to help you with it.
00:32:05.320 So let's talk methodology quickly, because that's a big part of what I do. Some of these things that Rails Ruby Bench did well, I carried forward.
00:32:29.890 Some of them were things Rails Ruby Bench did poorly, and I fixed those issues. Dedicated EC2 instances—it's all localhost networking.
00:32:48.090 If I'm using an EC2 instance, I can't trust the speed of the networking. I've switched to a lower overhead load tester.
00:33:09.430 Those of you looking for a load tester: I'll say WRK in general works very well despite an interesting bug we found with JRuby.
00:33:26.390 I keep a live on it recently, but in general, it’s a very good load tester, as the folks at Passenger would tell you.
00:33:41.790 Multiple consecutive batches in random order—those of you who do much benchmarking know that sometimes if you run only one process doing everything, you get a large weird correlation.
00:33:56.550 So it makes sense to run a certain number of requests and switch batches. What I'm looking for here is consistency.
00:34:15.360 So what I want to do is to go for consistency. I have scraped as much of the database, Redis, and other things that can cause variation as reasonably possible.
00:34:33.230 I'll carefully log all the errors and the machine configuration. I'm using Rails 4.2 because that's compatible across my range of Ruby versions.
00:34:50.930 It would be really easy to change that if you wanted to play with a slightly different app. My code base is simple hello world stuff.
00:35:14.920 I've scraped out as much data database and Redis that can cause variation, so now we take a look at what this looks like.
00:35:34.230 We're running. This is what you get if you check out the repository. The things that you are most interested in are in the runners.
00:35:54.890 Just in case you're not memorizing everything here, everybody in this room is memorizing this presentation, right? You'll remember every moment.
00:36:10.670 But just to clarify, the README is pretty good at explaining why and how to use it.
00:36:28.960 And if you're curious for the significantly more complicated version, this will show you passing all those environment variables.
00:36:45.010 For a number of runs, which Ruby to run under, the duration for benchmarking, the duration for warmups—whether to debug or benchmark in Rails or Rack.
00:37:02.120 If you want a second app, you might need it to change after running them twice. But again, the Rails test app is not doing anything complicated.
00:37:17.260 It runs a real server; if you change the Rails version, it would do that. So, great.
00:37:32.280 And now we have something very clear.
00:37:39.160 If it sounds less complicated than most of what you've seen, that's because this is the simple runner that runs in your current Ruby.
00:37:53.280 The more complicated runner requires a number of environment variables for a number of runs.
00:38:07.390 It sets up a server, so in your case for instance with Rails 5, you'd need to make a second app and change over or just run it twice.
00:38:22.300 So the Rails test app is simple; it runs a real server in whatever is found. So if you change the Rails version, it would do that.
00:38:36.630 With that done, we've got about 10 minutes left according to that clock, and I've said everything I have to say. So, are there any questions?
00:39:06.520 Yes, I see a hand up; I have not yet tried Falcon. I saw the presentation you probably did, and that immediately got me curious.
00:39:17.510 I'm definitely curious about it and will check it out. It supports processes, threads, and fibers, but I don't know if it supports a hybrid process-thread model.
00:39:32.950 I'm going to check it out, though. Haven't done it yet!