How Close is Ruby 3x3 For Production Web Apps?

00:00:07.639 So hello, I'm Noah Gibbs, a Ruby Fellow with AppFolio. I'm honored to be at RubyKaigi for the first time. Thank you very much.

00:00:12.410 I'm specifically here to talk about numbers and benchmarks, and how fast current Ruby is compared to Ruby 2.0. Ruby 3 tries to be three times faster than Ruby 2.0, which is why I'm using it as the baseline.

00:00:25.439 However, 'faster' is a difficult word. It can mean many things, so let's talk about what 'faster' means and what we should measure.

00:00:40.350 My benchmark is designed to measure web framework performance. I believe that real-world benchmarks are crucial, and they provide valuable insights. When I say 'real world', I mean last year at RubyKaigi, Matthew gave a talk titled 'How Are We Going to Measure Three Times Faster?'. He suggested many types of benchmarks that were missing, including those focused on web frameworks.

00:01:05.070 He noted that we should have some type of real-world Ruby on Rails benchmark, which can encompass many different applications. Therefore, I decided to utilize an application called Discourse.

00:01:17.220 Discourse is forum software, open source, built with Ruby on Rails, and used by many people. Hence, I thought it would serve as an excellent example of a real-world web application framework worthy of this benchmark. I did extensive measurements on how long it takes to serve requests and how long it takes to serve the first request, both of which are one of Ruby's great strengths, and I believe we must preserve them.

00:01:45.299 Matthew also talked a lot about warm-up iterations. I'll explain what those are, why they are important, and how they matter in great detail. If I start to talk too fast when I get nervous, anyone here is welcome to yell at me to slow down.

00:02:21.210 I haven't had much luck with the conference Wi-Fi, so I will play this video locally. When I mention Discourse, some of you might know exactly what I mean, while others may not. This is a recorded example of using Discourse, captured on my development laptop with the seed data that I used for my benchmark.

00:02:43.800 When it gets to topics or making comments, this video shows Discourse with the same data that you will see speed results for. This video illustrates what it would look like in a web browser instead of running it in a benchmark.

00:03:17.040 I based my benchmark on simulating user behavior. Many people are curious about how fast a Rails app performs when many users utilize it. To investigate this, I chose to execute repeatable random requests.

00:03:55.060 This means generating actions according to a random seed to send to Discourse. I checked Discourse's log files to understand the structure of the URLs and wrote a script to generate them randomly from a seed number. The same actions happen in the same order, which is crucial to validate a complex workload repeatedly.

00:04:47.120 Here's an example random seed from the random number generator with the first few requests it generates. The first request might be for /sessions, which is a login, the second for /posts, which creates a new post, and the third for getting a specific topic. The benchmark generates seed data—similar to what you saw in the video—and then generates requests against that.

00:05:51.410 Of course, this is just a small fraction of the requests; many requests are made during the benchmark. Another critical aspect of this real-world benchmark was configuring it the way many companies set up their Ruby on Rails servers. I have also done this with Sinatra and other frameworks for small startups. The load testing is different, but the basic configuration with Puma is quite similar.

00:06:40.210 I used the Puma web app server, which employs both threads and processes. I set it up after extensive testing on a reasonably large virtual machine on AWS—an m4.2xlarge instance dedicated to this purpose. For those unfamiliar, a dedicated instance guarantees no noisy neighbors—other VMs using up resources on the same hardware.

00:07:15.860 The Puma setup with ten processes and sixty threads was determined after running various configurations and recording significant amounts of data. The number of threads or processes could change for larger or smaller virtual machines, but for this configuration, that balance was optimized to fully utilize all the machine's cores.

00:08:00.050 For those who have done similar configurations, you'll find that because of Ruby's global interpreter lock, having many more threads than processes—like 61—is common. Typically, it requires four or five or six threads per process, since with the global interpreter lock, Ruby threads often spend much time waiting on that lock rather than executing Ruby code.

00:08:59.110 One significant trade-off of this real-world benchmark is that load testing, the database, Puma, and the Ruby server all run on the same virtual machine. In a production application, you'd separate these into multiple VMs. Those familiar with AWS will understand this design choice—it helps mitigate noisy neighbor problems and addresses one of the significant sources of delay when using Amazon AWS for testing: network delay.

00:09:30.380 By co-locating everything on the same machine, I’m removing a significant uncertainty source of delay that isn't directly related to Ruby. However, the downside is that running everything on a single virtual machine isn't a typical production deployment.

00:10:03.470 That was a very quick overview of my methodology. I've shared it with some Ruby experts and invested significant time testing it. During the Q&A section later, I'd love to discuss any questions you might have about the methods, the trade-offs I'd like to explore, and I'm always open to new recommendations. I frequently adjust little elements of my approach to improve the benchmarks.

00:10:46.050 One more piece of background information before we delve into the graphs. Warm-ups, as I mentioned—some of you might be familiar with this—are additional untimed requests made with your server before you begin timing the requests.

00:11:07.090 If I want to time 3,000 HTTP requests with my benchmark, I might run 1,000 warm-ups first. The rationale behind this approach is that many software pieces require time to reach full speed due to data caching, method caching, and several intriguing dynamics within the memory system.

00:11:43.970 Some implementations take a while to warm up before they reach full speed. In the graph showing Ruby MRI, you will notice it has very good startup speed represented by a flat red line. JRuby has a slight delay to full speed represented in green. Eventually, it levels off, but its offensive counterpart, the Truffle Ruby, is notably slower to reach full speed due to its complex operations.

00:12:16.300 That was a fair example illustrating warm-up iterations, particularly in just-in-time (JIT) compilation systems that compile code based on how many times it's executed, causing delay in reaching optimal performance.

00:12:53.920 As I talk about warm-up iterations, it is this additional time before I commence genuine measurements. On to another essential topic—the real-world benchmark I used with Discourse is fantastic because it is real-world software primarily utilized by genuine users.

00:13:23.899 They do not maintain full compatibility across all versions. I used an older version of Discourse for Ruby 2.0, tracking through 2.3 and 2.4 while ensuring I tested with relevant versions of Discourse.

00:14:00.590 This means I need to ensure there's both old Ruby with old Discourse and new Ruby with new Discourse for an overlap to assess any speed changes over time.

00:14:32.960 Thank you for your patience; I suspect some of you might find all this background boring! So, what can we learn from this real-world benchmark? The primary question we aim to answer is, if we are handling numerous web requests, how much faster is the new Ruby compared to the old Ruby?

00:15:19.470 I made sure to put that in the title of my talk— do you like graphs? I love them!

00:15:58.970 Each set of five bars demonstrates the minimum time for each load thread to serve one hundred fully served HTTP requests accurately. Moving from the tenth percentile time to the median, then the 90th percentile, and finally to the maximum or hundredth percentile time, you'll see a mini graph at the top of each five-bar group showcasing that specific Ruby and Discourse version's performance.

00:16:55.730 I find this method effective for observing performance nuances since the time—which you can see listed in seconds—suggests shorter times are better. This shows how long it takes one load thread to process its gamut of requests. Thirty load threads handling 3,000 requests, with each needing a hundred requests, results in a notable performance variation.

00:17:32.660 Looking over the Discourse version 1.5 on the left, the measurable difference between Ruby 2.0 and 2.3.4 reflects anywhere between a 45% to 50% performance improvement. On the right side, the Discourse version 1.8 highlights the performance jump from Ruby 2.3.4 to 2.4.1, yielding a difference of around 3% to 5%. That's decent! Furthermore, this software has already established significant performance levels, so you want to see stability—as Discourse did not decrease in performance.

00:18:30.160 However, Discourse also slowed down slightly—around 20%. Again, this isn't due to Ruby's version improvement, but rather a reflection of changes in the application used for benchmarking. Thus, when only examining Ruby 2.4.1 alone, it might mistakenly appear as a performance regression, but in reality, Ruby is improving. In summary, for processing HTTP requests, Ruby 2.4.1 represents just over 150% of the speed of Ruby 2.0.

00:19:29.080 Is this as spectacular as the 300% increase? No, but it's still impressive, especially for a mature language.

00:20:09.100 Now, let's discuss warming iterations. Do other Ruby implementations require warm-up iterations? Certainly! It's an interesting question to see how Matz's Ruby benefits and how I design the benchmark currently narrating 100 warm-up iterations, though this could alter with future experimentation.

00:20:50.500 Each group of five bars now expresses different meanings—the leftmost bar signifies no warm-up iterations while subsequent bars represent each incremental amount of warm-up iterations, progressively increasing through ten, hundred, thousand, and up to five thousand warm-up iterations.

00:21:35.430 As you can see, the bar labeled Discourse 1.5 Ruby 2.0.0 shows the throughput, stating how much faster Ruby 2.0.0 is with these various warm-up iterations.

00:22:17.500 The previous MRI graph depicted its fast startup speed almost flat—now, you clearly see Ruby 2.0 does gain speed with warm-up iterations. More warm-up iterations lead to a better throughput— so yes, warm-up iterations are beneficial for Matz's Ruby.

00:23:15.890 With 5,000 warm-up iterations, requests per second increase by about 5%—not a huge improvement, but still considerable.

00:24:31.680 Looking into generational garbage collection and JIT use, it is very well known that JIT implementations necessitate warm-up iterations due to their inherent system complexities. However, the results prove otherwise; the slopes remained consistent between Ruby 2.0 and Ruby 2.4.1.

00:25:05.330 In short, generational garbage collection appears to have solved the issue pertaining to warm-up concerns, maintaining efficiency therein.

00:25:45.960 Now that we've seen how long it takes to process requests in a concurrent environment, let’s talk about the time to serve the first request. You can observe Discourse version 1.5 on the left and version 1.8 on the right, with a downward trajectory showing how time decreases—indicative of improved performance.

00:26:21.170 When transitioning from Ruby 2.0 to Ruby 2.3, there's a 23% speed improvement. From Ruby 2.3 to 2.4, another 10% boost follows. An additional observation shows, comparing Discourse versions 1.5 to 1.8, we see a similar improvement of about 15%. So, Discourse is not only maintaining its speed; it's continuously improving!

00:27:00.540 There is a software named boot snap that has gained popularity with Rails applications. It pre-parses Ruby code like the Cohezi gem and caches elements in the require path, significantly increasing the speed of Rails applications.

00:27:41.920 Without boot snap, Discourse is still booting 15% faster on average, while incorporating boot snap will provide another boost up to 50% overall increase.

00:28:15.750 The time to successfully handle a request is decreasing, meaning your debugging time is significantly shorter; Ruby 2.4.1 has shown about a 30% improvement compared to Ruby 2.0 for time to first request.

00:28:56.020 Now I've discussed the benchmark and presented the data it is derived from but haven’t yet delved into how you can leverage this data to address your own inquiries. I’m merely an individual, and you have no reason to inherently trust my numbers.

00:29:30.420 It would be fantastic if everyone here thought, “I don’t believe him” and went home to run your own benchmarks to gather your own data regarding performance.

00:30:01.550 Unfortunately, I had issues connecting to the conference Wi-Fi, so I’ll present a local video of how this works. You can use the same images I have with Discourse installed, along with specific Ruby versions, and run an AWS instance to replicate my benchmark and results.

00:30:52.050 The dialog that's difficult to read notes that these commands are included in the slideshow—I will post the slides online, and they are available in the README for my benchmark.

00:31:42.510 If you take a look, you will see the command line is only as sophisticated as running a script contained in all Ruby implementations. The process executes a loop that runs the benchmark many times with appropriate iterations.

00:32:21.620 The script collects JSON files in your working directory for you to download conveniently from your EC2 instance or through your usual file moving methods. You will also want to terminate your VM to avoid unexpected expenses.

00:32:50.540 The code is accessible, and this URL remains the same as when Matz highlighted the benchmark at Ruby World. While it has progressed, the code retains its core structure. I am constantly running the benchmarks across various Ruby configurations—please look it over and provide any critical feedback.

00:33:44.390 I also have files of all the JSON data from this presentation and my graph creation scripts. If that proves helpful, please reach out to me; I'd like to ensure you can create your JSON files through repetitive benchmark running.

00:34:18.850 If you're reluctant to deploy an AWS instance, I have created a simple Ruby script leveraging a Rails app utilizing a Discourse replica; this allows for local running followed by initiating the database creation through seeding options I’ve shared.

00:34:57.020 Instructions are available in the README file, along with commands displayed in this slide deck which I'll provide shortly. When running from the command line, you can indicate the count of warm-up iterations with a specific flag.

00:35:45.830 You may notice that it takes approximately 6 to 9 seconds for each iteration of measuring a single request. This is particularly relevant since you'll be running at least over 100 iterations, which can lead to lengthy testing durations.

00:36:33.240 For total request samples, meaning regular iterations, you can also configure Puma to dictate the thread count and how it operates alongside your proposed experiments.

00:37:15.390 As for future work, I will endeavor to maintain my timelines closely. I have been publishing bits of data regularly on this blog. Much of what you observed in this presentation derives from updates shared via that platform.

00:37:51.300 The gate analysis extends into Ruby's evolution in speed, where you'll find this isn't solely due to garbage collection.] Returning to discussions regarding Ruby's usability or performance differences observed when utilized in Japan, I believe everyone here has something to contribute.

00:38:28.800 Variations in performance can emerge based on the operating systems and databases employed across the board, becoming increasingly challenging to identify performances that falter or enhance.

00:39:08.000 For tracking performance changes over time, the best course of action is ensuring that consistent conditions are maintained—leveraging Packer scripts lets me hold specific versions throughout variations in my tests.

00:39:38.020 I compile comprehensive JSON manifests using consistently identical Ubuntu versions but also need to find overlaps when measuring with Distort versions.

00:40:00.840 To analyze substantial performance regressions and identify root causes effectively, I suggest maintaining consistency and measuring thoroughly rather than varying conditions too significantly.

00:40:26.050 That's the fundamentals on tracking my performance adjustments—if I may direct focus back to your comments—thank you all for your attention.

00:40:48.520 Yes! Would you tell me which Rails versions each Discourse 1.5 and 1.8 is utilizing? If recent versions differ, I encourage looking at results comparing Ruby 2.0 with the latest or forthcoming Ruby versions, aiming for accuracy through specific Rails versions.

00:41:19.230 The Rails versions remain closely aligned, 4.6 for version 1.5 and 4.7 for version 1.8. Thus, the comparison is valid as the Rails versions have negligible performance discrepancies. They're currently in the process of updating to Rails 5. Significant differences will arise as that shift occurs.

00:42:01.200 This means that I likely won't constantly refer back to Ruby 2.0 for every result, especially in the future due to the performance boosts gained during version upgrades.

00:42:39.130 Ultimately, establishing a foundational metric now helps ensure baseline comparisons can withstand alterations.

00:42:53.340 Thank you for your time and attention! Does anyone else have questions?

00:43:02.000 Yes! Over here!

00:43:05.260 Regarding the AWS benchmarks, how long did you run these for?

00:43:09.720 I typically execute each test case with 3,000 iterations including warm-ups.

00:43:15.370 In total, I’ll conduct 20 complete batches for turns, making it roughly two days of AWS compute time.

00:43:26.200 The EC2 configuration enables promptly testing within distinct OSs, yielding differing results, leveraging shared VMs for efficient performance analysis.

00:43:40.880 To clarify, using dedicated instances allows for marginal variances due to workloads occurring on shared hardware. The main metrics are gleaned from isolating instances.

00:43:55.620 We strive to minimize the inherent noise generated from shared environments while ensuring every metrics gathered remains accurate.

00:44:06.640 With that, I appreciate your indulgence in this session! Let's give Noah a round of applause!

00:44:30.290 Thank you all for joining us today!