Talks

Ruby3x3: How are we going to measure 3x?

http://rubykaigi.org/2016/presentations/MattStudies.html

To hit Ruby3x3, we must first figure out **what** we're going to measure, **how** we're going to measure it, in order to get what we actually want. I'll cover some standard definitions of benchmarking in dynamic languages, as well as the tradeoffs that must be made when benchmarking. I'll look at some of the possible benchmarks that could be considered for Ruby 3x3, and evaluate them for what they're good for measuring, and what they're less good for measuring, in order to help the Ruby community decide what the 3x goal is going to be measured against.

Matthew Gaudet, @MattStudies
A developer at IBM Toronto on the OMR project, Matthew Gaudet is focused on helping to Open Source IBM's JIT technology, with the goal of making it usable by many language implementations.

RubyKaigi 2016

00:00:00.320 Hi everyone. Thanks for having me back at RubyKaigi. I'm a developer at IBM working on the Eclipse OMR project, which involves cross-platform components for building high-performance, reliable language runtimes. We hope to have some news soon, so please follow us on Twitter. However, I can't say anything just yet. Ruby 3x3 seems like a relatively simple goal: take the performance of Ruby 2.0 and make it three times faster. But when you look at this chart, there's a little bit of a hidden detail, which is performance.
00:00:19.439 Over the course of this presentation, I'm going to talk about benchmarking. I'll provide some definitions, share some philosophy, discuss pitfalls, and conclude with thoughts on how we can measure Ruby 3x3. So, let's get started. Benchmarking is kind of the bane of my existence. It's this strange combination of art and science that can drive one a little insane. The problem is that while benchmarks seem scientific and objective, they are filled with judgment calls, and the science itself is quite complex.
00:00:47.360 A benchmark is simply a piece of computer code executed to gather measurements for comparison, because unless you're comparing, there's no point in running the code. There are many things you can benchmark, and not all of them relate to time. For instance, you can benchmark the execution time of different interpreters or options for the same interpreter, compare the execution time of algorithms, or even benchmark accuracy for machine learning algorithms. The art of benchmarking turns into a long list of questions and decisions that you must navigate, all filled with judgment calls. The first question is: what do you run? When I think about benchmarks, I see benchmarking as a continuum. On one end, we have micro-benchmarks, and at the other end, we have entire applications, with application kernels in the middle. A micro-benchmark is a small piece of code you write, typically intended to exercise a particular part of the system you're testing. They're often easy to set up and run and target a specific aspect to yield quick data acquisition. However, micro-benchmarks tend to exaggerate effects and don't generalize well.
00:01:45.119 On the other end of the spectrum is the full application benchmark. The advantage here is clear: if I show that a benchmark is two times faster, then the full application is twice as fast because it encompasses the whole. The downside is that a full application is inherently complex, naturally exhibiting variances, and small effects can be drowned out by noise and can be slow to run. This leads us to the middle area of application kernels, which are parts of an application extracted specifically for benchmarking purposes. This typically involves building scaffolding around the application's core, allowing for independent benchmarking.
00:03:01.440 The advantage of this approach is that they are real-world code sourced from actual applications, often leading to results that are more generalizable. However, determining how much of the application to include versus what to mock out remains a crucial judgment call. When designing benchmarks, numerous pitfalls may arise. One common pitfall to avoid—unless you are porting benchmarks—is the issue of unruby-like code. An old adage states that you can write Fortran in any programming language, and this holds true. You can also take benchmarks designed for Fortran and port them to Ruby, but they often end up looking strange and running poorly. Another example of problematic benchmarking code is code that never produces garbage. Real applications written in Ruby often rely on garbage collection, so it's essential to allow for that in benchmarks.
00:04:22.880 Similarly, exceptions should be accounted for in the same way. A key point often overlooked is that the input data used in your benchmark is crucial. Preparing the right input data can prevent misleading conclusions. For instance, consider an MP3 compressor benchmark that receives a file consisting only of silence. This situation is odd because most MP3s aren't simply silence; they contain some structure that the compressor typically exploits. Such odd input can fundamentally affect the generalizability of your benchmarking results, causing the code paths you execute during the test to be skewed.
00:05:01.440 Next, when you're benchmarking, you need to decide what metrics to measure. I am particularly focused on benchmarks relevant for Ruby 3x3, so we should consider things like time, throughput, and latency. The most obvious metric—spoiler alert—is wall clock time, which is a direct measurement relative to an actual clock, independent of the process being timed. For example, running a command with the `time` command provides wall clock time. This is distinct from CPU time, which measures how much CPU was actually used during that process. An illustrative example is that sleeping for one second, which takes almost no CPU time, takes about a second in real time. This can serve as a helpful measure in certain situations, but it can also be misleading.
00:06:13.600 Throughput is a common metric in web applications, representing the count of operations performed within a unit of time. In contrast, latency measures the time it takes for a response to occur after an action is initiated. For instance, in a web server, this refers to how long it takes from receiving a request to serving a response. After measuring various metrics, the next question is what to report. While the raw measurements seem like the obvious choice, we want to offer comparisons, particularly speedup, since speedup is exactly what Ruby 3x3 aims for. Speedup is simply the ratio computed by comparing a baseline measurement against an experimental one.
00:07:42.880 Let’s delve briefly into the science of benchmarking. While studying for my master's, I spent considerable time reading academic papers and learned how to mislead in speedup discussions. An important takeaway is that controlling the baseline gives significant power over the perceived speedup. I often encountered papers boasting linear speedup as the number of threads increased, supported by stunning graphs that suggested high performance. Yet, upon running their code or scrutinizing hidden numbers, I often found that original sequential programs ran significantly slower than their parallelized versions, underscoring that linear speedup, while mathematically sound, may not yield real-world improvements.
00:09:08.959 The distinction between relative and absolute speedup is crucial. Relative speedup often refers to improvements compared to a single-threaded execution. While this can be useful in some contexts, absolute speedup—measured relative to the fastest sequential version—is often what truly matters in productivity. Additionally, how you measure affects what you'll ultimately measure. For example, comparing two benchmarks with differing running conditions can yield vastly different insights based on how you set them up. The warm-up period can introduce significant variances in performance results; thus, the question of when an application has warmed up is a complex challenge. Despite this lack of a clear definition, recognizing that warm-up happens allows you to be cautious in how you run your benchmarks.
00:10:53.760 Another critical aspect is run-to-run variance, where identical benchmark runs do not yield the same execution times. I might run a benchmark five times and get five different results. It’s vital to account for warm-up effects, as ignoring them can mask peak performance readings. Even if you choose not to control for warm-up, this decision may result in missing peak performance insights. To properly measure run-to-run variance, you should run benchmarks multiple times and present results statistically, using techniques like confidence intervals. While striving to minimize wild variations in performance times is ideal, benchmarking can be inherently unpredictable due to variances in execution.
00:12:05.760 When benchmarking a language with garbage collection like Ruby, you also need to consider the application's heap size. Using JRuby, it's easy to adjust heap size, and experiments show that reducing heap resources can impact performance. For example, an application might speed up as its JIT compiler activates, while a smaller heap leads to more frequent garbage collections, which ultimately degrades performance. This interplay between benchmarking and garbage collection became evident last year when benchmarking the Ruby OMR preview. By replacing the heap in MRI with our own and altering how objects were stored, we found significant variations in garbage collection behavior—complicating fair comparisons between versions. Even minor changes, such as the memory available on machines running the same workload, can lead to drastically different outcomes.
00:13:57.720 In conducting my master's thesis benchmarking, I often faced unexpected issues. One time, I celebrated a tenfold increase in performance only to discover I had miscalculated my input size, accidentally running an application on one-tenth the original data. Having a robust harness for benchmarking helps alleviate these errors and ensures reproducibility in results. Another unexpected incidentoccurred while benchmarking when a mid-day power interruption caused drastic performance variations due to the laptop switching between power modes. Such hardware and environmental factors can lead to perplexing performance spikes, further complicating the benchmarking process.
00:15:03.600 In computer science, persistent issues exist, such as software features activating during benchmarks, system updates causing unexpected performance changes, or screen savers interrupting computations. Researchers have attempted to mitigate benchmarking pitfalls in a paper called "Virtual Machine Warm-Up Blows Hot and Cold," which describes rigorous measures for minimizing performance variances. They disabled various hardware features and employed their own benchmarking tool, K-run, designed to ensure as much consistency as possible before running benchmarks. By taking all these precautions, they could measure very slight performance changes accurately. My hope is that testing conditions will allow Ruby 3x3 to demonstrate significant speed increases in practical scenarios, as performance gains will often compound, leading to meaningful advancements.
00:16:58.720 The journey from Ruby 2 to Ruby 3 will not be linear; it will include periods of growth interspersed with phases of slower development. Accurate performance measurement over time will be crucial in understanding how Ruby evolves. Ultimately, benchmarks drive change—the metrics we measure dictate the performance outcomes developers will pursue. There is also the inherent risk that if we neglect to measure something, it may not improve as intended. Performance enhancements can sometimes resemble squeezing a water balloon; pressure applied to one part may cause swelling in another. Consequently, it’s vital to measure associated metrics for a comprehensive view of the trade-offs involved in any development process.
00:18:35.120 As someone rooted in optimization via just-in-time compilation, I see speed as a trade-off between initial application speed and peak performance after warm-up. Alongside this, memory footprint must also be considered; JIT-compiled code may occupy memory that hampers data performance. The trade-offs you navigate heavily depend on the platform; decisions made for robust servers with ample resources differ from optimizations pursued on limited systems like a Raspberry Pi. Time and again, we see benchmarks aging as they become less effective for measuring meaningful performance changes over time.
00:20:02.400 As Ruby evolves, the idiomatic practices have transformed, making it essential for benchmarks to adapt alongside these language changes. It’s vital to embrace new idioms and patterns when assessing performance, and benchmarks could guide developers as they transition to newer language features. In considering future benchmarks for Ruby 3x3, I suggest we select nine application kernels that encapsulate the intended improvements in C Ruby. These choice benchmarks should reflect where we envision Ruby being three times faster, as selecting too many benchmarks can dilute effort across various projects.
00:21:47.920 Choosing nine benchmarks seems optimal since too many could scatter focus; as a VM developer, handling 150 benchmarks would complicate priority setting. While I won’t decide which nine benchmarks we ultimately choose, I possess insights on potential candidates. There should be some CPU-intensive benchmarks, since the goal of Ruby 3x3 centers on improving performance for CPU-bound applications. The opt-carrot benchmark, neural net tests, and Monte Carlo tree searches could all serve this purpose. Additionally, we need some memory-intensive benchmarks to gauge the impact of garbage collection.
00:23:31.760 Monitoring startup times in Ruby applications is also critical, as the MRI implementation enjoys an edge in that area. Therefore, we should pay close attention to how startup performance evolves, ensuring that any speed improvements do not disproportionately sacrifice startup efficiency. Incorporating benchmarks for web application frameworks is necessary as well, although I may not be the best source for their guidelines. We have capable experts within the Ruby community who can propose methodologies to ensure fast, reliable benchmarks for web frameworks. Such methodologies should stimulate change for C Ruby, balancing performance enhancement with manageable warm-up periods for these applications.
00:25:12.480 I propose a standard performance harness alongside the RubyGems ecosystem, allowing developers to introduce performance benchmarks when crafting their gems. By writing performance tests alongside typical tests, gem authors can enable efficient performance tracking. This concept allows VM developers to sample various gems and aggregate performance metrics, making meaningful statements about ecosystem improvements over time. For example, we could announce an average performance increase across gems, facilitating clarity and reproducibility for end users who cite performance issues.
00:26:46.800 Thank you for your attention. If you are interested in the science of benchmarking and the myriad ways it can go awry, I encourage you to explore the Evaluate Collaboratory at the provided URL. This initiative is dedicated to enhancing reproducibility within systems evaluation in computer science, offering resources for effective benchmarking practices and cautionary tales alike.
00:28:11.840 Are there any questions? We have about five minutes.
00:29:18.720 If not, I can ask questions. I promise they won't be puns.
00:29:27.520 You mentioned Koichi's talk about guilds. What if we introduced that feature before Ruby 3.0 and executed everything in parallel? Does that count? I mean, if Mats considers Ruby 3x faster on a four-core machine due to guilds, that would be a valid outcome. From my standpoint as a JIT developer, having a clearly defined goal that resonates with the community is essential.
00:30:24.480 For planning purposes, I find it intriguing to compare performances across different Ruby implementations, such as JRuby and MRI. You might see discrepancies depending on the platform—Kevin discovered that JRuby was 20 times faster than MRI on Linux but only 2 times faster on macOS. This variability signifies the challenges of benchmarking, which can result in mismatched or inconsistent performance representations due to variations in the testing environment.
00:31:13.920 Let's consider the relationship between performance and power usage. Intel provides tools for measuring power consumption while running benchmarks, but what are effective methods to correlate performance improvements and their environmental impact? Academic research on this topic shows some potential. I recall projects examining Firefox's performance relative to power consumption. They measured wattage during each commit, which allowed them to assess the broad implications of changes in performance across significant user bases.
00:32:00.160 Looking at environmental impacts and cost analyses for performance regressions could provide valuable insights into how improvements affect both system performance and the ecological footprint.
00:32:44.480 Finally, I have one last question: what’s the difference between turbo boost and the turbo button? As I remember in my youth, the turbo button often slowed things down.
00:33:26.720 Thank you, Matthew.