Talks

Benchmarking your code, inside and out

Benchmarking is an important component of writing applications, gems, Ruby implementations, and everything in between. There is never a perfect formula for how to measure the performance of your code, as the requirements vary from codebase to codebase. Elastic has an entire team dedicated to measuring the performance of Elasticsearch and the clients team has worked with them to build a common benchmarking framework for itself. This talk will explore how the Elasticsearch Ruby Client is benchmarked and highlight key elements that are important for any benchmarking framework

RubyKaigi 2019

00:00:00.000 Hello, my name is Emily Stolfo, and I'm going to give a talk about benchmarking the Elasticsearch clients. It's called 'Benchmarking Your Code, Inside and Out.' Before I start the talk, though, I'd like to mention a few things.
00:00:05.310 The first thing is that over the last day and a half, I've attended a number of talks here about optimizing the Ruby interpreter itself. These talks have covered topics like performance, static profiling, and the garbage collector. However, this talk is taking it one level up; it's about the code that you write in Ruby and benchmarking that code.
00:00:18.060 I want to tailor this talk a little bit more to you, so by show of hands, can you tell me if you write a language other than Ruby in your job? It can be in addition to Ruby, but a language that is more than Ruby or entirely different.
00:00:30.330 That's great! This talk is about benchmarking concepts and best practices, and I think you'll find that you can apply these concepts to your Ruby code as well as the code that you write in other languages.
00:00:43.980 A little bit about me before I get into the talk itself: I'm a Ruby engineer at Elastic. By show of hands, who uses Elastic? Okay, a lot of people do, and that's cool! I work on the clients team at Elastic. The company has a number of different products, many of which are based around Elasticsearch.
00:01:11.369 The server I work on is part of the clients team, which develops libraries, gems, etc., that interface with the Elasticsearch server to handle HTTP requests sent to the search server.
00:01:18.270 In addition, I'm an adjunct faculty member at Columbia University, where I've taught courses on web development in Ruby on Rails and, most recently, a class on NoSQL databases. In that class, I provided a survey of different NoSQL databases and talked about various use cases and how to choose the right database for a specific use case.
00:01:37.799 I've been a resident of Berlin, Germany, for five years, but I'm originally from New York. This connection allows me to teach classes at Columbia and stay with family or friends when needed. I also maintain a list of gems related to Elasticsearch. Half of them focus on using pure Ruby with Elasticsearch, while the other half deal with integrating Elasticsearch into a Rails application.
00:01:54.549 One more thing: I previously worked at MongoDB for six years before starting at Elastic last June. I want to mention my experience at MongoDB because it is relevant to this talk. I learned a lot about benchmarking, especially with the MongoDB driver, which has served as inspiration for my current work at Elastic.
00:02:07.260 So, why do we benchmark? I think we, as humans, are inherently interested in speed and measurement. In particular, we like to compete and compare ourselves to one another, and that natural tendency extends to measuring our code and tracking it over time.
00:02:19.780 When I think about benchmarking and speed, the fable of 'The Tortoise and the Hare' comes to mind. The tortoise is one of the slowest animals, while the hare is one of the fastest. They decide to have a race, and the hare, confident in its speed, takes a nap halfway through the race.
00:02:34.770 However, the tortoise persistently marches to the finish line and actually wins because the hare doesn't wake up in time. I mention this because speed is relative, and benchmarking is more than just determining how fast you can go. It's not just about how quickly you reach the finish line; it's also about how you get there.
00:02:45.780 Now, regarding the specific case of benchmarking the Elasticsearch clients, we do it primarily to detect regression. As we make changes to the code, we want to ensure we're not introducing any performance issues. We also need to check for changes in Elasticsearch that could impact the clients.
00:03:06.790 For instance, Elasticsearch might change the response format, necessitating additional parsing on the client's side, which could introduce performance problems. Additionally, we routinely review pull requests from the community for these clients to ensure they don't introduce any performance regressions.
00:03:18.910 We also confirm assumptions about performance. We make assumptions regarding speed, but those can be tested scientifically through measurement.
00:03:31.440 Moreover, the Ruby clients allow the comparison of different pluggable HTTP adapters. I've built a benchmarking framework to compare different HTTP clients from Ruby and, while it's not intended to compete with other clients, it allows us to analyze performance thoughtfully.
00:03:51.930 When we talk about benchmarking, I generally categorize it into macro and micro benchmarking. A friend of mine in Berlin who is very passionate about benchmarking also refers to application benchmarking. This talk, in particular, focuses on macro benchmarking, which pertains to entire client-server applications.
00:04:08.000 Micro benchmarking, on the other hand, focuses on individual components. My benchmarking framework at Elastic has three main ingredients. The first is my experience from MongoDB, the second is how Elasticsearch servers benchmark themselves, and the third is the experience gathered from the Elastic clients team. My colleagues, although I am new, have relayed valuable insights into past performance issues to look for.
00:04:45.420 Starting with MongoDB: it differs from Elasticsearch because it does not use HTTP. Thus, it's crucial to perform micro benchmarks there since serialization and deserialization are managed by the drivers, which is not quite the case with Elasticsearch. This necessity makes it more important to look at macro benchmarks.
00:05:03.160 In MongoDB, there's an open-source specifications repo that outlines driver behavior and benchmarking frameworks. I used that experience as inspiration for how to approach the problem at Elastic. I also implemented a similar framework for the Ruby driver.
00:05:22.680 The remainder of this talk will focus on the other two ingredients: how the Elasticsearch server benchmarks itself and how the clients' benchmarking framework was built. We'll discuss benchmarking methods and best practices that you can apply to any language.
00:05:34.260 Let's start with the Elasticsearch benchmarking. The Elasticsearch server team has a dedicated performance team that spends 100% of their time working on and monitoring the performance of Elasticsearch as changes are made to the codebase.
00:05:46.780 They developed a framework called Rally, which is an open-source macro benchmarking framework specifically for Elasticsearch. It consists of this idea of tracks, which can be understood as scenarios involving data sets that stress different parts of Elasticsearch.
00:06:03.440 Rally tracks put stress on the system, define the cluster configuration, and specify how the metrics are stored, either printed to standard out or indexed somewhere. Additionally, Rally publishes results, which proves valuable to both the community and the Elasticsearch team itself.
00:06:17.320 You can find these Rally tracks in a separate repo under 'rally tracks' in the Elastic organization. The various datasets help stress Elasticsearch in different ways—some contain nested documents while others include a variety of value types. They're all open source, so you can inspect and utilize them in your own benchmarks.
00:06:41.500 Recently, I learned that someone on the performance team created a UI to allow users to define their own Rally tracks, where previously, this had to be done via command line. I found this very exciting and obtained the permission to share this news within this presentation.
00:07:07.200 During discussions with the Elasticsearch performance team about building a framework for the clients, I learned that the characteristics of their benchmarking are published and follow defined infrastructure and best practices. I was particularly interested in the fact that they publish their infrastructure and best practices.
00:07:21.330 The Elasticsearch performance team has a Kibana application where different dashboards visualize benchmarks and stress tests implemented on the system. This application provides a plethora of collected data over several years, which is fascinating and helpful.
00:07:34.680 As for the infrastructure, they primarily use Hetzner as a hosting provider. While you can utilize any provider, it is important to note that they have dedicated machines for testing Elasticsearch, rather than using dynamic workers.
00:07:45.740 We learned that having fixed machines is crucial, especially when network latency is a variable in benchmarks. We will cover best practices in more detail later, but my colleague Daniel Mitten-Darfur has written a blog post called 'Seven Tips for Better Elasticsearch Benchmarks'.
00:08:06.300 He presented at a conference last October in Sweden, which I found very educational. We will explore these tips further in the context of the client benchmarking framework.
00:08:24.080 Now, let’s examine the client benchmarking framework and its structure. The initial requirement was that it had to be published and language agnostic, allowing benchmarking for diverse clients like PHP, Python, .NET, Node.js, Go, Ruby, and more.
00:08:37.560 Another requirement was that it needs to be open-sourced, aiming to follow in the footsteps of the Elasticsearch core team. We needed a common results schema, which we will delve into shortly.
00:08:52.890 Furthermore, the framework required defined infrastructure while adhering to best practices.
00:09:05.430 The first step in defining this framework was to establish what the results schema would look like. Initially, I was skeptical about defining the results schema before the tests, but I soon realized that it made sense.
00:09:18.470 By defining the results structure upfront, I could design any test without imposing restrictions. We used something called the Elasticsearch Common Schema, which was announced in GA about a month ago.
00:09:35.470 The Elasticsearch Common Schema is an open-source specification that delineates a common set of document fields for data ingested into Elasticsearch. It helps maintain a consistent way to record events throughout the ecosystem.
00:09:52.610 Whenever the benchmarking result schema contains a field that has a corresponding entry in the Elasticsearch Common Schema, we use that field for consistency. This approach facilitates the analyzing of events, leading to improved analytics.
00:10:05.910 The benchmarking results, characterized in YAML, consist of three main top-level keys: event, agent, and server. I know it's a small section, but highlights the importance of consistent fields and nested documents. We capture statistics as naturally as possible.
00:10:21.900 After defining the results schema, we turned our attention to defining the tests. I divided them into different suites: simple, complex, and parallel. Simple operations, like a ping or a document index task, are intended to be quick. The complex suite involves operations like complex queries on nested fields which require more time.
00:10:36.610 Parallel is particularly interesting; it allows any workload with threads or libraries in the language to complete the work as quickly as possible.
00:10:50.150 The YAML format allows a client to dynamically generate tests based on its defined YAML or manually create benchmarks. One example of a task defined in YAML is indexing a small document several times.
00:11:04.540 I also defined several datasets, including a large document, a small document, and a Stack Overflow dataset. I borrowed the idea of the Stack Overflow dataset from Rally tracks, as the documents feature rich, nested documents for robust testing.
00:11:19.730 Establishing best practices was crucial for our benchmarking framework. One core idea is that system setup should resemble production as closely as possible. While testing the client on localhost gives impressive speed, it does not reflect real-world scenarios.
00:11:34.910 Daniel's findings revealed that inconsistent setups could yield varied benchmarking results, as illustrated by an irregular graph showing results from earlier benchmarks.
00:11:48.260 Another point is to warm up your system properly. Initially, I overlooked warming up the tests but later added these steps after learning about best practices. This ensures clients conduct proper warm-ups, as certain languages may require more extensive warm-up phases.
00:12:05.790 It's essential to define warm-up repetitions in each test definition. Another best practice suggests modeling your production workload accurately. A true benchmark encapsulates real use cases rather than pushing limits to extreme.
00:12:20.250 It's also vital to test your benchmarking software itself. Daniel mentioned that many ignore testing due to the assumption that precise output numbers indicate correct performance.
00:12:35.580 The Elasticsearch performance team found a bug in a load generator leading to inaccurate benchmark results, emphasizing the need for thorough testing.
00:12:53.270 Eliminating accidental bottlenecks is another practice that requires diligence, especially since we run benchmark tests through Jenkins, where the client and Elasticsearch codes run on separate workers.
00:13:05.620 Having dynamic workers leads to potential variability in results due to network latency. Hence, I've been vocal about the necessity for fixed machines to ensure stable network latency across benchmark runs.
00:13:21.730 Moreover, using structured processes ensures that measurements are taken one step at a time rather than trying to bundle multiple tests in one run. This approach helps lend precision to the diagnostic effort.
00:13:35.860 One last best practice is to use statistically significant tests. Elasticsearch has a no-op plugin that returns immediate results, negating elasticsearch work and allowing measurement of client code directly.
00:13:53.680 Thus, any discrepancies in numbers would reflect real differences in code performance. Now, let's dive into the implementation of the Elasticsearch Ruby client benchmarking.
00:14:08.260 The benchmarking code is encapsulated within Rake tasks, allowing execution from the command line. Features include functions for pinging and creating indices among others.
00:14:26.960 These tasks are organized into suites with various provisions for testing, including the ability to execute individual tasks based on parameters.
00:14:43.050 The Jenkins job streamlines the execution of the Rake tasks from the command line. It consists of three parts: uninstalling any Elasticsearch on the machine, executing the Ruby benchmark test, and clearing the Elasticsearch data directory.
00:15:02.030 We ensure that benchmarks start from a clean state every time. The result is a seamless setup, where client code runs on one fixed machine while the Elasticsearch server runs on another, promoting consistent network latency.
00:15:18.690 Lastly, I maintain a dashboard that measures various benchmark runs over time for operations like indexing small and large documents. This acts as a heartbeat for tracking code performance on a commit basis.
00:15:35.110 Looking forward, one major goal is for the other clients to implement similar systems. We want to effectively share the setups through Jenkins to facilitate this, making the process less complicated.
00:15:52.780 We aim to establish comparative dashboards showcasing client performance changes, enabling us to pinpoint performance hotspots in various clients and identify inefficiencies.
00:16:08.470 The parallel suite still requires further definition as it's more subjective, hence we need to establish clear benchmarks and realistic workloads through collaboration.
00:16:24.370 Ultimately, we aspire to publish results similar to the Elasticsearch performance team's dashboard, allowing visibility into client performance improvements over time.
00:16:40.340 Thank you very much for your attention. I welcome any questions, and I'll be around for the next day. If you have inquiries related to Elasticsearch or benchmarking, feel free to ask.
00:17:09.160 You had a slide there where you showcased the animal-like definitions of your benchmarks—there were three parts involving warm-ups and iterations.
00:17:31.170 How do you decide the duration for warm-up and measurement times? Warm-up is essentially about preparing the tests, while measuring refers to the actual benchmarking.
00:17:46.800 That's a great question! My choices were based on the duration of operations and the iterations needed for reliable results. It's a balance between completing enough executions to smooth out noise while not overloading the system.
00:18:10.160 As other clients implement this framework, I expect adjustments to these parameters will be necessary.
00:18:24.250 Take note that comparing clients is secondary; the primary advantage is tracking individual code performance over time.
00:18:37.780 Thank you for your questions!