Vernier: A next generation profiler for CRuby

00:00:11.200 Hello.

00:00:18.199 My name is John Hawthorn. I'm a Ruby committer, a member of the Rails core team, and a staff engineer at GitHub.

00:00:26.679 Today, I'd like to introduce you to Vernier. Vernier is a new open-source Ruby profiler for Ruby 3.2 and above. It's a tool designed to help you see where your program is spending time, as well as to better understand its performance.

00:00:36.760 To start, I'll give a quick demonstration. Vernier comes with a CLI that allows you to easily measure a command. We are going to measure 'bundle update' with a mostly empty Gemfile, so let's run that and wait for it to finish.

00:00:42.120 Once it finishes, the tool will tell us it recorded a profile and where that file is stored. Next, we need to take that profile to a viewer program to see it. We have this single-page web app hosted at v.r. It's based on the Firefox profiler, and it's basically a light reskin; in fact, the main change we've made is swapping out the word 'JavaScript' for 'Ruby.'

00:01:00.160 That may sound a bit strange, but the Firefox team designed their profiler with the intention of allowing other tools to target it. These projects aren't as different as you might think. They're both C or C++ programs that host a dynamic language, have a JIT compiler, and are multi-threaded. Therefore, looking at the features that the Firefox profiler provides and finding Ruby equivalence has been a good guide for building a solid profiler.

00:01:17.600 Once we drag the file into this viewer, we will be presented with a timeline at the top that shows a chronological representation of what your program is doing at every point throughout its execution. As you hover over it, you will see the stack, so you can pinpoint exactly where the program was, as if an exception had been raised or you'd executed a 'puts' call.

00:01:25.880 The lower part of the graph displays the same information but in a flame graph or stack chart style visualization, where components are layered on top of one another. You can browse this view to get a clear understanding of what your program is doing throughout its execution. However, this view can be quite busy and intimidating, especially on a big screen.

00:01:41.320 So, I'm going to switch over to the call tree view, which provides a summary. This shows a top-style list of which methods took the most time. The first observation it presents us is that 'main' took 100% of our time, which, frankly, we didn't need a profiler to conclude.

00:02:01.160 However, we can switch to a bottom-up view instead of a top-down view, which will show us more interesting statistics. We see that time is being spent requiring, and it’s executing some network operations that make sense for Bundler. However, the top method we are observing here might be a bit surprising: the string.split method. I don't think this is because string.split is slow; it's probably just that we're calling it numerous times with a very large string.

00:02:34.560 We can expand that to see which methods are calling it, and we find two methods inside the Bundler::CompactIndex::ClientCache class. We don’t necessarily need to know what those methods are. By double-clicking on it, we can bring up the source code. This is a feature from the FABR site that runs a proxy, fetching the code from any public gem or local profile viewer gem.

00:03:00.000 In the left-hand margin, we have annotations showing the number of samples collected for each line. You can think of this as a relative measure of performance. Now, we reach the first unique feature of Vernier: we can switch our data source from timings to allocations.

00:03:11.040 In addition to recording the time taken for operations, we've also recorded the number of allocations made. We keep both sets of data in the same profile file because I want these to be a single artifact that represents as much as possible about a program's run. Here, we see that the string.split method, in addition to being the slowest, also made almost a million allocations—definitely allocation-heavy.

00:03:39.000 If we wanted to improve this, we might consider taking those lines that split a string and note that it's only splitting the string into three parts but only taking the first and last. We don't actually need that middle string or the entire array. If we do this with a substring index, it might be faster. Let's see the result: originally, this operation took about 6 seconds, and with the slight modification, it now takes about 5 seconds—100 milliseconds faster.

00:03:55.200 I find performance work to be enjoyable because it allows us to improve Bundler without needing any deep knowledge of how it works. In fact, we've gained insight into what is important for Bundler; clearly, the CompactIndex and these string manipulations are a significant part of its operation.

00:04:01.040 I just want to point out one thing I didn't do: we didn’t use the profiler to measure the improvement, as profilers can introduce overhead and potentially skew the results. It’s vital to always measure without profiling enabled.

00:04:13.520 This is what using Vernier is like today, but this wasn't the original idea I had for this profiler. Back in 2022, when I started working on this, I faced a very different problem. My goal was to discover where a Ruby program’s memory was being utilized.

00:04:35.680 There are some good built-in tools for this. You can use `ObjectSpace.trace_allocation_start`, which instructs Ruby to start tracing where allocations occur. After that, you can load your program and run `ObjectSpace.dump_all`, which creates a heap dump—a huge file with one line for every object in your Ruby heap. For GitHub's largest application, that results in about 6 million lines yielding a file size of around 2 gigabytes.

00:05:00.000 Each of those lines provides details about the object and specifies the file and line where it was allocated. For many cases, this can be quite useful. However, in other scenarios, like with MessagePack, you don't always know where the memory is being used. MessagePack does its best to operate with whatever you have instructed it to load.

00:05:18.240 I wanted to extract more information here—ideally, a full stack trace. The issue, however, is that representing a stack trace in such a file is costly. It’s already 2 gigabytes, so I can't simply store it uncompressed.

00:05:30.440 To improve that, the first thing I wrote for Vernier was a data structure to efficiently represent stacks—the StackTable class. You can instantiate one of these to record the current stack, and then every time we do that, it returns a unique integer representing that observed stack.

00:05:49.199 If you call it repeatedly with the same stack, you will always receive the same integer. Later, you can reconstruct that back into a full stack trace. I'm sharing this because I think it is an interesting capability that may have uses beyond just profilers, so feel free to experiment with it. Another important feature of this, specifically for the retain memory profiler, is that calling current_stack never allocates a Ruby object.

00:06:09.840 Most of the time, it won't allocate at all if it has seen the stack before. If it’s a new stack, an allocation will occur in C—but not in Ruby. This way, we preserve our measurements without interfering too much with the Ruby program.

00:06:23.840 The representation of these stacks uses a tree structure where each stack points to its parent. This allows for reusing parts that are common among multiple stacks, which is quite prevalent in profiling, and we also deduplicate frames that occur multiple times. As a result, we obtain a very efficient representation that closely resembles the Firefox output format.

00:06:41.120 When you create a memory profile, unlike a time profile, the x-axis shows the number of bytes used. For cases like MessagePack, you gain a better understanding of memory usage. Instead of seeing just that MessagePack read was where memory was used, we recognize that it stemmed from loading ITN information.

00:07:02.120 After building this, I discovered it was a solid start for a more traditional time profiler. There are a couple of techniques for time-based profiling. The first is using a tracing profiler, which in Ruby is achieved using the TracePoint API.

00:07:14.840 This allows you to configure a callback to be triggered every time a method is entered and another when it exits. Taking the differences between those two times is how profile.rb, the original Ruby profiler, functions. This profiler was included with Ruby until a recent version; however, it can be quite slow, introducing substantial overhead.

00:07:31.920 For instance, if we have a Fibonacci program that takes 0.1 seconds to begin execution, running this with profile.rb makes it about 500 times slower. This makes sense considering we are executing an extra line of Ruby code for every line we run. Even if we implement this technique in C, it still ends up slower.

00:07:53.600 There are two main problems with this: first, it becomes challenging to run something this slow in production, as you might hit a timeout or impact users. Second, it doesn’t consistently slow things down. A real Ruby program is a mix of Ruby and native code execution, such as C functions like string.split and time spent idly waiting on IO.

00:08:13.920 When using a tracing profiler, the impact predominantly affects Ruby code, inflating the time measured in Ruby while leaving the other times unchanged. Therefore, I think tracing may not be the best solution, but it does have benefits; it’s the only way to capture complete information, such as counting occurrences.

00:08:36.880 Vernier utilizes a combination of both tracing and sampling, but in a very streamlined manner. For user timings, we gather data through Rails Active Support notifications, which tell us about every database query made. This allows us to record additional details.

00:09:00.880 We also subscribe to some Ruby events, for instance, internal garbage collection (GC). Vernier subscribes to the GC API and records every GC pause that occurs during execution. This offers a valuable insight into the garbage collector, often thought of as always being active and consuming a portion of time.

00:09:24.960 In reality, the garbage collector operates as a sophisticated tool featuring incremental marking, which improves visibility.

00:09:39.200 Now, let’s switch to sampling profiling, which is what most contemporary Ruby profilers employ. The key idea is to run something at intervals, which you could refer to as a sampling rate. Here, I've set it to 1 millisecond, which is very typical.

00:09:53.840 Every 1 millisecond, we collect a sample, which includes the backtrace from the main thread. As our program runs, we build a large array that illustrates where our program spends its time. This concept serves as a useful mental model for how sampling profilers function.

00:10:06.240 It explains certain quirks associated with profilers, such as potentially missing method calls if they execute too quickly or infrequently. Furthermore, we can't determine where methods start and stop during execution.

00:10:23.040 Here’s an example: if we're executing a simple program using sleep for one second, our pure Ruby sampling profiler can effectively run alongside it, waking up every millisecond to record a sample, resulting in roughly 1,000 samples for that one-second duration.

00:10:38.560 However, if we create a loop that also plays out over one full second but just keeps running, our sampling profiler in pure Ruby will experience GVL contention and will be unable to collect samples.

00:10:58.320 This is due to Ruby's Global VM Lock (GVL), which allows only one thread to execute Ruby code at a time. While this might sound more complicated than it is, we can perform IO or sometimes run C functions in parallel, we end up missing samples during a busy loop.

00:11:14.640 In Ruby 3.2, Jean Busier implemented an API for GVL instrumentation, and Vernier also subscribes to this. Here, you can configure it with an add-on hook API that lets you listen in on every time a thread transitions to another state within Ruby.

00:11:36.080 The GC information and user timing data appear in our marker chart, allowing us to see where a thread stalls. If we find our sampling profiler is installed for a significant duration, such as one entire millisecond, we can track that an external thread holds the GVL for a set amount of time.

00:12:00.960 In Vernier, we can also log a backtrace when we reach the suspended event. As a Ruby program cannot progress while suspended, we know that the backtrace compiled when it was freed from the GVL is the same state that the thread is expected to reside in until it regains that right.

00:12:22.480 This feature ended up being one of my favorites in this profiler. If you implement a well-designed C extension, it will relinquish the GVL whenever necessary for IO operations. As such, we automatically gain a precise measure of where IO occurs within our application.

00:12:43.920 For instance, this example is from Bundler—a tool we reviewed earlier that shows how it executes an HTTP request. Now, let’s switch to a more traditional profiler, StackProf, which is the profiler I've engaged with the most prior to writing Vernier.

00:13:06.640 StackProf operates using UNIX signals, triggering the signal handler code that it has set up. However, UNIX signals can be quite peculiar, which limits what you can execute within the signal handler, as it must run functions designated as async signal safe, which, notably, do not include allocating memory.

00:13:21.920 For this reason, Ruby provides a postponed job API that permits the VM to execute our code at a later time when it's safe to allocate. In this scheme, we queue a job and once it reaches a safe point, it executes further StackProf code to determine the stack being utilized, allowing memory allocations for saving samples.

00:13:43.520 This method works well with Ruby code, but has a peculiar result. Here’s an example utilizing multiple medium-expensive string operations followed by an empty method, which should ideally incur no expense. However, StackProf’s handling can yield misleading metrics.

00:14:05.360 There are only two safe points, one occurring at the return of an empty method and the other at the return of a slower method. As a result, despite the expense attached to operations in the slower method, the attribution tends to lean entirely on the execution of the empty method.

00:14:26.160 In the past, I fixed this issue in StackProf through a workaround, moving the profile frames logging from that checkpoint into the signal handler. Although RB Profile Frames is not officially async signal safe, it has proven to be so in practice since Ruby 3.0.

00:14:46.080 While this workaround has worked well lately, I think we still need more research on how to provide a more formally supported solution for accurate sampling.

00:15:06.640 Additionally, another peculiarity with the postponed job API arises during extensive while loops, as they may cause the job's execution to side exit, a case primarily seen in micro-benchmarks. This effectively renders slowdowns, yet more often than not, it won’t be detrimental in a real application.

00:15:26.480 We've now addressed the safe point biases, but there's still another hurdle in using postponed jobs. When executing a C function that takes a considerable amount of time, the signal may not come in quickly enough, leading to lost samples.

00:15:46.080 To tackle this, Vernier introduces a dedicated C thread that avoids GVL contention, responsible for sending signals to the Ruby thread when it's time to call RB Profile Frames and save the gathered samples.

00:16:06.160 Thanks to our StackTable infrastructure that does not allocate Ruby objects, it's safe to execute in a C thread, ensuring we never skip a sample, consequently boosting our accuracy and minimizing overhead by saving samples off the main thread.

00:16:26.160 Another challenge Vernier resolves is awareness of threads. For years, profilers like StackProf didn't adequately account for threads, predominantly profiling the main thread. This approach sufficed when web servers typically ran single-threaded.

00:16:47.360 Nevertheless, with most web servers now using threads, it's vital to capture information about all threads and comprehend how to enhance them. Therefore, Vernier can send signals to each running Ruby thread and gather samples from them.

00:17:07.520 While this collection process works efficiently, it introduces overhead by sampling threads, raising concerns that this would deter developers from scaling by adding more threads. It’s feasible to imagine that if 100 threads were implemented, this method might prove unwieldy.

00:17:27.680 To tackle this, we leverage the gvl API to understand when threads are inactive since idle threads cannot progress. By recording the backtrace once a thread goes idle, we no longer have to signal or profile them; we are privy to their execution state.

00:17:48.560 When dealing with iterations, multiple threads operate concurrently under the GVL, ensuring that our profiling maintains insight into all threads while avoiding needless overhead.

00:18:08.920 Acquiring data from all these threads offers an innovative way to profile. Instead of only focusing on specific requests, we are enabled to profile the entire web server over a given time span.

00:18:30.440 In a production environment, I can operate a Rails server and toggle profiling on and off by sending a signal. Here’s an illustration of my personal Mastodon web server handling multiple requests, revealing instances of GVL contention.

00:18:51.360 While it initially manages requests smoothly, it may later become stalled amid seemingly rapid requests converted into 200-millisecond response times due to network transitions.

00:19:11.360 The state showing one thread running while another is stalled indicative of non-GVL-aware profiling, could easily lead to confusion. There won't be any improvements on the stalled thread during its wait.

00:19:32.720 We can take several approaches to resolve this issue upon identifying the table's GVL contention. The conventional advice would likely be to employ isolated processes, which undoubtedly alleviates issues.

00:19:54.400 Still, we can also attempt to accelerate our profiling processes. By examining the thread concurrently utilizing the GVL and addressing any contention to reduce its time spent utilizing it.

00:20:15.760 Moreover, I propose an engaging third option: integrating short-lived Ractors. At RubyKaigi last year, I observed challenges regarding the broader acceptance of Ractors. However, GVL-aware profilers might provide the key to this dilemma.

00:20:37.680 Through profiling, we can identify precise locations responsible for the prolonged GVL locks, subsequently optimizing those segments or extracting them into Ractors, consequently enhancing concurrency.

00:21:01.040 In conclusion, Vernier is a GVL-aware profiler that can efficiently profile all Ruby threads while maintaining minimal overhead and interference. It supports GC tracing, GVL tracing, user-level tracing, and allocation tracing. I hope you give it a try; together, we can make our Ruby code faster.