Vernier: A next Generation Ruby Profiler

00:00:16.640 Hello, my name is John Hawthorn. I'm a Rails Core member, a Ruby committer, and a staff engineer at GitHub on the Ruby architecture team. GitHub is one of the largest and most complex Ruby on Rails applications out there, and my job is to make GitHub.com fast. Basically, I'm trying to avoid you seeing our equivalent of Twitter's fail whale.

00:00:19.320 Performance problems can hide anywhere, and a profiler is the best tool to solve that. Whether the trouble is hiding in a library, a database query, or a network request, a profiler will show you where your program is executing and what's taking up the time.

00:00:38.719 For the past year or so, I've been working on Vernier, a new profiler for Ruby 3.2 and greater. It aims to deliver more information more accurately than previous Ruby profilers and has a fancy new frontend that makes it easier to use, borrowing design elements from the Firefox team.

00:01:05.600 So, let's start with a quick demo to see what using a profiler is like. Vernier comes with a CLI to easily measure a command. We're going to measure 'bundle lock.' This command is similar to 'bundle update' in that it updates the versions of your packages, but it does not install them, making it a little easier for us to measure.

00:01:19.880 After running the command, Vernier will tell us that it recorded a result showing the program took about 2.2 seconds and where it wrote that profile file. To view this file, we can upload it to our viewer program, which is a single-page app hosted at ver.prof, or we can use the profile viewer gem that Aaron created. This is just a lightly modified version of the Firefox profiler, and the files are compatible with it. The key difference is that we've replaced the word 'JavaScript' with 'Ruby.'

00:01:53.840 If you want to follow along later, this is the bundler demo that you'll find on the VPR website. When we drag our file in, the first thing you will see is a timeline at the top that shows what the program did throughout its execution. You'll see the two seconds up there, and you can hover your mouse over it to show a stack trace of what was running at each point in time. The lower part of the screen shows the same information, but with a different visualization in the stack chart view, layering all the functions currently on the stack.

00:02:37.360 We can zoom in on the upper timeline and scroll through this, visualizing how your program executed during the two seconds it took to update our gems and produce this result. I find this incredibly powerful. You can really see your program doing whatever it is and notice any bottlenecks. This view is helpful to learn more about your program, but it's a little busy and can be intimidating in a presentation, so for most of this talk, I will switch back to the call tree view.

00:03:18.760 The call tree view is more like a summary. It presents a top-level list of which methods took the most time. Initially, it indicates that 'main' took 100% of our time, which is not that helpful. However, this view shows us a bottom-up perspective of our program. To get more useful insights, we can invert the stack chart, which reveals interesting methods.

00:03:50.840 The first method that shows up is 'Thread#pop,' indicating we are waiting for data from another thread, so we might ignore that for now. The next one is 'String#split,' which is a C function, and we wouldn't expect it to be slow. It is written in C, so nothing should cause it to be slow. However, in this case, it consumes 10% of our command's time, suggesting that we are either calling it with an extremely large string or, more likely, calling it far too many times.

00:05:12.160 We can expand this tree to see the locations calling the 'String#split' method. Notably, we see significant calls within the 'Bundler::CompactIndex::Client::Cache' class. Although I'm not entirely sure what this class does, we can double-click on anything here, and the viewer will bring up the source code. This shows the source code that took approximately 10% of the time of our program. Along the left margins are annotations of the sample counts, which serve as a relative measure of the time spent executing a line of code.

00:06:08.600 This doesn't indicate how many times the line was executed, as this profiler does not track that metric, but it gives insight into how much time we spent there. The 'total' value represents the sum of this line as well as anything it called into. For the 'String#split' method, the total count of 315 indicates the number of samples observed.

00:06:57.200 Taking a closer look at this, I think it's great Ruby code that I would usually like to write. However, I recently checked the number of times Bundler calls this method in the command, which is about 200,000 times. Understanding why 'String#split' appears prominently makes sense now, considering its inherent speed.

00:08:01.159 To improve this, we could aim to call it less frequently. This might be feasible, although I don't know Bundler well enough. Another option is finding a more efficient way to perform its current task. Currently, it splits a line by spaces into three strings, returning them as an array, but we only grab the first and last strings, disregarding the second. Thus, we could use a substring to achieve this more efficiently.

00:08:44.320 This change isn't as pretty, and this highlights the value of using a profiler. You can write beautiful Ruby code 99% of the time and identify the specific cases where performance matters. If we time 'bundle lock' before making changes, it takes around 2.1 seconds. After adjustments, it takes about 1.9 seconds, resulting in a 10% speed improvement. This is significant, especially for a tool like Bundler, and reveals that we can make substantial performance improvements without deep initial knowledge.

00:09:57.840 Additionally, we gain insights into Bundler's workings. I'd like to clarify something I didn't do here: I didn't use the profiler to measure the improvement. Profiler tools inherently have some level of overhead and potential accuracy issues, so we want an actual performance increase rather than just looking good to the profiler. Vernier strives to have lower overhead and greater accuracy than previous profilers.

00:11:46.320 Next, let's review the evolution of Ruby profilers and explore why accuracy can often be problematic. The most straightforward profiler to create is a tracing profiler, which can utilize callbacks on method entry and exit using the TracePoint API to measure execution time.

00:12:08.720 This profiler shipped with Ruby up until the last version that included 'profile.rb.' While convenient, it can make your program up to 500 times slower because every executed line adds overhead. Even a C version, Ruby Prof, can be about 30 times slower.

00:12:36.720 There are two significant issues with excessive overhead. The first is that you cannot run it in production, as many pages hover close to timeout already. Additionally, production is the best place to primarily run profilers. Secondly, the slowdown isn't consistent across programs. For example, a recursive Fibonacci calculation may illustrate a worst-case scenario for this type of profiler, but typical Ruby applications mix components where time is spent in various ways, including executing Ruby code, running native code, or waiting on a database query.

00:13:50.360 Due to this overhead, most Ruby profilers prefer sampling techniques, which periodically wake up to log a sample. For example, they might set a sampling interval of one millisecond and record the backtrace from the primary thread at each iteration. At the end of running the program, this creates a large array of samples reflecting what was executing at any given moment. This methodology underpins modern Ruby sampling profilers like RBSpy and StackProf.

00:15:31.520 This statistical model leads to some inherent limitations. It doesn't track how many times a method is called, and if a method executes too quickly, it might not even appear in the samples. This is why profiles can sometimes be confusing or seem incomplete, as they are based on this sampling.

00:15:58.360 Let's visualize how this works. Take a running program that does a 'sleep 1;' while another pure Ruby profiler records samples in parallel. That profiler will collect 1,000 samples during that sleep period. However, when running the profiler on more complex scenarios – such as loops that take significantly long durations without yielding – the profiler may stop functioning correctly due to Ruby’s Global VM Lock, which permits only one thread to execute at a time.

00:17:25.920 Although some perceive this global lock as a significant limitation, Ruby handles threading efficiently, and relevant threading issues should ideally be detectable through profiling. The Vernier profiler is the first to report GVL information in profiles. It monitors pure Ruby code alongside scripts, showing us where stalls occur and how time spent in those states influences performance.

00:18:34.080 In Vernier, threads can be in one of three states: running (executing Ruby code), stalled (blocked on GVL), or suspended (sleeping, engaging in I/O, or running certain C extensions). The 'stalled' state isn't ideal, as excessive duration spent in this situation suggests a problematic C extension or insufficient thread count.

00:19:43.520 Meanwhile, the 'suspended' state shows when threads await I/O activities, and Vernier provides comprehensive logging of these events. For example, bundler suspending its operation inside a TCP socket during HTTP requests provides visibility into what occurs within the program without additional instrumentation.

00:20:39.920 Next, regarding StackProf, it has been my go-to profiler for years and was a direct inspiration for Vernier. StackProf utilizes Unix signal-based timing methods. While effective, it introduces bias by only capturing data at 'safe points' in the code, which can skew reporting. We addressed bias via modifications in StackProf to extract data more accurately and ensure sample integrity.

00:21:40.080 While StackProf worked reasonably well over the years, it wasn't perfect. There could be situations where an application might not reach 'safe points' quickly enough, which could result in losing samples amidst incoming request data. Vernier solves this problem by incorporating a second thread that saves samples without direct interactions with Ruby APIs, ensuring we never miss recording samples unless our hardware performance falters.

00:23:30.320 This allows us to maintain accurate sample recordings alongside reduced overhead since less processing is required by the main Ruby thread. As threads have evolved within multiple Ruby profilers over the years, earlier implementations would only sample the main thread. However, with the increasing popularity of multithreading in web servers like Puma, it’s crucial to capture all threads, allowing better performance analysis across various code executions.

00:24:51.680 An essential goal behind Vernier was to accurately measure every thread’s performance at all times. Our trick enables us to avoid measuring idle threads that aren't executing Ruby code since they released the GVL, giving us full visibility without the overhead of unnecessary samples. Ultimately, Vernier differentiates itself by allowing users to view every thread within the application and analyze performance insights effectively.

00:25:37.360 Vernier also supports concurrent thread selection to review their performance with full timeline representation. You can pinpoint where specific threads run by utilizing the track selector feature within the visualization, making it much easier to manage and analyze performance bottlenecks.

00:25:43.840 Vernier further provides features for garbage collector instrumentation, letting users track and analyze garbage collector pausings within a request. Insightfully measured, this data assists users in identifying whether garbage collection contributes considerably to performance issues.

00:26:31.599 Vernier also accommodates allocation tracing during runtime, providing insights into allocation behaviors alongside method timings for each method executed. This capability lets you link garbage collection spikes with specific methods that lead to excess allocations.

00:27:48.760 Moreover, integration with Rails Active Support notifications framework allows users to trace database queries and directly correlate them to timing, optimizing requests effectively. Ideally, Vernier is designed to provide an in-depth understanding of the entire request handling process, identifying all performance-hindering factors comprehensively.

00:28:41.279 In conclusion, I hope you will give profiling a try, especially with Vernier. Not only can it assist in optimizing your program's performance, but it will also enhance your understanding of its intricate operations. Typically, the most critical sections of any program are those that take up time.

00:28:56.199 So, I encourage you to try this out and let’s work together to write faster Ruby code. Obrigado!