Ruby Video | Keynote: Vernier - A next Generation Ruby Profiler

00:00:11.000 And now we're getting into the meat of the day. I'm privileged to present John Hawthorn as today's keynote speaker. I first met John almost five years ago to the day in Minneapolis at RailsConf 2019. I was basically a newcomer to the community, meeting lots of interesting people.

00:00:24.880 I was talking to Matthew Draper, who was working on the gem named "gel", asking about how he got it to work so quickly. He mentioned that John is the brains behind the algorithm used in this tool, known as PubGrub. While the algorithm doesn't belong to John, he ported it and wrote it in Ruby. This algorithm is for dependency resolution, and it was the first time I heard about it. I was really impressed, and the more I talked to John, the more impressed I was by his intellect and curiosity. Later on, the PubGrub gem started being used by Bundler, meaning that today, Bundler's dependency resolution relies on the PubGrub algorithm and the gem originally created by John.

00:01:08.479 In that way, John is somewhat ubiquitous, yet his work often goes unnoticed by many of us because he works behind the scenes. He's also one of only three people in the world who are both Rails core and Ruby core members, which is a remarkable privilege. Additionally, he's a staff engineer at GitHub. With all that said, it brings me pride and privilege to invite John to the stage.

00:02:17.040 Thanks! This talk is about Ruby and Rails performance, which is where I spend most of my time working. I don’t think we always appreciate how different performance work is from the feature work we do on a day-to-day basis. When we do development, we generally look for specific outcomes, such as having our tests pass or ensuring that a page renders correctly with the right data. Once we achieve that output, we usually feel done, at least until someone changes the requirements.

00:02:29.879 However, performance is quite different. It's about this continuous measurement; the time it takes to perform an action, and something we'd like to minimize. This has two implications. First, improvements can be measured objectively—if you make something 10% faster, that is objectively 10% better. But secondly, performance work is never really done. You can always try to lower those numbers forever, albeit with diminishing returns. Performance work is not just about lowering that number; it's also about gaining a deeper understanding of your program. In other words, when talking about performance, it's not the destination, it's the journey.

00:03:36.519 A recent example of how performance can teach you about your program is the discovery of a backdoor in the xz library. This was found by a PostgreSQL developer who was conducting full system profiling and realized that OpenSSH was performing actions that it shouldn't be. While I wouldn’t say that working on performance will make your application more secure, it will certainly help you better understand it and identify sections of your code that may be doing unnecessary tasks or taking more time than expected.

00:04:31.880 I believe Ruby and Rails are often treated unfairly when it comes to performance discussions. Rails is an excellent framework for scaling up, particularly with recent versions featuring multi-database and sharding support. These are features you may not need when you first start your application, but they become available as your application grows. Furthermore, Ruby has seen significant performance improvements over the past decade. Many people still hold mental models based on Ruby 1.8, which was based on a different virtual machine architecture. Now, with Ruby’s JIT compiler, the performance characteristics have shifted considerably.

00:05:25.600 It's time for people to update their mental models of just how fast Ruby can be. There is some truth to the idea that Ruby and Rails make it easy to write code that is inefficient, but I don't believe this means we should impose severe limitations on our coding practices. Ruby and Rails allow us to express code freely, even if that code isn't the most efficient, because otherwise we would be prematurely optimizing. I'm sure everyone has heard the quote that 'premature optimization is the root of all evil.' That always seemed a bit harsh to me, and I didn’t fully embrace that idea until I read the quote in its entirety.

00:06:06.760 This quote comes from Donald Knuth in a paper he wrote about go-to statements. He advocated for the use of go-to statements, but he remarked that programmers waste enormous amounts of time worrying about the speed of non-critical sections of their programs. These attempts can negatively impact debugging and maintenance efforts. He stated that we should ignore small efficiencies 97% of the time because premature optimization is the root of all evil. However, we shouldn't overlook opportunities within that critical 3%. This emphasizes focusing your optimization efforts wisely. We can dedicate much time to improving performance, so we should prioritize writing clean, maintainable, and efficient code.

00:07:18.640 The challenge we face is knowing what is actually slow—it's notoriously difficult to determine. I spend most of my time focused on performance, yet I still find myself guessing poorly regarding what takes time on a specific page. I believe this is because we tend to assume the time something takes aligns with how critical it is to the end result. Yet often, our best opportunities for improvement arise when there is a mismatch between duration and utility. For instance, Nadia mentioned in her opening keynote about a slow query within StoryGraph, which was causing performance issues, even though this query wasn’t displayed on the page. This serves as an example—if something takes longer than expected, but doesn’t yield visible results, you may not expect it to slow down the loading process.

00:08:45.200 To address this challenge, we need tools to help us identify where to focus our optimization efforts. Today, I introduce you to Vernier, a new open-source Ruby profiler for Ruby 3.2+ designed to help you understand where your program spends its time. I will start with a quick demo. Vernier comes with a command-line interface to easily measure the time taken by a command. Let’s measure ‘bundle update’ with a mostly empty gem file. I think I've just included Rack.

00:09:58.279 We can run that and wait for it to finish. It will inform us that it recorded a result and where that file went. This operation took about 0.6 seconds, which isn't particularly slow. I chose this example because Bundler is an easy way for all of us to understand how to get non-trivial work done; it serves as an excellent test project.

00:10:49.320 After obtaining this file, we need a tool to view it. This brings us to the second component of Vernier: a single-page web application. You can find it at v.r. If you'd like to follow along, this is the Bundler example, and the web viewer was the original driving force behind writing a new Ruby profiler. It provides an excellent interface for exploring these profiles. I'm comfortable praising it because I did not create it. Essentially, we adapted the Firefox Profiler front end, swapping out JavaScript terminology for Ruby. This implementation may sound strange, but the Firefox team designed this viewer with the versatility for use in this manner.

00:11:59.040 The applications are not as different as you might think. Both are C or C++ applications that are multi-threaded and utilize JIT compilation for dynamic languages, so many of the challenges encountered when profiling Firefox are equally applicable to Ruby developers. Once we integrate this, here's what we'll see first: a timeline displayed at the top, which provides a chronological representation of what your program is doing at various points in time. We can mouse over it to see the stack at each time increment. If we executed a puts call or raised an exception, that would display the same information in the lower part of the screen.

00:12:50.559 The lower part simply presents the same data in a slightly different format, layering the stacks on top of each other. These visualizations are designed to help us identify issues within the data. I find these representations can initially seem daunting and hard to interpret, and that's somewhat by design. The idea behind these flame graph or stack chart styles is that humans are good at recognizing visual patterns. It presents a significant amount of data all at once, trusting that we’ll manage to parse out what’s happening. If this seems overwhelming or you've ever opened one of these visualizations and thought, 'Wow, that's a lot,' that's totally normal. Just take your time and trust your instincts.

00:14:24.680 However, there are more accessible options available as well, such as a call tree view, which provides a summary or a top-down list showing which methods we spent the most time in. Initially, this may seem non-informative, indicating that the main method consumes 100% of the time, which I could have guessed. It offers more perspective when clicking on 'invert call stack', revealing more enlightening data.

00:15:03.440 Here, we start seeing insights: we notice the bundler beginning its processes, which reveals some requires, suggesting TCP socket activity, indicating that we must be engaging in network I/O. However, the top method of interest here is 'String#split'. Now, 'String#split' is a C function, so is C slow? Not likely. In fact, I don't believe 'String#split' is slow at all; it's operating efficiently. The challenge lies in either providing it with a very large input or calling it many times. We can investigate further to understand who is invoking it.

00:16:50.560 Looking at the output, we see two nested lines within Bundler that call 'String#split'. The first line appears to take the majority of the time, so to explore its behavior, we can double-click on it. Another notable feature of the Vernier front end is that you can view the source code of any public gems within the viewer. This capability should prove helpful for familiarizing yourself with the libraries and frameworks you are using. There's a common misconception that there’s a separation between whatever application you ship and the frameworks or libraries you use; however, when profiling, that distinction falls away.

00:17:31.200 What we see here shows the relative performance measured in samples. We don’t need to worry too much about that right now. If we examine the specific lines of code within Bundler, we're looking at a string that is split by spaces, assigning the resulting substrings into variables. There's nothing inherently wrong with this operation, but it occurs 200,000 times, which is substantial. Each time it works with the incoming strings we fetch from a hosted gem list file, and it processes these into separate variables before returning them.

00:18:64.280 Vernier's unique first feature is that it combines timing data with allocation data in a single profile. While other profilers can record allocation data, Vernier is likely the first Ruby profiler capable of encompassing both in a single report. I aim for the profiles we generate to represent all we can learn about a program’s execution—this minimizes the need for back-and-forth adjustments. Here we observe that our string split is executed 200,000 times and generates 890,000 allocations. That's quite a significant number!

00:19:56.800 Let's see what we can do about that. If I were better acquainted with Bundler, I could suggest reducing the frequency of this operation—which is a primary approach—but a simple way to alleviate overhead could involve avoiding additional array allocations together with minimizing the overhead of string splitting. For instance, we could leverage substring operations instead of performing a full split.

00:20:10.320 Adjusting that way would conceivably save us significant processing time. Initially, this was running at about six seconds. After making our alterations, we improved it to approximately 0.5 seconds. This encapsulates why I find performance work to be so thrilling; it's not only about improving Bundler but also gaining insights as we dive deeper into the code's functionality. We come to understand which elements of Bundler are critical and worth further exploration.

00:21:03.360 One important note is that I did not utilize a profiler to assess the improvement; profilers can introduce overhead and potentially distort results. Consequently, performing a straightforward timing test is always the best approach to ensure that an improvement is genuine rather than just a byproduct of profiling artifacts. Let's be clear: keep using 'String#split' as it isn't inherently slow; the key takeaway is to be mindful of how we utilize it. In Bundler alone, there are numerous other spots that employ 'String#split' effectively, and I've yet to experience performance issues with them.

00:21:56.760 That said, I reckon my previous example might represent a premature optimization. We may well be acceptable with the current timings, or perhaps other avenues to enhance it will emerge if we determine it's truly necessary. What I've demonstrated applies to any Ruby profiler; that includes StackProf, RBSpy, or Vernier, which are all great options. If you’re not on Ruby 3.2 or above, feel free to use one of the earlier profilers. I firmly believe that utilizing any profiler is beneficial, but I see Vernier advancing profiling capabilities for Ruby considerably.

00:23:06.400 Let me take a step back and clarify why performance measurement can be more complex than it seems at first glance. Say we have a snippet of code we want to measure. You might start with something straightforward: record a time before and after execution, then note the difference. While this method works adequately, there are improvements we can employ. For example, `Time.now` isn't always the best choice; instead, we should use `Clock.gettime` with `Clock::Monotonic`.

00:23:55.960 In the operating system, we have access to various clocks for timing tasks. The real-time clock represents human-readable time, but it can have adjustments and inconsistencies due to time synchronization. Conversely, the monotonic clock remains static—it counts the number of seconds since the machine booted and never adjusts, making it a suitable choice for consistent measurements on a single machine. Additionally, CPU time clocks function differently: they record the time only when the program is executing, excluding periods of waiting on I/O, locks, etc.

00:24:57.360 Profilers operate differently regarding the use of wall time or CPU time. You might prefer CPU time if it aligns with your resource limits; however, I personally favor wall time for Ruby, as it gives insight into where we spend our time overall—especially relevant when we’re waiting on I/O or database queries. The subsequent issue we face is avoiding cluttering our codebase with print statements for timing purposes.

00:25:56.799 It's common to see guidelines advocating against littering your source code with timing print statements. A more sophisticated approach is achievable using Active Support Notifications, a built-in instrumentation framework in Rails. Rails uses this system to time various processes internally, but you can create your own topics as well. I highly recommend doing so; this framework provides an excellent way to time your applications effectively.

00:26:34.720 After instrumenting your code, you can create subscribers that receive callbacks every time an event occurs, providing you with the time taken for that process. However, caution is warranted with subscribers, as there’s a cost associated with adding them. It’s important not to perform heavy computations within these listeners, which could lead to inflated and misleading timings.

00:27:11.440 You might unknowingly be familiar with Active Support Notifications due to their role in Rails logging. For instance, whenever you observe SQL select statements or action view renders in your development console, they're being processed through Active Support Notifications, where a subscriber outputs these to the log. Vernier’s first innovation is that it collects these Rails subscribers and includes their data within the profiling logs.

00:27:56.040 While analyzing these linearly could be misleading, if you summed up the times, they wouldn't accurately reflect total execution times because some actions may be nested within others. Using a timeline view becomes very useful, and I'd recommend considering this approach because it allows for a clearer picture of not just which lines take time, but how they relate to each other in the overall timeline.

00:29:03.000 In addition to timing visualizations, Vernier subscribes to a few Ruby internal APIs, which exposes more information about actions performed by the Ruby VM, such as garbage collector activity. Many have prematurely considered garbage collection a performance bottleneck when it really showcases sophisticated timing strategies. With our profiling tools, we can now see where GC was active and what is transpiring while our applications are running.

00:30:02.400 This addresses a frequent misconception that Ruby, as a garbage-collected language, must be slow. In fact, we can demonstrate that the GC operates incrementally, marking only micro-segments of time while allowing your program to proceed between these operations. However, one weakness of the notification system is that it relies on knowing about potential slow points beforehand; thus, only recognizable areas are logged in our metrics.

00:31:00.280 Profiling serves as a complementary tool to notifications. In its simplest form, profiling could be performed by instrumenting every method call. This technique, called a tracing profiler, utilizes a Ruby API such as TracePoint to obtain a callback at every method entry and exit. However, this can significantly slow down execution. For example, with recursive Fibonacci code, standard execution might take 0.1 seconds, but using the default Ruby profiler could lead to a runtime 500 times longer. This speed degradation often prevents profiling from being practical in production environments.

00:32:26.560 As a result, real-world measurements taken on production servers yield far more applicable insights than development environment baselines, which may not reflect actual conditions. Hence, while tracing profilers note how many times a method runs-called, I prefer using a sampling profiler for accuracy. A sampling profiler gathers samples periodically, yielding a more manageable and accurate representation of performance. We configure it to record backtraces periodically and when the application raises an exception.

00:33:58.080 The challenge with sampling profilers is that if a method completes quickly, sampling may miss it if it's not called often enough. This can produce misleading results which suggest a problem exists where one may not actually be present. When I measured idle Ruby processes, the pure Ruby profiler worked well; however, in more complicated contexts, such as sand-boxing tools like the GVL, performance measurement can suffer.

00:34:56.560 Ruby has a global VM lock (GVL) allowing only one Ruby thread to execute at a time. Although this appears detrimental, we can still achieve parallelism to a degree through I/O operations and native C function calls. Often, developers waver on using threads due to uncertainty about potential GVL impacts. Vernier addresses this, utilizing a new API introduced in Ruby 3.2 that tracks GVL state and provides feedback on thread activity at various points, allowing developers to optimize their threading safely.

00:35:46.560 With Vernier profiling our pure Ruby profiler in action, we looked closer at a sample indicating that threading was indeed stalled. Through this profiling exercise, we've represented the GVL in three states: when it's executing Ruby code, when it's blocked while acquiring the GVL, and when a thread is suspended during I/O operations, at which point it releases the GVL.

00:37:17.280 Capturing the suspended state proved to be a pivotal feature of Vernier. It allows us to track the stack at the time a program is suspended, which can yield valuable insights. If a library or method behaves as expected, we can identify how and when it interacts with the GVL. For instance, within Bundler, it showed us how the application performs network requests while managing GVL states seamlessly.

00:38:30.720 Profiling for Ruby programs in production is essential to achieve reliable, accurate measurements. Overcoming potential pitfalls of Ruby profiling is yielding results we've previously been unable to access effectively. Combining aspects of tools through Vernier allows a complete picture of application performance rather than an interpretation of merely path tracing in Ruby.

00:39:39.680 Over the years, I'd migrated from using Ruby's official social media tool to self-hosted alternatives, such as Mastodon. It offers an interesting testbed for profiling due to its complexity and community engagement. As I delve into its thorough architecture, I engage with the same concepts regarding thread configuration that previous teams faced initially. By properly configuring signal capture, we can objectively analyze all incoming requests based on thread activity, assessing how they influence the GVL.

00:40:54.080 Through detailed profiling capabilities, we can pinpoint bottlenecks, such as N+1 queries or excessive waiting on GVL blockage. Realistically, if a single request is being slower than estimated, it could stem from stalling caused by running other requests concurrently. This focus allows developers to understand their limitations and optimize efficiently.

00:42:09.060 This is my hope for Vernier: to capably capture various moments in time as we monitor Ruby processes, providing as much detail as possible. I hope this convincingly presents how beneficial using a profiler can be. Leveraging a profiler empowers developers to write Ruby confidently without preoccupation over minor inefficiencies, improving both application performance and understanding of the frameworks and libraries we utilize. Furthermore, encountering slowdowns should signal opportunities for betterments rather than purely challenges.

00:42:46.200 In conclusion, if slow processes are revealed in your profiling, view them as opportunities to enhance and optimize, understanding that performance can be an invitation to delve deeper into your application’s workings. In doing so, we cultivate a more robust community through collaborative learning and contribute to a faster, more efficient Ruby ecosystem. Thank you!