Optimizing Ruby: Building an Always-On Production Profiler

00:00:11.759 Hello everyone, welcome to my talk on optimizing Ruby: building an always-on production profiler. This is the third talk at RubyKaigi this year about profiling, which I am very excited about.

00:00:18.039 I've put my slides online and will show the link at the end again. I'll also share it on Twitter, so if you want to check the slides later, they will be available.

00:00:30.359 So, who am I? My name is Ivo Anjo, and I currently work as a senior software engineer at Datadog. I've been a big fan of Ruby since 2014 when I started using it professionally.

00:00:41.760 I'm also a fan of exploring language runtimes, including the Ruby VM, the JVM, JRuby, etc. I enjoy spending time thinking about application performance, focusing on how we can enhance it and make it easier to achieve.

00:00:52.559 That's how I ended up working on tools to uncover better performance insights, essentially looking at our applications in different ways to learn more about them and what they are doing.

00:01:06.920 Today, I will talk about the continuous profiler that I work on at Datadog, but first, if you haven't heard about Datadog, we provide cloud monitoring as a service.

00:01:20.320 We offer various features including profiling, distributed tracing, security monitoring, error tracking, logging, and many other functionalities for different programming languages.

00:01:41.760 Here you can see a screenshot showcasing tracing requests across multiple services, along with nice graphs for metrics. The profiler provides a view of what the application was doing over a significant duration, whether it's an hour or a day.

00:02:05.320 In this profiling timeline view, each line represents a thread in your Ruby application, allowing you to visualize what that thread was doing over time. This is just a brief overview of what we offer.

00:02:24.480 Today, we're focusing on the Ruby profiler, which is part of the DD Trace gem. It's open source and available on GitHub.

00:02:30.760 This means you can see most of the work I've done at Datadog since I joined three years ago. The profiler supports CRuby versions 2.5 and above.

00:02:42.680 Currently, on the stable version, it supports Ruby 2.3 and above, but we will soon release an update that will drop support for 2.3. The profiler itself primarily supports Linux, with some experimental support for macOS. If you want to run it on macOS, my recommendation is to use Docker.

00:03:02.000 The gem can be run on Linux, macOS, etc., but the profiler contains Linux-specific features that have not been ported to other operating systems.

00:03:15.920 The gem is built using a combination of Ruby, C, and Rust, which is quite interesting as it produces data in a standard PR format, allowing compatibility with various profiling tools beyond Datadog.

00:03:28.519 A key feature is that it is designed to be always on and in production. This is a distinguishing characteristic of this profiler.

00:03:39.879 An alternative title for this presentation could have been 'How to Gather Hundreds or Thousands of Profiling Data Points per Second with Low Overhead', or 'How Profilers Can Push the Ruby VM to Its Limits and Sometimes Beyond'.

00:03:48.159 There are many excellent Ruby profilers available; however, we started building a new profiler at Datadog because existing ones often come with caveats. Running them continuously in production can add overhead to your application, which is not ideal.

00:04:06.439 Additionally, many profilers produce a vast amount of data, which can create challenges around storage—whether in the cloud, containers, or elsewhere. Some profilers also gather sensitive data, such as heap dumps that may contain database keys or customer information, necessitating careful management.

00:04:30.960 Furthermore, certain profilers may only track one type of data at a time, such as CPU but not memory, or vice versa. Therefore, the challenge we addressed at Datadog was finding a way to run our profiler always on in production while minimizing overhead.

00:04:55.440 To achieve this, we employed sampling techniques to efficiently manage data, both in memory while gathering it and when uploading it. We ensure there is no sensitive data and can track multiple kinds of profiling data simultaneously.

00:05:16.039 For this presentation, I will share a quick outline: I will discuss a few examples of profiling, the types of profiling data we gather in the Ruby profiler, how we collect that data, and some of the challenges and workarounds we faced.

00:05:28.199 I will also explain how we use Rust in the DD Trace gem for the profiler. At the end, I'll summarize why having an always-on profiler is beneficial.

00:05:42.039 Now, let's look at a real-world example from Datadog. We utilize GitLab, which is similar to GitHub but open source; we have our own self-hosted instance.

00:05:59.919 One of the team members running GitLab, which is also built in Ruby, reached out to me about requests timing out after 60 seconds. I suggested looking at the profiler for insights.

00:06:19.199 Upon investigating the profiler, I discovered that most of the time was spent in a specific method, line 222 of the repository.rb file in GitLab.

00:06:30.800 This method retrieves branch names and tag names from the repository to identify which names are prefixes of others. It separates names with slashes from those without slashes.

00:06:52.000 For example, if you have a branch named 'iuu' and another called 'ianu-something', it builds regular expressions to match them as prefixes.

00:07:06.239 This operation didn’t initially appear problematic, so I wondered how a repository could involve so many branches and tags to take that long. It turned out that this was a large front-end Datadog repo featuring over 50,000 branches and 130,000 tags.

00:07:30.800 Hence, matching that many regular expressions contributed significantly to the delay. After sharing this with my colleague, we found that we didn’t actually need all those branches and tags since an outdated synchronization process failed to delete old ones.

00:07:43.680 After he performed a cleanup of the branches and tags, the system began operating efficiently again. This shows how the profiler can effectively identify performance issues.

00:07:56.480 Another example comes from metrics we publish within the gem. One metric we check is class count, or how many classes are loaded in your Ruby VM.

00:08:07.680 We validate the availability of this metric since it might not be present in all Ruby versions or JRuby. When I reviewed the performance profile for this, I noticed we were checking two times, which doubled the processing time.

00:08:19.679 To resolve this, I implemented caching, enhancing performance significantly. You can also experiment with DD Trace locally without requiring a Datadog account.

00:08:36.640 We provide support for custom exporting mechanisms, allowing you to save a profile to a file. For instance, you can use a simple Fibonacci example by adding DD Trace to your bundler.

00:08:50.320 You can run the profiler on your app with 'ddra' and the tool automatically compresses the profile data; note that it's only available as compressed data.

00:09:02.800 You can visualize this data using tools like go tool alongside P Prof, which provides visualization options for the generated flame graph.

00:09:22.040 Now, let's discuss the types of profiling data we gather in the Ruby profiler. The first type is CPU time, which measures how much time code spends executing on the CPU, where.

00:09:39.559 This roughly correlates with the time the Ruby VM spends running code. However, it does not capture waiting times due to network access or file I/O, as those do not consume CPU.

00:09:59.919 We also track wall time, which reflects the real-world time consumed by running code. For instance, a sleep operation does not consume CPU time but does count towards wall time.

00:10:11.079 If you sleep for ten seconds, you’ll see zero CPU time but ten seconds of wall time. Wall time includes both on-CPU and off-CPU activities, such as network I/O and waiting for locks.

00:10:22.399 Understanding wall time is essential because it represents the latency that users will experience. If you focus solely on CPU time, you may miss performance bottlenecks attributed to other factors.

00:10:35.519 Another important metric is object allocations, which count how many objects are created and their location. High object allocation leads to more work for the Ruby garbage collector, affecting overall performance.

00:10:51.120 Garbage collection time is another significant metric, informing you of the cost of having too many objects in memory. High object allocation correlates with increased garbage collection time.

00:11:06.360 We also track live heap objects, which monitor the objects still present in memory after garbage collection. This information is crucial for identifying memory leaks specific to Ruby objects.

00:11:29.920 To gather this information, we rely on sampling, which we have previously discussed. Using sampling allows us to monitor application behavior without significant overhead.

00:11:46.040 In practice, our profiler records thread state every 10 milliseconds, capturing metrics such as CPU time, wall time, backtrace data, and active web requests.

00:12:10.000 We represent this data visually using flame graphs, where wider sections indicate more time spent by specific functions, while the backtrace indicates how the application reached those point.

00:12:21.279 For CPU time, while there is no standard operating system API available, we can use specific code on operating systems like Linux to measure time spent on the CPU.

00:12:39.960 To record wall time, we log the elapsed time since the last sample. The timestamps allow us to create detailed timeline views representing the order of events.

00:12:54.960 This view displays active threads, enabling us to analyze their workload over time. We also utilize live web request data to filter profiles for individual requests or single endpoints.

00:13:18.360 For instance, we can extract a profile for a specific endpoint based on data collected over several hours, useful for assessing performance in multithreaded applications like Puma.

00:13:39.240 Another consideration involves the global VM lock (GVL), which might make it seem like profilers need to wait their turn to execute. However, we designed our background thread to run in C, allowing it to operate independently of the GVL.

00:14:02.560 This mechanism ensures we can retrieve sample data from the active Ruby threads without being affected by the GVL's constraints. Additionally, we adjust sampling frequency dynamically based on application demand.

00:14:18.280 This adaptation ensures that sampling overhead remains minimal, even during periods of high load. With regard to garbage collection, we utilize trace points for GC entry and exit events to assess performance.

00:14:32.399 The Ruby GC operates incrementally through multiple steps, which necessitates careful management of trace points to maintain low overhead.

00:14:51.040 Monitoring allocation events also requires strategic trace point placement, since it can significantly affect performance based on how frequently we record allocation data.

00:15:06.039 When profiling allocations, we track backtraces, class information, and the current web request. Seeking an optimal frequency for acquiring allocation data is crucial to ensuring minimal application impact.

00:15:21.639 Each profiling method reflects a balance between data accuracy and overhead mitigation. Similarly, we employ heap live tracking to identify memory leaks by correlating allocation data with live object status.

00:15:43.839 Through routine checks, we can identify objects that have not been cleared from memory after garbage collection, indicating a potential memory leak issue.

00:16:06.679 An important takeaway from building this profiler is that it stretches the VM to its limits in order to capture all this data while keeping overhead low. There are still additional improvements we'd like to implement.

00:16:26.360 One major area for potential enhancement is regarding the allocation and heap profiling overhead. The Ruby VM's object allocation code utilizes both fast and slow paths, leading to implications for performance measurement.

00:16:48.920 With dynamic tracing, we experience some additional overhead, particularly in micro benchmarks. Therefore, we would like to improve our profiler to allow use of the fast allocation path even when tracing.

00:17:09.000 Another consideration is the handling of the Ruby destructors, which during their garbage collection phase, can disable active trace points. This introduces inconsistency in the profiler's data collection capabilities.

00:17:24.600 There are existing open bugs and PRs addressing this issue, which we hope to drive forward. Another topic of interest is more accurate sampling while managing signal handlers.

00:17:43.240 By enabling direct stack tracing from a Ruby thread using signals, we can achieve more precise CPU profile accuracy—something observed in other languages but not guaranteed in Ruby due to atomicity concerns.

00:18:03.560 If we could resolve any atomicity issues, this enhancement would significantly benefit profiling efforts. Many improvements in Ruby have been made recently by the community, making profiling lives much easier.

00:18:20.879 The use of Rust in the DD Trace gem offers advantages related to data storage and serialization. Understanding our implementation utilizes Rust ensures efficiency and safety.

00:18:32.800 Using the Rust compiler enables us to harness advanced features without the need for users to have Rust installed. The binary gem allows us to manage dependencies while keeping installations easy and straightforward.

00:19:02.800 The combination of C and Rust enhances our system's performance and data management capabilities. With 17,000 downloads, the DD Trace gem demonstrates its usefulness and strong adoption rates.

00:19:21.600 In closing, I want to emphasize why always-on production profiling is a crucial tool in your development toolkit. One key advantage is its ability to investigate problems whenever they arise.

00:19:42.399 When issues occur, you can immediately analyze past performance and understand what was happening moments before the problem surfaced.

00:20:01.919 Always-on profiling allows identification of anomalies based on immediate historical data, making it essential in high-demand production environments.

00:20:25.679 Furthermore, it aids in reducing latency by providing insights into performance bottlenecks, enabling optimizations that improve user experiences.

00:20:49.640 You can also benefit from identifying cost-efficient opportunities where resource consumption can be reduced, thereby lowering operational costs.

00:21:15.080 Lastly, one of my favorite reasons for utilizing a profiler is the learning and discovery process it fosters. Understanding what your application is doing allows for deeper insights.

00:21:33.040 When you join a new team or service, profiling exposes operational behaviors, revealing important code paths and behaviors.

00:21:51.520 Even if you have been working with a service for years, comparing your mental model to profiling data can lead to new opportunities for learning.

00:22:08.320 You might discover discrepancies between assumptions and reality, highlighting areas for improvement. Thank you for your time. Let's work together to make Ruby apps even faster.

00:22:36.800 Here is the link to my slides, and I would love to discuss further. Please feel free to contact me. By the way, Datadog is hiring, and I want to thank them for sponsoring my trip here.