Daisuke Aritomo

The Depths of Profiling Ruby

RubyKaigi 2024

00:00:19.600 Hi!
00:00:19.600 Thank you. My talk is about the depth of profiling Ruby. I just built a new Ruby profiler, and today I'm going to share my journey and learnings from the experience.
00:00:26.519 Let me introduce myself first. My name is Daisuke Aritomo; people call me 'Show' on the internet, so feel free to call me that. Also, I'm a member of the Rnock, which is the Wi-Fi team. I've been running around the venue yesterday, so if you have any interconnections issues, feel free to let me know.
00:00:36.680 I'm also a big fan of Poni, which are these delicious rice dumplings. I eat these about three times a week throughout the year, so I'm very sure that I'm in the top 1% of Ruby fans when it comes to these. I have an advertisement: the Smart Bank, which is a company I work for, is a hydration sponsor for RubyKaigi, and we're giving out free drinks like Coca-Cola and iced tea outside the large hall. So please stop by and grab a drink!
00:01:15.520 Now, let's get back on track. What are profilers in the first place? The dictionary states that a profiler is a person whose job is to collect useful information about somebody or something so that they can provide a description of them. What does this mean in the context of programming? A profiler in programming is a tool injected into a program to analyze its performance. These tools are often used to reveal the hotspots, which are sections of code that are making your program slow.
00:01:43.799 The figure on the right is called a flame graph, which is a popular visualization created by profilers. It indicates the hotspots and heavy methods by representing time-consuming methods with horizontally longer bars. In this case, you'd want to focus on the right part of the graph to discover why your program is slow.
00:02:13.680 Today, I would like to announce my new Ruby profiler, which is called PF2. I've been working on this for a while. PF2 focuses on analyzing Ruby programs from various perspectives. Let me introduce some features. It's very easy to use; you don't need to modify your code. All you need to do is append 'pf2' before your Ruby command to use it.
00:02:48.239 PF2 is capable of classic profiling, which finds the time-consuming methods and displays them in a flame graph as I previously introduced. For example, the string split function is taking a long time. The profiling considers multi-threading scenarios, so this flame graph shows program performance for four threads. The little bars in the middle indicate when each thread was actively working.
00:03:10.799 An interesting feature of PF2 is its capability of showing statistics about threads, such as how much time a thread spent in garbage collection or wasted while waiting for the Global Virtual Machine Lock (GVL). Since in Ruby only one thread can utilize the CPU simultaneously, it's important to know if your threads are starving for CPU time. Based on this information, you might consider adjusting the number of threads if you're working in a web server environment.
00:03:59.920 PF2 is also capable of profiling inside C extensions, which I believe is unique to my profiler. One widely used C extension example is the MySQL2 gem, which serves as a MySQL client. If a bottleneck exists in the query function, it can be challenging to pinpoint why your query is taking a long time. With the C extension profiling feature enabled, you can see the blue section on the upper side, representing the function calls made within C extensions.
00:04:39.160 For example, the MySQL2 queries method is broken down mainly into three parts: sending the query, waiting for the results, and receiving the results. Examining this can help you understand why a particular query is taking a long time, and it might provide hints to enhance your performance. This feature also assists in finding hidden Ruby methods within Ruby programs. I've displayed a flame graph on the right side. At first glance, it looks like an ordinary flame graph, but actually, some methods are not showing up here. Can anyone find them?
00:05:27.559 For instance, the hash bracket function is consuming considerable time, yet it doesn't appear in the Ruby flame graph. This is because the hash bracket function is optimized at the C Ruby level and therefore does not reveal itself at the Ruby level. However, with the C extension profiling feature, you can uncover the RB hash call, which represents the hash bracket function. With some knowledge, you can approach identifying the actual bottlenecks in your program.
00:06:56.560 Let’s talk about the visualizations PF2 offers. It supports several visualization modes provided by the Firefox profiler. Of course, we have the flame graph, but there's also the call tree, which provides a view for the flame graph. Additionally, there’s an inverted tree, which is useful for finding methods that are called frequently. A single call might take no time, but if it is called repeatedly, it can significantly affect performance.
00:07:54.270 Why do we need so many features in a profiler? This is due to bottlenecks often occurring in unexpected and hard-to-guess regions. We, as programmers, tend to think we know where the bottlenecks are, but that's not always true, especially when the bottleneck is not within our code itself. For example, spawning 50 threads in a Puma server will not necessarily make your server 50 times faster because of the limitations imposed by the GVL. Understanding both Puma and the GVL very well is crucial to guessing performance issues.
00:08:14.960 However, if the profiler understands these limitations for you, it can guide you toward a solution. Building a profiler like this isn’t just like writing a regular Ruby program. Let's delve into the depths of profiling Ruby. First, let's understand the profiler's mission.
00:09:09.560 A profiler's mission must be to provide accurate and informative data without being misleading. It also needs to minimize overhead since the profiler itself shouldn't become a bottleneck. If collecting too much information incurs excessive overhead, then perhaps we shouldn’t be doing that. Of course, the tool should also be user-friendly.
00:09:42.240 Let’s see how PF2 achieves these objectives by looking into its lifecycle. The profiler lifecycle consists of three stages: first is the sample collection stage, which records the target program's execution state while it runs. Next, after the program ends, the unwinding and aggregation stage collects additional information. Finally, we move on to visualization of the collected data.
00:10:31.600 Each stage has its own challenges, starting with sample collection. A sampling profiler collects samples every few milliseconds while the target program runs. Samples consist of the program's execution state, including which method was running, what the underlying function was, and which thread had the GVL at that moment. Understanding the sampling profiler's timeline can shed light on how this works.
00:11:39.360 Here's a diagram illustrating the lifecycle of a program. It starts from the main function, and as functions are called, the profiler captures samples every few milliseconds. For instance, if it's configured to capture profiles every 10 milliseconds, it logs the execution state whenever the sample is captured. This helps in analyzing the profiling performance.
00:12:59.790 Collecting samples is one of the most exciting aspects of building a profiler. Profilers work by observing the execution state from outside the target program. While it is technically possible to gather this information from inside a program, it’s impractical to wrap each method call with timing code and compute the elapsed time. Instead, we leverage Ruby's capabilities to gather information about the current execution state.
00:13:48.720 Ruby's C implementation, CRuby, exposes certain APIs to facilitate profiling. This allows us to capture thread states, garbage collection events, and execution information without significant intrusion. Some key APIs we utilize include RB_Profile_thread frames, the TracePoint API, and the thread events API.
00:14:52.559 The first notable API, RB_Profile_thread_frames, accepts a thread argument and retrieves the current stack trace. The TracePoint API is valuable for monitoring various events like garbage collection. It helps us understand the garbage collection state represented by internal Ruby events. However, the TracePoint API's overhead is monitored to avoid performance penalties.
00:16:00.760 Finally, we use the thread events API to observe the state of the GVL. When a thread changes its state, it notifies the profiler. This mechanism allows us to collect a wealth of information about Ruby threads.
00:16:54.639 Now, let's discuss collecting C-level stacks, required for profiling C extensions. C-managed stacks are differently handled than Ruby stacks. To accurately trace the execution, PF2 records the current CPU program counter, facilitated by a library called libtrace. This data is essential for reconstructing the original stack trace.
00:18:07.920 Capturing stack traces effectively is crucial. Scheduling defines when to collect samples. PF2 employs timers to regulate collection intervals. For instance, when the profiling system captures CPU execution every 10 milliseconds, the kernel sends a signal indicating this. Utilizing predefined system signals, such as SIGPROF, initiates the sampling process seamlessly.
00:19:41.360 A good profiler's scheduler must support multiple definitions of time, including real-world time (also called wall clock time), CPU time, and kernel thread CPU time. This granularity allows the profiler to balance workloads accurately across multiple threads, providing a clearer picture of program performance.
00:20:50.240 During signal handling, it's important to note that minimal operations can be performed; specifically, no new allocations should occur. I've designed the signal handler to ensure collected data is maintained in a pre-allocated ring buffer and that all API calls remain signal-safe.
00:21:41.760 Choosing the appropriate sampling rate is vital for effective profiling. Higher sampling rates yield more accuracy but also increase overhead. PF2 defaults to a 19-millisecond sampling rate to avoid loop stick sampling, which could misrepresent the profiling data if multiple events occur concurrently.
00:23:12.560 Controlling overhead is paramount; I set a goal to keep it under 5%. Therefore, if the sampling process exceeds 950 microseconds, it's reevaluated to maintain efficiency. As some may have noticed, much of this profiling work resembles systems programming, which is why I chose Rust for implementing PF2. Using Rust’s safety features ensured a balance between performance and safety, though some 'unsafe' aspects were unavoidable.
00:24:58.160 Now, I’ve completed the core aggregation process. PF2 collects minimal information during sampling to keep overhead low. However, after execution, it unwinds the full stack trace, which is a more resource-intensive operation. We use various tools and libraries to symbolize and unwind the collected stacks, allowing us to construct cohesive representations of the Ruby and C stacks.
00:25:21.560 After separating the Ruby and C stacks, we need to weave them together for comprehensive analysis. The algorithm captures threading transitions seamlessly, effectively merging the two stack traces into a single representation.
00:26:10.520 Finally, we discuss the importance of visualization. Good visualization informs us and provides insights into performance issues. While I initially thought of implementing a custom visualizer, I opted to leverage existing ones, notably the Firefox profiler, which is well-integrated and continuously developed.
00:27:14.560 In conclusion, profilers need to collaborate with runtimes and the kernel to optimize information collection. Accurate scheduling balances overhead and valuable data while unwinding stack traces efficiently is a substantial process done after program execution.
00:27:57.919 As Ruby evolves, its profiling features must also adapt to maintain relevancy in a continually advancing ecosystem. Profilers are inherently linked to the runtime implementation; thus, they must innovate alongside language developments.
00:28:07.440 One emerging trend is 'always-on' or continuous profiling. To achieve this, we need to further minimize overhead in profiling processes.
00:28:12.320 Lastly, I would like to express my gratitude to all who supported my work on PF2, especially the Ruby Association for the 2023 grant program, and my mentors Kokan and Kian for their valuable advice. Please try out PF2 and provide me with your feedback.