00:07:46.560
If you use 'estrace' on the command line, it's sort of like a wrapper around 'strace'. It doesn't provide as fine-grained output, but it can be useful. So, let's discuss how to 'estrace' Ruby.
00:08:00.000
When you 'estrace' Ruby, you may notice that SIGVTALRM signals come up frequently. What does that mean? It turns out that Ruby utilizes a system call called 'setitimer' and signals to manage green threads.
00:08:13.560
This behavior only occurs if Ruby is not built with the 'enable-pthread' option, which we will discuss shortly. When a new thread is created for the first time, Ruby calls 'setitimer' to establish a timer. It instructs the kernel to send it a SIGVTALRM signal every 10 milliseconds to keep tabs on what's happening.
00:08:31.039
When Ruby receives this signal, it triggers a handler called 'catch_timer'. The call to 'posix_signal' is responsible for attaching the handler to the signal.
00:08:43.279
Now, let's look at the Ruby VM code, which provides an abbreviated version of what's occurring. The 'rb_thread_start' function is called every time a new thread starts. The first time a thread starts, it flips the 'thread' flag, indicating that the timer has been established. The 'posix_signal' function attaches the 'catch_timer' handler, and 'RV_thread_start_timer' calls 'set_timer', affirming that we want to receive signals every 10 milliseconds to manage timing.
00:09:10.720
If you trace Ruby, you will observe this behavior. You'll see a call to 'setitimer', along with the SIGVTALRM signals as they come in. The problem arises, however, when you start one thread. Even after all your threads terminate, the timer continues to issue interrupts every 10 milliseconds, which disrupts your code—this is quite a problem. If you trace your Ruby code and notice no use of threads, you might wonder why this continues. Well, libraries like 'net-http' leverage timeouts, and 'NSP' also employs timeouts that are based on threads.
00:09:51.880
Once you initiate a single thread, the timeout mechanism continues to send SIGVTALRM signals to your Ruby process, which can interrupt it indefinitely, and that's problematic.
00:10:15.399
To address this, we developed a patch for the Ruby VM to halt the thread timer. The implementation is fairly straightforward: it verifies if it is the last thread and, if so, disables the thread timer to prevent interruptions in the execution of Ruby code.
00:10:28.360
When you trace again, you can see the timer initiating, threads spawning, alarms firing, and ultimately, the timer being turned off. This resulted in a significant performance improvement. Our code began running faster, and we were no longer concerned about timer interruptions.
00:10:43.320
The next substantial performance enhancement we implemented for threading was in our production environment utilizing Debian. We used 'estrace' on our Ruby process because we noticed it was operating exceptionally slowly. We attached 'estrace' and identified a significant number of calls to 'sigprocmask', leading us to question the reason behind such an excessive count.
00:11:13.519
Upon further investigation, we discovered that when you enable 'pthreads', it changes what the Ruby VM does. Many people mistakenly believe that enabling 'pthreads' in the configuration is directing Ruby to use native threads. However, it actually indicates using the native thread on the system for timing in the green thread implementation.
00:11:36.000
Enabling 'pthreads' is also beneficial for maintaining compatibility with libraries like Ruby Tk, among others. When you compare the differences that arise from enabling 'pthreads', you will notice additional defines come into play that subsequently create the timer thread when enabling 'pthreads'.
00:12:18.760
Among the other changes that occur is activating 'getcontext' and 'setcontext'. These functions are part of a kernel system known as 'ucontext'. Ruby can use either 'setjmp'/ 'longjmp' or 'setcontext'/'getcontext' for its threading implementation as well as for exception handling.
00:12:38.280
These functions save and restore the current CPU state, allowing a state to be saved, code to be executed, and ultimately restoring the state from before execution. 'setjmp'/'longjmp' functions operate similarly to 'getcontext'/'setcontext', but 'ucontext' provides a more advanced version that allows for modifications to the internal state.
00:12:54.079
One downside of using 'ucontext' and associated functions is that they save and restore the signal mask, which consequently results in the excessive calls to 'sigprocmask' we observed earlier.
00:13:08.240
To rectify this issue, a simple patch was formulated. It instructs Ruby to enable a new flag called 'disable_ucontext'. This means that while still using the timer thread, we will not invoke 'sigprocmask'.
00:13:27.760
With this patch, tracing again reveals that the 'sigprocmask' calls have diminished, and Ruby has become approximately 30% faster, which is quite impressive.
00:13:36.040
In addition to these enhancements, I manage the event machine gem. There has been a longstanding issue with the event machine functionality when utilized alongside IO, threads, and the Ruby VM, causing significant performance slowdowns.
00:13:51.120
I decided it was time to investigate this issue, especially since we were using 'net-http' in our code, which would enable this timer thread, causing the timer signals to disrupt the performance of our application significantly. Thus, I set out to build a simple reproduction case.
00:14:18.160
Event machine manages network IO by allocating large buffers on the stack for incoming data before transferring that data to Ruby. Therefore, I constructed a simple C extension that allocated a sizable buffer on the stack before handing control back to Ruby to execute various threading operations.
00:14:36.000
We decided to utilize Google Performance Tools as the profiler. This profiler is widely used internally at Google, and it's highly effective. You can download, compile, and link it to your application or preload it.
00:14:51.960
On Linux, you set the 'LD_PRELOAD' environment variable; on macOS, an equivalent exists. This way, any binaries you launch will load the library prior to executing anything else. Once this library is loaded, setting an environment variable named 'CPU_PROFILE' to a filename will produce statistics about performance when the binary finishes executing.
00:15:17.760
You will receive extensive statistics saved in the specified file. Running a provided Perl script on that file generates coded output that is immensely useful. Not only does the profiler generate valuable textual output, but it can also create highly informative graphs for easy analysis.
00:15:38.240
Regarding the benchmarking of the threading issues faced in event machine, I saw some strong candidates for areas causing delay, particularly 'thread_save_context' and 'thread_restore_context'. However, I found them calling 'memcpy', which was unexpected.
00:15:57.840
To verify my findings, I employed another tool called 'ltrace', which is similar to 'strace' but tracks library calls instead of system calls. The execution proved similar; 'memcpy' indeed emerged as a major point of delay.
00:16:18.080
I investigated further, following the calls from 'thread_save_context' and 'thread_restore_context' to the C code. I observed that 'thread_save_context' initiates a call to 'setjmp', which saves the CPU state. The next step corresponds to 'memcpy' and accesses the thread's stack pointer, subsequently copying a significant amount of data.
00:16:43.440
I discovered that this operation was performing a full copy of the entire thread stack, taking up considerable overhead in the process. Additionally, many global variables were being saved, which affects performance when Ruby restores the context.
00:17:10.040
The benchmark that revealed these performance pitfalls indicated how much data was being copied back and forth every time a thread switched. Through meticulous workings of GDB, I examined various thread stacks and identified a pattern.
00:17:29.560
I regularly unpacked the stack to understand its contents in-depth. Instead of delving into GDB's complex parts, we explored the stack traces to discern where performance issues were arising.
00:17:49.720
The stack traces revealed how extensive the overhead was for every call; in particular, ruby methods tend to have considerable stack allocations. Each 'RB_eval' frame accounted for a kilobyte of memory, indicating just how costly thread switching could become.
00:18:12.080
After understanding the stacking problem, we noted our Ruby process executing multiple kilobytes in a single thread switch, leading to significant processing time around context switches in heavy applications, such as Rails.
00:18:30.440
We ultimately sought to eliminate unnecessary copies, as stacks are merely memory regions. By directing the CPU at a designated heap region without requiring excessive copying, significant performance increases could occur.
00:18:55.040
During this process, the Ruby process consumes excessive time that could affect response times because all the stack frames were being swapped out. By aiming to eliminate the overhead by modifying stack frame management, we considered the implications of our changes.
00:19:12.560
With the zero-copy threading patch we created, we improved performance by ensuring that stack allocations and contexts could be swapped out directly in memory instead of being regularly copied. In doing so, we significantly reduced switching penalties.
00:19:37.760
This new mechanism for managing threads resembles a lightweight form of fibers in how they interact with memory management. The key challenge remained: how could we maintain the benefits while reducing context switching inefficiencies?
00:19:59.040
We ultimately benchmarked our thread implementations against conventional Ruby threads and demonstrated how the new zero-copy approach increased efficiency, leading to faster response times in high-load applications.
00:20:21.760
Through insightful optimization, we managed to improve Ruby's performance without restructuring the entire framework, improving thread scheduling through efficiency.
00:20:36.560
Additionally, from performance trials, we acknowledged that these advancements could aid the future of Ruby as it continues to foster significant tooling and frameworks around performance.
00:20:56.479
To maximize these evolving improvements, developers should remain informed about threading performance in their applications. Thereupon, as Ruby evolves through the ongoing development of concurrent programming, it will lead to new benchmarks.
00:21:15.840
Thank you for listening, and feel free to address any questions regarding performance or threading. Your curiosity and engagement fuel the advancement of Ruby.