LA RubyConf 2010

Everything You Ever Wanted To Know About Threads And Fibers But Were Afraid To Ask

Everything You Ever Wanted To Know About Threads And Fibers But Were Afraid To Ask

by Joe Damato and Aman Gupta

In the video titled "Everything You Ever Wanted To Know About Threads And Fibers But Were Afraid To Ask," presented at the LA RubyConf 2010 by Joe Damato and Aman Gupta, the focus is on enhancing Ruby's threading model and addressing associated performance issues. The speakers explore various technical challenges with Ruby's implementation of threads, specifically how it manages timers and the performance overhead that arises from context switching.

Key points discussed include:

  • Using 'estrace' with Ruby: The talk begins with the use of 'estrace' to diagnose Ruby's threading performance. It reveals that Ruby frequently issues SIGVTALRM signals due to its timer setup for green threads when the 'enable-pthread' option is not enabled.
  • Managing SIGVTALRM signals: The speakers explain the impact of the timer on Ruby code execution when threads are created, emphasizing issues stemming from lingering timers even after threads are finished.
  • Patches for performance: The presenters describe a patch developed to stop the timer when the last thread terminates, significantly improving Ruby's performance by reducing unnecessary interruptions.
  • Impact of enabling 'pthreads': There is a detailed discussion on how enabling 'pthreads' affects Ruby's VM, with a focus on the misconception that it simply allows native threads, whereas its implications on context management and signal handling can complicate matters.
  • Functionality of 'ucontext': The video outlines how functions like 'getcontext' and 'setcontext' can lead to excessive calls to 'sigprocmask', impacting performance adversely.
  • Event machine performance: The speakers analyze performance issues relating to the event machine gem, particularly when dealing with IO and threading, and how standard profiling tools helped them diagnose performance bottlenecks.
  • Identifying 'memcpy' delays: Through the use of 'ltrace', they pinpointed delays caused by 'memcpy' operations during context switching, highlighting the need for optimization.
  • Zero-copy threading patch: Finally, the presentation culminates in detailing a zero-copy threading patch aimed at avoiding unnecessary data copying during thread context switches, leading to efficiency improvements.

The main takeaway from this session is the critical understanding of Ruby's threading model and the ongoing efforts to optimize it for better performance in high-load scenarios. Developers are encouraged to remain aware of these threading issues in their applications, ensuring Ruby can evolve to meet the demands of modern programming techniques.

00:07:46.560 If you use 'estrace' on the command line, it's sort of like a wrapper around 'strace'. It doesn't provide as fine-grained output, but it can be useful. So, let's discuss how to 'estrace' Ruby.
00:08:00.000 When you 'estrace' Ruby, you may notice that SIGVTALRM signals come up frequently. What does that mean? It turns out that Ruby utilizes a system call called 'setitimer' and signals to manage green threads.
00:08:13.560 This behavior only occurs if Ruby is not built with the 'enable-pthread' option, which we will discuss shortly. When a new thread is created for the first time, Ruby calls 'setitimer' to establish a timer. It instructs the kernel to send it a SIGVTALRM signal every 10 milliseconds to keep tabs on what's happening.
00:08:31.039 When Ruby receives this signal, it triggers a handler called 'catch_timer'. The call to 'posix_signal' is responsible for attaching the handler to the signal.
00:08:43.279 Now, let's look at the Ruby VM code, which provides an abbreviated version of what's occurring. The 'rb_thread_start' function is called every time a new thread starts. The first time a thread starts, it flips the 'thread' flag, indicating that the timer has been established. The 'posix_signal' function attaches the 'catch_timer' handler, and 'RV_thread_start_timer' calls 'set_timer', affirming that we want to receive signals every 10 milliseconds to manage timing.
00:09:10.720 If you trace Ruby, you will observe this behavior. You'll see a call to 'setitimer', along with the SIGVTALRM signals as they come in. The problem arises, however, when you start one thread. Even after all your threads terminate, the timer continues to issue interrupts every 10 milliseconds, which disrupts your code—this is quite a problem. If you trace your Ruby code and notice no use of threads, you might wonder why this continues. Well, libraries like 'net-http' leverage timeouts, and 'NSP' also employs timeouts that are based on threads.
00:09:51.880 Once you initiate a single thread, the timeout mechanism continues to send SIGVTALRM signals to your Ruby process, which can interrupt it indefinitely, and that's problematic.
00:10:15.399 To address this, we developed a patch for the Ruby VM to halt the thread timer. The implementation is fairly straightforward: it verifies if it is the last thread and, if so, disables the thread timer to prevent interruptions in the execution of Ruby code.
00:10:28.360 When you trace again, you can see the timer initiating, threads spawning, alarms firing, and ultimately, the timer being turned off. This resulted in a significant performance improvement. Our code began running faster, and we were no longer concerned about timer interruptions.
00:10:43.320 The next substantial performance enhancement we implemented for threading was in our production environment utilizing Debian. We used 'estrace' on our Ruby process because we noticed it was operating exceptionally slowly. We attached 'estrace' and identified a significant number of calls to 'sigprocmask', leading us to question the reason behind such an excessive count.
00:11:13.519 Upon further investigation, we discovered that when you enable 'pthreads', it changes what the Ruby VM does. Many people mistakenly believe that enabling 'pthreads' in the configuration is directing Ruby to use native threads. However, it actually indicates using the native thread on the system for timing in the green thread implementation.
00:11:36.000 Enabling 'pthreads' is also beneficial for maintaining compatibility with libraries like Ruby Tk, among others. When you compare the differences that arise from enabling 'pthreads', you will notice additional defines come into play that subsequently create the timer thread when enabling 'pthreads'.
00:12:18.760 Among the other changes that occur is activating 'getcontext' and 'setcontext'. These functions are part of a kernel system known as 'ucontext'. Ruby can use either 'setjmp'/ 'longjmp' or 'setcontext'/'getcontext' for its threading implementation as well as for exception handling.
00:12:38.280 These functions save and restore the current CPU state, allowing a state to be saved, code to be executed, and ultimately restoring the state from before execution. 'setjmp'/'longjmp' functions operate similarly to 'getcontext'/'setcontext', but 'ucontext' provides a more advanced version that allows for modifications to the internal state.
00:12:54.079 One downside of using 'ucontext' and associated functions is that they save and restore the signal mask, which consequently results in the excessive calls to 'sigprocmask' we observed earlier.
00:13:08.240 To rectify this issue, a simple patch was formulated. It instructs Ruby to enable a new flag called 'disable_ucontext'. This means that while still using the timer thread, we will not invoke 'sigprocmask'.
00:13:27.760 With this patch, tracing again reveals that the 'sigprocmask' calls have diminished, and Ruby has become approximately 30% faster, which is quite impressive.
00:13:36.040 In addition to these enhancements, I manage the event machine gem. There has been a longstanding issue with the event machine functionality when utilized alongside IO, threads, and the Ruby VM, causing significant performance slowdowns.
00:13:51.120 I decided it was time to investigate this issue, especially since we were using 'net-http' in our code, which would enable this timer thread, causing the timer signals to disrupt the performance of our application significantly. Thus, I set out to build a simple reproduction case.
00:14:18.160 Event machine manages network IO by allocating large buffers on the stack for incoming data before transferring that data to Ruby. Therefore, I constructed a simple C extension that allocated a sizable buffer on the stack before handing control back to Ruby to execute various threading operations.
00:14:36.000 We decided to utilize Google Performance Tools as the profiler. This profiler is widely used internally at Google, and it's highly effective. You can download, compile, and link it to your application or preload it.
00:14:51.960 On Linux, you set the 'LD_PRELOAD' environment variable; on macOS, an equivalent exists. This way, any binaries you launch will load the library prior to executing anything else. Once this library is loaded, setting an environment variable named 'CPU_PROFILE' to a filename will produce statistics about performance when the binary finishes executing.
00:15:17.760 You will receive extensive statistics saved in the specified file. Running a provided Perl script on that file generates coded output that is immensely useful. Not only does the profiler generate valuable textual output, but it can also create highly informative graphs for easy analysis.
00:15:38.240 Regarding the benchmarking of the threading issues faced in event machine, I saw some strong candidates for areas causing delay, particularly 'thread_save_context' and 'thread_restore_context'. However, I found them calling 'memcpy', which was unexpected.
00:15:57.840 To verify my findings, I employed another tool called 'ltrace', which is similar to 'strace' but tracks library calls instead of system calls. The execution proved similar; 'memcpy' indeed emerged as a major point of delay.
00:16:18.080 I investigated further, following the calls from 'thread_save_context' and 'thread_restore_context' to the C code. I observed that 'thread_save_context' initiates a call to 'setjmp', which saves the CPU state. The next step corresponds to 'memcpy' and accesses the thread's stack pointer, subsequently copying a significant amount of data.
00:16:43.440 I discovered that this operation was performing a full copy of the entire thread stack, taking up considerable overhead in the process. Additionally, many global variables were being saved, which affects performance when Ruby restores the context.
00:17:10.040 The benchmark that revealed these performance pitfalls indicated how much data was being copied back and forth every time a thread switched. Through meticulous workings of GDB, I examined various thread stacks and identified a pattern.
00:17:29.560 I regularly unpacked the stack to understand its contents in-depth. Instead of delving into GDB's complex parts, we explored the stack traces to discern where performance issues were arising.
00:17:49.720 The stack traces revealed how extensive the overhead was for every call; in particular, ruby methods tend to have considerable stack allocations. Each 'RB_eval' frame accounted for a kilobyte of memory, indicating just how costly thread switching could become.
00:18:12.080 After understanding the stacking problem, we noted our Ruby process executing multiple kilobytes in a single thread switch, leading to significant processing time around context switches in heavy applications, such as Rails.
00:18:30.440 We ultimately sought to eliminate unnecessary copies, as stacks are merely memory regions. By directing the CPU at a designated heap region without requiring excessive copying, significant performance increases could occur.
00:18:55.040 During this process, the Ruby process consumes excessive time that could affect response times because all the stack frames were being swapped out. By aiming to eliminate the overhead by modifying stack frame management, we considered the implications of our changes.
00:19:12.560 With the zero-copy threading patch we created, we improved performance by ensuring that stack allocations and contexts could be swapped out directly in memory instead of being regularly copied. In doing so, we significantly reduced switching penalties.
00:19:37.760 This new mechanism for managing threads resembles a lightweight form of fibers in how they interact with memory management. The key challenge remained: how could we maintain the benefits while reducing context switching inefficiencies?
00:19:59.040 We ultimately benchmarked our thread implementations against conventional Ruby threads and demonstrated how the new zero-copy approach increased efficiency, leading to faster response times in high-load applications.
00:20:21.760 Through insightful optimization, we managed to improve Ruby's performance without restructuring the entire framework, improving thread scheduling through efficiency.
00:20:36.560 Additionally, from performance trials, we acknowledged that these advancements could aid the future of Ruby as it continues to foster significant tooling and frameworks around performance.
00:20:56.479 To maximize these evolving improvements, developers should remain informed about threading performance in their applications. Thereupon, as Ruby evolves through the ongoing development of concurrent programming, it will lead to new benchmarks.
00:21:15.840 Thank you for listening, and feel free to address any questions regarding performance or threading. Your curiosity and engagement fuel the advancement of Ruby.