GoGaRuCo 2013

Thread Safety First

Thread Safety First

by Emily Stolfo

The video titled "Thread Safety First" features Emily Stolfo discussing the challenges and methodologies for writing thread-safe code in Ruby programming, particularly in the context of different Ruby implementations such as JRuby and Matz's Ruby Interpreter (MRI).

Key Points Discussed:

- Introduction to Concurrency in Ruby: Emily explains the difficulty of concurrency in Ruby, attributing it partly to the Global Interpreter Lock (GIL). She emphasizes the need for developers to become more aware of concurrency due to the rise of JRuby and its threading semantics, which differ from MRI.

- Demonstration of Threading Issues: The presentation begins with a demo using 200 threads that attempt to access and modify shared mutable data structures. The surprising errors that appear when running this code under different Ruby versions illustrate the unpredictability of thread behavior across implementations.

- Historical Context: Emily shares her own experience debugging concurrency in an open-source project and relates it to the insights gained from a keynote by Jose Valim on concurrency in Ruby.

- Concurrency Primitives in Ruby: The talk explains key primitives such as Mutex, ConditionVariable, and Queue, which can help manage shared mutable data. She discusses best practices for using mutexes—for example, avoiding them during I/O operations to prevent unnecessary waiting of threads.

- The Thundering Herd Problem: Emily describes this problem wherein multiple threads compete for a lock causing inefficiency, highlighting that conditional variables can alleviate such issues by allowing threads to wake up selectively.

- Testing for Concurrency: Emily emphasizes the importance of testing code under various Ruby implementations and thread counts, as successful execution in one implementation does not ensure it will function correctly in another.

- Real-world Examples from MongoDB: Throughout her talk, Emily refers to concurrency issues encountered in the Ruby driver for MongoDB, detailing how changes in the replica set’s state can lead to unstable shared states.

- Conclusions on Writing Safe Code: In closing, she stresses that while there is no universally thread-safe Ruby code, developers can adopt practices to ensure their code remains safe and efficient when considering the nuances of each Ruby implementation.

Main Takeaways:

- Understand the threading behavior of different Ruby implementations, notably the differences in their GIL and concurrency handling.

- Use concurrency primitives cautiously and wisely to manage shared mutable data.

- Prioritize thorough testing across implementations and under different threading conditions to uncover subtle issues.

- Continuously educate oneself on concurrency issues to write safer and more effective Ruby code.

00:00:21.119 Okay, so for our next talk, we have Emily Stolfo. Emily is from New York, my other favorite city, and she's been doing Ruby for three years. She works at MongoDB, where she is involved with their Ruby drivers. Emily is also on staff at Columbia University, where she teaches Ruby on Rails. Wow, they actually teach Rails at colleges now! That's kind of cool. Well, maybe she can talk about that later. Alright, great! So, Emily is going to talk to us about threading. Let's give a big round of applause for Emily!
00:01:04.720 Hi, I'm Emily, and I'm going to talk to you about writing thread-safe code in Ruby. I'll start out with a demo. This demo will use 200 threads that alternate between trying to update and access members of a set. I will run this code on JRuby and on Ruby 2.0. To begin, I'll show you what this script looks like. We will look at it in detail later. For now, I will spin up a thread 200 times. Half of those threads will access a member of this set, while the other half will try to update it. Using Ruby 2.0, I will run this three times. It works—I have never had issues with threads running Ruby, so I expect it will work.
00:01:26.840 Now, I'm going to change the script to use 10 threads and run the same thing using JRuby. As I mentioned, I have never had problems with threads in Ruby, so I expect it will work with JRuby as well. I will run it three times, and it seems to work just fine. However, now I'm going to increase it to 200 threads and see if it works just as well. I will run it once, and hopefully, it will work twice. Okay, that worked. What’s this runtime error: 'cannot add a new key into hash during iteration'? That looks pretty scary, but I only got it once. What’s going on here? There is no such thing as thread-safe Ruby code.
00:02:34.680 My name is Emily Stolfo, and I work for MongoDB, where I co-maintain the driver for the database. I teach Ruby on Rails at Columbia University, but I am not an expert in concurrency or writing thread-safe code. I have had an experience debugging a concurrency issue with an open-source library that I maintained about a year and a half ago. So, how many of you here have debugged a concurrency issue before? Now, raise your hand if you had fun doing it or if you would like to do it again. Right? No hands are up. I do not want to do this ever again. For those of you who haven't had the experience, trust me, you don't want to. It was exhausting for me, particularly for about eight days.
00:03:46.239 This talk is also inspired by a keynote done by Jose Valim at Ruby Kaigi, where he talked about concurrency from a high level in different Ruby implementations. He explained why we, as Rubyists, need to think about concurrency and running thread-safe code. After having seen that talk, I thought, 'Hey, I actually experienced what he was warning about, and I want to share it with all of you.' This talk focuses on JRuby and MRI (Matz's Ruby Interpreter) as well as YARV (Yet Another Ruby VM). I will not mention Rubinius, not because it is unimportant, but because I am focusing on what I know and my experience. So, please, if you are looking for information about Rubinius or other implementations and their concurrency, feel free to do your own research.
00:04:41.960 Let's revisit the code from the demo, but at a slightly slower pace this time. As I mentioned, I had 200 threads; every other thread attempted to access this shared data structure—a set, which is based on a hash. All the other threads were trying to update it. In Ruby 2.0, there were no problems with 200 threads; I only ran the code three times, so it's possible I did not see the bug in those runs. However, it ran fine with JRuby at 10 threads as well. In contrast, with JRuby and 200 threads, I encountered that runtime error, which is alarming. This is the type of error you don’t want to see when running your code in JRuby.
00:05:39.880 After this experience, I started pondering writing thread-safe code and why I had never really thought about it while coding in Ruby. I learned it back in school with Java, but why has this not been relevant until now? I want to explore that, starting with concurrency in Ruby. After that, I will discuss writing thread-safe code, followed by testing concurrency. Testing is arguably the most challenging aspect of concurrency. Debugging concurrency issues is difficult because you often only see the problem after it occurs. If you have a race condition and your data becomes inconsistent, you'll see the inconsistency only after experiencing the problem.
00:06:06.280 That’s what makes it so hard. When testing, you should focus on the areas of code that might be thread unsafe and see to it that you encourage the threads to do the wrong thing consistently every time you run your tests. This is really hard. Regarding concurrency in Ruby: what makes concurrency particularly interesting, if not difficult, is that there are different Ruby implementations, each with distinct threading behaviors. I will primarily focus on Ruby 2.0 (MRI 1.9, 2.0), and JRuby in this talk. I won’t touch on other Ruby implementations during this discussion.
00:07:03.240 To illustrate this, I came up with a metaphor: threads are like music. Let's say we have a conductor; we'll refer to the conductor as the Global Interpreter Lock (GIL). The GIL determines which instrument plays which notes at any given time. A note represents a green thread—something that needs to execute. In contrast, a native OS thread is scheduled by the operating system. Thus, you have two entities: green threads (notes) and native threads (instruments). In Ruby 1.8, you only had one OS thread; hence, only one instrument could play at a time, meaning notes are played serially, never simultaneously. In Ruby 1.9, you could use multiple OS threads while still having the GIL, determining which instrument plays which note when.
00:08:37.200 However, with JRuby, you can use multiple OS threads without the GIL, allowing notes to be played simultaneously across several instruments. This can lead to concurrency issues, as different Ruby implementations have different semantics. The semantics of green threads, their scheduling behavior, and methodologies can significantly differ across Ruby versions. This difference means that code may sometimes operate correctly purely out of luck. You might be fortunate that your code hasn’t run under JRuby, or that the GIL kept everything orderly until now. It’s crucial to wake up to the realities of threading in Ruby.
00:10:08.000 In order to illustrate my personal experience with concurrency, I will use the Ruby drivers as an example since it's what I know best. To explain these concurrency issues in the Ruby driver, I ought to give a brief overview of MongoDB itself. MongoDB allows you to have replica sets, which consist of multiple nodes, typically comprising one primary and several secondaries. You can only write to the primary node, while reads can occur from either the primary or secondary nodes. Additionally, the Ruby driver implements connection pooling. When a thread arrives needing to perform an operation on a node, it must consult the cached state of the replica set that the driver has saved.
00:11:08.160 This becomes challenging, though, when a primary goes offline and another primary is promoted from the secondaries. Once the original primary comes back up, it becomes a secondary. The challenge arises from the mutable shared state among threads within the Ruby driver due to this dynamic. For example, in the initialization method of the Ruby driver, a list of nodes existing in the replica set is created, which represents a shared mutable state. The method 'connect_to_nodes' connects to a single seed node and retrieves the configuration, which details the additional nodes in the set.
00:12:01.440 Furthermore, there is a method called 'choose_node.' When a thread acquires this method, it looks for either a primary or secondary, traversing through the list of nodes and returning the first one of the specified type. This situation can lead to a concurrency issue if multiple threads attempt to update or access this shared data structure named 'nodes,' as they may run into that runtime error I previously mentioned. We often use hashes as caches, and since a set is a derivative of hash, being aware of thread safety regarding mutable shared state is vital for any Rubyist utilizing hashes in their code.
00:14:05.000 In JRuby, this means that whenever you add a node to the node set, many little instructions are involved. The GIL prevents other threads from executing during those instructions, but the operating system can interrupt a thread at any point while executing those instructions. For instance, if you are reading or iterating through this set, simultaneous updates to it might lead to issues you don’t want to see. Therefore, being mindful of the differences in threading implementation is essential. In summary, running thread-safe code is not about formulas, but rather rules of thumb and practices that enhance safety.
00:15:10.579 Running thread-safe code necessitates being cautious about shared data across threads. If you must use shared data, strive to avoid shared mutable data. If you cannot avoid it, utilize concurrency primitives. To illustrate, we can analyze two key concurrency primitives in Ruby: Mutex and ConditionVariable. A mutex allows you to designate specific segments of code as mutually exclusive, meaning only one thread can execute that code at a time; this is known as a critical section. If we apply a mutex around the code that represents the shared state, it ensures that no other threads interrupt during that update.
00:16:40.200 However, while mutexes are beneficial, they also require careful consideration. Avoid locking around I/O operations—external resources like network requests, database queries, or file access—because when one thread is locked in a critical section waiting for a network call, other threads may be stalled. In the Ruby driver example, I found a method that attempted to connect to a node while inside a synchronized block. This was a crucial external resource, and moving it outside the synchronized block enhances efficiency by allowing threads to operate independently without waiting for tasks that can take an uncertain amount of time.
00:17:51.960 Now, let’s address a more abstract problem known as the Thundering Herd problem. This situation occurs when numerous threads congregate waiting for a lock or event. As soon as that event happens, every thread wakes and tries to engage together, but only one can proceed. Therefore, synchronizing this behavior is inefficient, as it results in excessive resource consumption. To circumvent this, when you need to wake threads based on specific conditions, opt for a Conditional Variable. This variable allows a thread to signal other threads to wake up only when it has completed its work.
00:18:52.640 When using conditional variables, you enhance synchronization between threads, allowing one thread to notify others without causing an unnecessary wake-up for all. I recommend implementing a signaling method that avoids flooding all the waiting threads when only one will resume. Being diligent while using mutex and condition variables prevents oversubscribing threads, saving system resources. It is critical to remember that incorporating threading mechanisms into your code does not automatically make it thread safe. You must remain vigilant about resource management and potential issues.
00:19:55.200 And while there are more concurrency primitives like Q and several others worth exploring for those interested, I want to transition into testing for concurrency. Testing is incredibly challenging, and it is vital to ensure that you test your code across various implementations. Just because your code works in MRI or Ruby 2.0 doesn’t mean it will function correctly in JRuby or Rubinius. Ensuring thorough testing means not only trying different Ruby implementations but also testing with many threads. Once you've conducted basic testing with various thread counts, utilize patterns that can aid in producing precision during tests, emphasizing that concurrency problems can arise in very specific sections of code.
00:22:04.520 To put this all into practice, there is a specific test in the Ruby driver that leverages a pattern to allow threads to do requests. It acquires sockets, then pauses all the threads, kills a node, and then allows them to continue. In this scenario, one thread handles refreshing the replica set state, simultaneously ensuring resource management is vital. As we reach the conclusion, concurrency in Ruby is indeed challenging and distinct from other languages. When writing thread-safe code, it’s essential to understand your implementations and the differences in their semantics. While you might be fortunate with certain Ruby versions, beware of the unpredictability with implementations such as JRuby and Rubinius. Knowledge of concurrency primitives and awareness of your system resources is paramount to writing efficient and safe code.
00:24:39.599 In closing, remember there is no such thing as universally thread-safe Ruby code, but there is a form of thread-safe code when using JRuby or other implementations when done correctly. Acknowledge the nuances of shared mutable state and migrate your approach towards ensuring safe concurrent practices in your code. Thank you for listening.