Thread Safety

Summarized using AI

What does GIL really guarantee you?

Daniel Vartanov • November 28, 2017 • New Orleans, LA

In the talk "What does GIL really guarantee you?" presented by Daniel Vartanov at RubyConf 2017, the focus is on understanding the Global Interpreter Lock (GIL) in Ruby and its implications for thread safety.

Key Points Discussed:
- Introduction to GIL: Vartanov defines GIL and its fundamental role in Ruby, stating that despite allowing only one thread to run at a time, this does not eliminate the risk of race conditions in multi-threaded environments.
- Race Condition Explanation: He illustrates a typical race condition using a simplified example involving a bank account scenario where multiple threads attempt to increment the account value, resulting in unexpected outcomes if not handled correctly.
- Demonstration with JRuby and MRI: Vartanov runs similar code first with JRuby (without GIL) then with Ruby MRI (with GIL). While MRI appears to protect against race conditions initially, subtle changes in the code expose vulnerabilities when functions that affect shared state are refactored, leading to potential race conditions.
- Context Switching: The speaker introduces the concept of context switching, explaining how GIL allows for concurrent execution but does not utilize parallelism effectively. This nuance is critical because two threads might still compete for execution time on a single core, leading to race conditions.
- Practical Example from Industry: Vartanov shares a personal anecdote about a critical error that occurred in his project due to a lack of understanding of multi-threading and GIL, demonstrating the real-world consequences of being unaware of these issues.
- Final Thoughts on GIL: He emphasizes that GIL is not intended to safeguard against race conditions in user code. Instead, it serves to maintain the integrity of built-in methods and data structures within Ruby MRI. Vartanov insists that developers must be proactive in preventing race conditions and not rely on GIL for thread safety.

The talk concludes with Vartanov advising developers to remain cautious with threading in Ruby, acknowledge the limitations of GIL, and adopt best practices to prevent concurrency issues. He encourages the audience to understand that context can switch at any time, which can lead to unpredictable behavior if not handled carefully.

What does GIL really guarantee you?
Daniel Vartanov • November 28, 2017 • New Orleans, LA

What does GIL really guarantee you? by Daniel Vartanov

You probably heard that Global Interpreter Lock (GIL) in Ruby "allows only one thread to run at a time", while technically it is correct it does not mean what you probably think it means. After this talk you will know why your code can be not thread safe even if GIL is enabled and how even a subtle change in your code can make GIL fail to protect you from race conditions. This talk is junior-friendly, but even experienced Ruby devs most probably will learn something new and important about GIL in Ruby.

RubyConf 2017

00:00:10.910 Well, so hi again.
00:00:13.320 I'm going to talk about GIL, the Global Interpreter Lock in Ruby.
00:00:16.289 I appreciate you all being here on the third day when everyone is usually tired already.
00:00:19.529 I will do my best to make this talk as engaging as possible.
00:00:26.550 This talk is intended to be junior-friendly, but even if you are an experienced developer, I believe you will find something new today.
00:00:34.320 So whether you're a junior, experienced, or pretending to be experienced, this talk is for you.
00:00:44.640 I have been working in the industry for a long time, almost ten years with Ruby, and I absolutely love it.
00:00:52.199 In fact, even the text on my wedding cake was written in Ruby, even though my wife is a Pythonista.
00:01:02.670 Just like many of you, I didn’t bother researching how exactly multi-threading works. What exactly does GIL do? I had to conduct this research because a disaster happened to me and my project three years ago.
00:01:15.320 My project, called Vico, has been under development for five years, and I started building it even before it was officially founded.
00:01:30.270 Now, in 2017, we are thriving and have all chances to succeed; however, three years ago, in 2014, we were at the brink of collapsing due to a mistake in our thread safety.
00:01:49.229 I will share that story in detail later in this talk, but for now, to explain better what happened, I'll start with a simple example of what a race condition is.
00:02:05.909 I’ll give an example of how to earn a million dollars in Ruby. You have to start small; begin with ten thousand. If you have a bank account and you repeatedly add to it ten thousand times, you'll eventually reach one million.
00:02:22.379 After fifteen years in this industry, I can assure you this is the only way to earn significant money in Ruby.
00:02:31.079 I will even add tests to ensure we reach a million, but please don’t be alarmed by those symbols below—they are just there to highlight characters in the console with color.
00:02:56.310 Now, imagine that instead of repeating this cycle 400 times, I ask a hundred threads to do the job. Don't try to learn the exact syntax of how to spawn threads from this example; it’s irrelevant to your understanding.
00:03:12.390 Just trust me that this code will run one hundred threads, each of which will perform the cycle of ten thousand. Initially, I will try to run this code with a Ruby implementation that does not have GIL.
00:03:45.329 Does anyone here know an implementation that doesn't have GIL?
00:03:47.340 Right! JRuby, thank you.
00:03:48.930 So I ran it with JRuby, and we got an error. We didn’t get our million; instead, we encountered a problem.
00:03:56.129 Why? Because it’s a race condition, which is quite easy to understand. The reason for this race condition boils down to one line of this code.
00:04:11.790 This simple line in the scope of this code is responsible for the entire issue we are facing. If we analyze it closely, we can expand this line of code into three distinct operations: first, reading from the instance variable (the bank account), then incrementing it by one in memory, and finally writing back to the instance variable.
00:04:27.420 Now imagine that two threads are doing this simultaneously. Each thread will have its own copy of the local variable value to work with, but they will share the common instance variable of the bank account.
00:04:51.060 Picture this: the first thread reads the value from the bank account, which is currently five. The second thread reads the same value, five.
00:05:06.300 The first thread increments its value to six, and then the second thread does the same, also getting six. The first thread saves six to the bank account, and so does the second thread.
00:05:20.490 What should have been a two-dollar increase results in only one dollar being added to the bank account. This issue is known as a race condition.
00:05:28.310 Now, I will try to run the same code with Ruby MRI, which does have GIL. By the way, ladies and gentlemen, this will be the first introduction of GIL in this talk.
00:05:48.760 So, behold, I’m running the same code with Ruby MRI, and it appears to be correct. No matter how many times I run it, it seems to hold up.
00:06:01.470 It seems GIL saved us from the race condition, but it's too early to celebrate.
00:06:10.450 Now, imagine that a junior developer comes in and refactors the code. This talk is junior-friendly, so I'm partially blaming juniors for what has happened.
00:06:39.880 Basically, two methods—operations of reading from and writing to the bank account—have been extracted. This is actually quite a sensible refactor, and even well-known software engineers would approve of it.
00:06:56.160 However, GIL does not favor this change. Look, I’m writing it again, and… why is that? Well, we have two questions here to address.
00:07:12.690 The first question is: why did we allow the race condition to occur in the first place? And the second question is: why did it only happen when the junior developer extracted methods?
00:07:37.390 Great questions, and I will answer them, but first, I need to clarify a common misconception about multi-threading in Ruby.
00:07:42.770 This laptop has eight cores. I will now turn off seven of them and run the same code with Ruby MRI with GIL once again. Over to you—what do you think will happen when I turn off all the additional cores? Will the race condition go away or not?
00:08:11.049 Okay, yes, let’s try that. Just in case, I’m checking because sometimes it fails. You may say that yes, sometimes you will get correct results; after all, it’s all about probability.
00:08:19.220 But think of it—our code becoming unreliable due to a simple refactor which seemed innocent. I will turn off the cores and run it again. I’m ensuring only one core is available.
00:08:43.060 I will run it several times... and the race condition hasn’t disappeared. Our code is still unreliable. So just to clarify, I’m turning back my cores. Now I have three questions.
00:09:05.290 Why didn’t the race condition go away when we turned off other cores?
00:09:08.350 The answer is that parallelism is not concurrency. They may seem like synonyms, but they are not. Let’s examine why. Imagine we have two threads: the first thread aims to read from 'red' and the second aims to color 'blue'.
00:09:35.980 When people hear GIL allows only one thread to run at a time, they sometimes visualize it as one thread coming after another, forming a neat queue.
00:09:54.920 However, this is not how it works; this situation is neither concurrent nor parallel. We wish each thread occupied its own core and executed simultaneously, which is what true concurrency and parallelism entail.
00:10:24.240 With GIL, you can achieve concurrency but not parallelism. In this case, no matter how many cores are available, GIL will prevent your ability to use more than one core.
00:10:47.660 While one core remains engaged, both threads continuously contend for execution, with Ruby allocating a certain number of milliseconds to each thread before switching.
00:11:05.500 This explains why a race condition did not disappear even when we turned off additional cores—because GIL will still allow concurrent execution, differentiating between concurrent and parallel.
00:11:13.660 So, let's agree on one thing today: every time MRI switches between threads, it's referred to as a context switch. This is a crucial concept that will recur throughout this talk.
00:11:36.300 When I discuss multi-threading with juniors, they often ask why, if we can’t utilize more than one core, we would consider multi-threading in Ruby at all.
00:12:01.720 Well, let's consider a scenario: imagine you need to communicate with a very slow remote API, and you are to make 25 requests. Instead of waiting for each one sequentially, you could create 25 threads, each waiting for its corresponding response.
00:12:44.210 Waiting does not consume CPU resources, so even a single core can manage 25 threads simply waiting for responses from a sluggish API.
00:13:00.900 Returning to the prior discussion, why did the race condition occur only when the junior developer refactored code?
00:13:08.100 The answer lies in how Ruby MRI switches contexts; this is initiated when you call a method.
00:13:16.820 When the junior developer refactored, the method call was introduced into the critical section of our algorithm, leading to the race condition.
00:13:25.200 Now we’ve addressed that question, but why was GIL introduced in the first place? What was the need for it?
00:13:48.440 Here’s another example: instead of inflating bank account values, we’re trying to populate the same array with three threads, and we want to ensure we get a million elements total.
00:13:55.950 The method call array.push is quite complex. If performed concurrently, many things can break or get corrupted inside.
00:14:23.500 If I run this with Ruby MRI, it will work correctly. This is because Ruby’s GIL protects the integrity of built-in methods written in C.
00:14:39.500 It also protects your C methods, unless there are callbacks to Ruby, which is not relevant to our current discussion. GIL was created to protect the internal integrity of Ruby’s data structures.
00:14:53.460 So let’s try running the same code with JRuby, which doesn’t have GIL. You might see an error indicating invalid array content due to concurrent modifications.
00:15:09.280 In JRuby, concurrent modifications can lead to data corruption, which will never occur in Ruby MRI due to GIL.
00:15:24.300 Today’s main takeaway is that GIL is not designed for your convenience but rather for the convenience of MRI developers.
00:15:46.370 To illustrate, I've added another check that verifies not only the array size but also its contents, and regardless of how many times I run it, the array size remains consistently accurate.
00:16:01.610 However, the contents are always incorrect, which means that while GIL protects the operation of insertion into an array, it does not protect your code surrounding that insertion.
00:16:13.670 This is the answer to our first question: Why did a race condition occur at all? GIL isn't designed to protect your code from race conditions.
00:16:37.490 As a final note, I want to share a vital insight about context switching.
00:16:55.290 Let me finally fulfill my promise and share my experience regarding what happened to my project three years ago.
00:17:03.000 Like many of you, I thought this topic was irrelevant to my work, that I would never use multi-threading, and that I was not going to encounter race conditions. I was mistaken.
00:17:18.150 My project, Vico, is a platform for e-commerce retailers that aggregates orders from various platforms like Amazon and eBay.
00:17:37.290 In the background, we use frameworks for making these API calls—can anyone guess the most popular background job processing framework in Ruby?
00:17:57.780 That’s right! Sidekiq. We were utilizing it to talk to Shopify.
00:18:02.967 However, three years ago, Rails Active Resource wasn’t thread-safe due to a bug. One morning, on the day of our Christmas party, a disaster occurred.
00:18:23.389 Our users began seeing commercial orders belonging to other users, exposing sensitive data.
00:18:40.660 Three years ago, we did not have many users, so we detected and fixed the problem fairly quickly; however, if it happened again today, it could be catastrophic given our large user base.
00:19:06.400 Such mistakes in thread safety can cost a business dearly.
00:19:20.120 One important lesson is to be smart: don't be like I was. Learn your stuff in advance to avoid being in a difficult position.
00:19:36.080 When our team was at a restaurant enjoying the view, we developers had our heads down, performing surgery on the database.
00:19:52.520 I want to discuss another important concept related to context switching.
00:20:04.090 Let’s look at the very simple example from earlier—now reminding you that it is thread-safe.
00:20:18.300 I will increase it to ten million to ensure no false negatives appear.
00:20:35.860 I’ll take this line and add 'if true'. It shouldn’t change anything in behavior. Running it again, it appears to be correct.
00:20:58.210 Next, I’ll convert it to 'unless false'; it shouldn’t change anything. But, boom! Out of nowhere, we have a race condition.
00:21:11.050 The only reasonable explanation is that the exact points at which Ruby MRI switches contexts are undocumented internal parts of Ruby.
00:21:43.320 You should never rely on these undocumented features; they can change from version to version without warning.
00:21:57.100 Consider this: I’m running Ruby 2.3 and can switch to 2.4. It’s okay; however, there were 3,000 commits between those two versions.
00:22:23.150 These changes affect internal behavior and can lead to unpredictable outcomes.
00:22:42.050 In conclusion, assume that context can be switched at any point in your Ruby code. The only safe approach is to be cautious.
00:23:08.910 You may wonder how to protect yourself from race conditions. There are many strategies, but they are far beyond the scope of this talk.
00:23:26.450 Initially, we reduced concurrency of workers to one in Sidekiq, manually fixed the database mess, and upgraded to Rails 4, where Active Resource became thread-safe.
00:23:48.350 Additionally, realizing that while Ruby 3.0 will introduce improvements, you cannot simply rely on GIL as it won’t save you from all concurrency issues.
00:24:07.220 To reiterate, let’s highlight a few crucial points: only Ruby MRI has GIL; GIL isn’t a solution for race conditions; parallelism is not concurrency; and you cannot rely on GIL for your convenience.
00:24:43.210 GIL will not prevent race conditions, nor is it a magical tool that can magically eliminate such issues.
00:24:57.420 As we conclude this talk, I won’t take questions because I may not fully understand your lovely American accents.
00:25:07.590 However, feel free to catch me in the corridor; I'm very friendly and will do my best to help. Alternatively, follow me on Twitter or email me; I’m always willing to assist.
00:25:34.220 Thanks for your attention!
Explore all talks recorded at RubyConf 2017
+83