Off to the races

by Kyle d'Oliveira

In the video titled 'Off to the Races', presented by Kyle d'Oliveira at RailsConf 2023, the main topic revolves around race conditions, particularly in the context of Ruby on Rails applications. The speaker provides an in-depth introduction to what race conditions are, how they manifest in software running with shared resources, and strategies for testing and fixing them.

Key Points discussed in the video include:

Definition of Race Conditions: Race conditions occur when multiple processes interact with shared resources in an unexpected order, causing unanticipated behavior in applications.
Example of Dining Philosophers: This classic problem illustrates race conditions, showcasing how philosophers can't eat if they can't access both chopsticks simultaneously, leading to potential starvation if not managed correctly.
Applicability in Rails: The speaker explains that Rails applications often involve concurrent requests, leading to the possibility of race conditions, particularly when accessing shared resources like databases.
Vulnerable Code Example: A voting mechanism is presented, showing how simultaneous requests can lead to lost votes due to a read-modify-write pattern, creating a race condition.
Testing Race Conditions: Identifying critical sections is crucial for testing race conditions. Techniques such as introducing delays or using shared services like Redis are explored alongside the challenge of ensuring tests effectively represent race conditions.
Fixing Race Conditions: Different approaches to mitigate race conditions include:
- Removing Critical Sections: Refactoring code to avoid critical sections where possible.
- Atomic Operations: Utilizing database functionalities that inherently avoid race conditions.
- Optimistic and Pessimistic Locking: Techniques that allow processes to manage access to shared resources safely by leveraging locks or version control.
Real-World Implications: The speaker emphasizes understanding the consequences of race conditions and the importance of finding practical solutions that suit specific scenarios.
Conclusion: The talk emphasizes building awareness and a list of potential race conditions in teams to better manage and avoid them in the future.

The session concludes with a call for developers to approach race conditions not just as isolated problems, but as potentially impactful issues that require careful consideration in code design and execution.

00:00:26 Welcome to my talk today! I'm really happy that you all came to hear me speak. It's always nice to see the people who are attending the sessions I want to discuss.

00:00:39 My talk today is entitled "Off to the Races."

00:00:44 This will be an introduction to race conditions, general strategies you can use if you encounter them, and a bit about how to test them. My hope is that you will leave with a good understanding of what race conditions are, which will allow you to engage with the people around you and start building experience to recognize race conditions in your code and learn how to fix them.

00:01:10 A little bit about myself: My name is Kyle d'Oliveira, and I'm a principal software engineer based in Vancouver, Canada. I've been working with Ruby for almost my entire career and have worked on every single Rails version, starting with Rails 0. I think my first project involved upgrading from Rails 0 to Rails 3, which was quite an experience. I find Rails inspiring and empowering, and I love the community. The community is one of the biggest reasons I continue to work with Rails.

00:01:42 I've been working at Aha! for the past two and a half years, and it's also one of the best workplaces I've been part of. They focus a lot on the employees and the human element of our work. We help other companies build what matters for them. The team is amazing; we are fully distributed across the globe, with team members everywhere from New Zealand to Europe, and our culture is fantastic. We operate under what we call TRM, the Responsive Method, which is a framework of shared values we can all agree on.

00:02:02 This framework allows us to be interruption-driven, establish goals, and make decisions while remaining kind. This helps us move quickly and feel empowered. If you'd like to be part of this culture, please come chat with me afterwards; we are always looking for new talent.

00:02:43 When I started preparing for this talk and telling people it was about race conditions, I had many people asking how programming related to races. Without prior knowledge, the phrase 'race condition' is not very clear. So, I went on a hunt for a definition and found this one: race conditions are unanticipated behaviors caused by multiple processes interacting with shared resources in an unexpected order.

00:03:09 That's quite a mouthful. If I were learning about race conditions for the first time and read this, I would still be unclear about what it meant. Even as someone familiar with race conditions, I found this definition lacking. So, let’s look at an example. I first learned about race conditions in university, where they introduced us to a problem known as the Dining Philosophers.

00:03:40 The Dining Philosophers problem looks like this: You have a table with five philosophers seated around it, each with food in front of them. There are five chopsticks placed between each pair of philosophers, where each philosopher can only eat if they have both adjacent chopsticks. The rules state there is only one chopstick for each pair of philosophers, and chopsticks cannot be shared.

00:04:10 The challenge here is to find a process that enables all philosophers to agree so that none of them starve.

00:04:16 The term 'process' refers to an algorithm or set of rules for each philosopher to follow. They reach for the left chopstick first and if it's unavailable, they wait and then reach for the right one. Only when they have both can they eat.

00:04:37 In theory, it might look like this: a philosopher reaches out and grabs the left chopstick, then grabs the right one and eats. Once done, they put the chopsticks down and repeat the process. However, in practice, each philosopher acts independently. It could happen that multiple philosophers concurrently reach for the same chopstick.

00:05:00 For instance, philosopher one attempts to pick up the left chopstick, while philosopher two simultaneously does the same. The same occurrence continues with philosophers three, four, and five. Now, all philosophers have taken their left chopsticks. However, philosopher one, reaching for the right chopstick, finds it unavailable and is forced to wait, leading to a scenario where they all starve.

00:05:28 The Dining Philosophers is a rather abstract problem, but it illustrates the basics of what can occur with a race condition. With multiple independent processes—here represented by the philosophers—and shared resources—the chopsticks—unusual timing can create unanticipated behavior.

00:06:09 Another important concept to note is the critical section of the code or algorithm. This refers to the portion of the algorithm where the philosophers first reach out for a shared resource, continuing until they cease using that resource.

00:06:33 This context about philosophers and the way they dine is useful, but at a Rails conference, you won't find philosophers or chopsticks in your Rails applications. However, you will certainly encounter multiple processes and concurrent requests.

00:07:00 If a Rails app serves only one request at a time, it isn't functioning properly. Background jobs may also run concurrently, causing interference among them. Tasks like scheduled rake jobs can interact with shared resources. You will likely use a database, which is a common shared resource, and may rely on queues or other microservices.

00:07:20 Considering this, the possibility of race conditions in Rails is always present. Here's a simplified example of code vulnerable to a race condition.

00:07:41 Imagine an app that allows users to create ideas and vote on those ideas. We have an idea controller with a vote action. When a request comes into this action, we look up the idea using its ID; let's say it initially has zero votes.

00:08:00 We increment the vote counter and save it to the database, resulting in one vote. Then, another request comes in, doing the same thing: loading the idea, incrementing the vote, and saving it.

00:08:21 Now, we have two votes! However, what if the timing were slightly different? The first request is processing the increment when a second request hits the database and reads the initial vote count, which is still zero.

00:08:41 Both requests will save their increments, but they’re both referencing the original vote count. As a result, we would end up with just one vote when we expected two. This illustrates the confusion of losing a vote due to the race condition.

00:09:00 Let's dissect this with more detail: the first request extracts the idea and creates an in-memory representation of that idea with zero votes. It then increments the vote counter.

00:09:19 Simultaneously, a second request performs the same action and loads an identical in-memory object with zero votes since the previous change hasn't yet been saved.

00:09:42 Next, the first request attempts to save the vote count, successfully updating the database to show one vote. The second request, arriving later, also attempts to save its vote count, which reinitializes it again to one vote, losing the original vote.

00:10:00 This scenario illustrates what's known as a 'read-modify-write' race condition, a common enough pattern that it has been given a name. If your code performs operations where you read a value, modify it in memory, and then write it again, you are potentially creating a race condition.

00:10:19 If you suspect your code may exhibit a race condition, how could you go about testing for it? It can be tricky to test race conditions because they are, by nature, non-deterministic. They often require very specific timing to manifest.

00:10:44 Begin by identifying the critical section, which typically starts when you first reach for a shared resource and continues until you cease accessing that resource. In this example, the three lines handling the vote increment could be identified as the critical section.

00:11:00 One approach to reproduce a race condition is by trying to minimize the duration of the critical section. A simple method is by inserting sleep calls to reduce timing restrictions.

00:11:17 However, while this might trigger the race condition, it isn't a robust or reliable solution because you must also ensure the second request sleeps for a predictable duration.

00:11:34 Alternative approaches include creating helper methods or services that intelligently synchronize the requests, potentially using Redis to introduce flags that can manage request states effectively.

00:11:50 For example, if a request successfully sets a flag in Redis, it can sleep indefinitely. If another request can't set the flag, it will delete it, allowing both processes to continue their execution at the same time.

00:12:18 This technique provides a dependable method of having two processes reach the same critical section with a high likelihood of reproducing race conditions.

00:12:39 If you've managed to reproduce a race condition, you might explore the possibility of automated testing to identify similar concerns in the future. But be aware tests often run in a single-threaded environment, complicating the detection of race conditions that require multiple processes.

00:13:04 To mimic race conditions in a single-threaded test, you can utilize methods like `and_wrap_original` from RSpec. This enables you to stub methods and execute additional code before or after the original implementation.

00:13:25 This way, you can load an idea into memory while simultaneously making background changes to ensure the database state reflects a race condition. Testing can illustrate the problem without modifying the actual functionality.

00:13:47 Setting up tests might begin with creating an idea, utilizing the trick with `and_wrap_original`, and checking if the expected votes return two. In this case, the test would fail, highlighting the presence of the race condition.

00:14:06 Although this isn't a foolproof solution for testing, it's a useful tool to have in your toolbox. Now, having discussed testing, let's move on to fixing race conditions.

00:14:25 Once you've identified a race condition, the next step is determining how to address it. There are several approaches, though no silver bullets, highlighting the importance of evaluating what can be applied to your specific situation.

00:14:44 One of the most straightforward ways to address a race condition is by eliminating the critical section. I don't mean to simply remove the code, but rather to refactor it in a way that renders it unnecessary. However, this is not always feasible.

00:15:12 You may consider performing atomic operations, which execute as a single unit. That way, no other processes can read or modify portions of it during its execution. For example, you could instruct the database to perform the increment itself without needing to load the object into memory.

00:15:34 When this is done through SQL, it would look something like instructing the database to set the vote count to increment by one, thereby avoiding a race condition.

00:15:50 Where atomic operations exist, they may be ideal solutions. However, not all operational contexts will conveniently allow for atomic actions. This risk isn't confined solely to databases—services like Redis are susceptible as well.

00:16:09 For instance, consider a scenario in Redis where two requests attempt to increment a count concurrently, both pulling their values before one has updated it, thereby introducing the same race condition.

00:16:27 If you don't have atomic operations at your disposal, one alternative is to detect when a race condition occurs and recover from it.

00:16:44 Rails provides a robust optimistic locking feature. With optimistic locking, an update is applied to the database only if the version of the object you hold in memory matches the one in the database.

00:17:06 To enable this feature, simply add a lock_version column to your table. Rails will recognize this and start using optimistic locking. With this column in place, Rails modifies SQL executed during updates.

00:17:30 The SQL command now checks if the lock version remains the same when executing the update. If it does not match the version in memory, Rails won't perform the update.

00:17:54 While optimistic locking is a useful tool, it can be dangerous as it may silently ignore updates, which is undesirable in some contexts. Therefore, methods like `save!` should be used to ensure exceptions, like active_record_stale_object_error, highlight potential update failures.

00:18:18 When you catch the stale object error, you can take appropriate action, such as retrying the operation. However, you should ask important questions, such as how many retries are acceptable without entering an infinite loop.

00:18:34 A crucial aspect of retrying within a critical section is ensuring that any changes made can be undone if necessary—often achievable using database transactions.

00:18:54 While rolling back a transaction is straightforward, external side effects like API requests, background jobs, or emails can complicate matters, making it tough to rollback those changes.

00:19:09 If you can't detect and recover from a race condition, consider protecting the critical section. This involves setting up a contractual agreement within your code indicating that all shared resources accessed in conflict need protection.

00:19:28 Rails offers features such as pessimistic locking, where a process informs the database that once it retrieves an object, it should have exclusive access until the transaction is complete.

00:19:49 To utilize pessimistic locking, ensure your code executes within a transaction and use the lock method while querying the database. Instead of a standard SQL select statement, it will issue a lock-aware select for an update.

00:20:05 The lock ensures that as long as the object has been pulled from the database, no other process can access it until that transaction concludes.

00:20:20 In light of the race condition theme, you can think of the lock as a baton pass in a race: only the entity holding the baton can run until it passes it.

00:20:39 If a request or process tries to enter a locked critical section, it will block and have to wait until the lock is released.

00:20:57 Similar to the detect-and-recover approach, setting waits involves establishing how long processes can wait. Introducing sleep commands can lead to processor underutilization.

00:21:15 You need to consider timeouts, especially if locking rows proves challenging. Locking arbitrary blocks of code or employing algorithms like Redlock, a Redis-based distributed lock implementation, can also protect critical sections.

00:21:37 Using this gem, you can create a new Redlock client and call the lock method with an identifier or timeout, running your critical section in this context.

00:21:59 Another useful gem is 'advisory lock', where advisory locks help to apply application-defined blocks that require manual locking and unlocking.

00:22:18 This gem simplifies the process by handling lock and unlock tasks thoroughly, mirroring the structure of the Redlock example.

00:22:41 Remember, protecting the critical section is a contractual commitment. Every place dealing with increment operations must adhere to the same locking mechanics.

00:23:01 Error handling doesn’t have to be overly complex; it can be straightforward. For instance, specify a timeout, and if it fails, return to the user promptly.

00:23:22 It's crucial to carefully determine how long to wait and whether you wish to impose maximum latency limits.

00:23:43 In high-throughput areas, the holistic perspective on all processes should be maintained, as overprotecting critical sections may cause system slowdowns.

00:24:06 Another common race condition is what’s referred to as the 'check-then-act' condition, often seen in validations such as unique constraints in Rails.

00:24:29 The goal of validation uniqueness is to prevent duplicate database entries; however, an inherent race condition exists in this process.

00:24:43 With the validation first occurring followed by the insert, there exists a window where another process can concurrently insert a duplicate entry into the database.

00:25:08 This essentially makes the single line `create` in the model vulnerable to race conditions, leading to challenges with unique validations.

00:25:32 With this in mind, let’s explore strategies for addressing this specific race condition.

00:25:51 Just like before, one of the options is to remove the critical section, but completely deleting code here alters its fundamental behavior.

00:26:08 Instead, consider looking for ways to make operation atomic. Swapping out this create method for something utilizing 'upsert' could change the SQL executed into a more race-condition resilient form.

00:26:30 With 'upsert', the database is instructed on how to handle conflicts, allowing you to avoid those race conditions effectively.

00:26:48 Yet, if you need to run validations or callbacks, directly using 'upsert' would bypass that functionality.

00:27:04 Should atomic operations not be applicable, you can also move to detect and recover from such conditions by rescuing from unique index exceptions.

00:27:22 If an insert fails due to a unique index violation, you can handle it gracefully by looking up the relevant idea from the database.

00:27:40 Once you identify that the entry exists already, you can proceed without needing a second attempt.

00:28:04 Another approach is to protect the critical section with the same techniques we've previously discussed regarding advisory locks.

00:28:20 Be sure your identifier aligns with the name being used so that concurrent creations of ideas with different names can occur without conflict.

00:28:33 A sustainable solution requires that you identify all creation points for ideas and protect them consistently—a task that may not always be feasible.

00:28:53 I'll conclude with a flowchart highlighting helpful considerations while approaching race conditions.

00:29:09 The first question you should explore is whether you're dealing with shared resources that might conflict. If not, then concerns about race conditions can be set aside.

00:29:23 If multiple processes target the same shared resources, then consider the likelihood of a race condition arising.

00:29:41 Identifying if a race condition is present is quite complex, as patterns might emerge revealing issues like read-modify-write, check-then-act, deadlocks, or livelocks.

00:30:00 It takes practical experience to pinpoint whether a race condition exists, and creating a knowledge base within your organization can help newcomers assimilate quickly.

00:30:19 If after your initial exploration a race condition does exist, first consider how you might eliminate the critical section through refactoring or theoretically replacing it.

00:30:39 If the operation can be made atomic, investigate available atomic operations, but determine alternative measures if that proves impossible.

00:31:03 Should you identify recoverable mechanisms, ask how to handle retries; establish acceptable limits and review outcomes on failures.

00:31:28 Where you cannot detect a race condition, begin considering methods to protect critical sections to maintain stability and reduce disruption in the application.

00:31:49 Understanding these core concepts around race condition detection and the potential for efficient software development enables you to be proactive.

00:32:06 As I conclude, I want you to keep this flowchart in mind as you tackle software challenges and mitigate potential race conditions.

00:32:23 Now, if any of you have questions about race conditions, I'd be happy to dive deeper into the discussion.

00:32:58 So, the question is, what patterns have emerged for myself?

00:33:04 In general, I find waiting longer than a couple of seconds can negatively impact performance.

00:33:21 If you have high-throughput systems, consider scheduling background jobs that attempt retries at strategic intervals.

00:33:39 Identifying certain patterns leading to race conditions, such as concurrent reads and modifications, can prompt further examination.

00:33:57 Race conditions may not present regularly, but when they do, they can significantly impede operations.

00:34:15 The question involves gaining buy-in from companies skeptical about their occurrence. This is best achieved by emphasizing consequences.

00:34:30 For example, highlighting risks involving customer data loss can resonate more strongly with stakeholders.

00:34:45 Emphasizing the financial implications tied to race conditions can promote understanding and potentially reveal valuable backing.

00:35:00 Approaching race conditions fosters opportunities for growth and improvement, ensuring teams are equipped for future challenges.

00:35:14 The ensuing conversation tackles the subject of instrumentation for identifying race condition resolutions.

00:35:28 Generally, it's hard to establish instrumentation for proving the existence of race conditions.

00:35:43 While the challenge exists, digging deeper into inserts or executing duplicate data checks can yield insights into their frequency.

00:36:02 If you've suspected a race condition’s emergence, implementing logging can provide a useful mechanism for tracking their occurrence.

00:36:16 The final question revolves around the trade-offs between protection versus allowance for race conditions.

00:36:38 You should consider the consequences of the user experience. Minor error messages may not need excessive code changes, while larger transactions require keen oversight.

00:37:04 Ultimately, addressing how to create well-engineered solutions while grappling with potential opportunities for improvement should remain the focus.

00:37:26 Thank you very much for your time, and I hope you learned something today!