Talks
Screaming Zombies and Other Tales: Race Condition Woes

Summarized using AI

Screaming Zombies and Other Tales: Race Condition Woes

Joshua Larson • December 18, 2020 • Online

The video titled "Screaming Zombies and Other Tales: Race Condition Woes" presented by Joshua Larson at RubyConf 2020 explores the concept of race conditions in software development. A race condition occurs when multiple processes interact with shared data and depend on the timing of operations. The session begins with a whimsical question, "What is the sound of a zombie screaming?" and ties it to the underlying theme of race conditions. Key points discussed throughout the talk include:

  • Understanding Race Conditions: The session explains that operations in programming, such as incrementing a variable, may not be atomic. This lack of atomicity can lead to unexpected behaviors when multiple processes attempt to read and write shared data simultaneously, resulting in race conditions.
  • Consequences of Race Conditions: Larson highlights how race conditions can manifest as bugs, causing issues in systems, particularly in concurrent environments.
  • Examples of Race Condition Stories: Four specific stories are shared to illustrate the impact of race conditions:
    • Smurfy's Law: A scenario where simultaneous data reads caused test failures due to shared test buckets.
    • Out of the Void: A problem in transaction processing where voided transactions incorrectly settled due to timing discrepancies.
    • Ghost in the Machine: A situation with IoT devices and stale sensor data, revealing the server's inability to handle messages arriving out of order.
    • Screaming Zombies: This story details how a metrics reporting tool at Braintree produced incorrect data outputs due to improperly managed reporting processes.
  • Solutions for Managing Race Conditions: The talk emphasizes methods to mitigate race conditions, such as separating data instances, managing operation sequences, and removing redundant processes.

Joshua concludes the presentation with a reminder of the complexity of race conditions in modern software architectures and the need for a robust approach in system design, while humorously reaffirming that the scream of a zombie is ultimately silence.

Screaming Zombies and Other Tales: Race Condition Woes
Joshua Larson • December 18, 2020 • Online

What is the sound of a zombie screaming?

Race conditions are a problem that crop up everywhere. This talk will go over what a race condition is, and what it takes for a system to be vulnerable to them. Then we’ll walk through four stories of race conditions in production, including one that we named the “Screaming Zombies” bug.

You’ll leave this talk with a greater appreciation for how to build and analyze concurrent systems, and several fun stories for how things can go amusingly wrong.

And if you were wondering about the question at the top, the answer is: Silence

Josh Larson
Josh is a full-time programmer, part-time human, whose interests include weird programming, physics, math, and trying to make software reliably be better. When he’s not writing code or equations, he’s probably biking somewhere or watching something on HBO.

RubyConf 2020

00:00:01.040 What is the sound of a zombie screaming? What does that have to do with race conditions? What even is a race condition? Who is this person sitting inside your computer telling you all about race conditions, zombies, and screaming?
00:00:17.199 My name is Josh. I have been developing software with Ruby for about five years. I currently work for a company that you may have heard of and almost certainly have used, called Braintree. We deal with payment processing, usually credit cards.
00:00:33.840 Before that, I worked at an Internet of Things platform where we tried to connect various devices to the internet. We had this goal to manage all sorts of things, but mostly focused on printers. In my previous jobs, I have used both Ruby and Java fairly extensively.
00:00:51.199 This is also my first time speaking at RubyConf. I've attended many RubyConfs before, and I have always been eager to try speaking at it. And this is exactly how I imagined that it would go.
00:01:08.880 Let's talk about incrementing. You might think this is one of the more simple operations that a computer can do, and in a sense, it is. But it's not just one thing; incrementing is two operations at the computer level.
00:01:17.199 First, you load the initial value of x. Then you figure out the new value, which is x plus one. Finally, you assign that new value to the variable x, storing it in whatever location and data structure x is being held in. What makes this important is the idea of an operation being atomic. So, x equals x plus one is not atomic because it happens in multiple steps. In fact, most things aren't atomic: most database operations, most list operations, most data structure actions—most things are not atomic.
00:02:06.799 So let's look at why that matters. Here we have what's called a sequence diagram. The way we read this is that the boxes at the top represent different items, while the lines that fall down from the middle of the box represent that thing's forward motion in time.
00:02:12.160 Time starts at the top and moves towards the bottom, while the arrows flowing between the items represent the messages that are being sent or passed between them. In this case, we are seeing two running processes, each labeled "Code," executing the operation x equals x plus one.
00:02:40.640 So, the first code comes in. We start at the top, and we see that x is assigned a value of zero. Next, the code on the left springs into existence and asks x, 'Hey, what is your value?' x replies, 'Zero.' The code on the left does some complex math, figuring out that zero plus one is, in fact, one, and then assigns the value one to x.
00:03:01.120 The code on the right comes in and does essentially the same thing, except that it starts with the value of one. As a result, the code on the right assigns the value two to x. Here's why we care so much about this being atomic—or in this case, not being atomic. There's nothing that guarantees that the arrows will be clustered together. If the sequence of operations had happened in a different order, we would get the wrong value for x.
00:03:38.640 So, what happens on the left is that the code says, 'Hey x, what is your value?' It learns the value is zero, assigns one to it, and the code on the right queries x before the code on the left has finished assigning it. This means the code on the right tries to update using stale data. This scenario captures the secret sauce of a race condition.
00:04:17.919 What that means is that we have multiple processes interacting with the same shared data. In this case, the two processes are the codes on the left and the right, and the shared data is the value stored in the variable x. These multiple processes rely on things happening in a certain order, but actions do not necessarily happen in that order. Therefore, when things happen out of order, this results in a race condition and the associated bugs.
00:04:56.479 Now, I'm going to present a quick puzzle. Imagine we started with x equal to zero, but instead of the codes on the left and right each incrementing x once, let's say they each increment x ten times. The question is, at the end, when we print the value of x, what are the possible outcomes? At this point, I would recommend muting for just a couple of seconds—about 30 seconds, until I move on to the next slide, in case you want to figure this out on your own.
00:05:29.680 The answer is that the value will be any number within a certain range. That range has a maximum of 20, which makes sense because there aren't enough increments to push x above 20. The smallest possible value is 2, so I’ll leave it as an exercise for the viewer to determine what sequence of operations could lead to x being printed as two.
00:06:04.080 And now, for story time! I’m going to share four stories, each a moment in my career when I encountered a race condition that caused a bug. It might have been a bug in a test or in monitoring tools, but in each case, there was a race condition. I'll explain how we identified the race condition, the impact it had, and how we fixed it.
00:06:23.280 Story number one is called 'Smurfy's Law.' Smurf was a tool at Braintree that gathers data and sends it to one of our banking partners. This process must happen constantly in a payment processing company. Various parts of our system write data to a bucket in S3, and Smurf reads that data from the same bucket to process and format it correctly for sending.
00:06:46.960 Of course, we have code running in different environments, so to properly separate data, we have one bucket for production, one for sandbox, and one for test. The production bucket is appropriately access-restricted, the sandbox bucket has sandbox data, and the test bucket gets cleared and repopulated during integration test runs with data that the tests are aware of.
00:07:16.319 Here’s a look at a typical integration test run. One test begins by seeding some data into S3, then instructing Smurf to do its thing, followed by making assertions about Smurf's actions regarding the seeded data. One of the things we prioritize is ensuring that Smurf accurately handles changes in data. We do this in two steps, named 'step one' and 'step two.'
00:07:45.599 Here’s the bug: we started seeing sporadic failures, especially as we began developing Smurf. We would expect the step two results, but would get the step one results instead. Imagine if two runs of the tests were executed simultaneously; let's call one T1 and the other T2. If T1 is our CI setup running tests on the main branch, and T2 is me running tests on my local machine.
00:08:18.879 If the tests on the left finished first and Smurf read the step two data correctly, my tests (T2) would subsequently see the step one data. Conversely, if T2 seeded the bucket with step one data before Smurf could read the data, we would encounter an unexpected result.
00:09:01.839 Notice that the lines going down from the T1 and T2 boxes still perform the same operations as earlier, but the tests on the left don’t see the right data because Smurf on the left read the wrong data. This is exactly what causes such failures: we expected the step two data but got the step one data instead.
00:09:42.160 The core issue here is our single test bucket. Clearly, that is insufficient if multiple test runs can occur simultaneously. We need separate buckets to prevent different test runs from interfering with each other, which proved to be complicated to implement. However, once we had it all set up, it completely resolved the problem.
00:10:00.560 Story number two, 'Out of the Void.' As a payment processing company, we handle transactions, which involves that moment when a customer interacts with a merchant's website or in person to make a payment. The transaction lifecycle can be simplified as follows: it starts in a pending state, and we check to ensure the credit card is valid.
00:10:40.720 Once authorized, the transaction will move into a new state, typically when a credit card is accepted. At regular intervals, we process all transactions, marking them as settling or settled—this is when the money moves from one bank account to another.
00:11:12.000 However, if a customer or merchant decides to cancel the transaction post-authorization but before settlement, we void it instead, moving it into the voided state. Thus, the sequence diagram for this process looks like this: a customer requests to void the transaction, and upon confirming that it's voided, the settler checks to see if it can proceed.
00:12:04.079 Here’s the bug: occasionally, a voided transaction would get settled. This was confusing for any customer who saw their voided transaction still being processed. The issue emerges when the settler first checks the transaction's state and then gets a confirmation that a transaction is voided only after attempting to settle it.
00:12:50.800 At this point, both the customer and the settler think they are doing the right thing. The customer believes their transaction has been successfully voided, while the settler thinks the transaction should be settled. This becomes problematic when we have to ensure there is no transition between the settling state and the voided state.
00:13:35.760 Because voiding occurs in a web request while settling happens in a batch job, we can block a transaction from being voided if it is marked for settling. However, we cannot check every transaction right before settling as that would lead to excessive database requests and possibly freeze the system.
00:14:40.320 The clever solution we devised involved implementing a timeout mechanism. Instead of settling the transaction at once, we mark it as 'about to settle.' When the customer subsequently requests to void, the settler will check if it is still valid. If the transaction flags itself as voided during this check, the settler knows not to settle it.
00:15:18.720 The reason for the 60-second timeout is due to web requests needing to complete within that time frame, as per HTTP standards. Any void requests made before we initiated the settling process are guaranteed to have completed by the time the settler checks the status.
00:15:56.639 Next, we have story number three: 'Ghost in the Machine.' Previously, I worked on an Internet of Things platform where various devices send data to the internet. Typically, a device transmits two types of messages: sensor data and heartbeat signals indicating its online status.
00:17:02.560 The server keeps track of this data, which is crucial because stale data is a problem—it can happen that old data is received instead of the latest. When a temperature reading that is older than the last reading arrives, that data should not be updated.
00:18:06.040 Our bug stemmed from one particular device's separate messaging for heartbeat and sensor data. While heartbeats were sent on time, sensor data messages were delayed enough to make them stale by the time they reached the server.
00:19:03.600 The fundamental problem lay in the server’s data model not accounting for heartbeat and sensor data arriving separately or out of order. While restructuring to accommodate this is an ideal solution, it can incur significant risk and cost.
00:19:41.760 Instead, we opted not to fix the problem as it only impacted our testing devices, and since the issue was not seen in production, this approach proved reasonable.
00:20:11.760 Now finally, we arrive at story number four: 'Screaming Zombies.' At Braintree, we had a tool called Postoff that involved Ruby code sending data to a Java process, which then formatted and sent the data to our partner. This task is fraught with challenges, especially relating to ensuring that data must be sent reliably and quickly.
00:21:04.960 To achieve this, we established monitoring for the latency of requests and how many requests succeed or fail. Using a metrics provider called StatsD, we would write the value of some variable that tracked our data.
00:22:00.480 However, the problem we encountered was that our data would disappear intermittently. We could see actual useful values in the metrics for a while, but then they would abruptly drop to zero, correlating with runs on different machines.
00:23:00.080 The issue stemmed from the way we handled our metrics reporting. Originally, the 'stop' method did nothing, but when 'start' was called multiple times inadvertently, it would clean up resources improperly, causing the old metrics reporter to become a zombie process.
00:23:40.640 These zombie reporters with outdated metrics would drown out valid data by continuously sending zeros, leading to our monitoring system effectively reporting no useful information.
00:24:44.300 The easy fix was simply implementing a one-line change to the stop method to ensure that all previous reporters ceased reporting metrics. By removing the redundant processes, we eliminated race condition scenarios connected to shared state changes.
00:25:35.040 In conclusion, we've discussed various remedies for managing race conditions: separate data instances, controlling the order of operations, or removing duplicate processes to secure robustness in our systems.
00:26:12.680 I hope you enjoyed following along on this journey through some of the race condition woes that I’ve experienced. It is essential to recognize that race conditions can occur anywhere, especially in today’s distributed world where redundancy in application architecture is common. Remember to keep these concepts in mind to ensure that your systems function robustly.
00:27:11.520 Finally, I posed a very important question at the beginning of this talk: what is the sound of a zombie screaming? The answer is silence.
Explore all talks recorded at RubyConf 2020
+17