00:00:01.040
What is the sound of a zombie screaming? What does that have to do with race conditions? What even is a race condition? Who is this person sitting inside your computer telling you all about race conditions, zombies, and screaming?
00:00:17.199
My name is Josh. I have been developing software with Ruby for about five years. I currently work for a company that you may have heard of and almost certainly have used, called Braintree. We deal with payment processing, usually credit cards.
00:00:33.840
Before that, I worked at an Internet of Things platform where we tried to connect various devices to the internet. We had this goal to manage all sorts of things, but mostly focused on printers. In my previous jobs, I have used both Ruby and Java fairly extensively.
00:00:51.199
This is also my first time speaking at RubyConf. I've attended many RubyConfs before, and I have always been eager to try speaking at it. And this is exactly how I imagined that it would go.
00:01:08.880
Let's talk about incrementing. You might think this is one of the more simple operations that a computer can do, and in a sense, it is. But it's not just one thing; incrementing is two operations at the computer level.
00:01:17.199
First, you load the initial value of x. Then you figure out the new value, which is x plus one. Finally, you assign that new value to the variable x, storing it in whatever location and data structure x is being held in. What makes this important is the idea of an operation being atomic. So, x equals x plus one is not atomic because it happens in multiple steps. In fact, most things aren't atomic: most database operations, most list operations, most data structure actions—most things are not atomic.
00:02:06.799
So let's look at why that matters. Here we have what's called a sequence diagram. The way we read this is that the boxes at the top represent different items, while the lines that fall down from the middle of the box represent that thing's forward motion in time.
00:02:12.160
Time starts at the top and moves towards the bottom, while the arrows flowing between the items represent the messages that are being sent or passed between them. In this case, we are seeing two running processes, each labeled "Code," executing the operation x equals x plus one.
00:02:40.640
So, the first code comes in. We start at the top, and we see that x is assigned a value of zero. Next, the code on the left springs into existence and asks x, 'Hey, what is your value?' x replies, 'Zero.' The code on the left does some complex math, figuring out that zero plus one is, in fact, one, and then assigns the value one to x.
00:03:01.120
The code on the right comes in and does essentially the same thing, except that it starts with the value of one. As a result, the code on the right assigns the value two to x. Here's why we care so much about this being atomic—or in this case, not being atomic. There's nothing that guarantees that the arrows will be clustered together. If the sequence of operations had happened in a different order, we would get the wrong value for x.
00:03:38.640
So, what happens on the left is that the code says, 'Hey x, what is your value?' It learns the value is zero, assigns one to it, and the code on the right queries x before the code on the left has finished assigning it. This means the code on the right tries to update using stale data. This scenario captures the secret sauce of a race condition.
00:04:17.919
What that means is that we have multiple processes interacting with the same shared data. In this case, the two processes are the codes on the left and the right, and the shared data is the value stored in the variable x. These multiple processes rely on things happening in a certain order, but actions do not necessarily happen in that order. Therefore, when things happen out of order, this results in a race condition and the associated bugs.
00:04:56.479
Now, I'm going to present a quick puzzle. Imagine we started with x equal to zero, but instead of the codes on the left and right each incrementing x once, let's say they each increment x ten times. The question is, at the end, when we print the value of x, what are the possible outcomes? At this point, I would recommend muting for just a couple of seconds—about 30 seconds, until I move on to the next slide, in case you want to figure this out on your own.
00:05:29.680
The answer is that the value will be any number within a certain range. That range has a maximum of 20, which makes sense because there aren't enough increments to push x above 20. The smallest possible value is 2, so I’ll leave it as an exercise for the viewer to determine what sequence of operations could lead to x being printed as two.
00:06:04.080
And now, for story time! I’m going to share four stories, each a moment in my career when I encountered a race condition that caused a bug. It might have been a bug in a test or in monitoring tools, but in each case, there was a race condition. I'll explain how we identified the race condition, the impact it had, and how we fixed it.
00:06:23.280
Story number one is called 'Smurfy's Law.' Smurf was a tool at Braintree that gathers data and sends it to one of our banking partners. This process must happen constantly in a payment processing company. Various parts of our system write data to a bucket in S3, and Smurf reads that data from the same bucket to process and format it correctly for sending.
00:06:46.960
Of course, we have code running in different environments, so to properly separate data, we have one bucket for production, one for sandbox, and one for test. The production bucket is appropriately access-restricted, the sandbox bucket has sandbox data, and the test bucket gets cleared and repopulated during integration test runs with data that the tests are aware of.
00:07:16.319
Here’s a look at a typical integration test run. One test begins by seeding some data into S3, then instructing Smurf to do its thing, followed by making assertions about Smurf's actions regarding the seeded data. One of the things we prioritize is ensuring that Smurf accurately handles changes in data. We do this in two steps, named 'step one' and 'step two.'
00:07:45.599
Here’s the bug: we started seeing sporadic failures, especially as we began developing Smurf. We would expect the step two results, but would get the step one results instead. Imagine if two runs of the tests were executed simultaneously; let's call one T1 and the other T2. If T1 is our CI setup running tests on the main branch, and T2 is me running tests on my local machine.
00:08:18.879
If the tests on the left finished first and Smurf read the step two data correctly, my tests (T2) would subsequently see the step one data. Conversely, if T2 seeded the bucket with step one data before Smurf could read the data, we would encounter an unexpected result.
00:09:01.839
Notice that the lines going down from the T1 and T2 boxes still perform the same operations as earlier, but the tests on the left don’t see the right data because Smurf on the left read the wrong data. This is exactly what causes such failures: we expected the step two data but got the step one data instead.
00:09:42.160
The core issue here is our single test bucket. Clearly, that is insufficient if multiple test runs can occur simultaneously. We need separate buckets to prevent different test runs from interfering with each other, which proved to be complicated to implement. However, once we had it all set up, it completely resolved the problem.
00:10:00.560
Story number two, 'Out of the Void.' As a payment processing company, we handle transactions, which involves that moment when a customer interacts with a merchant's website or in person to make a payment. The transaction lifecycle can be simplified as follows: it starts in a pending state, and we check to ensure the credit card is valid.
00:10:40.720
Once authorized, the transaction will move into a new state, typically when a credit card is accepted. At regular intervals, we process all transactions, marking them as settling or settled—this is when the money moves from one bank account to another.
00:11:12.000
However, if a customer or merchant decides to cancel the transaction post-authorization but before settlement, we void it instead, moving it into the voided state. Thus, the sequence diagram for this process looks like this: a customer requests to void the transaction, and upon confirming that it's voided, the settler checks to see if it can proceed.
00:12:04.079
Here’s the bug: occasionally, a voided transaction would get settled. This was confusing for any customer who saw their voided transaction still being processed. The issue emerges when the settler first checks the transaction's state and then gets a confirmation that a transaction is voided only after attempting to settle it.
00:12:50.800
At this point, both the customer and the settler think they are doing the right thing. The customer believes their transaction has been successfully voided, while the settler thinks the transaction should be settled. This becomes problematic when we have to ensure there is no transition between the settling state and the voided state.
00:13:35.760
Because voiding occurs in a web request while settling happens in a batch job, we can block a transaction from being voided if it is marked for settling. However, we cannot check every transaction right before settling as that would lead to excessive database requests and possibly freeze the system.
00:14:40.320
The clever solution we devised involved implementing a timeout mechanism. Instead of settling the transaction at once, we mark it as 'about to settle.' When the customer subsequently requests to void, the settler will check if it is still valid. If the transaction flags itself as voided during this check, the settler knows not to settle it.
00:15:18.720
The reason for the 60-second timeout is due to web requests needing to complete within that time frame, as per HTTP standards. Any void requests made before we initiated the settling process are guaranteed to have completed by the time the settler checks the status.
00:15:56.639
Next, we have story number three: 'Ghost in the Machine.' Previously, I worked on an Internet of Things platform where various devices send data to the internet. Typically, a device transmits two types of messages: sensor data and heartbeat signals indicating its online status.
00:17:02.560
The server keeps track of this data, which is crucial because stale data is a problem—it can happen that old data is received instead of the latest. When a temperature reading that is older than the last reading arrives, that data should not be updated.
00:18:06.040
Our bug stemmed from one particular device's separate messaging for heartbeat and sensor data. While heartbeats were sent on time, sensor data messages were delayed enough to make them stale by the time they reached the server.
00:19:03.600
The fundamental problem lay in the server’s data model not accounting for heartbeat and sensor data arriving separately or out of order. While restructuring to accommodate this is an ideal solution, it can incur significant risk and cost.
00:19:41.760
Instead, we opted not to fix the problem as it only impacted our testing devices, and since the issue was not seen in production, this approach proved reasonable.
00:20:11.760
Now finally, we arrive at story number four: 'Screaming Zombies.' At Braintree, we had a tool called Postoff that involved Ruby code sending data to a Java process, which then formatted and sent the data to our partner. This task is fraught with challenges, especially relating to ensuring that data must be sent reliably and quickly.
00:21:04.960
To achieve this, we established monitoring for the latency of requests and how many requests succeed or fail. Using a metrics provider called StatsD, we would write the value of some variable that tracked our data.
00:22:00.480
However, the problem we encountered was that our data would disappear intermittently. We could see actual useful values in the metrics for a while, but then they would abruptly drop to zero, correlating with runs on different machines.
00:23:00.080
The issue stemmed from the way we handled our metrics reporting. Originally, the 'stop' method did nothing, but when 'start' was called multiple times inadvertently, it would clean up resources improperly, causing the old metrics reporter to become a zombie process.
00:23:40.640
These zombie reporters with outdated metrics would drown out valid data by continuously sending zeros, leading to our monitoring system effectively reporting no useful information.
00:24:44.300
The easy fix was simply implementing a one-line change to the stop method to ensure that all previous reporters ceased reporting metrics. By removing the redundant processes, we eliminated race condition scenarios connected to shared state changes.
00:25:35.040
In conclusion, we've discussed various remedies for managing race conditions: separate data instances, controlling the order of operations, or removing duplicate processes to secure robustness in our systems.
00:26:12.680
I hope you enjoyed following along on this journey through some of the race condition woes that I’ve experienced. It is essential to recognize that race conditions can occur anywhere, especially in today’s distributed world where redundancy in application architecture is common. Remember to keep these concepts in mind to ensure that your systems function robustly.
00:27:11.520
Finally, I posed a very important question at the beginning of this talk: what is the sound of a zombie screaming? The answer is silence.