The Very Hungry Transaction

00:00:11.639 We are opening today with Daniel Colson. Daniel is a Senior Software Engineer on the Ruby Architecture team at GitHub. He's worked on Rails applications of all sizes and contributed to numerous open-source projects. Daniel was formerly a composer, violist da gamba, and professor of music.

00:00:26.080 So it sounds to me like a career changer, which is really cool! How many of us here are career changers? A lot of people! I love that! I just love taking that poll at every conference.

00:00:39.760 Alright, Daniel is going to take us away and kick us off with "The Very Hungry Transaction." Give it up for Daniel!

00:00:55.079 Uh, an important bit of information before I start: I'm not the only viol da gamba player in the room! I think I see you back there. We've never met in person before, but we're the two!

00:01:08.720 So today, I'm going to tell you a story about Cat. Cat is a software developer working on a large Ruby on Rails application at Bug Hub. Cat has had a bit of a stressful week. Bug Hub is a large online community of bugs, and Cat has spent the past week shipping a highly successful new product within Bug Hub.

00:01:27.119 However, along the way, Cat has had to deal with a bunch of problems — everything from data consistency bugs to application slowness and, at one point, even site-wide failure. All these problems are going to trace back to a single database transaction. What started out as a simple little transaction is going to grow as there are more and more business requirements, and eventually, the transaction is going to become a liability for Cat's application.

00:01:54.200 Along the way, we're going to explore what went wrong and how Cat could have prevented some of these problems. The story starts on Monday. Cat showed up to work very hungry for business opportunities. Cat knows that a lot of the bugs on Bug Hub are very hungry for food, but they're too busy connecting on the site to go out to the grocery store.

00:02:13.680 Sitting there hungry on their computers, they want groceries delivered directly to their homes. So, Cat has decided to launch a brand new grocery delivery service integrated directly into the larger Bug Hub product — the first grocery delivery service for bugs.

00:02:37.720 We're not going to dig into the controllers or the views or even how the order gets put together today. We're only looking at how the order gets saved. We can assume that this order instance variable is an instance of an Active Record model, backed by an orders table in the database. This order is going to have various grocery products in it, identified by a SKU or a Stock Keeping Unit.

00:02:59.280 Finally, there's a join table connecting the products to the order. Each order item will also have a quantity indicating how many of that particular product was ordered. Although it's not explicit in the code here, when calling save on an Active Record model, it opens a database transaction. If we look at the queries involved, we're going to see a begin statement that opens up the transaction, which will insert the order and some number of order items.

00:03:31.159 Finally, there will either be a commit statement if the transaction is successful or a rollback if it's not. This is what a transaction is supposed to look like. The idea here is we want this to be atomic; we want it to be all or nothing. Either all of the inserts succeed, or none of them succeed. We don't want partial success.

00:03:54.920 On Monday, Cat ships this code, and all is well so far. However, on Tuesday, Cat discovers that bugs are ordering more items than what Bug Hub has in stock. This is a pretty frustrating experience for bugs, as they get excited about their order which then gets cancelled because they can't get those items.

00:04:12.799 So, Cat decides to deal with this problem by keeping track of the inventory directly on the products table. This involves adding a new column to the products table, and Cat can use that information to avoid showing out-of-stock products on the site. Maybe add some validations along the way?

00:04:36.759 Then, over in the submit code, after saving the order, Cat adds the code to update the product inventory for that order. The exact queries here aren't too important, but the basic idea is that for each item in the order, there will be an update statement that subtracts the quantity ordered from the inventory for that product, identified by the corresponding SKU.

00:05:06.199 Pat wants these two things to be atomic: either the order saves and the product inventory gets updated, or neither happens. So, Cat wraps this in an explicit database transaction. Now, all the inserts and updates are happening together in this one transaction. This seems fine, except after shipping this code, Cat starts noticing an increase in 500 responses when placing orders.

00:05:39.880 He looks at his bug tracking software and sees a whole bunch of errors, like deadlock errors, lock timeout errors, and requests taking a long time and sometimes timing out. So, what's going on here? The underlying problem is something called lock contention. This happens when multiple processes attempt to work with the same data at the same time, leading them to compete for access to that data.

00:06:10.840 We can imagine two bugs ordering the same or similar products around the same time. Their transactions will both begin; they'll insert their orders, and then both of those transactions arrive at this update statement — both trying to update the product inventory for the same product at the same time. Well, that's not going to work.

00:06:48.759 I've been comparing database rows recently to bathrooms. It only makes sense for one person to use it at a time: when you go in, you lock the door, and anybody else that wants to use it has to wait. So that's kind of what happens here: one of these update statements arrives first. In order to perform the update, it needs to lock the row, which means the other transaction is going to get stuck waiting.

00:07:19.360 Even after the update succeeds, that row is still locked for as long as the transaction is open. Hence, bug two's transaction is left waiting until, eventually, bug one's transaction commits, and then bug two's transaction can proceed. At best, contention like this makes the performance of transactions worse; you can see bug two's transaction sitting there waiting.

00:07:37.680 At worst, it can lead to things like deadlock errors, where two or more transactions are waiting on opposite locks and neither can proceed, or lock timeouts where a transaction waits so long that it times out and gives up. Interestingly, the speed of a transaction affects how much opportunity there is for lock contention.

00:08:10.240 If the locks are not held for too long, there's less opportunity for contention. The chances of another transaction needing to operate on the same rows at the same time are relatively low, and even if it does happen, they won't have to wait that long for the locks. But if there's a slow transaction, there's a lot of time for another transaction to come in and need to operate on these same rows, potentially having to wait a while to get those locks.

00:08:42.200 This is precisely what was happening in Cat's case. Cat had a bunch of slow transactions creating this opportunity for contention. But what was making Cat's transactions slow? One thing is that Cat had some slow queries within these transactions. It turns out these update statements were slow because Cat forgot to add an index on the SKU column.

00:09:10.680 It’s interesting to note that adding an index and speeding up a query can actually affect lock contention in your transactions! Moreover, even if Cat made all these queries fast, these very hungry bugs were submitting orders with massive numbers of products in them. Thus, having too many queries within a transaction can also slow that transaction down.

00:09:37.920 Transactions like this are especially bad for contention because they're holding a lot of locks. So, for example, if you have a thousand products being updated, that's a thousand locks held for the entire duration the transaction is open. Additionally, to make matters worse, a transaction like this is likely to get stuck waiting for locks as well.

00:10:07.600 Contention has a way of spreading: locks held by one transaction slow down another transaction, and that slowness creates more opportunities for contention with yet more transactions, and the problem can cascade. Therefore, operating on large batches of records is not something a transaction is good for.

00:10:34.840 You can probably get away with updating hundreds of records in a transaction, but if you start getting into the thousands, it might be time to reconsider the design. In this case, Cat should have validated a maximum number of products allowed in a single order.

00:11:01.520 This innocent-looking addition can cause some problems. If it causes slowness in the transaction, Cat can reduce the possibility for contention by speeding up queries, limiting the number of queries, and Cat can also prevent some of the specific errors seen, like the deadlock errors, by updating products in a consistent order.

00:11:35.840 Cat might also want to question his assumptions about whether the updates need to happen within a transaction at all. Perhaps there's a design that doesn't require these updates to happen in a transaction or that doesn't require updating the products table at all. Maybe Cat needs a more robust inventory management system that lives outside of this database.

00:12:08.360 On Wednesday, Cat noticed customers complaining that submitting an order was taking too long. They would click on the submit button and get stuck waiting for a spinner — and nobody likes a spinner! Well, maybe some people do, but Cat tracked the slowness back to a single after_create call that was doing some slow work, like syncing with external platforms and sending out emails.

00:12:31.280 This is stuff we often don't want inside a web request because it slows down the request. We also don't want it inside a transaction because it slows down the transaction and creates more opportunities for contention, as Cat experienced on Tuesday.

00:12:54.440 So, Cat moves this slow stuff into a background job, and all is well. Well, not exactly. It's definitely an improvement, but there’s still a potential problem here. After shipping this change, Cat started to hear reports of confirmation emails being delayed by several minutes.

00:13:16.680 Since Cat is using an after_create callback, the job gets enqueued right after inserting the order but before the transaction commits. Since Cat is not using a database-backed job backend, in queuing the job, this operation is separate from the transaction.

00:13:47.360 This means the worker can pick up this job and run it before the transaction commits. It’s not too big a deal in Cat's case because this job requires a persisted order, and the order isn't really persisted until the transaction commits. So the job is going to fail and can get retried later after the transaction commits.

00:14:05.640 That's fine — not ideal, but fine. However, this transaction then moves on to updating products, which we have seen can be quite slow in Cat's application. This leaves room for this job to retry and fail multiple times before the transaction commits.

00:14:30.720 To accommodate that, Cat has had to configure the retries for the job quite high so that eventually, when the transaction does commit, it will retry once more and succeed. But all these retries become problematic when considering that a large number of these transactions are rolling back due to contention-related errors.

00:14:56.080 In this case, the job gets enqueued but can never succeed because the order never actually gets created. These jobs end up stuck retrying and failing until eventually their retries run out. Thus, these are destined to fail but are still in the queue using up resources.

00:15:16.680 What happened in Cat's case is that these failed jobs ended up backing up the queue and delaying the work of successful job runs, which included sending confirmation emails. Cat probably needs more job capacity here. It's not great that these extra retries got in the way of successful job runs; we want our capacity determined by successful runs, not by failures.

00:15:38.560 The simplest thing Cat could have done differently here is to replace the after_create call with an after_create_commit callback. This way, the job gets enqueued after the transaction commits; it only gets enqueued if the transaction commits. If it rolls back, it won’t get enqueued at all.

00:16:00.440 Now, the job will typically run exactly once, and we won’t need to deal with fails and retries at all. Moreover, Cat can look forward to Rails 7.2, which will come with configuration to perform this by default. In that case, Cat wouldn’t have to make any code changes at all; he would have gotten this improved behavior for free.

00:16:24.120 On Thursday, customers began reporting getting deliveries too late, or sometimes receiving the wrong items. It turns out Cat's team had been doing a lot of manual work to fulfill the orders, but it’s a slow and error-prone process. Cat has decided to work with an external fulfillment service to handle fulfilling the orders.

00:16:50.720 This fulfillment service has an API that Cat can use to submit the orders. So, after saving the order to Bug Hub’s database, Cat adds this call to submit the order to the fulfillment service via a fulfillment client. This HTTP call to an external service is inside a database transaction.

00:17:20.640 If we examine the queries involved, the fact that this external call is made can seem similar to the job example because it is. In this instance, we encounter some of the same problems. For example, this external call could succeed, but then the transaction could still roll back for some reason.

00:17:51.520 Worse still, after shipping this change, Cat started to see errors across all of Bug Hub, not just in the grocery delivery product. This was an issue across the entire site. While looking into these problems, Cat noticed that this external fulfillment service was experiencing some scaling issues.

00:18:20.120 Too many orders were coming in from Bug Hub, and the service wasn't able to handle them all. It makes sense that this would affect creating orders, but how did this cascade into failures across the whole site? As the service became backlogged with too many requests, it began responding very slowly or not responding at all.

00:18:51.520 This intermittent slowness is somewhat expected for external calls; services can be slow or may degrade. However, since a network is involved, we have to deal with network latency, which can be an issue. This is why we often move external calls outside of web requests and into background jobs.

00:19:20.440 Regardless of whether this transaction is in a web request or in a job, when slowness occurs inside a transaction, it can threaten the health of the database. From the database's perspective, the transaction is sitting idle for the duration of this slow external call. You might think it's not doing much, but it’s still using valuable resources.

00:19:51.440 Coming back to these locks: these locks will be held for the duration of the transaction. If your transaction is idle for minutes, those locks are also held for minutes. However, that's not the biggest problem here. More importantly, an open transaction keeps the database connection open the entire time that the transaction is open.

00:20:19.920 That database connection cannot be used for anything else. So, as the fulfillment service degrades, orders keep coming in, opening transactions that get stuck sitting idle while waiting for that fulfillment service response. As more orders arrive, the number of connections in use increases, leading to a decrease in the available database connections.

00:20:50.840 Eventually, every single database connection is in use. At that point, no other requests can acquire a database connection, and any request needing access to this database across all of Bug Hub starts to fail. Even the connection pooling mechanism sitting in front of Cat's database did not help here because each transaction uses one real database connection the entire time it’s open.

00:21:12.920 This is a serious problem. One action Cat should have taken differently is to add a client timeout to this HTTP call. You never want to wait an unlimited amount of time for an external call to respond, as this would have mitigated some of the issues.

00:21:45.840 Yet, there’s always some risk of having such an external call inside a transaction, and there's no real advantage. The external call has no place in the transaction. Instead, Cat should move this network call outside of the transaction. Now, if the fulfillment service has problems, it won't have an effect on the database transaction.

00:22:09.760 That transaction will only begin after the fulfillment service succeeds. While the failure of the service will impact order creation, it won’t affect the entire site. I also appreciate that now it’s visually clear that these are two separate operations, not a single atomic operation.

00:22:32.680 It is easier to see that in both versions, it was possible to submit the fulfillment service but then have that order transaction roll back and fail. However, we can recognize that this isn't the best failure mode. If we submit the order to the fulfillment service and start getting fulfilled, but then have no record of it in the Bug Hub database, that could be problematic.

00:23:05.440 Maybe if the order transaction rolls back, the API has a way to delete the order, but that could fail too. So then what do we do? It might actually make sense for Cat to swap the order of operations here. Now that these are two distinct operations, rearranging them is straightforward.

00:23:40.000 It is possible to create the order in the Bug Hub database and then fail to submit to the fulfillment service. This may be easier to deal with: the order can have a status updated after submitting to the fulfillment client, enabling finding orders that were created but never submitted and then retrying them or performing some manual recovery.

00:24:10.960 On Friday, customers began complaining about site reliability. In addition to all the previous problems we’ve discussed, users were experiencing intermittent 500 responses and slow performance, which frustrated them.

00:24:30.640 These issues stemmed from the grocery delivery service's growing popularity. The database was getting overwhelmed by writes on the orders and products tables, which in some ways is a good problem to have. Cat meets with the database team, and they decide that the best course of action is to separate the orders and products tables into distinct databases.

00:25:14.040 This leaves Cat with a dilemma: this orders transaction has queries for both orders and products. Cat restructures this operation and thinks it looks reasonable — however, this poses a serious risk. We can expect all the bug reports we've seen today to occur, but they will now be worse, as they will affect two databases instead of one.

00:25:42.880 Everything within these transactions, if we encounter contention problems or external services failing, now affects both of these databases. Furthermore, queries to one database from within a transaction to another database are treated as another type of external call, which comes with all the same problems as any external call.

00:26:12.800 We’ve now tied the health of both databases together, undermining the effectiveness of splitting them in the first place. After Cat deployed this change, he noticed that specific products stopped selling because inventory levels had dropped to zero, even when the products were still in stock.

00:26:45.600 This situation is due to the fact that these two transactions are now separate operations across two distinct databases. This is not a single atomic operation. Most of the time, both transactions will commit; however, it is also possible for one transaction to commit while the other rolls back.

00:27:10.920 In Cat's case, this very scenario was occurring: the inner products transaction was committing, which decremented the product inventory, yet the order never got created as that transaction rolled back. This scenario is more common than one might think, especially when the order transaction is idle for extended periods.

00:27:39.940 Idle connections are some of the most likely to disappear. This can happen due to database timeouts or network timeouts — there are many potential issues that can cause a transaction to roll back when it was intended to commit.

00:28:16.880 Given that there are no guarantees about these transactions being atomic, the first step for Cat here is to remove the nesting and execute them sequentially. Although this might not be an easy change to make, it is essential for the health of these databases.

00:28:42.160 Doing things sequentially highlights the fact that they are two separate operations, rather than a single atomic one. Consequently, we are more likely to consider the case where the product transaction succeeds and then the order transaction fails.

00:29:12.440 Cat doesn't want the product inventory to decrease without the order getting created. Therefore, one approach could involve swapping the order of the two operations. However, this introduces another failure mode: it is now possible to create an order without decrementing the product inventory, leading to inflated inventory levels instead of depletion.

00:29:52.240 Cat can choose which option works better. If Cat truly requires these two processes to behave as if they are atomic, it would involve significantly more coding. It’s somewhat complex, perhaps a topic for a separate discussion.

00:30:27.440 That brings us to the weekend. Cat doesn’t work on weekends; he likes to rest and reflect on what he has learned. He learned this week that external calls within a transaction are a risk. They can lead to data integrity issues because they can seem atomic, when in reality they aren’t.

00:30:58.480 They also create cascading failures, where the failure of an external service leads to a failure of the database. These external calls include a lot of factors like HTTP requests, sending emails, queuing jobs, or querying other databases. Essentially, the only thing we want inside a database transaction is queries specific to that database.

00:31:31.280 Additionally, Cat learned that slow transactions are also a risk. Slow transactions within a request make for slow requests; we generally want to avoid these. In fact, slow transactions anywhere can lead to issues in application performance.

00:32:04.160 Lastly, there's also the issue of resource exhaustion. Database connections and CPU time are finite resources. Slow transactions may overutilize these resources, threatening the stability of the database. Even a slow transaction in a job can affect the overall performance of requests.

00:32:38.880 In summary, these insights might help you in writing safer transactions, allowing you to avoid some of the problems Cat encountered. I work at GitHub, not Bug Hub. I've changed names, domains, and species for this talk, but this is all based on real challenges we've faced at GitHub, and I'm eager to discuss any of them with you.

00:33:01.280 Finally, I would like to thank my father-in-law and his friend for the illustrations. Thank you!