Data Integrity

Summarized using AI

The Very Hungry Transaction

Daniel Colson • May 24, 2024 • Detroit, MI

In this talk titled 'The Very Hungry Transaction' presented by Daniel Colson at RailsConf 2024, the narrative follows a database transaction that initially appears simple but gradually escalates into a significant liability due to evolving business requirements. The presentation highlights common pitfalls in database transaction management and offers strategies for improvement. Key points discussed include:

  • Transaction Growth: Cat, a software developer at Bug Hub, starts with a small transaction to integrate a grocery delivery service which soon becomes overly complex.
  • Data Integrity Issues: As Cat adds inventory tracking, the transaction expands, leading to locking issues, deadlocks, and timeout errors due to contention from simultaneous requests.
  • Performance Bottlenecks: Slow queries and excessive operations within transactions are identified as contributors to performance issues.
  • Separation of Concerns: The presentation stresses the importance of moving external calls (like API requests) out of transactions to maintain application performance and integrity.
  • Potential Solutions: Recommendations include using shorter transactions, limiting the number of queries, and deferring external calls until after commits.
  • Conclusion and Best Practices: Colson emphasizes keeping transactions short, avoiding external calls within them, and utilizing the upcoming tools in Rails 7.2 to manage background jobs post-transaction completion. The talk concludes with a call for cautious changes and proactive management of transactions to prevent future issues, backed by real experiences from GitHub, albeit within a fictionalized context.

The Very Hungry Transaction
Daniel Colson • May 24, 2024 • Detroit, MI

The story begins with a little database transaction. As the days go by, more and more business requirements cause the transaction to grow in size. We soon discover that it isn't a little transaction anymore, and it now poses a serious risk to our application and business. What went wrong, and how can we fix it?

In this talk we'll witness a database transaction gradually grow into a liability. We'll uncover some common but problematic patterns that can put our data integrity and database health at risk, and then offer strategies for fixing and preventing these patterns.

RailsConf 2024

00:00:10.480 Our next speaker is a senior software engineer on the Ruby architecture team at GitHub. He has worked on Rails applications of all sizes and contributed to numerous open source projects. Formerly, he was a composer, stream player, and professor of music. I really appreciate his ability to explain complex topics in understandable ways. Please join me in welcoming Daniel.
00:00:45.600 Thursday morning. A big crowd feels really good; there’s a lot of good energy. However, you’re all making me really nervous. So many smart people! That’s okay; I’ve got a story for you today about Cat. You might know the story of the Very Hungry Caterpillar, but this is not the same Cat. This Cat is a software developer working on a large Ruby on Rails application at Bug Hub. Cat has had a bit of a stressful week. Bug Hub is this large online community of bugs, and Cat has spent the past week building a highly successful new product within Bug Hub. Along the way, Cat decided to address some of the issues such as data consistency bugs, application slowness, and at one point, even sitewide failure. All of these problems can be traced back to a database transaction.
00:01:59.119 It started out as a small and simple transaction, but as the days go by, more and more business requirements cause this transaction to grow in size until it eventually becomes a liability for Cat’s application. Along the way, we’ll stop and explore what went wrong and how Cat could have improved some of these problems.
00:02:21.560 The story starts on Monday. Cat showed up to work very hungry for business opportunities.
00:02:33.519 Cat knows that many bugs on Bug Hub are very hungry for food, but they are too busy connecting with Bug Hub and don’t want to be bothered to go out to the grocery store. They want groceries delivered directly to their homes. So, Cat has decided to launch a brand new grocery delivery service integrated with this larger Bug Hub product. This is the very first grocery delivery service for bugs.
00:03:03.120 We’re not going to dig into controllers or views or even how this order gets put together. We’re really only going to look today at how the order gets saved. We can assume that this order instance variable is an instance of an Active Record model, an order model backed by an orders table in the database. There’s also a product model for the various grocery products that a bug can buy, identified by a SKU (stock keeping unit). Finally, there’s a join table connecting those products to the orders, and this order item model also keeps track of how many of that particular item is in the order.
00:03:39.239 It’s not explicit in the code, but when we call save on an Active Record model, that opens up a database action. So, here’s our first transaction: if we look at the queries involved, we see a begin statement starting the transaction, then we insert the orders and some number of order items for that order, and then we commit the transaction if it was successful, or roll back if it wasn’t. This is what a transaction is supposed to look like. The idea here is we want this to be atomic; we want it to be all or nothing. All of the inserts should succeed or none of them should succeed. This is a good use of a transaction, and there’s not too much risk here. Just a few queries inside of a transaction, not too much is going to go wrong. Cat ships this code, and all is well; the orders start rolling in.
00:04:53.120 On Tuesday, Cat discovers that bugs are ordering more items than what Bug Hub has in stock. This is a frustrating experience because bugs get excited about their orders, but then the orders end up getting canceled. Cat decides to resolve this problem by keeping track of the inventory right here on the products table. He adds an inventory column to keep track of the quantity in stock. This way, Cat can use that data to avoid showing out-of-stock products on the site or maybe add a model validation so that it’s not possible to create orders for out-of-stock products.
00:05:32.160 In the submit code, Cat adds a line to update the product inventory after creating the order. The way it works is that it goes through each item in the order and performs an update by subtracting the number of items ordered from the inventory for that product. Cat wants these two operations to be atomic; they should either both succeed or neither should happen. So, he wraps this in an explicit database transaction. Now, all the inserts and updates are happening together in one transaction.
00:06:20.560 Unfortunately, after shipping this code, Cat starts to see a bunch of errors in the application. He checks his bug tracking software and sees a bunch of locking-related errors: deadlock errors, lock timeouts, and also requests taking too long and some of them timing out. The underlying problem here ends up being lost contention. This happens when multiple processes try to work with the same data at the same time, and they end up having to compete for access to that data.
00:07:05.440 Imagine two bugs ordering around the same time. They are ordering the same product. Their transactions begin, they insert their orders, and then they reach the part of the code that’s updating the product. Both of these transactions want to update the same product, leading to a situation akin to a bathroom that can only be used by one person at a time.
00:07:31.039 It generally only makes sense for one person to use it at a time. So, when you enter a bathroom, you lock it. Anyone else who wants to use it has to wait. That’s what’s happening here; only one of these updates can happen at a time. Whichever one does it will lock the row, and so the other update can’t happen until it’s released. Even after the update succeeds, that lock will still be held until the first bug's transaction is complete, which leaves the second bug waiting. Only once the first bug's transaction commits can the second bug finally get the lock and perform the operation.
00:08:28.720 Contention like this worsens the performance of transactions, as they get stuck waiting around. At worst, it can also lead to errors. Cat is encountering deadlock errors, where two or more transactions are waiting for each other to release a lock and get stuck unable to move forward. The database can also produce lock timeout errors, where a transaction ends up waiting so long for a lock that it gives up and times out. These can lead to application errors.
00:09:01.959 Interestingly, the speed of a transaction affects how much opportunity there is for contention. If the transaction is fast, the locks aren’t held for too long, and there’s not too much opportunity for this contention to happen. The chances of another transaction coming in and needing to work with the same data at the same time are relatively low. Even if one does, it won’t have to wait long for the lock.
00:09:24.760 Low throughput transactions, however, had a lot of opportunities for contention. There was a lot of time for another transaction to come in and need to work with this same data, which might have to wait a long time for the lock. This is what was happening in Cat’s case: Cat had a lot of slow transactions creating this opportunity for contention.
00:09:54.640 There are many things that can make transactions slow. One of the things that was causing Cat's transactions to be slow was queries. It turns out these update statements were relatively slow in Cat's application, and he forgot to add an index for his columns. Cat could have reduced lock contention by adding an index to speed up this query. Interestingly, even if the queries were fast, these very hungry bugs on Bug Hub were submitting orders with massive numbers of products within them.
00:10:20.720 Too many queries within a transaction can also lead to slow transactions. These transactions tend to hold a lot of locks as well. So, a large opportunity for contention arises. Additionally, transactions like this, in addition to holding a lot of locks, are likely to get stuck waiting on a lot of locks. Lock contention has a tendency to spread; locks held by one transaction can slow down another transaction, and slowness creates opportunities for even more contention with other transactions.
00:10:45.639 Seeing large batches of records being processed inside of a transaction is not something that transactions are particularly good at handling. While it's possible to have hundreds of updates in a transaction, if you start to get into the thousands, it might be time to rethink the design. In this case, perhaps Cat needs to add a maximum number of items that can be included in a single order.
00:11:16.839 This innocent-looking line can actually cause problems if it creates slowness in the transaction. Cat can reduce the potential for contention by speeding things up, speeding up queries, limiting how many of them there are, and separately, Cat can also avoid specific errors. For example, deadlock errors can be avoided by updating products in a consistent order.
00:11:37.799 But Cat might also want to question whether this needs to be a transaction at all. Maybe there’s another design that doesn’t require these updates to happen in this transaction, or that doesn’t involve updating the product table at all. Perhaps Cat actually needs a more robust inventory management system that exists outside of this database.
00:12:29.279 On Wednesday, Cat noticed customers complaining that submitting an order sometimes took way too long. Cat traced the slowness back to this "after create" callback, which was doing a bunch of slow work with an external billing platform, sending confirmation emails, and other tasks. This is the type of work we often don’t want inside of our request, as it slows down the request and also complicates the transaction by creating more opportunity for lock contention.
00:12:50.079 To improve the situation, Cat moved all of the slow tasks into a background job. This is an improvement, but after shipping this change, Cat received reports of order confirmation emails being delayed by several minutes. Since Cat is using an "after create" callback, the job gets queued right after creating the order, immediately after the inserts, but before the transaction commits. Cat is not using a database-backed job queue adapter, so queuing here is not part of the current transaction. That means the worker can pick up and execute this job before the transaction commits.
00:13:45.120 While this isn’t a major issue because the job will fail since it requires the order to be persisted, the situation is less than ideal. This transaction moves on to updating the product inventory, which we've seen can sometimes be really slow. It turns out this job can retry and fail multiple times before the transaction itself commits. To accommodate this, Cat configured the job to retry a high number of times to ensure that eventually, once the transaction commits, it will attempt one more time.
00:14:35.880 However, many of these transactions are rolling back due to various errors. This creates an issue because these jobs end up queued, but they will never succeed since the order will not be created. They get stuck retrying and failing until eventually, their retries run out. This results in stalled jobs remaining in the queue, not realizing that the rollback has occurred.
00:15:08.880 What happened in Cat's case is that these retries backed up the queue and interfered with the successful job runs, delaying critical tasks like sending confirmations. The simplest change Cat could have made here would have been to swap the after create callback with an after create commit callback, which would ensure the job doesn’t get queued until after the transaction commits. This would reduce the chances of unnecessary retries. With this approach, if the transaction rolls back, the job will not get queued at all, leading to fewer problems.
00:15:41.200 With this change, Cat would typically run the job exactly once without needing excessive retries. You can look forward to Rails 7.2, which is going to offer a built-in way to handle jobs after transactions. In that scenario, Cat wouldn’t have to make any additional changes at all, which would be great.
00:16:16.519 On Thursday, customers started complaining about orders being delivered too late or even sometimes containing the wrong items. It turns out Cat's team was doing a lot of manual work to fulfill these orders, which had become extremely slow and prone to errors. Cat decided to collaborate with an external fulfillment service that has an API, which we can use to submit the orders.
00:16:53.639 After saving the order to Bug Hub's database, he adds a call to submit the order to this fulfillment service via an API. This is an HTTP call to an external service inside of a database transaction. If we look at the queries involved and where this external call is made, you might notice it’s similar to the job queuing example and that’s because enqueuing a job with Cat’s job queue adapter is considered an external call, leading to some of the same problems.
00:17:41.079 For example, the external call might succeed, but this transaction could still roll back. Worse, after shipping this code, Cat starts to see errors across all of Bug Hub, not just in this grocery delivery product, but across the entire site.
00:18:02.200 As he investigated, Cat discovered that this external fulfillment service was experiencing some scaling issues and couldn’t keep up with the incoming order volume. This makes sense, as a backlog would affect order creation, but he couldn’t understand why this might cause cascading failures across the entire site.
00:18:19.920 As the external service became overwhelmed with too many requests, it started responding slowly or sometimes even not at all. Intermittent slow responses are expected from external services; they may have occasional issues, and there's also network latency to consider. It’s one of the reasons we often move external calls outside of core transaction logic and into background jobs to separate those effects.
00:18:55.440 Regardless of whether it was a request or a background job, it’s essential to be cautious about putting these calls inside of transactions when experiencing slowness. In this case, the transaction sits idle for the duration of the external call. From the database's perspective, this transaction remains open, consuming a connection that cannot be used for anything else.
00:19:33.360 When Cat's fulfillment service degraded, orders kept coming into Bug Hub, opening transactions, and getting stuck waiting for the external service to respond. As more and more orders queued up in this manner, the number of active database connections increased while the number of available connections dwindled until eventually every single database connection was in use.
00:20:11.800 At that point, no other requests could acquire a database connection. Even the connection pooling mechanism in front of the database couldn’t mitigate the issue because each of these transactions used actual database connections for the duration.
00:20:41.200 The ideal solution for Cat would have been to configure a client timeout for that HTTP call. We should never wait indefinitely for an external service to respond. This could have mitigated the situation to some extent, but there is still inherent risk with having that call in the transaction.
00:21:03.360 Ultimately, Cat should move this external call outside of the transaction. Now, if the service encounters problems, it will have no effect on this particular transaction. The transaction should proceed only if the external API call succeeds. Failure of that service would still affect order creation, but it wouldn’t cascade and affect the rest of the site.
00:21:45.679 I also like that visually separating these two operations makes it more evident that they are distinct processes that can each succeed or fail entirely independently. In both previous iterations, it was possible for an external service call to succeed while the transaction rolled back. This clearer separation makes it easier to handle in our code.
00:22:08.080 On Friday, customers started to complain heavily about reliability. They were facing application slowness, and intermittent 500 errors were starting to frustrate users. The root of the issues stemmed from the overwhelming popularity of this grocery delivery service, leading to a surge in product offerings. As a result, it turned out the database was getting overwhelmed.
00:22:28.760 Thinking quickly, Cat and his colleagues decided to extract the orders and product parts into separate databases. However, this implementation left Cat with a problem. In this code, we have an orders transaction, but we are performing queries related to both orders and product inventories inside of it. Therefore, Cat decides to wrap it in another transaction, generating both an orders transaction and a products transaction.
00:23:03.679 Some may have attempted to do this before and know the pitfalls that come with it, but on the surface, it looks reasonable: it appears that this should work and looks innocent enough, yet it carries significant availability risks. It threatens the health of the databases since this approach guarantees Cat will encounter all of the same issues he faced before; however, they will likely be worse now because they affect two databases instead of just one.
00:23:44.480 There will be slow queries, and when external services are involved, they can affect both transactions. Queries to one database inside the transaction to another database become just another type of external call. If one of them encounters problems, it can severely impact the health of the other, so now the health of these databases is intertwined.
00:24:27.600 Moreover, after Cat ships this change, he notices certain products are no longer selling. This is due to the inventory dropping to zero even though the products are still very much in stock. This situation arises because these two transactions are not atomic; they are two separate operations. Often, they will both commit, but if the inner products transaction rolls back, it will likely cause the orders transaction to roll back as well.
00:25:02.560 What Cat is experiencing is that the inner products transaction is committing—decreasing the product’s inventory—while the order itself does not get created, which results in the whole order failing to be processed. This scenario is common in transactional applications. During the products transaction, the orders transaction sits idle, which isn’t great for other reasons, but this idle time increases the likelihood of a connection failure.
00:25:58.639 Timeouts at either the database layer or network layer can go wrong, and there's no guarantee that both transactions will commit. The first thing I would recommend for Cat is to make this process explicit by executing them sequentially. This isn’t always an easy change to make since the inner and outer transactions intertwine, but it’s an essential adjustment to ensure the health of both.
00:26:28.360 Sequential execution serves to highlight that these are indeed two distinct operations and that they are not atomic. Separate transactions encourage the examination of the cases where one transaction (in this case, the products transaction) decrements inventory while the order simply fails to be created. Cat doesn’t want inventory decremented unless the order is created.
00:26:57.560 So one thing he absolutely must do is consider the order of these transactions carefully. Of course, it’s now possible to create an order and leave inventory decremented, but perhaps that’s acceptable for Cat's application; it may be better for inventory to show as too high rather than too low.
00:27:15.200 On the weekend, Cat doesn’t work—he likes to rest and reflect on what he has learned. One key takeaway for Cat is that external calls within a transaction pose significant risks; they can lead to data integrity issues because they can make it appear that the operations are atomic when they are not.
00:27:48.560 This can hide the reality that you are dealing with two separate operations and may lead to cascading failures where the failure of the external service ends up causing problems in the database itself. This includes everything from HTTP requests to jobs, sending emails, or querying other databases. Essentially, the only thing we want inside a database transaction is queries that directly modify that database.
00:28:29.880 Moreover, Cat learned that slow transactions are dangerous; if they interrupt a request, they can cause performance bottlenecks, and slow transactions should be avoided entirely. Contention in one area can affect numerous other areas. For instance, slow transactions—even those occurring in jobs—can still lead to contentious performance issues elsewhere in the application.
00:28:56.240 Cat also recognized that database connections are a finite resource, and they can easily become overutilized when transactions linger for too long, potentially threatening performance.
00:29:25.960 Returning to Monday, Cat had a restful weekend, but many of his colleagues had been working throughout the weekend. When he came back on Monday, he found something resembling a chaotic transaction, one that many others had encountered. We all recognize that transactions like these pose a risk to maintainability; nobody understands them fully, nor do they want to alter them. Yet, transactions like this carry risk since we often underestimate their impacts, which can ripple through other areas of the application and in the worst case, might even bring down the entire application.
00:30:07.080 So, what can Cat do? His first step could involve a rewrite of the entire transaction. However, I would recommend proceeding more cautiously—make small incremental changes rather than attempting a massive overhaul. Perhaps he can take one external call out of that transaction or refine a single query. The goal should be to adjust code minimally to leave it slightly better off.
00:30:31.560 There’s also a new tool coming in Rails 7.2 that allows for delaying job executions until after transactions complete. I’ve mentioned this before. Additionally, there’s a more general-purpose callback in the works that will facilitate deferring work until after transactions have been completed.
00:30:56.400 It’s likely that this tool will be incredibly helpful for Cat to improve transactions over time, reworking the code where necessary to minimize problematic transactions. Besides fixing the existing ones that are clearly problematic, Cat should also proactively seek out other transactions that may be causing similar issues. Queries within one part of the application may very well affect other parts.
00:31:18.720 Most databases provide tools to inspect query and transaction behaviors; for example, in MySQL, there is a performance schema that can be utilized to gather information regarding which transactions are slow, which ones are holding or waiting for locks, and which are sitting idle. Cat should prioritize his fixes accordingly.
00:31:37.520 Rails 7.1 added transaction instrumentation, making it useful for observing transactions from the application side. Gathering and analyzing that information can allow Cat to deter such problems from creeping back into the application.
00:32:03.520 Finally, I would advise Cat to adopt practices to prevent problematic transactions from reappearing. There are tools available like the isolator gem that can help detect external calls within transactions. I'll leave you with some tips on writing safer database transactions.
00:32:19.600 First, keep transactions as short as possible. I mentioned one second earlier, but that’s arbitrary; maybe it depends on the application. What is important is not letting transactions extend beyond five or ten seconds.
00:32:32.360 Fast transactions mean having fast queries within those transactions and ensuring there aren’t too many of them. Again, that number might vary by application, but once I see more than a thousand queries within a single transaction, I begin to get concerned.
00:32:56.280 We want to avoid all external calls occurring within the transaction. The goal is to minimize operations inside of transactions; ideally, most processes should happen before or after the transaction. Be cautious of callbacks that may obscure what is happening during the transaction.
00:33:14.560 Favor using callbacks like after_commit that execute their operations outside of transactions whenever possible. Finally, consider if it’s genuinely necessary to utilize a database transaction at all. Sometimes, it can be easier to put everything in a transaction, but that may not represent the best design for your application.
00:33:35.040 For the health of your database, always treat it as a sensitive resource, and don’t overburden it unnecessarily. Hopefully, these tips will help you write safer transactions in your applications and help you avoid some of the pitfalls Cat faced throughout his week.
00:33:56.919 And by the way, I work at GitHub, not Bug Hub, but everything here is based on real problems we have encountered at GitHub, though names, domains, and characters have been altered. If anyone has further questions about any of this, I’d love to discuss it in greater detail. Thank you!
Explore all talks recorded at RailsConf 2024
+33