00:00:05.640
Thanks, everybody! Got to turn this on. There we go. Today I'm going to talk to you about Cat.
00:00:13.480
Cat is a software developer working on a large Ruby on Rails application at Bug Hub. Cat has had a bit of a stressful week.
00:00:19.480
Bug Hub is an online community of bugs, and Cat has spent the past week shipping a highly successful new product within Bug Hub. Along the way, Cat has had to deal with a whole bunch of problems: data consistency, bugs, slowness, and at one point, even a site-wide failure.
00:00:31.199
All these problems stem back to this one database transaction. It started out as a simple little transaction, but as the days go by, Cat is going to have more business requirements that lead him to add more code to this transaction.
00:00:38.120
Gradually, it turns into a liability for his application. Along the way, we're going to explore some of the things that went wrong and what Cat might have done differently to prevent some of these problems. The story starts on Monday, when Cat showed up to work very hungry for business opportunities.
00:01:09.560
Cat knows that many of the bugs using Bug Hub are very hungry for food, but they're too busy interacting on the site to be bothered to go out to the grocery store. They want groceries delivered directly to their homes. So, Cat has decided to launch a brand new grocery delivery service integrated directly into this larger Bug Hub product.
00:01:30.479
We won't dig too much today into the controllers or views or how the order gets put together. We're only concerned here with how the order gets saved. So, we can assume that the order instance variable here is an active record model backed by an orders table in the database.
00:01:52.920
There's also a product model, and the product will be identified by a SKU, a stock keeping unit. Finally, there's a join table connecting products to orders, which additionally has a quantity for how many of that particular item was ordered. Although it's not explicit in the code here, this call to save opens a database transaction.
00:02:22.319
If we look at the queries involved, we first see a 'BEGIN' statement starting the database transaction. Then, it will insert the orders and some number of order items, and finally there's a COMMIT if the transaction is successful, or a ROLLBACK statement if it's not. This is what a transaction is supposed to look like: it has a few queries in it and nothing else.
00:02:53.640
The idea here is that we want these inserts to be atomic: we want all or nothing. We want them all to succeed or fail together—no partial success. So, this is all good. Cat ships this code, and all is well. The orders start rolling into Bug Hub.
00:03:15.640
However, on Tuesday, Cat discovers that bugs are able to order more items than what Bug Hub has in stock. This is a frustrating experience for the bugs because they get excited about orders, but then later they get canceled. Cat decides to deal with this problem by keeping track of the inventory on the products table.
00:03:38.400
He adds an inventory column, and then uses that information to not show out-of-stock products on the site, possibly adding model validation around that as well. Over in this submit code, Cat adds a call to update the product inventory after saving the order. The way that works is that it will go through each item in the order and perform an update, decrementing the inventory by however many of that item were ordered.
00:04:08.280
Cat wants these two operations to be atomic. They should both succeed, or neither one should, so he wraps the whole thing in an explicit transaction. Now, the transaction looks something like this: there are some inserts and some updates, and they're all happening together.
00:04:32.320
Unfortunately, after shipping this code, Cat starts to see increased 500 responses when placing orders. He looks at his bug tracking software and sees things like deadlock errors and lock timeout errors, as well as some requests that are taking a long time and timing out. The underlying problem here stems from something related to lock contention.
00:05:06.760
This happens when multiple processes are trying to work with the same data at the same time, and they have to compete for access to that data. We can imagine two bugs ordering the same or similar products simultaneously. Their transactions begin, and they create the orders and the order items, and then we reach the code that updates the product inventory.
00:05:39.800
Both transactions want to update the same product at the same time. The metaphor I’m using now is that database rows are like bathrooms; it usually only makes sense for one person to use one at a time. If you enter a bathroom and lock the door, anyone else who wants to use it has to wait.
00:06:06.919
So that's what's going on here. One of these updates arrives first and locks the row, which means that the other one has to wait to perform the update. Even after the update succeeds, the other transaction is still left waiting until the row gets unlocked when that first transaction commits or rolls back.
00:06:30.280
At best, contention like this is going to make the performance of the transactions worse because you can see that the bug's transaction here is stuck waiting—it's slower than it needs to be. But at worst, it can lead to errors like deadlock errors, where two or more transactions are waiting for opposite locks and can’t proceed, or lock timeouts, where a transaction is waiting so long for a lock that it gives up and times out.
00:07:01.960
Lock contention is undesirable. Interestingly, the speed of a transaction affects how much opportunity there is for contention like this. If the transaction is very fast, the locks aren't held for too long, and there’s not a lot of opportunity for another transaction to come in and need those same rows at the same time.
00:07:38.760
However, slow transactions on the other hand create more opportunities for contention; there's now a lot of time for another transaction to come in and need to perform operations on the same rows, which may lead to the waiting for those locks.
00:08:04.599
This was happening in Cat's case. Cat had a lot of slow transactions. But what was making them slow? One thing is that these update statements turned out to be very slow in Cat's application. Slow queries in a transaction mean slow transactions.
00:08:34.199
In this instance, Cat forgot to add an index on the SKU column. It might not be apparent, but you can reduce lock contention by adding an index to speed up a query. Even if all the queries were fast, these very hungry bugs at Bug Hub were submitting orders with massive numbers of products in them.
00:09:06.200
So, too many queries can also make a transaction slow. Additionally, transactions like this hold many locks, which increases the possibility for lock contention. Furthermore, transactions like this, in addition to holding a lot of locks, are likely to get stuck waiting on locks as well, and so lock contention has the potential to spread.
00:09:40.480
Locks held by one transaction slow down other transactions, and that slowness creates more opportunities for contention, further slowing down even more transactions. Operating on large batches of records like this is not something a transaction is good for. You might be able to get away with hundreds of updates in a single transaction, but if you get into the thousands, you may need to reconsider the design.
00:10:13.600
In this case, perhaps Cat could validate a maximum number of products that could be included in a single order. This innocent-looking line can actually create some problems if it causes slowness in the transaction.
00:10:48.199
Cat can reduce the potential for this contention by speeding up the queries and limiting the number of queries. Deadlock errors are also fixable; in this case, Cat could update the products in a consistent order, and the deadlocks would go away.
00:11:15.679
Cat might also want to question his assumptions about whether these updates need to occur within a database transaction at all. Maybe there's another design here that doesn't require these updates to happen in a transaction and perhaps doesn't necessitate updating the products table at all. Maybe Cat needs a more robust inventory management system that operates outside of this database.
00:12:09.519
On Wednesday, Cat notices customers complaining that submitting an order takes too long. No one likes clicking on a button and waiting for a spinner. Cat tracks this slowness back to an 'after create' call that was doing a lot of slow work, like syncing with an external billing platform and sending off confirmation emails.
00:12:39.239
These are things we often do not want inside of a web request because they can slow down the request. We also don't want them inside of a transaction because they can slow down the transaction, leading to contention like we saw on Tuesday.
00:13:02.480
So, Cat moves the slow work into a background job. This is an improvement, but there are still some potential problems here. After shipping this change, Cat eventually starts getting reports of confirmation emails being delayed by several minutes.
00:13:42.639
Since Cat is using an 'after create' callback, the job is getting queued right after inserting the order but before the transaction commits. Cat happens to be using a job queue adapter that is not backed by the database, meaning that queuing the job here is not part of this transaction.
00:14:16.280
Also, because of this, a worker can pick up the job before the transaction commits. In some cases, that can lead to unwanted actions taking place before the transaction commits. In Cat's case, it was somewhat fine because the job requires a persisted order, and the order isn't persisted until the transaction commits.
00:14:49.760
So, that job just fails and can retry later. However, there are all these statements to update the product inventory, which have proven to be kind of slow. This creates potential for the job to retry and fail multiple times before this transaction commits, which is not ideal. Cat has had to configure the retries for this job rather high to accommodate that.
00:15:38.920
Eventually, when the transaction does commit, the job retries one last time and then succeeds. However, many of these transactions are rolling back because of deadlock errors and lock timeout errors we saw before. This makes the retries problematic; with these rolled back transactions, the job is queued but will never succeed because the order never gets created.
00:16:09.720
These jobs get stuck retrying until the number of retries runs out. So, they are destined to fail but are stuck in the queue using up resources. As a result, the retries for these failing jobs started backing up the queue, getting in the way of successful job runs.
00:16:35.320
The work of those successful job runs is getting delayed, such as sending confirmation emails. The simplest thing Cat could have done differently here is swap this 'after create' call with an 'after create commit'. This way, queuing the job only occurs if the transaction commits and doesn't occur if it rolls back.
00:17:03.440
This adjustment means that typically, the job should run exactly once, and we won't ever need to perform this retry. In fact, Cat can look forward to Rails 7.2, which is going to introduce a feature to always queue jobs after transactions commit. This means Cat wouldn't have to change any code when upgrading Rails—he just needs to use the appropriate configuration.
00:17:44.799
On Thursday, Cat sees customers complaining about orders being delivered too late or sometimes with the wrong items. It turns out Cat's team has been doing a lot of manual work to fulfill the orders, which is slow and prone to error.
00:18:08.700
As a result, Cat decides to work with an external fulfillment service that offers an API for submitting orders. After saving the order to the Bug Hub database, he adds a call to submit the order to this fulfillment service via a fulfillment client.
00:18:39.360
This is an HTTP call to an external service inside of a database transaction, and someone in the front row just made a gasp. Yes, that's correct; we'll get there. If we look at the queries involved and where this external call gets made, you might notice it looks quite similar to the job example, and that's because it is.
00:19:05.679
Queuing the job in Cat's case was also an external call and brings the same problems. For example, the external call here could succeed, but then this transaction could roll back. But worse than that, after Cat ships this code, he starts seeing errors across all of Bug Hub, not just in this new grocery delivery product, but throughout the entire site.
00:19:39.560
Upon investigation, Cat discovered that the new external fulfillment service was having scaling problems. Too many orders were coming in, and the service simply could not handle them all. It makes sense how this could affect order creation, but how did this cascade into failure across the whole site?
00:20:15.680
As this service became more overwhelmed with too many requests, it started responding slowly, or sometimes not at all. Intermittent slowness is somewhat expected for external calls like this; services can degrade or be slow. However, there's also the issue of network latency.
00:20:56.020
That's why we often move external calls outside of requests and into background jobs. However, when this slowness occurs within a transaction, it can threaten the health of your database. From the database perspective, the external call is just idle time, making it seem as if the database is doing nothing.
00:21:30.560
But in reality, it’s still utilizing valuable resources during that idle time. For one, it's holding locks, which creates opportunities for lock contention. The bigger problem is that these open transactions are using a database connection the entire time they are open.
00:22:07.880
This database connection cannot be used for anything else. Database connections are a finite resource. So when this fulfillment service started breaking down, orders continued flooding into Bug Hub, opening new transactions that would just sit idly, waiting for the fulfillment service to respond.
00:22:49.080
As more orders came in, the number of connections in use increased, which meant that the number of available connections decreased until eventually, every single connection was in use. At that point, no other request was able to acquire a database connection.
00:23:32.720
Any request needing access to that database started failing. Even the connection pooling mechanism in front of Cat's database couldn't help here because a transaction is using a real database connection the entire time it remains open.
00:24:00.480
One thing Cat should have done here is add a client timeout for this HTTP call. We should never wait an unlimited amount of time for a response. That would have mitigated this situation to some extent, but there's always going to be some amount of risk associated with having this external call within a transaction.
00:24:35.040
So the real solution is to move it outside of the transaction. Now, if this fulfillment service encounters issues, it won't affect the database at all. The transaction only begins if the external call succeeds. Failure of the service will indeed impact order creation, but it won't affect the rest of the site.
00:25:14.480
I also appreciate that it's now clearer visually that these are two separate operations. It's not one single atomic operation as it might have appeared before. While it was previously possible to successfully submit to the fulfillment client and then have that transaction roll back, we can recognize this is true now as well—but it is more apparent.
00:25:52.960
For instance, if the transaction rolls back, maybe we need to make another API call to delete the order, assuming the API allows for that. However, that could also fail, raising the question of whether the order should have been created successfully in the first place.
00:26:31.759
There are many ways to handle this, but yes, it requires additional code to deal with these two separate operations that aren't atomic.
00:27:02.600
On Friday, customers begin complaining about site reliability. In addition to the problems we've observed so far, they start experiencing intermittent 500 responses, slow request times, and increasing frustration. These issues stem from the grocery delivery service's popularity.
00:27:38.520
The database has become overwhelmed by writes to the orders and products tables, so in some ways, it's a good problem to have. Cat meets with the database team, and they decide to split these tables that are receiving so many writes into separate databases. Now, there will be an orders database and a products database.
00:28:14.720
However, this introduces a new problem for Cat because inside the orders transaction, there are now queries for both orders and products. To address this, Cat wraps the code in a products transaction as well.
00:28:54.080
This approach seems reasonable and innocent enough, and it appears that it should work. However, it's actually a significant availability risk. Cat can almost guarantee he'll encounter all the same problems we've seen so far, but they're even worse now because any problem affects two databases instead of just one.
00:29:30.639
We now have contention problems, slow transactions, and external services failing, and they are impacting both of these databases. Additionally, making queries to one database inside a transaction for another database behaves like another external call, exposing us to the same problems as other external calls.
00:30:05.680
The splitting of databases limits their effectiveness. After the change, Cat notices that certain products stop selling because their inventory has dropped to zero, even though the product remains in stock.
00:30:49.200
This occurs because the two transactions and different databases are not atomic: they are two distinct operations. Most of the time, both should either commit or if the inner products transaction rolls back, that would cause the outer orders transaction to also roll back.
00:31:33.540
However, it is still possible for one to commit while the other rolls back. In Cat's case, the products transaction committed while decrementing the inventory, but the corresponding order never actually got created; this transaction rolled back.
00:32:08.560
Interestingly, this occurrence was more common in Cat's application than one might expect.
00:32:11.680
This happens because the orders transaction is sitting idle for the duration of the products transaction, which can result in losing the database connection. In fact, it is one of the most common ways to lose the database connection.
00:32:37.239
The same risks arise: network timeouts, database timeouts, application timeouts—all sorts of things can go wrong. Given that these two transactions are not atomic, my recommendation for Cat is to split them up.
00:32:49.760
This is not always an easy change to implement, but it’s essential. This highlights the distinction between the two transactions, showing that they are separate rather than a single unit of work.
00:33:05.560
While the failure modes remain the same, it's now apparent what they are. Cat realizes he does not like the inventory being decremented too far when the order fails to be created.
00:33:38.680
This may mean flipping the order of operations—a change in failure modes for his application. Admittedly, it could also mean more code to manage these actions to give them the atomic treatment they don’t have.
00:34:09.160
In conclusion, that brings us to the weekend. Cat does not work on the weekend.
00:34:22.840
He likes to reflect on what he's learned. This week, he learned that external calls within a transaction involve risks to data integrity, as they can appear atomic when, in fact, they're not.
00:34:51.680
Cascading failures can occur when external services degrade, potentially leading to database failures. These external calls encompass various operations; ideally, the only activity in a database transaction should be queries specific to that database.
00:35:20.760
Cat learned that slow transactions entail risks as well. If they exist inside of requests, they create slow requests, which are undesirable. However, slow transactions are unwelcome gestures in any context.
00:35:55.920
Contention within slow transactions can spread; a slow job may also slow down requests or create errors in a request. The same applies to resource exhaustion: database connections are finite resources, as are database CPUs. Transactions that are too slow could be overwriting these resources, thereby affecting the rest of the application.
00:36:29.160
I added slides to this presentation after giving this talk before, as my team thought I should include more memes. Here’s one my manager suggested, featuring Sheldon from 'Big Bang Theory.' Here's another.
00:36:55.760
Well, that brings us back to Monday. Cat enjoyed a restful weekend filled with memes. However, it turns out a lot of other teams made changes to this transaction over the weekend.
00:37:23.120
People were working on their own transaction and inadvertently altered the situation, which raises concerns about transactions like this. Who has seen transactions that look something like this? A few hands raised.
00:37:45.680
Transactions of this nature are not good for maintainability; nobody knows how they work and nobody wants to touch them. However, these transactions carry considerable risk, and we tend to underestimate that risk, which could bring down the entire application.
00:38:06.320
So, what can Cat do now? Avoid rewriting it from scratch. Instead, he should strive to make small incremental changes, perhaps moving one external call outside the transaction or speeding up one query. The idea is to make minimal changes that improve the situation a little, and then repeat the process.
00:38:35.919
There are tools coming in Rails 7.2 to assist in deferring work until after transactions commit. I mentioned this change to queue jobs following a commit, but there will also be a general-purpose callback for deferring work, which can be a good way to improve transactions without entirely rewriting code.
00:39:06.480
Cat must also recognize other problematic transactions, as these can affect other areas of the application. Most databases offer tools to inspect transaction behavior. MySQL, for instance, has a performance schema that allows you to query the database for details on which transactions are slow or waiting on locks.
00:39:30.960
Additionally, Rails 7.1 has introduced transaction instrumentation. This can be valuable to combine with the output from the database itself, providing back traces to understand where transactions are happening.
00:39:59.760
Finally, I recommend that Cat take measures to prevent transactions like this from re-entering his codebase. Tools such as the 'isolator' gem can help detect external calls within a transaction, and you can build tooling on top of the transaction instrumentation in Rails.
00:40:31.760
Now, let's conclude with some tips for writing safe transactions. Keep transactions brief; speed is significant in transactions. Ideally, they should complete in less than a second, but any conclusion might vary.
00:41:07.680
If the transaction eclipse more than five or ten seconds, it raises concerns. Fast transactions refer to quick queries and not too many of them. Less than 100 might be reasonable, but I’d be wary of anything exceeding a thousand.
00:41:43.360
Importantly, avoid external calls within transactions, as they create risks and unfavorable failure modes. If you move external calls outside the transaction while making an initial move, it might be a challenging change, but the benefits are substantial.
00:42:04.360
Above all, ensure that you keep your code within transactions to a minimum. Aim to execute as much work as possible before or after the transaction. Callbacks can complicate this, so when drafting callbacks, default to those that occur outside of transactions, such as after-commit callbacks.
00:42:23.360
Lastly, consider whether a database transaction is really necessary for your operation. While it may seem easiest to group everything in a transaction for simpler reasoning, it may not be the most suitable design for your application or the health of your database.
00:42:58.159
Addressing the database as a nuanced resource is crucial, allowing you to avoid overworking it unnecessarily. Hopefully, these tips will assist you in writing safe transactions, enabling you to sidestep some of the issues that Cat has faced and leading toward a more stress-free life.
00:43:26.080
Thank you. I just wanted to mention that I work at GitHub, not Bug Hub. Names and details were changed for this talk, but everything is based on real experiences we've encountered at GitHub.