rails db:migrate:safely

by Matt Duszynski

In the talk titled "rails db:migrate:safely" at RailsConf 2019, Matt Duszynski presents valuable insights on safely handling database migrations when working with large datasets, specifically focusing on relational databases like PostgreSQL. The discussion emphasizes the risks associated with copying and modifying millions of records during migrations, urging developers to learn from real-world mistakes.

Key points discussed include:
- Understanding Locks: Migrations can obtain exclusive locks, causing system downtime when adding or modifying columns, leading to user frustrations. For example, adding a column with a default value can lock the entire table, preventing all access until the operation completes.
- Transactions and Their Role: Transactions can delay updates as they require locks to be released only after all operations complete. The problem intensifies with large data sets, where all rows are locked during bulk updates.
- Best Practices for Migrations:
- Avoid adding columns with default values.
- Perform data updates in smaller transactions, rather than a single large batch, to keep the system responsive.
- Disable DDL transactions when necessary to avoid data and schema change collisions.
- Utilize concurrent index creation to prevent table locks during index updates.
- Anecdotal Case Study: Duszynski shares a cautionary tale about an indexing operation at Weedmaps that took down production due to unexpected database concurrency issues caused by not realizing the read operations generated during auditing processes.
- Testing and Monitoring: Regular performance tests of database operations before, during, and after migrations are crucial to ensure stability and performance. Tools like strong_migrations can help catch unsafe migration practices early in the development cycle.

Conclusions and Takeaways:
- Developers must prioritize writing safe and efficient database migrations. By following best practices and being aware of database behavior, particularly during migration processes, one can significantly reduce risks of downtime or performance degradation during migrations. Uptime is paramount, and actions taken towards safe migrations reflect directly on user satisfaction and system reliability. Ultimately, learning from past mistakes and being proactive in testing and monitoring can lead to a pain-free migration experience.

00:00:20.689 Before we get too far in, I want to explain who I am. I'm Matt Duszynski. You can find me on Twitter and GitHub. Here’s a fun fact: I can plug a USB cord in the right way round the first time every time. Just kidding! Actually, how did I get in? As Megan mentioned, I work for a high-tech company in Southern California.

00:00:35.129 We might be the highest tech company out there, building a platform for the cannabis industry and the millions of cannabis users around the world. It sounds lofty. DHH might take me to task for that. If that sounds interesting to you, definitely stop by our booth in the exhibit hall and come talk to us. You can’t miss it; we’re always hiring, and it’s a fantastic place to work.

00:00:52.170 If you're perfectly happy with your current job, stop by the booth anyway to RSVP for our party tonight, which is going to be awesome. I hope to see all of you there! During my time at Weedmaps, which has been a bit over a year and a half, I’ve unfortunately taken down production.

00:01:06.450 But before I go into that, I want to conduct a quick audience participation exercise. Can I see a show of hands? How many of you are using a relational database such as Postgres or MySQL as your primary data store?

00:01:18.990 I figured that would be about what you’d expect. Second question: How many of you have tables with 1 million rows or more in that database? Okay, great! Then this talk is very applicable to you because the migrations that work well for 50 records do not necessarily work well for 50 million.

00:01:37.500 Let’s start with something that seems pretty simple—adding a column to the database. Let’s say we have our users' table and we have our application, which has millions of people using it because we’re a really successful startup. We want to be able to mark a user account as active or inactive.

00:01:54.540 I don’t know how you’ve got this many users without needing to mark them active, but bear with me for the sake of this example. The database structure is simple: we have a column called ‘active,’ which is a boolean that is either true or false, and we probably want to set our existing users to active so we don’t lock everyone out of the system.

00:02:04.440 To do this, you might write a migration that looks like this: 'add_column :users, :active, :boolean, default: true.' Sounds like a good way to get the job done, right? Wrong! Suddenly, no one can log in, no one can sign up, and everyone is furious. Your app gets blasted on Yahoo News, and your VCs get angry.

00:02:19.260 It begins a downward spiral, and somehow a plane crashes into a mountain. We’re not sure how that happened. To understand what’s going on here, we have to talk about locks. Locks are the mechanism that your relational database uses to ensure that multiple operations don’t update the same row at the same time.

00:02:31.080 There are many different lock modes—unfortunately, too many to go into detail here. But to give you an idea, they range from an access share, which is basically the lowest level where I can access data, to access exclusive, where no one else can do anything with this piece of data because I’m doing something potentially dangerous, like updating or deleting.

00:02:47.670 During a database migration, even as simple as the one we discussed earlier, it will obtain an access exclusive lock on the relevant table and prevent any other connections—and the entire rest of your application—from reading or writing data to that table until the migration is complete.

00:03:05.910 Let’s illustrate this with a real-time example using our users’ table. You’ll have to imagine the remaining 9.999 million users; I could only fit so many on the slide. Our migration begins, and the first thing it does is obtain that access exclusive lock. It is going to add the column, and so far, so good.

00:03:26.550 Then we begin filling that column with ‘active: true,’ so people can log in. Our little elephant here is Postgres doing its thing, and we’ve marked the first user, Alice, as active. So far, so good. We move on and mark Bob as active. Uh-oh, first problem: Alice is trying to log in, but she can’t because the entire table is locked.

00:03:41.720 Your application can’t read data from it, and she’s stuck at a hung login screen or a 500 error. Or, she might go and use your competitor’s application. Overall, not good. Meanwhile, the migration is still running, and we’re marking the next user, Charlize, as active.

00:04:01.680 Now Bob’s trying to log in, but he can’t either. This problem continues as users are trying to log in, resulting in a pile-up of queries that can’t be responded to and a lot of very upset users. Finally, the migration finishes running and releases the lock once it’s done filling all the data, but what you’re left with is a cadre of dissatisfied customers.

00:04:24.060 This is probably not what you had in mind when you decided to mark users as active or inactive. So, how do we avoid this? First of all, don’t do it—don’t add columns with default values. That’s a dangerous operation because of the locking mode it uses, and it can lead to significant downtime if you have enough rows in your database and enough traffic on your system.

00:04:34.200 It is worth mentioning that Postgres 11 actually addresses this problem in certain circumstances. Adding a static default value no longer requires obtaining a table-level access exclusive lock. However, there is a caveat: this only applies under certain conditions.

00:04:53.720 Moreover, with other circumstances, like backfilling a new UUID column, it may still obtain that lock, making it potentially dangerous. Also, it’s hard to be up to date on the latest version. At Weedmaps, we use Aurora, and the highest version they support is Postgres 10.6. In that case, you are at the mercy of your database, your SaaS provider, or your DBA.

00:05:12.450 So, as a general good practice, don’t add columns with default values. We know what to do now; we’re not going to add that column with the default value. Instead, we’ll add it without a default value first, and then change the default once the column has been added.

00:05:27.360 Finally, we’ll go back and update all of our users to mark them as active. Of course, we’ve split it into ‘up’ and ‘down’ because we’re good Rails developers, and our migration should be reversible. We tested this in development, and everything seemed hunky-dory.

00:05:41.170 But then we ran into a problem, and now we have to talk about transactions. Transactions combine multiple database operations into a single all-or-nothing operation. Either all of your statements succeed within the transaction and are committed, or something fails, violating an integrity constraint, and the entire operation is rolled back.

00:06:03.470 Transactions provide guarantees for atomicity, consistency, isolation, and durability, which you might hear referred to by the acronym ACID. This is a topic in and of itself and probably warrants many talks, so we don’t need to go into it deeply here. Just suffice it to say that the consistency and isolation that a transaction provides are guaranteed by locks.

00:06:18.900 For example, consistency means that any operation will take your database from a valid state to another valid state. If you write a row, that row must be valid to be saved and read by subsequent queries. Isolation guarantees that uncommitted operations from one transaction cannot interfere with another.

00:06:36.750 When a row is being updated, an exclusive lock is issued to prevent other operations from updating that same row. This is important; if you and I have data that we need to use, and we both try to do it at the same time, it will violate the consistency principle in relational databases.

00:06:53.179 So, updates to rows are always made in a serial manner. This serial order is guaranteed by locking within a transaction. Locks are issued on a first-come, first-served basis and live for the duration of the transaction.

00:07:09.479 Even if the statement that requested the lock has already executed, for most of your database operations this might not be a significant issue, because they usually happen in milliseconds. However, the problem begins when you have to perform millions of database operations on a very large data set.

00:07:21.679 Now, how does this tie into migrations? Through a bit of that Rails magic we all know and love, everything you do in a migration is automatically wrapped inside a transaction.

00:07:35.360 Now, you can imagine what this looks like. Fortunately, you don’t have to, because I have more math tables. We’re going to start after our column’s been added, and it’s time to mark our users active, starting with row 1. We do not lock the entire table this time because we’re not adding a default value.

00:07:51.270 We’re just individually updating all of our rows. However, in order to update a row, you must obtain a lock on it. So, we mark our first row as active. We lock it, and even though that was successful, as I mentioned, that lock doesn’t get released until your transaction commits.

00:08:05.580 This can’t happen until you update all of the rows in the migration. Now we’re on row 2, obtaining another lock for that row and marking it true. We keep going in this manner, and now Alice is trying to log in, but that row is locked.

00:08:18.470 Thus, we can’t write data to it. Then Charlize’s row is locked, and we mark her as active. One key thing to note is that this is actually better than the first failure case because some of your users, those that haven’t been updated to be marked as active, are not locked yet during this migration.

00:08:34.230 This means they can successfully log in. Doug is a happy user, and he goes on his merry way to use your application, unaware of any issues. Now, we move on to his row, but as we continue marking users as true, more users are trying to log in.

00:08:48.720 Now Doug's row is also locked, so our migration completes while we still have a few angry users. If you listened closely, you might have caught that I said we can’t write data because update locks don’t prevent you from reading the previous value of the row until that transaction is committed.

00:09:02.990 When a user logs in, we’re just reading the password. We can still read data. Why is this a problem? Are you sure you’re not making updates when your users log in? You might want to consider this. Even if this particularly contrived example doesn’t necessarily apply to you, I’m sure there’s a table in your application that cannot afford to block updates.

00:09:18.929 For an e-commerce application, that might be your orders table; if you're a SaaS blogging service, it might be your posts table. There’s always some reason for your application to be writing data. While you can chance it, and maybe two unhappy users is an acceptable margin of error, it’s better to not backfill data inside a transaction.

00:09:39.260 Much advice is simply: don’t do it. If only everything were so easy. Wait a sec—transaction use? They’re a great way to maintain ACID properties while manipulating a database.

00:09:57.740 We don’t want to get into an inconsistent state. We also don’t want connections to start conflicting with one another; that’s an unstable system and not acceptable. So, as you can see, 'don’t backfill data inside a transaction' is an important point.

00:10:13.000 Instead, we should pull this line out of the migration. We want our schema changes to happen safely, so we will write a completely separate migration to mark all of our users as active after the original migration.

00:10:30.500 The disabled DDL transaction allows us to disable that global transaction I mentioned. It’s implicitly enabled, but you have the option to explicitly opt out of that behavior when running a specific migration.

00:10:45.100 Now, within our ‘up’ method, rather than marking every single user as active inside a single transaction—something dangerous—we’ll iterate through our users in batches, wrapping each batch inside a transaction.

00:11:01.340 The batch size defaults to a thousand, of course—we can configure it based on our individual needs. The key here is that this transaction updating a thousand rows will complete and commit much faster than a transaction dealing with ten million rows.

00:11:18.290 This changes your lag time from minutes, where your entire user base can’t log in or use the software, to seconds or less, all while an individual subset of users might receive slightly delayed responses.

00:11:32.860 In almost all circumstances, this is an extremely good trade-off, and your users likely won’t even notice that anything happened. Our rule of thumb is to not mix schema and data changes because you will wind up with a global transaction that causes extensive locks to be held for long periods.

00:11:45.660 Now that we’ve finally added our column and populated it, we only caused two production outages—pretty good, right? Now, we probably want to look up which of our users are active or not. So, let’s add an index.

00:12:04.270 Note that this says 'for Postgres only' because the internal implementation of indexing in MySQL doesn’t have this particular failure mode that we’re about to discuss. Postgres has a more complex indexing strategy that has a particular edge case.

00:12:18.840 For instance, we have a very simple migration for adding a index to the active column on users. It's straightforward; nothing could go wrong, right? Well, you’ve been here long enough to know that’s not why I put it on the screen.

00:12:33.970 In addition to running that inside a transaction, adding an index in Postgres will, by default, obtain one of those nasty table-level locks while reading all the data from the table and writing it to the index.

00:12:52.305 Like when you are locked from updating an individual row while Postgres is trying to write an index, it means you can’t make changes that would affect how the index is written, including updates to different rows, and your database state will be frozen.

00:13:06.750 Fortunately, this has generally been realized as a bad practice, even by the folks who maintain Postgres. They added an algorithm that can concurrently create an index and write it to disk without blocking reads.

00:13:23.170 Thus, just like we talked about earlier, we will use this algorithm to add the index in a non-blocking manner. But remember—it’s Postgres only. If you include algorithm concurrently in your migration and you're using MySQL, it will fail with an error.

00:13:38.720 That said, you won’t break anything, but it simply won’t work. So, don't try it—or do, at your own risk! Regardless, make sure to add Postgres indexes using the concurrent option, as it prevents massive table-level locking that could disrupt your application.

00:13:55.330 Now that we've discussed some potential failure modes and what you shouldn't do, let’s talk about a time when I did all those things that you shouldn’t do.

00:14:09.050 At Weedmaps, we keep a significant audit trail of operations that users perform within the system. This makes sense; we want to know when and how people are logging in and making updates for compliance and security.

00:14:25.830 In a very broad sense, we write actual changes to our main Postgres database and then follow up by writing an audit record that captures what changed, including the operation (create, update, delete) and the properties of the model that were altered. We write this to a separate database.

00:14:41.780 You could write it to a separate table in the same database; that would depend on your architecture decision based on data set size and scaling needs. However, I’m looking forward to Rails 6, especially with multi-DB support because we had to implement quite a bit of hackery to make this work, but that’s neither here nor there.

00:14:57.040 Now, we needed to update a compound index on the audits table. If you haven’t heard of a compound index, it’s just like a simple index, but it’s on two columns. In this case, we needed an index on both the type and ID column because that’s how we query.

00:15:12.660 So, what do we do? We drop the old index and add the new index concurrently. No locks, no transactions, and we didn’t make any obvious mistakes with the migration; what could possibly go wrong?

00:15:27.620 A lot! We ended up taking down our entire production system for about 45 minutes—in the middle of the afternoon when people are getting off work and using our service to find cannabis.

00:15:44.980 This was a terrible time to have a massive production outage. What happened? If we look at the migration, it looks pretty sane. We aren’t in a transaction, and we’re adding the index concurrently; so what could go wrong? Why did we take down production with this indexing change?

00:16:02.110 To know why, we need to dive into how concurrency works. There's going to be some math on this next slide. That's the math—don't worry, we’ll walk through it. Little's Law is a very simple equation that governs the relationship between concurrency, throughput, and response time.

00:16:18.340 This ultimately controls how many requests your system can handle within a given time frame. For instance, if you have 100 requests and an average response time of 40 milliseconds, at any given point in time, you have four requests being served by your application.

00:16:34.920 It also works in reverse. If you have four threads in Puma and your average response time is 40 milliseconds, then you can theoretically serve up to 100 requests per second. Note the relationship between response time and concurrency; adding a zero to your response time also adds a zero to your concurrent requests.

00:16:51.570 How many of you run 40 or more servers? Scaling from 4 to 40 is not something that can be done quickly, and it’s likely that you won’t be able to do it at all. If requests start queuing because they are arriving faster than your application or database can respond, you will experience saturation.

00:17:06.240 What happens here is that if a database operation blocks many requests for extended periods, you’ll saturate your connection pool. This means your application won’t be able to serve any requests, and the entire system will grind to a halt. It’s the worst time to have that happen—like during a conference when there’s no one on-call.

00:17:22.210 So, how was this a problem for our audits? First, we write the actual changes to the database, and then we have a tiny read query to see what version that record is at so we can write the next version to the audit database. Thus, each audited operation generates both a read and a write.

00:17:38.740 So rather than being a write-only append log, it turned out that every audited database operation was generating a read in addition to the write—what we thought was 99% writes. Now, we’re actually at about 50% reads, which is problematic because we index things for fast reads.

00:17:55.250 What did we do? We dropped an index, which changed our query performance on the audit database from several milliseconds, which is generally acceptable, to several seconds, which is not. If you recall the Little’s Law multiplier, this substantially changed our level of concurrency and requests we could handle.

00:18:12.600 Instead of being able to handle our 10,000 requests per minute during production, we found ourselves with bandwidth to handle only 10 requests per minute. So, what does this mean?

00:18:28.710 In the midst of our auto-scaling group and multi-region architecture, various Rails applications were trying to generate that audit trail but queries that were quick became slow.

00:18:45.870 Once all these requests came in, they all tried to access the audit database, saturating your connection pool, all waiting for a response from Postgres. None of your applications can do anything else because these requests are all hung open, waiting for a response.

00:19:01.590 All the other requests come in, and before you know it, everything goes down. Even though our particular example relates to our audits database, any long-running query can saturate your database connection pool on any table and at any time.

00:19:16.860 So, what do you do? You need to test your database performance regularly. Understand how performance changes during and after migrations, so you know what your application is doing.

00:19:30.970 Had we identified the audit reads as a significant portion of our database traffic beforehand, we would likely not have written that migration—or at least not in that manner.

00:19:45.150 Even though query performance was fine before and would have been fine after the migration, the slow queries during the migration caused a cascading failure that took down our entire application.

00:20:01.250 Let’s talk about some tools and resources to help keep your database healthy while allowing changes. First, there are gems; some static analysis tools will warn you in advance about unsafe migrations.

00:20:19.190 These cover the things I’ve discussed, as well as some things I haven’t, like changing column types, dropping a column, or adding a JSON column to a Postgres database. It’s crucial to catch these problems at dev time, not deploy time.

00:20:34.600 If you write safe migrations, you will run safe migrations. I advise checking out one of the two gems: ‘strong migrations’ or ‘zero downtime migrations.’ You might find they help based on your database engine version and schema.

00:20:52.200 Additionally, GitLab has built an exceptionally powerful migration helper system that can help with many of the issues I've discussed, but keep in mind that it hasn’t been abstracted into a gem, making it challenging to use within your project.

00:21:07.430 Next, you might consider using an APM tool to understand your performance better. There are countless options available—New Relic, Skylight, Scout, AppSignal, and many more. I hope each of you is already using one, but if not, maybe check out the sponsors of this conference.

00:21:19.860 But the most important tool in your toolbox is: you! As many developers have been taught, you are your best resource in not executing dangerous actions. Static analysis can only go so far.

00:21:37.690 As we saw in the audit case study, you can take down your database in ways the software cannot predict because it doesn’t know how your queries work and what their relative performance will be. Stay current with engine changes and ensure your database is up-to-date.

00:21:49.900 I wish we were on Postgres 11, but we're not. Practice these techniques in low-stakes environments to avoid deploying complicated migrations to multi-million row tables without prior experience.

00:22:06.520 Implementing these methods even when you might not think you need them is always a good practice. Write safe migrations by default, so you never have to wonder how big your table is or how to run it in a transaction.

00:22:20.440 It’s always better to heed that nagging little voice in your head that asks whether you did it right, is it safe, or will it take down production? I would prefer someone to challenge me to prove that a migration is safe rather than presume it is.

00:22:34.590 This brings us to a quick recap. Commit this to memory: don’t add columns with default values, do not backfill data inside a transaction, and use multiple smaller transactions instead.

00:22:49.130 Don’t mix schema and data changes in the same migration, since this can result in a global transaction. Do add your Postgres indexes concurrently, and continuously monitor and test your database performance before, during, and after migrations. Test regularly!