Zero downtime Rails upgrades

by Ali Ibrahim

Zero Downtime Rails Upgrades

In this insightful talk presented at RailsConf 2023 by Ali Ibrahim, the focus is on overcoming the challenges associated with upgrading Ruby on Rails applications. The speaker outlines a strategy to streamline the upgrade process, allowing developers to manage it with minimal disruption to feature work.

Key Points Discussed:

Conclusion and Takeaways:

By refining the upgrade process through incremental changes, addressing deprecations proactively, utilizing dual booting, and continuously integrating new features, teams can make Rails upgrades predictable and repeatable. The overarching message is to embrace the learning curve with each upgrade to build a robust process for future upgrades, ultimately leading to zero downtime during the process.

Sharing experiences and insights learned during each upgrade can further benefit the Rails community, fostering a collaborative environment for tackling these common challenges.

00:00:19.160 Hi everybody, I'm Ali, and I'm here today to talk about everyone's favorite subject. Drum roll please... upgrades!

00:00:25.980 So, there are dozens of us! Okay, cool. I'm actually surprised; I thought more people were not going to like Rails upgrades. That's fine! When I look back through past RailsConf events, I see a lot of talks about Rails upgrades— one in 2017, two in 2018, one in 2019, and yes, one last year, along with a workshop on upgrading Rails yesterday, which was a great workshop, by the way.

00:00:42.860 Since we all love working on Rails upgrades, why are we spending so much time talking about them? When I think about this question, I reflect on all the different upgrades I've worked on over the course of my career at Test Double, and one thing stands out: these organizations aren't coming to us to work on these upgrades because they're easy; they're coming to us because they're hard. But what is it about these upgrades that makes them so difficult? The first thing that might come to mind are all the technical challenges involved with working on an upgrade. For example, the first thing you might want to do to upgrade Rails is run `bundle update rails`. This could turn into a multi-day journey through pain and suffering as you try to get your gem versions to work with Bundler, and then you find out, ‘Oh, this gem is not maintained anymore,’ or ‘There’s some weird fork?’ Really, I just want to run `bundle update rails`, a standard command!

00:01:19.560 Or sometimes there are changes in the public API. For example, how many people here remember the strong parameters migration from Rails 3 to Rails 4? It was a pretty big migration and if you had a large app with many controllers, this was a significant change. But regardless of what the technical challenges are, all of us here can solve these problems. I say this confidently because if we couldn't, there wouldn’t be any upgrades at all, and if there were no upgrades, none of us would be here because who would be building stuff in Rails, right?

00:02:25.140 So, maybe we need to reframe our question about why it’s hard for organizations to work on these upgrades. Instead, let’s ask why is it so difficult for organizations to invest the time and money to do these upgrades? To think about that, let’s shift gears a little bit and analyze something that organizations do every day, which is feature work, and see how this compares to Rails upgrades.

00:03:02.580 When your team embarks on building feature XYZ, all the people involved, all the stakeholders—whether it’s your product managers, UI/UX, engineering, or QA—must be engaged in making that feature happen. All of you have some context and some form of shared understanding about the app you're working in or your business domain. Because of this shared understanding, it's easier to break up the work. This might mean identifying the scope: what’s our appetite here? How much of this do we want to do right now? Or maybe it’s about breaking it up into small discrete steps that you can work on bit by bit, shipping stuff incrementally. Because we can break up the work, it’s easier to track progress. You might say this feature has been divided into ten cards, and after a couple of cards, we can kind of tell if we’re on target before our deadline or if we need to have another conversation about it.

00:03:50.760 How does this compare to Rails upgrades? Well, Rails has a lot of code in it. I’ve gone deep into the weeds of Active Record a couple of times, and I would say I know maybe five percent of the code in Rails. If you saw Eileen's keynote yesterday, she mentioned that there’s no core committer who knows all of Rails. So, it's a really big codebase that no one really comprehends fully. Rails is crucial infrastructure in our applications because without it, we don’t really have an app to talk about. So now we’re talking about changing something we don’t know well; we want to change it potentially a lot. Who’s going to sign up for a project with so many unknowns? What could go wrong? How long will this take? When will it be done? No one wants to work on this risky upgrade project.

00:04:55.200 But what if we transformed how we think about Rails upgrades and treated them like feature work? We could take that big black box of a project and break it up into smaller pieces that we can work through incrementally, shipping things bit by bit. We would then utilize insights gained from shipping to inform the next steps. If we could do that, we’d have upgrades that are more predictable, and if they were more predictable, they’d be relatively more repeatable. Spoiler alert: there will always be another Rails upgrade!

00:06:16.560 So, how do we get there? How do we get to having Rails upgrades that are more predictable and repeatable? Well, first things first, we have to determine what version of Rails we are even upgrading to, which might seem like an obvious question! We just want to get to whatever the latest version of Rails is, right? Let’s imagine we have an app running on Rails 5.2, and currently, Rails 7 is the latest version. We want to upgrade to Rails 7. Is there anything we can do to break this up into smaller pieces? Well, since we’re coming from Rails 5, we recognize that Rails 7 consists of two major versions: Rails 6 and Rails 7; thus, we must upgrade to Rails 6 first.

00:07:38.400 We could try to do it in one big giant PR that changes everything, or we can focus on upgrading to Rails 6 first. Now, our black box is looking a little smaller. Is there anything else we could do to break this down even further? Well, Rails 6 is composed of two minor versions: Rails 6.0 and Rails 6.1. Like moving from 6 to 7, we must get to 6.0 regardless, if we want to move to 6.1. Instead of trying to do that all at once, let’s just upgrade to 6.0 first. Now our big black box is looking significantly smaller.

00:08:55.860 What we’re trying to do here while determining which version we want to upgrade to is first target that next minor version. If you’re on Rails 5.0, your next minor version is 5.1. That’s your first upgrade; once you get to 5.1, your next minor is 5.2. After exhausting all minor versions, you then upgrade to the next major version. So, if you’re now on Rails 5.2, there’s no Rails 5.3; you’ll move on to that next major version, Rails 6. In doing this, we’re creating smaller upgrades. Instead of taking years of changes in Rails and applying them all at once, we’re focusing on a narrow chunk right in front of us. This will help reduce the risks of our upgrade.

00:09:54.600 Now, we have a little more clarity regarding the target version we want to upgrade to. We’re ready to run `bundle update rails`, but before we get ahead of ourselves, we need to talk about everyone's second favorite subject—that’s deprecation warnings! I need some audience participation here: can I see a raise of hands if you’ve ignored deprecations in your app recently? Yeah, right? Me too! That's what I do every day—ignoring deprecations because I think, ‘Well, my app is still working, and there’s nothing to do about them!’ But when you’re starting on a Rails upgrade, it’s essential to check these warnings to see what they’re suggesting. For example, imagine now that we’re on an app running on Rails 6.0, and we see this deprecation warning flying through our logs: ‘Accessing hashes from config 4 by non-symbol keys will be removed in Rails 6.1.’

00:10:43.140 We’re on Rails 6.0, so that thing we’re doing—accessing the hashes with non-symbol keys—still works fine. But when we move to Rails 6.1, that method is no longer valid. Importantly, this deprecation warning isn’t saying to upgrade to Rails 6.1 and open up a giant PR that changes a ton of things. No, it’s saying, ‘Use symbols for access instead.’ This is something you can implement in your application today, and we can do that for all the deprecation warnings. We don’t have to wait for a major upgrade effort to get our app one step closer to running on the newer version of Rails.

00:11:31.080 The good news is we don’t have to fix all these deprecations at once. For example, imagine you have a deprecation warning across hundreds of files in your application; it’s a huge mess. You could try to fix it in one giant PR that changes a bulk of files, but no one on your team is going to want to review that! Eventually, someone will say, ‘Look, don’t worry; it’s just deprecations. Just give me a checkmark, please,’ and you think, ‘Okay, you get your checkmark!’ However, when it’s time to hit the merge button and deploy this to production, you’ll be pretty nervous, because you changed a lot of stuff in your app.

00:12:55.680 Instead, you could tackle that deprecation warning in one file. Changing it in one file will make it easier for your team to review, because it’s a smaller change, and we can see what’s going on here! It’ll be easier to test and verify that we made this change and that it works in production as expected. With that information and the experience gained, you’ll find it easier to return confidently to update all the other files in your application.

00:13:54.120 So, we can divvy up the deprecation warnings across all the people on our team. Say, ‘Hey, everyone, pick one up and fix it here, fix it there!’ You will work very hard, diligently, bringing your app one step closer to running on a new version of Rails. Then, lo and behold, someone comes behind you and brings back a deprecation that you’ve just fixed. This can be a super frustrating experience because you’re putting in so much effort! You think, ‘Okay, I’ll fix the deprecations; this is important.’ But the question that arises is, why would a developer do this? Why are they not paying attention to all the work we're doing?

00:14:45.900 Earlier, we agreed that everyone’s ignoring the application warnings, so can we truly blame a developer for doing something all of us do every day? Instead, we can make it easier for them to adopt the patterns we need to use going forward. Here’s a blog post I wrote about one strategy I like to use when dealing with deprecation warnings. If you Google it, you’ll find many different ways to handle them, but they all have a similar core idea: once you’ve fixed a deprecation, set it up in your app so it raises an error in development and test environments. This way, any developer, like me, who is used to writing code in a particular way, will encounter an error when they try to write deprecated code. The computer will raise a flag, saying, ‘Nope, you can’t do this anymore!’ And since nobody likes being admonished by a computer, they will eventually realize, ‘Okay, I need to adopt this new pattern moving forward.’

00:15:58.100 Let’s couple that with logging in production because tests aren’t perfect; we’re never going to catch 100% of everything. We’ll log these deprecations in production to provide one last safeguard ensuring no warnings slip through the cracks. In doing this, we are continually shipping code, bit by bit, getting ourselves closer to that new version of Rails. Wouldn't it be nice if we could remain in this mode for all our Rails upgrades?

00:16:51.979 Once you've fixed all your deprecation warnings, you're ready to run `bundle update rails`. You check out a new branch—nothing new there; we always check out a new branch to make changes in our application— but as long as this branch is active, you are increasing the risks associated with your upgrade. Let’s imagine you’re on your branch, and you spend a few days battling Bundler to get the new Rails version to install. Eventually, you get it to work! However, in all that time you’ve been working on this branch, the team has been pushing updates on the main line. It’s not a major deal; it’s only been a couple of days. You pull those changes in, and development keeps happening on main.

00:17:53.640 There could have been a security patch released for another gem in that interim, and they made that change to the Gemfile, pushing it up. So now, when you return a couple of days later after you’ve been trying to fix tests or get the app to boot, you run into some conflicts because the Gemfile has changed in two places. You think, ‘Ugh, this is frustrating,’ but maybe it’s not too bad. You figure it out after an hour or so, and then your boss comes to you and says, ‘Whoa! Whoa! Whoa! Whoa! We have a really important, urgent thing to work on right now. Please, just get to a stopping point on this upgrade, and we’ll get back to it when we have time!’ You think to yourself, ‘Okay, I’ll push up my changes to my branch, and we’ll come back to it; no big deal.' And development carries on flowing along on main.

00:18:49.440 Time passes—could be weeks, months, or even years—I’ve seen it happen! When you eventually come back, your branch is so far behind main that when you try to pull the changes in, the Gemfile has changed again. You think, ‘Ah! This is frustrating; I have to do this whole bundler dance again.’ Or those tests you were trying so hard to fix have now changed on main, so the changes you made on your branch no longer make sense. So, you scratch your head and wonder, ‘What do I do now? How do I get this to work?’ You might end up concluding that you will have to start this upgrade all over from scratch! That’s a huge risk because we’re essentially saying that if we have to pause our upgrade for any reason—life happens, things change—we could lose all of the progress we've made up to that point.

00:19:54.259 No one wants to be in that situation, so what can we do to avoid this? Well, we can adopt a strategy called dual booting. With dual booting, you set up your application on the main branch to run on either the production version of Rails or the target version you are upgrading to. You will also have some kind of switch. By default, this switch will be off, meaning your app remains on the production version of Rails. Nothing out of the ordinary there; it’s same old, same old. But when you flip that switch to on, your app will then be running the target version of Rails on the main branch.

00:20:42.780 There are various strategies for dual booting, but I personally like using 'Boot Boot', which is a Bundler plugin made by Shopify. I appreciate it so much that I bent over backwards to get it functioning with Rails 4—a story for another day. Now, when you go back to your branch and run `bundle update rails`, you won’t just be doing the Bundler dance to get the new version of the gem installed; you will also be incorporating that dual booting functionality. As soon as you get these two aspects operational, you can bundle the new version of Rails, and dual booting will be enabled, allowing you to seamlessly toggle between Rails versions. You can ship that code. No more waiting around to say, ‘Oh, let me finalize these tests before I deploy. Let me make sure the app's booting.’ As soon as dual booting works and you have bundled the new version of Rails, you can ship that off to production.

00:21:28.920 From this point forward, any changes you have to make for your Rails upgrade will be completed on main. After today, I don’t want to hear anyone saying they’re maintaining a long-lived branch that they’re rebasing every day for months for their upgrade! You don’t have to do that anymore. Please use dual booting! If you do this, you’ll be rigorous in your process, shipping code every day; you’ll reduce the risk of your upgrade.

00:22:27.960 As you follow along with me, you may ask, ‘How is this going to work exactly? Are there new features in the newer version of Rails that aren’t in the old version? How will this deployment work on main without the new version of Rails?’ I’m just confused about how this works. Once you adopt the dual booting strategy, you will encounter conditionals like this in your application. Imagine we have an app running on Rails 5.2: we’ll say: 'If our app is currently running on 5.2, here’s the code we will run in production; otherwise, here’s the code we will use in the future version of Rails.' Now, you can ship code that works for both versions of Rails to main.

00:23:00.900 You might see these conditionals popping up in your codebase—not a big deal! Sometimes method signatures will change between Rails versions; that kind of stuff. However, if you begin to see the same pattern of if-elses repeating, it will drive up the risk of your upgrade. That’s an indicator that there is a pattern your team is used to that will not function going forward. You’ll be stuck trying to manage this game of whack-a-mole, which will be hard to win because all of these things will pop back up in your application, and you’ll never really know if you’ve caught all the cases.

00:24:02.640 So, how do we deal with this? We can adopt a strategy called backporting. What you do with backporting is take that new thing in Rails—whether it be a new feature or a new way of calling code— and you backport it to the previous version of Rails that your app is using. This allows your team to start using that new feature before you finish your upgrade. There is even an excellent open-source example of this with strong parameters! We briefly discussed this in the talk’s early stages. Strong parameters were released in Rails 4, and it was a significant change in how we handle parameters. When Rails released strong parameters in Rails 4, they also offered a gem that targeted Rails 3, which acts as our backport. This meant if you had a model in your app—say, a post—you could add that gem to your Gemfile and then include the module available to you, opting into the new behavior.

00:25:20.639 You would include it only in one model and start the process of migrating that model to use parameters the way we needed for Rails 4 and moving forward. You would work on that migration, test everything, verify it, and ship it to production. As soon as you do that, no one can use parameters in the old way anymore because that module will raise errors. This way, from that point forward, if someone adds a new model, they will have to adopt the new style for parameters. This is an example of a backport that was created by the Rails core team, and it’s incredibly useful! Yet, we can implement this ourselves; we don’t have to depend on Rails to create all of our backports.

00:26:30.420 I wrote a blog post detailing a similar situation I encountered during an upgrade and how we implemented backports to help our team adopt new patterns. This, ultimately, is what we aim to achieve with our backports—to help guide our teams toward adopting new practices as we move forward. Depending on when you introduce a backport, there may be several-month periods where you will be writing code meant for the new version of Rails. Thus, when you switch your production environment to run that new version of Rails, nobody should be surprised! They will have already been using these updated patterns for a while.

00:27:27.360 With all that said, we’ve been doing well, steadily approaching our Rails upgrade. Yet, we still have a significant task ahead of us: managing our test failures. In all the upgrades I’ve worked on, I’ve noticed teams dedicated the majority of their time fixing failing tests. Test suites tend to be big, slow, complicated, brittle, and hard to understand—and I could keep going! So, what do our teams do when they see hundreds, or even thousands of test failures, in these large, complex test suites?

00:28:38.519 We’ve talked about breaking things down into smaller pieces we can handle incrementally, and this applies to test failures as well. If you see several thousand test failures, you could segment the work by saying, ‘Everyone take one of these failures, figure out how to fix it, and open your PRs!’ But personally? I don’t want to open thousands of PRs— or even worse, thousands of issue tickets! Nobody wants to do that. We need a middle ground between addressing a giant PR fixing multiple unrelated tests and creating many small PRs that are narrowly focused on the same issues over and over again. So, what can we do?

00:29:45.720 Well, we could try grouping our failures. Let’s say we review all of our tests and notice a cluster of tests failing because they expected a 200 response but got a 400. We could isolate these as a group, then assign two people to look into those tests to determine the root cause. After a few days of investigation, they might conclude that a batch of them failed for reason A, while the remaining ones failed for reason B. At that point, we could separate them into two sub-groups for fixing. Alternatively, if we find a bunch of tests failing due to the ‘ActiveRecord::StatementInvalid’ error, we can pull those out as a group—likely they’re also failing for the same reason. We assign someone to tackle that particular issue, while others work through the separate failures as well.

00:30:50.640 You will continue identifying these groups of errors, and ultimately, the goal is to discover chunks of work that are just the right size—not too big with unrelated changes, and not too small where you’re seeing the same PR opened over and over again. It’s kind of a Goldilocks situation! The way you group these failures will depend directly on your application. Depending on how your upgrade goes, it could make sense to identify failing tests based on the file they belong to or, if you run your tests on CI across 40 containers and see 15 failing, you might say, ‘Everyone takes one container and makes it green again!’ This strategy ultimately depends on the specifics of your application and the nature of the upgrade you’re tackling.

00:32:12.000 So now, we’ve fixed all our deprecations, we’ve set up dual booting successfully, and we’ve addressed all our test failures. We’re ready to make that switch from Rails' production version to the new version. Are we, though? Even after this considerable work, assessing the remaining risks of making this change can be challenging. This late stage of the upgrade— is there anything we can do to better mitigate the remaining risks involved in making the switch?

00:32:59.539 Well, we can conduct small experiments in production-like environments to verify if the upgrade is behaving as expected. For instance, imagine your app is running in production, and you decide to deploy this upgraded version for internal use only. We can ask all of our employees to use this upgraded version of the app. We’ll monitor our Datadog dashboards, or Rollbar errors, or whatever observability tools you have in place, to see if we’re encountering any new errors. Another effective method is to try a canary deployment where we ship the upgrade to just five percent of our traffic for about two minutes. While there might be some impact on users, it’s minimal since it will only last those two minutes. After that duration, we plan to roll it back.

00:33:58.280 In those two minutes, we’re again leveraging our dashboards to monitor whether errors are coming through, confirming it's working as we expect it to. If after those two minutes we find several unexpected errors, we will rollback and address those problems. When we return to conduct the experiment again, we anticipate fewer errors popping up because we’ve fixed them! Therefore, we try increasing to ten percent of our traffic for two minutes. Again, we can repeat that monitoring process, and if we discover some more errors, we handle those, fix them, and keep iterating. After a while, if everything looks promising, we could increase it to ten percent of our traffic for five minutes.

00:34:56.720 Finally, you might find yourself feeling pretty confident, so you increase the traffic to forty percent for ten minutes. We could observe that everything is performing optimally and not presenting new, unexpected errors that could warrant concern. We’re becoming more and more assured about this upgrade we’ve been working toward. Ultimately, in today’s talk, we’ve discussed various strategies aimed at minimizing the risks of our upgrades, gaining information, and ensuring we feel more confident that we’re not going to incur any incidents by deploying a Rails upgrade. If we manage this correctly, we will all be able to ship zero-downtime Rails upgrades.

00:35:57.660 The first upgrade you undertake like this will undoubtedly be the hardest! It doesn’t matter what version of Rails you’re working with or anything like that; it will be challenging because your team and organization have likely never undertaken an upgrade of this nature before. You must figure out what tools you like to use, how you want to allocate the test failures, and all that entails. This process will certainly reveal many learnings: you’ll realize what works best for deprecations or which deployment strategies went the most smoothly. Document and reflect on these experiences; they’ll contribute to developing a playbook for the next time you tackle an upgrade.

00:36:44.760 If you do this correctly, when it’s time for the second upgrade, it might feel a bit bumpy, but it will certainly be easier. Moreover, by the time you get to the third upgrade, you’ll be cruising along, feeling as though you’ve done it a hundred times before—it will become second nature! Perhaps one day you will reach a stage like GitHub, where upgrades happen weekly! Now, I don't know if anyone desires to live on the edge and always maintain the latest version of Rails forever, but this nonetheless demonstrates that once you refine your process and ascertain the best tools for the job, you can make these regular upgrades with ease.

00:37:36.360 As you traverse this journey, sharing what you learn is paramount! As I mentioned at the beginning of the talk, there seems to always be another discussion or talk dedicated to Rails upgrades every year. There will always be something new to share, so please return to share your knowledge and experiences, as you can aid others in shipping their zero-downtime Rails upgrades too! If you are not keen on public speaking, you can always opt to write blog posts, and I have plenty of those myself.

00:38:41.880 That’s all I have for today! If you have any questions regarding Rails upgrades or anything at all, feel free to email me at [email protected]. I’m also available on Slack under my full name, and again, I referenced a few blog posts during this discussion; you can find them all on the Test Double blog. There are numerous other blog posts about upgrades too, alongside plenty of other valuable content. Many of us are here from Test Double, and I genuinely enjoyed Daniel's talk yesterday about legacy code and encouraging curiosity to improve it. I’m looking forward to Landon's presentation about machine learning and Ruby in just a few minutes, and tomorrow Meg and Justin will be hosting a little town hall discussing how to make Standard even better for all of us!