Reflecting on Active Record Associations

by Daniel Colson

The video titled "Reflecting on Active Record Associations" features Daniel Colson at RailsConf 2022. The talk delves into the workings of Active Record associations in Ruby on Rails, traditionally viewed as magical due to their simplicity and effectiveness. Colson emphasizes the importance of understanding the underlying mechanics of these associations, rather than accepting their behavior at face value.

Key Points Discussed:
- Active Record Magic: Colson introduces the concept of 'Rails Magic', highlighting how simple methods like belongs_to and has_many encapsulate complex behaviors, which can lead to confusion when things don’t work as expected.
- Meta Programming: The speaker engages in a hands-on approach to show how to create simplified versions of belongs_to and has_many associations using meta programming. This involves dynamically defining methods based on the context of the model.
- Association Reflection: The introduction of a Reflection class helps in storing metadata about the associations, allowing for more generic and reusable code for belongs_to. This makes it possible to retrieve the necessary class and foreign key dynamically rather than hard coding them.
- Caching Mechanism: Colson explains the necessity of implementing a caching mechanism to prevent the loading of multiple instances of the same related record, ensuring efficiency and consistency in data management.
- Handling Complex Relations: Moving on to has_many associations, the speaker demonstrates the complexities involved with Active Record relations, explaining how to create a CollectionProxy that maintains the features of relations like lazy loading while also utilizing the association caching.
- Inverse Associations: The talk wraps up with an explanation of inverse associations and their importance in maintaining consistency between bi-directional relationships. Colson illustrates this with examples, highlighting how to effectively manage the relationship and prevent redundant data loading.
- Call to Action: Colson encourages viewers to deeply study the tools they use, aiming to become 'Rails magicians' who understand the intricacies of their code, thus leading to more proficient and confident development.

In conclusion, Colson’s exploration of Active Record associations not only demystifies their operation but also provides valuable insights into effective coding practices. By understanding these underlying principles, developers can leverage the full power of Rails, reduce bugs, and write more efficient, maintainable code.

00:00:00.900 Foreign.

00:00:16.100 I am totally freaked out, so thank you for coming. Thank you for being here on the last day afternoon, and thanks for being awake. Hopefully, you're still awake by the end.

00:00:21.960 I'd like you to take a moment to think back to the first time you wrote an Active Record association. It might have been recently or many years ago. Did you feel confused or perhaps impressed by how much could be accomplished with so little code?

00:00:35.640 I remember my first time well. I was a brand new programmer working through the Hartl Rails tutorial, and I wrote my first associations having really no idea what I was doing. I didn't know that 'has_many' and 'belongs_to' were Ruby methods. I thought that they were these special Rails macros.

00:00:49.500 Okay, thanks for laughing. I was awestruck when I realized how much behavior was defined by these brief lines of code. 'Belongs_to' defines eight different methods for you. Methods for reading and writing the repository, for creating new ones. But it also comes with a heap of behavior on top of that—presence validations, a caching mechanism.

00:01:07.080 Hitting your mic with your thumb, uh, this is a ton of behavior to get with just a few lines of code. I think this is a good example of what people sometimes call Rails magic. Now, Rails magic is fantastic. Rails provides all this behavior for us without us having to really think about it. It performs a kind of vanishing act, like what we saw with Zeitwerk in the opening keynote.

00:01:39.360 The details are hidden away out of sight, so you don't have to think about them. This allows you to focus on the parts of your application that make it unique. But Rails magic comes with a downside as well. Sometimes, when things are confusing or not working the way you're expecting, it can be tempting to throw up your hands and say, 'Well, I don't know, I guess it's just Rails magic.' But instead of brushing off that confusion, I think it's worth digging into it to uncover the source of the sorcery.

00:02:02.460 Because if Rails is indeed magic, I think that makes all of us magicians. Now, I've been watching way too much 'Penn and Teller: Fool Us' while procrastinating in preparing for this talk.

00:02:21.400 It seems to me that when a magician sees a trick that confounds them, they like to study how it works and incorporate that knowledge into their own repertoire. So we can do the same kind of study with Rails magic, and that's what this talk is about. We're going to study some of the parts of Active Record associations that I've found confusing or that have interested me, with the hope that pulling back the curtain a bit will help you use associations more effectively.

00:02:42.780 I'm also hoping that this talk will encourage you to continue these studies beyond what you learn here. I'm Daniel Coulson. What am I going to say about myself? I was formerly on the Rails issue team, a maintainer at Factory Bot, and nowadays I like to spend my time helping other folks get involved in contributing to Ruby open source.

00:02:57.180 So if that's you, come talk to me. My various handles are esoteric references to my former career as a professor of music. Now, I am on the Ruby Architecture team at GitHub, where I've had lots of opportunities to study Active Record associations.

00:03:07.260 Now, if you're really interested in Active Record associations and you're interested in hearing more about some experimental patterns that we've been exploring at GitHub, the very next talk in this room is about that. So you may want to stick around.

00:03:20.520 All right, for our study, we're going to define our own simplified 'has_many' and 'belongs_to' methods that are closely based on the design of ActiveRecord itself. We'll start out with a bit of metaprogramming as we define our 'belongs_to' method.

00:03:34.139 Then we're going to introduce this class called 'Reflection' that will help us define our 'belongs_to' in a generic way. Once that's complete, we'll add on a caching mechanism. Then we'll move on to the 'has_many' method, and we're going to bump into this other part of ActiveRecord called the Relation. Finally, we'll add on one more feature called Association inverses.

00:03:52.740 So first up: metaprogramming. For the example here, we're going to have a 'PullRequest' model, and we're going to define a 'belongs_to' association called 'Repository.' So the first time I saw this, I had no idea that 'belongs_to' was a method call. I'll just rewrite this ever so slightly to make that more explicit.

00:04:01.050 So we're calling the 'belongs_to' method on 'self.' 'Self' here is the PullRequest class. This is a class method and takes an argument of the association name.

00:04:10.800 So we can define a class method. That's not too hard. Looks like this. 'def self.belongs_to' and it takes an argument of the association name. So that's the easy part.

00:04:23.520 Now, the fun part: metaprogramming. When we call this 'belongs_to' method, we want it to define a reader and writer method for our association. So we're writing a method that's writing other methods. That's what metaprogramming is all about.

00:04:40.680 And we want the methods that this 'belongs_to' defines to be based on the name passed into it. Let's look at our reader method. Let's say we've got a PullRequest, and let's say that's got a repository whose ID is 42. I want my 'belongs_to' to define a reader method 'pull_request.repository' that will return the repository object whose ID is 42. In other words, I want the method to return the repository whose primary key matches the pull request's foreign key.

00:05:02.400 Now, when I'm doing this kind of metaprogramming, I find it easiest to start with a concrete method and then work toward a generic version. So this is what the reader would look like if I just wanted it to work for repositories and not for any possible association. I can use Active Record's 'where' method to find the repository object whose ID matches the pull request.

00:05:13.020 And there's only going to be one repository with a given ID, so I call 'first' to return that one record. The writer method looks like this: let's again say I've got a pull request whose repository ID is 42, and I've got some other repository in memory with an ID that doesn't match. I want my 'belongs_to' to define a method 'repository=' that takes this repository and will update the pull request's repository ID to match the repository that I passed in.

00:05:27.180 So I can write that method without too much trouble. 'repository=' is the method name. It takes an argument of the repository, we read the ID off of it, and then we set the pull request's repository ID. So right now this only works for associations called 'repository.' We want this to work for any possible association, and we can do that by using Ruby's 'define_method.' So 'define_method' takes as an argument the name of the method that you want to define.

00:05:55.860 So now if I call 'belongs_to' with this argument 'repository,' I'll get a 'repository' method and, with a little string interpolation, a 'repository=' method.

00:06:10.920 But this now works for any possible association name. I get a reader and a writer method. However, I've also hard-coded some stuff into the body of these methods—there's the repository class and there's the primary key and the foreign key ID (repository ID). So again, if I want this to work for any possible association in any possible class, I need to get these in some kind of generic way.

00:06:24.240 What I have available when I'm declaring my 'belongs_to' association is the model that I'm in, in this case, PullRequest, and the name of the association, which in this case is 'repository.' So how do I get from that information to the class, primary key, and foreign key I need?

00:06:36.780 I've heard folks say that magic is all smoke and mirrors, so what if we could solve this with a new class called 'Reflection?' I know that was a cheesy joke, but sorry. I worked hard on that. This Reflection class will store metadata about the association and then let us reflect on the association, or ask certain kinds of questions about it, like 'Hey, what's your primary key?', 'What's your foreign key?', and 'What classes are you associated with?'

00:06:52.680 This Reflection will get initialized with the Active Record model that you're defining your association in, and the name of the association. In our case that would look like this: PullRequest is the model that we’re in, and 'repository' is the name of the association.

00:07:06.960 Then we can define three methods on this Reflection to get those pieces of information that we need. So the class method, in this example, will return the repository class.

00:07:20.520 You might be able to see that that's kind of related to the name of the Reflection, so I need a class method that will transform this reflection's name into the corresponding constant. Active Support comes to the rescue. I can use these methods 'camelize', which will take a snake case string and turn it into a camel case string, and then I can call 'constantize', which will take that camel case string and turn it into the appropriate Ruby constant.

00:07:38.940 The primary key method is actually pretty simple because Active Record models have a primary_key method on them that returns the appropriate value, so I call class.primary_key. That one's pretty easy. Now for the foreign key method, in this example, I want it to return 'repository_id,' which again, you can see, is kind of related to the reflection name. I just have to tack '_id' onto the end of it, which I can do with a bit of string interpolation.

00:07:54.360 So with those three methods defined, now at the top of my 'belongs_to', I can initialize one of these Reflections. I can get the class, primary key, and foreign key off of that reflection, and all of a sudden, my 'belongs_to' methods fall into place. I now have a way to get these values. So we did it, this works for any class and it works for any association name.

00:08:13.680 I feel like I earned some water.

00:08:20.760 Thank you.

00:08:26.160 All right, now we move on to another feature: caching. Not that kind of cash. Currently, without caching, if I call the 'repository' reader method twice in a row, I get different objects each time.

00:08:34.380 Having multiple copies of the same repository object in memory is inefficient and can result in inconsistent data. So for example, if I change the name of one copy of this repository, the other copy in memory won't reflect that change, and I end up with inconsistent data.

00:08:49.740 If I then render, say, the repository name on a page somewhere, I'm going to get different results depending on which of these copies of the object I use. This sort of thing can lead to subtle and confusing bugs. So we're going to introduce a caching mechanism, and the logic is going to live in this new class called Association.

00:09:12.060 Whereas the Reflection class stored class-level metadata about the association, this Association class is going to have the instance level data. The owner here is going to be an instance of a record that we're calling an association method on, and then the target is going to be the records that we actually load for that association.

00:09:26.040 Then we'll use this 'loaded' flag to figure out whether we still need to load the target. This Association class is going to be responsible for everything to do with loading records. So we're actually going to move the 'belongs_to' reader and writer method bodies that we wrote a moment ago out and into this new class.

00:09:37.620 So I'm going to do a little cut and paste here: cut that out, define these new methods in the Association called reader and writer, and paste what I just cut into there. Then I'll go back to the 'belongs_to' definition and call these Association reader and writer methods instead.

00:09:52.380 So I haven't changed anything yet, I just moved some code around. I also cheated a little bit because I never actually defined that Association method, but basically, every record has a cache of these association objects and you can look them up by the name of the association.

00:10:04.320 So for our 'pull_request.repository' association, we'd get back an object that looks like that. The owner is the instance of a PullRequest and the association will start out as not loaded.

00:10:18.720 So now, to get our caching mechanism working, we're going to add a conditional to the reader method. That conditional is based on the loaded flag, which is going to start out as false.

00:10:30.780 So that's going to send us to the else branch of this conditional. The else branch will do the loading that we did before and then take the record that it loads and set that as the association's target. It'll also mark the association as loaded.

00:10:43.020 So now if I call this reader method a second time, 'loaded' is true, and so I don't have to load the record again. I can use this target that is already in memory.

00:10:57.060 One little detail here, to ensure that the association target is always up to date, we also need to set it whenever we're writing to the association; otherwise, we end up with stale data in the cache.

00:11:05.820 So great, we've now got a fully functioning caching mechanism. Now if I call this repository reader method twice in a row, I get the same object each time, and so there's only one copy of this repository in memory. That means that when I update one, I'm actually updating both because they're the same exact object.

00:11:21.780 So no more problems with data consistency. This is great! We just eliminated a whole bunch of bugs, and we made our application more efficient. And Rails does this for you for free; you never have to think about it.

00:11:31.800 Relations. We're gonna, yeah, okay, get excited.

00:11:37.260 It’s not that kind of relations. I know I just put the emoji there, but whatever.

00:11:44.160 Now, onto the 'has_many' association. That 'has_many' association is going to start out very similar to 'belongs_to', but we're going to bump into this other part of Active Record called the Relation, and it's going to get a little complicated. So bear with me; we'll get through it, and I think it ends up being pretty neat at the end.

00:12:08.220 So the example we'll use here is kind of the other side of the example we used a moment ago. We've now got a repository model, and we’re defining a 'has_many' association called 'pull_requests.' Before, we were working with the PullRequest model that belongs to the Repository; it's kind of the opposite.

00:12:41.340 We can define our 'has_many' class method with the exact same code we ended up with for our 'belongs_to.' Great! We're using 'define_method', we're calling the association reader and writer methods.

00:12:55.680 But of course, it's not that simple. We can't just reuse everything. There's different association classes; there are actually different reflection classes as well. I'm not going to get into that too much. The 'belongs_to' association and 'has_many' association classes will work fairly differently.

00:13:14.640 So, let's take a look at the reader method for these two classes. The 'belongs_to' association couldn't look like this: it had our caching mechanism built into it, and then the way that we loaded records was by finding the records whose primary key matched the association owner’s foreign key.

00:13:35.760 Now we can get pretty close to a 'has_many' association reader with just a few small changes to the way we load the records. Because we’re kind of on the opposite side of the relationship here, we swap the position of the foreign key and primary key.

00:13:50.460 So we’re now loading records whose foreign key matches the association owner’s primary key. Instead of calling 'first', we call 'to_a' to get an array of records. This is a collection instead of a single record.

00:14:05.580 This is pretty good! We can call 'repository.pull_requests' and that’ll load the associated pull requests for that repository, and we get back an array of pull request objects. We’ve got our same caching mechanism, so if we call the method again, we get back the array that’s already in memory, and we don't have to load it again.

00:14:20.760 So this works, but it's not quite as magical as what we get with a real Active Record. What I really want to do here is not return an array of records but return something called an Active Record relation.

00:14:35.520 So what is a relation? Relations are what you get if, for example, you call the 'where' method. Now, we've called the 'where' method before, but we immediately called 'first' or 'to_a' on it to get a record or an array of records. If you just call the 'where' method without chaining those things on, you get back this thing called an Active Record relation.

00:14:51.960 I think of a relation as a super-powered array of records. It has all these additional features built on top of it, and I would really like to get those features for my 'has_many' association.

00:15:06.840 One of the features is lazy loading. So if you just call 'where', it doesn't actually load the records. It's not until you call a method like 'to_a', for example, where you need an array of records, that the relation will perform a SQL query and load those.

00:15:23.220 That's neat. Relations also have a bunch of methods that standard arrays wouldn't have, like a 'create' method. The 'create' method allows you to create new records using the conditions from that relation.

00:15:39.420 And there's a whole bunch more built into these relations, probably enough for a whole separate talk on just that. So wouldn't it be cool if when I called 'repository.pull_requests', instead of getting back an array immediately, as sort of like eagerly loaded array, I could get back a relation with all these features?

00:15:57.960 We can do that! But there's a catch. We've got this association caching mechanism. So if I want this association caching mechanism to still work, then when I call a method like 'to_a' on the relation that forces it to load records, it needs to somehow load the records via the association so that the association can be marked as loaded and can put those records in its target.

00:16:05.040 But there's a problem here because relations are a totally separate part of Active Record and they don't know anything about associations, so the arrow I've drawn here doesn't make any sense. There's no way for a reflection to delegate certain behavior to this Association object that it doesn't know exists.

00:16:20.760 Unless we introduce a new type of relation, a subclass of relation called collection proxy that gets initialized with a reference to the association. Then this collection proxy could delegate certain behavior to that association object.

00:16:36.060 Okay, so let's rewrite our 'has_many' association reader method to return one of these collection proxies. Now we've got all the relation features. A collection proxy is just a special type of relation.

00:16:46.860 We’re going to take what used to be in the reader method and put it into this new method called 'load_target'. We’ll mention this 'load_target' method again in a moment, so you can remember the name of that.

00:17:02.460 So now 'repository.pull_requests' returns a collection proxy, and the collection proxy has a reference to the association. If I now call a method like 'to_a' that forces the collection proxy to load, instead of loading the records directly like a standard relation would, it's going to load them via this association object by calling its 'load_target' method.

00:17:18.960 That way, the association will be marked as loaded; it'll set the target to be those records that were loaded. Because 'repository.pull_requests' is a collection proxy, we also have methods like 'create', but again, instead of doing it directly the way a relation would, the Collection proxy is going to delegate that work of creating a new record to the 'has_many' association.

00:17:31.920 That’s going to allow the association to update its target with the newly created record, so we won't end up with a stale cache. This is pretty cool! This is specific to the collection proxy 'has_many' association. Relations don’t know how to do this sort of thing at all.

00:17:51.840 Then if we call 'to_a' again on the collection proxy it again delegates that work to the association using the 'load_target' method. Since the association is already loaded, it returns the target that's already in memory.

00:18:04.620 I feel like I was supposed to celebrate that a little bit.

00:18:20.220 Yeah! Some more water, oh yeah, oh yeah, right.

00:18:36.060 It’s like this is cool right? It’s complicated, I know! It’s complicated. And if you didn’t follow all of it that’s okay; this is a complicated part of the code base.

00:18:47.520 But anyway, I think it's really neat. We've now got a 'has_many' reader method that loads and caches associated records, but it also includes all the behavior of a relation.

00:19:01.620 One last feature: inverses come into play when you have a pair of related associations that work in opposite directions. Rails calls these bi-directional associations.

00:19:13.680 So the example we've been using: there's a Repository class and a PullRequest class, and we defined a pair of associations—a 'belongs_to' and a 'has_many' pair that get you from one of these classes to the associated records on the other class.

00:19:26.580 So you can kind of see that these are related: 'repository.pull_requests' and 'pull_request.repository', but right now there's nothing in our code that actually connects these two things.

00:19:39.780 The reflection objects don’t know about each other; the association objects don’t know about each other. And that can actually cause some problems.

00:19:49.440 So let's say that I load all the pull requests for a given repository. That'll perform a query to get all the pull requests with the repository ID that matches. Then let’s say I go in the opposite direction in this bi-directional association. I take all those pull requests that I just loaded and I call the 'repository' method on them.

00:20:03.420 I would expect that to return the same repository object that I started with; it's already in memory, all the IDs match up. We should be able to reuse that same object. But that’s not what happens.

00:20:15.140 Right now we end up loading the same repository over and over again, and so we end up with the same problem with data inconsistency that we saw when we were dealing with caching.

00:20:30.420 If I change one of these repositories in memory, the other ones aren't going to see those changes. This kind of makes sense because when we first load up all the pull requests, all the associations for those pull requests are going to be unloaded.

00:20:39.420 So if I then call the 'repository' method on each pull request, it has to load the association.

00:20:51.540 Well, we have this other association, the 'repository.pull_request' association. Its owner is the exact repository that we want as the 'belongs_to' association's target. So what if we could just grab the owner from the 'has_many' association and shove it over there into the 'belongs_to' association? Then we wouldn't have to load it, and that's exactly what we're going to do.

00:21:05.760 That's what this 'inverses' feature is all about. If we can get these association objects to know about each other, then the 'has_many' association can kind of send its record on over to the 'belongs_to'. So that’ll look like this when we’re first loading the 'has_many' association's target.

00:21:23.520 We'll go through each record in that target, so each pull request will look up the repository association for that pull request, and then we’ll set that association's target to be the owner of this 'has_many' association. The owner here is that repository.

00:21:36.720 This works, but I cheated again because I hard-coded the name of the association 'repository.' So if I want this to work for any association, I can't hard-code a value like that.

00:21:47.100 Luckily, this is another question we can ask of the reflection. We can say, 'Hey, if you looked in the mirror, what would the association on the other side look like?' That was a reflection joke, no? Okay, that's fine.

00:22:03.480 So like the other side of the PullRequest association would be the Repository association. I want to define a method called 'inverse_name' that will return the name of the association that’s on the other side, which might be able to see that that’s related to the reflection’s Active Record.

00:22:20.220 We can write that method by getting the name of the active record, which is going to be camel case, and then calling 'underscore' to make it snake case. That allows me to replace this hard-coded value with a generic value.

00:22:38.280 Now if I call 'repository.pull_requests' and load those pull requests, and then I go in the opposite direction and call the 'repository' method on each of the pull requests I loaded, there's only ever one copy of the repository in memory, so we have no more problems with data inconsistency.

00:22:54.540 Yes, yes! Love it! If you've ever wondered what the 'inverse of option' is all about, I should ask, have you ever wondered what the 'inverse of option' is all about? Yeah, okay, okay.

00:23:06.840 This is it! Anyway, this took me years to understand what 'inverse of' is all about, but it's basically about setting up this relationship between associations that are connected. Rails can sometimes guess it for you, but not always.

00:23:17.100 So that's where today's study ends, but it's certainly not where Active Record associations end. There’s all sorts of stuff that makes the real implementation more complicated and more fun to me. There’s like through associations, polymorphic associations, all the different ways that scopes can interact with associations— all kinds of stuff I left out.

00:23:28.980 These are great fun to study, and you have access to the source code, you can go look at this code right now. Just to help out a little bit, this is a big library, so I’ll call out some files.

00:23:43.860 'association.rb' is where the 'has_many' and 'belongs_to' methods are defined. 'reflection.rb' is where the reflection classes are defined, and then the rest of it is in the associations directory— that’s the collection proxy, the association classes, and some other stuff as well.

00:23:59.160 I also started writing on my blog here about various things that I left out or lied about in this talk, so that may be interesting to you as well. And I don't know, maybe I’ll write more if you want me to. I'm also happy to chat with you about any of this. I think I’m also supposed to share my slides, but I have no idea how to do that, so I’ll figure that out and do that as well.

00:24:13.320 My experience has been that demystifying Active Record associations has allowed me to use them more effectively. I'm able to better leverage features that are built into Rails. I'm able to write less custom code and rely on the code that’s already in Rails.

00:24:43.560 Understanding things like 'inverse_of' lets me write more efficient code and avoid inconsistent data. Perhaps most importantly, I’m coding more confidently. I understand why I'm writing what I’m writing—like why do I need to specify this foreign key here? Why do I need to reload this record here?

00:25:02.460 So the next time you're working with Active Record, or really any other library, and you find yourself confused by something that seems magical, consider taking a moment to reflect on how that code is working. Studying your tools can help you use them more effectively and can broaden your knowledge in general.

00:25:23.040 I know being a Rails developer is great, but why not study to become a Rails magician? Thank you.