The 30-Month Migration

00:00:20.900 Thank you, Sebastian. Good afternoon! I'm going to present a case study of a sustained multistage effort to change and simplify the core data relationship in a production system.

00:00:26.160 Changing code has its challenges, but changing the data model of production data is even more difficult. Protecting the existing data requires caution, and exploring and testing possible solutions can be slow.

00:00:34.920 Your new data model may require data completeness or correctness that hasn't been enforced for your existing data.

00:00:41.820 However, living with a data model that mismatches your business can be painful. It introduces extra complexity, pitfalls where bugs can lurk, and poor performance.

00:00:48.120 We rarely hear about the details of such changes, so I hope this talk will be helpful for those of you facing similar challenges.

00:00:54.210 We undertook this work in four stages over two and a half years. At each stage, we addressed the biggest problem we faced, driven by business considerations, and then returned to feature development until the next significant problem became costlier to live with than to fix.

00:01:07.410 You can see that stage two began with a short spike, followed by a five-month gap before the real push started. Initially, we had no idea what lay ahead, and we might not have started if we had known the challenges involved.

00:01:19.560 Looking back, it feels amazing and as if it went quickly. We made substantial progress, and we’re glad we are where we are now.

00:01:23.429 While this talk contains quite a few technical details, it's not solely about the technicalities. Your experience will likely differ.

00:01:30.149 All four stages we went through had very different technical details. This discussion focuses on our overall strategy and how to approach these changes responsibly.

00:01:42.060 We applied these principles at every stage, and I encourage you to pay attention to how we did that throughout the presentation.

00:01:48.959 The entire effort started with a couple of significant mistakes, so I will begin by sharing those to provide the rationale for all this work.

00:01:55.830 Initially, we aimed to help real estate agents win more business by focusing their efforts on the right people at the right time.

00:02:01.890 The average realtor has a database of a few thousand contacts, and we used predictive analytics and other techniques to help agents market wisely.

00:02:07.349 Now seems like a good time to mention that we are currently hiring Rails developers. A startup's job is to try various approaches and discover what does not work as we seek what does.

00:02:12.840 When we first examined this problem, our basic data model looked like this: Realtors as customers, each having a set of contacts.

00:02:19.560 We considered all the possibilities this data presented, including knowing who was married to whom and understanding relationships between contacts.

00:02:25.530 This led to our first major mistake: we started building a social graph.

00:02:31.650 This subsequently caused our second big mistake: storing all that data in a graph database.

00:02:39.210 I don’t believe this is a mistake for everyone, but it proved to be one for us.

00:02:44.400 We realized, as we progressed, that we were not actually building a social graph. Today's focus will be on the relationships between Realtors and their contacts.

00:02:52.900 Remember that the other relationships still exist. This notion of a social graph led us to think of contact records as representing actual people.

00:02:59.640 For example, we referred to 'Abby's contact info for Nancy' rather than just identifying a simple record.

00:03:06.680 We made numerous other mistakes along the way, but let's not get bogged down in the specifics. Just remember what Jerry Weinberg said.

00:03:12.840 Keep that in mind as I describe where we were two and a half years ago.

00:03:18.990 At that time, we had what we believed was business data about users and their activity stored in Postgres, and all the contacts in our social graph were in Neo4j.

00:03:26.270 To link users to data in Neo4j, we had what I will call realtor prime proxy objects in Neo4j.

00:03:33.010 All those CR nodes represented contact relationships, as Neo4j reifies all relationships as nodes.

00:03:39.620 Its query language, Cypher, encourages one to think along these lines, but I suspect you think more in terms of tables.

00:03:45.620 For convenience, let’s illustrate this with a table structure.

00:03:52.720 Here, you can see that contact relationships function as a significant join table.

00:03:58.360 Contacts were shared between Realtors, requiring us to utilize source information for every attribute, such as phone numbers and emails, to determine what each customer could see.

00:04:05.720 We would have tracked that source information anyway, but we never intended to manage data privacy concerns for our customers with it.

00:04:13.560 However, it had to be included in every query because contacts were shared between Realtors.

00:04:21.030 Here's the structure we actually need and have finally achieved today.

00:04:26.990 Note that shared contacts have now been duplicated, and the join table is gone.

00:04:33.000 Initially, we didn't know for sure that we needed to end up here.

00:04:39.210 Our big problem, so significant it overshadowed everything else, was having our data spread between two databases.

00:04:48.240 We started by trying to extract the data from Neo4j. What drove the change?

00:04:56.450 Neo4j and Cypher were not as familiar to our developers as Postgres and SQL.

00:05:04.150 The Active Model gem for Neo4j, at least at that time, was less mature and feature-rich than Active Record.

00:05:11.260 Neo4j's drivers were less mature, not as well optimized, and the database itself was a newer product.

00:05:18.830 Eventually, we developed features requiring cross-database joins, which proved to be slow and memory-intensive.

00:05:25.290 Rob, an engineer on my team, and I sat down to devise a plan.

00:05:32.570 We decided to migrate people across Realtor by Realtor.

00:05:39.970 We began developing an importer job to import a Realtor's Neo4j data into Postgres.

00:05:49.170 The importer needed to avoid duplicating the shared data that had already been imported for another user.

00:05:55.720 We would use a feature flag to indicate whether a user had migrated their data.

00:06:03.590 We began with schema definition, knowing our data in Neo4j was messy. Neo4j's referential integrity features are weaker than those of Postgres.

00:06:11.230 Our knowledge of Neo4j's strengths was weaker than our understanding of Postgres features, so we hadn't utilized those very well.

00:06:19.060 We took data integrity in our schema very seriously, using foreign keys, cascade deletes, restrict declarations, check constraints, and exclusion constraints.

00:06:26.440 We sought to encode as many constraints on our schema and data into the database schema.

00:06:34.060 This proved enormously helpful. The constraints caught a significant portion of our messy data during importer testing.

00:06:41.920 They forced us to confront and clean up the data issues upfront.

00:06:49.920 The feature flag needed to be readily available throughout our entire codebase, so we implemented middleware to set a thread-local variable early in the request cycle.

00:06:58.230 However, we encountered a problem: many queries began by invoking class methods on a model class.

00:07:05.600 We needed that model class to be a Neo4j model if the current Realtor's feature flag was set, indicating their data had been migrated.

00:07:13.490 Conversely, it needed to be an Active Record model if it hadn’t yet been migrated.

00:07:20.420 How do you address this? Well, Ruby's dynamism came to the rescue.

00:07:28.290 We were able to build models that could switch back and forth based on the feature flag.

00:07:35.050 For example, the Contact class was a small class that extended a switching model module, allowing it to switch between these two models.

00:07:42.560 Contact v1 represented the Neo4j model, while Contact v2 represented the Active Record model.

00:07:51.220 This switching model stored those two models in class instance variables for future use.

00:07:57.960 There was a v2 mode query method to check the thread-local variable to see which mode we should be in.

00:08:06.320 In v2, we would query Postgres, while in v1, we would query Neo4j.

00:08:15.050 An environment variable could also override that feature flag for testing.

00:08:22.740 There was a switch method that simply returned either the v2 method or the v1 method depending on the current user's mode.

00:08:29.470 Everything else was handled by delegating with method missing.

00:08:38.660 Most of the code didn’t need to be aware of both model types; controllers could just use the models to get the data they needed.

00:08:47.920 As we began this work, numerous queries integrated Cypher fragments. We found it helpful to convert those into scopes.

00:08:56.920 This enabled us to mirror those with Active Record scopes, so our controllers wouldn't need to behave differently based on the data source.

00:09:02.730 They could simply call constructs like 'contacts.followed.nonRealtors', and the scopes would hide the database dependencies.

00:09:09.400 Consequently, we built a rich vocabulary of scopes that has served us well since.

00:09:17.150 Queries in our system are generally very readable and straightforward.

00:09:25.290 During this period, we constructed two different rake tasks utilizing the environment variable override of the feature flag.

00:09:34.520 We maintained separate sets of factories and had CI run both sets of specs.

00:09:40.810 For testing, we conducted extensive manual tests by comparing the old and new versions of user data.

00:09:48.730 Nearing completion, we organized a whole company QA swarm on our staging server.

00:09:55.240 Our CTO, at the time, Jess Martin, advised me to find a way to communicate our progress to the whole company.

00:10:01.690 He emphasized the importance of demonstrating that the backend work was nearing completion.

00:10:07.790 I implemented a custom RSpec formatter to count the total P2 specs and how many passed.

00:10:13.660 This output was saved as a CSV record, which was then plugged into a spreadsheet displayed in our project management system.

00:10:19.130 It visually showed that we were gradually converging toward completing the work.

00:10:24.090 When it came time for execution, we began with employees who were not involved in sales and demos, followed by the rest of the employees.

00:10:33.120 Next, we tested with some friendly customers who wouldn't get upset if there were issues, followed by the rest of the active customers.

00:10:39.010 The entire process took about three weeks to complete.

00:10:47.420 After the initial migration of employees and selected customers, I found myself with nothing to do for the first time in about two months.

00:10:53.310 So, I thought I might as well start working on a pull request to remove all the v1 and transitional code.

00:10:59.170 Ten hours later, I submitted one of my favorite PRs of my entire career, which was a very satisfying moment.

00:11:06.720 At that point, here was our data model: contacts were still shared and the join table was still present, but it was all in Postgres.

00:11:14.660 Now it was time for stage two.

00:11:22.350 I mentioned that the importer needed to avoid duplicates of shared data already imported from other contacts into Postgres.

00:11:29.510 Neo4j attaches UUIDs to all of its nodes, and I thought it might be a good idea to use that for solving duplication.

00:11:36.780 However, I soon discovered this was a significant mistake and had to rectify it.

00:11:43.630 Don’t get me wrong, Postgres UUID primary keys work fine, but they're a little harder to remember and compare in output.

00:11:50.680 The issues began when we needed to track source info for a different table with an integer primary key.

00:11:57.380 We tracked sources using a polymorphic joint table.

00:12:03.620 All the source tables pointed to had UUID primary keys.

00:12:10.650 We realized we needed to track source info for this other table.

00:12:17.130 Although we could have tried a number of ways to hack around it, we chose to fix it.

00:12:25.790 Once I recognized we had to do this, I set aside a couple of evenings to conduct a spike and developed a strategy.

00:12:33.940 First, we added an integer ID column using contact names as an example, giving it a serial option in the migration.

00:12:40.960 This would allow it to auto-populate based on a sequence, then pre-popitated it.

00:12:46.620 Next, we dropped the primary key constraint on the ID, renamed that to UID, and finally, renamed integer ID to just ID and re-established the primary key constraint.

00:12:53.220 Fortunately, this was the simple case.

00:13:00.550 However, when foreign key references to that table are involved, things naturally become more complex.

00:13:07.270 For properties and property notes, we initially had an integer ID column in properties and populated it.

00:13:14.100 Next, we added an integer property ID column to property notes and populated that by joining it to properties.

00:13:21.520 Now it was time to drop both the primary key constraint and the foreign key constraint and rename ID to UID.

00:13:29.420 At this point, we were about halfway there.

00:13:37.370 Then, we dropped the old property key column, renamed the properties integer ID to ID, and finally renamed the int property ID to property ID.

00:13:44.580 After that, we re-established the primary key and foreign key constraints.

00:13:52.640 The more tables that have foreign key references, the more complicated this approach becomes.

00:14:01.090 This led me to develop primary key helpers and migration helpers to handle specific parts of this work.

00:14:08.730 One problem we encountered were polymorphic tables.

00:14:15.580 Polymorphic tables complicate everything, and this case was no exception.

00:14:22.080 We had to convert all of them at once in one batch, leading to five separate clusters of tables that needed conversion.

00:14:29.260 Consequently, I wrote migration helpers to manage all those details and made them reversible.

00:14:37.210 I tested these helpers to ensure that they could accomplish a round trip with both schema and data.

00:14:45.340 Just as a side note, in a relational database, you can discover anything regarding the schema from a set of internal tables and views.

00:14:51.440 Here's an example of finding all foreign keys pointing to the contacts table.

00:14:58.400 You can conduct similar inquiries for indexes and other constraints.

00:15:05.620 The output of my spike resulted in three complex migration helpers for primary key foreign key and polymorphic references.

00:15:12.480 We ended up waiting five months before the pain of maintaining this outweighed the risk of proceeding with the change.

00:15:20.470 The initial migration was very easy, but harder instances were more complicated.

00:15:27.480 The most challenging case was making changes to all those tables in production at once overnight.

00:15:36.640 As the time approached to execute this, we carefully reviewed all migrations and helpers.

00:15:44.220 We ran the migrations many, many times on production database clones, fixing errors and repeating.

00:15:51.520 I was especially grateful for Postgres's transactional DDL.

00:15:58.480 Had any migration failed during the process, the schema could have been left unchanged.

00:16:05.380 This safety net was crucial. Fixing an error usually involved reflecting on new dependencies or constraints in Postgres that I hadn't noticed before.

00:16:12.780 At times, it simply meant creating a workaround.

00:16:19.760 We approached this meticulously, running the migrations in staging to gauge timing.

00:16:27.260 As a company, we had the luxury of scheduling maintenance windows over the weekends, but we wanted to estimate how long those would need to be.

00:16:35.050 Once again, we made everything reversible, intending never to rely on that ability.

00:16:41.990 When dealing with production data, we built random spot checks into the migration helpers.

00:16:48.750 At the start of the migration helper, we would grab a random sample of records from the table we were about to change.

00:16:56.150 After conducting the migration, we would verify that they all pointed to the same things as before.

00:17:02.800 This ensured that we had some checks in place during production.

00:17:09.790 Upon reaching the conclusion of stage two, nothing dramatically changed about the data model layout.

00:17:16.900 However, everything uniformly used integer primary keys, and we were able to start tracking source info for more elements.

00:17:24.260 It was finally time for stage three: separating the contacts.

00:17:29.390 No longer would there be data shared between Realtors.

00:17:34.930 Swan, who has been my partner in this journey, grasped the situation well.

00:17:41.760 He took the lead, and I played a supporting role, primarily focused on validation.

00:17:49.690 What drove this change?

00:17:57.320 I've already mentioned that nearly every query required filtering based on where the individual pieces of data originated.

00:18:02.960 We didn’t want to unintentionally reveal Joe's unlisted number, only meant for Realtor Abbie, to other realtors.

00:18:09.780 This extra complexity was burdensome, especially with polymorphic tables, which were also slow.

00:18:15.880 Moreover, it posed a significant business risk: there was always the chance of a missed data privacy violation.

00:18:23.130 Swan performed a spike evaluation of different methods to accomplish this.

00:18:29.790 Using Ruby proved straightforward but would have taken about a day per contact or realtor.

00:18:37.300 In contrast, utilizing SQL required advanced skills, but it could be done in about ten minutes per realtor.

00:18:43.900 Thus, we opted for that method and decided on a realtor-by-realtor approach, migrating data one at a time.

00:18:50.780 Swan carefully planned the process and documented it for the team.

00:18:57.560 He developed a simple example where one contact shared between three Realtors was analyzed.

00:19:05.370 First, we added an old contact ID column to contact relationships, pre-populating it with the existing contact ID.

00:19:13.900 This would enable us to reference the original prior to changing contact IDs.

00:19:21.780 We added a contact relationship ID column to contacts and populated it with null.

00:19:30.610 We then enforced a uniqueness constraint to prevent duplicates in that column.

00:19:36.600 We then tackled Alice’s contacts first.

00:19:44.860 We updated the contact relationship ID for every contact owned by Alice, claiming those records.

00:19:50.540 If the contact relationship ID was null, Alice had laid claim to that record, and we proceeded with the insert.

00:19:56.010 For each of Alice's relationships, we inserted new entries into the contacts, updating as necessary.

00:20:03.270 If a conflict arose due to a uniqueness violation, we simply left the existing record unchanged.

00:20:09.300 Finally, we updated the contact relationships to point to the newly created contacts.

00:20:17.740 This entire process was executed in a single, sophisticated query.

00:20:24.570 Though simplified, it served its purpose well and produced valid results.

00:20:31.720 We selected all of Alice’s contact relationships and corresponding contacts, then used that information to establish new contact records.

00:20:39.960 Some inserts would succeed, while others would encounter a duplication conflict.

00:20:49.390 In those cases, the insert was not performed, but we updated the timestamp to reflect some activity.

00:20:55.990 Every new contact was wrapped in a common table expression, returning only newly inserted records.

00:21:02.180 We focused on ensuring that contact IDs in contact relationships pointed to these new records.

00:21:09.100 With Alice, when we tried to insert it, there was a conflict arising from that uniqueness constraint.

00:21:15.350 For Bill, we too attempted to claim a contact but failed because the contact relationship ID was not null.

00:21:21.540 Then, the insert went through without issue, and the update fixed the contact relationships to point to the new entry.

00:21:29.960 We then processed all the attributes attached to contacts, copying necessary details over.

00:21:39.100 Once we were finished with Bill, we proceeded to Carl.

00:21:48.320 As before, we ran these queries numerous times against a production clone.

00:21:55.760 We compared that Realtor's data to their production data, completing a full run-through in staging.

00:22:02.290 During the run through, I plotted table count changes as a sanity check.

00:22:09.540 One table began growing much faster than the others.

00:22:16.640 Swan investigated and found an outer join that should have been an inner join.

00:22:24.200 After making that fix and re-running the full process, the output leveled out.

00:22:31.030 At this point, we executed the entire operation successfully.

00:22:39.710 At completion, our data model no longer included shared contacts.

00:22:46.890 However, the join table still existed, albeit without a significant reason.

00:22:54.360 We now sought to eliminate that join table.

00:23:02.050 The driving force for this change was the realization that our database inaccurately represented a many-to-many association.

00:23:09.320 The new rules mandated that it now had to be a one-to-many association.

00:23:17.580 This necessitated various constraints and checks that would not have been necessary without that join table.

00:23:26.970 One of my team members, Monte Johnston, challenged me to find a way to eliminate it.

00:23:34.680 I pondered this for a while and realized that opportunities existed to pursue this work opportunistically.

00:23:43.050 Instead of disrupting our workflow fully, we could proceed with adding a direct foreign key relationship from contacts to Realtors.

00:23:51.530 We populated this foreign key to align with the existing contact relationships.

00:23:59.170 Then, it became a straightforward matter to ensure consistency using triggers.

00:24:10.000 Rails developers are often cautious around stored procedures and triggers, and for justifiable reasons.

00:24:16.600 However, these were precisely what we needed in this scenario.

00:24:24.470 Though I had yet to write a trigger before this, I was aware of how they functioned.

00:24:31.570 Thus, I turned to Postgres to learn how triggers could be applied in this context.

00:24:39.480 When starting this work, all code operated through the contact relationships table.

00:24:46.930 For instance, when inserting a new association between Realtors and contacts, we needed a trigger to set the realtor ID on the contact.

00:24:54.460 Ultimately, this ensured the two remained in sync.

00:25:01.330 Transitioning, our code would eventually switch to directly utilizing that association.

00:25:09.110 Such cases required a trigger to insert the relevant data back into the contact relationships.

00:25:16.470 If we could ensure that one side was working as expected while the other was switched over, it could function correctly.

00:25:24.040 However, we couldn’t guarantee this, creating a potential for infinite regress.

00:25:32.360 For instance, a trigger insert invoked another trigger, and we needed to prevent an infinite loop.

00:25:40.370 This complexity was challenging to conceptualize. I needed to learn various post-credits SQL syntax for efficiency.

00:25:48.180 Moreover, I had to control the conditions under which the triggers would be invoked.

00:25:55.440 Triggers must only invoke when relevant data is changing to avoid unnecessary processing.

00:26:03.540 In certain instances, the trigger needed to execute before an update, while in others, it should follow.

00:26:09.880 Additionally, the updates and inserts needed to be carefully designed to only adjust discrepancies.

00:26:16.580 Trust me, if you’ve never triggered a stack overflow exception from within your database, you have yet to live.

00:26:24.920 We devised a plan to achieve three primary goals.

00:26:32.160 First, we wanted a means of tracking progress to keep ourselves motivated.

00:26:38.450 Second, we sought an audit mechanism for the triggers to validate their activities.

00:26:44.440 We needed assurance that they performed their intended functions upon going into production.

00:26:50.910 Our third goal was to incorporate the primary key and foreign key references along with the triggers.

00:26:58.280 I redistributed individual fields from contact relationships to contacts.

00:27:04.800 Next, I shifted references for polymorphic associations and further cleanup.

00:27:10.640 Finally, I dropped the table used for tracking progress.

00:27:18.060 I programmed a simple script to count references to contact relationships within our codebase.

00:27:24.560 As you can see, this number dropped significantly over a couple of months.

00:27:30.280 I updated the triggers to log each invocation into a specific actions table.

00:27:36.620 This included information about which trigger invoked it and the corresponding timestamps.

00:27:42.890 We also recorded whether changes were made as a result of the trigger.

00:27:49.300 Occasionally, we would audit this data to ensure consistency.

00:27:56.030 If inconsistencies surfaced, we halted operations until they were resolved.

00:28:03.460 Doing the bulk of this work took eight to ten hours per week as we processed step by step.

00:28:11.210 Most deployments occurred over weekends to minimize risks.

00:28:17.330 Finally, here is our completed data model, which feels good.

00:28:23.860 It provides a structure that suits our business and has us poised for the next stage of development.

00:28:30.870 So, what can we learn from all of this?

00:28:38.870 I recommend an incremental, slow, and steady strategic approach.

00:28:44.090 This strategy helped us contain risk while allowing future development to proceed during most of the changes.

00:28:51.140 Despite the slow progress at times, we ultimately achieved enormous improvements.

00:29:00.700 We continuously looked ahead, contemplating how to enhance our system architecture.

00:29:08.700 We maintain a pain inventory of the challenges and complexities we wish to address.

00:29:15.490 Each stage presented entirely different solutions.

00:29:21.840 Stage one involved Ruby magic, while stage two tackled migrations and database reflections.

00:29:29.640 Stage three introduced sophisticated Postgres insertions, while stage four focused on triggers.

00:29:39.870 There is no universal recipe; we have to understand what works and build confidence in our choices before implementing them.

00:29:46.840 All these stages leveraged the database effectively.

00:29:54.390 Rails developers love Active Record and arel for queries, which is acceptable, yet SQL holds extraordinary power.

00:30:01.330 The referential integrity protections offered by robust databases can save you significant headaches.

00:30:09.100 Without Postgres's transactional DDL, our risk and effort would have increased exponentially.

00:30:16.120 Even stored procedures and triggers have their rightful place in certain contexts.

00:30:22.840 As I mentioned, we had the advantage of scheduling maintenance windows on weekends.

00:30:29.860 If your business model does not allow this luxury, exploring alternative techniques and consulting with experienced database professionals is advisable.

00:30:37.140 Such tasks often benefit significantly from focused attention.

00:30:44.050 It helps to have someone on your team who is particularly detail-oriented to oversee these efforts.

00:30:51.490 Maintain perspective to avoid overcomplication and rethink strategies when necessary.

00:30:58.090 Would we approach this differently if given the chance to start over?

00:31:05.990 Had we fully grasped our end goal, we could have completed all four stages in the first phase.

00:31:11.650 If we hadn't copied over the contacts from Neo4j or shared them, we could have avoided the join table.

00:31:18.620 Reflecting on our original goals, we became aware that our reliance on Neo4j was unwarranted.

00:31:25.470 Given our knowledge at that time, we likely would have repeated the same actions.

00:31:31.890 It's challenging to predict how your business will evolve over time.

00:31:37.440 Nonetheless, in hindsight, we realized we should not have used UUID primary keys for those tables.

00:31:43.900 There was no need to make them the primary key when it could have been an additional column.

00:31:50.590 UUID primary keys are mainly useful when you require distributed key generation.

00:31:59.780 Only if contention on the sequence database becomes a bottleneck, or you have tight human latency conditions needs consideration.

00:32:05.810 If neither situation applies, then it’s best to stick with integer primary keys.

00:32:11.480 These changes can be costly and risky.

00:32:18.110 I do not recommend racing back to your teams to make a complete overhaul.

00:32:25.400 However, if a persistent pain point exists, you can safely implement these changes.

00:32:33.560 By applying the three principles we discussed, and creatively utilizing various technical solutions at each stage.

00:32:39.660 Thank you very much, and here's that jobs link again.