00:00:20.900
Thank you, Sebastian. Good afternoon! I'm going to present a case study of a sustained multistage effort to change and simplify the core data relationship in a production system.
00:00:26.160
Changing code has its challenges, but changing the data model of production data is even more difficult. Protecting the existing data requires caution, and exploring and testing possible solutions can be slow.
00:00:34.920
Your new data model may require data completeness or correctness that hasn't been enforced for your existing data.
00:00:41.820
However, living with a data model that mismatches your business can be painful. It introduces extra complexity, pitfalls where bugs can lurk, and poor performance.
00:00:48.120
We rarely hear about the details of such changes, so I hope this talk will be helpful for those of you facing similar challenges.
00:00:54.210
We undertook this work in four stages over two and a half years. At each stage, we addressed the biggest problem we faced, driven by business considerations, and then returned to feature development until the next significant problem became costlier to live with than to fix.
00:01:07.410
You can see that stage two began with a short spike, followed by a five-month gap before the real push started. Initially, we had no idea what lay ahead, and we might not have started if we had known the challenges involved.
00:01:19.560
Looking back, it feels amazing and as if it went quickly. We made substantial progress, and we’re glad we are where we are now.
00:01:23.429
While this talk contains quite a few technical details, it's not solely about the technicalities. Your experience will likely differ.
00:01:30.149
All four stages we went through had very different technical details. This discussion focuses on our overall strategy and how to approach these changes responsibly.
00:01:42.060
We applied these principles at every stage, and I encourage you to pay attention to how we did that throughout the presentation.
00:01:48.959
The entire effort started with a couple of significant mistakes, so I will begin by sharing those to provide the rationale for all this work.
00:01:55.830
Initially, we aimed to help real estate agents win more business by focusing their efforts on the right people at the right time.
00:02:01.890
The average realtor has a database of a few thousand contacts, and we used predictive analytics and other techniques to help agents market wisely.
00:02:07.349
Now seems like a good time to mention that we are currently hiring Rails developers. A startup's job is to try various approaches and discover what does not work as we seek what does.
00:02:12.840
When we first examined this problem, our basic data model looked like this: Realtors as customers, each having a set of contacts.
00:02:19.560
We considered all the possibilities this data presented, including knowing who was married to whom and understanding relationships between contacts.
00:02:25.530
This led to our first major mistake: we started building a social graph.
00:02:31.650
This subsequently caused our second big mistake: storing all that data in a graph database.
00:02:39.210
I don’t believe this is a mistake for everyone, but it proved to be one for us.
00:02:44.400
We realized, as we progressed, that we were not actually building a social graph. Today's focus will be on the relationships between Realtors and their contacts.
00:02:52.900
Remember that the other relationships still exist. This notion of a social graph led us to think of contact records as representing actual people.
00:02:59.640
For example, we referred to 'Abby's contact info for Nancy' rather than just identifying a simple record.
00:03:06.680
We made numerous other mistakes along the way, but let's not get bogged down in the specifics. Just remember what Jerry Weinberg said.
00:03:12.840
Keep that in mind as I describe where we were two and a half years ago.
00:03:18.990
At that time, we had what we believed was business data about users and their activity stored in Postgres, and all the contacts in our social graph were in Neo4j.
00:03:26.270
To link users to data in Neo4j, we had what I will call realtor prime proxy objects in Neo4j.
00:03:33.010
All those CR nodes represented contact relationships, as Neo4j reifies all relationships as nodes.
00:03:39.620
Its query language, Cypher, encourages one to think along these lines, but I suspect you think more in terms of tables.
00:03:45.620
For convenience, let’s illustrate this with a table structure.
00:03:52.720
Here, you can see that contact relationships function as a significant join table.
00:03:58.360
Contacts were shared between Realtors, requiring us to utilize source information for every attribute, such as phone numbers and emails, to determine what each customer could see.
00:04:05.720
We would have tracked that source information anyway, but we never intended to manage data privacy concerns for our customers with it.
00:04:13.560
However, it had to be included in every query because contacts were shared between Realtors.
00:04:21.030
Here's the structure we actually need and have finally achieved today.
00:04:26.990
Note that shared contacts have now been duplicated, and the join table is gone.
00:04:33.000
Initially, we didn't know for sure that we needed to end up here.
00:04:39.210
Our big problem, so significant it overshadowed everything else, was having our data spread between two databases.
00:04:48.240
We started by trying to extract the data from Neo4j. What drove the change?
00:04:56.450
Neo4j and Cypher were not as familiar to our developers as Postgres and SQL.
00:05:04.150
The Active Model gem for Neo4j, at least at that time, was less mature and feature-rich than Active Record.
00:05:11.260
Neo4j's drivers were less mature, not as well optimized, and the database itself was a newer product.
00:05:18.830
Eventually, we developed features requiring cross-database joins, which proved to be slow and memory-intensive.
00:05:25.290
Rob, an engineer on my team, and I sat down to devise a plan.
00:05:32.570
We decided to migrate people across Realtor by Realtor.
00:05:39.970
We began developing an importer job to import a Realtor's Neo4j data into Postgres.
00:05:49.170
The importer needed to avoid duplicating the shared data that had already been imported for another user.
00:05:55.720
We would use a feature flag to indicate whether a user had migrated their data.
00:06:03.590
We began with schema definition, knowing our data in Neo4j was messy. Neo4j's referential integrity features are weaker than those of Postgres.
00:06:11.230
Our knowledge of Neo4j's strengths was weaker than our understanding of Postgres features, so we hadn't utilized those very well.
00:06:19.060
We took data integrity in our schema very seriously, using foreign keys, cascade deletes, restrict declarations, check constraints, and exclusion constraints.
00:06:26.440
We sought to encode as many constraints on our schema and data into the database schema.
00:06:34.060
This proved enormously helpful. The constraints caught a significant portion of our messy data during importer testing.
00:06:41.920
They forced us to confront and clean up the data issues upfront.
00:06:49.920
The feature flag needed to be readily available throughout our entire codebase, so we implemented middleware to set a thread-local variable early in the request cycle.
00:06:58.230
However, we encountered a problem: many queries began by invoking class methods on a model class.
00:07:05.600
We needed that model class to be a Neo4j model if the current Realtor's feature flag was set, indicating their data had been migrated.
00:07:13.490
Conversely, it needed to be an Active Record model if it hadn’t yet been migrated.
00:07:20.420
How do you address this? Well, Ruby's dynamism came to the rescue.
00:07:28.290
We were able to build models that could switch back and forth based on the feature flag.
00:07:35.050
For example, the Contact class was a small class that extended a switching model module, allowing it to switch between these two models.
00:07:42.560
Contact v1 represented the Neo4j model, while Contact v2 represented the Active Record model.
00:07:51.220
This switching model stored those two models in class instance variables for future use.
00:07:57.960
There was a v2 mode query method to check the thread-local variable to see which mode we should be in.
00:08:06.320
In v2, we would query Postgres, while in v1, we would query Neo4j.
00:08:15.050
An environment variable could also override that feature flag for testing.
00:08:22.740
There was a switch method that simply returned either the v2 method or the v1 method depending on the current user's mode.
00:08:29.470
Everything else was handled by delegating with method missing.
00:08:38.660
Most of the code didn’t need to be aware of both model types; controllers could just use the models to get the data they needed.
00:08:47.920
As we began this work, numerous queries integrated Cypher fragments. We found it helpful to convert those into scopes.
00:08:56.920
This enabled us to mirror those with Active Record scopes, so our controllers wouldn't need to behave differently based on the data source.
00:09:02.730
They could simply call constructs like 'contacts.followed.nonRealtors', and the scopes would hide the database dependencies.
00:09:09.400
Consequently, we built a rich vocabulary of scopes that has served us well since.
00:09:17.150
Queries in our system are generally very readable and straightforward.
00:09:25.290
During this period, we constructed two different rake tasks utilizing the environment variable override of the feature flag.
00:09:34.520
We maintained separate sets of factories and had CI run both sets of specs.
00:09:40.810
For testing, we conducted extensive manual tests by comparing the old and new versions of user data.
00:09:48.730
Nearing completion, we organized a whole company QA swarm on our staging server.
00:09:55.240
Our CTO, at the time, Jess Martin, advised me to find a way to communicate our progress to the whole company.
00:10:01.690
He emphasized the importance of demonstrating that the backend work was nearing completion.
00:10:07.790
I implemented a custom RSpec formatter to count the total P2 specs and how many passed.
00:10:13.660
This output was saved as a CSV record, which was then plugged into a spreadsheet displayed in our project management system.
00:10:19.130
It visually showed that we were gradually converging toward completing the work.
00:10:24.090
When it came time for execution, we began with employees who were not involved in sales and demos, followed by the rest of the employees.
00:10:33.120
Next, we tested with some friendly customers who wouldn't get upset if there were issues, followed by the rest of the active customers.
00:10:39.010
The entire process took about three weeks to complete.
00:10:47.420
After the initial migration of employees and selected customers, I found myself with nothing to do for the first time in about two months.
00:10:53.310
So, I thought I might as well start working on a pull request to remove all the v1 and transitional code.
00:10:59.170
Ten hours later, I submitted one of my favorite PRs of my entire career, which was a very satisfying moment.
00:11:06.720
At that point, here was our data model: contacts were still shared and the join table was still present, but it was all in Postgres.
00:11:14.660
Now it was time for stage two.
00:11:22.350
I mentioned that the importer needed to avoid duplicates of shared data already imported from other contacts into Postgres.
00:11:29.510
Neo4j attaches UUIDs to all of its nodes, and I thought it might be a good idea to use that for solving duplication.
00:11:36.780
However, I soon discovered this was a significant mistake and had to rectify it.
00:11:43.630
Don’t get me wrong, Postgres UUID primary keys work fine, but they're a little harder to remember and compare in output.
00:11:50.680
The issues began when we needed to track source info for a different table with an integer primary key.
00:11:57.380
We tracked sources using a polymorphic joint table.
00:12:03.620
All the source tables pointed to had UUID primary keys.
00:12:10.650
We realized we needed to track source info for this other table.
00:12:17.130
Although we could have tried a number of ways to hack around it, we chose to fix it.
00:12:25.790
Once I recognized we had to do this, I set aside a couple of evenings to conduct a spike and developed a strategy.
00:12:33.940
First, we added an integer ID column using contact names as an example, giving it a serial option in the migration.
00:12:40.960
This would allow it to auto-populate based on a sequence, then pre-popitated it.
00:12:46.620
Next, we dropped the primary key constraint on the ID, renamed that to UID, and finally, renamed integer ID to just ID and re-established the primary key constraint.
00:12:53.220
Fortunately, this was the simple case.
00:13:00.550
However, when foreign key references to that table are involved, things naturally become more complex.
00:13:07.270
For properties and property notes, we initially had an integer ID column in properties and populated it.
00:13:14.100
Next, we added an integer property ID column to property notes and populated that by joining it to properties.
00:13:21.520
Now it was time to drop both the primary key constraint and the foreign key constraint and rename ID to UID.
00:13:29.420
At this point, we were about halfway there.
00:13:37.370
Then, we dropped the old property key column, renamed the properties integer ID to ID, and finally renamed the int property ID to property ID.
00:13:44.580
After that, we re-established the primary key and foreign key constraints.
00:13:52.640
The more tables that have foreign key references, the more complicated this approach becomes.
00:14:01.090
This led me to develop primary key helpers and migration helpers to handle specific parts of this work.
00:14:08.730
One problem we encountered were polymorphic tables.
00:14:15.580
Polymorphic tables complicate everything, and this case was no exception.
00:14:22.080
We had to convert all of them at once in one batch, leading to five separate clusters of tables that needed conversion.
00:14:29.260
Consequently, I wrote migration helpers to manage all those details and made them reversible.
00:14:37.210
I tested these helpers to ensure that they could accomplish a round trip with both schema and data.
00:14:45.340
Just as a side note, in a relational database, you can discover anything regarding the schema from a set of internal tables and views.
00:14:51.440
Here's an example of finding all foreign keys pointing to the contacts table.
00:14:58.400
You can conduct similar inquiries for indexes and other constraints.
00:15:05.620
The output of my spike resulted in three complex migration helpers for primary key foreign key and polymorphic references.
00:15:12.480
We ended up waiting five months before the pain of maintaining this outweighed the risk of proceeding with the change.
00:15:20.470
The initial migration was very easy, but harder instances were more complicated.
00:15:27.480
The most challenging case was making changes to all those tables in production at once overnight.
00:15:36.640
As the time approached to execute this, we carefully reviewed all migrations and helpers.
00:15:44.220
We ran the migrations many, many times on production database clones, fixing errors and repeating.
00:15:51.520
I was especially grateful for Postgres's transactional DDL.
00:15:58.480
Had any migration failed during the process, the schema could have been left unchanged.
00:16:05.380
This safety net was crucial. Fixing an error usually involved reflecting on new dependencies or constraints in Postgres that I hadn't noticed before.
00:16:12.780
At times, it simply meant creating a workaround.
00:16:19.760
We approached this meticulously, running the migrations in staging to gauge timing.
00:16:27.260
As a company, we had the luxury of scheduling maintenance windows over the weekends, but we wanted to estimate how long those would need to be.
00:16:35.050
Once again, we made everything reversible, intending never to rely on that ability.
00:16:41.990
When dealing with production data, we built random spot checks into the migration helpers.
00:16:48.750
At the start of the migration helper, we would grab a random sample of records from the table we were about to change.
00:16:56.150
After conducting the migration, we would verify that they all pointed to the same things as before.
00:17:02.800
This ensured that we had some checks in place during production.
00:17:09.790
Upon reaching the conclusion of stage two, nothing dramatically changed about the data model layout.
00:17:16.900
However, everything uniformly used integer primary keys, and we were able to start tracking source info for more elements.
00:17:24.260
It was finally time for stage three: separating the contacts.
00:17:29.390
No longer would there be data shared between Realtors.
00:17:34.930
Swan, who has been my partner in this journey, grasped the situation well.
00:17:41.760
He took the lead, and I played a supporting role, primarily focused on validation.
00:17:49.690
What drove this change?
00:17:57.320
I've already mentioned that nearly every query required filtering based on where the individual pieces of data originated.
00:18:02.960
We didn’t want to unintentionally reveal Joe's unlisted number, only meant for Realtor Abbie, to other realtors.
00:18:09.780
This extra complexity was burdensome, especially with polymorphic tables, which were also slow.
00:18:15.880
Moreover, it posed a significant business risk: there was always the chance of a missed data privacy violation.
00:18:23.130
Swan performed a spike evaluation of different methods to accomplish this.
00:18:29.790
Using Ruby proved straightforward but would have taken about a day per contact or realtor.
00:18:37.300
In contrast, utilizing SQL required advanced skills, but it could be done in about ten minutes per realtor.
00:18:43.900
Thus, we opted for that method and decided on a realtor-by-realtor approach, migrating data one at a time.
00:18:50.780
Swan carefully planned the process and documented it for the team.
00:18:57.560
He developed a simple example where one contact shared between three Realtors was analyzed.
00:19:05.370
First, we added an old contact ID column to contact relationships, pre-populating it with the existing contact ID.
00:19:13.900
This would enable us to reference the original prior to changing contact IDs.
00:19:21.780
We added a contact relationship ID column to contacts and populated it with null.
00:19:30.610
We then enforced a uniqueness constraint to prevent duplicates in that column.
00:19:36.600
We then tackled Alice’s contacts first.
00:19:44.860
We updated the contact relationship ID for every contact owned by Alice, claiming those records.
00:19:50.540
If the contact relationship ID was null, Alice had laid claim to that record, and we proceeded with the insert.
00:19:56.010
For each of Alice's relationships, we inserted new entries into the contacts, updating as necessary.
00:20:03.270
If a conflict arose due to a uniqueness violation, we simply left the existing record unchanged.
00:20:09.300
Finally, we updated the contact relationships to point to the newly created contacts.
00:20:17.740
This entire process was executed in a single, sophisticated query.
00:20:24.570
Though simplified, it served its purpose well and produced valid results.
00:20:31.720
We selected all of Alice’s contact relationships and corresponding contacts, then used that information to establish new contact records.
00:20:39.960
Some inserts would succeed, while others would encounter a duplication conflict.
00:20:49.390
In those cases, the insert was not performed, but we updated the timestamp to reflect some activity.
00:20:55.990
Every new contact was wrapped in a common table expression, returning only newly inserted records.
00:21:02.180
We focused on ensuring that contact IDs in contact relationships pointed to these new records.
00:21:09.100
With Alice, when we tried to insert it, there was a conflict arising from that uniqueness constraint.
00:21:15.350
For Bill, we too attempted to claim a contact but failed because the contact relationship ID was not null.
00:21:21.540
Then, the insert went through without issue, and the update fixed the contact relationships to point to the new entry.
00:21:29.960
We then processed all the attributes attached to contacts, copying necessary details over.
00:21:39.100
Once we were finished with Bill, we proceeded to Carl.
00:21:48.320
As before, we ran these queries numerous times against a production clone.
00:21:55.760
We compared that Realtor's data to their production data, completing a full run-through in staging.
00:22:02.290
During the run through, I plotted table count changes as a sanity check.
00:22:09.540
One table began growing much faster than the others.
00:22:16.640
Swan investigated and found an outer join that should have been an inner join.
00:22:24.200
After making that fix and re-running the full process, the output leveled out.
00:22:31.030
At this point, we executed the entire operation successfully.
00:22:39.710
At completion, our data model no longer included shared contacts.
00:22:46.890
However, the join table still existed, albeit without a significant reason.
00:22:54.360
We now sought to eliminate that join table.
00:23:02.050
The driving force for this change was the realization that our database inaccurately represented a many-to-many association.
00:23:09.320
The new rules mandated that it now had to be a one-to-many association.
00:23:17.580
This necessitated various constraints and checks that would not have been necessary without that join table.
00:23:26.970
One of my team members, Monte Johnston, challenged me to find a way to eliminate it.
00:23:34.680
I pondered this for a while and realized that opportunities existed to pursue this work opportunistically.
00:23:43.050
Instead of disrupting our workflow fully, we could proceed with adding a direct foreign key relationship from contacts to Realtors.
00:23:51.530
We populated this foreign key to align with the existing contact relationships.
00:23:59.170
Then, it became a straightforward matter to ensure consistency using triggers.
00:24:10.000
Rails developers are often cautious around stored procedures and triggers, and for justifiable reasons.
00:24:16.600
However, these were precisely what we needed in this scenario.
00:24:24.470
Though I had yet to write a trigger before this, I was aware of how they functioned.
00:24:31.570
Thus, I turned to Postgres to learn how triggers could be applied in this context.
00:24:39.480
When starting this work, all code operated through the contact relationships table.
00:24:46.930
For instance, when inserting a new association between Realtors and contacts, we needed a trigger to set the realtor ID on the contact.
00:24:54.460
Ultimately, this ensured the two remained in sync.
00:25:01.330
Transitioning, our code would eventually switch to directly utilizing that association.
00:25:09.110
Such cases required a trigger to insert the relevant data back into the contact relationships.
00:25:16.470
If we could ensure that one side was working as expected while the other was switched over, it could function correctly.
00:25:24.040
However, we couldn’t guarantee this, creating a potential for infinite regress.
00:25:32.360
For instance, a trigger insert invoked another trigger, and we needed to prevent an infinite loop.
00:25:40.370
This complexity was challenging to conceptualize. I needed to learn various post-credits SQL syntax for efficiency.
00:25:48.180
Moreover, I had to control the conditions under which the triggers would be invoked.
00:25:55.440
Triggers must only invoke when relevant data is changing to avoid unnecessary processing.
00:26:03.540
In certain instances, the trigger needed to execute before an update, while in others, it should follow.
00:26:09.880
Additionally, the updates and inserts needed to be carefully designed to only adjust discrepancies.
00:26:16.580
Trust me, if you’ve never triggered a stack overflow exception from within your database, you have yet to live.
00:26:24.920
We devised a plan to achieve three primary goals.
00:26:32.160
First, we wanted a means of tracking progress to keep ourselves motivated.
00:26:38.450
Second, we sought an audit mechanism for the triggers to validate their activities.
00:26:44.440
We needed assurance that they performed their intended functions upon going into production.
00:26:50.910
Our third goal was to incorporate the primary key and foreign key references along with the triggers.
00:26:58.280
I redistributed individual fields from contact relationships to contacts.
00:27:04.800
Next, I shifted references for polymorphic associations and further cleanup.
00:27:10.640
Finally, I dropped the table used for tracking progress.
00:27:18.060
I programmed a simple script to count references to contact relationships within our codebase.
00:27:24.560
As you can see, this number dropped significantly over a couple of months.
00:27:30.280
I updated the triggers to log each invocation into a specific actions table.
00:27:36.620
This included information about which trigger invoked it and the corresponding timestamps.
00:27:42.890
We also recorded whether changes were made as a result of the trigger.
00:27:49.300
Occasionally, we would audit this data to ensure consistency.
00:27:56.030
If inconsistencies surfaced, we halted operations until they were resolved.
00:28:03.460
Doing the bulk of this work took eight to ten hours per week as we processed step by step.
00:28:11.210
Most deployments occurred over weekends to minimize risks.
00:28:17.330
Finally, here is our completed data model, which feels good.
00:28:23.860
It provides a structure that suits our business and has us poised for the next stage of development.
00:28:30.870
So, what can we learn from all of this?
00:28:38.870
I recommend an incremental, slow, and steady strategic approach.
00:28:44.090
This strategy helped us contain risk while allowing future development to proceed during most of the changes.
00:28:51.140
Despite the slow progress at times, we ultimately achieved enormous improvements.
00:29:00.700
We continuously looked ahead, contemplating how to enhance our system architecture.
00:29:08.700
We maintain a pain inventory of the challenges and complexities we wish to address.
00:29:15.490
Each stage presented entirely different solutions.
00:29:21.840
Stage one involved Ruby magic, while stage two tackled migrations and database reflections.
00:29:29.640
Stage three introduced sophisticated Postgres insertions, while stage four focused on triggers.
00:29:39.870
There is no universal recipe; we have to understand what works and build confidence in our choices before implementing them.
00:29:46.840
All these stages leveraged the database effectively.
00:29:54.390
Rails developers love Active Record and arel for queries, which is acceptable, yet SQL holds extraordinary power.
00:30:01.330
The referential integrity protections offered by robust databases can save you significant headaches.
00:30:09.100
Without Postgres's transactional DDL, our risk and effort would have increased exponentially.
00:30:16.120
Even stored procedures and triggers have their rightful place in certain contexts.
00:30:22.840
As I mentioned, we had the advantage of scheduling maintenance windows on weekends.
00:30:29.860
If your business model does not allow this luxury, exploring alternative techniques and consulting with experienced database professionals is advisable.
00:30:37.140
Such tasks often benefit significantly from focused attention.
00:30:44.050
It helps to have someone on your team who is particularly detail-oriented to oversee these efforts.
00:30:51.490
Maintain perspective to avoid overcomplication and rethink strategies when necessary.
00:30:58.090
Would we approach this differently if given the chance to start over?
00:31:05.990
Had we fully grasped our end goal, we could have completed all four stages in the first phase.
00:31:11.650
If we hadn't copied over the contacts from Neo4j or shared them, we could have avoided the join table.
00:31:18.620
Reflecting on our original goals, we became aware that our reliance on Neo4j was unwarranted.
00:31:25.470
Given our knowledge at that time, we likely would have repeated the same actions.
00:31:31.890
It's challenging to predict how your business will evolve over time.
00:31:37.440
Nonetheless, in hindsight, we realized we should not have used UUID primary keys for those tables.
00:31:43.900
There was no need to make them the primary key when it could have been an additional column.
00:31:50.590
UUID primary keys are mainly useful when you require distributed key generation.
00:31:59.780
Only if contention on the sequence database becomes a bottleneck, or you have tight human latency conditions needs consideration.
00:32:05.810
If neither situation applies, then it’s best to stick with integer primary keys.
00:32:11.480
These changes can be costly and risky.
00:32:18.110
I do not recommend racing back to your teams to make a complete overhaul.
00:32:25.400
However, if a persistent pain point exists, you can safely implement these changes.
00:32:33.560
By applying the three principles we discussed, and creatively utilizing various technical solutions at each stage.
00:32:39.660
Thank you very much, and here's that jobs link again.