00:00:11.929
Hi folks, I'm Betsy Haibel, and welcome to Data Integrity in Living Systems. Now,
00:00:19.260
ordinarily, I'd like to launch right into the talk content here. We've got a lot to cover and that's what you all are here to see.
00:00:26.070
However, Marcos' keynote just now really hit home for me, and I wanted to follow up on that.
00:00:31.590
As a white woman, I recognize that my experiences are quite different from those of a Black man.
00:00:36.690
I don’t want to flatten those differences. A lot of the ways he framed survival resonate with my own.
00:00:43.140
I often think of my non-traditional background journey into tech as a simple, happy-path narrative.
00:00:49.050
But let's get real for a second. A decade ago, as I was getting into tech, I was just a woman with no college degree, having washed out of an arts career due to a sudden onset chronic pain disorder.
00:00:56.879
Boot camps were not even an option yet, so I was learning everything on my own, which was terrifying.
00:01:04.260
I picked up some bad survival lessons during that time.
00:01:10.590
This context is relevant because throughout this talk, I’ll share moments when I was kind of a self-righteous jerk.
00:01:16.710
I want you all to remember that I was only able to grow out of that mindset after I wasn't the only woman in the room anymore.
00:01:22.640
Once I had company, I could transition from survival through arrogance to nuance and kindness.
00:01:28.409
And that transition was possible only because I no longer had to carry the torch of being the sole representative of my gender.
00:01:33.570
Anyway, back to data integrity. This talk needs another name.
00:01:40.320
It used to be called 'Data Corruption: Stop the Evil Tribbles.' I didn’t change it because that name was a bit hokey.
00:01:45.540
However, that title implied carrying about my personal dignity, which feels uncomfortable.
00:01:52.590
Let's look at a humorous example from Star Trek, which I’ll be using throughout the talk.
00:01:59.120
Changing the name of the talk imposed a frame of bad data as an evil and invading force.
00:02:05.780
This perspective can be quite unproductive because it suggests we have pristine databases.
00:02:11.570
In reality, most of us are working with code bases that look very different from that.
00:02:17.330
Additionally, framing the issue adversarially encourages developers to fight against bad data.
00:02:26.870
This mindset can easily lead to seeing our users as the enemy or, even worse, our teammates.
00:02:35.330
It's easy to become self-righteous about data integrity issues, but that leads us away from solving the actual problems.
00:02:41.300
In my experience, the root causes of data integrity issues often stem from product changes or miscommunications between teams.
00:02:47.530
That's why I want to emphasize that data integrity is not just your responsibility!
00:02:53.840
If we're going to adopt a blameless approach, we need to look at root causes—many times.
00:03:02.590
For instance, people skip known data integrity patterns because their code base discourages their proper use.
00:03:08.000
We may run into the occasional, strange bug coming from the database due to unusual circumstances, but most issues stem from team dynamics.
00:03:16.130
We need to design our systems to be resilient against common problems like product changes or engineers making mistakes.
00:03:22.700
Focusing on resilience helps us catch and correct a wide array of data integrity bugs.
00:03:29.720
Let's consider a story. It's loosely based on my experience a few years ago at an e-commerce company.
00:03:36.500
The details aren’t exact, but they illustrate the problem effectively.
00:03:41.600
I was part of a team working on a module that processed returns and shipped products back to vendors.
00:03:46.669
Considering the larger monolith we were part of, this module turned out to be straightforward.
00:03:54.080
Our return-to-vendor model was simple; it prioritized three things: product worth, vendor information, and shipping status.
00:04:00.140
But of course, the world has its way of giving us partial data.
00:04:05.790
In our case, sometimes incoming data didn't include vendor information, making it very difficult to return products.
00:04:12.619
This new requirement from the return-to-vendor module highlighted more product needs.
00:04:19.290
The upstream system previously met the system's needs perfectly, but now it became invalid due to changes.
00:04:24.590
Common data integrity issues often arise from product changes, but they can be addressed at the product level.
00:04:31.130
Instead of complicated measures like machine learning to find missing vendors, we decided to implement a simple UI change.
00:04:38.200
We updated the return display to hide units marked for return that didn't have vendor information.
00:04:45.000
This was a lot more efficient than creating a complicated technical solution.
00:04:52.260
The main takeaway from this experience is to scope data based on whether it can progress.
00:04:59.300
Additionally, we should validate models dynamically rather than statically, as complex operations require aware and flexible validation.
00:05:05.540
Collaboration can provide critical insights from users, making it unnecessary to overthink data recovery.
00:05:12.210
In living software systems, users are as crucial as the code itself.
00:05:18.190
Approaching data integrity problems collaboratively with our team and user base can lead to solutions.
00:05:25.610
Earlier, I mentioned eventually finding a solution through scoping down data and creating a user-friendly interface.
00:05:32.899
However, we encountered some miscommunication before reaching that resolution.
00:05:38.690
My team had assumed that the upstream system would provide vendor data.
00:05:44.199
When we saw units without vendor markings, we mistakenly believed it was a bug.
00:05:49.610
However, this was an oversight as sometimes the upstream does not provide data.
00:05:56.160
The data for the return-to-vendor process was mostly sourced from the returns handling module.
00:06:04.120
The warehouse workers logging returns often didn't have vendor information.
00:06:11.670
Picture this: the warehouse worker receives returns off a truck, processing a messy pile, with packages sometimes improperly labeled.
00:06:17.160
Despite this, most people are not jerks and do label their returns correctly.
00:06:25.960
However, there are many scenarios where boxes may be missing information.
00:06:31.160
Warehouse workers are not responsible for finding out the returned brand of each product.
00:06:37.690
Their job is quick logging to progress to the next return, optimized by performance metrics.
00:06:43.210
If we impose burdens that require them to delve deeply into every box, we are being unreasonable.
00:06:49.699
All of this was unknown to my team at the time. We were focused solely on the immediate needs of the return-to-vendor module.
00:06:56.250
Consequently, we ran migrations that inadvertently led to further complications and numerous production issues.
00:07:03.830
We operated as if the validations implied would catch all potential issues.
00:07:09.800
Ultimately, we assumed internal CI failures indicated new bugs.
00:07:15.041
The underlying issue, however, was our lack of humility and awareness of upstream weaknesses.
00:07:22.230
After discussing this experience in a cross-team retrospective, we learned that solutions came from communication.
00:07:29.440
If we had approached our upstream colleagues sooner, I believe we would have avoided this entire error.
00:07:35.390
It’s often said that all we need to do is talk to one another.
00:07:41.500
As the Agile Manifesto suggests, individuals and interactions are paramount.
00:07:46.660
But let’s not ignore the intricacies of communication.
00:07:54.630
Too often, the phrase 'just talk' is a way of dismissing both the complexity of processes and the effort they require.
00:08:01.300
As Camille Fournier noted, managing coordination among people is an effort we cannot overstate.
00:08:11.000
The background noise of team dynamics becomes especially relevant.
00:08:18.000
So when we prioritize individuals and interactions, we must remember that processes matter.
00:08:22.230
The goal is not to dismiss processes but to find the right balance.
00:08:27.589
Observing how the software actually behaves is more reliable than sticking strictly to documentation.
00:08:32.000
Identifying how communication unfolds is vital. For instance, when my team added validations, we could have explored the issue further.
00:08:38.150
Understanding our recent commits might have shed light on where we were going wrong.
00:08:44.390
Effective interactions also involve acknowledging the work culture that stifles communication.
00:08:50.550
Burdening teammates with loads of pressure hinders our collaborative spirit.
00:08:57.000
The dominant attitude of each person's work being paramount leads to the notion that inter-team communication is wasteful.
00:09:02.600
Recognizing that cross-team interactions are always a valuable investment shifts our team's dynamics.
00:09:09.970
Further, let’s discuss the failure to leverage well-known data integrity patterns.
00:09:15.620
For example, patterns such as validations, callbacks, and transactions are often overlooked.
00:09:21.060
Neglecting these opportunities leads to moralizing and finger-pointing when issues arise.
00:09:28.150
We should focus on understanding why these errors occur within our systems.
00:09:35.200
This shifts the focus from individual blame toward systemic recovery.
00:09:41.590
My focus here is on building in systems to manage recovery instead of preventing error.
00:09:48.250
Recovery emphasizes moving forward when errors happen—a proactive stance.
00:09:53.860
Many older Rails projects accumulate considerable complexity in validations and callback mechanisms.
00:09:59.300
Consequently, serious engineering effort can be lost in trying to manage data integrity effectively.
00:10:06.090
Development teams often find themselves in a paralysis from overengineering.
00:10:12.610
Thus, the more we moralize about data integrity, the less we resolve actual problems.
00:10:18.550
Instead, we should cultivate an environment that encourages recovery and learning.
00:10:26.350
Finding balance between complex validation mechanisms and simplicity is key.
00:10:32.960
Lastly, let’s not overlook the challenges alongside using validations and the validations' placement versus service object architecture.
00:10:40.080
Service objects can alleviate validation issues, but they increase cognitive load and complicate user experiences.
00:10:46.969
Cognitive load increases when deviating from a Rails happy path many users default to.
00:10:54.420
Similarly, as we step away from familiar patterns, we might introduce risk.
00:10:59.470
So we must establish workflows that allow for discoverability and maintain consistency across the team.
00:11:06.870
Strategies may include pairing on code reviews and maintaining close communication.
00:11:13.199
To shift to maintaining data integrity effectively, we need to remember how Rails facilitates certain aspects. Rails provides convenience in using database transactions.
00:11:21.930
Transactions help manage failures by rolling back when inconsistencies arise.
00:11:29.990
However, we still need to adhere to those practices in our code.
00:11:36.510
It's tempting to shortcut back to imperative programming without regard for fallbacks.
00:11:43.200
The consequences of that may lead to fragmented data states.
00:11:50.120
We need to wrap database manipulations in transactions to ensure alignment.
00:11:56.050
When dealing with third-party services, asynchronous requests have their own risks.
00:12:02.620
Transactions may not recover from asynchronous failures—which presents significant challenges.
00:12:08.630
Third-party services may not pass errors back cleanly, complicating debugging.
00:12:15.619
Contractual commitments to the CAP theorem must be regarded in cooperative architecture.
00:12:21.220
The CAP theorem asserts that consistency, availability, and partition tolerance are interdependent.
00:12:28.230
In practical application, selection is necessary between two of these pillars.
00:12:35.430
When a system prioritizes consistency, availability may take a hit and vice versa.
00:12:42.600
In conclusion, engineer applications that adapt, ensuring our system’s reliability.
00:12:49.780
Auditing processes at set intervals will help catch discrepancies before they spiral.
00:12:56.240
Run basic integrity checks every few hours—ensuring everything aligns.
00:13:02.740
If issues arise, escalate the problems up to human resources and deal with them directly.
00:13:09.410
Build a culture around addressing problems. Data integrity is about putting systems in place.
00:13:15.110
Establish regular meetings with the team to audit periodic checks and re-evaluate structures.
00:13:22.210
Tools like Sidekiq can facilitate pseudo-audits, finding discrepancies within the codebase.”
00:13:29.990
If anomalies arise, log and assess them without immediate panic.
00:13:35.520
Address risks through continuous learning—improving our frameworks.”
00:13:42.110
Finally, integrating support personnel enables effective feedback loops. Programmers spend less time debugging.
00:13:51.800
Instead, they focus on resolving the core problems arising from data integrity.
00:13:57.280
Recognize software as a living system, embracing collaboration while evolving.
00:14:02.870
Understand that the complexity of dependencies requires a balanced perspective.
00:14:10.220
Avoid focusing solely on developers versus users to tackle data integrity.
00:14:16.700
Acknowledge each development cycle evolves with changing business needs.
00:14:23.000
Maintain a posture of continuous improvement rather than fatalism.
00:14:29.240
Create inter-team channels to provide fertile ground for communication.
00:14:35.270
Now with time, lessons learned accumulate.
00:14:43.360
As we address data in our lives sensitively, we cultivate a thoughtful approach.
00:14:49.430
The overarching theme is the necessity for collaboration and recovery, not merely prevention.
00:14:57.010
This helps reduce analysis paralysis and fosters a proactive team culture.
00:15:02.940
I am Betsy Haibel, known as Betsy's Muffin on Twitter. I primarily discuss programming, tech culture, and a bit of feminism.
00:15:10.390
You can find my talk content at betsyhaibel.com/talks. I work for a company called ReCertify in San Francisco.
00:15:20.130
We are currently not hiring but will soon open positions for senior developers.
00:15:25.830
If you wish to mentor juniors or mid-levels, I encourage applying.
00:15:32.500
I also co-organized 'Learn Ruby in DC', a casual office hours initiative for newer programmers.
00:15:39.090
If you're interested in doing something similar in your town, it’s not challenging. Just show up!
00:15:47.010
Now I'll take a few questions.
00:15:55.198
Q: Are there tools that can help us check our databases? A: This may sound simplistic, but Sidekiq is a great tool.
00:16:03.960
Use it to build auditing systems that send alerts if something’s wrong, scheduled for regular intervals.
00:16:13.676
Incorporate that into your workflow and adjust for complex calculations as necessary.
00:16:21.606
Q: What about using transaction scripts? A: A strong code review culture is vital.
00:16:28.000
Create a design whereby every object must inherit from a base class to ensure consistency.
00:16:36.500
This structure aids in catching oversights during code reviews.
00:16:43.500
Q: What about raw SQL bypassing Rails? A: Bypass can happen during expediency but you lose callbacks.
00:16:49.090
Consider implementing database checks for further integrity.
00:16:55.930
Be mindful that pulling entities externally may complicate integrity oversight.
00:17:02.660
Lastly, ensure understanding about when to push validations to a service versus executing them in SQL.
00:17:10.000
My advice: use SQL when necessary but with consideration for loss of traceability.
00:17:17.000
I appreciate the opportunity to share my thoughts today. Thank you!