RailsConf 2017

Data Corruption: Stop the Evil Tribbles

Data Corruption: Stop the Evil Tribbles

by Betsy Haibel

In the video "Data Corruption: Stop the Evil Tribbles," speaker Betsy Haibel addresses the challenges of data integrity in software development, especially within complex systems like Rails applications. The talk emphasizes that data corruption is often inevitable and outlines strategies for recovery and prevention at various system levels. Key points discussed include:

  • Understanding Bad Data: Haibel challenges the perception of bad data as an external adversarial force. Instead, she suggests that issues often arise from product changes, team miscommunication, or architectural flaws within the codebase.

  • Promoting Communication: Effective communication among teams is crucial. The author recounts a case from her work experience where miscommunication led to erroneous assumptions about data integrity, culminating in a destructive migration that created production errors.

  • Building Resilient Systems: Systems should be designed to handle common data integrity issues proactively. This involves employing well-established data integrity patterns, like validations and database transactions. Haibel notes that skipping these best practices often stems from the system's architecture discouraging their use.

  • Recovery over Prevention: The focus should shift to recovery mechanisms that allow for the identification and correction of data issues. Tools like regular audits and event tracking (inspired by DevOps practices) can help manage data integrity in living systems, contributing to a more sustainable workflow.

  • Human-Centric Solutions: Haibel stresses the importance of involving people in the data integrity process rather than solely relying on code. Often, users can provide solutions that are more efficient than technical fixes when data integrity problems arise.

  • Cognitive Load: The complexity of modern systems increases cognitive load, making it easy for developers to overlook data integrity measures. To counter this, creating a culture of paired programming and code reviews can help maintain oversight and improve system reliability.

  • Adaptability to Computer Errors: The importance of adaptability in the face of bizarre computational errors is highlighted. Developers should be equipped to handle unexpected failures and fix issues promptly to minimize the impact on users.

The concluding takeaway stresses that while data integrity is a challenging aspect of software development, a collaborative culture, effective communication, and appropriate recovery systems can significantly alleviate the inherent complexities. Investing effort into ensuring that systems accommodate both technical and human aspects will pave the way for healthier software ecosystems.

Betsy Haibel encourages teams to acknowledge the messy nature of software development and prioritize strategies that allow for resilience in their systems.

00:00:11.929 Hi folks, I'm Betsy Haibel, and welcome to Data Integrity in Living Systems. Now,
00:00:19.260 ordinarily, I'd like to launch right into the talk content here. We've got a lot to cover and that's what you all are here to see.
00:00:26.070 However, Marcos' keynote just now really hit home for me, and I wanted to follow up on that.
00:00:31.590 As a white woman, I recognize that my experiences are quite different from those of a Black man.
00:00:36.690 I don’t want to flatten those differences. A lot of the ways he framed survival resonate with my own.
00:00:43.140 I often think of my non-traditional background journey into tech as a simple, happy-path narrative.
00:00:49.050 But let's get real for a second. A decade ago, as I was getting into tech, I was just a woman with no college degree, having washed out of an arts career due to a sudden onset chronic pain disorder.
00:00:56.879 Boot camps were not even an option yet, so I was learning everything on my own, which was terrifying.
00:01:04.260 I picked up some bad survival lessons during that time.
00:01:10.590 This context is relevant because throughout this talk, I’ll share moments when I was kind of a self-righteous jerk.
00:01:16.710 I want you all to remember that I was only able to grow out of that mindset after I wasn't the only woman in the room anymore.
00:01:22.640 Once I had company, I could transition from survival through arrogance to nuance and kindness.
00:01:28.409 And that transition was possible only because I no longer had to carry the torch of being the sole representative of my gender.
00:01:33.570 Anyway, back to data integrity. This talk needs another name.
00:01:40.320 It used to be called 'Data Corruption: Stop the Evil Tribbles.' I didn’t change it because that name was a bit hokey.
00:01:45.540 However, that title implied carrying about my personal dignity, which feels uncomfortable.
00:01:52.590 Let's look at a humorous example from Star Trek, which I’ll be using throughout the talk.
00:01:59.120 Changing the name of the talk imposed a frame of bad data as an evil and invading force.
00:02:05.780 This perspective can be quite unproductive because it suggests we have pristine databases.
00:02:11.570 In reality, most of us are working with code bases that look very different from that.
00:02:17.330 Additionally, framing the issue adversarially encourages developers to fight against bad data.
00:02:26.870 This mindset can easily lead to seeing our users as the enemy or, even worse, our teammates.
00:02:35.330 It's easy to become self-righteous about data integrity issues, but that leads us away from solving the actual problems.
00:02:41.300 In my experience, the root causes of data integrity issues often stem from product changes or miscommunications between teams.
00:02:47.530 That's why I want to emphasize that data integrity is not just your responsibility!
00:02:53.840 If we're going to adopt a blameless approach, we need to look at root causes—many times.
00:03:02.590 For instance, people skip known data integrity patterns because their code base discourages their proper use.
00:03:08.000 We may run into the occasional, strange bug coming from the database due to unusual circumstances, but most issues stem from team dynamics.
00:03:16.130 We need to design our systems to be resilient against common problems like product changes or engineers making mistakes.
00:03:22.700 Focusing on resilience helps us catch and correct a wide array of data integrity bugs.
00:03:29.720 Let's consider a story. It's loosely based on my experience a few years ago at an e-commerce company.
00:03:36.500 The details aren’t exact, but they illustrate the problem effectively.
00:03:41.600 I was part of a team working on a module that processed returns and shipped products back to vendors.
00:03:46.669 Considering the larger monolith we were part of, this module turned out to be straightforward.
00:03:54.080 Our return-to-vendor model was simple; it prioritized three things: product worth, vendor information, and shipping status.
00:04:00.140 But of course, the world has its way of giving us partial data.
00:04:05.790 In our case, sometimes incoming data didn't include vendor information, making it very difficult to return products.
00:04:12.619 This new requirement from the return-to-vendor module highlighted more product needs.
00:04:19.290 The upstream system previously met the system's needs perfectly, but now it became invalid due to changes.
00:04:24.590 Common data integrity issues often arise from product changes, but they can be addressed at the product level.
00:04:31.130 Instead of complicated measures like machine learning to find missing vendors, we decided to implement a simple UI change.
00:04:38.200 We updated the return display to hide units marked for return that didn't have vendor information.
00:04:45.000 This was a lot more efficient than creating a complicated technical solution.
00:04:52.260 The main takeaway from this experience is to scope data based on whether it can progress.
00:04:59.300 Additionally, we should validate models dynamically rather than statically, as complex operations require aware and flexible validation.
00:05:05.540 Collaboration can provide critical insights from users, making it unnecessary to overthink data recovery.
00:05:12.210 In living software systems, users are as crucial as the code itself.
00:05:18.190 Approaching data integrity problems collaboratively with our team and user base can lead to solutions.
00:05:25.610 Earlier, I mentioned eventually finding a solution through scoping down data and creating a user-friendly interface.
00:05:32.899 However, we encountered some miscommunication before reaching that resolution.
00:05:38.690 My team had assumed that the upstream system would provide vendor data.
00:05:44.199 When we saw units without vendor markings, we mistakenly believed it was a bug.
00:05:49.610 However, this was an oversight as sometimes the upstream does not provide data.
00:05:56.160 The data for the return-to-vendor process was mostly sourced from the returns handling module.
00:06:04.120 The warehouse workers logging returns often didn't have vendor information.
00:06:11.670 Picture this: the warehouse worker receives returns off a truck, processing a messy pile, with packages sometimes improperly labeled.
00:06:17.160 Despite this, most people are not jerks and do label their returns correctly.
00:06:25.960 However, there are many scenarios where boxes may be missing information.
00:06:31.160 Warehouse workers are not responsible for finding out the returned brand of each product.
00:06:37.690 Their job is quick logging to progress to the next return, optimized by performance metrics.
00:06:43.210 If we impose burdens that require them to delve deeply into every box, we are being unreasonable.
00:06:49.699 All of this was unknown to my team at the time. We were focused solely on the immediate needs of the return-to-vendor module.
00:06:56.250 Consequently, we ran migrations that inadvertently led to further complications and numerous production issues.
00:07:03.830 We operated as if the validations implied would catch all potential issues.
00:07:09.800 Ultimately, we assumed internal CI failures indicated new bugs.
00:07:15.041 The underlying issue, however, was our lack of humility and awareness of upstream weaknesses.
00:07:22.230 After discussing this experience in a cross-team retrospective, we learned that solutions came from communication.
00:07:29.440 If we had approached our upstream colleagues sooner, I believe we would have avoided this entire error.
00:07:35.390 It’s often said that all we need to do is talk to one another.
00:07:41.500 As the Agile Manifesto suggests, individuals and interactions are paramount.
00:07:46.660 But let’s not ignore the intricacies of communication.
00:07:54.630 Too often, the phrase 'just talk' is a way of dismissing both the complexity of processes and the effort they require.
00:08:01.300 As Camille Fournier noted, managing coordination among people is an effort we cannot overstate.
00:08:11.000 The background noise of team dynamics becomes especially relevant.
00:08:18.000 So when we prioritize individuals and interactions, we must remember that processes matter.
00:08:22.230 The goal is not to dismiss processes but to find the right balance.
00:08:27.589 Observing how the software actually behaves is more reliable than sticking strictly to documentation.
00:08:32.000 Identifying how communication unfolds is vital. For instance, when my team added validations, we could have explored the issue further.
00:08:38.150 Understanding our recent commits might have shed light on where we were going wrong.
00:08:44.390 Effective interactions also involve acknowledging the work culture that stifles communication.
00:08:50.550 Burdening teammates with loads of pressure hinders our collaborative spirit.
00:08:57.000 The dominant attitude of each person's work being paramount leads to the notion that inter-team communication is wasteful.
00:09:02.600 Recognizing that cross-team interactions are always a valuable investment shifts our team's dynamics.
00:09:09.970 Further, let’s discuss the failure to leverage well-known data integrity patterns.
00:09:15.620 For example, patterns such as validations, callbacks, and transactions are often overlooked.
00:09:21.060 Neglecting these opportunities leads to moralizing and finger-pointing when issues arise.
00:09:28.150 We should focus on understanding why these errors occur within our systems.
00:09:35.200 This shifts the focus from individual blame toward systemic recovery.
00:09:41.590 My focus here is on building in systems to manage recovery instead of preventing error.
00:09:48.250 Recovery emphasizes moving forward when errors happen—a proactive stance.
00:09:53.860 Many older Rails projects accumulate considerable complexity in validations and callback mechanisms.
00:09:59.300 Consequently, serious engineering effort can be lost in trying to manage data integrity effectively.
00:10:06.090 Development teams often find themselves in a paralysis from overengineering.
00:10:12.610 Thus, the more we moralize about data integrity, the less we resolve actual problems.
00:10:18.550 Instead, we should cultivate an environment that encourages recovery and learning.
00:10:26.350 Finding balance between complex validation mechanisms and simplicity is key.
00:10:32.960 Lastly, let’s not overlook the challenges alongside using validations and the validations' placement versus service object architecture.
00:10:40.080 Service objects can alleviate validation issues, but they increase cognitive load and complicate user experiences.
00:10:46.969 Cognitive load increases when deviating from a Rails happy path many users default to.
00:10:54.420 Similarly, as we step away from familiar patterns, we might introduce risk.
00:10:59.470 So we must establish workflows that allow for discoverability and maintain consistency across the team.
00:11:06.870 Strategies may include pairing on code reviews and maintaining close communication.
00:11:13.199 To shift to maintaining data integrity effectively, we need to remember how Rails facilitates certain aspects. Rails provides convenience in using database transactions.
00:11:21.930 Transactions help manage failures by rolling back when inconsistencies arise.
00:11:29.990 However, we still need to adhere to those practices in our code.
00:11:36.510 It's tempting to shortcut back to imperative programming without regard for fallbacks.
00:11:43.200 The consequences of that may lead to fragmented data states.
00:11:50.120 We need to wrap database manipulations in transactions to ensure alignment.
00:11:56.050 When dealing with third-party services, asynchronous requests have their own risks.
00:12:02.620 Transactions may not recover from asynchronous failures—which presents significant challenges.
00:12:08.630 Third-party services may not pass errors back cleanly, complicating debugging.
00:12:15.619 Contractual commitments to the CAP theorem must be regarded in cooperative architecture.
00:12:21.220 The CAP theorem asserts that consistency, availability, and partition tolerance are interdependent.
00:12:28.230 In practical application, selection is necessary between two of these pillars.
00:12:35.430 When a system prioritizes consistency, availability may take a hit and vice versa.
00:12:42.600 In conclusion, engineer applications that adapt, ensuring our system’s reliability.
00:12:49.780 Auditing processes at set intervals will help catch discrepancies before they spiral.
00:12:56.240 Run basic integrity checks every few hours—ensuring everything aligns.
00:13:02.740 If issues arise, escalate the problems up to human resources and deal with them directly.
00:13:09.410 Build a culture around addressing problems. Data integrity is about putting systems in place.
00:13:15.110 Establish regular meetings with the team to audit periodic checks and re-evaluate structures.
00:13:22.210 Tools like Sidekiq can facilitate pseudo-audits, finding discrepancies within the codebase.”
00:13:29.990 If anomalies arise, log and assess them without immediate panic.
00:13:35.520 Address risks through continuous learning—improving our frameworks.”
00:13:42.110 Finally, integrating support personnel enables effective feedback loops. Programmers spend less time debugging.
00:13:51.800 Instead, they focus on resolving the core problems arising from data integrity.
00:13:57.280 Recognize software as a living system, embracing collaboration while evolving.
00:14:02.870 Understand that the complexity of dependencies requires a balanced perspective.
00:14:10.220 Avoid focusing solely on developers versus users to tackle data integrity.
00:14:16.700 Acknowledge each development cycle evolves with changing business needs.
00:14:23.000 Maintain a posture of continuous improvement rather than fatalism.
00:14:29.240 Create inter-team channels to provide fertile ground for communication.
00:14:35.270 Now with time, lessons learned accumulate.
00:14:43.360 As we address data in our lives sensitively, we cultivate a thoughtful approach.
00:14:49.430 The overarching theme is the necessity for collaboration and recovery, not merely prevention.
00:14:57.010 This helps reduce analysis paralysis and fosters a proactive team culture.
00:15:02.940 I am Betsy Haibel, known as Betsy's Muffin on Twitter. I primarily discuss programming, tech culture, and a bit of feminism.
00:15:10.390 You can find my talk content at betsyhaibel.com/talks. I work for a company called ReCertify in San Francisco.
00:15:20.130 We are currently not hiring but will soon open positions for senior developers.
00:15:25.830 If you wish to mentor juniors or mid-levels, I encourage applying.
00:15:32.500 I also co-organized 'Learn Ruby in DC', a casual office hours initiative for newer programmers.
00:15:39.090 If you're interested in doing something similar in your town, it’s not challenging. Just show up!
00:15:47.010 Now I'll take a few questions.
00:15:55.198 Q: Are there tools that can help us check our databases? A: This may sound simplistic, but Sidekiq is a great tool.
00:16:03.960 Use it to build auditing systems that send alerts if something’s wrong, scheduled for regular intervals.
00:16:13.676 Incorporate that into your workflow and adjust for complex calculations as necessary.
00:16:21.606 Q: What about using transaction scripts? A: A strong code review culture is vital.
00:16:28.000 Create a design whereby every object must inherit from a base class to ensure consistency.
00:16:36.500 This structure aids in catching oversights during code reviews.
00:16:43.500 Q: What about raw SQL bypassing Rails? A: Bypass can happen during expediency but you lose callbacks.
00:16:49.090 Consider implementing database checks for further integrity.
00:16:55.930 Be mindful that pulling entities externally may complicate integrity oversight.
00:17:02.660 Lastly, ensure understanding about when to push validations to a service versus executing them in SQL.
00:17:10.000 My advice: use SQL when necessary but with consideration for loss of traceability.
00:17:17.000 I appreciate the opportunity to share my thoughts today. Thank you!