Data Integrity
How to lose 50 Million Records in 5 minutes

Summarized using AI

How to lose 50 Million Records in 5 minutes

Jon Druse • November 18, 2019 • Nashville, TN

In his talk titled How to Lose 50 Million Records in 5 Minutes at RubyConf 2019, Jon Druse recounts a catastrophic experience from his decade-long career as a developer, emphasizing the dangers of cutting corners in software development. He illustrates the complexity of managing data in real estate and shares the journey of building a system named Forge to handle extraction, transformation, and loading (ETL) of over 60 million records sourced from various Multiple Listing Services (MLS).

Key Points Discussed:
- Complex System Environment: Jon explains the diverse data sources and the intricacies of managing them, including handling hundreds of APIs, slow data feeds, and varied data formats.
- The Forge System: He describes the creation of Forge to streamline the ETL process, highlighting initial missteps such as not maintaining a raw data copy and the lack of a proper staging environment.
- Impact of Poor Decisions: Critical mistakes made in the development process, including ignoring tests and legacy system issues, lead to a situation where the Elasticsearch cluster experienced catastrophic failure, resulting in the loss of 50 million records after an upgrade attempt.
- Learning from Failure: Jon advocates for a calm and systematic approach to handling crises. He stresses the importance of taking ownership of projects, adhering to best practices, and not ignoring underlying issues that might cause problems later.
- Preventative Measures: After the disaster, Jon outlines new strategies implemented in their systems to prevent future occurrences, including better data storage practices, proactive monitoring, and the implementation of redundancy.

Significant Examples:
- Jon uses the example of a disastrous upgrade to Elasticsearch, which led to the erroneous indexing of records and the eventual loss of data, as a cautionary tale.

Conclusions and Takeaways:
- Handle crises calmly and thoroughly investigate issues rather than applying quick fixes.
- Prioritize creating robust backup strategies and test thoroughly before implementing changes.
- Emphasize the need for consistent adherence to development processes, regardless of the perceived size of a change.

Overall, Jon’s story serves as a vital lesson on the consequences of engaging in non-compliant practices in software development, encouraging attendees to learn from mistakes to fortify their own systems against similar failures in the future.

How to lose 50 Million Records in 5 minutes
Jon Druse • November 18, 2019 • Nashville, TN

RubyConf 2019 - How to lose 50 Million Records in 5 minutes by Jon Druse

Join me in re-living the worst catastrophe of my more than a decade long career as a developer. Enough time has passed that I can laugh about it now and hopefully you will too while being inspired to stop cutting corners.

It’s like a game of Clue and all the different parts of the system are characters. Which one was the killer? Spoiler alert, it was me, in the office with the keyboard.

#confreaks #rubyconf2019

RubyConf 2019

00:00:12.380 All right, so I don't know how I could possibly follow Sandi Metz. I was nervous about that even before I knew what she was going to talk about. My name is Jon, and I'm an engineering director at W.R. Studios. We make products for real estate, which we're doing the best we can.
00:00:30.510 I actually live about 20 minutes south of here in Franklin. I moved here a couple of years ago. If you haven't been down to check it out, it's pretty cool, super old, with lots of good stuff to eat and see. They have the best ice cream in the whole wide world right on Main Street.
00:00:54.319 This ice cream is so good it can cause you to make bad decisions, but we'll talk about that in a little bit. I'm going to tell you a story about how many small decisions over months of time created a catastrophe from which we essentially couldn’t recover, and we lost 50 million of our records.
00:01:04.739 First, we need to set the stage. There’s a lot going on in any system, and ours is no different. Since we're in real estate, I need to explain some terms that I'll be using. An MLS, for our purposes today, is just an organization that collects and distributes listing data. A listing is like a property for sale, while an agent is a licensed real estate agent. These are who our customers are.
00:01:33.270 A listing is a record that acts like an ad for a property. Essentially, if you go into Zillow and click on a house, that’s a listing. A data feed is just an API offered by an MLS to provide listings for us and really any software vendor. This is our kind of landing page; we have four different products that work together as a suite.
00:01:48.720 Here’s one example: a tool for agents to search listings quickly and easily. A typical listing record has three to four hundred fields, and some cases can have upwards of a thousand fields, which causes all kinds of problems. A lot of people think that there is one MLS in the United States, but there are actually more like 700, and here’s a map of all the places from which we pull listing data currently.
00:02:30.630 There’s a lot going on to gain access to each of these MLSs. We have to jump through all kinds of hoops, do demos, wait for sales cycles, get board approval, and sign contracts. This can take weeks or even months, and it has taken us years to get to this point.
00:02:51.570 We have all these products and data feeds, and all these products use this listing data in some way, but it's a nightmare to deal with. So we needed something internally to manage all of this, and that’s why we built something we call Forge. I watch a lot of YouTube videos, and there's a guy who makes things out of metal. I thought that what we do is kind of like heating metal and beating it with a hammer until it's in the shape we want.
00:03:17.280 It's a Ruby app, and its job is ETL. There are currently over 60 million records in this system, which is an Elasticsearch cluster. We have about 500,000 licensed users across all four products—it's a big challenge with lots of data.
00:03:51.960 Let me break down exactly why it's challenging. First, extraction: each pin represents a particular API that we have to pull data from. Managing the configuration for hundreds of APIs is kind of a nightmare, and you would not believe how many options there are, creating lots of problems for us.
00:04:14.820 Next is the lack of speed in these APIs. For example, here’s an actual screenshot showing that we pulled 1.6 million records over seven days from this API, which works out to about 150 records a minute—that's about as fast as we can reliably pull data, which is incredibly slow.
00:04:55.170 So to sum up the extraction problems: we deal with hundreds of API configurations, permissioning and licensing, and they’re extremely slow and changing all the time. New standards continually emerge, which we have to account for.
00:05:38.700 What about transformation? Transformation is the hardest thing we do because the data is super weird. Here are a couple of examples: this shows a very small portion of a listing record from two different sources. The data on the left kind of makes sense, but the data on the right makes absolutely no sense. It's someone’s job to sift through all of that and make sense of it.
00:06:06.990 When you consider that a listing record has four to five hundred fields, you can see how it's error-prone, and we make mistakes that have to be fixed all the time. In addition, other problems just break all the rules. Whatever you think this data means, you are likely wrong, because that’s what it actually means.
00:06:35.260 Transformation is difficult because field names are totally different per feed. We have conventions that defy reason and logic, and we also have extremely opinionated customers. If they don’t see the data they want in the format they want, our product becomes unusable to them. Getting feedback is difficult, and this data changes monthly; they change names, rename things, add new options, and take them away.
00:07:09.620 Now, let’s move to loading data. Most of the issues we deal with loading are around Elasticsearch. Elasticsearch is a really awesome product—almost black magic—but it has some quirks. If you try to put records into an index that doesn’t exist already, Elasticsearch will create one for you, which is cool. However, mixed with something called dynamic schema, it can lead to problems.
00:07:38.900 By default, new fields can be added dynamically to a document just by indexing a document containing the new field. In a production scenario, you never want this to happen. We go to great lengths to avoid it.
00:08:01.570 Loading is tricky—Elasticsearch is powerful, but it’s a huge system, and there are many things to know and watch out for in production.
00:08:17.070 We had to build this system we called Forge. The very first version of it involved the API configurations, which we thought was the simplest thing that could work. However, we didn’t think to store a local raw unmapped copy of the data.
00:08:52.960 The entire ETL cycle happened all at once—pulling from the API, transforming it, and then pushing it into Elasticsearch—all without a proper staging environment as we didn’t want to pull the data multiple times. This approach created some problems because we couldn’t iterate or fix mistakes quickly enough.
00:09:32.330 Eventually, we created a dynamic API configuration with ActiveRecord and introduced a dynamic Elasticsearch config. This allowed some options to be toggled, but we still didn’t have a local raw unmapped copy of data, which was a problem.
00:10:09.870 This dynamic configuration was built using the same system, and the API wrappers worked whether you used the YAML or the dynamic configuration, so we didn’t have to rebuild everything. However, the data structure was not backward compatible, and the output into Elasticsearch was drastically different between the two versions.
00:10:49.820 This meant we couldn’t migrate everything automatically, preventing us from fully migrating the entire system. As a result, we were dealing with this status quo where both versions were in use.
00:11:24.540 About a year later, we still had five YAML configurations because we deemed them too important to change or not worth changing. This legacy system was ignored and forgotten until one day I received a message saying, 'When I removed this YAML file, all the tests failed. What should I do?' I thought, 'I’m sure it’ll be fine.’
00:11:55.890 That decision put us on a collision course with disaster, although we wouldn’t know it for months. Let me explain how you can ruin any application.
00:12:16.689 Firstly, decide to make a change—this could be small or large—then ignore the test suite because you think it’s okay or that the new thing won’t really change the test. You end up in a weird status quo where things aren’t the way you want them to be, making it hard to change.
00:12:35.580 As issues arise because of your poor status quo, you apply quick fixes without researching the problem thoroughly; you’re just guessing. Enough guesses can lead to a last-ditch effort to solve a problem, which can result in catastrophic failure.
00:13:07.420 We created a new version of the product but reused the existing tests and never fully migrated to the new system. We were just living in a situation where quick fixes compounded. At some point, our Elasticsearch cluster was running out of disk space.
00:13:31.269 I entered the server and manually deleted some Elasticsearch indexes, thinking they weren’t being used. Unfortunately, this did not fix the slowness issue we had incurred over time. We encountered weird errors where some systems couldn’t access a listing because it wasn’t ready yet.
00:14:02.840 Realizing the issue, we determined to upgrade Elasticsearch, believing it would solve our problems. However, the next morning, Robert informed me there were only four million listings left, to which I nonchalantly replied, 'I’m sure it’s fine, I’ll check tomorrow.' I went to get some ice cream—remember that, it’s that good.
00:14:43.890 In the morning, we discovered that we had lost 50 million listings. This loss was compounded by the fact we couldn't easily get the data back; it would take weeks to retrieve it again. So, what happened? All we did was upgrade Elasticsearch, and upon coming back online, all of the data was gone.
00:15:44.250 The issue stemmed from the timestamp field names in JSON. We knew it was a bad idea to structure it this way, but this poor decision led to our Elasticsearch cluster crashing. A year after that decision, we were completely unprepared for the consequences.
00:16:52.670 Essentially, I created a situation where the system tried to index a document into an index that didn’t exist, which Elasticsearch recreated with dynamic schemas that no longer ignored the timestamp fields. As a result, every time we indexed a listing, it was adding new fields to the schema.
00:17:40.230 This continued over months, leading to schemas made of megabytes of JSON. Each day as new records were indexed, they changed the schema, leading to Elasticsearch endlessly passing megabytes of data. Consequently, everything slowed down because Elasticsearch couldn’t keep up with itself.
00:18:27.470 In an attempt to fix slowness, we added more nodes, which didn’t help. Ultimately, we tried upgrading Elasticsearch, but due to the weird state it was in, it resulted in a catastrophic failure.
00:19:05.490 The day after my ice cream excursion, I told a friend that the next day might be the worst day of my job or just another day. It ended up being quite bad, as we were unable to retrieve the data, which took all day to restore from backups.
00:19:49.820 Restoring took all day, especially with terabytes of data. Once the backups were up, we faced the same issues as before. Fundamentally, we needed to fix the problem properly—not just throw data back without addressing the underlying issues that caused the failure.
00:20:23.670 We had to eliminate any tests that were causing problems and get our Elasticsearch settings back to normal. This took a few days, but we eventually restored everything and stopped our customers from yelling at us.
00:20:54.210 So, what can we learn from this? Firstly, handle catastrophes calmly. There’s a running joke at my company that our CTO is so Zen that it’s almost scary. It's essential to remain clear-headed. Yelling won’t resolve issues, and being angry won’t help you move toward a solution.
00:21:16.600 Next, work through the problem rather than guessing. When you guess, you’re just grabbing at straws with no idea of the outcome. Always have proper backups; they can save your business when everything else fails.
00:21:46.920 Now, how can we avoid catastrophes in the first place? You must take responsibility for the projects you work on. Ownership improves the project and your career. Disregarding this leads you nowhere.
00:22:22.230 Follow best practices. Our systems often act as layers organizing many subsystems, yet we ignored Elasticsearch's documentation, which clearly states not to use it as a primary data store. This led us to problems we couldn’t fix.
00:23:01.010 We had to consider reindexing, but we couldn’t due to the performance issues. Investigate root causes thoroughly—quick fixes without understanding lead to greater issues later.
00:23:39.200 Two years ago, these issues existed, but now we’ve assigned 'Forge 2.0' as a new project to remedy these fundamental problems. I was angry at the old code and decided to start fresh—without the problematic existing structures.
00:24:01.260 We now proactively store data locally. It’s essential to monitor and alert ourselves to problems; the system emails me daily about potential issues. This approach has made our system more robust.
00:24:44.890 In fact, this happened again just two months ago. I woke up to hundreds of emails from Forge stating that there should be two million records, but Elasticsearch returned zero. We lost many nodes, yet thanks to our improvements, nobody noticed.
00:25:29.410 We just reindexed the missing data, and by that afternoon, it was all fixed. Final thoughts: be proactive about issues in your system; they won’t disappear on their own.
00:25:50.270 You cannot ignore them, as they will worsen. One colleague stated that we will always use the system more than we do now, meaning investing in reliability is crucial.
00:26:35.570 Always keep a backup plan for unavoidable issues. We’ve faced this, and you must be prepared for every possible scenario. Most problems are avoidable, but for those that aren’t, have processes, tests, staging servers, and continuous integration in place.
00:28:53.150 I pushed a change recently that seemed harmless. I neglected to test it thoroughly, and traffic suddenly led to spiraling job queues because it took significantly longer in production. Ignoring processes, even for seemingly trivial changes, can lead to major disaster.
00:29:12.510 Take time to consider the potential consequences in your organization. I hope you can learn from my story and identify issues in your own organization so that you don't find yourself in a situation similar to mine. Thank you very much.
00:29:48.190 What’s my dog’s name? His name is Nash; he’s a Goldendoodle. I actually just got another one, but he's not in the picture yet—he's too little.
00:29:51.120 So, moving forward, how do we prevent this from happening a third time and recover from it if it does? We’ve been thinking about backup data centers to mitigate issues caused by natural disasters.
00:30:07.700 Considering redundancy in all systems is key: maintaining multiple copies and investing in more servers can facilitate rapid recovery when things go awry.
00:30:18.550 In terms of development processes, I’d like to say that we've changed them significantly, but it’s easy to fall into the trap of process theater where everyone believes they’re following processes while they are not.
00:30:46.120 Following development processes is inherently challenging, but understanding that we need consistent adherence to processes is vital. No change is too minor to require performance tests or staging. Ignoring these details can lead to drastic outcomes. Thank you very much.
Explore all talks recorded at RubyConf 2019
+88