00:00:12.380
All right, so I don't know how I could possibly follow Sandi Metz. I was nervous about that even before I knew what she was going to talk about. My name is Jon, and I'm an engineering director at W.R. Studios. We make products for real estate, which we're doing the best we can.
00:00:30.510
I actually live about 20 minutes south of here in Franklin. I moved here a couple of years ago. If you haven't been down to check it out, it's pretty cool, super old, with lots of good stuff to eat and see. They have the best ice cream in the whole wide world right on Main Street.
00:00:54.319
This ice cream is so good it can cause you to make bad decisions, but we'll talk about that in a little bit. I'm going to tell you a story about how many small decisions over months of time created a catastrophe from which we essentially couldn’t recover, and we lost 50 million of our records.
00:01:04.739
First, we need to set the stage. There’s a lot going on in any system, and ours is no different. Since we're in real estate, I need to explain some terms that I'll be using. An MLS, for our purposes today, is just an organization that collects and distributes listing data. A listing is like a property for sale, while an agent is a licensed real estate agent. These are who our customers are.
00:01:33.270
A listing is a record that acts like an ad for a property. Essentially, if you go into Zillow and click on a house, that’s a listing. A data feed is just an API offered by an MLS to provide listings for us and really any software vendor. This is our kind of landing page; we have four different products that work together as a suite.
00:01:48.720
Here’s one example: a tool for agents to search listings quickly and easily. A typical listing record has three to four hundred fields, and some cases can have upwards of a thousand fields, which causes all kinds of problems. A lot of people think that there is one MLS in the United States, but there are actually more like 700, and here’s a map of all the places from which we pull listing data currently.
00:02:30.630
There’s a lot going on to gain access to each of these MLSs. We have to jump through all kinds of hoops, do demos, wait for sales cycles, get board approval, and sign contracts. This can take weeks or even months, and it has taken us years to get to this point.
00:02:51.570
We have all these products and data feeds, and all these products use this listing data in some way, but it's a nightmare to deal with. So we needed something internally to manage all of this, and that’s why we built something we call Forge. I watch a lot of YouTube videos, and there's a guy who makes things out of metal. I thought that what we do is kind of like heating metal and beating it with a hammer until it's in the shape we want.
00:03:17.280
It's a Ruby app, and its job is ETL. There are currently over 60 million records in this system, which is an Elasticsearch cluster. We have about 500,000 licensed users across all four products—it's a big challenge with lots of data.
00:03:51.960
Let me break down exactly why it's challenging. First, extraction: each pin represents a particular API that we have to pull data from. Managing the configuration for hundreds of APIs is kind of a nightmare, and you would not believe how many options there are, creating lots of problems for us.
00:04:14.820
Next is the lack of speed in these APIs. For example, here’s an actual screenshot showing that we pulled 1.6 million records over seven days from this API, which works out to about 150 records a minute—that's about as fast as we can reliably pull data, which is incredibly slow.
00:04:55.170
So to sum up the extraction problems: we deal with hundreds of API configurations, permissioning and licensing, and they’re extremely slow and changing all the time. New standards continually emerge, which we have to account for.
00:05:38.700
What about transformation? Transformation is the hardest thing we do because the data is super weird. Here are a couple of examples: this shows a very small portion of a listing record from two different sources. The data on the left kind of makes sense, but the data on the right makes absolutely no sense. It's someone’s job to sift through all of that and make sense of it.
00:06:06.990
When you consider that a listing record has four to five hundred fields, you can see how it's error-prone, and we make mistakes that have to be fixed all the time. In addition, other problems just break all the rules. Whatever you think this data means, you are likely wrong, because that’s what it actually means.
00:06:35.260
Transformation is difficult because field names are totally different per feed. We have conventions that defy reason and logic, and we also have extremely opinionated customers. If they don’t see the data they want in the format they want, our product becomes unusable to them. Getting feedback is difficult, and this data changes monthly; they change names, rename things, add new options, and take them away.
00:07:09.620
Now, let’s move to loading data. Most of the issues we deal with loading are around Elasticsearch. Elasticsearch is a really awesome product—almost black magic—but it has some quirks. If you try to put records into an index that doesn’t exist already, Elasticsearch will create one for you, which is cool. However, mixed with something called dynamic schema, it can lead to problems.
00:07:38.900
By default, new fields can be added dynamically to a document just by indexing a document containing the new field. In a production scenario, you never want this to happen. We go to great lengths to avoid it.
00:08:01.570
Loading is tricky—Elasticsearch is powerful, but it’s a huge system, and there are many things to know and watch out for in production.
00:08:17.070
We had to build this system we called Forge. The very first version of it involved the API configurations, which we thought was the simplest thing that could work. However, we didn’t think to store a local raw unmapped copy of the data.
00:08:52.960
The entire ETL cycle happened all at once—pulling from the API, transforming it, and then pushing it into Elasticsearch—all without a proper staging environment as we didn’t want to pull the data multiple times. This approach created some problems because we couldn’t iterate or fix mistakes quickly enough.
00:09:32.330
Eventually, we created a dynamic API configuration with ActiveRecord and introduced a dynamic Elasticsearch config. This allowed some options to be toggled, but we still didn’t have a local raw unmapped copy of data, which was a problem.
00:10:09.870
This dynamic configuration was built using the same system, and the API wrappers worked whether you used the YAML or the dynamic configuration, so we didn’t have to rebuild everything. However, the data structure was not backward compatible, and the output into Elasticsearch was drastically different between the two versions.
00:10:49.820
This meant we couldn’t migrate everything automatically, preventing us from fully migrating the entire system. As a result, we were dealing with this status quo where both versions were in use.
00:11:24.540
About a year later, we still had five YAML configurations because we deemed them too important to change or not worth changing. This legacy system was ignored and forgotten until one day I received a message saying, 'When I removed this YAML file, all the tests failed. What should I do?' I thought, 'I’m sure it’ll be fine.’
00:11:55.890
That decision put us on a collision course with disaster, although we wouldn’t know it for months. Let me explain how you can ruin any application.
00:12:16.689
Firstly, decide to make a change—this could be small or large—then ignore the test suite because you think it’s okay or that the new thing won’t really change the test. You end up in a weird status quo where things aren’t the way you want them to be, making it hard to change.
00:12:35.580
As issues arise because of your poor status quo, you apply quick fixes without researching the problem thoroughly; you’re just guessing. Enough guesses can lead to a last-ditch effort to solve a problem, which can result in catastrophic failure.
00:13:07.420
We created a new version of the product but reused the existing tests and never fully migrated to the new system. We were just living in a situation where quick fixes compounded. At some point, our Elasticsearch cluster was running out of disk space.
00:13:31.269
I entered the server and manually deleted some Elasticsearch indexes, thinking they weren’t being used. Unfortunately, this did not fix the slowness issue we had incurred over time. We encountered weird errors where some systems couldn’t access a listing because it wasn’t ready yet.
00:14:02.840
Realizing the issue, we determined to upgrade Elasticsearch, believing it would solve our problems. However, the next morning, Robert informed me there were only four million listings left, to which I nonchalantly replied, 'I’m sure it’s fine, I’ll check tomorrow.' I went to get some ice cream—remember that, it’s that good.
00:14:43.890
In the morning, we discovered that we had lost 50 million listings. This loss was compounded by the fact we couldn't easily get the data back; it would take weeks to retrieve it again. So, what happened? All we did was upgrade Elasticsearch, and upon coming back online, all of the data was gone.
00:15:44.250
The issue stemmed from the timestamp field names in JSON. We knew it was a bad idea to structure it this way, but this poor decision led to our Elasticsearch cluster crashing. A year after that decision, we were completely unprepared for the consequences.
00:16:52.670
Essentially, I created a situation where the system tried to index a document into an index that didn’t exist, which Elasticsearch recreated with dynamic schemas that no longer ignored the timestamp fields. As a result, every time we indexed a listing, it was adding new fields to the schema.
00:17:40.230
This continued over months, leading to schemas made of megabytes of JSON. Each day as new records were indexed, they changed the schema, leading to Elasticsearch endlessly passing megabytes of data. Consequently, everything slowed down because Elasticsearch couldn’t keep up with itself.
00:18:27.470
In an attempt to fix slowness, we added more nodes, which didn’t help. Ultimately, we tried upgrading Elasticsearch, but due to the weird state it was in, it resulted in a catastrophic failure.
00:19:05.490
The day after my ice cream excursion, I told a friend that the next day might be the worst day of my job or just another day. It ended up being quite bad, as we were unable to retrieve the data, which took all day to restore from backups.
00:19:49.820
Restoring took all day, especially with terabytes of data. Once the backups were up, we faced the same issues as before. Fundamentally, we needed to fix the problem properly—not just throw data back without addressing the underlying issues that caused the failure.
00:20:23.670
We had to eliminate any tests that were causing problems and get our Elasticsearch settings back to normal. This took a few days, but we eventually restored everything and stopped our customers from yelling at us.
00:20:54.210
So, what can we learn from this? Firstly, handle catastrophes calmly. There’s a running joke at my company that our CTO is so Zen that it’s almost scary. It's essential to remain clear-headed. Yelling won’t resolve issues, and being angry won’t help you move toward a solution.
00:21:16.600
Next, work through the problem rather than guessing. When you guess, you’re just grabbing at straws with no idea of the outcome. Always have proper backups; they can save your business when everything else fails.
00:21:46.920
Now, how can we avoid catastrophes in the first place? You must take responsibility for the projects you work on. Ownership improves the project and your career. Disregarding this leads you nowhere.
00:22:22.230
Follow best practices. Our systems often act as layers organizing many subsystems, yet we ignored Elasticsearch's documentation, which clearly states not to use it as a primary data store. This led us to problems we couldn’t fix.
00:23:01.010
We had to consider reindexing, but we couldn’t due to the performance issues. Investigate root causes thoroughly—quick fixes without understanding lead to greater issues later.
00:23:39.200
Two years ago, these issues existed, but now we’ve assigned 'Forge 2.0' as a new project to remedy these fundamental problems. I was angry at the old code and decided to start fresh—without the problematic existing structures.
00:24:01.260
We now proactively store data locally. It’s essential to monitor and alert ourselves to problems; the system emails me daily about potential issues. This approach has made our system more robust.
00:24:44.890
In fact, this happened again just two months ago. I woke up to hundreds of emails from Forge stating that there should be two million records, but Elasticsearch returned zero. We lost many nodes, yet thanks to our improvements, nobody noticed.
00:25:29.410
We just reindexed the missing data, and by that afternoon, it was all fixed. Final thoughts: be proactive about issues in your system; they won’t disappear on their own.
00:25:50.270
You cannot ignore them, as they will worsen. One colleague stated that we will always use the system more than we do now, meaning investing in reliability is crucial.
00:26:35.570
Always keep a backup plan for unavoidable issues. We’ve faced this, and you must be prepared for every possible scenario. Most problems are avoidable, but for those that aren’t, have processes, tests, staging servers, and continuous integration in place.
00:28:53.150
I pushed a change recently that seemed harmless. I neglected to test it thoroughly, and traffic suddenly led to spiraling job queues because it took significantly longer in production. Ignoring processes, even for seemingly trivial changes, can lead to major disaster.
00:29:12.510
Take time to consider the potential consequences in your organization. I hope you can learn from my story and identify issues in your own organization so that you don't find yourself in a situation similar to mine. Thank you very much.
00:29:48.190
What’s my dog’s name? His name is Nash; he’s a Goldendoodle. I actually just got another one, but he's not in the picture yet—he's too little.
00:29:51.120
So, moving forward, how do we prevent this from happening a third time and recover from it if it does? We’ve been thinking about backup data centers to mitigate issues caused by natural disasters.
00:30:07.700
Considering redundancy in all systems is key: maintaining multiple copies and investing in more servers can facilitate rapid recovery when things go awry.
00:30:18.550
In terms of development processes, I’d like to say that we've changed them significantly, but it’s easy to fall into the trap of process theater where everyone believes they’re following processes while they are not.
00:30:46.120
Following development processes is inherently challenging, but understanding that we need consistent adherence to processes is vital. No change is too minor to require performance tests or staging. Ignoring these details can lead to drastic outcomes. Thank you very much.