00:00:00.800
Hmm, wow, that was a great morning talk. I don't know if that was coincidental, but I'm full of energy now. Um, I have to cough. Excuse me.
00:00:04.920
Our next speaker is Matthew Borden, or Matt, as he likes to be called. I met Matthew when he was a young boy, about five years ago, when we volunteered at CSS and JS comp together. That's pretty exciting! Please welcome him to the stage right now.
00:00:16.170
Matthew is a self-proclaimed coffee snob—hello, I am too!—and spends time watching The West Wing lately. So, what are you up to? What series are you watching? Oh, I won't watch a cheap show; it just got bad after that.
00:00:41.399
Matt is currently working as a Site Reliability Engineer at Seer Medical, which is Australia's largest epilepsy diagnostic service. Seer diagnoses patients over seven days of continuous monitoring of their brain activity at home, so they can stay at home and be monitored rather than having a long stay in the hospital. This is pretty great for people.
00:01:16.930
So, he is working on something important, and we were having a chat at coffee yesterday. Most of us can say that if we put a bug into production, it’s not saving lives, but he's actually saving lives. There can be a lot of pressure as a Site Reliability Engineer on that team, but today, instead of helping people, he's going to talk about murdering our Mongo database.
00:01:35.340
So, today, we're going to go on the journey of 'murdering' a database. As Neil mentioned, I currently work at Seer Medical diagnosing epilepsy. This project actually happened at Stile Education, where I was working last year. Stile is a highly engaging learning platform that contextualizes science with the latest news and issues relevant to students.
00:02:01.890
They teach physics with driverless cars, chemistry with chocolatiers, and biology in the context of the 2014 Ebola crisis. They recently published a lesson exploring the science behind the recent bushfires and the contribution of climate change towards them.
00:02:23.290
Before we get into murder, I wanted to introduce MongoDB for those of you who might not have used it before. It is a document-based data store, meaning you can write directly from your application by serializing existing objects into JSON. In a traditional database, you might use SQL to write data into data structures that are predefined with a schema.
00:02:39.239
In MongoDB, there is not necessarily a schema, and you write into it using its own query language. It has support for sharding data over multiple servers, meaning you don't require one large server to store all your data. It's usually deployed in a set of three or more servers for high availability, with one primary server and multiple secondary servers replicating data from the primary.
00:03:10.150
There are some trade-offs when comparing MongoDB to a relational database. It recently got transaction support, but until recently, you had to implement transactions yourself through strategies like two-phase commits. It now has schema validation, but updating the schema doesn't affect data already added to the database. Moreover, it's not well designed for relational data; it lacks features typical of relational databases, such as foreign keys to enforce relationships between records or joins to connect documents to each other.
00:03:36.459
When considering using MongoDB as a database, it's worth asking yourself if the nature of the data you're storing is well-suited to it and if these trade-offs are right for your application. Today, I'll explain why Stile introduced MongoDB as our second database. I'll share some examples of use cases we had throughout its lifecycle, and I'll share the process we went through in deciding to remove it from the application. I’ll dive deep into how we safely refactored it out of our codebase, even though it was critical to our application.
00:04:26.550
So, I want to tell you a story. It starts back in 2011, before I joined the company in 2015. The original application was built in PHP, and they didn't have any real customers at the time, but as they began to get some, they knew they didn't want to continue writing the application in this spaghetti code of PHP. They asked themselves this question: if we started again, what technologies would we choose?
00:04:51.210
The developers, who were mostly recent graduates from university and lacked experience in designing technical applications, decided to look through the catalog of application design on Hacker News. At the time, MongoDB was highly popular, and to the founders of the company, it looked like the database of the future. It also seemed like it would solve some of the problems they'd previously encountered with MySQL. One of the founders had previous experience with older versions of MySQL and noted that they lacked support for transactions, suffered from referential integrity issues, and were optimized for read-heavy operations, whereas their application was going to have a higher write workload.
00:05:23.250
This made MongoDB an appealing case for a high-availability clustered database at a time when there were no robust solutions for clustering MySQL. In 2012, they rewrote the application as a set of 11 Ruby microservices communicating over a message bus. This allowed each microservice to choose which database it wanted to use, either MongoDB or MySQL. However, during this migration, they'd written the first version of the application using MongoDB and didn’t want to take the time to migrate the data over.
00:06:07.860
Having significant experience using MySQL in the past, they decided to use both MongoDB and MySQL in this new web-scale architecture. Some services used MongoDB while others used MySQL, but due to a lack of previous experience and a few early outages, they tried to keep only non-critical data in the MongoDB services. However, this led to some issues when coupling was introduced between the services and the MySQL services.
00:06:50.850
They came up with makeshift ways to reduce this coupling, such as introducing a ratification where if one service was down, the MySQL services wouldn't go down either. We spent a lot of time trying to implement things like this to avoid affecting our use of MySQL. However, every time we had an incident, we would find a new way that coupled both services together, leading to a situation where, for a long time, we had more microservices than developers.
00:07:39.960
In 2017, we abandoned our microservices architecture in favor of a monolith. Rather than everything having to call through this centralized message bus, developers could just make calls through the same process. This introduced new complexities around managing connections to two databases within one application. Unlike the previous architecture where if one service was down the site was ideally still up, in this new architecture, if MongoDB was down, the application wouldn't start and our site would be fully down.
00:08:05.670
Throughout this journey, we didn’t have a true impetus for using a second database. There wasn’t a technical problem we were solving with it; we could have implemented the same API using MySQL. Thus, the advantages that MongoDB presented weren’t really advantages for our application. As we began to rely on MongoDB, we had to mitigate some issues as well.
00:08:40.020
Initially, when we first introduced MongoDB, there were no hosted services for it, and we were running our own cluster on EC2 instances on AWS. This meant dealing with all the challenges of running a database yourself, such as setting it up and getting its defaults correct, capacity planning, and maintaining backups. One time, we scaled down the database during the school holidays and forgot to scale it up again until school resumed. All our users came back online, and we had taken the site down on the first day of school.
00:09:06.990
We also had to maintain our own backups and monitor the cluster. On one occasion, we were doing some maintenance on our cluster that was set up on EC2 instance storage. If you’ve used storage volumes on AWS before, you know that EBS volumes can survive the lifetime of the instance they’re attached to, sort of like an external hard drive. Well, instance store is different; the storage of the instance only survives as long as the instance is turned on. It’s basically like a computer that deletes everything when you restart.
00:09:49.180
This setup provided some fast ephemeral storage, but it wasn’t ideal for persisting a database. During an upgrade, one of the three servers was shut down by an operator who thought they could just take the volume and reattach it to another server. However, when the instance was shut down, it deleted all the data on that volume. Fortunately, the other servers had a full copy of the database, and we were able to copy the database to a new server. However, this caused a few hours of downtime while we re-synced the data.
00:10:49.810
So, MongoDB saved the day with its replication of data to multiple servers. Yet, the fact that we needed to handle these kinds of database upgrades ourselves required specialized knowledge, and we had several other outages. Most of these were application-level outages due to schema mismatches after applying updates to the code; the new code wouldn’t work with the existing data. We even had one major outage that significantly impacted us, which revolved around how our monitoring tools queried MongoDB.
00:11:42.200
Our metric server, Prometheus, asked our application every 15 seconds for the size of a collection. This caused a bottleneck when this collection scanning started taking longer than 15 seconds, resulting in these collection scans happening in parallel. The metric server kept asking our application, creating more and more connections until the server had over a thousand concurrent connections, all running these parallel connection scans. As the performance slowed, the server started dropping some connections, but creating new connections wasn’t cheap.
00:12:55.360
As connections were dropped, new ones were created, causing a cascade effect. I need to explain the concept of load average. The load average reflects the number of threads waiting to be processed. Typically, it’s between 0.5 and 4, and should ideally be less than the number of CPUs on the server. Our server had 8 CPUs, but the load average soared to 1500 due to this excessive connection creation.
00:13:57.580
This issue illustrated that we hadn’t spent enough time setting proper connection limits for our application. We had a lengthy discussion on whether we should invest more efforts into getting these default settings correct for our application or to optimize them further. Given several years of experience testing our MySQL settings in production before gaining confidence in them, we decided to pursue a path of removing MongoDB and double down on our investment in MySQL.
00:14:39.630
Another challenge we faced with MongoDB was that we didn’t use an ORM; we weren't using a framework like Ruby on Rails. We had to implement all the database interactions ourselves with Ruby’s driver, which added a burden of framework-like overhead that we wrote ourselves. Since MongoDB lacks a schema, we attempted to create one by writing YAML files that defined the collection and the types of data we were storing. Then we would use these files to code-generate Ruby models.
00:15:53.780
We also code-generated all the ORM for each collection, but these autogenerated repositories quickly became difficult to manage as developers made manual edits to them. We ended up with similar files but with many different functionalities, making maintenance cumbersome. Eventually, we realized that the time we spent generating these models did not provide the advantages we expected over using our existing MySQL database.
00:16:53.980
We invested more effort into growing and running MySQL, benefitting from a managed service provided by AWS. In contrast, managing MongoDB proved to be too much of a burden. We first considered moving to a hosted service, but by removing it from our application, we were able to simplify our technical design. Once we decided we wanted to remove MongoDB, we needed to do it without downtime while ensuring we didn't lose data during the migration.
00:18:55.310
First, we defined a MySQL schema that fit the shape of the data we were storing and dealt with the possibility of missing data in existing documents. Next, we needed to migrate the data. An offline migration, where you turn your application off, migrate all the data, and then turn your application back on, would risk introducing corrupted data on reactivation.
00:19:28.670
If corrupted data were written into MySQL, you would not find out until you restarted your application. Instead of this method, we opted for a migration process conducted in small steps allowing us to roll back if something went wrong. We wrote records into both MongoDB and MySQL while performing updates and deletes on records in both databases.
00:20:41.920
The advantage of this dual writing approach is that you can continue using your old database while migrating to the new one. We first wrote a record into MySQL, then into MongoDB, ensuring both databases had identical records. Our script fetched records from the existing database and wrote them back to MySQL. If the migrations went wrong, we had the flexibility to delete all existing data in MySQL and repeat the process.
00:21:04.040
While this migration was occurring, we continued accepting writes in both data stores. We needed a mechanism to switch where we read from; we could switch from reading documents from MongoDB to reading from MySQL. To ensure data integrity, we ran our tests concurrently on both implementations.
00:22:56.660
To test accuracy, we employed a tool called Scientists provided by GitHub. By calling existing methods alongside new MySQL methods for a small percentage of requests, it allowed us to incrementally test the MySQL methods. This enabled us to gather performance metrics and detect discrepancies between MySQL and MongoDB.
00:23:43.310
The Scientists tool logged any differences found, allowing us to debug and resolve issues accordingly. Without this method, we risked returning invalid data to real users, such as deleted records.
00:24:01.720
One critical error we detected was an off-by-one error on an index array that caused us to write the wrong value for a reliability field into MySQL. The counts of records with bad reliability were vastly different between MongoDB and MySQL. Another issue was a configuration problem in MySQL where it accepted values not defined in a strict enum, leading to silent but erroneous empty string writes.
00:24:58.270
After we fixed these issues, we successfully migrated all the data across to MySQL. Following a few weeks of running this dual-writing approach, it became clear that the new MySQL code consistently returned the same results as MongoDB, allowing us to finally switch to primarily reading from MySQL.
00:25:27.750
We continued using the Scientists tool while running our new MySQL code, which allowed us to roll back at any time, as all records were still being written to MongoDB. Shifting over simply involved changing which method we called: reading from MySQL rather than MongoDB.
00:26:38.040
At that point, we wanted to ensure that no remnants of MongoDB were still lurking in the background. We created a method called request cursing where we added a special HTTP header to our API requests. When we contacted a service causing an 'unavailable' error, it simulated an outage for our testing purposes.
00:27:16.700
This allowed us to incrementally refactor different parts of the site and confirm they weren’t using MongoDB. We also employed a tracing tool called Yaeger which wrapped calls to the Ruby driver, giving us insight into which parts of the application were still relying on MongoDB.
00:27:58.990
Once we identified all the dependencies on MongoDB and proved that the migration was successful, we could simply remove the code and the data related to MongoDB from our application.
00:29:16.600
Throughout this migration process, we had the opportunity to roll back at various stages, giving us confidence in our transition. After running MySQL in our application for months, we were able to remove MongoDB, as it was no longer required.
00:30:01.000
This success illustrates the importance of safely migrating APIs and simplifying applications by questioning the validity of technologies that no longer serve our needs.
00:30:09.040
Thank you!