An Ode to 17 Databases in 39 minutes

Talks

Toby Hede

An Ode to 17 Databases in 39 minutes

by Toby Hede

In this talk titled "An Ode to 17 Databases in 39 minutes" by Toby Hede at RubyConf AU 2014, the speaker explores a diverse array of database technologies alongside the complexities of distributed systems, utilizing humor and analogies to engage the audience. The discussion emphasizes the importance of understanding various database options beyond the commonly used PostgreSQL and MySQL, highlighting when these alternatives could be beneficial in Ruby on Rails applications. The delivery is punctuated with animated GIFs and references to Dungeons and Dragons to illustrate points, making the talk both informative and entertaining. Key points covered include:

Database Fundamentals: An overview of ACID properties (Atomicity, Consistency, Isolation, Durability) and the CAP theorem (Consistency, Availability, Partition tolerance), explaining their significance and the trade-offs involved in database design.
Diverse Database Technologies: The session reviews various databases, starting with PostgreSQL as a reliable relational database, followed by MySQL, DynamoDB, Riak, Cassandra, and others, explaining their unique strengths and applications:
- PostgreSQL: Strength in consistency and relational capabilities, especially with JSON support.
- MySQL: Ubiquity and performance, with alternatives like MariaDB and Percona Server offering community-driven enhancements.
- DynamoDB and Riak: Examined for their innovative approaches to distributed key-value storage and tunable consistency.
- Cassandra: Noted for its distribution design and real-world application by companies like Netflix.
- Redis: Highlighted for its simplicity and in-memory data structure versatility.
- Neo4j and MongoDB: Discussed for their unique approaches to data relationships and scalability, respectively.
- Other databases include RethinkDB, Elasticsearch, and Hadoop, where the speaker mentions the importance of understanding their architectures and functionalities.
Conclusion and Takeaways: Toby emphasizes the evolution of databases, encouraging developers to broaden their perspectives on available technologies while acknowledging the challenges posed by distributed systems. The key takeaway is the recognition that while traditional databases like PostgreSQL and MySQL remain valuable, understanding the full range of modern databases can empower more effective data management in diverse applications. The talk concludes with praise for Elasticsearch's indexing capabilities and a reminder of the importance of experimenting with new database paradigms.

00:00:07.560 Um, this is a lightning talk that has run amok, basically, about 17 databases in however many minutes we have. There will be a bunch of animated GIFs and probably some Star Wars references. Actually, there's no Star Wars, so I don't know why I said that. Also, there will be some Dungeons and Dragons references for no real reason. I must confess, I am the walking embodiment of a stereotype that I created in my own mind back in 1992, and I've been keeping that dream alive ever since.

00:00:20.600 This is our cast that we're going to look at today—a bunch of different things, some of which you have probably seen and heard of, and some of which you may not. Some of these concepts are quite complicated. The general theme of this talk is that it's a bad idea, but hopefully, I won't hurt myself too much. I'm Toby, and I work at 9fold. Let's get cracking!

00:00:43.320 Distributed systems are really hard. Depicted here is, of course, a distributed system. Databases are very fun, but particularly in the Rails world, people tend to gloss over the fact that at the heart of most of your systems and code, there is a database of some description. There is always something that contains the data. Conversely, everything is hard and fun at the same time.

00:01:03.400 Quick word from people who aren't my sponsors: NoSQL or NewSQL and various other flavors of SQL drive me absolutely nuts. This is because SQL is the least of the problems that we are trying to solve in this world. Hopefully, you have heard of the concept of ACID, which is a set of properties that a relational database (like PostgreSQL or MySQL) guarantees around how your data is stored and interacted with. It stands for Atomicity, Consistency, Isolation, and Durability. These guarantees are handy; we store some data, and we can get it back out. Many people can store some data, resulting in a somewhat consistent view of what that data is. There's tons of math behind this, and people have been doing it for a very long time.

00:01:49.799 On the other side of this, where the more fashionable and trendy hipsters come in, is the CAP theorem or Brewer's conjecture. This theorem deals with Consistency, Availability, and Partition tolerance. ACID and CAP have essentially nothing to do with each other, which is why people often get confused when talking about the different properties that databases have. The consistency in CAP is not the same as the consistency in ACID, and that's a subtle but important point. I will explain what all of that means.

00:02:18.400 Brewer's conjecture and the formal proof of the CAP theorem are almost but not quite entirely alike. The problems actually started when a guy named Brewer presented a slide in a keynote he gave back in 2000. He theorized that when it comes to distributed systems, you can only have two of these three properties: Consistency, Availability, and tolerance to network partitions. There were only a few slides on this topic in that entire keynote, and that was all he really said.

00:02:50.560 In a typical way of good scientists, some people took that idea, saw a grain of truth in it, and created formal proofs to explore whether it was indeed true. The actual theorem states that it is impossible in an asynchronous network model to implement a data object that guarantees availability and atomic consistency simultanously. This is where the trouble in distributed systems really starts and ends.

00:03:06.079 Back to distributed systems: everything is distributed. The web, as a model, is inherently distributed, and that seems self-evident. You have a browser and a server, which means you are instantly building distributed systems. Your Rails app is also distributed, as you have code running on an application server talking to a database. If they are communicating across a network, even if it's local, you are building a distributed system. Most of the time, we ignore this fact and get away with it to varying degrees of success.

00:03:16.560 This is completely unreadable, but that doesn't matter. It was basically someone trying to categorize different databases in the modern era. If you spend a weekend at a hackathon, you too could start a NoSQL startup—it's a highly recommended career opportunity. They attempted to categorize different databases, thinking about how consistency ensures that data is always the same for everyone, while availability means clients can access the data. Partition tolerance is about the ability of the system to function when networks go down or systems stop communicating.

00:03:55.360 The way this gets misinterpreted is that you can pick two of these properties, but that's not quite accurate. What we are actually saying is that when there's a partition, and two systems cannot communicate, you must choose between being consistent or being available.

00:04:11.360 Let me demonstrate this with a picture. We have a cluster of nodes communicating with each other. I've worked in a company that resembled this in stand-ups, and you'd say 'no, we don't do stand-ups; it's fine.' We have two nodes, and they are inherently distributed. Data is coming into one node—that's cool. We have data being transferred to the other node, and they are synchronizing or replicating in some way.

00:04:45.280 We have read requests coming off the other node. This all seems great until we sever the communication between them. At this point, the writes are coming into one node, which knows the truth of the bank account balance or the client's phone number. At this moment, the other node cannot possibly maintain a consistent state anymore; it cannot accept the new data because it cannot communicate.

00:05:05.840 What all of this is trying to convey is that if two nodes can't communicate, they cannot attain consistency. The data can be updated in one place, and the other node won't know about it. These discussions about the CAP theorem and other insights are trying to illustrate this notion of distributed systems.

00:05:25.120 So now that we have a foundational understanding of the science behind distributed systems, let me introduce you to our adventurer party that we will be discussing today. Also, as an aside, what is the deal with the Albear? Some pointed out to me that there are no Al experts—the Albear is the French version, while the Albear is a Dungeons and Dragons monster that is a hybrid of an owl and a bear. Interestingly, it possesses the least scary properties of either animal, as if the bear could fly and had big teeth, that would certainly be scarier than a lumbering, pecking creature.

00:06:02.079 It doesn't matter, though. I realized that on Monday, which is why it's included here. So let's start with something simple: PostgreSQL. It's like MySQL for hipsters; it's totally awesome. The exciting aspect of PostgreSQL these days is its support for JSON, which is a very good thing. PostgreSQL is one of those technologies that gives people a warm glow of satisfaction that it's done right. It's solid engineering, and it's a database many of you have likely used.

00:06:42.560 It's consistent and relational. It takes data consistency as its worldview and prioritizes keeping the data consistent rather than confirming availability. This is the reason we often have single points of failure in our Rails apps: the primary database might go down, and essentially your application is down until the slave can come back up. This is the model of operations in this worldview.

00:07:16.520 HStore is an interesting feature—it's essentially a key-value store embedded in PostgreSQL, which is definitely worth looking at. Thanks to PostgreSQL 9.2 and 9.3 releases, there is now almost complete JSON query capability. I would argue that databases like PostgreSQL are not truly distributed systems. You can build variations that create a semblance of distribution with failover and replication, but it isn't quite the same.

00:07:54.880 HStore allows you to embed key-value pairs into a single column. The really cool part of this feature is that you can index it just like a normal value, enabling you to have nested structures in your HStore that you can index and utilize across tables. There's already an ActiveRecord builder for handling this regarding JSON, and developers are working on merging the JSON and HStore engines to share the same perspective.

00:08:44.560 You can store JSON objects and index the queries around them. The new capabilities allow you to query into JSON structures and engage in many of the operations databases kind of excel at. However, we have plenty more to discuss, so let's get into MySQL.

00:09:05.000 It's the same difference, right? But if you are looking to use a database operated by an evil overlord, MySQL is increasingly the choice. It’s not animated, unfortunately, but that’s the reality. MySQL’s predominant advantage, I guess, is its ubiquity. PHP can be blamed for that; it’s like a self-replicating virus where every server you spin up has been infected by PHP and MySQL.

00:09:35.720 On some benchmarks, the PHP-MySQL combination is significantly faster than native solutions in Java and various other languages, even though PHP itself isn’t remarkably performant. The optimization in this area has been extensive—even their buzzword compliance is impressive. They introduced a NoSQL API that resembles throwing together a bunch of random terms from a keyboard into MySQL to compel developers to consider selling it.”

00:10:03.720 However, many people are now exploring alternatives, such as MariaDB, which is a community-driven fork of MySQL that promotes freedom, as well as higher performance solutions like Percona Server, built for exceptional MySQL performance. TokuDB is another noteworthy option, which employs fractal indexes in MySQL, yielding tremendous performance improvements for certain workloads, especially geospatial data.

00:10:42.480 Next, we delve into our first postmodern database, DynamoDB. It emerged from a paper published by Amazon, which laid the groundwork for managing distributed systems. The upcoming databases we will discuss align with this model. DynamoDB introduced some excellent innovations in pricing; for instance, if you check the Amazon calculator, you can pay per query.

00:11:18.040 DynamoDB is essentially a massive key-value store with latency guarantees. It is highly tunable to specific workloads but, of course, as with any key-value store, you will need to jump through hoops to execute complex queries. It's like having a giant hash table in the sky. If your manager asks about monthly revenue, you might be left clueless because you have to iterate through everything.

00:11:45.440 With SQL databases, the queries are executed for you, even though there’s still iteration involved. What’s interesting about the Dynamo paper is that it popularized a model for distribution, which is critical for those interested in exploring distributed systems.

00:12:12.760 Next, we have Riak, which is essentially an implementation of the Dynamo paper. Riak is fantastic, and I've spoken to many individuals interested in it over a few beers while discussing the Albear. It would be awesome to have a problem that could be solved with Riak. It's another key-value store, but what we call tunable consistency is utilized within it.

00:12:41.080 Adding a node is straightforward as it can easily join the cluster. There is no master-slave configuration; systems are built with the understanding that they are inherently distributed. You can add more nodes, and they just work together seamlessly. Riak provides a REST API and is also able to store JSON documents.

00:13:00.160 Operations teams love Riak as it enables the advantages of inherent clustering, offering benefits that are quite interesting. Additionally, it includes cloud storage features that are comparable to Amazon S3, allowing users to store blobs of content within it. However, its productivity can be driven by a lack of focus on centralizing anything, which leads to fascinating results.

00:13:40.580 Easily gaining extra points for converging replicated data types, which is a term not widely known, Riak invests in academic research that has bled into the commercial sector in a positive fashion. Convergent replicated data types (CRDTs) are designed for distributed use, supporting data types like counters and arrays that can be distributed across nodes while ensuring eventual consistency.

00:14:05.400 Data gets hashed as it comes in. Nodes coordinate amongst themselves to distribute this data redundantly. This model basically leads to the patterns presented in the Dynamo paper, where data is hashed and then distributed around a giant ring of hashes. Each node receives a different portion of the data, which is replicated across multiple nodes, improving performance and redundancy.

00:14:32.960 As you add more nodes, you enhance performance and redundancy. Internally, Riak uses a structure called a vector clock to track versions of the same data over time. Essentially, when you write data to a relational database, it is stored as a single record, but with distributed systems like Riak, the application must manage how to handle conflicts.

00:15:06.080 It is up to the application to comprehend the conditions leading to that state where different records may exist. Thus, the model of Riak is somewhat akin to using Git for version control—while some records might have conflicting commits, the application must determine how to clarify any discrepancies.

00:15:34.320 Riak facilities the creation of an elegant and simple API allowing you to manage buckets with names, keys, and associated values, enabling document retrieval. There’s a library built into Riak, with a gem called Ripple that allows you to store JSON documents and perform additional operations. Furthermore, Riak supports secondary indexes for queries and offers a wealth of resources for those interested in distributed data storage.

00:16:01.520 Next, we have Bigtable, which is Google's response to distributed storage solutions. I'm not going to discuss it in great detail as Google has designed it to manage hundreds of petabytes of data and operate at 30 million operations per second, which many companies cannot fathom. Thus, unless you have a similar scale of data, it’s often irrelevant.

00:16:27.840 Meanwhile, Google’s subsequent creations, Spanner and F1, demonstrate that you can build relational databases that are fully distributed and atomic by using GPS in every server and atomic clocks in every data center. These solutions are proof of concept for Google—an impressively engineered approach to distributed systems, even if it is out of reach for the bulk of us.

00:16:52.480 On the topic of distributed databases, Cassandra is another option many are considering. While it's not necessarily the most inspirational engineering compared to Riak, it has become increasingly relevant. Cassandra is designed for distribution and follows principles similar to those described in the Dynamo paper; I've looked at it extensively, and while I found early implementations rather frustrating, many improvements have been made since.

00:17:22.320 Older configurations required XML for structure definitions and a restart of the server every time a modification was needed, which can feel cumbersome and very Java-like. However, recent updates have introduced a query language that resembles SQL—providing users with a familiar experience as they manage a large cluster.

00:17:47.440 Cassandra organizes its data in columns, which can be a bit complex to explain. In the past, you were forced to consider the underlying structures in great detail. In this model, you no longer view data in terms of rows and tables; instead, a column acts as a key-value namespace, albeit much more complex.

00:18:14.760 For instance, Netflix uses Cassandra for a wide array of functionalities due to its distributed nature. They manage massive clusters, and operational efficiency has become commonplace—although raising a thousand nodes is quite the feat.

00:18:42.720 Moving on, I’d like to talk briefly about Memcached. Although Memcached isn’t a database per se, I thought it was worth mentioning because it represents the transitional state of your Rails application, where distributed properties become apparent and complicate your workflow.

00:19:14.840 Many applications can seamlessly integrate Memcached without giving much thought to the underlying implications. However, you now must consider issues such as invalidating data and maintaining state between your database and Rails application, both of which are managed by a distinct process—this encapsulates the essence of distributed systems problems.

00:19:39.960 Memcached is an excellent solution due to its simplicity, boasting just a handful of methods to work with. It is essentially a large hash in the sky, and it's friendly towards Rails, particularly for managing sessions and fragments.

00:20:01.640 Redis is another fascinating topic. Redis has gained traction within the Rails community. Many applications use it for queuing, but not all utilize it as a database. With libraries like Sidekiq and Resque, we see Redis as a queue—though, in essence, it is not one, it can function as such.

00:20:27.840 Redis offers key-value storage but stands out due to its versatile data structures that enable in-memory operations. It transcends the simple model of a key mapped to a blog of data; in Redis, you have arrays, sets, and hashes, allowing you to execute an impressive array of operations.

00:21:00.640 The architecture is exceptionally fast since it operates solely in memory and is cleverly engineered. However, it’s not truly a distributed system; it can provide seamless functionality until some hidden issues erupt due to naive replication methods.

00:21:29.840 Redis's simplicity offers transaction-like features via checks and set values, enabling seamless multiple-step operations. Many developers utilize it within Rails for task management.

00:21:55.440 On to Neo4j, a graph database which I find particularly intriguing; its technology is genuinely impressive. It fundamentally regards data as a series of interconnected nodes, and their introductory material showcases impressive solutions to problems.

00:22:16.920 However, Neo4j still encounters challenges typical of single-point-of-failure databases. Data resides within the system and continually streams out; thus, it becomes essential to contend with these inconsistencies.

00:22:37.600 Now, on to MongoDB. There are some interesting points regarding MongoDB's adaptation to web scalability. Many people assume that MongoDB is a fix-all solution, similar to what people thought of MySQL in its infancy. Anyone with technical knowledge should recognize that transactions are central to database functionality.

00:23:01.480 Some early adopters invested in SQL, like Slashdot, did not understand databases either. If you think about the critiques from the academic world, you'll find they have some substantial weight about MongoDB's consistency and operational assumptions.

00:23:23.520 There's a growing number of options emerging, like RethinkDB, which approaches database construction with a focus on correctness—ultimately leading to slower performances, but a much finer tune on the data-related structure. As an academic project, RethinkDB boasts a solid foundation built upon consistent key-value storage adhering to distributed characteristics.

00:24:13.120 This database model resembles Redis, yet it starts from the ground up with distribution in mind. The approaches taken by RethinkDB offer a sense of legitimacy seen in other platforms.

00:24:54.720 As we reach the conclusion, we come to Hadoop, which you don’t have to worry about as long as you can successfully install it. I have a friend, a bit of an eccentric, who refers to Hadoop as a means to solve so-called 'small to medium data problems,' as he's had his share of experiences but doesn't shy away from its handling of expansive data sets.

00:25:27.440 Now, we turn to Uni, which has recently released a more advanced version called Vetus. It's essentially a NoSQL offering that functions similarly to MongoDB while being embeddable. This could lead to some exciting applications, but as it stands, we are still hesitant to utilize these newer frameworks.

00:26:16.640 Finally, there's Elasticsearch. I have frequent discussions about why we shouldn't default to this as a data store, even though it’s officially a search engine. Elasticsearch allows you to index documents, especially JSON documents, delivering the clustered architecture inherent to distributed systems right from setup.

00:26:59.240 The powerful querying capabilities make it exceptionally versatile, further amplified via integrations with logging platforms like Logstash and Kibana. Overall, it’s an incredibly elegant system that empowers developers with creativity for a vast array of applications.

00:27:28.640 That was my talk. I think I am right on time, as I dare to hope. And so, it has been great sharing this journey through databases with you today. Thank you!

RubyConf AU 2014