RailsConf 2014

An Ode to 17 Databases in 33 Minutes

An Ode to 17 Databases in 33 Minutes

by Toby Hede

In the video 'An Ode to 17 Databases in 33 Minutes,' Toby Hede presents an engaging overview of various database technologies. The session is characterized by humor and technical insight, aimed at giving a whirlwind introduction to both modern and less contemporary databases that can be utilized in a Rails application. Toby discusses 17 different databases while addressing underlying principles that influence their use.

Key Points Discussed:
- Introduction to Databases: Toby opens with anecdotes and humor, setting a casual tone while emphasizing the fun complexities of distributed systems and databases.
- SQL and NoSQL: He distinguishes traditional relational databases that adhere to ACSID principles from NoSQL databases, introducing concepts like the CAP theorem (Consistency, Availability, Partition Tolerance) as core reasoning for choosing one type of database over another.
- Feature-Rich Technologies:
- PostgreSQL: Highlighted as a feature-rich relational database, Toby points out its capabilities with documents through JSON features.
- Redis and Memcached: Discussed as robust solutions for caching, with Redis also recognized for its advanced key-value store functionalities.
- Cassandra and Riak: These databases are introduced for their eventual consistency and inherent clustering features, useful for large-scale operations.
- Emerging Technologies and Concepts: Toby mentions innovative databases such as Hyperdex and Neo4j, emphasizing their specialized functionalities in data indexing and graph databases, respectively.
- Traditional vs. Modern Solutions: The importance of understanding the trade-offs in choosing between various systems is noted, including maintaining compatibility and cost considerations.

Toby's conclusion encourages further exploration and discussion of database technologies, inviting others to engage with him during the conference. His light-hearted yet informative approach provides an engaging platform to understand the evolving landscape of database technologies and their application in real-world situations.

00:00:15.930 Good morning everybody! Yes, it's Friday, and it's been a long week. I'm excited and highly caffeinated. Without further ado, I present "An Ode to 17 Databases in 33 Minutes." I'm going to mangle a large number of metaphors and there will be a lot of animated GIFs. I've loaded a bunch this week!
00:00:28.240 This whole thing started as a joke: 17 databases in five minutes. I thought 33 minutes was worse—it's really just a catastrophe. However, we are going to cover a whole bunch of different databases, a little bit of the underlying theory, and hopefully you will walk out understanding why to use Postgres.
00:00:55.180 I'm Toby Hede, and you can find me on the Internet. I work with a company called Ninefold, and they kindly brought me here from Australia. This explains why I sound like I come from the deep south! I’ve been dealing with jetlag most of this week. Today I'm finally over that, just in time to go home and have it all over again next week.
00:01:26.730 A couple of quick facts about Australia: We use fewer syllables than you're used to! Here’s a genuine Australian politician, a mining magnate billionaire, who is currently running an MVP Jurassic theme park with giant glass dinosaurs. And I, for one, am all for it. I realized that I didn't include enough Star Wars references, so here's another gratuitous mention!
00:01:54.580 The thrust of this talk is that distributed systems are hard and databases are fun. Picture this distributed system: there are two app nodes and a master/slave type setup going on. We're going to talk about some of the complexities of running these kinds of systems, which is really fun stuff once you get under the covers and start thinking about it.
00:03:02.980 NoSQL is a term we use today, but now we also have NewSQL. I will cover various databases like Postgres, PostgreSQL, and AmbientSQL—all of which make my brain explode! The key to understanding these things is to think about what's happening underneath so you can make informed decisions about your databases.
00:04:01.769 I hope you're all familiar with some of the concepts of traditional relational databases. We have ACID compliance, which gives certain guarantees about data behavior. You can update data and be sure it was updated, ensuring that things are isolated from each other and persist over time.
00:04:20.970 Another important concept you may have heard of is the CAP theorem. This gets talked about a lot when we start discussing the new generation of databases. CAP stands for Consistency, Availability, and Partition Tolerance, which serves as a strong foundation for reasoning about how distributed systems behave and how they interoperate.
00:04:42.240 The original CAP theorem, known as Brewer's conjecture, proposed that a data store can only guarantee two of the three properties at any one time. In essence, data can be consistent, available, or capable of handling network failures, but never all three at once. Later, researchers made a formal proof stating that it's impossible in an asynchronous network model to implement a read-write data object that is simultaneously available and atomically consistent.
00:06:01.260 All discussions around NoSQL, NewSQL, and various other database technologies revolve around manipulating these different variables. While there's another concept called BASE, I won't focus on it because it’s simply a made-up acronym with no relevance to our discussion.
00:06:20.400 What we’re talking about with CAP is important because everything today is inherently distributed. For instance, you have a browser talking to a server, which is then communicating with a PostgreSQL or MySQL database or something fancier. As we increasingly rely on client-based operations, we face many of these distributed problems.
00:07:04.440 This is a simplified and somewhat inaccurate guide to NoSQL systems, splitting them into those that are available and those that are consistent. The CAP theorem states that under network failure, you get to choose between consistency and availability, but understanding this can be quite challenging.
00:07:26.780 Imagine a situation with a typical cluster of nodes working together. When a write occurs on one node, that data gets replicated across the system. If the nodes lose the ability to communicate due to a network partition, we must decide how to handle that write. You might have one node communicating while the other does not, leading to inconsistent reads. This complexity applies whether you are using master/slave setups in relational databases or additional, trickier configurations.
00:08:58.740 CAP theorem discussions often lead to claims that some databases have 'defeated' its principles. What's crucial to remember is that when two nodes cannot communicate, that creates a scientific challenge when trying to maintain data coherency.
00:09:10.410 If you want to dive deeper into the intricacies of how these systems perform, I recommend researching a project called Jepsen. It's a rigorous examination of the network operations of various distributed systems that can seriously expand your understanding of these concepts.
00:10:12.560 We are about to go on an adventure that’s like traversing a tortured maze of ridiculous Dungeons & Dragons metaphors! But first, let's examine a unique creature: the owlbear! The least scary aspects of a bear combined with an owl make for a whimsical, albeit peculiar, character.
00:10:52.570 Postgres, as we all know, is the 'MySQL for hipsters.' It’s a relational database with a consistent model. Under certain conditions—a network partition means your slave isn’t in contact with the master—it essentially becomes unavailable.
00:11:10.320 What’s interesting about Postgres is that it includes a lightweight key-value store called Hstore. If you're already running Postgres in production, you don't need to spin up anything else. With Hstore you can perform joins and index keys as if they were full-fledged database objects.
00:11:45.720 The most exciting feature of Postgres is its support for JSON. With versions 9.2-9.3 having incorporated a full-fledged document database, and the upcoming 9.4 release boasting high-performance JSON capabilities, it empowers developers to work with documents very effectively. If you want a document database, you already have one with Postgres.
00:12:20.560 MySQL can be seen as akin to Postgres, but with some caveats. It's widely used and offers performance flexibility through the ability to switch storage engines. Since Oracle’s acquisition, alternatives have emerged, such as MariaDB, which is a more open fork offering high-performance features.
00:12:52.720 In addition, there's a company called Percona that provides compatibility patches, giving MySQL users additional capabilities. There's also Tokutek, known for crazy fractal indexing for large datasets, making it efficient while retaining the familiar structure. These are interesting developments in the database space.
00:14:03.670 Much of what we now call NoSQL originates from a paper released by Amazon called Dynamo, outlining how to create a distributed system. Interestingly, Riak is essentially an implementation of the underlying Dynamo theory, providing a well-engineered and simple system that inherently understands clustering.
00:14:56.370 Riak has features like cloud storage compatibility with S3 and it effectively partitions data using consistent hashing. It’s great for large-scale data because it has nice operational characteristics, making it easy to manage. The API is straightforward, enabling users to store JSON documents and utilize secondary indexes for efficient data retrieval.
00:15:23.300 Next up is Google’s BigTable. This sophisticated piece of technology came from their internal research and is accessible via some cloud properties. It acts as a sparse distributed multi-dimensional sorted map, capable of handling hundreds of petabytes of data with fantastic performance in operations.
00:15:56.130 Google has developed other technologies like Spanner and F1, which provide capabilities for handling relational data across multiple data centers, pushing the boundaries of the CAP theorem. They use GPS in every server and atomic clocks in data centers to maintain accurate time.
00:16:29.120 Let's take a moment to discuss Cassandra. It's a column-oriented database focused on eventual consistency. What’s particularly intriguing about Cassandra is that it acts as a sparse distributed multi-dimensional sorted map, making it similar to Postgres in how it structures data, allowing for greater efficiency in certain types of queries.
00:17:03.060 Column-oriented databases invert the usual table structure found in relational databases, providing significant speed advantages for time series queries and other specific data types. Although its complexity may initially overwhelm users, it increasingly abstracts away complex definitions and provides a more familiar SQL-like structure.
00:17:41.490 Next, we encounter Memcached. Though technically a cache and not a database, it plays a significant role in your system's distribution. It's rock-solid in its simplicity and widely used. Many projects have built similar APIs, creating a widespread familiarity with Memcached.
00:18:06.400 Redis is also interesting to the Rails community as a primary queue. Redis is beautifully engineered, incredibly robust, and has a simple API, although distributing it can be slightly tricky. It allows for various data structures, such as hashes, lists, and strings, and optimizes locking to handle batch operations.
00:18:58.700 Within the Redis ecosystem, users can implement background tasks easily, leveraging its capabilities for smooth integration. Redis' Lua scripting enhances its functionality, allowing users to perform complex operations efficiently.
00:19:23.840 Moving on to Neo4j, a graph database optimized for connections rather than aggregated data. It represents a collection of nodes, each potentially connected through relationships that hold semantic meanings. Neo4j significantly simplifies querying for graph problems such as social networking.
00:20:12.099 MongoDB is known for its scalability and is described humorously as 'web-scale.' While it has its flaws, the model of treating objects like databases promotes a developer-friendly experience. However, reliance on MongoDB as a one-size-fits-all solution without understanding its operational challenges isn't recommended.
00:21:05.000 RethinkDB approaches the database world with a focus on strong correctness first, then usability, in contrast to its counterparts. They aim to create operationally robust systems that handle queries appropriately, although the technology is still evolving.
00:21:52.000 Many commercial databases exist, often emerging from the open-source world. These include Couchbase, a hybrid of CouchDB and memcached. They're all competing for market share, each with its unique features and cost structures, yet many systems ultimately lead users back to trusted solutions like Postgres.
00:22:52.340 HyperDex is my favorite because of its unique hyperspace hashing methodology, designed to make querying extremely efficient in multi-dimensional spaces. Using such technology allows users to construct genuine queries based on indexed properties of stored objects, a fascinating advancement in database technology.
00:23:50.740 With developments happening rapidly in the database field, it’s essential to keep pace. More traditional systems may require extensive installation enhancements, similar to the complexities faced in large big data scenarios, as is the case with Apache Hadoop or HBase, which continue to gain traction among larger tech giants.
00:24:50.210 In the end, the world of databases is vast and expanding. As technology evolves, it becomes crucial to explore new possibilities and consider what might work best for your applications. I have been Toby Hede, and if you want to discuss databases, I'm here at the conference. I consider myself somewhat of a butterfly collector of databases, so feel free to come say hi!