00:00:15.930
Good morning everybody! Yes, it's Friday, and it's been a long week. I'm excited and highly caffeinated. Without further ado, I present "An Ode to 17 Databases in 33 Minutes." I'm going to mangle a large number of metaphors and there will be a lot of animated GIFs. I've loaded a bunch this week!
00:00:28.240
This whole thing started as a joke: 17 databases in five minutes. I thought 33 minutes was worse—it's really just a catastrophe. However, we are going to cover a whole bunch of different databases, a little bit of the underlying theory, and hopefully you will walk out understanding why to use Postgres.
00:00:55.180
I'm Toby Hede, and you can find me on the Internet. I work with a company called Ninefold, and they kindly brought me here from Australia. This explains why I sound like I come from the deep south! I’ve been dealing with jetlag most of this week. Today I'm finally over that, just in time to go home and have it all over again next week.
00:01:26.730
A couple of quick facts about Australia: We use fewer syllables than you're used to! Here’s a genuine Australian politician, a mining magnate billionaire, who is currently running an MVP Jurassic theme park with giant glass dinosaurs. And I, for one, am all for it. I realized that I didn't include enough Star Wars references, so here's another gratuitous mention!
00:01:54.580
The thrust of this talk is that distributed systems are hard and databases are fun. Picture this distributed system: there are two app nodes and a master/slave type setup going on. We're going to talk about some of the complexities of running these kinds of systems, which is really fun stuff once you get under the covers and start thinking about it.
00:03:02.980
NoSQL is a term we use today, but now we also have NewSQL. I will cover various databases like Postgres, PostgreSQL, and AmbientSQL—all of which make my brain explode! The key to understanding these things is to think about what's happening underneath so you can make informed decisions about your databases.
00:04:01.769
I hope you're all familiar with some of the concepts of traditional relational databases. We have ACID compliance, which gives certain guarantees about data behavior. You can update data and be sure it was updated, ensuring that things are isolated from each other and persist over time.
00:04:20.970
Another important concept you may have heard of is the CAP theorem. This gets talked about a lot when we start discussing the new generation of databases. CAP stands for Consistency, Availability, and Partition Tolerance, which serves as a strong foundation for reasoning about how distributed systems behave and how they interoperate.
00:04:42.240
The original CAP theorem, known as Brewer's conjecture, proposed that a data store can only guarantee two of the three properties at any one time. In essence, data can be consistent, available, or capable of handling network failures, but never all three at once. Later, researchers made a formal proof stating that it's impossible in an asynchronous network model to implement a read-write data object that is simultaneously available and atomically consistent.
00:06:01.260
All discussions around NoSQL, NewSQL, and various other database technologies revolve around manipulating these different variables. While there's another concept called BASE, I won't focus on it because it’s simply a made-up acronym with no relevance to our discussion.
00:06:20.400
What we’re talking about with CAP is important because everything today is inherently distributed. For instance, you have a browser talking to a server, which is then communicating with a PostgreSQL or MySQL database or something fancier. As we increasingly rely on client-based operations, we face many of these distributed problems.
00:07:04.440
This is a simplified and somewhat inaccurate guide to NoSQL systems, splitting them into those that are available and those that are consistent. The CAP theorem states that under network failure, you get to choose between consistency and availability, but understanding this can be quite challenging.
00:07:26.780
Imagine a situation with a typical cluster of nodes working together. When a write occurs on one node, that data gets replicated across the system. If the nodes lose the ability to communicate due to a network partition, we must decide how to handle that write. You might have one node communicating while the other does not, leading to inconsistent reads. This complexity applies whether you are using master/slave setups in relational databases or additional, trickier configurations.
00:08:58.740
CAP theorem discussions often lead to claims that some databases have 'defeated' its principles. What's crucial to remember is that when two nodes cannot communicate, that creates a scientific challenge when trying to maintain data coherency.
00:09:10.410
If you want to dive deeper into the intricacies of how these systems perform, I recommend researching a project called Jepsen. It's a rigorous examination of the network operations of various distributed systems that can seriously expand your understanding of these concepts.
00:10:12.560
We are about to go on an adventure that’s like traversing a tortured maze of ridiculous Dungeons & Dragons metaphors! But first, let's examine a unique creature: the owlbear! The least scary aspects of a bear combined with an owl make for a whimsical, albeit peculiar, character.
00:10:52.570
Postgres, as we all know, is the 'MySQL for hipsters.' It’s a relational database with a consistent model. Under certain conditions—a network partition means your slave isn’t in contact with the master—it essentially becomes unavailable.
00:11:10.320
What’s interesting about Postgres is that it includes a lightweight key-value store called Hstore. If you're already running Postgres in production, you don't need to spin up anything else. With Hstore you can perform joins and index keys as if they were full-fledged database objects.
00:11:45.720
The most exciting feature of Postgres is its support for JSON. With versions 9.2-9.3 having incorporated a full-fledged document database, and the upcoming 9.4 release boasting high-performance JSON capabilities, it empowers developers to work with documents very effectively. If you want a document database, you already have one with Postgres.
00:12:20.560
MySQL can be seen as akin to Postgres, but with some caveats. It's widely used and offers performance flexibility through the ability to switch storage engines. Since Oracle’s acquisition, alternatives have emerged, such as MariaDB, which is a more open fork offering high-performance features.
00:12:52.720
In addition, there's a company called Percona that provides compatibility patches, giving MySQL users additional capabilities. There's also Tokutek, known for crazy fractal indexing for large datasets, making it efficient while retaining the familiar structure. These are interesting developments in the database space.
00:14:03.670
Much of what we now call NoSQL originates from a paper released by Amazon called Dynamo, outlining how to create a distributed system. Interestingly, Riak is essentially an implementation of the underlying Dynamo theory, providing a well-engineered and simple system that inherently understands clustering.
00:14:56.370
Riak has features like cloud storage compatibility with S3 and it effectively partitions data using consistent hashing. It’s great for large-scale data because it has nice operational characteristics, making it easy to manage. The API is straightforward, enabling users to store JSON documents and utilize secondary indexes for efficient data retrieval.
00:15:23.300
Next up is Google’s BigTable. This sophisticated piece of technology came from their internal research and is accessible via some cloud properties. It acts as a sparse distributed multi-dimensional sorted map, capable of handling hundreds of petabytes of data with fantastic performance in operations.
00:15:56.130
Google has developed other technologies like Spanner and F1, which provide capabilities for handling relational data across multiple data centers, pushing the boundaries of the CAP theorem. They use GPS in every server and atomic clocks in data centers to maintain accurate time.
00:16:29.120
Let's take a moment to discuss Cassandra. It's a column-oriented database focused on eventual consistency. What’s particularly intriguing about Cassandra is that it acts as a sparse distributed multi-dimensional sorted map, making it similar to Postgres in how it structures data, allowing for greater efficiency in certain types of queries.
00:17:03.060
Column-oriented databases invert the usual table structure found in relational databases, providing significant speed advantages for time series queries and other specific data types. Although its complexity may initially overwhelm users, it increasingly abstracts away complex definitions and provides a more familiar SQL-like structure.
00:17:41.490
Next, we encounter Memcached. Though technically a cache and not a database, it plays a significant role in your system's distribution. It's rock-solid in its simplicity and widely used. Many projects have built similar APIs, creating a widespread familiarity with Memcached.
00:18:06.400
Redis is also interesting to the Rails community as a primary queue. Redis is beautifully engineered, incredibly robust, and has a simple API, although distributing it can be slightly tricky. It allows for various data structures, such as hashes, lists, and strings, and optimizes locking to handle batch operations.
00:18:58.700
Within the Redis ecosystem, users can implement background tasks easily, leveraging its capabilities for smooth integration. Redis' Lua scripting enhances its functionality, allowing users to perform complex operations efficiently.
00:19:23.840
Moving on to Neo4j, a graph database optimized for connections rather than aggregated data. It represents a collection of nodes, each potentially connected through relationships that hold semantic meanings. Neo4j significantly simplifies querying for graph problems such as social networking.
00:20:12.099
MongoDB is known for its scalability and is described humorously as 'web-scale.' While it has its flaws, the model of treating objects like databases promotes a developer-friendly experience. However, reliance on MongoDB as a one-size-fits-all solution without understanding its operational challenges isn't recommended.
00:21:05.000
RethinkDB approaches the database world with a focus on strong correctness first, then usability, in contrast to its counterparts. They aim to create operationally robust systems that handle queries appropriately, although the technology is still evolving.
00:21:52.000
Many commercial databases exist, often emerging from the open-source world. These include Couchbase, a hybrid of CouchDB and memcached. They're all competing for market share, each with its unique features and cost structures, yet many systems ultimately lead users back to trusted solutions like Postgres.
00:22:52.340
HyperDex is my favorite because of its unique hyperspace hashing methodology, designed to make querying extremely efficient in multi-dimensional spaces. Using such technology allows users to construct genuine queries based on indexed properties of stored objects, a fascinating advancement in database technology.
00:23:50.740
With developments happening rapidly in the database field, it’s essential to keep pace. More traditional systems may require extensive installation enhancements, similar to the complexities faced in large big data scenarios, as is the case with Apache Hadoop or HBase, which continue to gain traction among larger tech giants.
00:24:50.210
In the end, the world of databases is vast and expanding. As technology evolves, it becomes crucial to explore new possibilities and consider what might work best for your applications. I have been Toby Hede, and if you want to discuss databases, I'm here at the conference. I consider myself somewhat of a butterfly collector of databases, so feel free to come say hi!