00:00:07.560
Um, this is a lightning talk that has run amok, basically, about 17 databases in however many minutes we have. There will be a bunch of animated GIFs and probably some Star Wars references. Actually, there's no Star Wars, so I don't know why I said that. Also, there will be some Dungeons and Dragons references for no real reason. I must confess, I am the walking embodiment of a stereotype that I created in my own mind back in 1992, and I've been keeping that dream alive ever since.
00:00:20.600
This is our cast that we're going to look at today—a bunch of different things, some of which you have probably seen and heard of, and some of which you may not. Some of these concepts are quite complicated. The general theme of this talk is that it's a bad idea, but hopefully, I won't hurt myself too much. I'm Toby, and I work at 9fold. Let's get cracking!
00:00:43.320
Distributed systems are really hard. Depicted here is, of course, a distributed system. Databases are very fun, but particularly in the Rails world, people tend to gloss over the fact that at the heart of most of your systems and code, there is a database of some description. There is always something that contains the data. Conversely, everything is hard and fun at the same time.
00:01:03.400
Quick word from people who aren't my sponsors: NoSQL or NewSQL and various other flavors of SQL drive me absolutely nuts. This is because SQL is the least of the problems that we are trying to solve in this world. Hopefully, you have heard of the concept of ACID, which is a set of properties that a relational database (like PostgreSQL or MySQL) guarantees around how your data is stored and interacted with. It stands for Atomicity, Consistency, Isolation, and Durability. These guarantees are handy; we store some data, and we can get it back out. Many people can store some data, resulting in a somewhat consistent view of what that data is. There's tons of math behind this, and people have been doing it for a very long time.
00:01:49.799
On the other side of this, where the more fashionable and trendy hipsters come in, is the CAP theorem or Brewer's conjecture. This theorem deals with Consistency, Availability, and Partition tolerance. ACID and CAP have essentially nothing to do with each other, which is why people often get confused when talking about the different properties that databases have. The consistency in CAP is not the same as the consistency in ACID, and that's a subtle but important point. I will explain what all of that means.
00:02:18.400
Brewer's conjecture and the formal proof of the CAP theorem are almost but not quite entirely alike. The problems actually started when a guy named Brewer presented a slide in a keynote he gave back in 2000. He theorized that when it comes to distributed systems, you can only have two of these three properties: Consistency, Availability, and tolerance to network partitions. There were only a few slides on this topic in that entire keynote, and that was all he really said.
00:02:50.560
In a typical way of good scientists, some people took that idea, saw a grain of truth in it, and created formal proofs to explore whether it was indeed true. The actual theorem states that it is impossible in an asynchronous network model to implement a data object that guarantees availability and atomic consistency simultanously. This is where the trouble in distributed systems really starts and ends.
00:03:06.079
Back to distributed systems: everything is distributed. The web, as a model, is inherently distributed, and that seems self-evident. You have a browser and a server, which means you are instantly building distributed systems. Your Rails app is also distributed, as you have code running on an application server talking to a database. If they are communicating across a network, even if it's local, you are building a distributed system. Most of the time, we ignore this fact and get away with it to varying degrees of success.
00:03:16.560
This is completely unreadable, but that doesn't matter. It was basically someone trying to categorize different databases in the modern era. If you spend a weekend at a hackathon, you too could start a NoSQL startup—it's a highly recommended career opportunity. They attempted to categorize different databases, thinking about how consistency ensures that data is always the same for everyone, while availability means clients can access the data. Partition tolerance is about the ability of the system to function when networks go down or systems stop communicating.
00:03:55.360
The way this gets misinterpreted is that you can pick two of these properties, but that's not quite accurate. What we are actually saying is that when there's a partition, and two systems cannot communicate, you must choose between being consistent or being available.
00:04:11.360
Let me demonstrate this with a picture. We have a cluster of nodes communicating with each other. I've worked in a company that resembled this in stand-ups, and you'd say 'no, we don't do stand-ups; it's fine.' We have two nodes, and they are inherently distributed. Data is coming into one node—that's cool. We have data being transferred to the other node, and they are synchronizing or replicating in some way.
00:04:45.280
We have read requests coming off the other node. This all seems great until we sever the communication between them. At this point, the writes are coming into one node, which knows the truth of the bank account balance or the client's phone number. At this moment, the other node cannot possibly maintain a consistent state anymore; it cannot accept the new data because it cannot communicate.
00:05:05.840
What all of this is trying to convey is that if two nodes can't communicate, they cannot attain consistency. The data can be updated in one place, and the other node won't know about it. These discussions about the CAP theorem and other insights are trying to illustrate this notion of distributed systems.
00:05:25.120
So now that we have a foundational understanding of the science behind distributed systems, let me introduce you to our adventurer party that we will be discussing today. Also, as an aside, what is the deal with the Albear? Some pointed out to me that there are no Al experts—the Albear is the French version, while the Albear is a Dungeons and Dragons monster that is a hybrid of an owl and a bear. Interestingly, it possesses the least scary properties of either animal, as if the bear could fly and had big teeth, that would certainly be scarier than a lumbering, pecking creature.
00:06:02.079
It doesn't matter, though. I realized that on Monday, which is why it's included here. So let's start with something simple: PostgreSQL. It's like MySQL for hipsters; it's totally awesome. The exciting aspect of PostgreSQL these days is its support for JSON, which is a very good thing. PostgreSQL is one of those technologies that gives people a warm glow of satisfaction that it's done right. It's solid engineering, and it's a database many of you have likely used.
00:06:42.560
It's consistent and relational. It takes data consistency as its worldview and prioritizes keeping the data consistent rather than confirming availability. This is the reason we often have single points of failure in our Rails apps: the primary database might go down, and essentially your application is down until the slave can come back up. This is the model of operations in this worldview.
00:07:16.520
HStore is an interesting feature—it's essentially a key-value store embedded in PostgreSQL, which is definitely worth looking at. Thanks to PostgreSQL 9.2 and 9.3 releases, there is now almost complete JSON query capability. I would argue that databases like PostgreSQL are not truly distributed systems. You can build variations that create a semblance of distribution with failover and replication, but it isn't quite the same.
00:07:54.880
HStore allows you to embed key-value pairs into a single column. The really cool part of this feature is that you can index it just like a normal value, enabling you to have nested structures in your HStore that you can index and utilize across tables. There's already an ActiveRecord builder for handling this regarding JSON, and developers are working on merging the JSON and HStore engines to share the same perspective.
00:08:44.560
You can store JSON objects and index the queries around them. The new capabilities allow you to query into JSON structures and engage in many of the operations databases kind of excel at. However, we have plenty more to discuss, so let's get into MySQL.
00:09:05.000
It's the same difference, right? But if you are looking to use a database operated by an evil overlord, MySQL is increasingly the choice. It’s not animated, unfortunately, but that’s the reality. MySQL’s predominant advantage, I guess, is its ubiquity. PHP can be blamed for that; it’s like a self-replicating virus where every server you spin up has been infected by PHP and MySQL.
00:09:35.720
On some benchmarks, the PHP-MySQL combination is significantly faster than native solutions in Java and various other languages, even though PHP itself isn’t remarkably performant. The optimization in this area has been extensive—even their buzzword compliance is impressive. They introduced a NoSQL API that resembles throwing together a bunch of random terms from a keyboard into MySQL to compel developers to consider selling it.”
00:10:03.720
However, many people are now exploring alternatives, such as MariaDB, which is a community-driven fork of MySQL that promotes freedom, as well as higher performance solutions like Percona Server, built for exceptional MySQL performance. TokuDB is another noteworthy option, which employs fractal indexes in MySQL, yielding tremendous performance improvements for certain workloads, especially geospatial data.
00:10:42.480
Next, we delve into our first postmodern database, DynamoDB. It emerged from a paper published by Amazon, which laid the groundwork for managing distributed systems. The upcoming databases we will discuss align with this model. DynamoDB introduced some excellent innovations in pricing; for instance, if you check the Amazon calculator, you can pay per query.
00:11:18.040
DynamoDB is essentially a massive key-value store with latency guarantees. It is highly tunable to specific workloads but, of course, as with any key-value store, you will need to jump through hoops to execute complex queries. It's like having a giant hash table in the sky. If your manager asks about monthly revenue, you might be left clueless because you have to iterate through everything.
00:11:45.440
With SQL databases, the queries are executed for you, even though there’s still iteration involved. What’s interesting about the Dynamo paper is that it popularized a model for distribution, which is critical for those interested in exploring distributed systems.
00:12:12.760
Next, we have Riak, which is essentially an implementation of the Dynamo paper. Riak is fantastic, and I've spoken to many individuals interested in it over a few beers while discussing the Albear. It would be awesome to have a problem that could be solved with Riak. It's another key-value store, but what we call tunable consistency is utilized within it.
00:12:41.080
Adding a node is straightforward as it can easily join the cluster. There is no master-slave configuration; systems are built with the understanding that they are inherently distributed. You can add more nodes, and they just work together seamlessly. Riak provides a REST API and is also able to store JSON documents.
00:13:00.160
Operations teams love Riak as it enables the advantages of inherent clustering, offering benefits that are quite interesting. Additionally, it includes cloud storage features that are comparable to Amazon S3, allowing users to store blobs of content within it. However, its productivity can be driven by a lack of focus on centralizing anything, which leads to fascinating results.
00:13:40.580
Easily gaining extra points for converging replicated data types, which is a term not widely known, Riak invests in academic research that has bled into the commercial sector in a positive fashion. Convergent replicated data types (CRDTs) are designed for distributed use, supporting data types like counters and arrays that can be distributed across nodes while ensuring eventual consistency.
00:14:05.400
Data gets hashed as it comes in. Nodes coordinate amongst themselves to distribute this data redundantly. This model basically leads to the patterns presented in the Dynamo paper, where data is hashed and then distributed around a giant ring of hashes. Each node receives a different portion of the data, which is replicated across multiple nodes, improving performance and redundancy.
00:14:32.960
As you add more nodes, you enhance performance and redundancy. Internally, Riak uses a structure called a vector clock to track versions of the same data over time. Essentially, when you write data to a relational database, it is stored as a single record, but with distributed systems like Riak, the application must manage how to handle conflicts.
00:15:06.080
It is up to the application to comprehend the conditions leading to that state where different records may exist. Thus, the model of Riak is somewhat akin to using Git for version control—while some records might have conflicting commits, the application must determine how to clarify any discrepancies.
00:15:34.320
Riak facilities the creation of an elegant and simple API allowing you to manage buckets with names, keys, and associated values, enabling document retrieval. There’s a library built into Riak, with a gem called Ripple that allows you to store JSON documents and perform additional operations. Furthermore, Riak supports secondary indexes for queries and offers a wealth of resources for those interested in distributed data storage.
00:16:01.520
Next, we have Bigtable, which is Google's response to distributed storage solutions. I'm not going to discuss it in great detail as Google has designed it to manage hundreds of petabytes of data and operate at 30 million operations per second, which many companies cannot fathom. Thus, unless you have a similar scale of data, it’s often irrelevant.
00:16:27.840
Meanwhile, Google’s subsequent creations, Spanner and F1, demonstrate that you can build relational databases that are fully distributed and atomic by using GPS in every server and atomic clocks in every data center. These solutions are proof of concept for Google—an impressively engineered approach to distributed systems, even if it is out of reach for the bulk of us.
00:16:52.480
On the topic of distributed databases, Cassandra is another option many are considering. While it's not necessarily the most inspirational engineering compared to Riak, it has become increasingly relevant. Cassandra is designed for distribution and follows principles similar to those described in the Dynamo paper; I've looked at it extensively, and while I found early implementations rather frustrating, many improvements have been made since.
00:17:22.320
Older configurations required XML for structure definitions and a restart of the server every time a modification was needed, which can feel cumbersome and very Java-like. However, recent updates have introduced a query language that resembles SQL—providing users with a familiar experience as they manage a large cluster.
00:17:47.440
Cassandra organizes its data in columns, which can be a bit complex to explain. In the past, you were forced to consider the underlying structures in great detail. In this model, you no longer view data in terms of rows and tables; instead, a column acts as a key-value namespace, albeit much more complex.
00:18:14.760
For instance, Netflix uses Cassandra for a wide array of functionalities due to its distributed nature. They manage massive clusters, and operational efficiency has become commonplace—although raising a thousand nodes is quite the feat.
00:18:42.720
Moving on, I’d like to talk briefly about Memcached. Although Memcached isn’t a database per se, I thought it was worth mentioning because it represents the transitional state of your Rails application, where distributed properties become apparent and complicate your workflow.
00:19:14.840
Many applications can seamlessly integrate Memcached without giving much thought to the underlying implications. However, you now must consider issues such as invalidating data and maintaining state between your database and Rails application, both of which are managed by a distinct process—this encapsulates the essence of distributed systems problems.
00:19:39.960
Memcached is an excellent solution due to its simplicity, boasting just a handful of methods to work with. It is essentially a large hash in the sky, and it's friendly towards Rails, particularly for managing sessions and fragments.
00:20:01.640
Redis is another fascinating topic. Redis has gained traction within the Rails community. Many applications use it for queuing, but not all utilize it as a database. With libraries like Sidekiq and Resque, we see Redis as a queue—though, in essence, it is not one, it can function as such.
00:20:27.840
Redis offers key-value storage but stands out due to its versatile data structures that enable in-memory operations. It transcends the simple model of a key mapped to a blog of data; in Redis, you have arrays, sets, and hashes, allowing you to execute an impressive array of operations.
00:21:00.640
The architecture is exceptionally fast since it operates solely in memory and is cleverly engineered. However, it’s not truly a distributed system; it can provide seamless functionality until some hidden issues erupt due to naive replication methods.
00:21:29.840
Redis's simplicity offers transaction-like features via checks and set values, enabling seamless multiple-step operations. Many developers utilize it within Rails for task management.
00:21:55.440
On to Neo4j, a graph database which I find particularly intriguing; its technology is genuinely impressive. It fundamentally regards data as a series of interconnected nodes, and their introductory material showcases impressive solutions to problems.
00:22:16.920
However, Neo4j still encounters challenges typical of single-point-of-failure databases. Data resides within the system and continually streams out; thus, it becomes essential to contend with these inconsistencies.
00:22:37.600
Now, on to MongoDB. There are some interesting points regarding MongoDB's adaptation to web scalability. Many people assume that MongoDB is a fix-all solution, similar to what people thought of MySQL in its infancy. Anyone with technical knowledge should recognize that transactions are central to database functionality.
00:23:01.480
Some early adopters invested in SQL, like Slashdot, did not understand databases either. If you think about the critiques from the academic world, you'll find they have some substantial weight about MongoDB's consistency and operational assumptions.
00:23:23.520
There's a growing number of options emerging, like RethinkDB, which approaches database construction with a focus on correctness—ultimately leading to slower performances, but a much finer tune on the data-related structure. As an academic project, RethinkDB boasts a solid foundation built upon consistent key-value storage adhering to distributed characteristics.
00:24:13.120
This database model resembles Redis, yet it starts from the ground up with distribution in mind. The approaches taken by RethinkDB offer a sense of legitimacy seen in other platforms.
00:24:54.720
As we reach the conclusion, we come to Hadoop, which you don’t have to worry about as long as you can successfully install it. I have a friend, a bit of an eccentric, who refers to Hadoop as a means to solve so-called 'small to medium data problems,' as he's had his share of experiences but doesn't shy away from its handling of expansive data sets.
00:25:27.440
Now, we turn to Uni, which has recently released a more advanced version called Vetus. It's essentially a NoSQL offering that functions similarly to MongoDB while being embeddable. This could lead to some exciting applications, but as it stands, we are still hesitant to utilize these newer frameworks.
00:26:16.640
Finally, there's Elasticsearch. I have frequent discussions about why we shouldn't default to this as a data store, even though it’s officially a search engine. Elasticsearch allows you to index documents, especially JSON documents, delivering the clustered architecture inherent to distributed systems right from setup.
00:26:59.240
The powerful querying capabilities make it exceptionally versatile, further amplified via integrations with logging platforms like Logstash and Kibana. Overall, it’s an incredibly elegant system that empowers developers with creativity for a vast array of applications.
00:27:28.640
That was my talk. I think I am right on time, as I dare to hope. And so, it has been great sharing this journey through databases with you today. Thank you!