Data Storage: NoSQL Toasters and a Cloud of Kitchen Sinks

by Casey Rosenthal

In the presentation titled "Data Storage: NoSQL Toasters and a Cloud of Kitchen Sinks," Casey Rosenthal discusses the evolving landscape of data storage, particularly emphasizing NoSQL databases. Starting with a humorous analogy involving toasters, he highlights the oversimplification of the term 'NoSQL' and invites the audience to reframe it as an opportunity for diverse choices in data storage.

Key Points Discussed:

- NoSQL Overview: Rosenthal outlines the historical context of data storage, contrasting traditional relational databases with the new options provided by NoSQL technology, which have emerged due to the limitations of SQL databases in certain use cases.

- Common Participants: He reviews popular NoSQL databases, including:

- MongoDB: A document database that stores data in tree-structure documents.

- Redis: An in-memory key-value store that prioritizes speed.

- Cassandra & HBase: Column-family databases that offer different ways of structuring data.

- Neo4j: A graph database, making complex relationships easier to manage.

- Riak: A distributed key-value store with flexibility in data formats.

- Other Mentioned Databases: CouchDB, Couchbase, and DynamoDB, among others.

- Emerging Use Cases: He introduces lesser-known NoSQL databases such as Titan (a distributed graph database), Exist DB (XML storage), and Atomic (offering cache and transaction efficiency).

- Critical Reasons to Consider NoSQL:

- Fault Tolerance: Designed to handle hardware failures and ensure data availability across multiple nodes.

- High Availability: Supports simultaneous operations across nodes to maintain function during outages.

- Scalability: Highlights both vertical and horizontal scaling, noting the appeal of horizontal scaling in NoSQL databases.

- Data Growth Challenges: Stresses the crucial need for scalable solutions in a landscape where data is exponentially increasing.

- Access Patterns Framework: Discusses the importance of understanding different access patterns when designing applications with NoSQL, and the need to shift dynamic patterns to static ones for efficiency.

- Conclusion: Emphasizes the absence of a universal theory of data, reinforcing that selecting the correct database relies on experience and understanding of specific application needs. Rosenthal encourages audience members to explore NoSQL solutions to better manage data effectively in modern applications.

00:00:16.400 There's been a little bit of confusion about the schedule, so my talk was moved here. This is my talk: Data Storage, subtitled 'NoSQL Toasters and a Cloud of Kitchen Sinks.' I am Casey Rosenthal.

00:00:33.920 I work for a company called Basho, which makes a distributed key-value database called Riak. It's very operations-friendly, and they also have a product called Riak CS, which is similar to private S3.

00:00:45.760 In the NoSQL world, there’s a joke that goes something like this: "Do you have a toaster?" "Yeah, I have a toaster. Everybody has a toaster, right?" The next question is, "Does your toaster run SQL?" To which you would reply, "Um, no, my toaster doesn't run SQL." And they would say, "Oh, you must have one of those newfangled NoSQL toasters!"

00:00:58.480 The joke is bad because it’s absurd, and it's absurd because the term NoSQL is itself absurd. We're in an industry where we can compute anything, given enough time and resources, but we define NoSQL by the small things that it can't do, which leaves everything else that's possible unmentioned. If you're in my field, this is kind of a problem, but we are known under the NoSQL umbrella. So, we have to embrace that term.

00:01:17.520 How do we embrace the term NoSQL? Well, first, we recognize it's just a term, and then we imbue it with our meaning. For us, NoSQL means choices. We want it to be an analogy with choices in how you store your data. For the past few decades, prior to the emergence of NoSQL, the software engineer had essentially only two ways to store data: on a file system or in a relational database.

00:02:01.360 What’s exciting about working in this world is that every year, or even every couple of months, a new NoSQL database comes out. New applications are built in ways that access data differently, allowing software engineers to tackle more problems and build new solutions.

00:02:14.879 So, the overarching theme of my talk is: Why NoSQL? Why would you want to consider NoSQL databases? This talk will be broken into three parts: Who are the NoSQL players? What kind of problems are they solving? And how do we build applications on top of NoSQL databases? Clearly, I only have 40 minutes, so I will be scratching the surface on all of these topics.

00:02:54.400 I want to be transparent: I work for NoSQL company Basho, which makes an SQL product called Riak. This gives me a certain bias. It would be easy for me to stand here and start flame wars by criticizing other NoSQL databases. Since I work with Riak, I am very aware of the criticisms about it. To avoid those flame wars—though they can be fun for some people—I will only speak positively about NoSQL databases.

00:03:39.680 This is a chart that 451 Research put out, showing the NoSQL LinkedIn Skills Index. People on LinkedIn can list their database skills, and those are summed up. From left to right, we can see the most popular NoSQL databases self-reported by engineers who have skills in them. Here are the ten databases in that index: MongoDB, Redis, Cassandra, HBase, CouchDB, MarkLogic, Neo4j, Riak, Couchbase, and DynamoDB.

00:04:10.959 I will quickly go over these ten databases, but I can't delve deeply into any of them. I will constrain myself to a single slide from GitHub for a Ruby client of one of these databases, to prove that yes, people are using these databases. Let's briefly describe some properties.

00:05:00.720 First, MongoDB: I don't expect you to learn how to use it just from this example. It’s a document database storage engine where a document is a data structure like a tree, where each node is a key-value pair, and the value might just be a set of other branches.

00:05:33.280 The next on the list is Redis. Redis is an in-memory key-value store that primarily stores data in RAM. It does have a couple of other data types and can persist data to disk, but serves requests from memory.

00:06:06.880 Then we have Cassandra, which falls in the column-family category of NoSQL databases. Inspired by Bigtable, these databases don't have a relational structure of tables, columns, and rows—rather, they use a different structure for column families and records. This means that they store data on disk differently than a relational database.

00:06:50.400 HBase is another column-family NoSQL database but distributes data differently on the back end, similar to Cassandra. CouchDB, also a document database, stores its documents in JSON and allows map-reduce to find different views of those documents. MarkLogic, like CouchDB, stores its documents in XML.

00:07:53.600 Next is Neo4j, the first graph database on this list. Graph databases are designed to navigate and efficiently manage relationships, making traversing tree structures or hierarchies easier compared to traditional relational databases. In comparison, self-joining a table in SQL might lead to resource limits quickly, while graph databases prevent this issue.

00:08:50.720 Then, there's Riak, a distributed key-value store. Riak has no intrinsic knowledge about the value that it stores and can store JSON, XML, or binaries. Couchbase can be seen as a mashup of an in-memory key-value and a persisted JSON document store. Finally, Amazon DynamoDB is another distributed key-value database.

00:09:55.200 Now you have the top ten databases in mind. You could say you're somewhat familiar with the scene. If a use case arises that could benefit from one of these databases, you can now say, "Okay, I kind of know something about what that does." Or you could just put "NoSQL" on your LinkedIn profile.

00:10:35.040 One of the reasons NoSQL emerged is that SQL was too popular. It was used in situations where it wasn't an ideal fit. I don’t want to leave you with just those top ten databases; there are dozens if not hundreds of NoSQL databases that solve different problems in unique ways. Often, those benefits can only be fully understood through exposure.

00:11:07.280 I’ll highlight three NoSQL databases I have looked at this past month to illustrate the depth of solutions available. First is Titan, a distributed graph database that is interesting because it supports more data than fits on one machine while still being eventually consistent.

00:12:12.880 Another is Exist DB, which has been around since 2001. It stores files in XML and utilizes a query language called XQuery, which is a W3C standard language that combines XPath with a JavaScript-like language. It allows you to generate queries that output HTML and dynamically update views when the underlying XML data changes.

00:12:59.520 Lastly, we have Atomic, which assumes certain things about your hardware and use case to offer unique features, such as built-in caching for queries and native data types for consistent transactions. Note that Yoko Harada will discuss Atomic and Ruby in further detail during a talk later this conference.

00:13:44.640 Overall, the NoSQL landscape is large, exciting, and full of energy. In Rails 4.0 and 5.0, we reach a maturity that won't yield major surprises in future versions. Meanwhile, the NoSQL world continues to evolve with new solutions that address problems we haven't yet considered.

00:14:11.840 Getting back to the question of why NoSQL, one reason to explore NoSQL solutions is for fault tolerance. Fault tolerance is the optimistic assumption that bad things will happen, like hard disk failures, server crashes, and network issues. If your application requires fault tolerance, you should seek NoSQL solutions designed from the ground up with this in mind.

00:14:54.800 Building fault tolerance on top of traditional relational databases can be challenging; some databases will automatically distribute your data across multiple nodes and facilitate data recovery when a fault occurs. This is a crucial trait for a data store, worthy of consideration if your use case demands fault tolerance.

00:15:52.559 Another benefit of distributed NoSQL databases is high availability. In such databases, operations can happen in parallel, meaning when you store a value across multiple nodes, you receive consistent results, even when there are network issues or server outages.

00:16:44.119 In some scenarios, strict consistency is essential. In strong consistency setups, every node in the cluster returns the same result when queried. Several NoSQL databases are designed to cater to strong consistency, a challenge to maintain with traditional relational database methods.

00:17:25.600 Now let’s have a word about scale. Scaling can encompass throughput, storage, and latency. While you might accomplish scaling via vertical scaling (upgrading to a bigger server) or horizontal scaling (adding more nodes), the NoSQL paradigm makes horizontal scaling particularly appealing on commodity hardware.

00:18:13.440 It’s achievable on relational databases with methods like sharding, but that requires careful management of application logic. Some NoSQL databases prioritize different tasks among servers, efficiently distributing workloads to enhance performance.

00:19:13.360 However, the larger challenge is the exponential growth of data. Presently, 90% of the world's data has been created within the last two years. As of last year, the estimation stood at around two to two and a half zettabytes of data stored, and projections for 2013 anticipate surpassing that amount significantly.

00:19:46.960 This has significant implications, as many businesses are eager to store data they find useful but lack the necessary infrastructure. Traditional relational databases aren't capable of scaling to meet current demands. Conversely, NoSQL databases are adapting to address this problem.

00:20:20.320 As a software engineer or consultant with insight into how NoSQL databases scale, you hold significant value in the current job market, particularly since very few can solve those scaling challenges.

00:21:02.240 One of the key takeaways from this presentation is there is no universal theory of data. There’s no single database solution that fits every need. There’s no overarching theory that helps you definitively choose the right NoSQL database for your specific application. This problem remains unsolved for now, and much relies on experience and intuition.

00:21:43.600 We must also consider access patterns. At Basho, where I am the director of professional services, we work with large clients to guide them in establishing infrastructure for managing the massive amounts of data generated by businesses. Hence, we need a framework to assess effective storage solutions based on access patterns.

00:22:37.599 For instance, we can analyze a scale from scheduled queries to spontaneous ones. Scheduled queries resemble tasks processed via cron jobs, allowing more control, while spontaneous queries depend on user traffic, which is less predictable. Another scalability aspect includes static versus dynamic access. Static access corresponds to key-value retrieval with a direct answer, while dynamic access introduces complexity by requiring a query planner.

00:23:36.640 Using these two scales, we can create a grid to determine how various databases align with specific access patterns—helping to identify suitable NoSQL or relational solutions. While most databases handle static and scheduled queries efficiently, spontaneous dynamic patterns present challenges. Application developers often gravitate toward this bottom right quadrant, leading to a mismatch.

00:24:58.080 The strategy is to shift problematic dynamic access patterns into static ones. By proactively evolving our approach to data access, we can position ourselves for a more effective integration with NoSQL databases, thus better solving related issues.

00:26:18.720 In the traditional relational model, we typically begin with what data we need to store, ensuring the relationships are normalized. If done correctly, we gain confidence in querying the correct data from the model. In contrast, for a NoSQL setup—especially with key-value structures—we should first determine what the data view should look like.

00:27:10.960 By establishing what we want from the data upfront, we can organically shape how it's stored to support those views, yielding benefits that come from spontaneous static solutions assisting formerly dynamic access patterns.

00:27:38.320 For example, when analyzing user logs to count how many users visited a web page, the SQL approach involves querying columns directly. With a key-value database, we can forecast that by maintaining rolling counts or averages in real-time, saving processing time since the answers can be fetched rapidly.

00:28:50.960 In instances where challenges arise, consider hybrid solutions, such as combining PostgreSQL for metadata management with a key-value store like Riak. Regardless, no universal theory of data exists today, making reliance on experience and intuition crucial for effective functioning.

00:29:51.840 Despite the absence of a single guiding theory, we can still find harmony among the various database types available. While navigating this diverse landscape can be confusing, it is achievable. Now, I hope I have time for some questions, although I’m not sure what the format for that is. Thank you for attending my talk.

00:34:17.679 That's my talk. I'm Casey Rosenthal.