Throw Some Keys on it: Data Modeling For Key/Value Data Stores by Example

Hector Castro

Hector Castro

1 talk

#data-modeling

#distributed-systems

#ruby

Throw Some Keys on it: Data Modeling For Key/Value Data Stores by Example

by Hector Castro

In the talk titled "Throw Some Keys on It: Data Modeling for Key/Value Data Stores by Example," Hector Castro discusses the growing importance of key/value data stores in contrast to traditional relational database management systems (RDBMS). The presentation highlights the differences between these two approaches to data storage, particularly in the context of modern application requirements.

Key Points:

Transition from RDBMS to Key/Value Stores: While RDBMS are foundational due to their ability to manage complex relationships, they are not always suitable for every application, especially those needing high performance and scalability.
Advantages of Key/Value Stores: Castro emphasizes several benefits of distributed key/value stores:
- Schema-less design, which simplifies data management.
- Faster single access reads facilitated by primary key access, beneficial for high-rate workloads.
- Effective performance in write-heavy scenarios, particularly because many key/value stores are append-only.
- Simplified API, avoiding cumbersome details like joins and groupings, making it easier to scale applications.
Practical Example Using Uber: To illustrate the concepts, Castro uses the Uber mobile application as a case study, demonstrating how it can effectively leverage key/value stores.
- The example begins by conceptualizing the Uber driver’s location on a coordinate grid for simplicity rather than traditional latitude and longitude.
- He discusses the process of requesting a ride and how the Uber app manages driver and passenger data using a key/value model.
Data Structures in Key/Value Stores: The talk explains a specific function called 'emitcarlocation' in Ruby, which captures data about a vehicle's location while maintaining relevant context. Castro discusses:
- Usage of GSet (Grow-only Set) data structures, which allow concurrent writes without risking data loss.
- GSets avoid removing data, focusing solely on adding data, which enhances availability and mitigates conflicts during concurrent operations.

Conclusions and Takeaways:

The importance of choosing the right data storage model is emphasized, particularly in applications that require robust handling of high traffic and large data volumes, such as ride-sharing services like Uber.
Key/value stores can offer solutions that outperform traditional RDBMS in specific scenarios, particularly in terms of speed, scalability, and simplicity in handling data.
Castro encourages developers to rethink their data modeling strategies and to be open to using key/value data stores for applications where relational models may hold unnecessary overhead.

Ultimately, his presentation invites a deeper understanding of data modeling in contemporary applications and highlights the potential advantages of adopting key/value data storage methods.

00:00:14.920 Hello everyone! The talk today is titled 'Throw Some Keys on It: Data Modeling for Key/Value Data Stores by Example.' My name is Hector Castro, and that's my handle pretty much everywhere. A quick disclaimer upfront: I work for Basho, where we make a distributed key/value database. Before we dive in, I want to share a quick story from my time in Mexico. If you're sitting in an open cafe, you might encounter a situation where someone approaches you, puts their hand on you, and silently stands there until you give him money. If you're with someone else, he will do the same to them. This can make you think about potential career options if software doesn't work out for us!

00:00:36.079 So, how many people here have seen something like this hanging on someone’s cube wall or printed out to explain what a database is and the relationships between different entities? Relational databases aren’t all bad; they offer many appealing features from a developer’s perspective. They allow us to establish relationships between multiple entities. When denormalizing, they provide transactions that ensure modifications across multiple rows happen atomically. They allow us to define schemas to optimize queries through clearly defined types. Finally, they grant the ability to perform ad hoc queries using SQL to access all data in the database.

00:01:14.320 However, what if your application doesn’t need most of these features, or if latency, scalability, or high availability are more critical? That's where key/value data stores come into play, specifically distributed key/value data stores. These are attractive for several reasons: they are schema-less, which we saw in a couple of talks yesterday; they allow single access reads through primary key access, leading to faster operations, particularly in high-rate workloads. They are also very effective for write-heavy workloads since many key/value stores are append-only, making writes very fast. Moreover, they tend to scale more easily because they have a more restricted API, avoiding the complexities of joins or grouping data across multiple shards.

00:02:39.040 For some examples, in Ruby, you can quickly create a hash and assign values to it, even complex structures. By using serialization, you can store images and other binary data. When faced with a primitive data interaction method, the initial thought may be that the application is too complex for this system, requiring tables and columns. However, the example I will walk through is relatively simple, but it gave me an 'aha moment' when modeling applications for key/value data stores. I hope it will produce the same moment for others here today. Are people familiar with the Uber application? For those who aren't, Uber is a mobile application that allows you to request drivers, typically luxury vehicles, but they are also expanding to offer standard cars.

00:03:52.320 Transactions occur through credit cards, with no real interaction with the driver, effectively optimizing the taxi service experience. As we consider how this system works, we start to think about geolocation and geospatial queries. As engineers, we might instinctively reach for relational databases like Postgres or Solr, which have geospatial query capabilities, but today I will demonstrate how to approach this problem using a key/value data store. The Uber system comprises several key components: the driver, the passenger (you), and the entire mapping system that identifies the area around you.

00:05:02.080 Let’s start with the driver. Imagine the driver's location represented within a coordinate grid, where locations are defined as simple grid coordinates rather than traditional latitude and longitude for generalization purposes. If we identify a driver positioned in a square on this grid, we can assume their location is at the specific coordinates based on where they sit within the grid. To illustrate how the key/value store works, we will use date, time, and these coordinates together to request our ride. In the key/value approach within React, we utilize 'buckets' to namespace our data by the provided date and time.

00:06:55.360 When working with a key/value store that lacks a built-in concept of buckets, you can simply prepend a namespace to the key. The store will handle the payload, which could include an array of vehicles within that quadrant/network. As we walk through some Ruby code, we define a function called 'emit_car_location' which accepts a car ID, color, and its coordinates. The purpose of this data structure is to maintain contextually meaningful data, while avoiding loss of information related to concurrent car emissions into the store.

00:11:40.720 The function ensures that when we capture location data of the cars, we utilize a GSet data structure. GSet stands for Grow-only Set, a conflict-free replicated data type (CRDT) which allows for simultaneous writes without the risk of data loss. GSets purposely avoid removing data; you only add to the set, enhancing availability and mitigating potential conflicts arising from concurrent operations and network partitions. In a situation where two writes occur concurrently due to partitions, both values are kept in a sibling state until resolved through defined reconciliation processes.

00:14:08.639 The insistence on maintaining both values allows the system to avert data loss and ensure that cars present in the area are accurately logged and retrievable. This technique speaks to the design of scalable systems, especially as usage expands to unprecedented heights. The process outlined keeps the mapping of cars intact and provides reliable information to users requesting vehicles, thus streamlining operations. Finally, as we anticipate increased scalability demands for services like Uber, it's essential to utilize data structures that perform well under considerable load—something variable relational databases often fail to do at larger scales. Thank you for joining me today, and I hope you found this exploration of key/value data stores insightful.

Big Ruby 2014