00:00:14.920
Hello everyone! The talk today is titled 'Throw Some Keys on It: Data Modeling for Key/Value Data Stores by Example.' My name is Hector Castro, and that's my handle pretty much everywhere. A quick disclaimer upfront: I work for Basho, where we make a distributed key/value database. Before we dive in, I want to share a quick story from my time in Mexico. If you're sitting in an open cafe, you might encounter a situation where someone approaches you, puts their hand on you, and silently stands there until you give him money. If you're with someone else, he will do the same to them. This can make you think about potential career options if software doesn't work out for us!
00:00:36.079
So, how many people here have seen something like this hanging on someone’s cube wall or printed out to explain what a database is and the relationships between different entities? Relational databases aren’t all bad; they offer many appealing features from a developer’s perspective. They allow us to establish relationships between multiple entities. When denormalizing, they provide transactions that ensure modifications across multiple rows happen atomically. They allow us to define schemas to optimize queries through clearly defined types. Finally, they grant the ability to perform ad hoc queries using SQL to access all data in the database.
00:01:14.320
However, what if your application doesn’t need most of these features, or if latency, scalability, or high availability are more critical? That's where key/value data stores come into play, specifically distributed key/value data stores. These are attractive for several reasons: they are schema-less, which we saw in a couple of talks yesterday; they allow single access reads through primary key access, leading to faster operations, particularly in high-rate workloads. They are also very effective for write-heavy workloads since many key/value stores are append-only, making writes very fast. Moreover, they tend to scale more easily because they have a more restricted API, avoiding the complexities of joins or grouping data across multiple shards.
00:02:39.040
For some examples, in Ruby, you can quickly create a hash and assign values to it, even complex structures. By using serialization, you can store images and other binary data. When faced with a primitive data interaction method, the initial thought may be that the application is too complex for this system, requiring tables and columns. However, the example I will walk through is relatively simple, but it gave me an 'aha moment' when modeling applications for key/value data stores. I hope it will produce the same moment for others here today. Are people familiar with the Uber application? For those who aren't, Uber is a mobile application that allows you to request drivers, typically luxury vehicles, but they are also expanding to offer standard cars.
00:03:52.320
Transactions occur through credit cards, with no real interaction with the driver, effectively optimizing the taxi service experience. As we consider how this system works, we start to think about geolocation and geospatial queries. As engineers, we might instinctively reach for relational databases like Postgres or Solr, which have geospatial query capabilities, but today I will demonstrate how to approach this problem using a key/value data store. The Uber system comprises several key components: the driver, the passenger (you), and the entire mapping system that identifies the area around you.
00:05:02.080
Let’s start with the driver. Imagine the driver's location represented within a coordinate grid, where locations are defined as simple grid coordinates rather than traditional latitude and longitude for generalization purposes. If we identify a driver positioned in a square on this grid, we can assume their location is at the specific coordinates based on where they sit within the grid. To illustrate how the key/value store works, we will use date, time, and these coordinates together to request our ride. In the key/value approach within React, we utilize 'buckets' to namespace our data by the provided date and time.
00:06:55.360
When working with a key/value store that lacks a built-in concept of buckets, you can simply prepend a namespace to the key. The store will handle the payload, which could include an array of vehicles within that quadrant/network. As we walk through some Ruby code, we define a function called 'emit_car_location' which accepts a car ID, color, and its coordinates. The purpose of this data structure is to maintain contextually meaningful data, while avoiding loss of information related to concurrent car emissions into the store.
00:11:40.720
The function ensures that when we capture location data of the cars, we utilize a GSet data structure. GSet stands for Grow-only Set, a conflict-free replicated data type (CRDT) which allows for simultaneous writes without the risk of data loss. GSets purposely avoid removing data; you only add to the set, enhancing availability and mitigating potential conflicts arising from concurrent operations and network partitions. In a situation where two writes occur concurrently due to partitions, both values are kept in a sibling state until resolved through defined reconciliation processes.
00:14:08.639
The insistence on maintaining both values allows the system to avert data loss and ensure that cars present in the area are accurately logged and retrievable. This technique speaks to the design of scalable systems, especially as usage expands to unprecedented heights. The process outlined keeps the mapping of cars intact and provides reliable information to users requesting vehicles, thus streamlining operations. Finally, as we anticipate increased scalability demands for services like Uber, it's essential to utilize data structures that perform well under considerable load—something variable relational databases often fail to do at larger scales. Thank you for joining me today, and I hope you found this exploration of key/value data stores insightful.