LoneStarRuby Conf 2011
Chronologic and Cassandra at Gowalla
Summarized using AI

Chronologic and Cassandra at Gowalla

by Adam Keys

In the video titled "Chronologic and Cassandra at Gowalla," Adam Keys discusses the development and functionality of Chronologic, an activity feed database built using Cassandra, Ruby, and Sinatra. The presentation covers the challenges and advantages of maintaining social activity feeds across various software applications while addressing concerns regarding storage and privacy.

Key Points Discussed:

  • Introduction to Cassandra: Keys explores what Cassandra is, describing it as a hybrid of distributed hash tables and column-oriented databases, emphasizing its masterless design and resilience against node failures.
  • Data Structure and Flexibility: He explains how Cassandra handles keys and values, hashes, sorted sets, and counters, providing flexibility in data storage with support for both structured and nested objects.
  • Use Cases at Gowalla: Gowalla encounters challenges managing rapidly growing social data, including check-ins and social graphs. Using Cassandra provides benefits such as efficient replication, scalability, and high availability without impacting performance.
  • Chronologic Overview: Chronologic serves to capture and store event-driven data from user interactions, such as check-ins and posts. It integrates seamlessly with social graphs, allowing users to retrieve activities in a consolidated manner.
  • Event Handling: The application effectively employs a publication-subscriber model to ensure the flow of events is managed correctly. Events and sub-events are stored systematically, allowing for organized and efficient data retrieval.
  • Implementation and API Usage: Keys illustrates how to set up bindings using the Ruby Cassandra driver, discussing simple social graphs and querying capabilities that allow manipulation of user-related data based on specific criteria.
  • Future Prospects and Enhancements: The potential for Chronologic to be used across different applications is highlighted, as well as ongoing improvements to the API documentation and features aiming to support ease of integration for developers.

Conclusions and Takeaways:

  • The distribution and flexibility of Cassandra make it suitable for high-velocity activity feed applications, such as those at Gowalla.
  • Chronologic is positioned as a versatile solution for managing event data, capable of adapting to the needs of different software environments while emphasizing user privacy.
  • The system's ability to efficiently communicate user activities via a REST JSON API streamlines operations, enhancing overall usability for applications.
00:00:20 All right, I'm going to go ahead and get started. I'm Adam Keys, and I work here in Austin at Gowalla. We've been doing a good bit of stuff with Cassandra lately, and I've been building an activity feed database called Chronologic. I'm going to introduce that to you today and then give it to you.
00:00:44 Just to show of hands, is anyone here working with Cassandra? Just tinkering or played with it? One person. Anyone have it in production? Just me? Okay, great! So we get to learn Cassandra together.
00:01:06 So, what is Cassandra? You've maybe heard about this thing that Twitter and Facebook started, and other people are using. Is it a distributed hash table that's persistent and durable? Is it a column-oriented, masterless database? Or maybe it's a Dynamo and Bigtable hybrid based on these papers that Amazon and Google published? The answer is yes; it is these things. It's Cassandra; it's whatever it wants to be, just like Zuul.
00:01:26 So, a lot of talks on Cassandra will throw you knee-deep into the data model or be like, 'Hey, it's this academic thing that makes sense if you know what Cassandra is already.' But I don't want to tell you that, so it's like these other things that you already know.
00:01:54 It stores keys and values, just like Memcache or Redis does. It also stores hashes, like Redis does when you use 'HSET' or 'HGET'. It has sorted sets, so you can throw a bunch of things into a row in Cassandra and test for membership. You can pull things out in order and page through it, etc. Additionally, it does keys and counters like Redis does. Cassandra has excellent support for atomic distributed counters, which is jargon-y but extremely useful.
00:02:36 It stores keys with values within them, just like MongoDB, allowing you to store nested document-like objects. Besides that, it operates with rows and columns like MySQL or PostgreSQL, which we've all come to know and love. The great thing about Cassandra is that it's distributed – none of the nodes are special. Any one node can go down, and no one has to wake up in the middle of the night.
00:03:01 Each server in your cluster is arranged conceptually on a ring. Every server knows about the next servers on that ring, which is used to locate data in the cluster. You have some key that you access data by. You connect to a node, and say, 'Hey, give me this data for this key,' and then that node knows how to find the data on the ring and send it back to you.
00:03:38 Data is stored along this ring and is also replicated along the same ring. As I mentioned, you can lose a node in the middle of the night, and your operations guide doesn't have to wake up, which they appreciate. You specify how many copies of every row to keep in the cluster, and Cassandra will ensure that the specified number is written somewhere in that ring.
00:04:08 You can also specify whether the replication occurs synchronously or asynchronously, which is a nice tunable parameter on a per-request basis rather than a server-global configuration. Each server in the ring is aware of every other server, and they gossip about their statuses, sharing insights about which servers are down or back up and what data needs to be sent.
00:04:43 You can connect to any server; none are special. For instance, at Gowalla, we have our Cassandra Ruby driver, which we instantiate by telling it about all the nodes in the cluster. The driver then chooses randomly which one to connect to, or you can specify one node and it can auto-discover the rest.
00:05:09 Cassandra is like all these other databases you might know in terms of consistency and the classic ACID profile. You can operate Cassandra with no guaranteed durability, where it confirms a write even though it may not be true, similar to how Memcache operates. Alternatively, you can work with full durability, meaning it does not return to your client until every copy of the data is written to disk across the cluster.
00:05:54 You have options regarding consistency as well. You can choose no consistency, meaning it writes as fast as possible without showing you the data back. You can opt for quorum consistency, ensuring that as long as a majority of nodes that are supposed to store the data agree on the value, you're good. You can even operate with full consistency, ensuring that you only receive values that are in agreement across the cluster.
00:06:25 Cassandra is designed to account for your network topology by allowing you to specify different groups of machines based on proximity and connectivity speed. For example, you can configure it to minimize interactions with machines connected via slower connections while ensuring replication to those machines.
00:06:54 The advantages of this design become apparent in scenarios such as when a plane lands in an Amazon data center or if a data center goes down unexpectedly. NoSQL databases each come with their own strengths and weaknesses, but Cassandra strives to do both well.
00:07:25 At Gowalla, we're using Cassandra for fast-growing data. We accumulate many check-ins, maintain a large social graph, manage photos and comments, and constantly create new data. As this data grows quickly, we use Postgres, which has okay replication support, but doesn't fit our needs for rapid access.
00:07:44 We have data that benefits from pre-materialized views. For instance, when you pull your friends' activities in Gowalla, it’s advantageous to make one request and receive all check-ins, user spot data, photos, comments, highlights, and metadata in one go rather than juggling multiple sources.
00:08:09 Our operations team greatly appreciates the masterless design of Cassandra that ensures redundancy and availability. If we lose one node overnight, the system keeps functioning without raising error rates for users. The design allows for continued operation even if additional nodes fail.
00:08:34 In the last year at this conference, I began experimenting with a method to store audit data when users change spots or various data changes occur in our system. We were previously writing an audit log to Postgres, but that table was growing very quickly. I developed a small low-risk project to store that data in Cassandra instead.
00:09:00 This was an easy project to manage because it risked little disruption to existing systems while allowing us to explore the benefits of Cassandra. It's now how we store change logs, and I've been working on the Chronologic activity stream database for activities, check-ins, and more.
00:09:31 Additionally, we cache social graphs from other networks. We extensively engage in social friend matching against Facebook and Twitter friends, which requires storing that data locally for quick access.
00:10:04 We're also planning to implement annotations, which allow users to mark locations, serving as a multi-level relationship. Moreover, we aim to separate out-banded notifications, like alerts when friends check in, into separate systems.
00:10:30 Looking ahead, while Redis is a great database with a fast and interesting data model, it primarily operates on a master-slave replicated setup. In contrast, Cassandra allows similar capabilities without the complexities of handling pre-sharding or slave reads.
00:11:04 There's also intriguing potential in using Cassandra alongside solar for real-time index updates. At the same time, for big data applications, you can replace HDFS with Cassandra for hybrid clusters, enabling you to run Hadoop jobs without worrying about the NameNode.
00:11:47 However, there are scenarios where Cassandra may not be ideal. For example, it’s not great for geo searches or location-based queries. While it’s possible to implement geo structures, I prefer to manage that through systems like Solar.
00:12:14 Cassandra isn't as efficient for tightly normalized relational data, and you can't do joins at the database level. Additionally, if you're dealing with small data sets—around a million rows or less—then MySQL, Redis, or other systems are often better suited.
00:12:38 I wouldn't use Cassandra for prototyping when I’m unsure of what queries will be important or what data I need fast access to. It's better to stick with a system that allows for ad-hoc queries until the application's needs stabilize.
00:13:02 While explaining Cassandra's data model, I’ll illustrate how Chronologic stores some data. For instance, we’ll examine storing keys in blobs similar to Memcache for events, and how to store keys and columns like hashes in Redis.
00:13:38 Now, I’m switching over to using the Ruby Cassandra driver. I create a new connection and use an LSR keyspace. I have a hash of data that I want to store in a column, so I insert it into the example column family. The key's name is 'akk', the column name is 'data,' and the value is a JSON-encoded string.
00:14:15 To read this data back, I simply do a GET request against the example column family, specifying the 'akk' row key. The returned result will be an ordered hash with a data column displaying the JSON string and a timestamp.
00:14:53 Using keys and columns in a hash-like structure, I can submit the entire hash where I have two columns, 'username' and 'age,' with values 'akk' and '31' respectively. This reaffirms how Cassandra allows for structure while also supporting variable schemas where any entry can look different.
00:15:40 Next, let's discuss simple social graphs. This represents a sub-graph of my followers on Gowalla. Each user on the left side is linked to me, showing who follows me, while the right reflects the users I follow.
00:16:28 When I insert these four columns into one row for the user ID in the graph, I can retrieve the stored hash and identify my followers, excluding any empty strings.
00:16:53 Cassandra's ability to store events as columns is crucial for timeline indexes. I’ll show how to create a column family with indexes on specific columns like age and username, allowing me to select users based on specified criteria such as age over 18.
00:17:17 This is functionality that's typically absent in Redis or Memcache. This version of Cassandra, the latest release, ships with CQL (Cassandra Query Language), which is a subset of SQL but without joins.
00:17:42 Chronologic is a service for storing event data and related objects, mapping that to social graphs, and assembling feeds of events and sub-events. I collaborated with talented artists to craft a logo for my project, representing the events generated, like statuses, check-ins, photos, or other activities.
00:18:12 These events relate to users through social graphs, indicating how users connect with one another, whether they follow each other, friend each other, or have meaningful connections. When someone posts something, all their followers see it become central to the social graph's operation.
00:18:53 Chronologic applications replicate their objects and social graphs, ensuring every user or repository interaction mirrors across systems. Thus, actions published in Gowalla for check-ins, like when I write a status or post something on GitHub, will show up in followers' timelines.
00:19:48 While clocking records, application subscriptions manage the flow of events. For instance, if a check-in occurs, I can share visibility in a global feed or across networks, meaning any event published becomes visible to all relevant users or platforms.
00:20:25 Applications harness these feeds to retrieve a fully materialized timeline in one HTTP request, encapsulating all check-ins, pushes, related objects, and sub-events. This makes data retrieval efficient and effective, avoiding multiple system conversations.
00:21:09 Chronologic's effectiveness lies in its design; it sends objects referred to by events to applications via REST JSON API, making the system user-friendly. Transactions are processed in a time-efficient manner to maintain high performance.
00:21:53 The internal architecture of Chronologic stores replicated data into key-blobs, similar to storing user data. There are considerations for using secondary indexes in the future, where various columns can yield different events or statuses, prompting a richer application.
00:22:39 Establishing subscriptions allows specific users or groups to interact meaningfully with one another. We can set up a back-link privacy feature within Chronologic. By offering privacy checks, we ensure that when users view their feeds, irrelevant data from those not in their circle is excluded.
00:23:17 With a publication-subscribe mechanism, linking associated graphs can facilitate communication efficiency. For instance, when someone checks in to a place, it can surface in the global feed for other users looking at similar check-ins.
00:24:09 Events are systematically stored as key-column structures, employing a timestamp generation process. This hierarchy allows for better organization and sequential retrieval of data.
00:25:13 As we explore further details of the internal API, any user publication becomes an event stored in the 'events' column, allowing an efficient query and retrieval for related timelines.
00:25:39 When users interact with an event, they can comment or share insights, creating sub-events that also appear within the original event timeline. This dynamic interaction maintains organization while allowing for a cohesive narrative.
00:26:15 On the implementation side, when publishing events, they are stored in the designated events column family while writing to all relevant timelines for each user's activity. Fetching these timelines allows for seamless engagement with events across users or systems.
00:26:55 The ability to fetch all relevant data from one event simplifies how information is disseminated within the application. Upon triggering an event publish, the system concurrently manages sub-events while linking them appropriately to ensure a rich user experience.
00:27:39 As we review API functionality, users can publish events, mirror objects, and configure interactions. Through specified relationships, events published can translate into user actions across applications, creating synergy in user engagement.
00:28:27 Subsequent to a user's interaction, applications can capture the essence of those interactions, such as names of individuals associated with certain actions, and present them develop a clearer narrative in app experiences.
00:29:09 Looking towards the future, we’re working on enhancing the readme, improving protocol, and creating comprehensive API documentation for Chronologic. While it operates in a production environment now, there are still numerous improvements and optimization tactics being executed.
00:29:43 After our upcoming major release at Gowalla, we plan to push out an official 1.0 version, ensuring it meets performance standards while expanding its usage across various platforms.
00:30:24 I'm excited about deploying Cassandra more widely; it alleviates many of the challenging concerns associated with database management. One innovation I'm exploring involves building a gem that integrates Cassandra effortlessly, offering features similar to Redis.
00:31:06 As interest grows around distributed databases like Cassandra, building tools and resources will enhance usability for developers. Innovations such as the hybrid enterprise setup promise to deliver reliable performance without sacrificing manageable workflows.
00:31:55 It's almost time for software releases, and as we prepare, if anyone has questions, I'm available to clarify any part of the process I rushed through. Understanding these concepts is vital for future application and development.
00:32:38 To address the query regarding whether Cassandra is suitable for a pub/sub messaging infrastructure, I believe it thrives in activity feed-related contexts. However, it might not serve best for high-frequency messaging scenarios.
00:33:16 The question about Cassandra's functionality in a graph structure leads us to explore how data is treated within the system. While it's not strictly required to fit traditional graph schemas, flexibility lies in handling various object types.
00:33:58 The potential of using Chronologic as a starter kit for photo or status-sharing services is intriguing. Developers can leverage it to help manage data feeds without needing to develop intricate tables from scratch.
00:34:34 As we think about future settings, durability and consistency settings within Cassandra have been resilient based on user testing. The distributed nature ensures performance while validating user actions in real-time.
00:35:13 By permitting flexibility in data consistency levels, users can tailor operations to fit their requirements—shaping a well-rounded experience for those managing workloads and interactions.
00:35:47 Ultimately, I would like to highlight the importance of well-considered user queries and data relationships shaping the usability of Chronologic in production settings.
00:36:05 That wraps up my presentation! Thank you for your interaction and thoughts during this session. If you have any other questions or insights, feel free to share.
Explore all talks recorded at LoneStarRuby Conf 2011
+15