Rocky Mountain Ruby 2011

Cloning Twitter: Rails + Cassandra = Scalable Sharing

Cloning Twitter: Rails + Cassandra = Scalable Sharing

by Charles Max Wood

In the talk 'Cloning Twitter: Rails + Cassandra = Scalable Sharing' by Charles Max Wood, presented at Rocky Mountain Ruby 2011, the speaker discusses the integration of Cassandra, a highly scalable NoSQL database, with Ruby on Rails to create a scalable application, particularly a clone of Twitter. The main focus of the talk is on the advantages of using Cassandra over traditional relational databases, especially in handling high-volume write operations and dynamic schema requirements.

Key points covered in the talk include:

- Introduction to Cassandra: Wood shares his background in Ruby and Rails since 2005 and briefly mentions the rise of NoSQL databases, particularly Cassandra, which was open-sourced by Facebook in 2008 and supported by the Apache Foundation.
- CAP Theorem: The speaker explains that Cassandra is rooted in the CAP theorem, which focuses on availability, consistency, and partition tolerance. He emphasizes that Cassandra prioritizes availability and partition tolerance, making it suitable for large-scale applications like Twitter.
- Use Cases for Cassandra: The platform is ideal for applications that require handling large data sets with high write loads and can adapt to ever-changing data schemas without predefined structures.
- Data Structure and Querying: Wood elucidates how data in Cassandra is organized into 'keyspaces' and 'column families,' akin to hashes, where rows are stored as records. Queries are primarily executed based on specific keys.
- CRUD Operations: The talk covers how CRUD operations in Cassandra can be streamlined using a Ruby gem instead of the command line interface, emphasizing the efficiency of data serialization compared to raw byte arrays.
- Scaling with Cassandra: He explains how to leverage multiple machines for scaling, set replication factors for data reliability, and adjust consistency levels based on data criticality.
- Active Model and ORM: Wood shares insights into building an ORM that reflects the active record pattern compatible with Cassandra's data structure, enhancing the development experience for Rails developers.
- Secondary Indexes and Key Management: He discusses the nuances of managing secondary indexes in Cassandra and the flexibility of defining diverse key types.

In conclusion, Wood advocates for the strategic use of Cassandra within the Ruby ecosystem, encouraging developers to explore how its schema-less advantages facilitate rapid development and adaptation to new requirements. His insights into the practical application of Cassandra foster a deeper understanding of the NoSQL paradigm, particularly in creating scalable applications like a Twitter clone.

00:00:09.360 Okay, so to start out, I just want to ask really quickly how many of you have actually used Cassandra in an app somewhere, even just playing with it?
00:00:12.480 We’ve got a couple here. How many of you have used it with Rails? One or two, okay.
00:00:15.679 Let’s see if this works. So, a little bit about me real quick: I’ve been doing Ruby and Rails since 2005.
00:00:21.119 Back then, I was doing tech support. I couldn't get my boss to buy us a tool, so I started building one and realized quickly that management wasn’t as fun as coding, so I switched.
00:00:25.679 I went freelance last year, and some of you may listen to some of the podcasts that I do, like 'Teach Me to Code,' 'Ruby Rogues,' and 'Rails Coach.' Those are a few of the things that I do, and I like to play with that kind of technology.
00:00:29.039 When I was getting ready to prepare this talk, I had a client who came to me and said, 'I want a Twitter clone.' I looked at him and said, 'You know, Twitter isn’t making any money, so this probably isn’t a great idea.'
00:00:35.920 However, he explained to me that he had a unique selling proposition; he wanted some functionalities that Twitter offers but didn't want Twitter itself. I figured that it was probably something that wouldn’t kill him, and he might actually be able to make it work.
00:00:43.439 He had some interesting ways of advertising on the site, so I said, 'Go ahead, I’ll do it for you.' He offered to pay me a substantial amount of money for it. A few months later, his brother-in-law, one of the founders of Dentrix, which is dental software, told me that Twitter was using this NoSQL solution to handle all of its tweets.
00:00:58.320 His brother-in-law insisted that he wanted a NoSQL solution right away, so I said, 'Okay.' I was apprehensive but agreed to go ahead with it.
00:01:10.720 As I was learning to implement Cassandra into this Rails app, I thought, 'I might as well talk about it.' A few months ago, right after I submitted this talk, he approached me and said he wanted to get the project into beta so he could start getting feedback, but I told him that some things needed to be cut from the plan.
00:01:30.720 He suggested cutting the conversion to Cassandra, which I found amusing. This conversation made me think that I might just build my own Twitter clone. How many freelancers here have time for a large project like that? I didn’t see any hands.
00:02:08.000 I've started working on a semi-functional prototype, but it’s not complete enough to demonstrate here. Speaking of Cassandra, there are a few hands raised. Most of you know it’s a NoSQL solution. Initially, I was confused by people discussing it as a column-oriented database in contrast to row-oriented databases.
00:03:01.840 We’ll discuss the schema in a bit, but generally, a column-oriented database is about how the data is conceptualized, not the structure itself. Cassandra was started by Facebook, which open-sourced it in 2008, and since then the Apache Foundation has supported it, leading to rapid development.
00:03:29.360 Cassandra is based on the CAP theorem, which states you can only maintain two of the three guarantees: availability, consistency, and partition tolerance. Availability means your client can always connect and retrieve data, while consistency means that multiple queries to different clients yield the same result each time. Partition tolerance refers to the capability of handling large data growth by spreading it across machines.
00:04:58.799 Cassandra typically emphasizes availability and partition tolerance rather than full consistency. Why would you use Cassandra over a relational database? While some argue relational databases are obsolete, I believe in choosing the right tool for the right problem.
00:05:16.000 Cassandra's benefits shine in large deployments, such as Twitter, which needs to handle billions of tweets. It excels in write-heavy operations and can easily integrate geographically distributed setups. If your schema is constantly evolving, Cassandra is a good fit, as it doesn’t require a predefined schema.
00:06:50.320 In Cassandra, the top-level structure is a keyspace, akin to a hash. Inside the keyspace, there are column families that are similar to hashes, where the keyspace manages data consistency, and column families reference rows.
00:07:01.679 Cassandra stores rows as records, and the columns represent key-value pairs, which enable efficient data management. Queries in Cassandra occur by the key, similar to hashes, and you can only look up data by one key at a time.
00:08:01.040 It’s common to create entire column families for one query; for instance, if I want all tweets from user X, I’ll set up a table where the key is the user identifier. However, it’s important to note that ordering is predefined in the database, and it’s not uncommon to set up column families uniquely for various queries.
00:09:05.840 For CRUD operations in Cassandra, they are simplified when using the Cassandra gem rather than the CLI. The gem handles data serialization better than raw byte arrays, and operations include regular get, multi-get, and remove, while insert performs both create and update.
00:10:27.360 When scaling with Cassandra, leverage multiple machines in your cluster and set a replication factor that determines how many copies of your data should exist. This replication provides reliability, allowing continued querying even if one node goes down.
00:11:42.080 You can also tune your consistency levels based on how critical the data is. For instance, if you desire strong consistency, you might require acknowledgments from three nodes for a correct response, but this may slow down your reads due to the checks.
00:13:02.720 The Ruby ecosystem offers several gems to help you interact with Cassandra, including the Cassandra gem which has a somewhat complex API but is functional. I’ve built an ORM on top of it to streamline interactions.
00:14:21.680 Active Model is a clean choice when working with this structure. The ORM I created reflects the active record pattern but operates under Cassandra's unique constraints, allowing full DB operation with familiar syntax.
00:15:46.799 As for migrations, they are simple since you don’t have to handle fields like you would in a relational database. It's primarily about creating and modifying key spaces and column families, which can be done easily.
00:17:17.440 So far, I have found that the approach has worked well for building a clean API that resembles what you'd expect when using Rails. I also want to maintain consistency in how users interact with their data models.
00:18:23.360 Remember that due to Cassandra’s architecture, automatic detection of data types and the need for orderly entries mean careful planning is necessary when designing APIs. It’s crucial to keep a visual representation clear for when users are inputting data.
00:19:49.520 The flexibility of NoSQL lies in its schema-less advantages, which allow for dynamic changes—like adding new attributes without requiring changes to existing structures. This agility proves beneficial in quickly evolving environments.
00:20:56.480 Several projects now exist that showcase how Cassandra can be integrated within Ruby ecosystems, and I encourage exploration of those. It's an excellent learning opportunity to design code that interacts effectively with database systems.
00:22:18.200 I'm happy to answer any questions, particularly regarding Cassandra or its integration with Ruby and Rails. Your inquiries and feedback about practical experience are very welcome.
00:23:09.440 Regarding secondary indexes, they require existing data to index. If a column doesn’t exist, it isn’t indexed. The flexibility of Cassandra allows arbitrary columns but requires the right setup and understanding of your data.
00:24:13.520 Your keys in Cassandra can be various types, including non-string types. This versatility allows for rich data handling, and the automatic sorting of keys simplifies data retrieval.
00:25:00.400 When setting up your architecture, consider how you define and manage your column families to maintain the integrity and performance capabilities of your application.
00:26:23.440 If you have further questions about setting up or optimizing your implementation of Cassandra, please let me know. Collaboration and discussion help us all grow stronger in our coding journeys.