Data Persistence

Summarized using AI

Stratocaster: Redis Event Timeline

Rick Olson • April 07, 2011 • Earth

In the video titled "Stratocaster: Redis Event Timeline," speaker Rick Olson discusses the evolution of GitHub's event timeline, moving from a MySQL-based system to a more efficient Redis-based architecture through a new library called Stratocaster. The presentation emphasizes the challenges faced when managing large-scale event feeds and how switching to a push architecture alleviates database strain. Key topics covered include:

  • Definition of Event Feeds: Olson clarifies what event feeds are, emphasizing they represent user actions on GitHub rather than event-driven programming.
  • Feeding Frenzy Paper: He references a paper from Yahoo Research discussing event feed management, distinguishing between consumers and producers in feed systems.
  • Push vs. Pull Architecture: GitHub's push model entails proactively creating user feeds, which enables real-time updates but also increases data size.
  • Challenges with the Current System: Large-scale feeds can overload databases, particularly when popular users generate many events.
  • Data Management Improvements: Olson discusses caching strategies, bulk inserts, and the transition from ERB to Mustache templates to enhance performance and readability.
  • Introduction of Stratocaster: This new library aims to modernize event handling by leveraging Redis for efficient data storage, retaining only essential event IDs rather than duplicating data.
  • Performance Comparisons: Olson shares testing results showcasing a 10x reduction in data storage when using Redis, with tests indicating significant memory efficiency in newer Redis versions.
  • Future Directions: The session wraps up with Olson hinting at refining the Stratocaster API for public release and discussing ongoing improvements.

Overall, Olson's presentation highlights the importance of optimizing event management systems for efficiency and user experience as GitHub continues to scale its infrastructure. The move to Redis is presented as both a strategic choice and a necessary evolution in their technology landscape.

Stratocaster: Redis Event Timeline
Rick Olson • April 07, 2011 • Earth

Stratocaster is an internal GitHub Ruby project written to replace the organically grown GitHub Event timeline. This is a tale of an overgrown MySQL table being replaced by a more specific Redis setup. We'll see what new possibilities that Redis is able to provide. Also, we'll go over how Mustache was able to replace Erb for event rendering.

Help us caption & translate this video!

http://amara.org/v/GZCa/

Ruby on Ales 2011

00:00:24.039 I talk about GitHub, one of the technologies that I work with, and I'm the animated GIF artist at GitHub. I also work with Ruby, but mostly I focus on animation. Lately, I've been getting into DJing a little bit.
00:00:33.760 K mentioned the Kegerator that we have at our office. Every time we’ve had a private office party, Tom and I have been DJing. I use an app called Djay on the iPad, which is pretty awesome.
00:01:08.119 So, what is an event feed? I feel it's important to establish a common vocabulary here. If you search for events on Google, you’ll find various meanings of the word. We are not discussing event-driven programming; we are simply referring to the news feed on GitHub. It's the section on the left that provides a feed of all the actions your friends are taking on GitHub.
00:01:20.520 Every time they make a change to a repository or perform any action, it shows up there. There is a paper titled "Feeding Frenzy: Selectively Materializing Users' Event Feeds," which is authored by some people at Yahoo Research.
00:01:40.120 The paper details how Yahoo stores its feeds, as they have a lot more users and scale compared to GitHub. Despite that, there’s a lot of valuable information in that paper. In it, they discuss consumers and producers. In any feed system, you have consumers who subscribe to producers that generate the events.
00:02:06.000 The interesting part is how different apps use consumers and producers in various ways. In an RSS feed aggregation service, the producers and consumers are entirely distinct. In contrast, on a platform like Twitter, you have followers; users follow each other, making them both consumers and producers. GitHub operates similarly.
00:02:32.040 Every user is a consumer, while the repositories they push to act as producers. Moreover, users have access to multiple feeds. For example, your personal actions feed lists everything that you specifically did, while other feeds display the actions of repositories you watch or organizations you're part of.
00:02:50.840 Whenever someone performs an action in an organization I follow, I see that action in a unique feed that excludes other unrelated updates. The Feeding Frenzy paper also introduces the concepts of push versus pull in event feeds; this describes how event feeds are constructed.
00:03:14.200 The first event feed I wrote in Rails was basic; it included an event model called 'activity log' with plugins for access auditing. The producer in our case was the repository or project, which would generate a record on an after-save callback.
00:03:29.920 Creating a record for each event was straightforward, and my database was well normalized, but writing crazy queries to retrieve those events was challenging. You could imagine how cumbersome it becomes when building permission systems linking users and memberships—those nested joins could make fetching events pretty complicated.
00:03:50.420 Building an event feed on the fly for reading has its challenges, especially in larger scales. When I discussed rolling out event feeds at GitHub with Chris, we realized that this system would quickly fall apart because the queries put too much strain on the database when scaled.
00:04:19.720 At GitHub, we adopted a push architecture, where we create the event once and pre-build everyone’s feed. When an event is created in Active Record, we loop through each follower and create related entries in their timelines. This means that the differentiation between the first event and subsequent ones is that the first only has the actor field filled, while the second has the user field indicating the audience for that event.
00:04:38.920 The events table is relatively simple, with only a few indexes on the actor and user fields. However, this creates its own issue as exemplified with prominent figures like Ashton Kutcher, who has millions of followers. When he has an action, Twitter must update millions of feeds, which is quite an immense task.
00:05:05.520 Similarly, Charlie Sheen had a massive influx of followers, and the strain on systems was noticeable. Thankfully, GitHub's issues are different. We face challenges with users like John Resig, the creator of jQuery, who has a vast following, combined with repositories like Rails that have substantial watcher counts.
00:05:29.680 When activity occurs on Rails, it can produce upwards of ten thousand events, emphasizing the difference in scale that we deal with compared to Twitter, but nevertheless a considerable load for us.
00:05:49.840 The push architecture saves a lot of data processes but can lead to large data growth. A quote from the Feeding Frenzy paper discusses a case where Dig moved to Cassandra, transitioning from a normalized database to an exploded structure, resulting in a gigantic storage requirement increase.
00:06:11.960 While GitHub hasn’t reached such extremes, our event database still runs into the hundreds of gigabytes in size, so efficiently managing data is crucial, especially with users and repositories that have extensive followings.
00:06:31.679 To manage the load, we utilize a plugin called AR extensions allowing us to perform bulk inserts with MySQL. This means instead of one-by-one inserts, we can send larger batches to the database, thereby enhancing performance significantly.
00:07:00.000 Additionally, when the event feed popularity surged, we began implementing more caching strategies. Essentially, the entire feed is cached, and each item within it is also individually cached. This approach works efficiently because many users see the same items, allowing us to reuse cached data across feeds.
00:07:29.640 Recently, we've upgraded our memory capacity on our servers, specifically on our memcached servers, and for the past month, we haven't encountered any cache evictions, demonstrating the effectiveness of this approach.
00:07:55.840 We also reworked the rendering of our templates, transitioning from cumbersome ERB templates to simplified logicless Mustache templates. The previous ERB code was complex and challenging to manage.
00:08:15.679 The Mustache templates improved readability, eliminating the logic previously present. Now, I can directly test the events without needing to scrape HTML, providing a much cleaner and more efficient testing process.
00:08:39.440 Additionally, I enhanced the rendering speed substantially; Mustache manages caching the compiled templates, which was not the case with the old ERB system. This refactoring was my response to the event data explosion we were encountering.
00:09:00.000 The necessity to store events efficiently led me on a quest to create a simple library, Stratocaster, that could migrate us off our existing event infrastructure. The first adapter for Stratocaster I implemented was for Redis.
00:09:37.679 Stratocaster is designed to work with any adapter, and Redis was the initial choice because of its impressive in-memory storage capabilities. Redis is an in-memory database with persistence that incorporates efficient data structures, making its transition from Ruby code to Redis code incredibly smooth.
00:10:05.360 Rather than duplicating event data multiple times for each follower, we store an array or a set of IDs. The following commands illustrate how we manage event addition and retrieval using Redis.
00:10:32.000 Using LPUSH allows us to add events to a list while LRANGE retrieves items from this list. This method keeps our storage compact and efficient by only holding IDs of repeated events, rather than duplicating the entire event data.
00:11:02.480 In trials, Stratocaster operated in production for about a month, running parallel tests against the previous system. We found that we were saving approximately 10 times the number of events that didn't need to go into the database, allowing us to shrink the event table significantly.
00:11:30.880 For instance, in the first week of testing, we saw roughly 18 million rows produced from a database. Observing how Redis held the data, we noted that it required just around 1.2 gigs in memory versus the requirements of typical MySQL setups.
00:12:01.200 Initially, Redis 2.0 performed adequately, but upon testing with Redis 2.2, we found significant optimizations that reduced the in-memory footprint to roughly 200 Megabytes due to new encoding methods that improved efficiency.
00:12:31.320 This required re-evaluating our data model to ensure it suited our needs within the scope of event feeds. Presently, we are using MySQL as our main database due to its established integration within GitHub's infrastructure.
00:13:02.400 Despite exploring other databases, we have not migrated due to setup complexities with MySQL being already operational.
00:13:32.080 Next, I’ll share the current Ruby API for Stratocaster. It's still in early stages and not yet publicly released, but the code is available in a hidden project I created a while back.
00:14:01.680 We define what timelines to store and build Redis list keys based on repository IDs. As the events come in, they communicate the followers, allowing us to build corresponding Redis keys for timelines.
00:14:30.440 This is the code we use to store event data in Grim, operating in Active Record. It's designed to be database-agnostic, meaning it can work with any active model as long as it carries necessary identifiers.
00:14:59.640 Creating a Stratocaster instance involves defining potential timelines, and as it receives events, it fetches those timelines and stores the appropriate events, making the system flexible and adaptable.
00:15:28.919 This flexibility contrasts with the MySQL setup where heavy indexing can complicate changes, making it slow to adjust to new requirements.
00:15:59.520 With this new approach, we can easily add feeds, like views for specific events or repositories, just by introducing a new timeline object within Stratocaster.
00:16:29.840 Internally, the adapters receive timelines to store corresponding events, and the Redis adapter ensures those events are retained in their specific lists.
00:17:04.639 Building Stratocaster catered to various datasets, as both MySQL and Redis easily aligned with our infrastructure. Once we facilitate moving functionalities to full-time uses, we'll look into more experimental data solutions.
00:17:36.160 One key advantage of this data model is that there is no significant historical view directly rendered in the current setup. We only need to retain a limited recent history, making Redis an excellent fit for our needs.
00:18:02.080 We limit the events held in memory, ensuring manageable capacity while presenting users with the most recent actions. Structuring everything this way allowed us to streamline development and execution.
00:18:31.920 Even though storing every event forever would be attractive, filtering limits helped us focus and accelerate the development of a user-friendly and efficient event feed.
00:19:02.640 As we worked on the project, I realized that reducing the complexities led to better outcomes. Over time, I rewrote Stratocaster multiple times as we refined its functionalities and responsiveness.
00:19:32.440 With every rewrite, improvements were made to align with evolving requirements, and once I discovered an existing library for key-value stores, I integrated the concept into Stratocaster to further streamline its development.
00:20:03.200 Currently, I’m compiling references that I’ll share on Twitter later. Some of these references include fascinating blog posts and papers discussing database transitions and methodologies similar to what we handle at GitHub.
00:20:34.600 One reference is from Dig that discusses their journey toward using Cassandra and the unique challenges they encountered, illustrating the broader conversations occurring in this field.
00:21:02.920 Another paper that caught my attention is one on second indexes in MySQL which presents interesting techniques useful for managing large datasets.
00:21:31.240 A recent talk by Kota Hae explored advanced data management concepts, igniting curiosity about various possibilities within our platforms. I recommend checking that out as it offers unique perspectives into handling event data.
00:21:58.880 The session concluded with a lighthearted Q&A, where I discussed my passion for creating animated GIFs. I mentioned how practice has been key in honing my skills, and we had a fun exchange about the evolution of trends in animated media.
00:22:35.760 One attendee asked about the future of Stratocaster, and I expressed that I’m working on refining the API before making it open for public use. There’s still work to do, but I’m excited for it.
00:23:02.240 Afterward, I demonstrated a simple application showcasing the capabilities of Stratocaster within a practical context, including managing timelines and interactions.
00:23:27.760 The demo illustrated how easily Stratocaster integrates with various data types across platforms, highlighting its flexibility and potential applications.
00:23:55.760 I also acknowledged some aspects of the API that I still want to refine ahead of release, ensuring a smooth user experience. We wrapped up with further dialogues on data stores and future capabilities.
00:24:22.840 Attendees shared their thoughts on various database options and their pros and cons, leading to a lively discussion about the best strategies for future data management.
00:24:59.920 Before concluding, I encouraged them to follow my work and explore topics that I would share on social media. It was a pleasure interacting with everyone and discussing advancements in data paradigms.
00:25:29.760 Overall, the session emphasized the balance between efficiency and user experience when managing extensive event data. Thank you all for attending!
00:26:02.960 I hope to connect with you again soon to further discuss these ideas and explore new horizons in event management.
00:26:29.760 If you have any additional questions or topics you would like to cover, please feel free to reach out.
00:27:10.640 Thank you once again, and I'm looking forward to our next interaction.
Explore all talks recorded at Ruby on Ales 2011
+8