00:00:24.039
I talk about GitHub, one of the technologies that I work with, and I'm the animated GIF artist at GitHub. I also work with Ruby, but mostly I focus on animation. Lately, I've been getting into DJing a little bit.
00:00:33.760
K mentioned the Kegerator that we have at our office. Every time we’ve had a private office party, Tom and I have been DJing. I use an app called Djay on the iPad, which is pretty awesome.
00:01:08.119
So, what is an event feed? I feel it's important to establish a common vocabulary here. If you search for events on Google, you’ll find various meanings of the word. We are not discussing event-driven programming; we are simply referring to the news feed on GitHub. It's the section on the left that provides a feed of all the actions your friends are taking on GitHub.
00:01:20.520
Every time they make a change to a repository or perform any action, it shows up there. There is a paper titled "Feeding Frenzy: Selectively Materializing Users' Event Feeds," which is authored by some people at Yahoo Research.
00:01:40.120
The paper details how Yahoo stores its feeds, as they have a lot more users and scale compared to GitHub. Despite that, there’s a lot of valuable information in that paper. In it, they discuss consumers and producers. In any feed system, you have consumers who subscribe to producers that generate the events.
00:02:06.000
The interesting part is how different apps use consumers and producers in various ways. In an RSS feed aggregation service, the producers and consumers are entirely distinct. In contrast, on a platform like Twitter, you have followers; users follow each other, making them both consumers and producers. GitHub operates similarly.
00:02:32.040
Every user is a consumer, while the repositories they push to act as producers. Moreover, users have access to multiple feeds. For example, your personal actions feed lists everything that you specifically did, while other feeds display the actions of repositories you watch or organizations you're part of.
00:02:50.840
Whenever someone performs an action in an organization I follow, I see that action in a unique feed that excludes other unrelated updates. The Feeding Frenzy paper also introduces the concepts of push versus pull in event feeds; this describes how event feeds are constructed.
00:03:14.200
The first event feed I wrote in Rails was basic; it included an event model called 'activity log' with plugins for access auditing. The producer in our case was the repository or project, which would generate a record on an after-save callback.
00:03:29.920
Creating a record for each event was straightforward, and my database was well normalized, but writing crazy queries to retrieve those events was challenging. You could imagine how cumbersome it becomes when building permission systems linking users and memberships—those nested joins could make fetching events pretty complicated.
00:03:50.420
Building an event feed on the fly for reading has its challenges, especially in larger scales. When I discussed rolling out event feeds at GitHub with Chris, we realized that this system would quickly fall apart because the queries put too much strain on the database when scaled.
00:04:19.720
At GitHub, we adopted a push architecture, where we create the event once and pre-build everyone’s feed. When an event is created in Active Record, we loop through each follower and create related entries in their timelines. This means that the differentiation between the first event and subsequent ones is that the first only has the actor field filled, while the second has the user field indicating the audience for that event.
00:04:38.920
The events table is relatively simple, with only a few indexes on the actor and user fields. However, this creates its own issue as exemplified with prominent figures like Ashton Kutcher, who has millions of followers. When he has an action, Twitter must update millions of feeds, which is quite an immense task.
00:05:05.520
Similarly, Charlie Sheen had a massive influx of followers, and the strain on systems was noticeable. Thankfully, GitHub's issues are different. We face challenges with users like John Resig, the creator of jQuery, who has a vast following, combined with repositories like Rails that have substantial watcher counts.
00:05:29.680
When activity occurs on Rails, it can produce upwards of ten thousand events, emphasizing the difference in scale that we deal with compared to Twitter, but nevertheless a considerable load for us.
00:05:49.840
The push architecture saves a lot of data processes but can lead to large data growth. A quote from the Feeding Frenzy paper discusses a case where Dig moved to Cassandra, transitioning from a normalized database to an exploded structure, resulting in a gigantic storage requirement increase.
00:06:11.960
While GitHub hasn’t reached such extremes, our event database still runs into the hundreds of gigabytes in size, so efficiently managing data is crucial, especially with users and repositories that have extensive followings.
00:06:31.679
To manage the load, we utilize a plugin called AR extensions allowing us to perform bulk inserts with MySQL. This means instead of one-by-one inserts, we can send larger batches to the database, thereby enhancing performance significantly.
00:07:00.000
Additionally, when the event feed popularity surged, we began implementing more caching strategies. Essentially, the entire feed is cached, and each item within it is also individually cached. This approach works efficiently because many users see the same items, allowing us to reuse cached data across feeds.
00:07:29.640
Recently, we've upgraded our memory capacity on our servers, specifically on our memcached servers, and for the past month, we haven't encountered any cache evictions, demonstrating the effectiveness of this approach.
00:07:55.840
We also reworked the rendering of our templates, transitioning from cumbersome ERB templates to simplified logicless Mustache templates. The previous ERB code was complex and challenging to manage.
00:08:15.679
The Mustache templates improved readability, eliminating the logic previously present. Now, I can directly test the events without needing to scrape HTML, providing a much cleaner and more efficient testing process.
00:08:39.440
Additionally, I enhanced the rendering speed substantially; Mustache manages caching the compiled templates, which was not the case with the old ERB system. This refactoring was my response to the event data explosion we were encountering.
00:09:00.000
The necessity to store events efficiently led me on a quest to create a simple library, Stratocaster, that could migrate us off our existing event infrastructure. The first adapter for Stratocaster I implemented was for Redis.
00:09:37.679
Stratocaster is designed to work with any adapter, and Redis was the initial choice because of its impressive in-memory storage capabilities. Redis is an in-memory database with persistence that incorporates efficient data structures, making its transition from Ruby code to Redis code incredibly smooth.
00:10:05.360
Rather than duplicating event data multiple times for each follower, we store an array or a set of IDs. The following commands illustrate how we manage event addition and retrieval using Redis.
00:10:32.000
Using LPUSH allows us to add events to a list while LRANGE retrieves items from this list. This method keeps our storage compact and efficient by only holding IDs of repeated events, rather than duplicating the entire event data.
00:11:02.480
In trials, Stratocaster operated in production for about a month, running parallel tests against the previous system. We found that we were saving approximately 10 times the number of events that didn't need to go into the database, allowing us to shrink the event table significantly.
00:11:30.880
For instance, in the first week of testing, we saw roughly 18 million rows produced from a database. Observing how Redis held the data, we noted that it required just around 1.2 gigs in memory versus the requirements of typical MySQL setups.
00:12:01.200
Initially, Redis 2.0 performed adequately, but upon testing with Redis 2.2, we found significant optimizations that reduced the in-memory footprint to roughly 200 Megabytes due to new encoding methods that improved efficiency.
00:12:31.320
This required re-evaluating our data model to ensure it suited our needs within the scope of event feeds. Presently, we are using MySQL as our main database due to its established integration within GitHub's infrastructure.
00:13:02.400
Despite exploring other databases, we have not migrated due to setup complexities with MySQL being already operational.
00:13:32.080
Next, I’ll share the current Ruby API for Stratocaster. It's still in early stages and not yet publicly released, but the code is available in a hidden project I created a while back.
00:14:01.680
We define what timelines to store and build Redis list keys based on repository IDs. As the events come in, they communicate the followers, allowing us to build corresponding Redis keys for timelines.
00:14:30.440
This is the code we use to store event data in Grim, operating in Active Record. It's designed to be database-agnostic, meaning it can work with any active model as long as it carries necessary identifiers.
00:14:59.640
Creating a Stratocaster instance involves defining potential timelines, and as it receives events, it fetches those timelines and stores the appropriate events, making the system flexible and adaptable.
00:15:28.919
This flexibility contrasts with the MySQL setup where heavy indexing can complicate changes, making it slow to adjust to new requirements.
00:15:59.520
With this new approach, we can easily add feeds, like views for specific events or repositories, just by introducing a new timeline object within Stratocaster.
00:16:29.840
Internally, the adapters receive timelines to store corresponding events, and the Redis adapter ensures those events are retained in their specific lists.
00:17:04.639
Building Stratocaster catered to various datasets, as both MySQL and Redis easily aligned with our infrastructure. Once we facilitate moving functionalities to full-time uses, we'll look into more experimental data solutions.
00:17:36.160
One key advantage of this data model is that there is no significant historical view directly rendered in the current setup. We only need to retain a limited recent history, making Redis an excellent fit for our needs.
00:18:02.080
We limit the events held in memory, ensuring manageable capacity while presenting users with the most recent actions. Structuring everything this way allowed us to streamline development and execution.
00:18:31.920
Even though storing every event forever would be attractive, filtering limits helped us focus and accelerate the development of a user-friendly and efficient event feed.
00:19:02.640
As we worked on the project, I realized that reducing the complexities led to better outcomes. Over time, I rewrote Stratocaster multiple times as we refined its functionalities and responsiveness.
00:19:32.440
With every rewrite, improvements were made to align with evolving requirements, and once I discovered an existing library for key-value stores, I integrated the concept into Stratocaster to further streamline its development.
00:20:03.200
Currently, I’m compiling references that I’ll share on Twitter later. Some of these references include fascinating blog posts and papers discussing database transitions and methodologies similar to what we handle at GitHub.
00:20:34.600
One reference is from Dig that discusses their journey toward using Cassandra and the unique challenges they encountered, illustrating the broader conversations occurring in this field.
00:21:02.920
Another paper that caught my attention is one on second indexes in MySQL which presents interesting techniques useful for managing large datasets.
00:21:31.240
A recent talk by Kota Hae explored advanced data management concepts, igniting curiosity about various possibilities within our platforms. I recommend checking that out as it offers unique perspectives into handling event data.
00:21:58.880
The session concluded with a lighthearted Q&A, where I discussed my passion for creating animated GIFs. I mentioned how practice has been key in honing my skills, and we had a fun exchange about the evolution of trends in animated media.
00:22:35.760
One attendee asked about the future of Stratocaster, and I expressed that I’m working on refining the API before making it open for public use. There’s still work to do, but I’m excited for it.
00:23:02.240
Afterward, I demonstrated a simple application showcasing the capabilities of Stratocaster within a practical context, including managing timelines and interactions.
00:23:27.760
The demo illustrated how easily Stratocaster integrates with various data types across platforms, highlighting its flexibility and potential applications.
00:23:55.760
I also acknowledged some aspects of the API that I still want to refine ahead of release, ensuring a smooth user experience. We wrapped up with further dialogues on data stores and future capabilities.
00:24:22.840
Attendees shared their thoughts on various database options and their pros and cons, leading to a lively discussion about the best strategies for future data management.
00:24:59.920
Before concluding, I encouraged them to follow my work and explore topics that I would share on social media. It was a pleasure interacting with everyone and discussing advancements in data paradigms.
00:25:29.760
Overall, the session emphasized the balance between efficiency and user experience when managing extensive event data. Thank you all for attending!
00:26:02.960
I hope to connect with you again soon to further discuss these ideas and explore new horizons in event management.
00:26:29.760
If you have any additional questions or topics you would like to cover, please feel free to reach out.
00:27:10.640
Thank you once again, and I'm looking forward to our next interaction.