Performance Optimization

Scaling 'most popular' lists: a plugin solution

Scaling 'most popular' lists: a plugin solution

by Wolfram Arnold

In this presentation at LA RubyConf 2009, Wolfram Arnold introduces a plugin he developed called 'Act As Most Popular', aimed at improving the scalability of 'most popular' lists in applications, particularly social networks. The key challenges discussed include the inefficiencies faced when tracking user activity on viewable entities such as user profiles, videos, and images. This is commonly addressed through queries that join activity with the viewable entities, often leading to performance issues as it requires multiple database accesses during user interactions.

To streamline this process, Arnold highlights the need for a caching solution that enables real-time updates to the popularity ranking without requiring a complete refresh of the data set every time a user interacts with the application. His vision involves using a caching framework that can manage data efficiently, eliminating the delays associated with frequent database queries.

Arnold introduces 'Cache Money', a modern caching framework developed by Nick Kallen, which significantly eases the management of cached data by allowing automatic interactions with models without manual cache handling. Key features of Cache Money include:
- Simplified access to cached data via methods like 'find', which seamlessly determines if data is cached or needs to come from the database.
- Automatic indexing of primary keys and the ability to monitor model changes, keeping the cached data synchronized with the database.

However, Arnold notes a limitation of Cache Money in handling joins, which is essential for computing 'most popular' rankings. To address this gap, he designed a plugin that incorporates 'get', 'set', and 'repository' methods, allowing developers to manipulate cached entries while retaining performance and ease of use.

Through implementing this plugin, Arnold details how it tracks user activity and maintains a sorted index of views for uploaded content, priming the cache upon first access and updating it with each new view. Such efficiency enables users to retrieve activity metrics easily and manage their 'most popular' lists effectively.

The presentation concludes with Arnold encouraging attendees to explore the plugin, which will soon be available on GitHub, and inviting those interested in scalability and best practices for Rails development to connect with him.

Ultimately, the session emphasizes the importance of optimizing user activity tracking in applications and the advantages of using caching frameworks to enhance performance, providing developers with tools that abstract complex data management tasks.

00:00:12.360 My name is Wolfram Arnold, and I’m here to talk with you today about a plugin I developed called 'Act As Most Popular.' I am with Ruby Focus, a small consulting and recruitment firm specializing in Ruby on Rails.
00:00:25.640 So, what is the 'most popular' list? The most popular list tracks user activity, and you see this functionality everywhere in social networks. For example, I’m currently building a social network focused on the entertainment industry with a few business associates. People often want to know whose content has been viewed, and typically, this involves some entity that's viewable, like a user profile, a video, or an image. These entities are ranked by user activity, which is usually measured by the number of viewings, comments, ratings, or something of that effect. This is a common problem in social networks and relevant applications.
00:01:20.000 The main issue arises from the need for a join between the viewable entity (like a user profile, image, post) and the activity (which is often stored in a separate table). You typically need to combine these tables to compile a complete list of statistics and analytics. This can be slow when users hit these tracking pages, as the database needs to be accessed each time. This may lead to long wait times, which are not acceptable in user-facing applications.
00:02:16.480 Here's an example of a query that might be run: you view an upload, and you join it on the viewings table to calculate the number of rows in the viewings table that refer back to that upload, ordering the results by that count. SQL then returns a list of uploads ordered by popularity. However, this process is not optimized, and optimizing it depends on your specific application needs. Overall, I identified a need for optimization, as the current solution was slow.
00:03:08.239 My goal was to create a solution that would allow this 'most popular' list to scale better. I needed a caching solution that could efficiently manage this data.
00:03:31.519 What I want is a solution that lets me scale the 'most popular' list from a cache effectively. I want this cached data to be populated from the database only once, automatically. The query I mentioned earlier should run during the initialization of the cache, and every subsequent access should come from this cache. As users interact with the site, whether they are viewing objects or commenting, I want those actions to be reflected in the cache without needing to regenerate the entire dataset each time.
00:04:09.200 This introduces a need for a caching framework that simplifies this process, ideally one I don’t have to build from scratch. The plan was to create a plugin solution based on existing components. This way, whether users are commenting, viewing, or giving ratings, the same plugin could be reused without having to create custom solutions each time.
00:04:51.560 I’m curious: how many people here are familiar with caching frameworks? Raise your hand if you have used one before. Quite a few! That's great to know. I am particularly interested in an MD-based caching framework, and I found one called 'Cache Money,' developed by Nick Kallen, whom I first met back in late 2006 while working at Pivotal. I hear this was extracted from the Twitter core and was part of what helped Twitter improve its performance.
00:05:36.000 Cache Money is the next generation of caching frameworks, improving upon the previous version known as 'Cache Fu.' The earlier designs required developers to explicitly manage the cache, using syntax like 'model_object.get_cache_key' for accessing cached data, which made the codebase complex and cumbersome. Nick wrote a nice blog post highlighting these issues.
00:06:17.120 With the evolution of technology, we're seeing a shift from manual, explicit control to a more automatic and transparent approach, similar to how automatic transmissions have gained prevalence over manual ones. What I appreciate about Cache Money is that it combines simplicity and effectiveness, allowing users to interact with models without needing to think about accessing the underlying database explicitly. The framework helps abstract these operations, enabling developers to focus on functionality rather than caching logic.
00:07:04.840 Cache Money allows methods like 'find' to seamlessly interact with cached data. For example, when you call 'user.find(ID),' the framework handles whether the data is in the cache or needs to be retrieved from the database without any extra effort from the developer.
00:07:45.999 Upon integrating this caching framework into your application, managing indexed data becomes straightforward. The primary keys are indexed automatically, and you can index additional fields necessary for your application. Moreover, Cache Money diligently monitors your models for any changes—additions, deletions—keeping everything in sync for optimal performance.
00:08:13.520 One limitation of Cache Money is that it does not handle joins, which is crucial for my specific use case concerning 'most popular' rankings. I set about developing a plugin to fill this gap. This plugin allows each model to not only use existing methods but also introduces 'get', 'set', and 'repository' methods to effectively cache entries while still allowing access to the database when necessary.
00:09:07.399 With the 'get' and 'set' methods, developers can specifically manipulate cached values if needed, providing flexibility while keeping everything else abstracted from the caching layer. Each model now comes equipped with these additional methods, ensuring easy management of cached data.
00:09:47.600 The next step was implementation, where I integrated my caching mechanisms with the tracking of user activity. My 'most popular' list tracks views for models, like uploads. I created an additional index to keep activity counts sorted. This new index gets primed upon first access and is maintained every time a viewing occurs for an associated upload.
00:10:34.800 So when uploaded content is viewed, I ensure that this viewing count is updated appropriately. The plugin requires you to specify the class being monitored, set limitations (for instance, you typically wouldn’t want a list with 500 entries), and you also provide attributes to help prime the cache effectively.
00:12:07.000 When you run a query using these attributes, the resultant objects will have access to necessary metrics like activity counts. You can retrieve these counts efficiently and from there, manage each entry in your 'most popular' list simply by calling 'upload.most_popular', which is the class method defined in the plugin.
00:12:57.000 This process is smooth and will work beautifully in practice. The slides will be shared publicly and the plugin will be available on GitHub soon. That brings me to the conclusion of my talk.
00:13:40.000 I am Wolfram Arnold from Ruby Focus. We are a small consulting firm specializing in Ruby and Rails. If you're interested in scalability or want to discuss best practices for Rails, please come and find me. If you're hiring or looking for a job in Rails, I'm also available for that. Thank you so much for your attention. Are there any more questions?