00:00:15.240
Hi everyone. My name is Donal McBreen, and I am a programmer at 37signals. Today, I'm going to talk about our new disk-backed Rails cache that we're calling Solid Cache.
00:00:22.119
We have been using this cache on hey.com since about February, and since the start of September on Basecamp.
00:00:34.360
The tagline of our project revolves around a critical question: if we start caching data on disk instead of memory, will the performance be sufficient?
00:00:41.480
The traditional approach involves using memory caches like Redis or Memcached. However, our aim was to explore disk caching, which offers the potential for significantly larger storage at a lower cost.
00:00:52.800
While we do not expect the individual operations in our cache to be faster, we hope that the overall caching mechanism will improve the application’s response time.
00:01:03.320
This is our Rails cache store quadrant. We distinguish between local and remote caches, and between memory and disk caches.
00:01:11.159
Rails already includes a file store as a built-in disk cache, but there's nothing currently available in the remote disk cache quadrant. Our goal is to fill this gap.
00:01:41.480
Before we started the project, we established some design criteria. Our first design goal was to utilize a database.
00:01:46.600
Starting with a database simplifies many aspects of implementation. It gives us access to connection libraries, SQL for operations like inserting, selecting, and deleting cache records, as well as indexing capabilities to quickly retrieve records. Additionally, it eliminates the need to interact directly with the file system, as the database handles that for us. Since we already have a database, we thought it would be sensible to utilize it for the cache, although we could also consider sharding it across multiple databases.
00:02:48.520
A major advantage of using a database is that they typically come with built-in memory caches. For instance, in MySQL, there's a buffer pool that can provide high hit rates, often over 99%, so you would only need to access the disk once in every 500 operations.
00:03:30.480
The second design criterion was to make it database-agnostic. We aimed to ensure that the cache could work with various databases, such as SQL Lite, MySQL, or PostgreSQL.
00:03:49.680
The third aspect was to make the cache plug-and-play. We wanted a simple installation process consisting of three main steps: install the gem, run the migrations to create the cache table, and configure the cache store in your settings.
00:03:57.680
Furthermore, we didn't want to rely on scheduled tasks or cron jobs to manage the cache, such as expiring old data.
00:04:12.200
Finally, performance was a key criterion. We wanted the cache operations to be as fast as possible, but we recognized that we might not achieve the speed of Redis or Memcached, where operations occur in microseconds, while database operations typically take around 100 microseconds.
00:04:33.680
However, all these times are relatively small, so our goal was to see how close we could get to that performance.
00:04:50.960
Let me show you a schema that illustrates a fundamental starting point for building the cache. All that's necessary is a simple key-value store with a key, a value, and an index on the key. With that, we can insert and query records effectively.
00:05:16.160
However, we faced the challenge of cache expiry. This issue is significant, especially as Redis and Memcached manage this automatically: you can simply set a memory limit, and they’ll delete older data when it exceeds that limit.
00:05:43.560
On the other hand, databases act differently; they prefer to retain data and need commands to delete it. To manage expiry in our solution, we consider two important factors: the age of the items and the overall size of the cache.
00:06:10.200
First, why should we expire items based on age? Because we want to ensure our customers are not left with outdated data in the cache. For instance, Redis uses a probabilistic least recently used algorithm to randomly remove older items.
00:06:24.880
This means that when using such a cache, some old records may remain. Therefore, we wanted to build a mechanism that allows you to specify a maximum age for the data, ensuring it gets deleted after that time.
00:06:46.480
We can achieve this by indexing the age of the cache items, allowing for the quick identification of the oldest records.
00:07:13.160
The second challenge is figuring out how to expire items based on the size of our cache. Even while we expiring by age, we need to account for varying growth rates within the cache.
00:07:47.640
Caches can grow unexpectedly, especially with fragment caching: if fragments are used across many pages, and a content change alters its digest, it results in a significant influx of data into the cache.
00:08:25.000
We explored several strategies, including file size checks and database statistics, but neither yielded the real-time information we required.
00:08:41.000
One alternative we considered was using row counts as a proxy. However, with hundreds of millions of rows in the cache, a full count would be too slow.
00:09:32.960
Next, we examined various expiry algorithms to determine how to identify the oldest items for removal. The simplest algorithm is FIFO, or First-In-First-Out.
00:09:53.680
To implement FIFO, we would need to modify our schema to include a 'created_at' timestamp field and index it, enabling us to identify the oldest records.
00:10:29.480
However, Redis and Memcached use an alternative approach based on 'Least Recently Used.' Instead of tracking creation times, they track access times—this requires renaming the timestamp column to reflect access times.
00:11:06.440
With LRU, every time a cache item is accessed, we update its access timestamp. This means the newest accessed item is at the front of the index. Unfortunately, while this method maintains freshness, it comes at a cost; every read operation also results in an update operation to the database.
00:12:14.760
Conversely, using FIFO allows faster reads since we can simply select from the table without additional updates.
00:12:50.320
In fact, with FIFO, the oldest entries align with the lowest IDs, allowing us to scan through IDs to find the oldest records without requiring a separate index.
00:13:21.200
However, a notable downside to FIFO is that it results in a lower hit rate compared to LRU, as we may evict items that are later requested.
00:13:56.720
Yet, due to cost efficiency and longer retention times—up to two months in our use case—FIFO proves to be a viable option.
00:14:34.880
Ultimately, we determined that FIFO reduces fragmentation, easily manages cache size estimates, and avoids the performance penalty associated with frequent updates.
00:14:57.520
We can use row ID ranges as proxies for cache size without fragmentation, allowing us to accurately estimate cache usage without costly overhead.
00:15:34.120
Now, let’s take a look at some initial results of our implementation. After introducing our expiry process, we observed a significant stabilization in our database size.
00:15:54.880
It's important to note that while databases seldom release memory, our goal was to prevent growth—this was achieved.
00:16:19.800
This analysis relied on the assumption that uniformity in data maintains consistent expiry behavior.
00:16:41.720
In terms of execution, we established a background task tied to cache writes. Every 80 writes, the cache would begin the expiry procedure.
00:17:19.680
This initiative is designed to process the oldest 100 records, examining both their ages and the overall cache size. Records older than a specified duration will then be deleted.
00:17:41.920
The ratio of records purged to records written ensures the cache remains in balance, preventing unbounded growth.
00:18:02.280
In terms of resilience, our system utilizes several MySQL databases, each holding dedicated memory and storage resources. The design focuses on rapid retrieval while ensuring security and encryption.
00:18:51.960
The caches are configured for durability without relying on replication, allowing us to sustain a straightforward infrastructure.
00:19:32.560
Transitioning over to performance, although we anticipated the operations to be slower due to encryption, our results revealed that reads averaged around 1 millisecond, and writes around 1.4 milliseconds—values that remain efficient in our context.
00:20:15.640
To compare storage costs, previously, we relied on 1.1 terabytes of RAM for our Redis cache but now utilize only 80 gigabytes of RAM with Solid Cache, resulting in substantial cost savings.
00:20:53.760
Estimations reveal that scaling with Solid Cache could be around 20 times cheaper and enable extended entries compared to using Redis.
00:21:30.960
Finally, looking at cache efficiency, the miss rate reduced from 10% with Redis to approximately 7.5% with Solid Cache, suggesting a significant improvement.
00:22:17.160
The key takeaway from this improvement is that the cache's efficacy hinges on its size and structure, and with careful management, it can optimize performance.
00:23:00.160
We also examined our capacity to endure system fluctuations while maintaining a reliance on disk-based storage, which has proven advantageous.
00:23:48.440
In conclusion, not only have we confirmed that Solid Cache operates at a larger scale and is more cost-effective, but we also validated that it has indeed sped up our application.
00:24:24.560
For applications not optimized for caching, the benefits might vary. Nevertheless, shifting to a primary database presents a significant gain in operational efficiency.
00:24:55.440
As I conclude, here's the repository link for anyone interested in exploring more about our Solid Cache project. Thank you for your attention!