Great Caching Disasters!

Talks

Lisa van Gelder

#software-architecture

#monitoring

#data-integrity

#error-handling

Great Caching Disasters!

by Lisa van Gelder

In this talk titled "Great Caching Disasters!" presented by Lisa van Gelder at GoRuCo 2015, the focus is on the significant challenges and failures experienced due to caching issues, particularly during her time at the Guardian. The speaker shares various anecdotes to illustrate how simplistic caching solutions are often the most effective.

Key Points Discussed:

- Introduction to Caching Problems:

- Van Gelder experienced downtime at the Guardian due to faulty caching practices, notably during peak traffic hours.

- Impact of Poll Submissions:

- Investigating logs revealed that cache clearances coincided with poll submissions, leading to service disruptions.

- Lessons Learned:

- Importance of closely monitoring caching mechanisms and avoiding manual cache clearance options due to their propensity for misuse.

- Cache Duration:

- A three-day cache duration led to issues when articles were deleted, reinforcing the lesson that shorter caching can be more effective.

- Case Study – Live Q&A with Julian Assange:

- The Guardian's comment system faltered under load despite having a short HTML fragment cache, highlighting the risks of integrating processing instructions with cached content.

- 'Woolly Rat' Phenomenon:

- Van Gelder refers to a viral article that caused unexpected traffic, illustrating the necessity for robust caching processes that respond automatically to traffic spikes.

- Final Recommendations:

- Simplify caching practices, monitor performance closely, and avoid complex systems unless caching is the core business focus.

Conclusion:

Van Gelder emphasizes the importance of simplification and vigilance in caching management. By adopting practices like maintaining a one-minute cache and avoiding manual interventions, organizations can significantly reduce the likelihood of caching disasters. The talk serves as a valuable lesson for engineers and developers in managing caching systems effectively, particularly during high-traffic events.

The speaker concludes by advising against elaborate caching systems unless specifically needed, and encourages utilizing Content Delivery Networks (CDNs) for effective caching solutions, leaving complex setups to the experts.

00:00:13.410 I am Lisa van Gelder, the VP of Engineering at Stride, a consultancy company here in New York. Before I joined Stride, I worked at the Guardian newspaper. Today, I want to discuss some significant caching disasters that we experienced during my time there.

00:00:24.360 I joined the Guardian in 2008, and shortly after I started, the website began experiencing downtime every lunchtime. This was peculiar because it coincided with peak traffic, yet it was not due to higher-than-normal load. The reason was not immediately apparent, and this downtime continued for about a week until someone finally thought to look at the caching statistics.

00:00:47.579 What we discovered was that, right before the site went down, the cache was being cleared. At peak times, we had about 2,000 requests per second across 12 servers, and we were heavily relying on caching to serve this traffic quickly. If the cache was cleared during peak hours, by the time it ramped back up, the website would crash because the servers couldn’t cope with the load.

00:01:13.740 Now that we understood caching was causing the problem, we started examining the logs to find out more. We noticed that whenever someone submitted a poll, it was just before the website would go down. For example, one such poll asked users to choose who had better hair: Jennifer Aniston or Justin Bieber.

00:01:29.920 At that point, the poll recorded responses based on raw numbers rather than percentages. It turned out that a given poll might show 25 votes, but the actual count was being incorrectly reported as people were trying to submit their votes. The core issue was still related to caching, and we considered a fix: every time someone submitted a poll, we would clear the caches to ensure accurate results.

00:01:50.970 However, lunch hour came, and we were reminded of two valuable lessons. First, if caching is crucial to your operations, you need to monitor it closely. It took us a week to realize that caches were being cleared, which is quite embarrassing. The second lesson is never to create a system that lets you manually clear the cache. Why? Because a manual clear cache button is like a red button that someone will undoubtedly push, and that will lead to trouble.

00:02:04.229 In addition, if you build a clear cache system when your caching setup is already complex, you are only complicating matters further. Caching is already difficult to test, and it is vital for performance. The more complex your caching setup becomes, the greater the risk that something could go disastrously wrong.

00:02:37.360 An effective approach with caching is to cache for a short duration. Many assume that longer cache times will protect their servers from load spikes, but you must also consider how long your users are seeing stale data. When we implemented a three-day cache at the Guardian, it masked numerous poorly performing SQL queries, which were in desperate need of optimization.

00:02:59.140 When articles were deleted for legal reasons, we used our cache management system to remove those articles from the cache. One day, however, a critical deletion notification was missed, and an article was removed from the database but not the cache. With our three-day cache still active, this situation led to significant issues.

00:03:19.000 The hibernate query cache expected everything to be in sync, but when the article was deleted from the database, Hibernate threw an exception and refused to load anything. Consequently, half our website started displaying error messages, making it seem as if the site was down. We experienced cache poisoning where corrupted data continued to serve from the cache.

00:03:39.000 Had the cache expiration been just a minute, we would have regained functionality quickly as the cache would naturally reset. Instead, we were stuck showing error pages for three entire days. Luckily, this happened in the afternoon, giving us time to identify the issue and refresh our caches before things escalated further.

00:04:04.700 From this event, we learned two crucial lessons: don’t try to compensate for faulty queries with caching—eventually, it won't work out for you; and always set the shortest cache time feasible. Subsequently, we conducted extensive testing at the Guardian to determine the optimal cache duration.

00:04:22.120 We concluded that keeping the cache at one minute was ideal—enough time to manage server load while also being brief enough to ensure users didn’t encounter stale content.

00:04:38.790 In 2010, the Guardian hosted a live Q&A with Julian Assange, which was an excellent opportunity to test our comment system under load. We felt confident about our comment caching, boasting a straightforward one-minute HTML fragment cache. However, once the event began, the comment system's response time quickly deteriorated.

00:04:55.750 Upon reviewing the caching stats, we noted that the cache hit rate was satisfactory, yet our system was still lagging. Oddly enough, this happened alongside a high number of database calls. Ordinarily, the comment system would take a backseat on low-traffic days, but when it became the focal point of the website, we tried to ensure it stayed operational by disabling other functions.

00:05:18.720 Through sheer determination, we managed to keep everything running without going offline. Once the dust settled, we investigated why caching didn't work as expected. We discovered that our caching comprised not just static content but also had processing instructions, integrating database queries behind the scenes.

00:05:36.860 Our caching mechanism was inadvertently set up to balance links; therefore, it didn't know where the other services were deployed. At the wrong moment, the system attempted to access a database query, which contributed significantly to our slowdown. Our findings reinforced that caching works best when you can serve static cached content rapidly.

00:05:56.079 Adding processing instructions when serving cached content can severely hinder performance, as it may lead to a database call for every single piece of cached content. This defeats the purpose of caching altogether.

00:06:12.999 Finally, I want to talk about the concept of a 'woolly rat'—that one piece of content that goes viral unexpectedly, leading to a sudden spike in traffic. We had one such viral article in our time at the Guardian, featuring a new species of woolly rat from Papua New Guinea. The article gained massive interest and caused traffic to surge.

00:06:35.710 The lesson here is that if your servers struggle under the sudden influx of traffic due to a woolly rat, it indicates a problem with your caching system. In our case, the caching system was still manual and depended on human intervention, which proved to be unreliable during high-traffic events.

00:06:54.000 After experiencing the woolly rat surge, we automated our caching processes to enhance our system’s robustness against spikes in traffic. We established a mechanism that allowed our servers to react automatically when traffic levels became risky.

00:07:17.370 You may have noticed that many of the dates in these examples are quite old. This is partly because I no longer work at the Guardian, but also because we genuinely learned our lessons about caching, simplifying it to a one-minute HTML fragment.

00:07:38.860 Since implementing these optimizations, I can confidently say that we have avoided any significant caching disasters. To summarize my key lessons: monitor your caching practices and simplify where you can.

00:08:05.270 I also want to emphasize that if you are using a Content Delivery Network (CDN), do not attempt to set up a clear caching system unless it is your primary business. If caching is not the focus of your operation, it’s best to leave it to the experts and keep your setup as straightforward as possible.

00:08:24.710 Thank you for your attention.

GoRuCo 2015