A Rails Performance Guidebook: from 0 to 1B requests/day

by Cristian Planas

In the presentation titled "A Rails Performance Guidebook: from 0 to 1B requests/day" delivered by Cristian Planas at RailsConf 2022, the focus is on improving application performance, emphasizing the complexity of scaling software systems effectively. The speaker unfolds insights and practical guidance drawn from experiences managing a significant volume of data in a real-world application serving billions of requests daily.

Key points discussed include:

The Myth of Performance Issues: Performance challenges are common across various programming languages and frameworks, not unique to Ruby on Rails.
Scaling Challenges: The journey of scaling applications can reveal the intricate difficulties of software engineering, particularly as systems grow large. Planas shares his experiences from working at Zendesk, where they currently handle 16.3 billion tickets and 400 terabytes of relational database data.
Performance and Monitoring: The importance of monitoring performance is highlighted. Effective monitoring can help identify issues in applications, as performance problems often manifest in different ways across different systems.
Database Optimization: Specific strategies for enhancing database query performance like avoiding "N+1" queries and ensuring appropriate indexing are discussed. Greedy selects (e.g., using SELECT *) can heavily impact performance negatively when fetching unnecessary data.
Caching Strategies: The presentation emphasizes efficient caching methods, such as write-through caching that helps reduce database load and improve response times.
Archiving Data: A method known as 'Cold Storage' is described, allowing aged data to be archived, thus preventing database bloating. This approach has drastically improved their data management and performance metrics.
Trade-offs and User Experience: Planas discusses the necessity of making trade-offs when scaling, such as finding a balance between system limitations and user needs. Users often prefer to access older data through separate calls rather than experience slow response times.
Simplicity vs. Complexity: An important lesson explored is the tendency of engineers to over-engineer solutions; simplicity and understanding the context of technology usage are crucial for stable performance.

In conclusion, the core message of the talk revolves around the idea that performance optimization is a continuous process that requires a mindset focused on enjoyment in problem-solving, balancing trade-offs, and prioritizing user experience. Every application scales differently, and the right strategies must be devised considering the unique performance challenges faced. Overall, the presentation advocates for a pragmatic, thoughtful approach to performance engineering in Rails applications.

00:00:00.900 Hello everyone, it's a pleasure to be here. My name is Cristian Planas, and I’m excited to share some insights on performance in Rails.

00:00:24.199 I've been in the audience for many years, watching presentations from my home in Spain. So, being here is truly a dream come true for me. One thing I've noticed while watching countless conferences is that great speakers often provide a quick summary in their first two slides, which allows the audience to catch key points before moving on to the next talk.

00:00:32.340 With that in mind, I’ve tried to summarize my ideas on this topic in just two slides. The first slide presents the main idea, while the second slide follows closely. Performance is a topic I think about very often, and I find it intriguing because while Ruby has its own unique challenges, many of them are shared with other programming communities like Python and Django.

00:00:49.500 In fact, I've developed microservices in Java, and ultimately, you’ll discover similar bottlenecks across languages. This led me to develop a comprehensive explanation regarding the origins of the myths surrounding Ruby’s performance issues. But to understand this, you’ll need to accompany me on a bit of time traveling. Let’s set the scene and travel back to 2010.

00:01:19.560 This is me back then, sporting a rather fabulous jumper, I must say. I was working on a photo site for movie magazines, and I wanted the critics to be able to write their own reviews, which meant I had to implement user authentication.

00:01:37.380 Authentication is a tough problem, as many of you already know. I rolled up my sleeves and conducted thorough research into this issue, eventually finding a solution. You might know the popular gem called Devise. I installed it quite simply, and although I didn’t fully understand how it worked, the point is, it worked.

00:01:58.259 For me, back when I was 22, it didn’t feel like such a hard problem at all. Fast forward a bit to 2012. I'm quite proud of this period; I started my own startup and found myself as the CTO, also known as the only engineer on the team.

00:02:11.580 We achieved success, but with that came the need to scale our application, a challenge I believe is the reason you're all here today. Again, I dove deep into researching this scaling issue before discovering a gem called ActiveRecord.

00:02:28.800 I assume everyone here is familiar with it. I initialized ActiveRecord with sufficient parameters and, surprisingly, everything just worked beautifully without it being overly complicated.

00:02:41.099 But scaling is quite a tricky topic; it feels like you might encounter a scenario where you just copy-and-paste some Rails screenshots, tweaking names here and there. The reality of scaling is that there are no free lunches, which is something I learned during my experience.

00:03:00.000 Scaling has a problem, and when you dive deep into it, you'll realize it often feels easy until you hit a wall. One of the significant hurdles in software engineering becomes apparent once you encounter scaling challenges.

00:03:17.280 Now, before delving into the main body of my presentation, I’d like to share some general thoughts about the audience for this talk. The primary purpose is to introduce problems associated with scaling applications.

00:03:39.120 I want to provide some overarching ideas, particularly revolving around issues we’ve faced while scaling to a significant degree. Unfortunately, I won’t have time to go into specific details, but if you’d like to discuss more later, I’d be happy to chat.

00:03:56.700 Importantly, when it comes to scaling, there are typically no silver bullets to resolve all issues within your application. As the famous author Tolstoy once said, every poorly performing system has its unique performance bottleneck, and without examining your application, you won’t know why it’s slow.

00:04:12.480 For instance, you might be facing an N+1 problem, or perhaps you’re simply trying to return too much data. Additionally, another common issue we encounter is that the number of performance improvements I can implement is nearly infinite—there are innumerable strategies to enhance the speed of your application.

00:04:31.800 However, we must also recognize the limitations we face: the time available to implement all those improvements, and perhaps more importantly, the constraints imposed by acceptable trade-offs.

00:04:49.560 As I've often stated, if a feature is too slow and you can't change it, dropping it can make your application perform faster, but that’s typically not a viable option—or maybe it is; you might have a feature that is not being utilized and negatively impacting your database, and in that case, it might be sensible to drop it.

00:05:07.860 The reason I’m sharing all of this with you today is that I believe the size of the problems we face when managing data becomes incredibly relevant, particularly in the context of a Rails application.

00:05:22.680 At Zendesk, for example, our basic unit of data is the customer ticket. You might consider that someone complains, for instance, about an Uber ride not arriving. To date, we’ve logged a staggering 16.3 billion tickets, which is more than double the current global population.

00:05:39.360 To put it simply, this data is continuously growing; a third of that data was created just in the last year. This growth impacts the relational databases associated with our models, increasing them to around 400 terabytes.

00:05:53.160 However, the tables themselves often contain many associations, contributing to our database’s complexity. For example, the largest ticket table alone is 41 gigabytes, and, interestingly, our events table in one chart amounts to two terabytes.

00:06:07.559 All these numbers illustrate the challenges we face. They are not trivial, but they do pose intriguing problems to solve.

00:06:23.499 Finally, I want to touch on what I consider the right mindset for a performance engineer. Don't worry, this isn’t going to be a lecture about LEDs or hardware specifics. My belief is that performance engineering should be enjoyable.

00:06:38.160 If you don’t feel a sense of joy when making an endpoint 100 milliseconds faster, perhaps you should explore different avenues in the field. This work should indeed feel rewarding.

00:06:57.720 In summary, I’d say a performance engineer genuinely thrives on witnessing the fruits of their labor, such as in this graph which illustrates a real optimization I achieved some months ago. It invokes a very satisfying feeling, a sense of achievement.

00:07:19.680 The first element I want to discuss is perhaps the most fundamental: monitoring. Without awareness of the problems at hand, you can’t fix them.

00:07:37.260 Monitoring can be expensive, and you might run the risk of over-monitoring, leading to a flood of data that you cannot effectively analyze. However, when working in performance contexts, it’s often sufficient to sample data instead of collecting every single piece of information.

00:07:56.520 For many types of data, I’ve found that a one percent sample is adequate to reveal where the inefficient queries lie and identify the underlying issues.

00:08:16.020 Next, I want to talk about error budgets. Most of you may associate error budgets with uptime, tracking how many times your application might fail. However, you can also establish error budgets based on latency.

00:08:33.780 It’s essential to recognize that while returning a 200 response is nice, taking 30 seconds to do so isn’t acceptable. In the accompanying slide, the left column outlines uptime error budgets, while the right column details unacceptable latencies and acceptable latencies.

00:08:51.540 Defining what constitutes acceptable and good latency is completely customizable, and it serves two important roles: first, to provide a north star for performance goals, and second, to help detect any changes that negatively impact performance without affecting uptime.

00:09:10.920 This brings us to the core of my discussion: database query performance. Although it dominates much of my daily work, I want to examine performance from a broader standpoint.

00:09:31.200 When I mention the term 'N+1,' it’s likely the first performance issue that comes to mind because it's a foundational problem that many overlook. It can slow down our applications dramatically unless addressed.

00:09:51.780 The next critical point I believe is crucial, even more so than N+1 issues, is the use of database indexes. Frequently, I fail to see a good reason not to create carefully crafted indexes for specific queries.

00:10:13.380 While writes may become marginally slower, it’s usually a worthwhile trade-off, as applications are generally read-heavy, and having well-tuned indexes benefits overall performance.

00:10:32.880 Now, one prevalent misconception in database engineering is that it is some kind of omniscient mechanism that automatically selects the best indexes. Unfortunately, this is often not the case. You need to validate your indexes and use tools such as the EXPLAIN command from ActiveRecord.

00:10:51.960 Another frequent issue faced in Rails, particularly with Active Record, is what I call greedy selects. This occurs when you irrationally select every column from a table using SELECT * when only a few attributes are necessary.

00:11:09.600 This approach can lead to significant problems on various levels, affecting both performance in the application layer and the database layer itself.

00:11:28.800 When you have a large table with many columns, especially those that store large data types, and you fetch all available data at once, this can result in your database having difficulties allocating enough memory, thus leading to increased latency.

00:11:46.920 Moreover, earlier I showed you an example of optimization that we achieved using cache, which resulted in noteworthy improvements in response times.

00:12:05.080 In regards to caching, it's critical to remember that although using caches can enhance performance, it also adds another layer to our application architecture.

00:12:27.280 Therefore, if used incorrectly, the complexity of caching can drive unforeseen issues into your application. I refer to this as what I call the 'Doraemon Paradox'—drawing parallels to the widely popular character from the 22nd-century.

00:12:44.640 Doraemon is characterized by his infinite pocket, where he can seemingly pull out any gadget to help solve problems. Some of us fall into that same mindset of thinking that we can pull out any solution without considering its implications.

00:13:02.420 In that light, while caching can be a powerful solution, it should not be approached all too casually. You need to carefully assess and choose the right moment to implement such solutions.

00:13:20.400 Now, I'd like to pivot to the final part of my talk: special interventions. In many contexts, we don't always scale things effectively.

00:13:38.880 Often, we can adopt a more manual approach, seeking to optimize and achieve considerable performance gains for certain use cases without losing sight of quality.

00:13:55.800 One method we’ve implemented at Zendesk is splitting user accounts. When power users outgrow our original setups, we enable them to operate separate systems designed to handle their specific workloads more effectively.

00:14:14.160 Finally, I want to leave you with some key takeaways regarding performance tuning in your applications. In this realm, trade-offs are inevitable.

00:14:33.840 You must recognize that there’s no perfect system; rather, the optimal design will be the one that aligns best with the core needs of your business.

00:14:53.160 In our community, understanding of performance and trade-offs is critical, and I urge you to apply this knowledge to develop solutions that harmonize immediate user needs with system stability.

00:15:12.840 As we move forward, please consider the factors that impact the performance of your applications, always keeping trade-offs in mind.

00:15:30.000 Thank you for your time today! Let's continue to strive for performance that optimizes both system capabilities and user satisfaction.