Caching
A Rails Performance Guidebook: From 0 to 1B Requests/Day
Summarized using AI

A Rails Performance Guidebook: From 0 to 1B Requests/Day

by Cristian Planas

In the talk titled "A Rails Performance Guidebook: From 0 to 1B Requests/Day" presented at RubyConfTH 2022 in Bangkok, Cristian Planas discusses the scalability of Rails applications based on his experiences at Zendesk. He begins by addressing the common belief that Rails does not scale, presenting his theory that while scaling is a challenging endeavor, it feels harder due to the simplicity of Rails. Key points include:

  • Experience and Motivation: Cristian shares his journey from learning Rails in school to working as a CTO at a startup, illustrating the evolution of his understanding of scaling challenges. He emphasizes that despite the hurdles, Rails is capable of scaling significantly, as evidenced by Zendesk's handling of billions of requests.
  • Understanding Scaling: Cristian argues that the reality of scaling issues is often attributed to a lack of understanding rather than inherent limitations of Rails. The presentation aims to introduce various scaling strategies and offer a toolbox rather than silver bullets.
  • Impact of Large Data Sets: He highlights the rapid growth of data at Zendesk, stressing the importance of monitoring and proper data management strategies including sharding and database indexing to optimize performance.
  • Monitoring and Error Budgets: Monitoring is essential for identifying performance issues, and setting error budgets can help manage performance expectations effectively.
  • Database Performance: Cristian touches on common pitfalls like the N+1 problem and encourages the use of indexes to improve query performance. He warns against over-fetching data and emphasizes the importance of understanding p99 latencies that affect user experience.
  • Caching and Trade-offs: The complexities introduced by caching, including potential data inconsistencies and the importance of write-through caching for derived models, are discussed. Cristian also introduces the "Doraemon Paradox" to illustrate the potential misuse of technology.
  • Cold Storage: The practice of moving less frequently accessed data to cold storage is presented as a viable solution for managing large data sets, with an explanation of Zendesk’s archival process.
  • User Experience Focus: The talk reiterates that performance should prioritize user experience and that solutions should align with the users' needs. Strategies for improving workflows and managing data loading are mentioned, along with the necessity of defining limits to ensure stability.
  • Final Thoughts: Cristian concludes by reminding the audience that trade-offs are an inescapable reality in software development, and achieving balance between various aspects of application performance is key. He emphasizes that developer satisfaction is integral to a productive engineering environment and advocates for a fun approach to programming with Ruby.

This presentation provides a thorough overview of strategies and mindsets necessary for scaling Rails applications effectively while accommodating rapid growth in data and user demands.

00:00:17.240 My name is Cristian, I am from Barcelona. The first thing I wanted to say is that this is my first year doing public speaking, so I'm really sorry for the next 30 minutes. However, I felt I had a mission here. You all know it’s true; I mean, if I asked my grandma about Rails, she wouldn't know exactly why it is, but she does know it doesn’t scale. Go to an editor thread, it’s a fact.
00:00:35.280 Despite never having spoken in public before, I’ve been attending conferences for 10 years. So I decided to make a little summary, just like the great speakers do. You know, like a small summary of my whole talk in the first two slides. This is my first one; there’s a second one. I really mean it because I work at Zendesk.
00:01:00.600 My focus every day is on making the Rails model scale. The presentation is a bit of a lie; we don't manage one billion requests per day; we've managed over two. I don’t think we are doing anything incredibly special. For example, don’t take it from me; this is from a couple of weeks ago during Black Friday, when Shopify engineering commented on the incredible amount of data they were managing. My favorite piece of data is that they handled 76 billion requests per minute. That’s a lot of data.
00:01:37.259 The thing is that Rails does scale; it’s happening right now. So the question is, why do people say that it doesn’t? I actually have a theory. But for me to explain my theory, we will need to do some time traveling. By the way, this is Doraemon; he is a 22nd-century robotic cat, and believe it or not, he will come back later in this presentation, I promise.
00:02:06.719 Now we’re going to time travel back to 2010. This is me in school, writing my papers. I was learning Rails then, and I wanted to add authentication. Now, authentication is a common feature, but it’s a hard problem, believe me. At the time, I had no idea, so I decided to conduct a deep investigation across various sources. Of course, I found this fantastic gem called Device. I read the documentation, more like the first paragraph of it, and it just worked.
00:02:27.239 The point here is that authentication is indeed a hard problem, but for the 20-year-old me, it didn’t feel that way. It felt so easy, like ‘man, this computer science thing is very easy!’ Okay, for the next part, we need to go back in the time machine to 2012. This is a picture of me with the Rails match, not the one I have here—I'm very proud of this one. At the time, I was co-founder of a startup, serving as the CTO, and like Nate, I was also the only engineer.
00:03:00.780 We were kind of popular, so I had to make the Rails app scale. Scaling is indeed a hard problem, so I decided to use the same method I had before and do a sort of investigation. Thankfully, I found a fantastic gem called ActiveScale. Now you might want to search it on GitHub. After reading the documentation on ActiveScale, my main point in this presentation is to use the parameter to scale sufficiently, but not too much; you don’t want to overdo it from the beginning.
00:04:21.600 Everything was beautiful, and nothing was hard. I think we recovered a lot of time for the coffee break, so I think we’re fine now. Okay, this was actually wool; ActiveScale doesn’t exist, I made it up obviously. I hope that some of you were searching for it! My general point here is that while scaling is a hard problem, it’s a problem that feels hard. It’s not like authentication; we in the Ruby community get a free lunch with Device. I had absolutely no idea about encryption and got a state-of-the-art authentication system.
00:05:02.880 My final point, my conclusion, is that the problem is not that Rails doesn’t scale; the reality is so good and so easy to use that, for many, the few problems left to solve are about scaling. Now, if we’re going farther into the techniques that we use for scaling, I would like to give a few general comments on this presentation.
00:05:32.400 The purpose of this talk is to provide an introduction to the problem of scaling. We’re going to give general ideas illustrated with the issues we faced while scaling to a significant degree. An important point here is that there are no silver bullets that I can offer you. I can only provide you with a toolbox, and you can choose the tools that are right for you. As Tolstoy said, every poorly performing system performs poorly in its own way.
00:06:20.340 The reason I wanted to talk about this is because I think I’m a bit qualified. I see the amount of data that we manage at Zendesk. We are a customer support company that manages the same data units as Twitter; for us, it will be the ticket. We have over 16.3 billion tickets, and this number is actually a lie because I took this figure in January; so, now, it’s probably more like 19 or 20 billion.
00:06:44.640 Just to give you an idea, this graph represents parts of the population of the Earth. The most important thing to emphasize is that our data set is growing rapidly. Around a third of all data was created in the last 365 days, meaning something that isn’t a problem now is going to become one soon. Therefore, we need to work for the future.
00:07:03.840 This incredible growth obviously has a real impact on our physical memory usage. For example, we have over 400 terabytes of data in relational databases, and while we have various storage types, our relational databases mainly handle the tickets table, which is around three terabytes. But this is far from being our biggest table. We also have one table that is sharded and is 60 terabytes. This is actually my project; I am trying to tame this thing.
00:07:31.800 Finally, before going on to discuss the techniques, I want to share what I believe is the right mindset for a performance engineer. The point is that I’ve seen that performance work is fun; many engineers prefer to create new features, you know, new toys. I feel that performance work is fun because it has real results, like these. These are the results of a real optimization I did at Zendesk, and I think when you see something like this, you're motivated to succeed as a performance engineer.
00:08:11.520 Let’s discuss the different techniques. First of all, incredibly basic monitoring; basically, you cannot fix a problem you don’t know exists. If you’re not monitoring, you won’t know you have a problem. Moreover, if you’re implementing a solution without measuring its performance, you are really not doing anything. So please, monitor everything.
00:08:45.600 There are different things you can monitor: metrics, APMs, logs. At a certain scale, you'll find that monitoring can get very expensive. I know this firsthand at Zendesk, especially when the Datadog bills arrive. But monitoring is crucial, and one thing you can do when monitoring performance work is sampling. Performance work lends itself very well to sampling because you want to capture the signal of the changes you can implement.
00:09:03.840 Typically, with a sample of 10%, 5%, or even sometimes 1%, you can gain enough insights into where you are. Monitoring error budgets is essential as well. I know what you’re thinking: error budgets are for uptime. It’s not about latency. However, a valid argument is that if your endpoint is functioning but takes 10 seconds to respond, it’s as good as broken.
00:09:26.760 One thing you can do is set error budgets for latency, and we do this at Zendesk. This is a real budget; in the left column, you have uptime percentages, while in the middle and right columns, you have the percentage of time the API responded in an acceptable and a good amount of time. Acceptable and good are indeed arbitrary numbers that we set, but it’s still very useful to establish limits. You need to set a North Star to follow, to understand where you stand.
00:10:00.700 Now, one thing I always disappoint people with is database query performance, mainly because I think people expect me to talk about it more. This presentation was originally designed to discuss the N+1 problem, but if you have an N+1 issue, fix it. Here's a few obvious recommendations: use database indexes. A crucial finding from working with a large amount of data is that the query optimizer sometimes struggles with too much data and can have issues detecting the right index.
00:10:46.740 Sharding is another strategy, particularly for many companies; you can create effective sharding, meaning dividing your database into multiple databases. This grants you flexibility. I like to emphasize the dangers of eager loading, particularly found in ORMs and ActiveRecord. The default behavior of fetching everything by default is dangerous. You might be fetching daily gigabytes of data when you only need a few bytes.
00:11:29.177 This happened in my company; we had a nice little endpoint that returned all objects of a particular table unpaginated. Initially, it was meant to return just IDs and names, so the response size seemed manageable—only a few megabytes at worst. However, one customer stored long text entries in the text areas. Thus, what seemed small quickly escalated to gigabytes of data being fetched from the database, which the database could not handle by itself, leading to memory issues.
00:12:35.579 A problem arose where the database was forced to read from memory to disk mid-select, slowing down queries significantly and sometimes leading to database meltdowns. The optimization I mentioned earlier could be as simple as using `.pluck`. Finally, regarding databases, it’s crucial to note that complex queries can introduce high variability in performance.
00:13:11.640 You may look at the mean response time for a query and think it’s fine, but you could also miss the potential high p99 latencies. P99s are critical to keeping an eye on for several reasons. First, they often contribute to database meltdowns, and second, they affect user experience—users don’t remember that a website responded in 150 milliseconds versus 160 milliseconds; they remember when they clicked and had to wait six seconds. P99s are invaluable.
00:13:51.830 Next, let's get into caching. I don’t want to delve into caching in general, but I will comment on a few things that we do which differ slightly. One technique is write-through caching, where you update the cache as you update the database. On the surface, this seems like a bad idea since the goal of a cache is to provide faster access. However, in our implementation, this cache serves as a normalized storage.
00:14:37.320 You have complex derived models that only appear in the API; they are the result of combining many tables and complex calculations. When anything that affects the response of that endpoint changes, you must pre-calculate that cache to ensure it reflects the latest data. This is one implementation we did for a high-load endpoint. It’s essential to consider how small changes can add up; a 40-millisecond improvement may not seem significant, but at scale, it matters.
00:15:04.560 One such endpoint, for example, is the tickets index, which hits 50,000 times per day. This adds up to more than eight full days of computing over time.
00:15:54.720 Now, let’s talk about trade-offs. Caching can introduce complexity; while previously all your data was in one place, now you might have duplicated data, which can lead to inconsistencies. These challenges are non-trivial and are not magical solutions. In the Ruby community, we often think caching is some kind of magic storage. I often contemplate these issues, which led me to create what I call the "Doraemon Paradox." You see? I told you Doraemon would come back!
00:16:43.760 For those who don’t know, Doraemon is a robotic cat from the 22nd century who has millions of futuristic gadgets in his 4D pocket. He takes out whatever he needs, which seems illogical. This idea parallels having storage where everything fits without order. I believe we use tools like Redis in this way too.
00:17:45.660 Beyond that, I discovered that Doraemon is an incredible fable for software engineering. Each episode follows the same structure. Doraemon lives with Nobita, a clumsy ten-year-old who often gets into trouble. In each episode, he faces a problem and asks for one of Doraemon's gadgets to solve it, but as he overuses them, it leads to bigger problems than he initially faced.
00:18:25.360 The moral here is that technologies, particularly seemingly simple ones like Redis and Rails, can tempt us to misuse them. We should strive to use technology wisely; I learned this the hard way, especially when I was 22, thinking I was a master of authentication without understanding encryption.
00:19:04.860 Another concept I love is cold storage, which is like the anti-cache. Caches store frequently accessed data, while cold storage is for rarely accessed data. You might wonder why one would want additional storage for infrequently needed data, but as our data set grows, we run into problems.
00:19:50.040 Database optimizations can only go so far, but there's a solution. What if I told you that we could shrink our tables by sending data five years into the past, back when the queries were faster because we had only 200 terabytes of data? Let’s talk about how we do that.
00:20:34.525 At Zendesk, we decided that when a ticket is closed, meaning our interaction with the user is finished, we wait for four months (120 days) and then mark it as ready for archival. We then remove it from the relational database and move it to a specialized archive storage. We used a variety of technologies like Dynamo and React, and I really want to use S3, but I haven’t yet convinced anyone.
00:21:34.380 You may think that this could create a huge mess by breaking various associations. To address this, we create a new escalating object in our system, called the archive ticket, so the sessions remain valid. Most of our data is archived, now over 90%. With cold storage, we wouldn’t have three terabytes in the tickets table; we would have nearly 25 terabytes.
00:22:14.840 However, cold storage has significant trade-offs. Now, data isn’t all in one place; you will have data in different locations, which may introduce issues. For instance, ordering all tickets can be a challenge if they're in different databases. Of course, making a sorting query across two databases is possible but not trivial. Engineers tend to strictly follow project specifications, but performance features might outweigh strict adherence to rules.
00:23:03.560 Ultimately, results are worth it. Designing a product around performance often involves understanding that at a certain scale, you really don’t know what might happen. Once, I built a feature that I couldn’t roll out due to excessive exceptions being thrown. We discovered a user was pasting the whole U.S. Constitution, including amendments, into a text area, making it impossible to process.
00:23:37.320 That extreme example points to how a valid customer can use a product in unconventional ways. It’s worth noting that performance limits aren’t attractive; customers don’t enjoy being told API access is limited to 1,000 requests per second. However, not imposing these limits leads to undesired consequences. Your application runs on hardware, which has physical limits. Without defined performance limits, you will face issues.
00:24:18.840 Next, I'd like to discuss what I call the "increasing payload" phenomenon. Many who’ve worked in the same company for years have seen this: what was once a lean, efficient endpoint can become overloaded. Typical story: five years ago, an endpoint might have returned four attributes per object, but returning 40 attributes would require a browser extension to interpret.
00:25:05.040 This situation indicates rising complexity that results in slower performance because you are just sending more data. A straightforward solution is simply to send less data; limit your payload to the information you truly need. You can utilize techniques like lazy loading or GraphQL to help achieve this.
00:25:39.960 When it comes to slow workflows, you can split reads and writes. Faced with an endpoint that takes three seconds to respond, splitting two calls to get the same data might seem counterproductive. However, implementing a very fast first query followed by a secondary query for additional data can significantly enhance user experience.
00:26:21.600 User experience is king. We create applications for users, not for numbers. The performance of an endpoint should prioritize user experience. Finally, you may face slow workflows where you’re inadvertently causing excessive inserts and updates to create one object.
00:27:04.380 Consider doing it asynchronously: when you receive a create request, just create the primary object and defer additional operations. This can evolve into a distributed system. Eventually, managing consistent states can be challenging, but it's rewarding; performance is worth the trade-offs involved.
00:27:54.360 As a closing thought, I want to remind you that not everything scales. Despite our best efforts to be scientific, the reality is that small customers can significantly impact performance. We know of companies that have caused specific incidents due to their unique usage patterns.
00:28:30.239 You may need to devote special attention to such clients. Think of Captain Ahab from Moby Dick; he’s not a typical fisherman, but rather someone dedicated to capturing the elusive white whale. It’s important to recognize that special interventions may be necessary depending on your business.
00:29:09.300 At Zendesk, we sometimes split accounts for high-activity customers. This makes resource distribution much better. In conclusion, I’d like to offer a final note on performance: it is all about trade-offs. There is no perfect system you can develop without compromise. You won’t have the simple, completely normalized database application you dreamed of as a startup; applications are complex and need to fit the specifics of the business.
00:30:00.159 An excessive pursuit of perfection can be counterproductive. You need to understand your priorities; performance is indeed a priority. I believe everyone in this room has an advantage because in the Ruby community, we understand trade-offs well. We're a community that prioritizes developer satisfaction, which, while a bit radical, is a beneficial focus; developer productivity remains of utmost importance.
Explore all talks recorded at RubyConf TH 2022
+7