A Rails performance guidebook: from 0 to 1B requests/day

Talks

Cristian Planas

A Rails performance guidebook: from 0 to 1B requests/day

by Cristian Planas

In his presentation at the wroc_love.rb 2023 conference, Cristian Planas discusses the challenges and techniques associated with scaling Ruby on Rails applications, especially in high-demand environments. Emphasizing the theme that 'Scaling is hard,' Planas reflects on his extensive experience in the tech community and his current role at Zendesk, where he deals with performance issues. He begins by highlighting the misconception that Rails does not scale, citing successful examples from companies like Shopify and GitLab.

Throughout the presentation, Planas covers several key points regarding performance optimization:

Monitoring and Error Budgets: Planas stresses the importance of identifying and monitoring performance issues using error budgets, which help define acceptable latency for key endpoints.
Database Optimization: He discusses the need for proper database indexing, the implications of query complexity, and the importance of sharding to improve performance in large datasets.
Caching Strategies: Planas explains how caching can significantly enhance application performance, advocating for techniques like write-through caching and asynchronous processing to reduce response times.
Data Management: He shares his approach to data regression, where unnecessary data from closed tickets is moved to cold storage, thus reducing load on primary databases.
Product Limits: Highlighting the necessity of setting limits on database queries and user operations to prevent abuse and system overload, he suggests implementing timeout limits.
Handling Large Payloads: Planas warns against feature creep in API responses, advocating for the delivery of only essential data while managing optional or aggregate data separately.

As he wraps up the presentation, Planas emphasizes the mindset required for engineers tackling performance issues, invoking the story of Alexander the Great and the Gordian Knot to illustrate the need for thinking outside conventional constraints. He advocates for a community spirit within the Ruby ecosystem that balances developer happiness with performance optimization. The talk concludes with an invitation for questions, showcasing an interactive engagement with the audience to further explore these complex topics in performance engineering.

00:00:05.220 Hey, my name is Cristian Planas, and one thing I wanted to say before starting is that this is my first presentation. I've done it a couple of times before, but it's still my first time presenting, so I'm really sorry for what is about to happen.

00:00:19.260 Even though it's my first time presenting, I've been going to conferences for 12 years now. I've seen all the big ones; I've seen Matz, I've seen the HH, and I've seen Aaron Patterson, who is my favorite. During this time, I've learned a few techniques, and one thing I always appreciate is when the presenter introduces the whole concept in the early slides.

00:00:37.079 So, this is my first slide. This is a very well-known issue in Twitter; everyone knows, I think, these days in computer science classes, it’s the first thing they tell you: 'Scale is hard.' Who agrees with this universal truth? Okay, no one. You destroyed my joke!

00:00:56.760 This next slide is about real scale, and I emphasize 'real scale' because there are companies using it at scale. I work at Zendesk, and if you see the title, it says 1 million requests per day. That’s still not true; it’s actually 2 billion triggers per day. Yes, I said 1 million because it sounds nicer.

00:01:14.939 On this slide, you have data from last year indicating that Rails is indeed more related to Shopify. Shopify returns 4.3 petabytes in one day. As you might think, the next slide is even better; they serve 76 million requests per minute, which is 127 million requests per second. That’s scaling!

00:01:39.240 Before we dive into the topic, I'd like to introduce myself a little more. I'm originally from Barcelona and work at Zendesk, where I've been for almost 10 years. I currently focus on performance issues, though my work covers a bit of everything. Performance is my passion. I just came back to Europe after spending seven years in the US, where I had the opportunity to present this talk, an iteration of the one I’m giving now, at a conference called Address Conf. However, this is a special occasion for me as it’s the first time I speak on YouTube, and I am very excited.

00:02:16.680 Moreover, I’m very happy that this happens in Poland. My first Ruby conference ever took place here in Poland; in fact, in 2011, I attended a very peculiar conference called Ruby. It was a mix of Ruby and Python—essentially a dynamic languages festival. There was one talk that I think of very often. I don't know if you know who this guy is.

00:02:43.019 Does anyone know? Yes, it was José Valim. At that time, back in 2011, he was quite famous as a Rubist. However, his presentation was not about Ruby; it was about designing a new language. This was fascinating to me because in computer science, we have a subject called compilers, where you learn how to design a language. Eventually, that same year, he released Elixir, which could be considered one of the most important languages of the last 15 years, at least in the conversation.

00:03:07.560 I'm pretty sure that this talk will not be as memorable, unfortunately, but I hope you will take something away from it because this is an important topic. Why am I talking about this? Well, I mentioned that some companies are using Rails at scale, but over and over again, this topic resurfaces.

00:03:52.920 In the last few weeks, programming Twitter has been discussing this again. I don’t know if you've been following the latest drama in the Ruby community. We find this a bit funny; in the Ruby community, we have these Jing and Yang celebrities. We have Matz, the creator of Ruby, who is the nicest, sweetest person ever, and then we have DHH, who writes things like, 'Well, I believe Basecamp removed TypeScript, and I will let the rest of us enjoy JavaScript in all its glory, free of strong typing.' Well, what happens next is that you get a flame war.

00:04:20.840 People start attacking each other, trolling, and it escalates so much that someone decided to make a pull request that invisibly put back TypeScript. Obviously, since DHH is the creator of Rails, people had to attack Rails. He made this post listing some companies that use or have used Rails at some point, which is true. Amidst all this trolling and flames, there were some companies comparing debates. For example, in conversations, one might say, 'Hey, Shopify is scaling; why couldn't Twitter?' There’s an excellent point there. While I think the use cases differ vastly, it is a good point. Shouldn't Rails allow itself to scale more than Twitter?

00:05:07.860 Interestingly, there's also the point that Rails allows startups to pivot quickly, which is incredible. Even as a Ruby person, I can’t say I would write a microservice in Ruby, except in special cases. Again, it’s an interesting discussion; my personal take on this would lean towards saying that monolithic architectures can have significant drawbacks when operating at scale.

00:05:53.640 However, even some of the leading companies would debate this. For example, GitLab wrote an article a year ago about why they are sticking with Ruby on Rails. Their point, surprisingly, was that having a monolithic structure in GitLab has been extremely beneficial for them. This is an interesting discussion, and I would appreciate it if people would engage in these discussions in a civilized manner without being absolute.

00:06:40.380 For me, I find it childish when someone claims that a particular technology is bad. Yet I believe the maximalist view that Rails does not scale is just not true, because, again, some companies are successfully using it at scale.

00:07:09.240 Arriving at this point, I often wonder why some people keep saying 'Rails doesn't scale.' I found this tweet during that flame war stating that 'every Ruby community has some deep-seated insecurity they never seem to overcome.' It’s true that Ruby is not incredibly performant, but there are other languages that are perhaps a bit faster than Ruby, and yet they are in the same ballpark, in my opinion.

00:07:22.259 These conversations seem never-ending. So, why is this happening? I have a theory, but for this, allow me to take you on a brief time travel. By the way, this is Raymond, a robotic cat, and he will return later in the presentation. Let's travel to the year 2010.

00:07:54.180 This is me in college, writing my farewell letter, and I wanted to have authentication. Security is a complicated problem, so I had no idea what I was doing. I spoke with experts in the field, and I found this excellent gem called Devise. I simply read the documentation, and it worked perfectly. Authentication is a hard problem, and I have an interesting anecdote.

00:08:18.540 My next point takes us to 2012, when I was CTO, and the only engineer at a startup that was gaining popularity. We had a lot of traffic, and I had to figure out how to scale. Scaling is a difficult issue, so I did extensive research and discovered a fantastic gem called ActiveScale. I configured it properly and things seemed beautiful and effortless. However, that was not true. I modified the title of ActiveRecord using Chrome tools. The point here is that scaling is a difficult problem. It often feels hard because there is no free lunch you can just download a gem and expect everything to be fine. This is especially true for junior engineers, and I believe this is one reason why Rails is so popular.

00:09:40.500 Sometimes, the challenge of scaling may feel impossible. But before we dive into techniques, I want to discuss some general concerns. The purpose of this talk is to introduce the problem of scaling, and we will discuss some general ideas illustrated with issues we face when scaling significantly. It's crucial to understand that there are no silver bullets. Every system is performing poorly in its unique way.

00:10:07.920 Though occasionally, I have discussions saying this application doesn't scale, and I wonder how to fix it. Well, I often don’t know. The problems typically stem from database issues, such as excessive queries, but the reasons can differ significantly.

00:10:32.640 Now, let's discuss the context of Zendesk. We work as a customer support platform and currently manage over 20 billion tickets in our system. That means if you distributed these tickets, it would be about 2.5 tickets per person globally. Moreover, we are experiencing growth—approximately 32% of the dataset is new in the last year, which has obvious consequences for our systems.

00:11:01.620 Our relational database holds over 400 terabytes, which is immense. In some shards, we have trillions of objects, making our data structure quite complex. However, before we discuss techniques, I want to highlight the fun side of these kinds of problems.

00:11:39.420 I believe that many engineers feel performance work is a drag because they would like to build features instead. However, I propose the right mindset should be that performance work is, in fact, optimizing real solutions. For instance, we utilized caching to improve performance significantly, reducing certain endpoint response times from one second to under 100 milliseconds.

00:12:11.460 The first thing I want to emphasize is monitoring and error budgets. This is perhaps the most important thing you can do. Why? Because usually, you'll fix problems that you don’t know exist. At Zendesk, we utilize various monitoring levels, most obviously using APMs for tracing requests, but monitoring time can become expensive at scale.

00:12:57.420 Often, I notice people mix metrics and logs as if they are the same. However, logs are highly beneficial as they help trace specific events in a timeline. When it comes to performance, we can take samples, perhaps 10% of the data for certain requirements, but that suffices to calculate our performance metrics adequately.

00:13:34.560 The evolution of error monitoring is what we term 'error budgets.' For instance, we can set an error budget for latency on a specific endpoint. We define several key endpoints in Zendesk, such as ticket show, create, and update. For these, we specify acceptable latency, with the ticket show endpoint being slower.

00:14:01.620 We arbitrarily define acceptable times; for example, the ticket show endpoint needs to be within an acceptable latency 99.95% of the time. Even though you find it somewhat specific, having a North Star is essential to prevent any endpoint from degrading continuously without notice.

00:14:40.920 Moreover, the final evolution in this regard is to set error budgets for specific parts of your logic. Sometimes, you’ll find that you modularize processes across multiple endpoints; setting specific budgets for these can be a good approach. Now we will address the part that everyone considers to be the star of performance improvement. I should mention, however, this talk's original title was 'Beyond N+1' as there are plenty of resources available on that already.

00:15:38.160 I'll discuss some obvious points like 'use database indexes.' This is something obvious, but you’d be surprised—particularly in large companies—how the responsibility of indexing a database often falls somewhere between the DBAs and engineers. This can also mean that if you're working with basic tables, don't assume your database engine will find the right index for you.

00:16:33.540 Moreover, another issue to consider is that in very large tables, the index may become ineffective. Sometimes we address tables with trillions of objects; these tables comprise extremely small objects, and the indexes may turn out to be larger than the data itself, which is something to keep in mind.

00:17:23.960 Sharding is an architecture to consider, dividing your data into independent databases. For instance, in Shopify, you operate by safeguarding customer data—each customer’s account should not access data from another account. However, what often happens is that one central database may be shared across multiple hosts.

00:18:19.260 Within Rails, when you implement various queries, you might call user IDs greater than 42, but you may end up fetching all columns disproportionately. This becomes problematic if, say, you have large blobs of data while needing only minimal fields. For example, this is something we've seen at Zendesk where simplifying our queries to return only crucial data, like IDs and titles, ended up improving performance significantly.

00:19:08.000 Furthermore, it's important to remember that database query complexity can lead to high variability in response times, meaning that evaluating query performance via median or p50 isn’t sufficient; instead, you should measure p99 to be aware of potential issues that could lead to database meltdowns.

00:20:07.740 Then we also have caching. Caching entails storing data in a faster memory store to facilitate faster access. At Zendesk, we find write-through caching particularly effective, especially when normalizing objects. For instance, if you have a site load that involves computing the sum of several objects frequently, avoid recalculating constantly; doing it at write time can lead to significant performance improvement.

00:20:47.040 Alternatively, I came across a scenario at Zendesk where we had huge computations happening on a side load, leading us to change our approach to calculate these asynchronously. Initially, this action took around 40 milliseconds, but after these updates, we reduced the evaluation time to under 10 milliseconds, which at scale can yield significant impacts.

00:21:35.580 As for trade-offs with caching, it brings inherent complexity—the new source of truth is not the cache; continuous reads could lead to inconsistent data. Since caching isn’t trivial, it's not just about layering another memory section and hoping for the best.

00:22:17.220 This brings us back to Raymond. Many people are aware of this character in Spain. For those unfamiliar, he’s a robotic cat from anime manga. Each episode follows a certain structure: the character Nobita, a clumsy, ten-year-old, encounters a problem and seeks help from Raymond, who magically resolves it. However, this solution only lasts a short time before Nobita finds a way to create complicating issues, ultimately leading to the same outcome—Nobita crying. This cycle reminds me of how we interact with technology.

00:23:38.640 Many developers think, 'Oh, caching, I can store everything here!' But the underlying message is to use technology wisely. It’s common to have an aversion to using ORM systems completely, but developer satisfaction is important. Having tools that simplify query writing can considerably improve happiness. However, one must understand how each query interacts with the database.

00:24:33.720 Additionally, you must manage the life of your data carefully. You might wonder why you should care about data after writing it, but if you’re managing billions of records, this becomes crucial. Issues can arise when data accumulates, and changes can become problematic.

00:25:17.700 I’d like to highlight one technique called data regression. For example, once an interaction is concluded, it may be unnecessary to retain all data generated during this process, particularly in its original format.

00:26:06.960 A solution could involve sending this data to a 'cold storage' for later access. For example, at Zendesk, we archive tickets after their activity for five months. When a ticket gets closed, we archive it and remove it from our relational databases to a key-value store.

00:26:55.500 This archival process lends us flexibility since it allows for scalability. Most archived tickets now exist in Dynamo instead of our traditional database. We can later refer to these archives when needed, restoring the functionality of tickets as required. This decision has proven essential in managing our growth.

00:28:21.700 Now, I want to focus on the principle of designing a product with performance in mind. A key concept here is the importance of setting product limits. When thinking about their system, developers often believe they understand all potential use cases, however, this is rarely true.

00:29:22.740 For example, various forms of abuse can affect your platform, and there are always customers who will find unexpected ways to stress your system. If, for instance, you’re allowing customers to run complex queries, you need limits in place to prevent them from overwhelming your system.

00:30:03.240 Let me provide some examples of 'product limits.' For instance, consider setting a limit on long-running database queries. If a query is taking over a minute, you should kill it willingly to avoid fallout. Similarly, in our system at Zendesk, we apply limits on triggers to ensure that operations don’t spiral out of control.

00:30:49.300 As we finalize this discussion, I want to share what I refer to as the problem of 'incredibly large pay loads.' Throughout each system's life cycle, you might notice the API response grows unwieldy due to a creeping requirement for more and more data being returned from a single endpoint.

00:31:38.120 This feature creep might have everything from basic user details to extensive transactional data in responses, leading to challenges like parsing excessively large JSON files. It's often much better to find ways to only return necessary data while providing optional or aggregated data through alternative means.

00:32:33.240 Additionally, breaking down complex read and write workflows is vital. For instance, writing derived data in a synchronous way can be a burden; alternatively, writing the basic data first and utilizing asynchronous processes for other tasks leads to more efficient systems.

00:33:12.500 However, keep in mind that using asynchronous processing can lead to challenges in terms of eventual consistency, which can be confusing for users who expect instant feedback. Thus, maintaining clear communication can help alleviate concerns.

00:34:01.040 As we wrap this up, I’d like to refer to a concept I call 'special interventions.' In practice, I've found that specific clients can sometimes exert disproportionate impact on service-level objectives. Often, certain large customers cause more strain on our systems than others.

00:35:02.760 In response, we have worked on solutions tailored to their unique needs, such as dividing accounts with differing requirements while creating centralized accounts for seamless data sharing. In doing so, we can drastically reduce peak usage.

00:35:51.020 Lastly, I want to emphasize the mindset required of engineers working on performance. Often, I relate to the story of Alexander the Great when he faced the Gordian Knot. He couldn’t untie it and, instead, cut it open. This story reminds me that sometimes perceived problems only seem impossible because we impose unnecessary constraints on ourselves.

00:37:13.680 We mustn't disregard unconventional solutions when attempting to scale our services. In software engineering, particularly when it comes to performance, it is all about understanding trade-offs.

00:38:00.000 Finally, I would like to celebrate the Ruby community; we have a unique culture that prioritizes developer happiness. In conclusion, programming should be as enjoyable as possible, so let’s strive to make it fun.

00:38:36.580 Thank you for attending my talk! Any questions?

00:39:02.800 Humans should not assume that the database chooses the right index. So, how do you make sure it does?

00:39:16.680 Oh, yes! You can check in your query using tools that tell you exactly which index is being used. In the Rails framework, there is typically a method for this.

00:39:40.160 About this, there has indeed been back and forth regarding ActiveRecord and its indexing methods, which change almost every version. There was a way to force an index in the past, but I think that has been phased out.

00:40:14.480 Concerning product limits, can you share examples of what limits you've found useful in posts?

00:40:56.540 It is crucial to implement timeout limits on databases. For instance, at Zendesk, we have 'triggers' in which clients can define specific conditions for actions—if a ticket comment has the word 'urgent,' it could send an email to the upper management.

00:41:19.360 If we allow unlimited possibilities, the impact on processing could escalate significantly.

00:41:49.960 For managing archived tickets, what are the types of services or databases used?

00:42:25.880 Currently, we use DynamoDB, AWS's key-value store, where we have been moving our archived tickets. Previously, we used React, which is a JavaScript framework.

00:42:55.760 Well, that's great to know! Thank you for that information!

00:43:12.500 In terms of sharding, do you practice geographically aware sharding?

00:43:45.460 Yes, we do implement some geographical awareness, but not specific down to cities. Most databases we manage can cater to geographic regions.

00:44:27.920 Is there any automated tooling to detect and mitigate abuse, such as excessive data storage or inappropriate usage?

00:44:56.860 Currently, we primarily leverage manual queries to catch suspicious behavior, and we have mechanisms in place that break functionalities upon reaching certain thresholds.

00:45:22.840 I appreciate your time and attention! Thank you for your questions.

wroc_love.rb 2023