A Rails Performance Guidebook: from 0 to 1B requests/day

Summarized using AI

A Rails Performance Guidebook: from 0 to 1B requests/day

Cristian Planas • October 04, 2023 • Vilnius, Lithuania • Talk

In 'A Rails Performance Guidebook,' Cristian Planas shares his insights into scaling applications using Ruby on Rails, presented at Euruko 2023. The presentation begins with Cristian's personal reflections on his journey through the Ruby community and emphasizes the importance of scalability amidst a backdrop of industry discussions, particularly surrounding complaints about Ruby's performance capabilities.

Key points discussed throughout the presentation include:

- Scale Issues: Cristian explains that working at Zendesk, where they handle approximately two billion requests daily, drives his interest in performance issues, highlighting a particular focus on scalability problems often faced by Rails developers.
- Historical Context: He reminisces about his experiences in college and learning to build Rails applications, emphasizing the challenges faced in implementing authentication.

- Performance Monitoring: Monitoring is deemed crucial for identifying performance issues before they impact users, with techniques such as sampling and establishing an 'error budget' for latency discussed.

- Database Optimization: The presentation stresses the importance of database indexing and query optimization, using 'EXPLAIN' commands to analyze query performance, and suggests using sharding to manage larger datasets effectively.
- Caching Techniques: Cristian discusses write-through caching as a method to reduce load by pre-computing complex calculations.

- Data Management: He touches upon the necessity for additional storage solutions and archiving strategies for infrequently accessed data, as well as the importance of implementing product limits to manage user interactions effectively.
- Payload Management: Avoiding unnecessarily large payloads from API endpoints is highlighted as a key optimization, advocating for lean responses and leveraging technologies like GraphQL.
- Asynchronous Processing: Utilizing background jobs for heavy write operations is recommended to enhance performance.

- Mindset for Optimization: Cristian emphasizes the importance of creative problem-solving, suggesting that unconventional approaches can lead to significant advancements in performance engineering.

The conclusion underscores that performance engineering is about balancing trade-offs and optimizing systems tailored for specific requirements while ensuring developer satisfaction. By keeping performance in perspective, developers can prioritize on making coding with Ruby enjoyable, as manifested in a quote from DHH underscoring the purpose of life and programming: to be happy.

A Rails Performance Guidebook: from 0 to 1B requests/day
Cristian Planas • October 04, 2023 • Vilnius, Lithuania • Talk

EuRuKo 2023

00:00:16.039 Good afternoon! My name is Cristian Planas, and I work with Ruby on Rails. I would like to share that I know a little bit of Lithuanian because my wife is Lithuanian, and she will be very happy to see this video.
00:00:21.279 This is my first presentation. I have made it a few times before, but this is the first time I am presenting in public. But despite that, I apologize in advance for what is about to happen.
00:00:34.480 I have been attending conferences, particularly Ruby conferences, for around twelve years now, so I have seen some great presenters. I've seen Matt, I’ve seen DHH, and I have copied a few tricks from them for this presentation. One important lesson I learned is that it’s a good idea to summarize the presentation in the first two slides.
00:00:53.440 This presentation started with the crazy idea of going to large Ruby conferences and showcasing my slides in the biggest possible letters. The next slide emphasizes that this is a real scale issue. For example, last year, there was a thread where engineering teams shared their figures during Cyber Monday, reaching 78 million requests per minute at peak times.
00:01:18.119 That’s about 1.27 requests per second. For me, this wasn't particularly surprising as I work at Zendesk, where I focus on performance issues. We do around two billion requests per day. While it might be less than what was reported, we're used to managing that level of traffic.
00:01:46.799 This made me reflect on why there is such an obsession with scalability or, conversely, a lack thereof. This topic has been a recurrent theme recently across social media, particularly Twitter, where a significant discussion in the Ruby community revolved around a tweet from DHH. In it, he jokingly criticized the TypeScript community, igniting discussions and flame wars.
00:02:06.880 Because DHH is the creator of Rails, there was a considerable backlash, and Rails became the focal point of the debate. One tweet that caught my attention was about insecurities inherent in programming communities, specifically mentioning Ruby's scalability issues.
00:02:44.319 It prompted me to think more about why we are constantly having these discussions. I have a theory to explain this, but it requires a bit of time traveling.
00:02:57.120 This is a character called Ryon, a robotic cat from the 22nd century who possesses a time machine. Don’t worry; we’ll talk more about Ryon later. For now, let's time travel back to 2010.
00:03:15.239 Back in 2010, I was in college, having just completed a course on PHP and the typical LAMP stack. This led me to discover Ruby on Rails. I decided to try creating my first Rails application, which needed authentication—a complex problem. Just discussing encryption can be dizzying.
00:03:38.319 At that time, I wanted to ensure I nailed this aspect, so I researched extensively, reading all the essential literature on the subject. Eventually, I found a library called Devise, which was easy to configure. I just had to install a couple of gems, run a few migrations, and make slight adjustments to my models, and it worked seamlessly.
00:04:24.880 Many years later, I became a tech lead for the authentication security team at Zendesk, where I learned that if you make mistakes in authentication, very serious issues can arise. But for my 22-year-old self, it was exhilarating and made me feel like an incredibly skilled engineer.
00:04:37.080 Next, we need to travel again, this time to 2012. I proudly remember meeting Matt during this time while I was the CTO of a company called Playful Vet. The CTO title stood for Chief Technical Officer, but I was essentially the only engineer in a small company, which meant I had to do everything.
00:05:14.560 Ironically, even though we weren’t financially successful, we became very popular among teenagers. This popularity led us to tackle the topic of scalability, which is what I will discuss here. I echoed the same quest for knowledge as before; I researched thoroughly about scaling. In my investigations, I stumbled upon a gem called ActiveScale.
00:05:34.800 I read the documentation meticulously and opted for the parameter: scale sufficient but not too much. I would strongly advise against using the 'go bananas' parameter as it could create a myriad of problems. After deploying this to production, everything looked promising—at least at first.
00:05:53.799 Eventually, it became clear that ActiveScale didn’t exist. I essentially altered the title of the ActiveRecord repository using Google Chrome's inspector.
00:06:03.039 My point here is that scaling appears to be a challenging problem. It's an issue that feels daunting. There is no clear path, unlike many features of Rails that grant you significant power without needing a deep understanding of how they function. What I conclude from this experience is that the issue isn't that Rails can't scale; it's that Rails is proficient in so many areas that one of the few remaining challenges for developers to tackle is scalability.
00:06:37.040 When faced with a problem that isn’t solved by Rails' default features, it can leave developers puzzled. Now, before we dive into the techniques we can use for scaling a large application, I want to provide some general observations about this presentation.
00:06:55.160 We won’t delve deeply into any single topic; rather, this will serve as an introduction to scaling issues. I’ll share general concepts illustrated with the challenges we encounter while scaling to a significant degree. The essential reason for this is that there are no silver bullets in optimization.
00:07:18.040 When someone mentions that an application is performing slowly, I typically respond, 'I don’t know what the issue is.' Commonly, the database may be involved, but there are many potential trouble spots.
00:07:45.160 I often say that every poorly performing system has its peculiarities. Let's look at the scale of the problem we have at Zendesk, a customer support company that handles ticketing—if, for instance, a complaint is made by a user, that usually generates an object in our database. We currently have over 20 million tickets.
00:08:20.040 Divided by the number of people on earth, that’s about 2.5 tickets per person—including babies and elderly individuals—so everybody would have numerous tickets associated with them. Furthermore, about a third of the data we possess was generated last year alone.
00:08:50.040 This rapid growth presents considerable implications for our systems. We utilize multiple databases, including key-value stores, but our primary relational database—integral to our monolithic Rails app—has over 400 tables.
00:09:05.880 Moreover, the distribution of this data is incredibly uneven. One table, in particular, presents me with many sleepless nights as it contains trillions of entries, making it challenging to work with.
00:09:20.000 Next, I want to discuss a technique that is crucial in managing these issues: monitoring. Why is monitoring so important? Because you cannot resolve a problem you are unaware of.
00:09:31.479 Imagine making a change that leads to an endpoint, which previously took 500 milliseconds, to take 5 seconds instead. If you’re not monitoring, you might not notice it immediately; you would likely find out when customers start complaining.
00:09:47.360 Some customers can tolerate slow responses for a while before leaving. One frequent statement I encounter is that monitoring can be expensive, particularly at scale, but as performance engineers, we have the advantage of sampling.
00:10:05.480 Sampling can be very effective when managing performance issues, as sometimes even a 10% slice is sufficient to understand how an endpoint is performing. This concept leads to an error budget, which is extensively utilized at Zendesk.
00:10:28.480 I believe the concept of an error budget is not widely recognized among smaller companies. Typically, when someone defines a budget, they refer to something like '99.999% uptime,' but I propose we establish a budget for latency.
00:10:50.000 For instance, if you have an endpoint that averages 99.9762% successful transactions within an acceptable time frame, that’s an essential metric to have. This opens up the conversation about trade-offs and monitoring approaches.
00:11:18.680 Next, let’s talk about database query performance, which tends to be a central focus for performance engineers. The majority of my time is spent understanding why certain queries are slow and how we can optimize them. However, I don’t want to only focus on this aspect.
00:11:40.640 Originally, this presentation was titled 'Beyond N+1,' but I want to share some fundamental concepts regarding database indexing. Please utilize database indexes; you will be surprised by how often I find applications that neglect to use them.
00:12:06.080 However, with database indexes, sometimes you encounter peculiar situations—especially with complex tables holding numerous entries—where the database engine may not choose the optimal index.
00:12:29.560 Using the 'EXPLAIN' or 'EXPLAIN ANALYZE' commands is essential to uncover what's happening with a query. This is particularly true with complex queries where optimization can get challenging.
00:12:48.600 Another technique we frequently use at Zendesk is sharding. The idea is simple: if you have a massive dataset, what if you partition your database into smaller, manageable ones?
00:13:06.079 In certain cases, especially in our business, customer A doesn’t care about customer B’s data, so it makes sense to separate that data into distinct databases. For example, with modulo sharding, you might divide customer data such that each customer has their own unique shard.
00:13:22.160 However, manually managing shards for large customers can become cumbersome. It’s crucial to have a systematic way to manage sharding instead of relying on manual fixes.
00:13:47.480 Moreover, something common in Active Record that I call 'select star' needs addressing. Imagine running a query like ‘User.where(id > 42)’. This results in selecting all columns from the user table.
00:14:07.480 Sometimes, you may only need the ID or the title, but without realizing, you could be fetching gigabytes of unnecessary text or blobs. This is a frequent occurrence due to developers using Active Record without understanding the implications and complexities beneath.
00:14:37.960 Optimizing your queries by using 'select' to retrieve only the necessary attributes can yield significant performance benefits. Identifying these optimizations can lead to impressive gains, particularly on endpoints that handle considerable loads.
00:14:54.360 Let’s go back to the concept of complex queries. Sometimes, developers run complicated queries several times in development and dismiss their performance implications based on the fact that it isn’t consistently slow.
00:15:21.120 The issue with complex database queries is their high variability—95% of the time, they might operate smoothly; however, that 1% of the time could cause a database meltdown, which is unacceptable.
00:15:47.080 Hence, I advise focusing on the 'p99' performance metrics, meaning keeping an eye on your system's most unreliable queries and making sure they perform as expected.
00:16:05.520 Another common technique worthy of mention is caching. I don’t want to focus only on traditional caching that serves frequently accessed data; instead, I would like to discuss specific caching techniques.
00:16:19.680 One technique we often use is called write-through caching. This is particularly useful for normalized objects where handles are complex calculations needed on the fly.
00:16:41.560 For instance, imagine an endpoint that needs to execute very complex calculations that could be pre-calculated on a write action instead of on the read.
00:17:03.160 In this case, the customers don’t need to customize the filters, so you could pre-compute the calculations when adding data rather than calculating it every time a read occurs.
00:17:27.520 This approach significantly reduces the load. We had a particular instance where a query that had a side load took 40 milliseconds, and by optimizing it with this technique, we reduced it to 5 milliseconds.
00:17:48.680 You might think that a 35-millisecond decrease seems negligible, but at scale, it isn’t. For example, if that endpoint is called 50,000 times per day, that results in more than eight full days of computation annually, which translates to considerable cost.
00:18:04.960 However, caching has its trade-offs. It adds complexity to your system, as you're potentially introducing another source of truth, duplicating data that could lead to inconsistencies later.
00:18:24.160 Returning to the theme of unconventional solutions, let me introduce you to my special guest tonight, Dmon Ryon, a cartoon character from Catalan television that I watched growing up.
00:18:47.520 Ryon is a futuristic robotic cat equipped with countless gadgets. His character portrays the idea of solving problems with fantastical solutions, similar to how software engineering sometimes offers seemingly miraculous fixes.
00:19:14.879 In each episode, Ryon helps a clumsy boy named Novita. The storyline begins with Novita facing a relational problem, and initially, Ryon provides a solutions machine. However, Novita always misuses it, leading to catastrophic results.
00:19:44.120 This mirrors how some technologies that seem straightforward can lead to misuse. For example, frequently used gems such as Active Record can be used incorrectly, leading to performance degradation.
00:20:03.120 My advice is to be cautious and thoughtful in using technology—understand the consequences of your implementations. Now, there’s another powerful tool I want to discuss: additional storage.
00:20:35.640 In addition to caching and relational databases, introducing a third storage system may help alleviate database burdens. Why opt for additional storage? At Zendesk, we manage substantial amounts of data.
00:20:57.520 Currently, we struggle with over 20 billion tickets, including tables with trillions of rows, which complicates data management. One way to address this challenge is by archiving infrequently accessed data.
00:21:17.680 At Zendesk, once a ticket is closed for five months, it’s marked for archival. We have a background job that removes it from the relational database but transfers it to specialized storage.
00:21:38.640 Initially, this raises concerns about losing associations with the ticket data. Hence, we create a new object, an archive ticket, retaining only essential information and associations, which facilitates further data retrieval.
00:21:54.960 Most of our data at Zendesk is archival, representing around 90% of our total data. While maintaining multiple data sources introduces its own challenges, it is necessary for efficient data management.
00:22:16.720 It’s vital to comprehend how splitting data across different stores impacts accessibility and querying, especially as different technologies are utilized. Access patterns in product design might need revisiting due to these complexities.
00:22:36.080 I want to discuss a significant idea concerning product limits. I cannot emphasize enough how much regret we have experienced from not anticipating the need for product limits years ahead.
00:22:55.080 As a platform expands, identifying what users do becomes increasingly challenging. Sometimes you encounter misuse, even if from well-meaning users. Therefore, it is crucial to safeguard against this sort of situation.
00:23:15.880 Although setting such limits may not seem appealing—especially when dealing with significant clients—it's crucial to implement them to maintain system health.
00:23:36.200 The next technique I want to address is what I call the 'incredibly enlarging payload.' This often refers to endpoints returning substantially more data than required. Over time, as applications evolve, additional attributes can work their way into existing endpoints.
00:24:03.360 Initially, an endpoint may return four relevant attributes, but over time it could balloon to 40 attributes, significantly increasing the payload, causing slower responses and additional complexity.
00:24:18.720 In these cases, it is advisable to keep responses lean and consider using tools like GraphQL to help in fetching only necessary data.
00:24:35.440 Another guideline is breaking down slower flows. For example, instead of making one call to retrieve a complex data set, consider separating the logic and making multiple calls.
00:24:48.480 While more calls may seem less efficient, it can help ensure the user is not blocked and that essential data is loaded faster, thus enhancing user experience.
00:25:05.040 Similarly, when it comes to write operations, consider making them asynchronous. Move heavy write operations to background jobs or use queues to enhance performance.
00:25:25.040 This pattern, combined with distributed systems like Kafka queuing, can greatly improve efficiency, although it requires a shift in thinking regarding consistency.
00:25:41.680 Customers often adapt better than engineers think they will. Even if a response now takes a couple of seconds, customers usually adjust tolerantly.
00:26:03.360 As we reach the end, I would particularly emphasize the mindset necessary for a performance engineer. It’s essential to maintain a mindset similar to that of Alexander the Great.
00:26:24.160 In 333 BCE, when he faced the impossibility of untying the Gordian Knot, he made an unconventional choice and cut it instead. This challenges the notion of needing to solve problems in traditional ways.
00:26:44.560 Sometimes, adhering too rigidly to conventional solutions can obstruct progress. As software engineers, we can often devise non-linear solutions to obstacles we face.
00:27:03.440 Finally, performance engineering is about creating trade-offs. There is no such thing as a flawless system—only systems optimized for specific needs.
00:27:28.680 Striving for absolute perfection is counterproductive, which leads to burnout and delays in delivery. Fortunately, our community understands this concept, as we prioritize developer happiness.
00:27:47.680 I would like to conclude with a quote from Matt: 'The purpose of life is, at least in part, to be happy.' Based on this belief, we should remember that Ruby is designed not only to make programming easy but also enjoyable. Thank you!
Explore all talks recorded at EuRuKo 2023
+28