00:00:16.039
Good afternoon! My name is Cristian Planas, and I work with Ruby on Rails. I would like to share that I know a little bit of Lithuanian because my wife is Lithuanian, and she will be very happy to see this video.
00:00:21.279
This is my first presentation. I have made it a few times before, but this is the first time I am presenting in public. But despite that, I apologize in advance for what is about to happen.
00:00:34.480
I have been attending conferences, particularly Ruby conferences, for around twelve years now, so I have seen some great presenters. I've seen Matt, I’ve seen DHH, and I have copied a few tricks from them for this presentation. One important lesson I learned is that it’s a good idea to summarize the presentation in the first two slides.
00:00:53.440
This presentation started with the crazy idea of going to large Ruby conferences and showcasing my slides in the biggest possible letters. The next slide emphasizes that this is a real scale issue. For example, last year, there was a thread where engineering teams shared their figures during Cyber Monday, reaching 78 million requests per minute at peak times.
00:01:18.119
That’s about 1.27 requests per second. For me, this wasn't particularly surprising as I work at Zendesk, where I focus on performance issues. We do around two billion requests per day. While it might be less than what was reported, we're used to managing that level of traffic.
00:01:46.799
This made me reflect on why there is such an obsession with scalability or, conversely, a lack thereof. This topic has been a recurrent theme recently across social media, particularly Twitter, where a significant discussion in the Ruby community revolved around a tweet from DHH. In it, he jokingly criticized the TypeScript community, igniting discussions and flame wars.
00:02:06.880
Because DHH is the creator of Rails, there was a considerable backlash, and Rails became the focal point of the debate. One tweet that caught my attention was about insecurities inherent in programming communities, specifically mentioning Ruby's scalability issues.
00:02:44.319
It prompted me to think more about why we are constantly having these discussions. I have a theory to explain this, but it requires a bit of time traveling.
00:02:57.120
This is a character called Ryon, a robotic cat from the 22nd century who possesses a time machine. Don’t worry; we’ll talk more about Ryon later. For now, let's time travel back to 2010.
00:03:15.239
Back in 2010, I was in college, having just completed a course on PHP and the typical LAMP stack. This led me to discover Ruby on Rails. I decided to try creating my first Rails application, which needed authentication—a complex problem. Just discussing encryption can be dizzying.
00:03:38.319
At that time, I wanted to ensure I nailed this aspect, so I researched extensively, reading all the essential literature on the subject. Eventually, I found a library called Devise, which was easy to configure. I just had to install a couple of gems, run a few migrations, and make slight adjustments to my models, and it worked seamlessly.
00:04:24.880
Many years later, I became a tech lead for the authentication security team at Zendesk, where I learned that if you make mistakes in authentication, very serious issues can arise. But for my 22-year-old self, it was exhilarating and made me feel like an incredibly skilled engineer.
00:04:37.080
Next, we need to travel again, this time to 2012. I proudly remember meeting Matt during this time while I was the CTO of a company called Playful Vet. The CTO title stood for Chief Technical Officer, but I was essentially the only engineer in a small company, which meant I had to do everything.
00:05:14.560
Ironically, even though we weren’t financially successful, we became very popular among teenagers. This popularity led us to tackle the topic of scalability, which is what I will discuss here. I echoed the same quest for knowledge as before; I researched thoroughly about scaling. In my investigations, I stumbled upon a gem called ActiveScale.
00:05:34.800
I read the documentation meticulously and opted for the parameter: scale sufficient but not too much. I would strongly advise against using the 'go bananas' parameter as it could create a myriad of problems. After deploying this to production, everything looked promising—at least at first.
00:05:53.799
Eventually, it became clear that ActiveScale didn’t exist. I essentially altered the title of the ActiveRecord repository using Google Chrome's inspector.
00:06:03.039
My point here is that scaling appears to be a challenging problem. It's an issue that feels daunting. There is no clear path, unlike many features of Rails that grant you significant power without needing a deep understanding of how they function. What I conclude from this experience is that the issue isn't that Rails can't scale; it's that Rails is proficient in so many areas that one of the few remaining challenges for developers to tackle is scalability.
00:06:37.040
When faced with a problem that isn’t solved by Rails' default features, it can leave developers puzzled. Now, before we dive into the techniques we can use for scaling a large application, I want to provide some general observations about this presentation.
00:06:55.160
We won’t delve deeply into any single topic; rather, this will serve as an introduction to scaling issues. I’ll share general concepts illustrated with the challenges we encounter while scaling to a significant degree. The essential reason for this is that there are no silver bullets in optimization.
00:07:18.040
When someone mentions that an application is performing slowly, I typically respond, 'I don’t know what the issue is.' Commonly, the database may be involved, but there are many potential trouble spots.
00:07:45.160
I often say that every poorly performing system has its peculiarities. Let's look at the scale of the problem we have at Zendesk, a customer support company that handles ticketing—if, for instance, a complaint is made by a user, that usually generates an object in our database. We currently have over 20 million tickets.
00:08:20.040
Divided by the number of people on earth, that’s about 2.5 tickets per person—including babies and elderly individuals—so everybody would have numerous tickets associated with them. Furthermore, about a third of the data we possess was generated last year alone.
00:08:50.040
This rapid growth presents considerable implications for our systems. We utilize multiple databases, including key-value stores, but our primary relational database—integral to our monolithic Rails app—has over 400 tables.
00:09:05.880
Moreover, the distribution of this data is incredibly uneven. One table, in particular, presents me with many sleepless nights as it contains trillions of entries, making it challenging to work with.
00:09:20.000
Next, I want to discuss a technique that is crucial in managing these issues: monitoring. Why is monitoring so important? Because you cannot resolve a problem you are unaware of.
00:09:31.479
Imagine making a change that leads to an endpoint, which previously took 500 milliseconds, to take 5 seconds instead. If you’re not monitoring, you might not notice it immediately; you would likely find out when customers start complaining.
00:09:47.360
Some customers can tolerate slow responses for a while before leaving. One frequent statement I encounter is that monitoring can be expensive, particularly at scale, but as performance engineers, we have the advantage of sampling.
00:10:05.480
Sampling can be very effective when managing performance issues, as sometimes even a 10% slice is sufficient to understand how an endpoint is performing. This concept leads to an error budget, which is extensively utilized at Zendesk.
00:10:28.480
I believe the concept of an error budget is not widely recognized among smaller companies. Typically, when someone defines a budget, they refer to something like '99.999% uptime,' but I propose we establish a budget for latency.
00:10:50.000
For instance, if you have an endpoint that averages 99.9762% successful transactions within an acceptable time frame, that’s an essential metric to have. This opens up the conversation about trade-offs and monitoring approaches.
00:11:18.680
Next, let’s talk about database query performance, which tends to be a central focus for performance engineers. The majority of my time is spent understanding why certain queries are slow and how we can optimize them. However, I don’t want to only focus on this aspect.
00:11:40.640
Originally, this presentation was titled 'Beyond N+1,' but I want to share some fundamental concepts regarding database indexing. Please utilize database indexes; you will be surprised by how often I find applications that neglect to use them.
00:12:06.080
However, with database indexes, sometimes you encounter peculiar situations—especially with complex tables holding numerous entries—where the database engine may not choose the optimal index.
00:12:29.560
Using the 'EXPLAIN' or 'EXPLAIN ANALYZE' commands is essential to uncover what's happening with a query. This is particularly true with complex queries where optimization can get challenging.
00:12:48.600
Another technique we frequently use at Zendesk is sharding. The idea is simple: if you have a massive dataset, what if you partition your database into smaller, manageable ones?
00:13:06.079
In certain cases, especially in our business, customer A doesn’t care about customer B’s data, so it makes sense to separate that data into distinct databases. For example, with modulo sharding, you might divide customer data such that each customer has their own unique shard.
00:13:22.160
However, manually managing shards for large customers can become cumbersome. It’s crucial to have a systematic way to manage sharding instead of relying on manual fixes.
00:13:47.480
Moreover, something common in Active Record that I call 'select star' needs addressing. Imagine running a query like ‘User.where(id > 42)’. This results in selecting all columns from the user table.
00:14:07.480
Sometimes, you may only need the ID or the title, but without realizing, you could be fetching gigabytes of unnecessary text or blobs. This is a frequent occurrence due to developers using Active Record without understanding the implications and complexities beneath.
00:14:37.960
Optimizing your queries by using 'select' to retrieve only the necessary attributes can yield significant performance benefits. Identifying these optimizations can lead to impressive gains, particularly on endpoints that handle considerable loads.
00:14:54.360
Let’s go back to the concept of complex queries. Sometimes, developers run complicated queries several times in development and dismiss their performance implications based on the fact that it isn’t consistently slow.
00:15:21.120
The issue with complex database queries is their high variability—95% of the time, they might operate smoothly; however, that 1% of the time could cause a database meltdown, which is unacceptable.
00:15:47.080
Hence, I advise focusing on the 'p99' performance metrics, meaning keeping an eye on your system's most unreliable queries and making sure they perform as expected.
00:16:05.520
Another common technique worthy of mention is caching. I don’t want to focus only on traditional caching that serves frequently accessed data; instead, I would like to discuss specific caching techniques.
00:16:19.680
One technique we often use is called write-through caching. This is particularly useful for normalized objects where handles are complex calculations needed on the fly.
00:16:41.560
For instance, imagine an endpoint that needs to execute very complex calculations that could be pre-calculated on a write action instead of on the read.
00:17:03.160
In this case, the customers don’t need to customize the filters, so you could pre-compute the calculations when adding data rather than calculating it every time a read occurs.
00:17:27.520
This approach significantly reduces the load. We had a particular instance where a query that had a side load took 40 milliseconds, and by optimizing it with this technique, we reduced it to 5 milliseconds.
00:17:48.680
You might think that a 35-millisecond decrease seems negligible, but at scale, it isn’t. For example, if that endpoint is called 50,000 times per day, that results in more than eight full days of computation annually, which translates to considerable cost.
00:18:04.960
However, caching has its trade-offs. It adds complexity to your system, as you're potentially introducing another source of truth, duplicating data that could lead to inconsistencies later.
00:18:24.160
Returning to the theme of unconventional solutions, let me introduce you to my special guest tonight, Dmon Ryon, a cartoon character from Catalan television that I watched growing up.
00:18:47.520
Ryon is a futuristic robotic cat equipped with countless gadgets. His character portrays the idea of solving problems with fantastical solutions, similar to how software engineering sometimes offers seemingly miraculous fixes.
00:19:14.879
In each episode, Ryon helps a clumsy boy named Novita. The storyline begins with Novita facing a relational problem, and initially, Ryon provides a solutions machine. However, Novita always misuses it, leading to catastrophic results.
00:19:44.120
This mirrors how some technologies that seem straightforward can lead to misuse. For example, frequently used gems such as Active Record can be used incorrectly, leading to performance degradation.
00:20:03.120
My advice is to be cautious and thoughtful in using technology—understand the consequences of your implementations. Now, there’s another powerful tool I want to discuss: additional storage.
00:20:35.640
In addition to caching and relational databases, introducing a third storage system may help alleviate database burdens. Why opt for additional storage? At Zendesk, we manage substantial amounts of data.
00:20:57.520
Currently, we struggle with over 20 billion tickets, including tables with trillions of rows, which complicates data management. One way to address this challenge is by archiving infrequently accessed data.
00:21:17.680
At Zendesk, once a ticket is closed for five months, it’s marked for archival. We have a background job that removes it from the relational database but transfers it to specialized storage.
00:21:38.640
Initially, this raises concerns about losing associations with the ticket data. Hence, we create a new object, an archive ticket, retaining only essential information and associations, which facilitates further data retrieval.
00:21:54.960
Most of our data at Zendesk is archival, representing around 90% of our total data. While maintaining multiple data sources introduces its own challenges, it is necessary for efficient data management.
00:22:16.720
It’s vital to comprehend how splitting data across different stores impacts accessibility and querying, especially as different technologies are utilized. Access patterns in product design might need revisiting due to these complexities.
00:22:36.080
I want to discuss a significant idea concerning product limits. I cannot emphasize enough how much regret we have experienced from not anticipating the need for product limits years ahead.
00:22:55.080
As a platform expands, identifying what users do becomes increasingly challenging. Sometimes you encounter misuse, even if from well-meaning users. Therefore, it is crucial to safeguard against this sort of situation.
00:23:15.880
Although setting such limits may not seem appealing—especially when dealing with significant clients—it's crucial to implement them to maintain system health.
00:23:36.200
The next technique I want to address is what I call the 'incredibly enlarging payload.' This often refers to endpoints returning substantially more data than required. Over time, as applications evolve, additional attributes can work their way into existing endpoints.
00:24:03.360
Initially, an endpoint may return four relevant attributes, but over time it could balloon to 40 attributes, significantly increasing the payload, causing slower responses and additional complexity.
00:24:18.720
In these cases, it is advisable to keep responses lean and consider using tools like GraphQL to help in fetching only necessary data.
00:24:35.440
Another guideline is breaking down slower flows. For example, instead of making one call to retrieve a complex data set, consider separating the logic and making multiple calls.
00:24:48.480
While more calls may seem less efficient, it can help ensure the user is not blocked and that essential data is loaded faster, thus enhancing user experience.
00:25:05.040
Similarly, when it comes to write operations, consider making them asynchronous. Move heavy write operations to background jobs or use queues to enhance performance.
00:25:25.040
This pattern, combined with distributed systems like Kafka queuing, can greatly improve efficiency, although it requires a shift in thinking regarding consistency.
00:25:41.680
Customers often adapt better than engineers think they will. Even if a response now takes a couple of seconds, customers usually adjust tolerantly.
00:26:03.360
As we reach the end, I would particularly emphasize the mindset necessary for a performance engineer. It’s essential to maintain a mindset similar to that of Alexander the Great.
00:26:24.160
In 333 BCE, when he faced the impossibility of untying the Gordian Knot, he made an unconventional choice and cut it instead. This challenges the notion of needing to solve problems in traditional ways.
00:26:44.560
Sometimes, adhering too rigidly to conventional solutions can obstruct progress. As software engineers, we can often devise non-linear solutions to obstacles we face.
00:27:03.440
Finally, performance engineering is about creating trade-offs. There is no such thing as a flawless system—only systems optimized for specific needs.
00:27:28.680
Striving for absolute perfection is counterproductive, which leads to burnout and delays in delivery. Fortunately, our community understands this concept, as we prioritize developer happiness.
00:27:47.680
I would like to conclude with a quote from Matt: 'The purpose of life is, at least in part, to be happy.' Based on this belief, we should remember that Ruby is designed not only to make programming easy but also enjoyable. Thank you!