A Rails performance guidebook: from 0 to 1B requests/day

00:00:06 Um hello, my name is Cristian and this is Anatoly. We are two engineers at a company called Zendesk.

00:00:14 It's a pretty big company. Who here uses Zendesk? A few people, nice! So, we are going to... well, first of all, I would like to say that this is our first presentation. We've presented in a few places before.

00:00:25 But the point is that I am personally very sorry for what is about to happen. Hopefully, you will forgive us. First of all, I wanted to comment on the topic itself: why this presentation? Well, this is an important presentation because we talk about one of the biggest questions for humankind: Does Rails scale?

00:00:42 And I have something to tell you. Some of you here have a secret. Some of you went to Computer Science school in college. Who went? Okay, and some didn't. For those who went to school, we have a secret that we don't share.

00:01:11 I'm going to take you down memory lane. Do you remember your first day when you arrived at school? On that first day, they taught you two very important ideas that have served me very well throughout my career. One was, 'Ruby is dead.' Believe me, I've been doing Ruby for 15 years. It was dead 15 years ago, so it's really dead now.

00:01:36 The second one, and the one we'll talk about now, is 'Rails doesn’t scale.' This is known.

00:02:00 People talk about this all the time. A couple of months ago, Programming Twitter was again discussing this due to a tweet. I don’t know if you remember what happened. DHH, our fearless leader, decided to remove TypeScript from, I think, the Basecamp repo.

00:02:19 If that wasn't provocative enough, he decided to publish it on Twitter, which caused a flame war. As you know, DHH is the founder of Rails, and people started attacking Ruby.

00:02:43 I found this tweet that I found very funny. All languages have some kind of deep insecurity, and the one in the Ruby community is that 'Ruby doesn't scale.' It's funny because similar languages, like Python, are a bit faster but are kind of in the same ballpark, I would say.

00:02:56 However, people don’t talk that much about this, but with Ruby, it’s all the time that 'Ruby doesn’t scale,' and 'Don’t start your startup in Ruby; it’s going to fail.'

00:03:06 The reality is that this is just not true. There are plenty of companies that use Rails. Some of them don’t use it anymore, but for example, Shopify uses it, and so does Zendesk.

00:03:30 It just scales. There are companies working at scale with it; it’s just a fact. It’s not debatable, really.

00:03:59 Moreover, if you want to hear more about that, it’s actually a good day because today is Black Friday. Last year, Shopify made a thread about all the scalability and performance levels that they achieved with a Rails monolith.

00:04:14 So I think I've already proved that this idea doesn't really make sense. There are people using Rails serving 75 million requests per minute.

00:04:32 So where does this idea come from? I have an explanation, but for that, you will have to time travel a little bit with me. I don't know if you know this guy; he's called Doron the Cat.

00:05:05 But don't worry; later we will talk about Doron. First, we'll travel to 2010. That was me, actually. That was not me in 2010, but I like this picture, and I was writing my first app.

00:05:21 It was a very small magazine, a movie magazine. This magazine needed authentication. I needed the reviewers of the movies to have their own back offices and write reviews.

00:05:45 The thing is, authentication is a hard problem. Later on, I became the tech lead for the authentication team at Zendesk, and just encryption is mind-boggling – it’s a really hard problem.

00:06:01 But at 20 years old, I had no idea. So I thought, 'Okay, I’m going to do this deep investigation.' I checked everywhere I could, and of course, I found Devise.

00:06:23 I read the documentation, ran a couple of lines, ran a migration, if I remember correctly, and that was it. It worked.

00:06:37 Authentication is a hard problem, but for 20-year-old me, it felt super easy, and I was the master of the universe. It felt really simple.

00:07:07 The second part of my explanation requires us to move again in time to 2012. I’m using this image because I really like it. In 2012, I had the important role of Chief Technical Officer at a company called PlayfulBet.

00:07:21 I achieved this position being the only engineer. This made it even more challenging because we had to scale. We became very popular among young people in Spain.

00:07:48 We were, for a while, one of the applications with the most traffic in Spain, and I was the only engineer. Scaling is a problem, and I hope that’s probably why you are interested in this presentation.

00:08:00 So again, I followed the same strategy: do a deep investigation using the best sources I could find, and of course, I found the gem ActiveScale. Who here uses ActiveScale in production?

00:08:21 Well, okay, well, surprising! So I read the documentation and decided to go for this scale: 'Rails scale sufficient but not too much.' Honestly, later, you can change the parameters to use, for example, 'Go Bananas.'

00:08:37 And that was it. Everything was beautiful, and nothing went wrong. Thank you. I told you at the beginning that I was sorry for what was going to happen.

00:08:53 Okay, that's not true. ActiveScale doesn’t exist. Actually, I just changed the title with Chrome, but my point here is that scaling is a hard problem.

00:09:10 But it's a problem that feels hard. You don't get a three-line solution like with Devise. I have absolutely no idea what encryption is, yet I have a really consistent solution in my system.

00:09:29 So the conclusion is not that Rails doesn’t scale; it’s that Rails is so good that one of the few problems left for you to solve is scaling.

00:09:50 But how do we make a Zendesk scale? Before going to that, I want to give some general ideas about the presentation.

00:10:06 We are going to be very general; we won't get very deep into anything, as we don’t have the time. We will give you general ideas and illustrate them with issues we found while scaling Zendesk.

00:10:28 This is because we found that there are no silver bullets. If someone comes to me after this, I often think, 'Oh, how can I make your system scale?' I have no idea; I don’t know your system.

00:10:45 It could be a database issue, but there are so many reasons why a system doesn’t scale. As Anatoly said, not every poorly performing system is performing poorly in its own way.

00:11:06 Moreover, you might be asking why I think I’m qualified to comment on this. Well, at Zendesk, we manage certain performance problems.

00:11:28 Zendesk is a customer support product. Basically, users send complaints about anything—'I bought this mattress, and it's broken,' or whatever. You send an email to the company, and this generates what is called a ticket.

00:11:49 This data gives you an idea: we currently have over 20 billion tickets. Just to give you an idea, I could provide 2.5 tickets for each person on Earth, including babies, and there would still be some left.

00:12:01 Moreover, this dataset grows; around a third of the tickets were created in the last year. This has obvious effects on our system, particularly in relational databases, and we have other kinds of stores.

00:12:17 We probably have over 400 terabytes, or more like half a petabyte at this point.

00:12:33 But I want to clarify that this isn’t all our data. We have a lot of different data types, not just in the relational database. Just to give you an idea, we have a model that will have trillions of these objects.

00:12:49 We have something we use for auditing—trillions. Good luck running a query on a trillion objects. But okay, those were the general ideas I wanted to give, and now we finally get into the topic.

00:13:05 The first thing you need to know about performance is monitoring. The reason for this is that you cannot solve a problem if you don’t know you have it.

00:13:17 Sometimes you may be surprised by all the things, like, 'Oh, yeah, this endpoint is slow.' Well, actually, no, maybe you have another endpoint that is way worse.

00:13:29 So using monitoring is very important. However, you will find that when you have a very big system, monitoring can become super expensive.

00:13:45 Sometimes we found that if we misconfigured our database instance, we would exhaust all the money of the company easily.

00:14:06 One thing that works very well for monitoring performance is sampling. When you are logging, you need everything because there could be an error, so it's important to have all the data.

00:14:22 But for performance, what you want is to detect patterns. So, 10%, or even sometimes 1% of the data, can work well enough for you to diagnose what's happening in the system.

00:14:42 With all that data, the next step is doing an error budget. If you are familiar with the concept of error budgeting, you may be wondering how it is related to performance.

00:14:59 Error budgeting is about uptime. For example, we aim for 99.99% uptime. You can define an error budget for latency. You can say, 'Okay, whatever our systems have five nines of uptime, but if my endpoint is returning 200 after one minute, maybe it's not really working.'

00:15:31 We really practice this at Zendesk, not with one-minute metrics, but with more reasonable times.

00:15:47 If you check in the left column, we have three different endpoints for our application. In the left column is the traditional uptime, the amount of time our application is up.

00:16:07 In the center, we have what we have defined as an acceptable amount of time; I don't know how much it is, but maybe we're talking about 500 milliseconds for a show or 300 milliseconds.

00:16:26 In the right column, we have the optimal amount of time.

00:16:47 I know it sounds like a difficult debate about what a good or acceptable amount of time should be, but the more important part is to have a guiding star so you can detect if something goes wrong.

00:17:01 And next, Anat will explain how to maintain those error budgets.

00:17:07 Yes, is everybody awake? That's good! I have some interesting slides about database performance.

00:17:23 I started to work with Ruby in 2008. Does anybody remember Ruby 1.8.7 and Rails 2.0.2? Okay, this talk is for you. No, I'm kidding; it's for everybody.

00:17:37 Let's talk about database performance at Zendesk and what kind of problems we are trying to solve.

00:18:02 Why do we even talk about databases in this talk? We have cost optimization to consider, and this is what we'll keep in mind while reviewing the next 20 slides.

00:18:21 What are Zendesk's current database challenges? We have a rapid dataset growth. For example, we created over 30% of tickets in the past two years, which is a lot.

00:18:44 We run our databases on AWS, including MySQL and Postgres, and we recently use Redis.

00:19:04 We saved costs while maintaining performance. Let me tell you this: anybody thinks that application performance problems stem from database performance problems? No?

00:19:27 Let me prove that it is the case. If you have an application the size of Zendesk, imagine having a full table scan that selects an account from a table with 100 rows, and then you have the same query for 100 million rows.

00:19:50 Anybody thinks the response time would be the same? So this is the hint that will help you understand.

00:20:11 Next, working with large databases like Zendesk has today spiky workloads that can’t scale. We learned the hard way that spiky workloads can only get higher and higher.

00:20:31 More hardware you put towards the problem, this can actually degrade performance but not improve it.

00:20:54 Volume of data does change query performance. For example, if I give you two pills to choose from and you have an application like Zendesk with billions of records in the database, you will have a performance issue.

00:21:08 What do you think may happen with the database if you downscale? Will performance get better or worse?

00:21:23 Okay, let's move on. We developed a three-step approach that has been efficient: right-sizing the database.

00:21:34 This often means downscaling, detecting the problems, and then reacting. First, large infrastructure, with a large amount of CPU cores and memory, prevents you from seeing the root causes of many performance problems.

00:21:49 You might think it's an application problem, but it’s actually a database problem.

00:22:06 So if you have a spiky workload, as you will see, you'll understand what I'm talking about.

00:22:23 Is anybody familiar with this type of situation where the database is working fine, it's successful, and performant, and then suddenly you don't have resources left available?

00:22:37 What happens if you upscale or throw more hardware at it? What if your cloud provider doesn't have anything larger than what you already have?

00:22:54 What if this line is the maximum limit they can give you? You come to them, saying 'I need more,' and you hear silence.

00:23:06 This is the type of spiky workload that can prevent you from right-sizing your infrastructure and hide problems.

00:23:24 But to see this problem, you have to zoom in really close—like maybe nine minutes or less—because you don’t see this problem happening throughout the day.

00:23:42 It doesn’t happen consistently; when it happens, you don’t know why.

00:24:01 Next slide shows you an example of something that happened two months ago. We ran a very interesting experiment at Zendesk.

00:24:20 We knew that the second largest application at Zendesk had performance issues, particularly around memory drops that happened once or twice a week.

00:24:35 We didn’t know exactly why this happened, but we had some theories. On the graph you see the freeable memory, which is the amount of memory available for your database instance.

00:24:54 The first line—with the first ten centimeters—shows no load on the database, no traffic, no queries. A flat line is expected.

00:25:10 Then you start seeing drops. We started to move some customers and see performance issues, but they didn’t happen too often; they happened once a week.

00:25:29 Each drop indicates the database crash. When a database crashes, you can expect some application failures. Does anybody still think that the database problem isn’t an application problem?

00:25:47 For any application, if you have Service Level Indicators (SLIs) and Service Level Objectives (SLOs) and they aren't green, but red, the problem is often not the application logic or business logic but the data structure.

00:26:09 Each drop shows a database crash. In the middle of the screen, you see the line that indicates the first downscale.

00:26:27 We downscaled again before we saturated the databases, to the point that problems started occurring every day, not every week, and we used downscaling as a microscope to see what exactly was happening.

00:26:50 Why did these issues arise, and what do we do about it? At this point, it became super clear that the problem wasn’t with the application; it was with the database.

00:27:05 What were we going to do with the database? The solution was to change the application, but we needed that microscope to understand the problem first.

00:27:23 Downscaling from 16x to 4x then became the obvious solution, and if you know the price of AWS Aurora, you will understand how much of a difference that makes.

00:27:43 Of course, downscaling is not a solution to every problem. In a case where you have a compute-intensive workload when utilization is flat and you don’t have any spikes, upscale as much as you want, and you will get flat performance.

00:28:03 But if you have IO-intensive applications, think twice before throwing more hardware at the problem.

00:28:29 Also, I have some theoretical material behind this exercise, so it’s not just random. Let me show you the law and the Pareto principle.

00:28:47 Does anybody have a stateless application without a database in the room? Just a few? Well, think about a database as a very closed system with a sound problem.

00:29:05 You don’t have to solve them all; you only have to solve part of the problems to achieve cost reductions.

00:29:18 If you care about the cost, that’s fantastic. But if you don’t care about the cost, you may care about performance. 20% of your queries consume 80% of the database resources.

00:29:39 If the database is overloaded and spikes are frequent, it’s not all queries; it’s just 20%. And this 20% of queries are almost everything your application might need.

00:29:58 You don’t need all queries; some of them you could potentially cut down and stop running, thus achieving performance gains and cost improvements.

00:30:12 That’s my part. Okay, so I'm back. The most important part of this presentation was the database work.

00:30:32 However, there are other techniques that can have a significant impact. The first one is caching. I think we’re all familiar with caching.

00:30:49 The idea is to have a smaller memory storage, or it could be even a key-value store, there are multiple ways to do that. This makes things faster, and you use it to save certain data that you will access more often.

00:31:08 Typically, other variations of this idea that have been very useful for scaling at Zendesk include write-through caching.

00:31:26 The idea here is to refresh the cache on the write instead of the read. Caching may sound incredibly unnecessary because at face value, you are just duplicating all the data in the cache.

00:31:41 Yet, this is not supposed to be done for all writes, only for certain normalized objects. Imagine you have an account object, and this account has a calculation of customers that have been active in the last month.

00:31:55 This query can be super expensive. However, what if every time the customer became active, you would calculate this and write it to the cache?

00:32:16 The idea is that here is normalizing. This is a real example of an optimization we did on the ticket endpoint, the most important endpoint at Zendesk.

00:32:29 We had a side called SLAs, which basically told how much time an agent had to solve a ticket. It was a little expensive, adding around 40 milliseconds per query.

00:32:45 Just using caching, we were able to reduce it from 40 to 5 milliseconds. Again, while 40 milliseconds may not sound significant, at scale, it absolutely adds up.

00:33:05 The other day, I heard a story about Meta. Some engineers there resolved a backend issue just by slightly optimizing some code, saving the company $2 million.

00:33:29 This can happen; in this case, this endpoint was being hit 50,000 times per day—40 milliseconds per hit—per year—that's eight days less of computing.

00:33:51 AWS isn't cheap. Of course, this has trade-offs; adding caching adds complexity. You have duplicated data, and you need to perform double writes.

00:34:12 The point is that it is not trivial. Finally, with this, we reach the point many of you have been waiting for: we're going to talk now about what I call the Doron Paradox.

00:34:26 First, let’s unveil this mystery. Who is Doron? Okay, very few people here know who Doron is; this is fascinating. I’ve been doing this talk at different places.

00:34:45 In Poland, almost nobody knew who he was. I did it in Bangkok, and it was crazy. They tweeted about me talking about Doron at the conference.

00:35:09 But let me explain who he is. This is a Japanese cartoon character, a 22nd-century cat who has many machines in his four-dimensional pocket.

00:35:29 These machines are like magical items—he can fly, transport things, and do whatever, really. For me, what fascinated me when I was little was the four-dimensional pocket.

00:35:46 Doron can fit almost anything in this pocket. It’s a bit like Ruby, you know? The point here is that Doron really works as an allegory for software engineering because every episode shares the same structure.

00:36:03 Doron lives in 20th-century Japan with his friend Nobita, who is a kid. Nobita is kind of a disaster—he’s clumsy and is always experiencing a lot of problems, whether it’s with crushes or academics.

00:36:23 Then he asks Doron if he can give him a machine that will fix his problems. Doron hesitates but gives him the machine, and it works for a while.

00:36:42 However, Nobita, not being very smart, uses the machine in a way that he shouldn’t, leading to a bigger disaster than the one he originally had.

00:37:04 I say this because I feel this happens a lot in technology, particularly with high-level technologies like Ruby and Rails. We start misusing them in ways we don’t understand until everything breaks.

00:37:25 The problem seems to fall on the technology itself, not on us. For example, I believe we’re the only community that embraces metaprogramming to a point that it becomes chaotic.

00:37:44 When I started with Ruby, I was hacking method missing all the time, which is quite a ridiculous idea. The point is that we should use technology wisely.

00:38:02 I’m not saying you have to read the codebase of all the gems we import—that would be great—but at least read the documentation.

00:38:19 Another idea is cold storage. Is anyone familiar with cold storage?

00:38:37 Okay, that’s good. Actually, you might get a little angry with this concept because...

00:38:50 This is another storage system I mentioned while trying to maintain complexity. However, I believe I have a good point.

00:39:06 We have a lot of data at Zendesk—20 billion tickets, trillions of objects in one table—that's extremely problematic.

00:39:18 What if there was a way that those trillions of objects wouldn’t exist in the table anymore? We can send them somewhere else, which would be ideal.

00:39:35 The basic idea of cold storage is that not all data is equal. Some data is read often—it’s hot data—and some data is kept for historical purposes.

00:39:54 What we do at Zendesk is send all the data to cold storage. For example, in a mini Zendesk, where a user has many tickets, and tickets have events—an audit.

00:40:12 After some time, we literally mark the data as ready for archival. Our current policy is that when a ticket is closed, after a couple of months, we mark it for archival.

00:40:30 A daily job processes the archived data, which involves removing it from the relational database but don’t be scared.

00:40:52 We move it to a specialized cold storage system—currently DynamoDB. This way, our tables stay small because we have less data.

00:41:07 Of course, you might think this is quite dangerous in many ways. To minimize the danger, we create a different type of ticket called an archive ticket.

00:41:27 This is a much thinner version of the original ticket, keeping associations or basic information like the title, while most data remains in cold storage.

00:41:47 This technique is super important at Zendesk. A majority of our data is in cold storage, around nine times more than in hot storage.

00:42:05 If you think about it, this has trade-offs; you are introducing multiple data sources, but essentially you're still dealing with the same data.

00:42:20 This might require changes in access patterns; for example, imagine an API endpoint that returns all tickets suddenly needing to query two different storages.

00:42:35 That might be complex, especially if you’re trying to order a billion tickets across different databases.

00:42:49 Let's discuss changing the product to make it scalable. Performance is not only about resolving marked problems as engineers; it's also about negotiating these issues.

00:43:05 Fortunately, there is something very simple to do that helps a lot. It’s a technique that helps a lot to scale: establishing product limits.

00:43:30 When a platform gets to a certain size, you have absolutely no idea what’s happening in terms of resource allocation.

00:43:51 I was reading an article stating that the majority of internet traffic is between bots, and I tend to believe it.

00:44:12 Moreover, you find customers that take your product in ways you never intended. For example, we’re a customer support company, but they might use our platform to make a logging system.

00:44:30 Again, I understand that product limits are not sexy, but you have to understand something.

00:44:50 Product limits exist across all of your applications. Applications run on the material world, which has limitations.

00:45:06 Eventually, the product limit is where database meltdown occurs. This happens to everyone.

00:45:25 Of course, the question then becomes: What is a good product limit? If you don’t yet have a clear idea, imagine the largest possible use for your application.

00:45:48 If you think nobody will have more than 100 users in an account, then set the product limit at 1,000.

00:46:09 When someone tries to create 10,000 users, you’ll be very pleased you have that limit.

00:46:29 This situation occurs repeatedly, and the issue is when customers reach an abnormally high limit.

00:46:41 You need to grandfather them for the future, even if you set the product limit later.

00:46:59 You’ll end up having customers working in ways that are quite strange.

00:47:21 Next, let's look at the incredibly enlarging payload issue. I think everyone in a company has seen this.

00:47:36 You have endpoints returning way too much data. Imagine a startup where you create an endpoint that is thin and returns four attributes per object.

00:47:57 It goes really fast, but then you start adding features and attributes, and suddenly, this endpoint returns 40 attributes per object.

00:48:18 That can make it slow, just too complex to fully understand what the endpoint does, and you are sending megabytes around the internet.

00:48:36 There are solutions to this. You should define the object schema explicitly, returning only what’s needed; anything else should be optional.

00:49:00 Another solution is to consider GraphQL.

00:49:20 The final technique I want to comment on product-side concerns is breaking down slow workflows.

00:49:41 You can do this on the read or write side. With reads, if you have a slow endpoint that returns a lot of data, consider dividing it into three new endpoints.

00:50:04 You might think this is a silly idea, as you’ll likely have to rerun the same logic for each of those new three endpoints.

00:50:22 But the reason I recommend this is that you need data fast in a UI, so users can interact with it. You don't want users to wait several seconds for obscure data.

00:50:40 Dividing requests like this improves user experience, and for most products, user experience is king.

00:50:57 For the write side, keep the basic part synchronous and handle the complex parts asynchronously.

00:51:12 This idea has spread throughout the community as many engineers widely embrace sidekiq, and at Zendesk, we use Rescue.

00:51:30 The key idea is to keep the base functionality synchronous and make the rest asynchronous, which often leads to distributed systems.

00:51:49 For example, we use Kafka; every time there's an update in our database, it generates events processed by our monolith or separate microservices.

00:52:02 However, one thing I wanted to mention is that eventual consistency can be challenging.

00:52:23 Ironically, the people facing challenges with eventual consistency are often developers.

00:52:38 Developers tend to see systems differently from customers. A few months ago, we enabled a new feature that made certain parts of the system asynchronous.

00:52:57 Advocates were quite anxious about it. They said, 'Let’s see what’s going to happen and we’ll give you feedback.'

00:53:13 After a month, advocates came back and asked when we were enabling it—no one realized any change had occurred.

00:53:34 Okay, we are nearing the end, and I want to discuss what I believe is the right mindset for an engineer dealing with performance and scalability problems.

00:53:48 It's fairly simple: you need to be like Alexander the Great. Let me explain this—it's a quick story.

00:54:05 In 333 BC, Alexander was at war with the Persians and crossed into Gordium, where there was a legend about an impossible knot.

00:54:20 The legend claimed that whoever untied the knot would become the king of Asia. Alexander, having imperial ambitions, decided to give it a try.

00:54:40 However, after trying for quite some time, he couldn't, and eventually took his sword and cut it.

00:54:58 I admit that when I read this story as a child, I thought it meant you could cheat. But as I grew up and became an engineer, I understood.

00:55:20 There's no requirement for how to solve a problem; I think we engineers do this a lot.

00:55:42 We face a problem and impose our assumptions about how a solution must work. This is especially true when we converse with other engineers.

00:56:05 We often have different perspectives on how to solve the same problem, particularly with unconventional data solutions, because in computer science school the idea of the immaculate relational database is perfect.

00:56:27 But trust me, you shouldn't worry about duplicating data; caching is acceptable. You need to be bold and fix the problems.

00:56:41 Moreover, it is essential to understand that engineering work, especially performance work, is about trade-offs.

00:57:01 You can often make one area run much faster by making another area operate a little slower, and that’s okay.

00:57:20 The other day, I spoke with an engineer who said they found a way to make consumption more reliable, but it increased deployment time from two to ten minutes.

00:57:39 I exclaimed, 'Who cares about a few more minutes of deployment time when we deploy twice a week?'

00:58:00 Trade-offs are inevitable.

00:58:20 Lastly, I want to emphasize that there is no such thing as a perfect system—only systems that fit the current needs of the business best.

00:58:43 I believe that an excessive pursuit of perfection can be counterproductive.

00:59:05 The good news is that in the Ruby community, we are in a great position regarding that. We are the only community that optimizes for developer happiness.

00:59:25 A happy developer becomes more productive and builds better systems. Finally, I wanted to leave you with a few words from Mats.

01:00:00 I believe that the purpose of life is partly to be happy. Based on this belief, Ruby is designed to make programming not only easy but also fun. Thank you.