Observability
Level Up Performance With Simple Coding Changes

Summarized using AI

Level Up Performance With Simple Coding Changes

David Henner • September 26, 2024 • Toronto, Canada

In this video titled "Level Up Performance With Simple Coding Changes," David Henner discusses how Zendesk significantly enhanced its application performance through straightforward Ruby on Rails techniques. He emphasizes that despite the complexity and challenges faced in application scaling, many performance gains can be achieved with small, targeted improvements.

Key points covered include:
- Historical Context: Henner provides an overview of Zendesk's 15-year journey, noting that for 12 years there was no dedicated performance team, which left a lot of potential for optimization.
- Sharding and Archiving: He stresses the importance of having a data management strategy to prevent bloating of databases, suggesting that unnecessary historical data should be archived or removed periodically.
- Cost Reductions: The performance enhancements led to significant reductions in infrastructure costs, potentially saving Zendesk tens of millions annually.
- Adding Observability: The introduction of Datadog for tracking performance marked a turning point, allowing the team to identify performance bottlenecks effectively.
- SQL Query Management: Henner highlights the need to monitor SQL queries and uses specific syntax to ensure efficiency in database interactions.
- Focus on Mundane Features: He argues that optimizing commonly used endpoints, even by milliseconds, can lead to massively cumulative performance improvements.
- Caching Strategies: The application of effective caching methods was a recurring theme, where Henner explained how to strategically cache user objects to reduce redundant database queries.
- Handling Edge Cases: He provided insights into managing unexpected data inputs, such as formatting anomalies in email tickets, which can drastically influence processing times.
- Continuous Evaluation: The importance of revisiting and adjusting forced indexes as application needs evolve is underscored, suggesting that it's critical to adapt strategies over time for continued performance.

In conclusion, David Henner encourages developers to consistently evaluate their code performance, share insights about the implemented strategies, and understand that many of these optimizations arise from small changes rather than major overhauls. His detailed methodology offers practical advice for those looking to enhance their Ruby on Rails applications.

Level Up Performance With Simple Coding Changes
David Henner • September 26, 2024 • Toronto, Canada

David Henner highlights some of the major improvements Zendesk has achieved using straightforward #Ruby techniques to get even better performance out of Rails, including data which illustrates how they saved thousands of years of processing time annually, leading to increased customer satisfaction and cost-effectiveness, as reflected in their AWS bills.

#railsatscale #rails #scaling

Thank you Shopify for sponsoring the editing and post-production of these videos. Check out insights from the Engineering team at: https://shopify.engineering/

Stay tuned: all 2024 Rails World videos will be subtitled in Japanese and Brazilian Portuguese soon thanks to our sponsor Happy Scribe, a transcription service built on Rails. https://www.happyscribe.com/

Rails World 2024

00:00:10.759 Okay, uh, first, we're going to start this. Everybody in the front row, I want you to all say hi to my mom because she needs a picture. Everybody ready? Three, two, one! There we go, thank you! My mom's going to be happy. Alright, my name is David Henner, and I'm going to be talking about performance today.
00:00:24.119 A lot of people have asked me in the last couple of days whether this tech talk is more geared towards junior people or more senior people. I've been telling people that it's probably more for juniors, but now that I think about it, I've seen a lot of your senior code out there too. You might know what I am going to present, but you probably don't do it. So it might be a reminder for those senior folks.
00:00:49.640 Um, I also need the clicker. When I started this talk, I wanted to ensure that my face was the same size as Aon. Spoiler: it is. I wanted to start with a whole bunch of little jokes, but then I realized that I don't have 45 minutes to talk; I have 30 minutes. Oh no, I'm not going to want to get kicked off stage because I think somebody might come up here and tackle me!
00:01:08.439 Anyways, I'm going to start by giving you a little history of Zendesk. Zendesk has been around for about 15 years or so. For the first 12 years, we had almost no dedicated performance team working on the code, so you can imagine there was a lot of low-hanging fruit. One of the things I need to talk about is that our endpoints aren't exactly what you might expect. They're a bit dirty, like some of yours. Our RESTful endpoints aren't really restful.
00:01:37.479 For example, for our Coach Base, we have endpoints that include user information in the ticketing response or comments. So, our actual endpoint response will change depending on what the user asks for. Technically, that means it’s not a restful response. There are many different responses that you might get depending on what the includes are. This will become really important at the end of the talk because we recently came up with ways of caching that complexity.
00:02:14.000 We've also diverged quite a bit from Ruby on Rails since then. But here's the work that my team has been doing. This is just three people, and this is P99 data, which refers to the worst customer experiences. You can see all these trends, and when we released code, each of these significant code changes made a huge difference in our infrastructure cost.
00:02:37.400 I can't tell you exactly how much our infrastructure cost improved, but think of it in the tens of millions, rather than a mere 50,000 or some smaller figure. We saved quite a bit with just three people and the help of the infrastructure team. You might be asking yourself how we achieved this. Well, the first thing I need to mention is sharding and archiving.
00:03:11.640 If you're not familiar with sharding or archiving, it's extremely important. Many people start with a small app and think, 'Okay, I'm just going to keep all my data in the main data store,' usually MySQL, and they never consider that this data becomes stale after a certain period. You should likely have an archiving strategy in place and even a complete removal strategy for data. Bring this up when you start your applications because implementing it later is generally harder.
00:03:53.400 You should be able to tell your customers, 'Hey, we're not going to keep your data for, let's say, 5 years.' For Zendesk, once a ticket is closed, most people don’t care about that ticket anymore after some time. We might say, 'You know what, let’s not keep that in MySQL; we’ll put it in DynamoDB or whatever store you might want to use.' This saves your application from having to run SQL queries against a massive data set.
00:04:30.400 How did we get here? When we started this team a few years ago, we didn't have a project manager, which was probably better because we let the data lead us instead of having a grand project. There are so many small things you can do to improve your app's performance. The first thing you really need is to add observability. At Zendesk, we use Datadog; I'm sure many of you do too, and probably have complaints about the cost.
00:05:01.280 Here’s what a typical trace used to look like before we started the team. We've added many details; anything yellow below the purple is something we added customly. However, still, this data left a lot of empty spaces, and filling them in would be costly due to Datadog's pricing model based on the number of traces. We needed some flexibility, and I will dive into the flexibility we incorporated very soon.
00:05:51.600 These Datadog traces are always a work in progress. New features will come up; you'll need to add more traces, especially for areas that become slow for some of our P99 customers. You can never really say that you’re done since adding this observability was absolutely crucial in finding the performance issues we faced.
00:06:34.639 Here's an example method; it's not a real method in our application, but I’m showing the syntax. You would have a method called 'rapy_dogging_trace' where you pass in a method name and, optionally, a parameter for what you want to label the trace in Datadog. The last parameter is for color coding, which we mainly keep consistent for our custom traces.
00:07:17.240 The syntax for block notation would look something like this, where you wrap the method call within the block. The drawback is that it leads to clutter in your code; you don't want these blocks everywhere. A more elegant way to handle tracing is to redefine the method in which you're wrapping the block, thus managing code cleanliness while still implementing the necessary traces.
00:08:26.639 For example, with our debugging traces, we redefine the method we want to trace and embed that block code so we can call the method cleanly. Before we began applying these traces, we accumulated a decent amount of data; but after refining our approach, we could dive straight into specific problem areas with better insights.
00:08:54.839 Here’s where it gets interesting. While these traces give you good performance data, keep in mind these traces are not free! The block at the bottom might look like a single trace, but it actually generates multiple traces, which can add unnecessary overhead if mismanaged. I've made mistakes where I've added a plethora of traces that subsequently resulted in a significant increase in response time, and that’s something I won’t repeat.
00:09:56.360 Moving on, I want to touch upon testing for the number of SQL queries. I find that many applications overlook this aspect. At Zendesk, we employ a specific syntax to assert the number of SQL queries in our tests, which is crucial when maintaining application efficiency. You can also utilize open-source gems to manage this effectively.
00:11:06.120 The syntax could look something like this: you assert expected SQL queries against the actual queries executed, and if someone does make alterations to your code years later, they will encounter a failing test that holds them accountable. It’s essential to establish these expectations early, especially in caching scenarios, to ensure long-term consistency.
00:11:57.560 Let’s transition to feature development. Many of you might be tempted to focus on features that provide dramatic changes in data for the sake of impressing your superiors, but it's the mundane endpoints receiving billions of requests each year that can lead to significant performance gains. For instance, if your application can save just one millisecond on a widely-used endpoint, it translates to days of processing time annually.
00:12:48.400 Now, here’s something interesting at Zendesk: we have what I would refer to as a god object concerning our account system. If you call for the account anywhere, whether it’s from a user’s account or a ticket's account, it references the same object. This allows for effective optimization through caching and reduces redundancy, particularly in instances where a single object ID can be referenced throughout.
00:13:19.760 When you're handling repeated queries for similar data, leveraging a single object for caching can significantly improve performance. We even modified our implementation to avoid unnecessary queries when fetching user emails, ensuring we cache effectively by referencing the same user object, thus maintaining state and saving on resources.
00:14:10.720 Additionally, I want to address the common issue with returning arrays. While arrays can be useful, using a set or a hash can often lead to better performance. For instance, if you have to ensure uniqueness in a large dataset, using a set to gather unique values is far more efficient than calling a 'unique' method on a huge array.
00:14:44.000 In applications, I’ve often noticed developers using the 'flatten' method. However, I recommend switching to 'flat_map' where applicable as I've found every instance where 'map.flatten' could be eliminated in favor of 'flat_map.' By doing so, you maintain the functionality while optimizing performance.
00:15:42.200 At Zendesk, we receive a significant volume of customer tickets through email. This data can often include unexpected items, such as irregular email signatures with excessive spaces. We had to implement solutions to sanitize the inbound data to save on processing time, allowing us to focus on the necessary data only.
00:16:01.600 Let’s discuss dropdown fields. Although we intentionally set limits on options, some customers go overboard, resulting in unexpected circumstances. For instance, instantiating a significant number of active record objects due to dropdown values can considerably slow performance. Instead, using the 'pluck' method to retrieve necessary attributes can save processing time and memory by sidestepping the overhead involved in creating unnecessary active record instances.
00:17:02.720 Moreover, to effectively manage our application's request cycle, we have structured the ticket field store to enhance efficiency without redundantly fetching user objects multiple times through the process. By implementing a ticket field store to allow access to the necessary data without repetitive database hits, we managed to significantly decrease our processing time.
00:18:02.240 Finally, configuring strategies such as max age can yield outstanding results. By utilizing a max age setting in Rails, we eliminated unnecessary traffic. When a browser requests data that hasn’t changed within a specified time frame—let’s say 60 seconds—the browser doesn't reach out to the server again; this can lead to significant reductions in processing load.
00:18:56.160 On the topic of stale, utilizing the stale mechanism in applications can greatly enhance efficiencies. At Zendesk, due to our complexities, we have to calculate the e-tag at the end of our response cycle. While this can be counterproductive compared to optimal standards, we've developed ways to cache tags to improve load times.
00:19:35.760 As an alternative, we developed a process that allows us to cache the e-tag response, markedly improving our efficiency. Then when we encounter new requests, we can compare incoming e-tags to the cached ones. If they match, we can skip a lot of unnecessary processing, returning a fast response without rendering new content.
00:20:36.840 Utilizing cache strategies is paramount to optimizing performance. Make sure to engage your team to establish a robust caching strategy; it’s something that requires careful planning and significant discussion.
00:21:30.160 For example, if you cache a single user object multiple times in a single index action, it can lead to inefficiencies. Instead, use 'read_multi' to retrieve it all at once, reducing overhead and leveraging smart caching techniques.
00:21:40.240 Lastly, let's touch on the necessity to revisit forced indexes. Over time, application changes may necessitate adjustments to indexing strategies, as outdated forced indexes can slow down performance significantly. In this past year, a big gain came from removing a forced index on our ticketing endpoints, emphasizing the need for ongoing evaluations of existing strategies.
00:22:31.120 In closing, I encourage everyone to continue evaluating your approaches and refine them where necessary to ensure optimal performance outcomes. Thank you for your attention; I truly hope you found these insights helpful.
Explore all talks recorded at Rails World 2024
+13