00:00:10.759
Okay, uh, first, we're going to start this. Everybody in the front row, I want you to all say hi to my mom because she needs a picture. Everybody ready? Three, two, one! There we go, thank you! My mom's going to be happy. Alright, my name is David Henner, and I'm going to be talking about performance today.
00:00:24.119
A lot of people have asked me in the last couple of days whether this tech talk is more geared towards junior people or more senior people. I've been telling people that it's probably more for juniors, but now that I think about it, I've seen a lot of your senior code out there too. You might know what I am going to present, but you probably don't do it. So it might be a reminder for those senior folks.
00:00:49.640
Um, I also need the clicker. When I started this talk, I wanted to ensure that my face was the same size as Aon. Spoiler: it is. I wanted to start with a whole bunch of little jokes, but then I realized that I don't have 45 minutes to talk; I have 30 minutes. Oh no, I'm not going to want to get kicked off stage because I think somebody might come up here and tackle me!
00:01:08.439
Anyways, I'm going to start by giving you a little history of Zendesk. Zendesk has been around for about 15 years or so. For the first 12 years, we had almost no dedicated performance team working on the code, so you can imagine there was a lot of low-hanging fruit. One of the things I need to talk about is that our endpoints aren't exactly what you might expect. They're a bit dirty, like some of yours. Our RESTful endpoints aren't really restful.
00:01:37.479
For example, for our Coach Base, we have endpoints that include user information in the ticketing response or comments. So, our actual endpoint response will change depending on what the user asks for. Technically, that means it’s not a restful response. There are many different responses that you might get depending on what the includes are. This will become really important at the end of the talk because we recently came up with ways of caching that complexity.
00:02:14.000
We've also diverged quite a bit from Ruby on Rails since then. But here's the work that my team has been doing. This is just three people, and this is P99 data, which refers to the worst customer experiences. You can see all these trends, and when we released code, each of these significant code changes made a huge difference in our infrastructure cost.
00:02:37.400
I can't tell you exactly how much our infrastructure cost improved, but think of it in the tens of millions, rather than a mere 50,000 or some smaller figure. We saved quite a bit with just three people and the help of the infrastructure team. You might be asking yourself how we achieved this. Well, the first thing I need to mention is sharding and archiving.
00:03:11.640
If you're not familiar with sharding or archiving, it's extremely important. Many people start with a small app and think, 'Okay, I'm just going to keep all my data in the main data store,' usually MySQL, and they never consider that this data becomes stale after a certain period. You should likely have an archiving strategy in place and even a complete removal strategy for data. Bring this up when you start your applications because implementing it later is generally harder.
00:03:53.400
You should be able to tell your customers, 'Hey, we're not going to keep your data for, let's say, 5 years.' For Zendesk, once a ticket is closed, most people don’t care about that ticket anymore after some time. We might say, 'You know what, let’s not keep that in MySQL; we’ll put it in DynamoDB or whatever store you might want to use.' This saves your application from having to run SQL queries against a massive data set.
00:04:30.400
How did we get here? When we started this team a few years ago, we didn't have a project manager, which was probably better because we let the data lead us instead of having a grand project. There are so many small things you can do to improve your app's performance. The first thing you really need is to add observability. At Zendesk, we use Datadog; I'm sure many of you do too, and probably have complaints about the cost.
00:05:01.280
Here’s what a typical trace used to look like before we started the team. We've added many details; anything yellow below the purple is something we added customly. However, still, this data left a lot of empty spaces, and filling them in would be costly due to Datadog's pricing model based on the number of traces. We needed some flexibility, and I will dive into the flexibility we incorporated very soon.
00:05:51.600
These Datadog traces are always a work in progress. New features will come up; you'll need to add more traces, especially for areas that become slow for some of our P99 customers. You can never really say that you’re done since adding this observability was absolutely crucial in finding the performance issues we faced.
00:06:34.639
Here's an example method; it's not a real method in our application, but I’m showing the syntax. You would have a method called 'rapy_dogging_trace' where you pass in a method name and, optionally, a parameter for what you want to label the trace in Datadog. The last parameter is for color coding, which we mainly keep consistent for our custom traces.
00:07:17.240
The syntax for block notation would look something like this, where you wrap the method call within the block. The drawback is that it leads to clutter in your code; you don't want these blocks everywhere. A more elegant way to handle tracing is to redefine the method in which you're wrapping the block, thus managing code cleanliness while still implementing the necessary traces.
00:08:26.639
For example, with our debugging traces, we redefine the method we want to trace and embed that block code so we can call the method cleanly. Before we began applying these traces, we accumulated a decent amount of data; but after refining our approach, we could dive straight into specific problem areas with better insights.
00:08:54.839
Here’s where it gets interesting. While these traces give you good performance data, keep in mind these traces are not free! The block at the bottom might look like a single trace, but it actually generates multiple traces, which can add unnecessary overhead if mismanaged. I've made mistakes where I've added a plethora of traces that subsequently resulted in a significant increase in response time, and that’s something I won’t repeat.
00:09:56.360
Moving on, I want to touch upon testing for the number of SQL queries. I find that many applications overlook this aspect. At Zendesk, we employ a specific syntax to assert the number of SQL queries in our tests, which is crucial when maintaining application efficiency. You can also utilize open-source gems to manage this effectively.
00:11:06.120
The syntax could look something like this: you assert expected SQL queries against the actual queries executed, and if someone does make alterations to your code years later, they will encounter a failing test that holds them accountable. It’s essential to establish these expectations early, especially in caching scenarios, to ensure long-term consistency.
00:11:57.560
Let’s transition to feature development. Many of you might be tempted to focus on features that provide dramatic changes in data for the sake of impressing your superiors, but it's the mundane endpoints receiving billions of requests each year that can lead to significant performance gains. For instance, if your application can save just one millisecond on a widely-used endpoint, it translates to days of processing time annually.
00:12:48.400
Now, here’s something interesting at Zendesk: we have what I would refer to as a god object concerning our account system. If you call for the account anywhere, whether it’s from a user’s account or a ticket's account, it references the same object. This allows for effective optimization through caching and reduces redundancy, particularly in instances where a single object ID can be referenced throughout.
00:13:19.760
When you're handling repeated queries for similar data, leveraging a single object for caching can significantly improve performance. We even modified our implementation to avoid unnecessary queries when fetching user emails, ensuring we cache effectively by referencing the same user object, thus maintaining state and saving on resources.
00:14:10.720
Additionally, I want to address the common issue with returning arrays. While arrays can be useful, using a set or a hash can often lead to better performance. For instance, if you have to ensure uniqueness in a large dataset, using a set to gather unique values is far more efficient than calling a 'unique' method on a huge array.
00:14:44.000
In applications, I’ve often noticed developers using the 'flatten' method. However, I recommend switching to 'flat_map' where applicable as I've found every instance where 'map.flatten' could be eliminated in favor of 'flat_map.' By doing so, you maintain the functionality while optimizing performance.
00:15:42.200
At Zendesk, we receive a significant volume of customer tickets through email. This data can often include unexpected items, such as irregular email signatures with excessive spaces. We had to implement solutions to sanitize the inbound data to save on processing time, allowing us to focus on the necessary data only.
00:16:01.600
Let’s discuss dropdown fields. Although we intentionally set limits on options, some customers go overboard, resulting in unexpected circumstances. For instance, instantiating a significant number of active record objects due to dropdown values can considerably slow performance. Instead, using the 'pluck' method to retrieve necessary attributes can save processing time and memory by sidestepping the overhead involved in creating unnecessary active record instances.
00:17:02.720
Moreover, to effectively manage our application's request cycle, we have structured the ticket field store to enhance efficiency without redundantly fetching user objects multiple times through the process. By implementing a ticket field store to allow access to the necessary data without repetitive database hits, we managed to significantly decrease our processing time.
00:18:02.240
Finally, configuring strategies such as max age can yield outstanding results. By utilizing a max age setting in Rails, we eliminated unnecessary traffic. When a browser requests data that hasn’t changed within a specified time frame—let’s say 60 seconds—the browser doesn't reach out to the server again; this can lead to significant reductions in processing load.
00:18:56.160
On the topic of stale, utilizing the stale mechanism in applications can greatly enhance efficiencies. At Zendesk, due to our complexities, we have to calculate the e-tag at the end of our response cycle. While this can be counterproductive compared to optimal standards, we've developed ways to cache tags to improve load times.
00:19:35.760
As an alternative, we developed a process that allows us to cache the e-tag response, markedly improving our efficiency. Then when we encounter new requests, we can compare incoming e-tags to the cached ones. If they match, we can skip a lot of unnecessary processing, returning a fast response without rendering new content.
00:20:36.840
Utilizing cache strategies is paramount to optimizing performance. Make sure to engage your team to establish a robust caching strategy; it’s something that requires careful planning and significant discussion.
00:21:30.160
For example, if you cache a single user object multiple times in a single index action, it can lead to inefficiencies. Instead, use 'read_multi' to retrieve it all at once, reducing overhead and leveraging smart caching techniques.
00:21:40.240
Lastly, let's touch on the necessity to revisit forced indexes. Over time, application changes may necessitate adjustments to indexing strategies, as outdated forced indexes can slow down performance significantly. In this past year, a big gain came from removing a forced index on our ticketing endpoints, emphasizing the need for ongoing evaluations of existing strategies.
00:22:31.120
In closing, I encourage everyone to continue evaluating your approaches and refine them where necessary to ensure optimal performance outcomes. Thank you for your attention; I truly hope you found these insights helpful.