Cache is King: Get the Most Bang for Your Buck From Ruby

by Molly Struve

In the video "Cache is King: Get the Most Bang for Your Buck From Ruby" presented at RubyConf 2018 by Molly Struve, the focus is on enhancing application performance through effective caching strategies in Ruby. Struve, a senior site reliability engineer at Kenna Security, highlights that while optimizing slow queries is important, the quantity of fast queries can also significantly burden data stores like MySQL, Elasticsearch, and Redis, impacting performance.

Key points discussed in the video include:

- Initial Optimizations: Struve begins by sharing how her team improved performance by optimizing long-running MySQL queries and Elasticsearch searches, but still faced high loads due to excessive datastore hits.

- Concept of Local Caching: Using a relatable analogy, she explains the importance of local caching by comparing frequent database queries to repeatedly asking a volunteer for their name versus writing it down. By caching data locally, applications can retrieve it significantly faster, reducing the load on data sources.

- Bulk Serialization: Struve describes her experience with serializing over 200 million vulnerabilities a day and how switching to bulk serialization from individual queries dramatically reduced the number of database calls made, decreasing serialization time from six seconds to less than one second per batch.

- Local Caching with Redis: The video further explores transitioning the cache strategy to reduce GET requests to Redis by utilizing a local Ruby hash to cache client index names instead, which resulted in a 65% increase in the speed of vulnerability indexing.

- Sharding Database and Gem Awareness: Struve talks about sharding their MySQL database by client to reduce load and the importance of understanding how gems interact with database connections to avoid unnecessary external requests.

- Preventing Unnecessary Database Hits: Finally, she emphasizes how to prevent unexpected database hits by ensuring checks are in place in the code to avoid executing unnecessary queries, ultimately enhancing overall application speed.

In conclusion, Struve advocates for using simple Ruby methods to reduce network latency and improve application performance. Key takeaways from the video include the significance of bulk processing, leveraging local caching, and understanding database interactions within Ruby applications to enhance efficiency. By applying these strategies, developers can create more responsive applications while minimizing load on their data stores.

00:00:15.740 Okay, as Valerie mentioned, my name is Molly Struve. Welcome to 'Cache is King'.

00:00:21.180 I am a senior site reliability engineer at Kenna Security, and today I'm going to show you how you can use Ruby to improve your application's performance.

00:00:28.560 When most people hear 'site reliability engineer,' they don’t necessarily think of Ruby. Instead, they think of someone working with MySQL, Redis, AWS, Elasticsearch, PostgreSQL, or some other third-party service to ensure that everything runs smoothly.

00:00:34.590 For the most part, this is a large part of what we do. Kenna's site reliability team is still relatively young; we are about a year old. When we first formed, you can bet we did exactly what you'd expect.

00:00:49.770 We went and found those long-running, horrible MySQL queries and optimized them by adding missing indexes, using select statements to avoid N+1 queries, and processing things in small batches to ensure we were never pulling too many records from the database at once.

00:01:02.370 We also had Elasticsearch searches that were constantly timing out, so we rewrote them to ensure they could finish successfully. We even overhauled how our background processing framework, Rescue, was talking to Redis.

00:01:14.130 All of these changes led to big improvements in performance and stability. However, even with all these things cleaned up, we were still seeing high loads across all of our data stores, and that's when we realized something else was going on.

00:01:33.689 Rather than just tell you what was happening, I want to demonstrate it. So, for this, I need a volunteer. I know you guys are shy, but don’t worry; this is going to be really easy.

00:02:00.509 Thank you, sir. Trust me; it’ll be good. What's your name? Excellent! Okay, stand right there. Everyone, this is... What's your name?

00:02:19.819 Oh, alright, I think I got it. Let's try that one more time. What's your name? Okay, okay, one last time; what's your name?

00:02:34.459 About how annoyed are you right now? A little annoyed? How annoyed would you be if I asked you your name a million times?

00:02:46.010 Exactly! It would be pretty annoying. Not to mention, we'd be here all night long. So what’s one easy thing I could have this person do so that I don’t have to keep asking for their name? Here’s a hint: it involves a pen and a piece of paper.

00:03:03.080 She shouted out, 'Bingo!' Okay, if you could write your name down right there, that’s totally fine. Beautiful!

00:03:20.810 Now that I have Bill's name written on this piece of paper, whenever I need it, all I have to do is read the piece of paper instead of explicitly asking Bill every time.

00:03:41.269 This is exactly what it's like for your data source. Imagine I'm your Ruby application; Bill is your data store. If your Ruby application has to ask your data store millions of times for information, eventually your data store is going to get overwhelmed.

00:03:53.359 Not to mention, it’s going to take time to do it. Instead, if you make use of a local cache, which is essentially what this piece of paper is doing, your application can retrieve the information much faster and save your data store a lot of headache.

00:04:06.889 Thank you, Bill. Let's give him a round of applause.

00:04:12.680 The moment we realized that at Kenna, the quantity of our datastore hits was wreaking havoc on our data sources, that was a big 'aha!' moment for us. We immediately started trying to figure out how we could decrease the number of datastore hits our application was making.

00:04:31.040 Before I dive into the awesome ways we used Ruby to do this, I want to provide some background on Kenna so you have context for the stories I’m going to share.

00:04:42.770 Kenna helps Fortune 500 companies manage their cybersecurity risk. The average company has 60,000 assets, which can be anything with an IP address. In addition, the average company has about 24 million vulnerabilities; a vulnerability is essentially any way you can hack an asset.

00:04:55.400 With all of this data, it can be extremely difficult for companies to know what they need to focus on and fix first, and that's where Kenna comes in. We take all this data and run it through our proprietary algorithms, which tell our clients what vulnerabilities pose the biggest risk to their infrastructure.

00:05:14.150 Now, when we initially get this asset and vulnerability data, the first thing we do is put it into MySQL. MySQL is our source of truth. From there, we then index it into Elasticsearch, which allows our clients to slice and dice their data anyway they need to.

00:05:35.090 However, when indexing assets and vulnerabilities into Elasticsearch, we have to serialize them, which leads me to my first story about serialization. Since we serialize vulnerabilities the most, I want to focus on that.

00:06:00.560 When we first started serializing vulnerabilities for Elasticsearch, we were using Active Model Serializers. For those unfamiliar, Active Model Serializers hook right into your Active Record objects, so all you have to do is literally define the attributes you want to serialize, and it takes care of the rest.

00:06:20.180 While this was our first solution, it became less effective when we started serializing over 200 million vulnerabilities a day. As the number of vulnerabilities we were serializing increased, the rate at which we could serialize them dropped dramatically, and our database began to max out on CPU.

00:06:39.830 The caption for one screenshot in Slack was, '11 hours and counting.' Our database was literally on fire all the time. A lot of people might look at that graph and think, 'Why not just beef up your hardware?' Unfortunately, at that point, we were already running on the largest RDS instance AWS had to offer.

00:06:58.490 When my team and I looked at that graph, we thought, 'Oh man! There must be a horrible query in there that we missed.' So off we went hunting for that elusive horrible MySQL query. Much like Liam Neeson in Taken, we were determined to find the root cause of our MySQL load.

00:07:16.819 But we never found those long-running horrible MySQL queries because they didn't exist. Instead, what we found were a lot of fast, millisecond queries happening repeatedly.

00:07:34.340 All these queries were lightning fast, but we were making so many of them at the same time that our database became overwhelmed. Therefore, we immediately started trying to figure out how to serialize all this data and make fewer database calls.

00:07:53.539 What if instead of making individual calls to MySQL for every vulnerability, we grouped all those vulnerabilities together at once and made a single MySQL call to retrieve all their data? From this idea, we developed the concept of bulk serialization.

00:08:06.800 To implement this, we started with a cache class. This cache class was responsible for taking a set of vulnerabilities and a client and then running all the MySQL lookups for them at once. We then passed this cache class to our vulnerability serializer, which held the logic needed to serialize each individual field.

00:08:19.430 Now, whenever it serialized a field, it would simply talk to our cache class instead of MySQL. Let’s look at an example in our application: vulnerabilities have a related model called custom fields, which allow us to add any attribute we want to a vulnerability.

00:08:31.550 Before, when we serialized custom fields, we had to query the database to do it. Now, we simply read from that cache. The payoff of this change was significant. Here is a console log showing how long it took to serialize 300 vulnerabilities individually: just over six seconds.

00:08:50.160 That’s a fairly generous estimate, considering it would take even longer when the database was under load. Instead, if you serialized those exact same 300 vulnerabilities in bulk—boom! Less than a second.

00:09:08.920 These speed-ups are a direct result of a decrease in the number of database hits we had to make in order to serialize these vulnerabilities. Serializing those 300 vulnerabilities individually meant we had to make 2,100 calls to the database.

00:09:21.400 Serializing those 300 vulnerabilities in bulk, however, only required us to make seven calls. As you can probably glean from the math here, that’s seven calls per vulnerability, or seven calls for however many vulnerabilities you can group together at once.

00:09:41.620 In our case, we usually grouped vulnerabilities in batches of a thousand. This significantly dropped the number of database requests we made per batch from 7,000 down to seven.

00:09:59.200 This dramatic reduction in MySQL requests is evident in the MySQL queries graph, where you can see the number of requests we were making beforehand compared to after we deployed the bulk serialization change.

00:10:17.290 With this large drop in database queries came a significant reduction in database load, as you can see in this CPU utilization graph. Prior to the change, we were maxing out our database; afterward, we were sitting pretty around 20 to 25 percent.

00:10:35.110 The moral of the story is this: when you are processing large amounts of data, try to find ways to use Ruby to process that data in bulk. We did this for serialization, but it can be applied any time you are processing data one at a time.

00:10:53.140 Take a step back and ask yourself if there is a way to process this data together in bulk because one MySQL call for a thousand IDs is always going to be faster than a thousand individual MySQL calls.

00:11:07.670 Unfortunately, the serialization saga doesn't end here. Once we got MySQL all happy and sorted out, Redis suddenly became very sad.

00:11:30.210 And this, folks, is the life of a site reliability engineer. Some days, it feels like you put one fire out, and you start another somewhere else. You speed one thing up, and the load transfers to another.

00:11:46.500 In this case, we transferred the load from MySQL to Redis. Let me explain why. When we index vulnerabilities into Elasticsearch, we not only have to talk to MySQL to serialize them, but we also have to communicate with Redis to know where to put them in Elasticsearch.

00:12:04.140 Our vulnerabilities are organized by client, so to determine where a vulnerability belongs, we must make a GET request to Redis to fetch the index name for that vulnerability based on its client.

00:12:22.560 When preparing vulnerabilities for indexing, we gather all the serialized vulnerability hashes, and one of the final things we do before sending them to Elasticsearch is making that Redis GET request to retrieve the index name.

00:12:42.860 These vulnerability hashes are grouped by client, so this Redis GET request is returning the same information over and over again. All these simple GET requests are blindingly fast; they take a millisecond to execute. But as I mentioned before, it doesn't matter how fast your requests are.

00:13:02.040 If you make a ton of them, it's going to take time. We were making so many of these simple GET requests that they accounted for roughly 65% of the time it took for us to index vulnerabilities.

00:13:18.780 The solution to eliminate a lot of these requests was once again Ruby. In this case, we ended up using a Ruby hash to cache the index name for each client.

00:13:35.730 Then, when looping through the serialized vulnerability hashes to send to Elasticsearch, instead of hitting Redis for every individual one, we would reference that client index hash. This means we now only have to hit Redis once per client instead of once per vulnerability.

00:13:55.590 Let’s look at how this paid off. Given that we have these three batches of vulnerabilities, no matter how many vulnerabilities are in each batch, we are only ever going to have to make three requests to Redis to get all the data we need.

00:14:09.810 As I mentioned earlier, these requests usually contain a thousand vulnerabilities apiece, so we roughly decreased the number of hits we were making to Redis by a thousand times, which resulted in a 65% increase in speed.

00:14:30.510 Even though Redis is fast, a local cache is faster. To put it into perspective, getting a piece of data from a local cache is like driving from downtown LA to LAX without rush-hour traffic. To get the same piece of information from Redis is like taking a plane from LAX all the way to Denver.

00:14:51.490 Redis is so fast that it can be easy to forget you're actually making an external request, and those external requests can add up, impacting your application's performance.

00:15:07.260 So with these insights in mind, remember that Redis is fast, but a local Ruby cache—such as a simple hash cache—is always going to be faster.

00:15:24.360 Now, these are two great ways we can use simple Ruby to replace our datastore hits. Next, I want to talk about how you can use any framework you're using to minimize your datastore hits.

00:15:36.990 At Kenna, we recently started sharding our main MySQL database by client, meaning each client's data lives in its own shard database.

00:15:58.640 When we need data for a client, such as an asset, we have to determine which shard database to query. This means our sharding configuration, which tells us what client belongs to what shard, needs to be easily accessible.

00:16:18.040 Initially, we used Redis to obtain this information because Redis is fast, and the configuration hash we wanted to store was small. However, that didn’t last long. Eventually, that configuration hash grew and grew.

00:16:37.870 Thirteen kilobytes may not seem like a lot, but when you are asking for that much data repeatedly, it adds up. We also increased the number of workers operating on these databases until we reached 285.

00:16:51.590 Now, every time one of these workers makes a MySQL request, it first has to go to Redis to retrieve that thirteen-kilobyte configuration hash. This quickly became unsustainable.

00:17:10.030 One of the first things we did to solve this issue was to examine Active Record's connection object. Active Record's connection object stores data about how to communicate with the database, so we thought it might hold a solution.

00:17:29.300 To our surprise, it wasn’t an Active Record object at all; it was an Octopus proxy object created by our Octopus sharding gem. This was a pleasant surprise, given that it contained great helper methods already storing our sharding configuration.

00:17:50.140 So, rather than hitting Redis every time we made a MySQL call, we only needed to communicate with our Active Record connection object.

00:18:03.940 One major takeaway from this experience was understanding the importance of knowing your gems. It's easy to include a gem in your Gemfile, but make sure you have a general understanding of how it operates.

00:18:26.420 You don’t need to read the source code for every gem, but consider checking it out the first time you use a gem, so you can see how it works and how it interacts with your databases in code.

00:18:47.440 If we had better understood how our Octopus gem was configured, we could have avoided the entire Redis headache. Regardless of how the solution emerged, caching locally using our framework as a cache is always faster and easier than making an external request.

00:19:01.360 These are three great strategies to utilize Ruby in replacing your datastore hits.

00:19:11.460 Now, I want to shift gears and talk about how you can prevent unnecessary datastore hits using Ruby. I’m sure some of you are thinking, 'Duh, I already know how to do this.' But let’s hold on for a moment because it might not be as obvious as you think.

00:19:31.960 For example, how many of you have written code like this? Come on, I know you have. This code looks decent, right? If there are no user IDs, then we skip all of the user processing.

00:19:47.780 Unfortunately, that assumption is false. If you execute this `where` clause with an empty array, you're actually hitting MySQL. Notice the `where 1 = 0` statement? That’s what Active Record uses to ensure no records are returned.

00:20:06.650 Sure, it’s a fast query taking a millisecond, but if you run this query millions of times, it can easily overwhelm your database. So how do you adjust this block of code to make your site reliability engineers love you?

00:20:24.560 You have two options: first, avoid executing that MySQL lookup unless it's absolutely necessary. You can do this by adding a simple array check with Ruby, preventing unnecessary datastore hits and chaos in your MySQL.

00:20:46.030 By doing this, you also speed up your code. If you run this code block 10,000 times, calling that useless MySQL lookup is going to take over half a second. If you add that simple line of Ruby to prevent that MySQL lookup, it will take less than a hundredth of a second.

00:21:04.120 As you can see, there's a significant difference in performance between hitting MySQL unnecessarily 10,000 times and running a plain Ruby line 10,000 times. This difference can accumulate, impacting your application's overall speed.

00:21:23.390 Many people often look at this top chunk of code and their first thought is, 'But Ruby is slow.' That couldn't be further from the truth. Ruby is not slow; hitting the database is what slows you down.

00:21:43.430 Watch out for situations in your code where it might be making a database request when you don’t expect it. Some Rails users might be thinking, 'I chained a bunch of scopes to my where clause, so I have to pass an empty array; otherwise, my scope chain breaks.'