RailsConf 2017

The Secret Life of SQL: How to Optimize Database Performance

The Secret Life of SQL: How to Optimize Database Performance

by Bryana Knight

The video titled "The Secret Life of SQL: How to Optimize Database Performance" by Bryana Knight at RailsConf 2017 explores the intricacies of SQL database performance optimization, specifically focusing on the gray areas of database indexing, common misconceptions, and effective querying techniques. The speaker shares insights drawn from her experience on the platform data team at GitHub.

Key Points Discussed:

  • Understanding Indexes:

    • A database index is essentially a sorted copy of selected columns from a table, allowing rapid access to data. Indexes help filter data, making queries more efficient.
    • Bryana uses an analogy comparing indexes to a book's index, providing an effective way to access specific information without scanning through entire tables.
  • Common Rules of Thumb:

    • Many indexing best practices are not always applicable. For instance, while it seems logical to cover all query fields with indexes, redundancy can lead to unnecessary complexity and degrade performance during data modification operations.
  • Using Index Prefixes and Redundant Indexes:

    • The video introduces the concept of index prefixes, which allow indexing on part of a column's data, useful for large text fields where a full index would be impractical.
    • Examples are provided where certain queries can be optimized and the importance of evaluating existing index structures is emphasized.
  • Optimizing Queries with 'OR' and 'UNION':

    • The speaker explains that using 'OR' can sometimes lead to poor performance because a query may not utilize any index.
    • Instead, using 'UNION' can help in executing separate queries that utilize their respective indexes effectively.
  • Addressing Performance Issues:

    • Even with proper indexing, the query planner may not always utilize indexes optimally. The speaker advises on techniques to guide the planner towards effective indexes.
  • Data Redundancy and Normalization:

    • Bryana also discusses the balance between normalization and denormalizing data in cases where it significantly impacts performance, especially with a high read-to-write ratio.
    • Real-world examples highlight the performance improvements after modifying data structures to reduce complexity in queries.

Conclusions and Takeaways:

  • Indexing should be strategic and not excessive, as over-indexing can lead to storage costs and slowed modifications.
  • It’s crucial to understand how queries are executed under the hood to troubleshoot performance issues effectively.
  • The importance of continuous evaluation and modification of indexing strategies and query structures is critical for maintaining optimal performance.
00:00:12.139 Okay, so we're going to go ahead and get started. Thank you everyone for coming. My name is Bryana Knight, and I'm an engineer at GitHub. I work on the platform data team, which, among other projects, is responsible for reviewing every single migration pull request that comes through GitHub.
00:00:18.960 We usually see at least a couple a day, and we're also responsible for improving any performance problems that we have within our application. So what's this talk about? Today, we're going to go over some best practices that aren't always best practices.
00:00:38.280 First and foremost, we're going to talk about database indexes. So what are they and how do they work? Database indexing is fundamental to performance. Then we'll dive into some common rules of indexing, querying, and data modeling. We'll discuss what those are, when and why they break down, and what tools you can use to understand what's going on. Finally, we'll talk about how to move forward when you can't rely on those rules anymore.
00:00:50.370 So what even is an index? An index is a copy of selected columns of data from a table that can be searched quickly and efficiently. The reason why they can be searched more quickly is because an index is sorted. An index has as many rows as a table, which means there is one index row for every row in your table. You can think of indexes and tables as the same thing, but you can get to the data you need in an index faster and then look up whatever else you need in the main table, knowing exactly where to look.
00:01:14.070 To really understand indexes, we need to take a step back and consider a real-life example. Hypothetically, I'm a cat owner and have this great big book called "The Ultimate Cat Book," which contains a ton of great information about cats. One day, my cat gets sick, and I'm pretty sure my cat is having some sort of allergic reaction. Not knowing where to go in this book to find information about cat allergies, I might have to flip through every single page checking whether or not that page has the information I need about allergies.
00:01:49.860 Luckily, this book has an index, which is sorted alphabetically. So I can go to the 'A' section and see that the cat allergies chapter starts on page 88. With that information, I can flip directly to that page in the ultimate cat book, read up on cat allergies, and save my dying cat. On a very basic level, this is the same way that database indexes work. Indexes are used to filter data for most queries. We often want a subset of the data in our table, so we can use an index to get that information.
00:02:44.440 Indexes can also be used to sort data. As I mentioned, indexes are sorted and can also retrieve records. Sometimes the index will have all the information you need, and you don't even have to reference back to the main table to get additional information. At the end of the day, what indexes provide us is a way to access the smallest information as quickly as possible.
00:03:14.460 Let's take a database example. Imagine we have a 'repositories' table that has an ID, which is the primary key, the name of the repository, and the ID of the owner of that repository. We have a sample query where we want all of the repositories where the owner ID is greater than 500. Without any index, we would have to perform a full table scan.
00:03:24.040 What that looks like is we have to scan through every record in the table, asking whether or not the owner ID is greater than 500. This would take a very long time and is exactly what we're trying to avoid with the usage of indexes. Looking at another example, we have the same repositories table and the same query, but now we have an index over owner ID. The index is sorted based on owner ID, which is the leftmost column in the index and the only column in the index.
00:04:07.940 You might also notice that the ID is appended to the right of the index. With this index, we can go directly to the part where owner ID becomes larger than 500 and then look back in the main table to get that information, knowing exactly where to look. This is what makes an index faster. Essentially, our repositories table becomes our ultimate cat book, and our index over owner ID is just like the index of that book.
00:04:33.130 Now, let's examine another example. In this case, we want the names of the repositories where owner ID is greater than 500. We have an index over both owner ID and name. You'll notice that the index is still sorted based on owner ID but includes the name field as well as the ID appended at the end. We can go directly to the portion of the index that has the information we want. Since the index includes the name field, we have everything we need and don't have to reference back to the main table. This is a situation where an index can be used to retrieve data.
00:05:04.310 So, if you have a situation like this, it will be super efficient. A few more things about indexes: index contents are used left to right. As mentioned earlier, indexes are sorted based on the leftmost column. However, if you have duplicate values in that column, it will sort those records based on the next column in the index. For example, if we have two repositories with an owner ID of 745, those will be ordered based on the name since the name is also included in our index.
00:05:41.490 Let’s say we have a request for the repositories where the name is 'Rails' and it's not the leftmost column in our index. In that case, we actually couldn't use the index to resolve that query. Now that we have a better understanding of how indexes work, we’ll move on to some common rules of indexing, querying, and data modeling, discuss what they are, when and why they break down, and what tools to use to understand what's happening.
00:06:04.119 Our first rule is that any columns involved in queries should be covered by an index. Let's take a look at a sample query. We want all the labels where the label repository ID is either 4, 5, or 6. At first glance, your instinct might be to add an index on the repository ID, allowing us to access the records faster since that index will be sorted based on repository ID.
00:06:31.330 Great instinct! However, later on, if we add to this query and say we want the labels with a repository ID of four, five, or six and with the name 'feature', your first instinct might then be to add an index on repository ID and name. While that’s still a good thought, our index on repository ID is now redundant.
00:07:05.550 What is a redundant index? It's basically an index that is already contained within another index. As you can see here, both of these indexes are sorted based on repository ID, the leftmost column in each of those indexes. Thus, any query that is solely concerned with repository ID can use the index on repository ID and name to satisfy that query, and we no longer need the index that's just over repository ID.
00:07:23.580 When are we breaking this rule? Our rule was that any columns involved in queries should be covered by an index. Here we're not breaking it; we're just adding the caveat of unless there’s already a covering index. Why does this matter? Our indexes are still going to be super fast—the index that's used will be one or the other, and what's the harm? Well, indexes take up space, and we always want to be aware of that.
00:07:58.530 Most importantly, adding an index will slow down updates, inserts, and deletes because logically, if you're storing that information in more than one place, if it changes, it needs to be updated in all those places. So even though we might see a performance improvement on selects, over-indexing can lead to performance problems for other operations.
00:08:25.520 We care so much about this at GitHub that we have a command built into our Slack. The command is 'my sequel data indexes' and it takes the name of a table. It will tell us if there are any redundant or unused indexes for that table. So our rule was that any columns involved in queries should be covered by an index. Just make sure you're avoiding unused or redundant indexes.
00:08:55.750 Now let's talk about index prefixes. Sometimes it's better or necessary to use an index prefix instead of a full index. An index prefix is a subset of data within a column; for example, we have an index on 'repository name' and an index prefix on 'repository name,' which only includes the first five characters of the name of the repository. In this scenario, it could be more efficient.
00:09:38.380 Let’s look at another sample query that’s a bit more complex. It’s asking for the recent code reviewers for a certain code path. The columns that stand out to us are repository ID, path, and created at, which lead us to think our first instinct might be to add an index over all those columns. Before doing that, it's important to analyze the table we're adding an index to.
00:10:04.900 Again, let's plug for our built-in command called 'my sequel table'. It gives us valuable information about a table, including its size and the columns included in any existing indexes. Checking the columns in our table, we see that the paths column is of the datatype `bar binary 1024`, which could be problematic because it might violate our limitations for the size of the index.
00:10:37.799 When might we break our rule that any columns involved in queries should be covered by an index? When regular indexing isn't possible, we can use an index prefix. Another situation where you might prefer an index prefix is when querying the data based on a prefix scenario—like asking for usernames that start with a particular letter. If you don't care about the rest of the user name, you can put an index over that first character and reference back to the main table.
00:11:10.620 Determining how long to make the prefix is important. You want to make it long enough to differentiate values within that field. For instance, if every file in your repository starts with the same prefix and your index prefix is only over the first four characters, that index won’t be useful. A developer previously chose 255 bytes as the prefix length which allowed them to index effectively.
00:11:49.100 What do we gain from using an index prefix? First, it requires less space for the index, enabling indexing of larger data types that weren't indexable before, and we can also see comparable performance improvement with a carefully calculated length. So our rule was that any columns involved in queries should be covered by an index; just make sure you're looking out for redundant or unused indexes and using an index prefix when appropriate.
00:12:23.720 Our second rule is to use an OR to return records satisfying one or more conditions. Many of you likely have queries with OR clauses in your application, so let’s look at another sample query where we're seeking the maximum ID from users that are either spammy or suspended. This query runs slowly, taking about a minute, and that's concerning.
00:12:56.990 Typically, my first step when I see a slow query is to run an EXPLAIN on that query, which provides details about how it's being executed. The columns we focus on are possible keys—indexes that could be used—and the key column indicates the index being chosen for the query. If the key is null, it signifies that no index is being utilized, meaning this query is fully scanning the table.
00:13:38.870 When examining the possible indexes, we can see indexes over both spammy and suspended. However, the challenge is that MySQL can only use one index per table during a query, which means if we only use one or the other, we might exclude rows that should have been returned.
00:14:06.700 To rectify this, instead of relying on an index, we can modify our query to use a UNION, effectively splitting the query into two separate queries, each exploiting the best index before combining results. This new query runs in just 11 milliseconds—significantly faster and more efficient. It's worth noting that Postgres has an index merge feature which allows it to recognize two indexes it can utilize.
00:14:52.960 So, our rule regarding using an OR statement is that you should leverage it to return records satisfying one or more conditions, but if the OR restricts index usage, resulting in a full table scan, tuning your query to use a UNION instead might yield better results.
00:15:22.460 Next, our third rule states that if an index covers all the fields in your queries, you're all set! However, it can be challenging if the query planner doesn't choose to use the index. A sample query wanting all commit comments for a repository ordered by ID is slow, taking about 40 seconds, which prompts an analysis with EXPLAIN.
00:15:51.500 In examining potential indexes, we see that the planner has chosen an index involving both the repository ID and user ID, despite the fact that our query is primarily concerned with the repository ID. While one would expect it to select the repository ID index, the query planner opts for an index to avoid additional sorting.
00:16:32.009 This correctly identifies a situation where the query planner isn't selecting the most efficient index. To remedy this, we can create an explicit index directive to guide MySQL on which index to employ, resulting in the new query taking only a millisecond to execute.
00:17:03.830 Upon reviewing the SQL documentation, there are hints available, such as USE INDEX and IGNORE INDEX, which indicate explicitly what index to use or avoid; however, that requires future-proofing your queries against potential index changes, as there is no guarantee the index won't be deleted later.
00:17:42.920 Now, our fourth and final rule is to avoid redundant data across tables. This is a fundamental principle of data modeling and data storage, as it’s essential for efficiency to maintain tables that contain only relevant subject matter. Let's consider a sample database with pull requests, repositories, and users, where we maintain a column to identify spam users.
00:18:39.470 To address performance degradation, particularly in queries that require joining multiple tables—like retrieving pull requests while filtering out spammy users—we must recognize that we're unnecessarily complicating our queries by having redundant data.
00:19:20.250 So what happens if the ratio of reads to writes becomes high? In that case, we may consider denormalizing our data. By strategically adding columns to user tables, we simplify direct queries, which previously required complex joins across two or more tables. The resulting queries are streamlined and vastly faster.
00:19:57.940 This detailed explanation highlights the performance improvements after denormalizing data. However, it's important to consider potential trade-offs ahead of embarking on a denormalization strategy, including a substantial initial workload to implement migrations for adding new fields, ensuring that data is updated consistently across tables, and establishing processes to maintain data quality.
00:20:43.690 So, to recap our four rules: 1. Any columns involved in queries should be covered by an index while avoiding redundant or unused indexes; 2. Use an OR when necessary to return records satisfying multiple conditions, but switch to a UNION if index usage is hindered; 3. If there's an index covering all fields in your query, it's generally sufficient, but assist the query planner if it overlooks that index; and 4. Avoid redundant data across tables, striking a balance between read and write operations.
00:21:34.440 Some final takeaways: Index but don't over-index. Keep in mind that nothing comes for free; indexes consume space and can slow down inserts, updates, and deletes. Tune your queries to exploit all reads so you're getting the best results from your queries. There are tools available to run EXPLAIN on your queries, assisting you in comprehending their execution.
00:21:56.679 Always examine the existing columns, tables, and indexes before making alterations. It's essential to know that properly created queries can still lead to performance issues, which is why it's vital to understand the underlying mechanics of your data structure and be willing to adapt when necessary.
00:22:47.920 Lastly, I want to mention that we are hiring on the platform data team. If any of this information piqued your interest or if you have relevant experience, I'd love to discuss it with you further. You can use that link if you want to check out the job posting. Thank you!