00:00:12.139
Okay, so we're going to go ahead and get started. Thank you everyone for coming. My name is Bryana Knight, and I'm an engineer at GitHub. I work on the platform data team, which, among other projects, is responsible for reviewing every single migration pull request that comes through GitHub.
00:00:18.960
We usually see at least a couple a day, and we're also responsible for improving any performance problems that we have within our application. So what's this talk about? Today, we're going to go over some best practices that aren't always best practices.
00:00:38.280
First and foremost, we're going to talk about database indexes. So what are they and how do they work? Database indexing is fundamental to performance. Then we'll dive into some common rules of indexing, querying, and data modeling. We'll discuss what those are, when and why they break down, and what tools you can use to understand what's going on. Finally, we'll talk about how to move forward when you can't rely on those rules anymore.
00:00:50.370
So what even is an index? An index is a copy of selected columns of data from a table that can be searched quickly and efficiently. The reason why they can be searched more quickly is because an index is sorted. An index has as many rows as a table, which means there is one index row for every row in your table. You can think of indexes and tables as the same thing, but you can get to the data you need in an index faster and then look up whatever else you need in the main table, knowing exactly where to look.
00:01:14.070
To really understand indexes, we need to take a step back and consider a real-life example. Hypothetically, I'm a cat owner and have this great big book called "The Ultimate Cat Book," which contains a ton of great information about cats. One day, my cat gets sick, and I'm pretty sure my cat is having some sort of allergic reaction. Not knowing where to go in this book to find information about cat allergies, I might have to flip through every single page checking whether or not that page has the information I need about allergies.
00:01:49.860
Luckily, this book has an index, which is sorted alphabetically. So I can go to the 'A' section and see that the cat allergies chapter starts on page 88. With that information, I can flip directly to that page in the ultimate cat book, read up on cat allergies, and save my dying cat. On a very basic level, this is the same way that database indexes work. Indexes are used to filter data for most queries. We often want a subset of the data in our table, so we can use an index to get that information.
00:02:44.440
Indexes can also be used to sort data. As I mentioned, indexes are sorted and can also retrieve records. Sometimes the index will have all the information you need, and you don't even have to reference back to the main table to get additional information. At the end of the day, what indexes provide us is a way to access the smallest information as quickly as possible.
00:03:14.460
Let's take a database example. Imagine we have a 'repositories' table that has an ID, which is the primary key, the name of the repository, and the ID of the owner of that repository. We have a sample query where we want all of the repositories where the owner ID is greater than 500. Without any index, we would have to perform a full table scan.
00:03:24.040
What that looks like is we have to scan through every record in the table, asking whether or not the owner ID is greater than 500. This would take a very long time and is exactly what we're trying to avoid with the usage of indexes. Looking at another example, we have the same repositories table and the same query, but now we have an index over owner ID. The index is sorted based on owner ID, which is the leftmost column in the index and the only column in the index.
00:04:07.940
You might also notice that the ID is appended to the right of the index. With this index, we can go directly to the part where owner ID becomes larger than 500 and then look back in the main table to get that information, knowing exactly where to look. This is what makes an index faster. Essentially, our repositories table becomes our ultimate cat book, and our index over owner ID is just like the index of that book.
00:04:33.130
Now, let's examine another example. In this case, we want the names of the repositories where owner ID is greater than 500. We have an index over both owner ID and name. You'll notice that the index is still sorted based on owner ID but includes the name field as well as the ID appended at the end. We can go directly to the portion of the index that has the information we want. Since the index includes the name field, we have everything we need and don't have to reference back to the main table. This is a situation where an index can be used to retrieve data.
00:05:04.310
So, if you have a situation like this, it will be super efficient. A few more things about indexes: index contents are used left to right. As mentioned earlier, indexes are sorted based on the leftmost column. However, if you have duplicate values in that column, it will sort those records based on the next column in the index. For example, if we have two repositories with an owner ID of 745, those will be ordered based on the name since the name is also included in our index.
00:05:41.490
Let’s say we have a request for the repositories where the name is 'Rails' and it's not the leftmost column in our index. In that case, we actually couldn't use the index to resolve that query. Now that we have a better understanding of how indexes work, we’ll move on to some common rules of indexing, querying, and data modeling, discuss what they are, when and why they break down, and what tools to use to understand what's happening.
00:06:04.119
Our first rule is that any columns involved in queries should be covered by an index. Let's take a look at a sample query. We want all the labels where the label repository ID is either 4, 5, or 6. At first glance, your instinct might be to add an index on the repository ID, allowing us to access the records faster since that index will be sorted based on repository ID.
00:06:31.330
Great instinct! However, later on, if we add to this query and say we want the labels with a repository ID of four, five, or six and with the name 'feature', your first instinct might then be to add an index on repository ID and name. While that’s still a good thought, our index on repository ID is now redundant.
00:07:05.550
What is a redundant index? It's basically an index that is already contained within another index. As you can see here, both of these indexes are sorted based on repository ID, the leftmost column in each of those indexes. Thus, any query that is solely concerned with repository ID can use the index on repository ID and name to satisfy that query, and we no longer need the index that's just over repository ID.
00:07:23.580
When are we breaking this rule? Our rule was that any columns involved in queries should be covered by an index. Here we're not breaking it; we're just adding the caveat of unless there’s already a covering index. Why does this matter? Our indexes are still going to be super fast—the index that's used will be one or the other, and what's the harm? Well, indexes take up space, and we always want to be aware of that.
00:07:58.530
Most importantly, adding an index will slow down updates, inserts, and deletes because logically, if you're storing that information in more than one place, if it changes, it needs to be updated in all those places. So even though we might see a performance improvement on selects, over-indexing can lead to performance problems for other operations.
00:08:25.520
We care so much about this at GitHub that we have a command built into our Slack. The command is 'my sequel data indexes' and it takes the name of a table. It will tell us if there are any redundant or unused indexes for that table. So our rule was that any columns involved in queries should be covered by an index. Just make sure you're avoiding unused or redundant indexes.
00:08:55.750
Now let's talk about index prefixes. Sometimes it's better or necessary to use an index prefix instead of a full index. An index prefix is a subset of data within a column; for example, we have an index on 'repository name' and an index prefix on 'repository name,' which only includes the first five characters of the name of the repository. In this scenario, it could be more efficient.
00:09:38.380
Let’s look at another sample query that’s a bit more complex. It’s asking for the recent code reviewers for a certain code path. The columns that stand out to us are repository ID, path, and created at, which lead us to think our first instinct might be to add an index over all those columns. Before doing that, it's important to analyze the table we're adding an index to.
00:10:04.900
Again, let's plug for our built-in command called 'my sequel table'. It gives us valuable information about a table, including its size and the columns included in any existing indexes. Checking the columns in our table, we see that the paths column is of the datatype `bar binary 1024`, which could be problematic because it might violate our limitations for the size of the index.
00:10:37.799
When might we break our rule that any columns involved in queries should be covered by an index? When regular indexing isn't possible, we can use an index prefix. Another situation where you might prefer an index prefix is when querying the data based on a prefix scenario—like asking for usernames that start with a particular letter. If you don't care about the rest of the user name, you can put an index over that first character and reference back to the main table.
00:11:10.620
Determining how long to make the prefix is important. You want to make it long enough to differentiate values within that field. For instance, if every file in your repository starts with the same prefix and your index prefix is only over the first four characters, that index won’t be useful. A developer previously chose 255 bytes as the prefix length which allowed them to index effectively.
00:11:49.100
What do we gain from using an index prefix? First, it requires less space for the index, enabling indexing of larger data types that weren't indexable before, and we can also see comparable performance improvement with a carefully calculated length. So our rule was that any columns involved in queries should be covered by an index; just make sure you're looking out for redundant or unused indexes and using an index prefix when appropriate.
00:12:23.720
Our second rule is to use an OR to return records satisfying one or more conditions. Many of you likely have queries with OR clauses in your application, so let’s look at another sample query where we're seeking the maximum ID from users that are either spammy or suspended. This query runs slowly, taking about a minute, and that's concerning.
00:12:56.990
Typically, my first step when I see a slow query is to run an EXPLAIN on that query, which provides details about how it's being executed. The columns we focus on are possible keys—indexes that could be used—and the key column indicates the index being chosen for the query. If the key is null, it signifies that no index is being utilized, meaning this query is fully scanning the table.
00:13:38.870
When examining the possible indexes, we can see indexes over both spammy and suspended. However, the challenge is that MySQL can only use one index per table during a query, which means if we only use one or the other, we might exclude rows that should have been returned.
00:14:06.700
To rectify this, instead of relying on an index, we can modify our query to use a UNION, effectively splitting the query into two separate queries, each exploiting the best index before combining results. This new query runs in just 11 milliseconds—significantly faster and more efficient. It's worth noting that Postgres has an index merge feature which allows it to recognize two indexes it can utilize.
00:14:52.960
So, our rule regarding using an OR statement is that you should leverage it to return records satisfying one or more conditions, but if the OR restricts index usage, resulting in a full table scan, tuning your query to use a UNION instead might yield better results.
00:15:22.460
Next, our third rule states that if an index covers all the fields in your queries, you're all set! However, it can be challenging if the query planner doesn't choose to use the index. A sample query wanting all commit comments for a repository ordered by ID is slow, taking about 40 seconds, which prompts an analysis with EXPLAIN.
00:15:51.500
In examining potential indexes, we see that the planner has chosen an index involving both the repository ID and user ID, despite the fact that our query is primarily concerned with the repository ID. While one would expect it to select the repository ID index, the query planner opts for an index to avoid additional sorting.
00:16:32.009
This correctly identifies a situation where the query planner isn't selecting the most efficient index. To remedy this, we can create an explicit index directive to guide MySQL on which index to employ, resulting in the new query taking only a millisecond to execute.
00:17:03.830
Upon reviewing the SQL documentation, there are hints available, such as USE INDEX and IGNORE INDEX, which indicate explicitly what index to use or avoid; however, that requires future-proofing your queries against potential index changes, as there is no guarantee the index won't be deleted later.
00:17:42.920
Now, our fourth and final rule is to avoid redundant data across tables. This is a fundamental principle of data modeling and data storage, as it’s essential for efficiency to maintain tables that contain only relevant subject matter. Let's consider a sample database with pull requests, repositories, and users, where we maintain a column to identify spam users.
00:18:39.470
To address performance degradation, particularly in queries that require joining multiple tables—like retrieving pull requests while filtering out spammy users—we must recognize that we're unnecessarily complicating our queries by having redundant data.
00:19:20.250
So what happens if the ratio of reads to writes becomes high? In that case, we may consider denormalizing our data. By strategically adding columns to user tables, we simplify direct queries, which previously required complex joins across two or more tables. The resulting queries are streamlined and vastly faster.
00:19:57.940
This detailed explanation highlights the performance improvements after denormalizing data. However, it's important to consider potential trade-offs ahead of embarking on a denormalization strategy, including a substantial initial workload to implement migrations for adding new fields, ensuring that data is updated consistently across tables, and establishing processes to maintain data quality.
00:20:43.690
So, to recap our four rules: 1. Any columns involved in queries should be covered by an index while avoiding redundant or unused indexes; 2. Use an OR when necessary to return records satisfying multiple conditions, but switch to a UNION if index usage is hindered; 3. If there's an index covering all fields in your query, it's generally sufficient, but assist the query planner if it overlooks that index; and 4. Avoid redundant data across tables, striking a balance between read and write operations.
00:21:34.440
Some final takeaways: Index but don't over-index. Keep in mind that nothing comes for free; indexes consume space and can slow down inserts, updates, and deletes. Tune your queries to exploit all reads so you're getting the best results from your queries. There are tools available to run EXPLAIN on your queries, assisting you in comprehending their execution.
00:21:56.679
Always examine the existing columns, tables, and indexes before making alterations. It's essential to know that properly created queries can still lead to performance issues, which is why it's vital to understand the underlying mechanics of your data structure and be willing to adapt when necessary.
00:22:47.920
Lastly, I want to mention that we are hiring on the platform data team. If any of this information piqued your interest or if you have relevant experience, I'd love to discuss it with you further. You can use that link if you want to check out the job posting. Thank you!