RailsConf 2018

Dropping Into B-Trees

Dropping Into B-Trees

by David McDonald

The video "Dropping Into B-Trees" presented by David McDonald at RailsConf 2018 focuses on understanding database indexes, particularly the B-tree structure, to enhance database performance. Understanding indexes is pivotal for developers to troubleshoot and optimize database interactions effectively. The key points discussed include:

  • The Importance of Learning About Indexes: Many developers utilize databases without fully grasping how indexes work, leading to inefficiencies in their coding practices and database management. McDonald emphasizes the necessity of understanding the fundamentals behind database performance.

  • Definition and Functionality of Indexes: An index is described as a data structure that facilitates quick data retrieval, organizing references to actual data locations. B-trees are the commonly used data structures for indexing.

  • Performance Impact: The speaker discusses the cost-benefit trade-offs of using indexes. While indexes improve read efficiency on frequently queried data, they also introduce overhead due to the need for maintaining duplicate information and pointers.

  • Specific Examples and Metrics: Through practical examples using PostgreSQL, McDonald demonstrates how to analyze and interpret database query performance using the EXPLAIN command.

    • A full table scan is compared to indexed lookups, illustrating their efficiency differences.
    • The performance of integer-based indexes versus string indexes is explored, highlighting that integer comparisons are less resource-intensive than string comparisons.
  • B-tree Mechanics: McDonald provides a simplified overview of how B-trees operate, describing their balanced nature, keys distribution, and the search algorithms (like binary search) that facilitate quick data retrieval, even as data size grows.

    • He explains why indexes can slow down insert operations due to necessary updates in both the base data and the index.
  • Best Practices in Indexing: The presentation concludes with guidelines on indexing strategies, stressing that not all queries require indexes and outlining when to use them based on specific use cases and data analysis.

In conclusion, McDonald's talk underscores the significance of having a thorough comprehension of indexing, particularly B-trees, which can greatly enhance database performance and efficiency in application development. Knowledgeable index application can lead to better planning, contribution, and troubleshooting in a developer's career.

00:00:11.360 Should I put you guys on my Instagram story? You guys want to be on my Instagram story? Really quick, okay, to get started, I need two people that aren't really tired to give me a hand with something.
00:00:18.660 It's really basic, non-computing, but I do need two volunteers. Okay, I got one here. One more volunteer, just come up and help me out. Come on up, I appreciate it. Okay, so we're going to do a little competition.
00:00:29.880 I have two books here. I just need each of you to pick a book, okay? Stand on each side of me, please. Now, can you look up every instance of the word "war" in each of these books? The first one to get it just raise your hand.
00:00:45.420 Wait until I say go ahead. Just look for every instance of the word "war." Please don't use your cellphone. I've done this before, and someone used their cellphone. Come on! You're doing the right thing there, I think. You got it? Okay, so you found it.
00:01:04.680 Really quick, how many of you found none? Outside of that, that's bad! Just a book page where you want. Okay, thanks! I think you get the point here. Thank you very much for helping me out with that.
00:01:40.210 So, the talk is titled "Dropping Into B-Trees." My name is David McDonald, and I'm a developer at Weedmaps. Although this is the title of the talk, we will dive into B-trees, but we will also talk more generally about indexes, so bear with me a little bit as I set the stage.
00:01:57.320 As developers, we spend a lot of our time learning. That comes with the territory. By my estimate, I spend about 25 percent of my time learning and about 12.5 percent learning about new JavaScript, etc. A lot of my time is spent watching YouTube videos, specifically funny cat videos because they are hilarious. I feel like I always want more time.
00:03:11.209 Many of the people I talk to in the industry express a desire for more time, but it is not forthcoming, especially if you work for somebody else, as most of us do. We don’t have endless amounts of time to learn about things we would like to learn about, so we have to prioritize.
00:03:23.150 I found that we often do not get as much time as we might like to learn about tools that we use every day. This problem is what inspired me. I have had many instances in my career where I had to engage in "just-in-time learning" when something is broken.
00:03:39.340 For example, I have two hours to fix an issue, and the deadline is 3:00 p.m. Diving into just-in-time learning isn't always sufficient, especially concerning database issues, specifically indexes. It’s a vast topic—50 years of computer science manifested in one thing like a B-tree.
00:04:08.30 As Rails developers, we've likely spent much more time looking up how to use options for our stack or figuring out how to get Sprockets to cooperate with Angular than learning how our databases work. Our databases are time-tested and reliable for the most part. They often feel like plug-and-play for the average developer, with databases becoming casualties of this time crunch.
00:04:40.780 One of the things I want you to take away from this today is that it is worth diving deeper into certain topics that can yield significant value, and I feel like indexes are definitely one of those topics. There are many spaces in our stack that require our expertise, where we can focus our attention.
00:05:01.410 These topics are not just good to know but are very marketable. They can help you in any team for the rest of your career. I spent most of my career treating databases like I treat my smartphone: I know how to use it, but I don’t really know exactly what it is doing.
00:05:23.340 The average person uses their cell phone without thinking about radio waves or technical details. So, like most of us, I’ve gotten by using gems, heuristics, and online searches. But as a professional who uses these technologies daily, I should be able to dig deeper when necessary.
00:05:44.820 Focusing on indexes allows us to get the most value. Understanding indexes is key to demystifying database performance and functionality and will greatly impact your ability to plan, contribute, and troubleshoot as you advance in your career.
00:06:05.790 What are indexes, anyway? We use them all the time, but do we really stop to think about what they are? Yes, they speed up selects and queries, particularly as they start to slow down. They can help you look up information quickly depending on the size and structure of your data store.
00:06:45.410 But what are they? We know what they do, but do we really know what they are? The standard definition is that an index is a data structure stored on disk that organizes a reference to your data by maintaining a copy of the data you're indexing combined with a reference to the actual data location.
00:08:01.150 For completeness, this would be like a non-clustered form where you store your data in a separate place on disk while creating copies of your data that you want to index. Your pointer is pointing to them. These data structures provide the basis for random lookups and efficient access to ordered records.
00:08:27.850 Let’s dwell on that image for a second and allow that to sink in. When you have an index, it makes a copy of the data, so the idea that you have to deduce a lot just from knowing that is significant. Let’s go deeper with an example.
00:08:41.100 Imagine you need to add a paragraph to a book that includes the word "war." You can’t just add the word without updating the index, which keeps track of that word’s location in the book. Databases operate in the same way: when a change occurs, the index must also be updated accordingly.
00:09:39.660 At this point, you likely have more questions than when you walked in, so let's add some more. When and how should I use this index? Why is an index on an integer faster than on a string? Why do inserts bring a performance penalty?
00:10:10.890 Can you drop an index without hurting the data? Another question for you is how you can open a file in Linux, tamper with it, and still have that file open despite damaging its integrity.
00:10:54.520 We all know that as data grows and the number of queries increases, the number of rows being scanned grows. As these requirements change, we must go deeper to understand the issues that arise. Some examples will help us explore these questions.
00:11:35.190 I want to clarify that I will discuss these examples in the context of PostgreSQL. The first reason is that's what I work with every day, and the second reason is that it is simply relevant. If you use another relational database, don't worry because the principles we will cover generally apply. Let's jump into PostgreSQL with some examples, and we will revisit our questions.
00:12:33.160 I created a simple sample with data I threw up here and a Gist if you want to follow along later. In Ruby, I created 50,000 rows in a CSV file using the Faker gem. If you're not familiar with it, it's great! I created a table with an ID as our primary key, a name, an email, and a city, all strings, and then I copied that in while specifying our delimiter and where we put that file.
00:13:39.790 Now we have 50,000 records. Why 50,000? I found that to be where I started noticing performance differences. Keep this in mind: we manage several tables with millions of rows at Weedmaps.
00:14:23.320 That isn't an immense amount of data, but it illustrates how we can see performance differences even with this small dataset. We can assess how performance changes occur with small tables.
00:14:59.660 Now, this yields a table where you see an index here: a primary key B-tree on the ID. We all know looking at rows by primary key is fast and straightforward. However, remember that indexes are stored on disk, meaning this table has a stored copy and a pointer to each one of the fields.
00:15:41.440 Additionally, every insert and update creates two writes: one to the actual table for the desired field, and again to create a copy of the data included with reference and some metadata to the index. There is a small cost involved with this trade-off.
00:16:38.320 While this trade-off is generally worth it for columns utilized frequently for lookups because typically, databases perform more reads than writes, it's crucial to bear in mind the costs.
00:17:11.790 Therefore, it doesn’t seem obvious to just throw indexes on everything, which is why the best practice is to only add an index when necessary. Indexes are essential on primary keys, foreign keys, and integers used in joins, or fields queried often. String fields may or may not need them.
00:17:51.640 Thus, making decisions about indexing isn't something you can simply Google, as every use case is different. So we need to understand what these data structures are and how they work.
00:18:36.010 Let’s look at the table. The first thing to note is when we do a full table scan, we look at every row. When we do an index scan using our primary key on users, we'll notice a significant decrease in cost compared to executing a full table scan.
00:19:19.900 Now we are looking at one row, and while this setup has a ramp-up time, it’s completely different from the cost of a full table scan. The insights gained from using this explain function can be critical for making decisions regarding performance.
00:20:07.100 Now, let’s check a string column. If we were looking for an email, the sequential scan may seem costly, and it is important to note that no index was utilized here. When the database must run a sequential scan on an email lookup, it's crucial to examine how often such views occur.
00:20:59.050 In this case, we must compare our strategies, especially in cases like authentication where recurring email lookups are necessary. It’s clear that understanding how to examine your database and the performance aspects surrounding it are essential. Now, these questions remain!
00:22:15.440 Why are we witnessing these slowdown effects? You might think, "It's just an index, and these are strings!" Let's investigate this further; what's the cost attached to it? Indexes come in various types. PostgreSQL, for instance, utilizes multiple index types, mostly B-trees.
00:22:49.990 B-trees provide efficient solutions for equality and range queries across data types. They excel at handling null values, leading to impressive performance gains. B-trees work effectively because they manage their keys under a balanced tree structure, ensuring no wild variations in tree height.
00:23:36.740 Returning to our prior inquiry: why is an index on an integer faster than one on a string? Comparing two integers is more efficient due to the streamlined nature of their byte sizes. While string comparisons are inherently slower.
00:24:20.860 Additionally, every insert carries performance penalties because it generates two writes anytime a user adds or modifies data. Each time an insert occurs, it’s essential to consider how this performance weighs against overall efficiency. You can remove an index without damaging your data, reflecting much of how data operates in an indexed structure.
00:25:40.440 In summary, I hope you can take away some valuable insights on indexing that inform your decision-making process regarding what and how you index. Whether or not to index a string field requires careful consideration and often demands benchmarking.
00:26:04.180 Remember to analyze your queries effectively using tools that Rails provides or through Ruby benchmarks. By conducting such assessments, you can develop a deeper understanding of how B-tree indexes function and their performance implications.
00:26:23.000 Thank you for attending! I've left some resources for you, one of which is Pat Shaughnessy's excellent blog on B-trees from 2014. General documentation from Heroku and PostgreSQL are invaluable as well, and I hope you take the time to explore these once have returned to your work.
00:26:50.180 Thank you!