00:00:16.000
Hi, my name is Tyler. I also work at Script. I'm going to be talking a little bit about alternative data structures in Ruby. Despite what Tim said, I don't actually know anything about NoSQL, so I'm not going to talk about that.
00:00:20.119
Instead, I'm going to discuss data structures that I've found useful while dealing with a huge amount of data at Script. You might be asking yourself, why should we talk about different data structures? I mean, we already have arrays, sets, and hashes. Isn't that enough? Well, sometimes it is, but other times, you need something a little different. I've found that the normal data structures just don't always work for me. So, why would I use a different data structure? There are basically three reasons: speed, memory, and clarity.
00:01:22.960
Now, let me clarify what I mean. The thing is, there might not be anything wrong with your favorite data structure; the ones I'm discussing are just the data structures I've used in the past—and still currently. The point of this talk isn't to argue that you should use these specific data structures. It's more about encouraging you to use data structures in general and to spark your interest in them.
00:02:04.880
So, let’s get right into it. We'll start with Bloom filters. The purpose of a Bloom filter is to test for existence within a set. It essentially checks, 'Have I seen this item before?' It's a probabilistic data structure, which means it can sometimes fail. You can control how much it fails and how often, which is a tunable feature we'll discuss shortly. The main benefit of a Bloom filter is its memory usage, which is impressive.
00:02:48.760
Let's say we have 100 million strings that are about 100 characters long. If you used a traditional set to store them, you would need around 10 GB of memory—far more than most computers have nowadays. However, a Bloom filter can handle that same data with only about 280 MB of memory if you can accept a 0.00001% chance of failure. If a higher failure rate is acceptable, you could do it in approximately 170 MB, which is much more manageable.
00:03:55.680
How does a Bloom filter work? Essentially, it's just a series of bits, like checkboxes that can either be on or off. When you want to add an object to your set, such as the string 'to be or not to be', you run it through a series of hash functions. How many hash functions you use depends on how many things you plan to add to the Bloom filter and its size, but for this example, let’s assume we use two hash functions.
00:04:27.590
So, you get two numbers from these hash functions, say one and five, and you set those indexes to 'on'. Now, if you want to add another string, let’s say 'question', you run it through those same hash functions. If one of the expected index numbers is not set, you know this string has never been seen. However, if both expected bits are set, there is a chance that it might have been added previously, which is where the false positives come into play.
00:05:35.480
Bloom filters can be particularly useful. For example, if you are running a file server that is remote and contains many files, querying this file server can be costly. Suppose you find that you're receiving many 404 errors. One solution to reduce the stress on this server would be to insert a Bloom filter between incoming requests and the file server.
00:06:19.360
The Bloom filter allows you to determine what is definitely not in the file server. Even if the Bloom filter gives a false match, saving you the time and resources required for an unnecessary lookup in the file server, the server will still respond with a 404 if the item doesn't exist. In this way, the Bloom filter can significantly help eliminate a vast majority of false requests.
00:06:56.960
To sum up, the Bloom filter's purpose is testing for existence in a set. The significant advantage of using this filter lies in its memory footprint and speed. Now, let's move on to a different data structure called BK trees.
00:08:02.960
BK trees, or Burkhard-Keller trees, named after the inventors, find the best match even when an exact one doesn't exist in a set. This structure decreases the search space. Normally, to find the closest match in a set of strings, you would have to scan through all of them, which can be time-consuming. The idea behind a BK tree is to ensure you do not need to scan the entire list. However, it operates only in a metric space, which works best for certain types of data, such as dictionaries for spelling correction.
00:09:36.720
For instance, imagine we have three nodes: X, Y, and Z, and we know the distances between them. Even if the distances aren't exact, we can determine a lower bound for the distance between nodes Z and Y. This lower bound allows us to eliminate comparisons where the distance exceeds a certain threshold when searching for close matches.
00:10:55.800
Now let’s look at an example using a simple dictionary of six words: taser, past, shave, light, pastor, and pasta. When building a BK tree, you would first choose a root word, say 'paste,' and each word would then be connected based on their edit distances. When a user types a word incorrectly, such as 'pastu,' the tree allows you to quickly find similar words with lower edit distances without needing to check every single entry.
00:12:09.280
This is beneficial because you only perform queries on relevant words, thus saving time and resources. Summarizing BK trees, these are primarily used for spelling correction, but they can also be used for tasks like finding close geographical points on a map. They work in metric spaces to reduce search space and function calculations.
00:13:24.000
Next, I want to introduce a data structure known as a splay tree. But before diving into that, I want to discuss access patterns a bit. When utilizing data structures, many people assume an even distribution of queries across different keys, but this is often not the case. Instead, querying resembles a power law. In text analysis, for instance, you will typically find that certain words are significantly more common than others.
00:14:23.360
This finding is particularly relevant to splay trees, as they operate optimally when access patterns are uneven. Splay trees are self-balancing binary trees that move the most accessed items towards the root. For example, when querying for a particular number, the tree will perform rotations to bring that number to the top for faster retrieval next time.
00:14:56.360
While this might seem counterintuitive if you're used to balanced trees, unbalanced trees can effectively optimize access for frequently accessed items. The idea is that once a node has been accessed, you want it to be readily available the next time a query occurs.
00:16:14.600
The final data structure I want to cover is a trie, which is one of my favorites. Tries are cool because they offer O(1) lookup, addition, and removal times. They provide ordered traversals and prefix matching capabilities, making them more efficient than hash tables in certain scenarios.
00:16:49.600
To illustrate how a trie works, let's consider an empty trie. We can add strings to it, and if two strings share a common prefix, they will share nodes. When querying, you simply navigate down the tree, checking each letter in a sequence until you reach the string you want.
00:17:29.919
If you query a misspelled word, the search would halt as soon as an incorrect character is encountered. A practical use case for tries is in auto-completers. By maintaining a list of words in the trie, you can easily retrieve all matches based on a user-input prefix.
00:18:35.440
In conclusion, while you may not find immediate use for the data structures we've discussed, I hope this talk encourages you to consider exploring them further. Data structures are fascinating, and understanding them can help you find efficient solutions to various programming challenges.
00:18:57.440
If you have any questions, please go ahead.
00:19:53.439
The idea with the Bloom filter is not to rely on it being correct 100% of the time. In file server cases, for instance, even if it returns a false positive, you still end up correctly finding the nonexistent file. Essentially, a Bloom filter is a probabilistic tool that is useful for filtering data effectively.
00:20:16.480
That's the basic premise of these structures, and they can greatly aid in achieving efficient data handling. Thank you for your attention!