Alternative Data Structures in Ruby

Ruby

Tyler McMullen

Alternative Data Structures in Ruby

by Tyler McMullen

The video 'Alternative Data Structures in Ruby' presented by Tyler McMullen at LA RubyConf 2010 discusses the necessity and benefits of utilizing alternative data structures beyond the traditional arrays, sets, and hashes in the Ruby programming language. McMullen emphasizes that while standard data structures suffice in many cases, alternative structures can offer significant advantages in terms of speed, memory efficiency, and clarity for specific applications.

Key Points Discussed:

Introduction of Alternative Data Structures: McMullen provides insight into why developers should explore various data structures tailored for specific needs, highlighting alternatives to common choices in Ruby.
Bloom Filters:
- A probabilistic data structure used primarily for determining whether an element is in a set.
- Offers remarkable memory efficiency; storing 100 million 100-character strings in about 280 MB, compared to 10 GB required by traditional sets.
- Useful in scenarios such as optimizing requests to a file server by reducing unnecessary lookups.
BK Trees:
- Specifically designed for finding approximate matches in a set, excellent for spelling correction and finding close geographical coordinates.
- Decreases search time by limiting comparisons based on known distances between nodes.
Splay Trees:
- Self-balancing binary trees that adaptively optimize data retrieval based on access frequency, favoring the most commonly accessed elements.
- Improve efficiency by reorganizing themselves to keep frequently accessed nodes at the top of the tree.
Tries:
- Data structures that provide O(1) time complexities for lookups, additions, and deletions.
- Effective for prefix matching and ordered traversals, often used in applications such as auto-completion.

Conclusions and Takeaways:

The talk encourages programmers to delve deeper into data structures, asserting that understanding their varied functionalities can lead to more efficient programming solutions.
Despite not being imperative for immediate application, alternative data structures provide essential tools for tackling diverse programming challenges efficiently. McMullen's exploration of data structures aims to inspire attendees to appreciate and utilize these tools in their development work.

00:00:16.000 Hi, my name is Tyler. I also work at Script. I'm going to be talking a little bit about alternative data structures in Ruby. Despite what Tim said, I don't actually know anything about NoSQL, so I'm not going to talk about that.

00:00:20.119 Instead, I'm going to discuss data structures that I've found useful while dealing with a huge amount of data at Script. You might be asking yourself, why should we talk about different data structures? I mean, we already have arrays, sets, and hashes. Isn't that enough? Well, sometimes it is, but other times, you need something a little different. I've found that the normal data structures just don't always work for me. So, why would I use a different data structure? There are basically three reasons: speed, memory, and clarity.

00:01:22.960 Now, let me clarify what I mean. The thing is, there might not be anything wrong with your favorite data structure; the ones I'm discussing are just the data structures I've used in the past—and still currently. The point of this talk isn't to argue that you should use these specific data structures. It's more about encouraging you to use data structures in general and to spark your interest in them.

00:02:04.880 So, let’s get right into it. We'll start with Bloom filters. The purpose of a Bloom filter is to test for existence within a set. It essentially checks, 'Have I seen this item before?' It's a probabilistic data structure, which means it can sometimes fail. You can control how much it fails and how often, which is a tunable feature we'll discuss shortly. The main benefit of a Bloom filter is its memory usage, which is impressive.

00:02:48.760 Let's say we have 100 million strings that are about 100 characters long. If you used a traditional set to store them, you would need around 10 GB of memory—far more than most computers have nowadays. However, a Bloom filter can handle that same data with only about 280 MB of memory if you can accept a 0.00001% chance of failure. If a higher failure rate is acceptable, you could do it in approximately 170 MB, which is much more manageable.

00:03:55.680 How does a Bloom filter work? Essentially, it's just a series of bits, like checkboxes that can either be on or off. When you want to add an object to your set, such as the string 'to be or not to be', you run it through a series of hash functions. How many hash functions you use depends on how many things you plan to add to the Bloom filter and its size, but for this example, let’s assume we use two hash functions.

00:04:27.590 So, you get two numbers from these hash functions, say one and five, and you set those indexes to 'on'. Now, if you want to add another string, let’s say 'question', you run it through those same hash functions. If one of the expected index numbers is not set, you know this string has never been seen. However, if both expected bits are set, there is a chance that it might have been added previously, which is where the false positives come into play.

00:05:35.480 Bloom filters can be particularly useful. For example, if you are running a file server that is remote and contains many files, querying this file server can be costly. Suppose you find that you're receiving many 404 errors. One solution to reduce the stress on this server would be to insert a Bloom filter between incoming requests and the file server.

00:06:19.360 The Bloom filter allows you to determine what is definitely not in the file server. Even if the Bloom filter gives a false match, saving you the time and resources required for an unnecessary lookup in the file server, the server will still respond with a 404 if the item doesn't exist. In this way, the Bloom filter can significantly help eliminate a vast majority of false requests.

00:06:56.960 To sum up, the Bloom filter's purpose is testing for existence in a set. The significant advantage of using this filter lies in its memory footprint and speed. Now, let's move on to a different data structure called BK trees.

00:08:02.960 BK trees, or Burkhard-Keller trees, named after the inventors, find the best match even when an exact one doesn't exist in a set. This structure decreases the search space. Normally, to find the closest match in a set of strings, you would have to scan through all of them, which can be time-consuming. The idea behind a BK tree is to ensure you do not need to scan the entire list. However, it operates only in a metric space, which works best for certain types of data, such as dictionaries for spelling correction.

00:09:36.720 For instance, imagine we have three nodes: X, Y, and Z, and we know the distances between them. Even if the distances aren't exact, we can determine a lower bound for the distance between nodes Z and Y. This lower bound allows us to eliminate comparisons where the distance exceeds a certain threshold when searching for close matches.

00:10:55.800 Now let’s look at an example using a simple dictionary of six words: taser, past, shave, light, pastor, and pasta. When building a BK tree, you would first choose a root word, say 'paste,' and each word would then be connected based on their edit distances. When a user types a word incorrectly, such as 'pastu,' the tree allows you to quickly find similar words with lower edit distances without needing to check every single entry.

00:12:09.280 This is beneficial because you only perform queries on relevant words, thus saving time and resources. Summarizing BK trees, these are primarily used for spelling correction, but they can also be used for tasks like finding close geographical points on a map. They work in metric spaces to reduce search space and function calculations.

00:13:24.000 Next, I want to introduce a data structure known as a splay tree. But before diving into that, I want to discuss access patterns a bit. When utilizing data structures, many people assume an even distribution of queries across different keys, but this is often not the case. Instead, querying resembles a power law. In text analysis, for instance, you will typically find that certain words are significantly more common than others.

00:14:23.360 This finding is particularly relevant to splay trees, as they operate optimally when access patterns are uneven. Splay trees are self-balancing binary trees that move the most accessed items towards the root. For example, when querying for a particular number, the tree will perform rotations to bring that number to the top for faster retrieval next time.

00:14:56.360 While this might seem counterintuitive if you're used to balanced trees, unbalanced trees can effectively optimize access for frequently accessed items. The idea is that once a node has been accessed, you want it to be readily available the next time a query occurs.

00:16:14.600 The final data structure I want to cover is a trie, which is one of my favorites. Tries are cool because they offer O(1) lookup, addition, and removal times. They provide ordered traversals and prefix matching capabilities, making them more efficient than hash tables in certain scenarios.

00:16:49.600 To illustrate how a trie works, let's consider an empty trie. We can add strings to it, and if two strings share a common prefix, they will share nodes. When querying, you simply navigate down the tree, checking each letter in a sequence until you reach the string you want.

00:17:29.919 If you query a misspelled word, the search would halt as soon as an incorrect character is encountered. A practical use case for tries is in auto-completers. By maintaining a list of words in the trie, you can easily retrieve all matches based on a user-input prefix.

00:18:35.440 In conclusion, while you may not find immediate use for the data structures we've discussed, I hope this talk encourages you to consider exploring them further. Data structures are fascinating, and understanding them can help you find efficient solutions to various programming challenges.

00:18:57.440 If you have any questions, please go ahead.

00:19:53.439 The idea with the Bloom filter is not to rely on it being correct 100% of the time. In file server cases, for instance, even if it returns a false positive, you still end up correctly finding the nonexistent file. Essentially, a Bloom filter is a probabilistic tool that is useful for filtering data effectively.

00:20:16.480 That's the basic premise of these structures, and they can greatly aid in achieving efficient data handling. Thank you for your attention!

Tyler McMullen

@tyler

LA RubyConf 2010