Big O in a Homemade Hash

Talks

Nathan Long

Big O in a Homemade Hash

by Nathan Long

In the talk 'Big O in a Homemade Hash' by Nathan Long at the MountainWest RubyConf 2014, the speaker explores the intricacies of hash data structures, particularly in the context of Ruby's Hash class. The main theme revolves around re-implementing a hash to achieve O(1) performance, enhancing the understanding of Big O notation and its significance in algorithm efficiencies. Key points discussed throughout the video include:

Introduction to Hashes: A hash is described as a data structure that manages key-value pairs and is often referred to as a map or dictionary in other programming languages.
Initial Implementation: Nathan proposes an implementation of a hash using an array of arrays, termed 'TupleMap', which meets basic requirements but operates at O(n) complexity.
Understanding Big O Notation: He simplifies Big O notation, explaining how it measures the growth rates of algorithms, contrasting O(n) with O(1) to showcase the efficiency of operations regardless of input size.
Collision Handling: To achieve efficient performance in hash lookups, Nathan discusses the importance of using arrays indexed by hashed values while addressing potential collisions through bucket storage.
Dynamic Resizing: He elaborates on maintaining efficiency by growing the hash structure as needed, rehashing when the number of keys in any bucket exceeds a threshold. This involves reallocating entries in a new, larger array.
Performance Measurement: Nathan presents a performance graph highlighting the consistent read and write times across varying sizes of hashes but notes the performance dips during rehashing events.
Amortized Analysis: He concludes that despite spikes in performance during resizing, the hash maintains its O(1) performance on average, enhancing understanding of efficiency and resource allocation in programming.
Final Thoughts: Despite teaching this implementation, Nathan emphasizes the benefits of Ruby's native hash due to its optimized performance facilitated by C language efficiencies, suggesting a preference for well-established systems for production use.

The talk concludes on a humorous note, reflecting on the imagined consequences of failure to send the completed hash to the President, highlighting the importance of coding and understanding data structures in programming.

00:00:25.519 So, I got a pretty important phone call recently.

00:00:30.720 President Obama called me up. I was pretty surprised; I was brushing my teeth when he called me. He said, 'I've received great news of a grave threat unless you re-implement Ruby's Hash, alien terrorists will blow up Sea World.'

00:00:42.079 I thought that was a bit strange. I told him, 'Mr. President, I love aquatic mammals as much as you do, but that really makes no sense.' I thought it would be best to hang up and blow it off, but then I looked outside and saw a suspicious van. That made me think, maybe I should at least give it a shot.

00:01:02.960 So, everybody knows what a hash is. A hash is a data structure we use all the time. Basically, it's a place to store values that you associate with keys, and you can ask for a key to get its corresponding value back. Some languages call this a map or a dictionary. Our goal is to implement this; let’s make our own hash class that works just like that.

00:01:32.400 Here's the implementation I have in mind. Inside the guts of this thing, we have an array of arrays. We have a bunch of small arrays, where each of them contains a pair of keys and values. So, if you say, 'I would like to see the value for cat,' I can walk through these little arrays until I find the key that is 'cat' and give you back its value.

00:01:38.720 If you try to insert a value, I can simply traverse the arrays until I find the correct key to update the value. If there isn't an entry for that key, I'll insert a new tuple. This could work perfectly. I also know that Ruby has all kinds of lovely syntactic sugar. You can define your own 'equals' method, so to get the nice bracket notation, I'll define this reader and writer, and then it will look just like it's supposed to.

00:02:01.680 Now, you notice I'm calling this class 'TupleMap.' It's not really a hash, but it's close enough since it uses tuples, so that's a good name. I built this class, tested it, and everything worked just like the standard one—I could get it back out. Thumbs up! I wrote an email to the President, attached my class, and said, 'There you go, sir!' I thought I had called it a day.

00:02:41.680 However, I got another phone call. The President was clear: 'That's not good enough. My experts tell me that your hash is O(n), and it needs to be O(1). I don’t know what that means, but I want you to fix it.' I told him he was changing the requirements on me. I felt the pressure mounting. But the problem is, his experts were right.

00:02:56.879 O(n) and O(1) are very different things. But what does that mean? Now, probably many of you have computer science backgrounds, but I don’t. So, I'm going to talk to those of you who don’t. This is going to be computer science explained by noobs for noobs. I'm going to explain the concept of Big O and why it’s so important.

00:03:34.400 When we talk about Big O notation, the 'O' stands for order, as in order of growth. We're really talking about rates of growth. Let me give you some simple illustrations. If I were to do a task like shake hands with every one of you, that's O(n). If there are 10 of you, I have 10 handshakes to make. If there are 100, I have 100 handshakes to make. Very simple, right? It scales that way.

00:03:54.560 What if I were to introduce every person in this room to every other person? That would be O(n²). If there are 10 of you, there are roughly 100 introductions to be made. If there are 100, then we’re looking at roughly 10,000 introductions. That scales very poorly. But something that scales awesomely is waving at you because I don't have to worry about how many of you there are—this is O(1). The great thing is that 'n' describes the number of inputs.

00:04:17.440 You might notice that in 'O(1)', the letter 'n' does not appear because it is completely irrelevant. So when we talk about an O(1) algorithm, that means it doesn't matter how many inputs we have at all. Orders of growth describe how an algorithm scales as we throw more and more work at it. Does it scale well, or does it scale poorly? So if you have an algorithm to sort a list and I give you a list that’s twice as large, does it take you twice as long to sort it, less than twice as long, or more than twice as long?

00:04:54.320 Now, what we really care about when we look at a diagram like this is how the graph is shaped. Notice these graphs are shaped very differently. For example, we have linear growth represented by a straight line, we have exponential growth, and we have logarithmic growth. Each of these has its own shape. We might get involved in talking about how algorithms perform in absolute terms, but when we graph these things, the angle and position are things we ignore when discussing Big O.

00:05:31.680 Here’s an example of O(1). All of these lines are O(1). In absolute terms, if you have an algorithm that performs like any of these, the differences may matter to you. On the x-axis, we have the number of inputs to the algorithm; on the y-axis, we measure the workload of the algorithm. Consider the blue line compared to the red line—your algorithm takes five steps instead of nine. You might want to optimize and do fewer steps, but in terms of growth, these are equivalent.

00:06:05.280 If you throw more and more work at it, it just performs like nothing changes; it continues its performance just as it has. Linear growth is represented by a straight line with a slope. Again, in absolute terms, you might care about the differences between them. The red one might have one unit of work for each input, while the steeper line may have five or ten. If you could optimize that, you could do this more efficiently, but in Big O terms, those differences are irrelevant.

00:06:44.720 Now, why do all these rates matter? They matter because as we consider how these graphs extend, those differences will multiply and become larger and larger. A famous algorithm when talking about orders of growth is the Traveling Salesman problem. You've probably heard of it; it's when a person wants to visit a number of cities, each once, and then return home, and the goal is to find the shortest path. The brute force approach tries every possible route, ranking them and returning the best.

00:07:20.760 That brute force approach is O(n!), which means if there are 10 cities, you have 10 times 9 times 8 times 7 all the way down. If it’s 100 cities, the steps involved go to 100 times 99 down to one. It grows quickly, making it difficult for us to comprehend in our minds. Suppose you're working on an approach for five cities takes 0.12 seconds. Does anyone want to throw out a guess for how long it would take with 21 cities? It's crazy—97 billion years.

00:08:06.640 It highlights how crucial it is to understand your algorithm's scaling. Consider you work for a travel search site, and you think you can develop this feature to assist users in planning their trips. You might be able to offer this for three cities or five cities. However, offering it for ten is unrealistic; the person will have taken the trip and returned before you finish. Also, forget about offering it for 21 cities because your server farm would melt down. It’s essential to recognize what Big O category your algorithm falls into.

00:08:41.440 The key idea here is that in the long run, it doesn't matter how high those values are—any O(1) algorithm will outperform any O(n) algorithm. Likewise, any O(n) algorithm will outperform any O(n²). Optimization is beneficial within the linear category but shifting from O(n) to O(1) is a monumental win. Now, what’s our Big O problem here? Why is this O(n)?

00:09:07.600 If you think about it, it makes sense. If you ask me for the key 'cat,' I must search through every one of these inputs to find where the key is 'cat'. Obviously, if there are a thousand entries, it's going to take much longer than if there are just three. When scaling up, it's clear that this isn't satisfactory. One great feature of hashes is they're O(1), meaning no matter how many keys are in the structure, it should take the same amount of time to look up a value.

00:09:38.320 I've actually built this Tuple Map class, and I've done measurements to show that it grows—here, what I've done is on the x-axis; you see larger sizes of hashes, anywhere from 150 to about 14,500 keys. The test I conducted was quite straightforward: I inserted 150 values and then performed 150 reads, and the growth was linear as expected.

00:09:55.760 Now, how can we implement the internals of this to achieve O(1)? We need to determine what we know is O(1). One important aspect is array lookup by index. If I request item 328 or item 5, it takes the same amount of time to find them, just like that. But how does that work?

00:10:03.519 Well, RAM is random access memory, which allows quick access to any location in memory. This is unlike a spinning disk, where the time to access data varies based on distance. This consistency is what makes random access memory so quick. So, the process works by knowing where the start of your array is and how wide each slot is. You can calculate the address using the index of the desired item and access it instantly.

00:10:37.200 Our new plan is to structure our hash using a sparse array along with a digest function to determine where to place data. The digest function converts a key into a number, which helps us find the index in our sparse array. However, 'digest function' can be somewhat confusing, as we often confuse it with hashes.

00:11:03.520 In Ruby, this functionality is provided by the 'digest' library, which is a suitable name because it conveys that digest functions work one way—they can’t transform waste back into value, at least not in the same way. We need a one-way function turning our strings into numbers to get and store values. If we achieve this, we’d be able to respond efficiently when someone requests a key.

00:11:54.320 We encounter a problem with collisions—there's no guarantee that every key will yield a unique value. For example, if the hashes for the keys 'appetizer' and 'mammal' both resolve to 2, we can’t have both stored in that slot. Our solution is to use buckets; instead of having a single entry at slot number two, we’ll store a collection of key-value pairs.

00:12:21.680 However, that brings up the same performance issue: if we locate our bucket, we have to traverse through all the entries. The second challenge is waste; while we appreciate having values, many empty slots lead to wasted memory. This is typical when dealing with sparse arrays, and we want to be efficient.

00:12:57.760 One way of minimizing collisions (the more collisions there are, the more entries we must sift through) and managing memory waste is by growing as needed. We can maintain good lookup speeds while only allowing minimal time investments throughout expansion.

00:13:35.040 We can define growing in many ways, but I'll keep it simple: whenever any bucket exceeds ten keys, it’s time to grow. As we increase the size of our sparse array, the new system should distribute entries evenly. While the prior setup had everything clustered, we want the new setup to have space again, with fewer buckets overcrowded.

00:14:00.320 This also means we must go through all the keys again to rehash and organize them in the new, larger sparse array. The basic strategy is to compute the hash based on a given number of initial buckets. For each key, we apply a modulus function to determine where to deposit it. Modulo helps ensure that numeric values resolve to valid slots within the sparse array. It is also essential to increase the size of our sparse array correctly.

00:14:35.520 To maintain efficiency, we can double the size of our string every time it grows; additionally, using prime numbers prevents rehashing to the same portions of the array. Let's determine how to calculate a raw digest value from a key; initially, I aimed to create it myself.

00:15:12.560 For it to work, any object must convert to a numeral. My first experiment involved utilizing strings by converting them to Unicode code points and merging the results. However, this method falls short since we require the capability of keying by arrays. Ruby allows any object to serve as a key.

00:15:57.840 Every object has an object ID, but this presents complications. Two distinct strings must yield identical outputs; otherwise, uniquely referencing prior data proves futile. The solution Ruby employs relies on asking the object itself for equality, and thus it must uphold its own rules according to the hash value returns.

00:16:28.720 For example, two separate arrays may bear different object IDs but contain equal values, resulting in identical hash values. If you design your custom class, such as 'Person,' where equality is based on shared first and last names, you'd also need to structure your hash method to return matching values. This can take time to understand, but it’s crucial.

00:17:03.360 So, what are our findings? We learned that we convert a key into a number using the hash method, then process it with operator % based on the number of buckets to find the index. We’ve seen that growing minimizes collisions, thus ensuring performance, but this leads to an increase in memory use as our hash grows.

00:17:45.120 However, our goal was O(1) performance. If I were to graph the performance of my class, it should ideally reveal that I can read and write from a hash with a million keys at the same efficiency as a hash with ten keys. Are we excited to see the results?

00:18:23.519 Well, while it seems promising, it isn't quite perfect. Looking at the graph’s x-axis—the size of the hash ranges from 100,000 entries to 9.7 million entries. The lower line shows the time taken for reads, while the upper line shows the time for writes. Regardless of the hash size, it appears reads take about the same time, which is good. Writes also remain relatively flat.

00:19:28.480 Yet, we notice spikes, which occur whenever we redistribute—this results from rehashing all current keys every time we grow. If we analyze the native Ruby hash, it seems to follow a similar performance pattern, which suggests that redistributing shapes performance along those lines.

00:19:56.600 How significant are those spikes? So significant that they might exceed what we can visually perceive. It could take forty-eight seconds, and you'd prefer not to use my hash in a production environment. While you might not have hashes that compacts to that extent, it still shows that the native Ruby hashes perform much better.

00:20:37.120 Nonetheless, it appears that both hashes exhibit linear performance underlining that the bulk of their performance lies flat. The spikes arise linearly from the increase in keys requiring rehashing. In Big O terms, both sorts trace a linear nature. Thus, we evaluate the results carefully; both hashes provide consistent performance with their respective spikes.

00:21:02.320 Can we declare that both hashes are O(1) or not? I would assert success, as the read performance continues to have less than ten steps, with no allowance more than ten keys in any single bucket. We could argue, in simpler terms, that it would be O(10), but such a positioning isn’t relevant; what matters is maintaining shape.

00:21:43.360 Write operations also represent O(1) performance due to their fixed-step nature. Each write involves calculating the location based on its hash, and then it’s a matter of placing it in the correct position. Of course, an unfortunate write occurring just as a growth-triggering event happens will complicate matters.

00:22:19.680 That said, growth does involve n steps, but if you think about it, as it slows down, rehashing becomes less frequent. This means each growth doubles in size produces larger valleys between spikes, presenting the average amount as consistent. This concept is known as amortized analysis, where we regard all rehashing efforts leading to one specific event as 'debt' that will be compensated in the future.

00:22:58.000 So, I decided to send the email attaching my work on the fresh hash to the President. I felt all was well. However, I conceded, you wouldn’t necessarily want to use the hash I created in a production setting. It hasn't yet matured with the years of refinement found within Ruby's native hash.

00:23:38.720 Moreover, native hashes draw upon C language optimizations, ensuring quicker slopes. An array of methods could accelerate performance as well: we could trade off memory allocations more conservatively, or perhaps harness background threads to rehash entries while avoiding delays.

00:24:14.720 Every solution holds valid; it’s about shaping the implementation rather than fitting the slope to performance metrics. Applications of this knowledge stretch across data structures like Hashes—a versatile multi-tool for algorithmic challenges. If you’re tasked with deduplication, utilizing a hash allows easy checks against your list.

00:24:42.320 Distributed hash tables (DHTs), such as those in NoSQL databases like DynamoDB, Riak, and Cassandra share similar foundational theories. Their processes utilize hashes to assign where a specific key resides; thus working efficiently with distributed resources.

00:25:19.840 However, if scaling emerges as a concern, issues arise again because adding a server requires recalibrating all keys in the system—a notable overhead. Instead of incorporating one server, consider deploying multiple to alleviate resource strain while spreading cancellation penalties across several systems.

00:25:55.520 So, while reflecting on all these thoughts, I received another phone call from President Obama.

00:26:34.960 He asked, 'Dude, I thought you were sending me that hash.' I replied, 'What do you mean? I sent you an email an hour ago!' He mentioned that he had ten missed calls and revealed that my draft was stuck.

00:27:01.600 With regret, I apologized for the delay. However, it was too late. In a twist, alien terrorists had already blown up SeaWorld.

00:27:40.560 But then, I woke up. It was all a dream. Thank you very much.

00:30:41.919 You!