Ruby Internals

Let's subclass Hash - what's the worst that could happen?

Let's subclass Hash - what's the worst that could happen?

by Michael Herold

In his talk at RubyConf 2018, Michael Herold explores the complexities and pitfalls of subclassing core Ruby classes, specifically Hash. He introduces the topic with light humor and connects his experiences as a maintainer of the Hashie gem, which enhances hash functionalities but also brings to light some significant issues that arise from such practices.

Key Points Discussed:
- Far-reaching Implications of Subclassing: Michael emphasizes that subclassing core Ruby classes, like Hash, can lead to unexpected bugs and performance issues. He cites the proverb, "With great power comes great responsibility," asserting that developers often overlook the responsibilities tied to extending core classes.

  • Indifferent Access Bug: The first major bug Michael discusses involves the Indifferent Access feature in Hashie. This extension allows users to access hash keys with either strings or symbols. However, a bug was uncovered where the merge method did not behave as expected due to how the underlying Ruby classes interact, leading to a no method error.

  • Mash Key Collisions: Michael describes how Mash, another component of Hashie, can create conflicts with Ruby's built-in hash methods. For instance, accessing a key that collides with an enumerable method can produce perplexing outcomes, negatively impacting performance and causing confusion in user interactions with the data structure.

  • Dash Data Structure: He also discusses a third structure called Dash, which enforces specific properties within a hash. This structure complicates merging operations due to strict validations, illustrating how subclassing can unintentionally break expected behaviors within Ruby's classes.

  • Public Method Rigor: He points out that Ruby's Hash class contains around 178 public methods, introducing a vast interface that subclassing must contend with. This can lead to a higher likelihood of bugs and complexity in managing the interactions of overridden methods.

Important Conclusions/Takeaways:
- Michael cautions against the impulse to subclass core Ruby classes due to the hidden complications that may arise, emphasizing that alternative solutions, like OpenStruct, might offer simpler implementations without incurring the overhead that subclassing often entails.
- He concludes with a strong recommendation to reconsider the necessity of tools like Hashie Mash in applications, advocating for direct parsing and manipulation of data to avoid potential pitfalls.

The session wraps up with encouragement for audience members to reach out for further collaboration or contributions to Hashie, maintaining an open dialogue to foster better practices in Ruby development.

00:00:15.650 Good morning, everyone! How are we doing this morning? Did you have a good night last night? Did everyone enjoy the party? Are we all awake? Do we need to do some stretching or anything? Sirhan's talk was really good, so I think we're probably all awake.
00:00:21.330 So, how about this venue? We’re in the Crystal Ballroom, which for those of you on the live stream looks like this. It makes me a little nervous that Statler and Waldorf might pop up on one of those balconies and start heckling me.
00:00:34.680 Given where we are, I want to be crystal clear about what I’m talking about. No need to be fuzzy. Anyway, now that my terrible puns are out of the way, let’s get started.
00:00:49.680 I’m going to talk about subclassing Hash: what’s the worst that could happen? As Megan said, my name is Michael Herold. If you have any questions or anything, please tweet me at @mHerold or say hello at Michael J Herold. As Megan also said, I work at Flywheel, a delightful WordPress hosting company for designers and creatives.
00:01:09.119 Now, I said WordPress, but we do all Ruby, so don’t worry, I’m not an impostor. Also, we're looking for an engineering manager, so if you are one and looking for a new job, please come talk to me or one of my co-workers here.
00:01:38.970 This talk is about a little gem called Hashie. If you read Hashie's GitHub page, it says that Hashie is a collection of classes and mix-ins that make hashes more powerful. Let’s think about this for a minute. What pops out at you from that sentence? The phrase 'more powerful' immediately stands out to me.
00:01:58.860 Whenever I hear that phrase, it makes me think of Uncle Ben. Of course, it might also make you think of 'unlimited power,' which we know we may also get by doing this. But let’s get back to Uncle Ben. Sadly, Stan Lee passed away this week, which is a sad day. But Uncle Ben is famous for saying, 'With great power comes great responsibility.'
00:02:22.410 I’d like to juxtapose this with an Alexander Pope quote: 'To err is human.' Well, humans write computer programs, and what do computer programs have? Bugs. This talk is primarily centered around three different bugs in three portions of the Hashie library. So that’s the framework for our story.
00:02:49.980 The first bug we’re going to talk about occurs in a Hash extension that we call Indifferent Access. If you're a Rails developer, you know that there’s a hash with indifferent access in Active Support. We have an extension that provides you that power without having to use Active Support, but there’s a bug in there.
00:03:07.550 There’s also a bug in Mash keys, which I’ll talk about a lot since it’s a big part of our library. Finally, we’re going to discuss Destructuring, which is another data structure within our library that I’ll tell you about.
00:03:37.530 To start out, I wake up one morning and see that there’s a bug report on GitHub. It’s a good bug report, very thorough, so let’s dig into what the reporter found. If we look at their sample code, they start out by subclassing Hash and then mix in something called the Merge Initializer.
00:03:57.030 If you know how Hash works out of the box, you can't easily pass a hash into another hash to make it merge. The Merge Initializer gives you that ability. It looks a little bit like this: we’re going to create our MyHash anew. We’re going to pass it a cat key that has 'meow' and a dog key that has another hash with a name of 'Rover' and a sound of 'woof'. We get what we would intuitively expect from Ruby's standard library.
00:04:27.230 We get that myHash can respond to cat and it gets 'meow', and we get that myHash can respond to dog in the bracket syntax and we get the hash included in there. So that’s the Merge Initializer. But that’s not where a bug lives.
00:04:59.780 The reporter also mixed in the Indifferent Access extension, and this is where the problem lies. If we create a new MyHash with Indifferent Access, we see that we can access the hash with a string key 'cat', just like we can with a symbol key. This is the Indifferent Access portion of the hash.
00:05:12.770 It makes it so you don’t have to remember if you're using string keys or symbol keys, which is particularly useful when dealing with user input from an endpoint or something similar. Intuitively, we see that accessing 'dog' with a string gives the same result as accessing it with a symbol.
00:05:40.130 So far, everyone with me? Awesome! Now, we’re going to create our hash again, and then when we do this, we want to grab the dog hash and merge on the breed. We try to merge on the breed, which is 'blue heeler', and we receive a no method error regarding an undefined method 'convert.' Hmm, I don’t see that anywhere. What’s going on?
00:06:02.030 When we look at the Indifferent Access extension, we see that we have a merge method that calls super and then calls convert on the result. We also have a convert method. So what’s happening? We’ve mixed this into our hash; why do we suddenly not have access to this convert method?
00:06:31.370 We check our hash to see if we respond to 'convert'. I love Ruby for this! We ask if the hash responds to convert, and it says yes. Then we check if the dog hash within the hash responds to convert, and we get true. What is happening? We need to go deeper.
00:07:06.960 So here’s an introduction to two of my favorite tools called Pry and Byebug. Does anyone here use Pry or Byebug? Yes! It makes my life so much easier. When I encounter a bug like this, I often write a failing test and then insert something that looks like this.
00:07:41.490 We’re going to call our merge method, and we’ll call super, and then we’ll tap into super. If you’re not familiar with 'tap', it’s a method on Object. All it does is pass the object you called tap on as the block parameter. Thus, the result of super becomes 'result', while self remains the Indifferent Access hash.
00:08:06.960 Now we have access to both the result and self to figure out what's going on. When we call convert on the result, we then call hash.merge with 'blue heeler', and we get dropped into a REPL (Read Eval Print Loop). Now we can type and interact with the variables.
00:08:37.920 First thing to check is what is self. Just to make sure I know what I’m dealing with: self is a hash, okay, that makes sense. So far, we see that result is also a hash. These should behave similarly. We ask self if it responds to convert, and it says yes.
00:09:17.220 Then we ask if the result responds to convert, and we get false. This makes no sense! They should be the same thing, right? If you’re unfamiliar with the singleton class: the singleton class is the eigenclass or the singleton class of an object at a given moment.
00:09:46.110 When you call extend on an object, you can modify the singleton hash of that object. When you call a method on an object, it crawls up the array of singleton class ancestors and checks if each of those modules has that method. We see that we have the Indifferent Access extension in the ancestors, so that’s why self responds to convert here.
00:10:21.630 When we ask the same question of result, we see that its singleton class has no knowledge of Indifferent Access. Thus, that is the source of our bug: the result doesn’t know how to convert because of this.
00:10:43.290 We need to make sure that the result of merge gets Indifferent Access set on it; without doing that, it’s just a normal hash. Because of the implementation of the merge method, super calls the hash implementation of merge, which is written in C in the Ruby VM.
00:11:01.709 So what we get back when we call super is just a normal hash, even though we’re asking the Indifferent Access extension to give us the result. To fix this bug, we’re going to change our approach. We’ll grab super as a result and then make sure to inject Indifferent Access into the result.
00:11:21.560 This means that the result's singleton class now has Indifferent Access in its ancestors list, which allows it to respond to convert. Once we make this change, we can run our test again, set up our MyHash, and try to access the dog and merge on the breed. It works!
00:11:43.080 So why was this a problem? The source of the bug was the fact that we called super, which used the base class of Hash. When we call super, Hash's implementation of merge runs in that instance. We need to use super to chain multiple extensions together for them to interoperate, but when we come from Hash, the base of super is going to be the Hash class's merge method.
00:12:21.480 Hash has a significant number of public methods (approximately 178), and Aaron Patterson wrote a blog post in 2014 about how too many methods can cause issues, particularly related to a memory leak in Rails when using its Action Controller parameters class.
00:12:43.020 He explained that you need to handle all of those methods because they are part of the public interface of your class. I find it challenging to manage covering 178 implicit methods from a subclass; the likelihood of bugs increases significantly.
00:13:10.080 That was the first problem, which was relatively easy to fix once we knew where to dig in. Let’s look at a second problem involving Mash keys.
00:13:40.780 Another morning, I get a bug report about Mash keys that collide with hash methods, producing strange results. It’s a pretty good report that clearly explains what’s happening. If you use Hashie, you might have seen it show up in your Gemfile and wondered why.
00:14:11.829 Spoiler alert: yes, Hashie has been controversial as I alluded to. In 2014, Richard Schneemann wrote a blog post titled "Hashie is Harmful!" Although he primarily discusses Mash, his criticisms of Hashie provide valuable insights.
00:14:39.440 He noticed that after adding OmniAuth to a Rails application, every endpoint became 5% slower. This performance hit was experienced across the board, not just with endpoints associated with OmniAuth, which is an interesting observation.
00:15:08.750 So, back to Mash. It works a bit like this: if we create a Mash and ask if it has a name property, if it doesn’t return existing, it returns false. We can verify this by trying to fetch 'name' using a method accessor, which gives us nil. This behavior is intuitive and expected.
00:15:54.019 We can set the name of the Mash to 'my mash', and then when we ask for 'name', we get 'my mash'. We can also see that we have a name property set. This is most of how Mash is utilized. It’s also recursive, meaning if you pass a hash key that is a hash value, it gets wrapped in a Mash as well.
00:16:29.839 This functionality is implemented through method_missing, which is powerful yet sharp. The implementation checks if it receives a message it doesn't recognize. If it matches a key, it returns it. If the key has a suffix, it processes accordingly based on that suffix.
00:17:05.249 Mash is intended primarily for JSON responses, as the README states. I have been guilty of writing API client libraries that use Mash for this purpose. After parsing a JSON response into a hash, we wrap it in a Mash and get method accessors for everything. While it seems convenient, nothing bad could happen, right?
00:17:46.350 However, Mash is a hash defined to behave as such. The problem is that Hash has 178 public methods, and some of those methods could conflict with what you return from an API.
00:18:17.290 So when we try to access 'zip', for instance, we get a strange response. It turns out we receive the enumerable 'zip' instead of what we expect. Thus, when Mash has a colliding public method, it behaves unexpectedly.
00:18:44.200 To address this, I created the Method Access with Override extension. You can mix this into your Mash to override conventional methods that may conflict. This gives you control over how a method behaves while retaining access to the original behavior.
00:19:15.230 However, please note that this approach can bust the method cache, leading to performance issues in production environments. It serves as an interesting exploration between conflict management and understanding performance implications.
00:19:47.080 Now let’s tackle another data structure called 'Dash'. A Dash is a declarative hash and offers another layer of control. Once I received a well-documented bug report explaining issues with double splat merging with a Dash.
00:20:06.169 A Dash allows you to define a hash and enforce what properties it can have, thus adding a layer of validation to ensure bad states do not occur within your hash.
00:20:36.800 When we try to double splat this result, the error occurs because the behavior alters from what you expect with a typical Hash. A Dash doesn’t permit undefined properties. However, when trying to merge these properties, certain behaviors break down.
00:21:09.490 Using the RubyVM instruction sequence tool, I learned that Ruby’s VM won’t call to_hash on a Dash since it assumes it’s already a hash, leading to unexpected outcomes with property accesses.
00:21:36.359 This creates a layer of complexity whereby a method you think should exist might not, due to how Ruby interprets certain structural behaviors.
00:22:02.510 To recap, we discussed Indifferent Access and how it allows you to access hashes with strings and symbols interchangeably without loss. We also examined Mash keys and their recursive properties while understanding that they rely heavily on method missing.
00:22:45.449 Finally, we tackled the Dash data structure and its merging behaviors that conflict with established expectations. All three of these issues stem from subclassing Hash.
00:23:07.350 I focus on Hash specifically because I am most familiar with it, but anytime you subclass core Ruby classes, similar issues will arise. Classes such as String, and numerous others exhibit a vast array of public methods many of which are implemented in C.
00:23:30.870 Attempting to override these behaviors may lead to complications you may not anticipate. As I referenced Aaron’s blog where he discusses similar problems in Rails concerning internal classes, these complications could manifest unexpectedly.
00:23:53.290 The key takeaway here is understanding that when subclassing these core classes, you have to contend with a potentially vast public interface and the consequences that can arise from that.
00:24:13.440 Before I close, I have a quick additional piece. My co-maintainer in Hashie, DB, has chronicled everything that has gone wrong with Hashie Mash. His entertaining blog post recounts a series of mishaps through the years.
00:24:31.740 In our library of Hash and Mash, when you look at how they interact with the RubyGems database dump, one percent of the top 1000 gems depend on Hashie. Out of the top gems, their functionality predominantly relies on Mash.
00:25:00.650 So, my parting PSA—question the necessity of Hashie Mash in your applications. More often than not, you can parse a JSON string directly through standard methods and work with the data directly.
00:25:36.080 Replacing Hashie Mash with OpenStruct or similar structures can lead to more straightforward implementation and significantly reduce overhead. Also, if you need recursion, OpenStruct can deliver that extensibility.
00:26:09.260 If you’d like to work together, please reach out to me at Flywheel. My name is Michael Herold, and I’d be happy to discuss further questions here or online. Also, if you're interested in contributing to Hashie, please contact me!
00:26:39.080 Thank you for your attention! I appreciate your time.