RailsConf 2021

The Curious Case of the Bad Clone

The Curious Case of the Bad Clone

by Ufuk Kayserilioglu

The video titled "The Curious Case of the Bad Clone" features Ufuk Kayserilioglu's investigation into a performance issue experienced by the Shopify monolith, specifically a 150% slowdown linked to a Sorbet signature in Ruby. This journey illustrates the intricacies of debugging in the Ruby, Rails, and Sorbet ecosystems. Ufuk meticulously outlines the steps taken to identify the culprit behind the slowdown, detailing how a seemingly innocuous Sorbet signature was implicated.

Key points discussed include:

- Initial Performance Investigation: The issue arose when a performance hit from 100 to 250 milliseconds was reported due to a newly added Sorbet signature, which prompted further investigation.

- Understanding Sorbet: Sorbet is described as a gradual type checker for Ruby. Ufuk discusses how its runtime component allows type validation, which was essential for understanding the slowdown.

- Ruby's method_added Hook: The presentation dives into the method_added hook and how it relates to performance when new methods are defined, particularly those triggering type validations at runtime.

- Identifying the Performance Impact: Through profiling, Ufuk discovered that multiple unnecessary calls to method_added resulted in significant overhead. The method's behavior during cloning operations was examined as it contributed to the slowdown.

- Collaboration with Experts: Ufuk highlights his discussions with Rafael Franca, an expert in Rails internals, to uncover deeper issues related to Active Support's callback methods that were impacting performance.

- Root Cause Analysis: The investigation led to the discovery of a flaw in the Ruby cloning mechanism, specifically concerning Singleton classes. This flaw caused unexpected behavior when cloning objects, leading to performance degradation.

- Resolution in Ruby 3.0: The findings culminated in a pull request that addressed these issues, resulting in a fix for Ruby 3.0, showcasing the collaborative aspect of debugging and enhancing performance in open-source communities.

- Conclusion: The session wraps up by discussing how the changes implemented improved the performance and reliability of Ruby applications. The overall experience emphasizes the importance of detailed investigation, community collaboration, and understanding of underlying mechanisms in coding frameworks.

00:00:04.860 It was a usual day at work on Friday, September 4th, 2020. I was working on improving our RBI generator, Tapioca, reviewing some PRs and responding to emails when a colleague pinged me on a PR. The request times had gone from 100 to 250 milliseconds, and a single Sorbet signature seemed to be the culprit. I needed to investigate this.
00:01:01.500 At the time this PR was opened, I had been working with Sorbet for about a year and a half, having adopted Sorbet on our main monolith at Shopify, which we call Core. I had fine-tuned the runtime impact to decrease some of the overhead coming from the Sorbet runtime component. We hadn't received any similar reports about a runtime slowdown like this before, so it was hard to believe that this was indeed a problem caused by Sorbet. I noted down my suspicion in the PR, and we waited to see if deploying it would bring the performance back.
00:01:37.799 Lo and behold, after a couple of minutes, Basam ran a profile on the deployed version of the code, and we found that our old performance was back. The stack traces were clearly implicating Sorbet. But before we delve further into that, let's talk about Sorbet.
00:01:57.780 Sorbet is a gradual type checker for Ruby, built as a static executable in C++. The type signature annotations of Sorbet are, by design, Ruby code. You add a `sig` method call with a block to declare the types of your method parameters and return value. This allows Sorbet to stay pure Ruby. There is a small cost: we need the `sig` method to be defined by someone at runtime. Otherwise, Ruby won't know what to do with that method call.
00:02:30.420 T.sig is the module that defines the `sig` method, which is defined by the Sorbet runtime gem. Having a runtime component allows Sorbet to perform more actions than it can with just the static type checker. Since Sorbet defines the `sig` method, it can actually use it to validate the type signatures at runtime, ensuring that the types declared by a method are indeed the types being passed to it during execution.
00:02:50.700 For instance, maybe your method expects an integer, but somehow someone is passing it a string instead. Since Sorbet is a gradual type checker, this method call might not be statically analyzable. However, the runtime component can catch this case if you have tests that cover it. The signature definitions wrap your method calls and check the types.
00:03:21.000 But how does this runtime magic work? How can Sorbet's runtime associate a `sig` method call with the method that follows it and instrument the method so that it can check the parameter and return types? Enter the magic of the `method_added` hook.
00:03:40.019 In Ruby, the module class has a method called `method_added`, which is called whenever a new instance method is added if defined in a module or class. This method is passed the name of the method that is being added, allowing us to perform various interesting tasks. For example, we could build our own toy `T.sig` module called `ToySig`. First, we need to implement the `sig` method, which we can simplify for our implementation by storing the current `sig` block in an instance variable.
00:04:21.419 Now that we've done that, we need to know when a method is added to our module. Let's add a `method_added` method for this. This method should do nothing if there is no signature defined, as we don't want to intercept all methods, just those with signatures. We then get a reference to the method object for the newly added method using the `instance_method` method. Now we can replace the method itself by using `define_method` to override it, capturing all the arguments, keyword arguments, and the block passed to this method.
00:05:15.060 The first step is to evaluate the signature block, which needs to occur late in the process because it might reference classes or modules that aren't defined at the time the signature is declared. By evaluating it as late as possible, we can ensure everything resolves correctly. For this toy example, we won't do much with the signature; we can just print the arguments and block that was passed to the method.
00:05:51.720 Next, we need to call the original method. To do this, we can use the 'bind call trick' with our method object from before. This way, we bind that method to the current `self` and then invoke it with the arguments, keyword arguments, and block from our replacement method. We will store the return value in a variable and print it out. Our replacement method also needs to return the same value provided by the original method. That's our toy `sig` module done. Now, let's see how we can use it.
00:06:22.259 We can define a toy `Shop` class and add a constructor that takes some parameters and prints upon initialization. Additionally, let's add a method that accepts a block and yields to it. We will also add a method without a signature to ensure that we don't replace that method.
00:06:36.419 From our tests, as expected, we see two calls to initialize and one call to the with method. The signature block is evaluated when the methods are called and can access the actual arguments and block passed to the method. The original method call still goes through, and we can also intercept the return value before returning it from the method properly. This explains why the method became slower after adding a signature, right? Well, not exactly, because the actual Sorbet runtime implementation uses some escape hatches that we've heavily relied upon to optimize the runtime impact of all of this.
00:07:24.480 We can't completely eliminate method wrapping due to technical reasons, but we can fully remove type validation and even perform signature block evaluation earlier in our Rails application. These techniques ensure that the Sorbet runtime's impact in production is minimal while still providing value in development and testing.
00:07:40.680 So, if all this is true, who is killing our performance? The profiling points to the `method_added` method, which looks suspicious. Let's dig deeper into what’s happening here. I usually start my investigations by adding `puts` statements in relevant places. Where better to add those than in the actual `method_added` hook added by the `T.sig` module?
00:08:06.300 I'll use my favorite tool of running `bundle open sorbet-runtime` to open the source code of the currently installed version of Sorbet runtime in my editor. This way, I can make changes to the gem's source code and run my application to see those changes.
00:08:12.479 Let's navigate to the location where the methods are added and add `puts self` and the method name to see when this method is being called. Returning to our `Shop` model, we can verify that we have both the `extents` method and the signature that was removed in the PR for experimentation.
00:08:28.500 In our application, we can change the eager loading behavior by setting the eager load environment variable in development. This change ensures that I get behavior as similar as possible to production. I also want to disable Spring so that every run is fresh, which I accomplish by using the `DISABLE_SPRING` environment variable. Running `rails console` with these environment variables provides a lot of output from the `puts` statements during eager loading of the application.
00:09:19.040 Now that we've booted our application, we discover that `shop.fetch_one` seems to run just fine and doesn't even trigger any `method_added` hooks. So what's going on here? Let's revisit the `fetch` method on `Shop`. Ah, it seems there is some conditional logic depending on whether `shop.current` is set.
00:09:36.600 During our initial call, we had not set `shop.current`. So what happens if we set `shop.current` to the first `shop`? So far, so good. Let's call `shop.fetch_one` again.
00:10:00.480 Whoa, what just happened? What are all these methods, and how many are there? Let's copy them into a text editor for easier analysis. I'm going to use the Rails Runner to execute this command and then pipe it to a clipboard copy operation.
00:10:34.680 Now that we've transferred it into our text editor, we can analyze these calls. This is quite surprising; we see about 1,600 calls to `method_added` each time we call `shop.fetch_one`. Each call is doing some processing to match itself to a signature and perform additional checks. If each of these calls takes about 10 microseconds to process, and if there are about six to eight calls to `shop.fetch` during a single request, that definitely explains the 150 millisecond performance hit. It appears we have found our immediate source of the problem, but we still need to understand why this is happening.
00:11:11.880 Let's take another look at these methods. They are all class methods defined in our `Shop` class, triggering `method_added` hooks whenever the `Shop` instance is cloned. Notably, the class name that's printed is not `Shop`; instead, it is an anonymous class that's receiving methods. Which class is this? Why are class methods triggering the `method_added` method when they should be triggering the Singleton method added instead?
00:12:01.019 Let's see if someone is doing something suspicious between the time we call `shop.clone` and the moment `method_added` is invoked. For this, I'll apply my favorite trick: setting up a binding pry with a global conditional, so I can control when the binding pry executes.
00:12:29.880 With that in place, I'll start our application again and check what happens. Our application has booted, and because we haven't set the global variable yet, it hasn’t printed anything or hit any breakpoints. So let's fetch the first `shop` and clone it. However, let's first toggle our global debug switch to break inside the call to `method_added`.
00:12:55.500 Here we are; now, what does our backtrace show? Who has called us? There might be some shenanigans happening here. Using the backtrace in pry will provide that information.
00:13:20.640 It appears we're being called directly from `kernel.clone`. No one seems to be manually adding these methods, so where are they coming from? I also noted something peculiar about the class instance while inside this method.
00:13:40.080 Let's check what `self` is. That's also strange; it looks like the Singleton class of a Singleton class. Something deeper is happening here. Where in Rails are all the callback methods being added? Who is triggering them? I reached out to the person who knows Rails internals best, Rafael Franca.
00:14:10.020 In a foreign session with Rafael, I explained the situation and showed him what was occurring. His initial instinct was that this behavior resembled the model body being re-executed when `clone` was called, as the methods being added were all things like `before_save xxx` or `after_save xxx`. We dug into the internals of ActiveSupport to check where those callback methods might have originated.
00:14:47.720 Our first step was to look at the definition of `has_many`. This simple method does not do much other than calling another method. Next, we proceeded to `Builder` and `has_many.build`. Upon further inspection, we saw that this class inherited from `Builder::CollectionAssociation`, which, in turn, inherits from `Builder::Association`. To our surprise, here is the build method. Reading the code, we observed that one of its tasks is defining callbacks. This is precisely what we were investigating.
00:15:49.500 We went deeper to locate the `define_callbacks` method. We needed to trace down the inheritance chain again to `CollectionAssociation` and found the `define_callbacks` method. This calls `define_callback` for each type of callback it needs to register, which then creates methods for callbacks like `callback_name_for_name`. Bingo! This aligns with what our methods have been like.
00:16:20.220 How are these methods being defined? By using a class attribute call! We delved into how class attributes work, and Rafael realized that the writer of a class attribute always redefines the reader when the value is set. Therefore, if a callback were being set during the cloning process, we would have the callback method redefined, which would trigger the method as it took.
00:17:24.060 We tried to determine if that's what was happening, but we weren't quite sure what to check next. While examining how `CollectionAssociation` was invoking `class_attribute`, Rafael identified that it was unnecessarily generating instance predicate and accessor methods. He promptly committed a fix to Rails for this issue.
00:17:56.760 Moreover, we discovered it was not just methods registered by `class_attribute` that were causing problems; basically, any class method defined on the `Shop` model was being added during a clone process. Up to this point, we were running out of ideas about how Rails might be affecting the cloning of an Active Record model. So Rafael suggested that the problem might reside in our application code.
00:18:52.560 We agreed it was prudent to add a very lean Model in our codebase to see if we could replicate this behavior. I went ahead and created a `Foo` model in our application. Since our application contains custom sharding logic called potting, adding a simple model was a bit tricky. However, after defining two straightforward methods on the model class, I was set.
00:19:42.120 Next, I added a method with a signature to the model and a class method for good measure. I tested the clone operation with instances of this model but found that, to my surprise, I could not recreate the strange behavior.
00:20:06.900 Notably, even the class method wasn’t triggering `method_added` during cloning. I began to wonder what was different about this model. My best guess was that the model wasn't using our custom sharding code. Let's see if that changes anything by trying to include the `ShardedTable` module.
00:20:46.800 Now, let's attempt the clone operation again. I re-enabled the breakpoint; let’s try cloning the first `Foo` instance. This time, the behavior appears to mirror that of `Shop`. Hence, we concluded that it must be our potting code responsible for these peculiar results.
00:21:08.760 To confirm our findings, we commented out the `ShardedTable` include to run another test. As anticipated, the outcome differed when we reran the clone. So, where do we go from here? One way to troubleshoot is to examine the source of the `ShardedTable` implementation.
00:22:02.820 The `ShardedTable` does a few things: it registers a class attribute called `has_shop_data`, adds the model to the shard, and includes additional modules into the model which invokes class eval and adds callback methods. There are also some checks made during this process.
00:22:51.960 To effectively troubleshoot this, I resort to a binary search within the source code. By partially commenting out the methods, I can determine if the printing of `method_added` hooks is triggered. If we comment out everything, we certainly don't hit anything. If we leave only the class methods extend line, there's no odd behavior either.
00:23:42.300 Now, let’s try just adding the class attribute. Odd behavior is absent here as well. Next, I will register the model on the shard and see what that yields. The registration doesn't seem to include anything. Next, let's comment out the include.
00:24:30.660 No peculiar behavior noted upon commenting it out and now we will comment out the registration. Alright, it appears that the issue resides with `base.include`. This includes an included method that performs a class eval and registers a `before_create` and an `after_find` callback.
00:25:11.760 I will check if `class_eval` is causing the strange behavior. Running the runner, I discover there is no odd behavior there. Now to test the `before_create` — likewise, no strange behavior observed.
00:26:09.060 Finally, let’s conduct a trial with `after_find`. The moment I include `after_find`, it activates the method's `method_added` hooks with its effects.
00:27:02.340 What’s happening here? Let’s take another route: instead of including `ShardedTable`, I'll just register an `after_find` callback method directly on the model. This should yield a similar outcome, I reason.
00:27:35.280 And voila! Yes, this indeed triggers the exact same behavior. Now our model is sufficiently isolated from the application, and we can compose a reproduction script easily.
00:28:02.340 One way to achieve that is to utilize the bug templates in Rails. Rails offers a bug template for the Active Record gem, a self-contained script that uses bundler inline, initializes Active Record, defines a few tables, declares their Active Record models, and runs a test with them.
00:28:29.280 Thus, I’ll adapt that small script as a base that will allow us to avoid dealing with the complexities of our application code. Initially, if we run this, we see it prints SQL statements and all the tests pass in their default forms.
00:29:21.960 Let's analyze how this bug template script operates. It uses bundler inline to define a gem file within the script, then requires Active Record MiniTest and initializes an Active Record connection using SQLite. Next, it defines a schema, establishes two classes, and creates a MiniTest class executing a single test.
00:30:36.960 We don’t require that test, nor do we need these post and comment models. Similarly, we won’t utilize these tables either, nor do we need the Active Record-based logger, as we don’t need to see the SQL output; Active Record is not what we’re debugging.
00:31:26.520 Instead, we’ll create a schema that simply sets up a `shops` table. We'll create a `Shop` class, and instead of bringing in Sorbet runtime, we’re going to add our `method_added` hook. The rest is consistent with the full model we once had.
00:32:14.640 Next, we create a `Shop` instance, find the first shop, and clone it. After that, let's add our `after_find` callback and then find the first shop again to clone it. This should allow us to observe the stray behavior occurring after registering the `after_find` callback.
00:32:59.600 As expected, we create a shop model without any issues. Now we execute the first clone operation before registering `after_find`, and everything proceeds smoothly. Following that, we register the `after_find` and then perform the second clone after registration.
00:33:44.820 Now, we're able to replicate the peculiar behavior in this miniature script, effectively understanding its roots. How can `after_find` influence `clone`? To answer that question, I turn my attention back to the `clone` method.
00:34:35.520 The `clone` method performs a shallow copy of the given object. While it duplicates instance variables, it doesn't copy the referenced object state. The documentation discussing nuances between `clone` and `dupe` indicates that they embody different semantics across descendant classes.
00:35:14.760 And thus, while `clone` duplicates an object, including its internal state, `dupe` utilizes the class of the descended object to create a new instance, ignoring the modules that the object has been extended with.
00:36:06.960 Next, we need to dive into how `dupe` operates inside the MRI code base. It first locates the class of the given object, allocates a new instance, initializes the copy, and then executes an `init_dupe` method before returning the duplicate instance.
00:37:26.520 Now examining `clone`, we get minimal content, only calling a `mutable_ops_clone`. Understanding this completely will take some digging, so let's search through the Ruby code base to glean improved understanding.
00:37:51.760 Returning to the object.c file, we can see how it finds the class and allocates a new instance. Following that, it conducts an `init_copy` and—crucially—does a freezing operation before returning the clone. Thus, the main difference here between `clone` and `dupe` is that additional block, which may be correlated to singleton classes.
00:39:05.520 So, if this is tied to singleton classes, we need to ascertain who is allocating a singleton class. We can do this by employing the object space tracing object allocations block, which facilitates tracing object allocations within that block.
00:40:16.080 When we apply an `objectspace.allocation_source` command targeting `shop.singleton_class`, we can obtain the file and then print the associated lines directly to the console.
00:40:52.380 With this groundwork, we rerun the script to analyze the results. To our astonishment, it seems the singleton class is instantiated within ActiveSupport callback, specifically within a method named `make_lambda`. Once more, let's drop a `puts` statement inside this method to acquire fresh insights.
00:41:49.860 Rerunning the script again, we find that a `call_send` is being made on an ActiveSupport callback conditional value, along with an `instance_exec` being executed on a shop instance, defined by a proc on line 55 in test_rb. As that is precisely the block we passed onto `after_find`, we must explore what `instance_exec` achieves.
00:42:55.200 We’re aware of `instance_exec`, but let's dive into the infrastructure. As we discover, it creates a singleton class for evaluating the block. Therefore, `after_find` runs `instance_exec`, which in turn generates this singleton class, leading to a distinguishing factor between `clone` and `dupe`.
00:43:54.780 With our discoveries in mind, we can eliminate all that Active Record from our reproduction script. Transforming our `Shop` class into a simple PORO class is sufficient. We can create a new instance and conduct cloning operations with which we can pair this new functionality.
00:44:24.180 First, cloning it, followed by invoking `.singleton_class` on `shop`, which constructs the singleton class on a `Shop` instance, provided it doesn’t already exist. Then, we attempt another clone operation. If our hunch is accurate, this will affirm that creating a singleton class results in differences between the two clone calls.
00:45:20.400 And indeed, it proves correct! The second clone does encounter our `method_added` hooks. Now what is a singleton class, precisely? Let's elucidate that.
00:46:08.160 A singleton class, conceptually, provides unique behavior to a single object, differentiating it from the other objects of its class. For instance, if we create two instances of `Shop` named shop_one and shop_two, when we define a method called `hello` on the singleton class of shop_one, only that instance will respond to `hello`, while the other instance will not.
00:46:59.880 Let’s take it a step further by figuring out how Ruby internally represents these things. Initially, we delineate a simple `Shop` class and create an instance called `shop`. When querying `shop` for its superclass, we stipulate it as `Object`, whereas querying for its class returns `Class`. The instance method on `shop` should yield `name`.
00:47:55.560 Shifting gears, let’s define our `Shop` class with a class method called `fetch`. Defining a class method effectively assigns that method to the singleton class. Thus, if we seek the properties of both the classes and the objects, the resultant queries unveil oversight from Ruby regarding class structures.
00:48:31.080 If we check for instance methods on the singleton class of `Shop`, we ought to observe `fetch`. Yet querying the singleton class of a `shop` instance reveals no methods defined there.
00:49:02.520 Advancing, we engage with a case where we summon a singleton class of a singleton class, permitting us to discern all sorts of inheritance mechanics within Ruby. The behaviors of `shop`, `shop1` become revealed by checking the respective class structures.
00:49:49.200 Now, let’s grapple with how cloning should work, starting from the simplest case: when cloning the `shop` object, the clone is simply a copy of the `shop` object itself. Since the clone also points at the same class, `Shop`, there's nothing more to duplicate, as both share the same class behavior.
00:50:49.740 In a slightly more complex scenario involving the singleton class of `Shop`, any clone should replicate that singleton class alongside the object, consequently ensuring that the two objects keep aligned.
00:51:14.920 Lastly, as we explore the most complex case, we‘ll see that whenever we clone a `Shop` object, we not only need to create a clone of that, but we also should create clones of its singleton classes.
00:51:30.550 In summary, we engaged in a forward process while documenting our findings, culminating in a reproduction script demonstrating how singleton classes defined on the objects caused imperfection during cloning.
00:52:35.950 Following a couple of hours of collaborative work alongside Alan Wu from the Ruby Core team, we successfully created a reproduction script illustrating the issues we discovered. Notably, this concluded that a check was needed to ensure an object was indeed attached to its singleton class.
00:53:17.040 We devised a PR to tackle the issue, reviewed its contents, made necessary clarifications, and contributed to improvements before finally merging it. This fix was incorporated into Ruby 3.0 in late November 2020, marking our progress.
00:53:52.880 Just eight days ago, it was successfully backported to version 2.7. Now we’ll utilize our diagrams from earlier to elucidate the happenings within Ruby, utilizing the previous discussions of singleton class interactions.
00:54:26.040 In conclusion, prior instances of the problematic case demonstrate stumbling as Ruby was returning the class pointers while cloning an object, hence inadvertently cloning the singleton classes which led to a cascade of issues.
00:55:12.440 Thus, when the shop object's singleton class was cloned, it mistakenly creates methods that shouldn't have been there to begin with, marking the root of our problems as methods were constantly being added to that cloned singleton class, causing performance degradation. After implementing fixes and careful attention, we have effectively resolved the issue.
00:55:56.979 We can now shift our focus to tests confirming the absence of unexpected behavior in Ruby 3, validating our efforts and ensuring we exercise a relaxation period post our investigation.