Talks

Caching With MessagePack

RubyKaigi 2022

00:00:02.460 Good afternoon! How's everybody doing?
00:00:11.639 The title of my talk is Caching with MessagePack. My name is Chris Salzberg, and I'm also known as Shioyama or Shyam Sensei.
00:00:14.759 I am a staff developer on the Ruby and Rails core team at Shopify. Within that team, I work in a smaller group called Rails Core. Our mandate at Shopify is to keep the core monolith—Shopify's core monolith—essentially the largest Rails production application in the world, on track, while providing an enjoyable environment for developers to work in.
00:00:21.380 The background for this slide is a picture of where I live, which is Hakodate, in the north of Japan, at the southern tip of Hokkaido. It's a great place to visit for those of you who don't know about it. I think it would also be a fantastic venue for a future RubyKaigi if the organizers are listening. Now, if the title of this talk seems familiar, that's because I gave a very similar talk at RailsConf a few months ago called Caching without Marshal. There is quite a bit of overlap, but there are also some differences. So if you've seen that one, this talk is still worth watching.
00:00:57.660 The starting point for this talk is an exception that was raised in production about a year and a half ago on the monolith. The exception was a NameError regarding an uninitialized constant: BetaFlagService ActiveRecordRepository. I wasn't directly involved in the incident, but I was involved in the follow-up. What happened was that this exception occurred, leading to an incident that required a rollback. So, what happened basically is that if anybody who ships code knows, when you ship code, you're essentially dealing with two different universes: the state of the application before you ship the code and the state after you ship the code. Usually, these two universes overlap almost entirely; you don't typically ship huge changes in one go.
00:01:30.479 However, in this case, a refactor was pushed that changed a number of classes around beta flags, which are quite an important concept in our core monolith. This led to a scenario where there was an instance of one of these beta flag constructs that was wrapped in another object, which was again wrapped in another object, and that final object was then stored in the cache. Meanwhile, the old version of the application was still running as the code was being deployed. When the old version accessed the cache, it pulled out this object and attempted to unwrap it, but since all the class names and methods had changed, the old version didn't recognize the object and consequently blew up. Exceptions were raised everywhere, and the incident required a rollback.
00:02:14.760 The problem in this situation is that when we examined it, we determined there were essentially two ways to address this issue. One option is to simply avoid putting too many objects into the cache; if you don't put much in the cache, this situation can't happen. But that isn't a viable solution, since the whole point of caching is to store and retrieve data efficiently. The other option is to avoid changing existing code. By not changing code, this issue wouldn't occur; you can write new code, but don't modify existing code. However, that also isn't a good solution because, especially for our team, we want to encourage developers to refactor and improve their code, so that approach simply isn't acceptable.
00:02:31.740 This realization led us to ask: what is a good solution for this problem? And that's what led to this talk. I want to first explain caching as it relates to the topic here. In my RailsConf talk, I focused on Rails cache, but since this is not a Rails conference, this is RubyKaigi, I want to address caching from a more general perspective.
00:03:09.660 In any web application, there are many things you might want to cache. You might cache actions, pages, or fragments of pages, query results from the database, or responses from external API requests. These are all things you may want to cache. Furthermore, there are various places where you might store those caches, such as memory, file storage, Redis, or Memcached. There are different ways to approach this situation. One option is to create separate caches for each type of item. For example, you could implement a specific cache to store action-type objects, which could figure out how to represent an action object as a string for storage in its cache. Then you'd create another cache for pages and another for fragments. However, this approach can quickly become complicated and is not scalable. It's not how Rails operates; I imagine other frameworks have similar strategies. Rails uses a single caching layer that handles all types of caching. This simplifies the situation, as all different types of caching scenarios funnel to one cache, which then manages the back end—whether that's memory, filesystem, Redis, or Memcached. It's the back end's responsibility to manage how to cache the objects received.
00:04:48.300 In practice, the back end will typically handle this using Marshal. While Marshal is a powerful and convenient tool, it comes with known risks, particularly when loading user-generated data. If you examine the Marshal documentation, it specifically points out that its load method carries inherent security risks. Because of these concerns, we should investigate what exactly Marshal does before we attempt to replicate it using MessagePack.
00:05:18.600 Let’s delve into what Marshal does, using an example from my RailsConf talk. We don't need to bother with application record specifics; we just need an object large enough for demonstration. For instance, let's consider a post class. If we create an instance with the title 'Caching with MessagePack', we call Marshal's dump method, which serializes the data into a binary format. When I did this previously, the result was a 1600-byte binary blob containing constants, instance variables, and values—essentially everything contained within the object. The magic lies in how, with a future call to Marshal's load method, regardless of whether it’s a day, week, or year later, you can retrieve the object just as it was originally. Marshal encodes the entire universe: it examines everything you put into it recursively without consideration for privacy concerns. This would indeed be fine if our universe were static. Yet, at Shopify, we shift code many times a day, meaning that the universe we operate in is constantly shifting. This alteration—even while Marshal encodes all this data from an ever-changing universe—creates substantial risks.
00:07:01.680 Let's look at what's actually inside Marshal. You can open up marshall.c, the file where Marshal is primarily defined in the Ruby C implementation. There, you’ll find numerous constants at the top of the file that reveal what happens behind the scenes in Marshal. The first thing you'll find is both the major and minor version of Marshal. These constants have remained unchanged for a long time, so you can safely assume they won't vary with Ruby version changes. Next, you will encounter a range of atomic constants—elements that don’t contain other objects, such as nil, true, false, numbers, floats, symbols, classes, and modules. Then there are composite types: arrays, hashes, structs, and objects, which can contain other objects within them.
00:07:42.120 The first thing we generally observe within Marshal is the representation of a basic object type, denoted by a small character ASCII 'o'. If we take the binary blob from earlier and convert it to hex, the first part represents the version number, followed by 'O' indicating that this is an object. A colon follows that indicates a symbol, followed by the class name represented as a symbol and its length. Subsequently, the number of instance variables is recorded, alongside their respective names followed by their corresponding values. An example includes a symbol with a value implying that it is not a new record given that we created it. This rapidly transforms into a large structure as each instance variable can itself be an object, leading to complex representation.
00:09:32.280 There's more to note regarding variables: objects can have instance variables, but so can other entities that you might not expect to have them. For example, in Ruby, you can assign instance variables to strings and retrieve them as needed. If you serialize such an object with Marshal, indeed it accurately tracks that information. Marshal keeps track of nearly everything and does so by utilizing special types, such as the type 'Ivar', represented by 'I', specifically for tracking instance variables on non-object types. Another interesting feature involves circular references, which is part of how Marshal manages duplicate references. Consider a scenario where you have an array that references itself; in such a case, Marshal can manage this complexity without issues.
00:09:55.200 This is achieved through a special type, known as a link type, denoted by an ampersand. This effectively means a reference to an already included object, allowing for safe serialization without duplicated data. Additionally, within the Marshal implementation, you ought to note what's identified as core-type subclasses that interact with Ruby's constructed classes. The built-in core types such as Hash, Array, Regex, and String have special handling in terms of serialization via Marshal.
00:10:25.799 The significance of engaging deeply with Marshal's operations stems from the caching process and how we often leave the quirks of Marshaling to blind faith. With a good understanding of Marshal's processes, we should be better equipped to transition toward using MessagePack effectively.
00:11:00.180 Now, let’s discuss message pack and how it relates to these issues. The primary problem we face is that when we cache items, they often lead to conflicts between versions of the application, especially if they’re deployed while data is being fetched from the cache. If your cache expires after two weeks, any deployments within that window can have overlapping conflicts, leading to complications. This highlights the importance of a serialization format that does not capture the complete state of the universe. That's where MessagePack comes in. MessagePack is a highly efficient binary serialization format. It's binary, similar to Marshal but differs significantly in that it's a generic format. Libraries exist for various programming languages including Java and Python. In many ways, it mimics Marshal functionality.
00:12:12.600 For example, if I give MessagePack a hash containing key-value pairs, it returns a binary string, similar to how Marshal would respond. However, the difference becomes apparent when unpacking and loading that binary string back, allowing us to regain access to our original data structure.
00:12:47.400 MessagePack presents its types in ways akin to Marshal, with atomic types such as integers, nil, Booleans (true or false), and floats. Similar to Marshal, MessagePack also has composite types like arrays or hashes. However, unlike Marshal, it does not possess an object type since it’s not a Ruby-specific format. As a result, MessagePack does not natively track instance variables. This exclusion would serve us well; we shouldn't assume knowledge of every object type embedded within stored data as that would lead to the same issues present with Marshal.
00:13:10.620 For instance, when using Marshal, sending a simple string like 'Foo' results in a detailed structured byte format that includes information about its encoding, like the UTF-8 designation. In stark contrast, MessagePack outputs a much simpler byte representation, mainly storing the actual ASCII of the string with little overhead. This remarkable efficiency—relying on the built-in handling of UTF-8 in MessagePack—ensures that the system remains lean. This efficiency is a major factor in our desire to implement MessagePack into our production codebase.
00:13:52.920 When we began searching our codebase, it quickly became clear that many storage structures relied on serialization that exceeded primitive types. One of our primary targets was ActiveRecord, which presents a number of serialization challenges. It’s important to serialize not just the record attributes but also the cached associations. If any associations were previously accessed, we want to bring those into the cache without needing to access the database again. In my RailsConf talk, I went into detail on this topic; I'm not going to elaborate further now aside from the fact that inverses can create circularity challenges, which Marshal handles with some inefficiency.
00:15:04.680 However, we discovered that we could manage that issue by constructing a coder that could properly serialize those arrays containing the associations, allowing it to work seamlessly with MessagePack. When we implemented this, we noted some promising results: for a post and its comments, the MessagePack size was around 300 bytes, compared to Marshal's 4000 bytes. This represented a significant compression ratio, close to 13 times improvement in size for complex objects.
00:15:43.260 Consequently, we observed a notable reduction in our Rails cache size within production. We were already operating at approximately 95% caching efficiency, but this adjustment provided us with meaningful improvements overall. As we explored further, we realized that our ActiveRecord objects formed just one facet of the overall challenge.
00:16:25.020 Another challenge involved handling hashes with diverse access types—particular instances of hashes that could interchangeably utilize string and symbol keys. Coming from ActiveSupport, where subclassing Hash is quite common, these could lead to complications. While Marshal could address this automatically, we discovered that MessagePack needed additional handling. So, we made an extension type that permits processing of this unique hash structure. Our tests consistently identified when a hash's subclass would not serialize accurately with MessagePack, ensuring that we only pursued valid serialization opportunities.
00:17:23.640 For practical implementation, I developed an extension type alongside registered codes to facilitate serialization of derived objects. Through this, it became possible to recursively manage the values within any given object prior to sending them through MessagePack. It became clear that developers would benefit from handling serialization within a structured and predictable manner, thus allowing them the flexibility in managing custom behaviors and extension types.
00:18:40.800 We covered much of these procedures and processes, defining where subclasses of objects might require addressing serialization needs. Each object bearing the instance methods 'pack' and 'unpack' would be duly recognized as a Serializable type. On the development side, we aimed to structure the core types, identifying and reacting to user needs through effective coding strategies. Module creation further augmented our initiative—constantly adapting types in production, enabling automatic serialization parameters. This allowed us to convey digest feedback from the internal states of those objects, facilitating smoother transitions between object versions within our development lifecycle.
00:19:56.760 After successfully transitioning and fully integrating MessagePack into our workflow, we are now positioned at a point where we can operate without encountering these serialization challenges. This process has liberated the efficiency and productivity of our development team at Shopify. Most of what I talked about today has been streamlined and implemented through a gem called Paquito. I would like to extend my sincere gratitude to Jean Busier, who did countless hours of extraction work and collaborated on many aspects I discussed throughout the presentation. Thank you to the organizers of RubyKaigi, and thank you all for your attention.
00:22:08.400 Thank you! Goodbye!