Samuel Giddins

Remembering (ok, not really Sarah) Marshal

RubyKaigi 2024

00:00:14.040 Okay, well, I hope everyone's not too tired after lunch to think about Ruby, code, and all that fun stuff. I will warn you, I tend to make very bad jokes. I understand that, but this time my boss is in the room, and I worry that if you don't laugh at them, I might lose my job. So please find me funny!
00:00:17.760 Today, we're going to be talking about Marshall. Now, I apologize—sorry, not sorry—this is a really terrible pun. I have not seen the movie. I know the name, but do you know how hard it is to come up with funny titles for conference talks about binary data serialization formats? It's a challenge.
00:00:22.280 A little bit about me: I am @se_giddins pretty much everywhere on the Internet. My name is Samuel, and I'm a maintainer of RubyGems, Bundler, and RubyGems.org. I've been contributing bugs for over a decade at this point. In my day job, I'm the Security Engineer in Residence at Ruby Central, sponsored by AWS, focusing on the security of the Ruby and RubyGems ecosystem. But that's not entirely what we're going to be talking about today.
00:01:30.079 We're going to be discussing this—this is the subject of an entire 30-minute talk, and I fit it onto one slide. So, a show of hands: who here has used Marshall? A fair number of you. Now, let's do that again: who here has been told that Marshall is bad and that you shouldn't use it? Pretty much the same people. Great! So today we're going to explore Marshall in more detail. We'll look at the good, and yes, the bad, and see what we can learn from it.
00:02:00.960 To start off, I went ahead and spent way too long this morning figuring out how to use hex dump, and I printed out some Marshall for you. Here we go! Here's a hex dump of a small Marshall document, and we are going to walk through this byte by byte.
00:02:49.880 So here we are, let's start off. Byte number zero— we have the Marshall major version number four, the minor version number eight. And before you ask how old 4.8 is, the answer is: very! Moving on, we can now look at objects. Byte number three is an 'I' that gives us the type of our next object.
00:03:19.640 We then get 'oops', uh, so I—what does that mean? It means we have an object with instance variables in it. Okay, next byte. Inside the object with IVars, we have the type of the object that has the IVars, and it is a double quote. Now, okay, you could probably guess that that means it's a string. We're through four bytes now. Next, we have the length of the string, which in HEX is 0x16, meaning, of course, the length of our string is 17 bytes. I know everyone here was thinking, "Sam, you didn't have to tell me that; that was obvious." So, integers in Marshall use a packed variable length encoding.
00:05:45.600 Just so if you're looking at the hex dump, you have no trouble understanding what you're looking at. Finally, we have the contents of our string: 'hello Ruby.' Okay, we just sped through a whole bunch of bytes there, and we are almost at the end. Now we are done with our string; we're back to the object with IVars, and we can see that the number of IVars is 0x06, which means we, of course, have one instance variable. The next field shows the name of the instance variable, which is going to be a symbol, denoted by a colon.
00:06:56.720 That makes sense to us as Rubyists, as colons mean symbols. The length of it is one byte, and the name of the instance variable is 'e', which on a string means 'encoding'. It's a special instance variable name that you can only set on strings because, notice what's missing—there's no '@' sign. The value of the 'e' instance variable is a 't', which means true.
00:07:49.679 Okay, who had fun just reading byte by byte? Yeah, that'll really wake you up after eating. Here’s maybe a more helpful way of looking at that same tree. We have our Marshall major version, minor version. We have an object with IVars, which has a string in it. This string is followed by a symbol of 'e' and then 'true'. Now, how would you feel if I said that was a really simple Marshall stream?
00:08:25.440 Okay, well, while you sit with that and ponder how much more complicated this whole thing can get, let's rewind. Marshall is a binary file format that has been supported in Ruby forever. I was too lazy to look up how long it's been around for, but I'm pretty sure that we've been on Marshall version 4.8 since before I started doing Ruby, which was a pretty long time ago at this point. So, we use Marshall to serialize a Ruby object graph into a binary format.
00:09:11.920 Great! Now, the reason I care about Marshall is it's used in certain places in RubyGems due to legacy reasons, and those legacy reasons are kind of also current reasons, which boil down to the fact that Marshall is the only binary format that you can read in Ruby without requiring any gems, which is kind of important for RubyGems.
00:10:29.999 So let’s set that aside; this is why I started looking into Marshall back in September and started this whole journey. Marshall belongs to a family of what are called tag-length-value formats. There’s a tag that tells you how to interpret the next bit of data, a length that tells you how long that piece of data is, and then the value, which is the piece that we care about. Compare that to formats that we're more familiar with. For example, JSON is like a delimited format; you have opening braces and closing braces, and commas.
00:11:38.640 You also have context-sensitive formats such as YAML, where knowing where you are in the document changes how you interpret it. Things like how far indented you are matter.
00:12:33.600 So, okay, tag-length-value. Let’s have a look at the different tags that are in Marshall. Take a deep breath! We have strings—we saw those—nil gets its own tag; there are symbols; we already saw those. There are links to symbols, links to objects, false, objects with instance variables, true, user-defined Marshal classes, arrays, floats, integers, big nums, user-defined objects, hashes, hashes with default values, extended objects, classes, modules, class or module data, regular expressions, structs, and user classes.
00:13:07.200 It's a lot of types; it's not the simplest format. Additionally, as I mentioned earlier, there is a special encoding for integers, which took me a while to implement properly. But what's cool about it is we can encode really small numbers in a single byte, and we can encode everything up to 32-bit integers in up to five bytes, where the first byte signals the sign and the length of the integer, and the other bytes are the integer.
00:14:27.840 If you ever wondered how to decode an integer in Marshall, there—I'm not going to walk through it, but suffice it to say, it allows us to encode really small numbers using fewer bytes and bigger numbers with more bytes. People have decided that this is a valuable thing for some reason. It turns out small numbers occur quite frequently in large documents, and big numbers are less frequent. You know, if you have like zero and one repeated a bunch of times, it helps to store them in a single byte, or so I'm told.
00:15:59.760 There are also fun features that we saw from the tag list, like sim links. There are links in Marshall that can point to both objects and symbols. So we can see here I'm marshalling an array with some repeated strings. Let's assume I haven't shown it on the slide, but we have frozen string literals enabled, and we can see in the output here that our two strings each only show up once, even though they’re repeated in the array.
00:16:27.760 That's because of those '@' signs that are showing up in the hex. Those are links to the repeated objects. So let's say you have hypothetically speaking a file that has a list of every Ruby gem and that gem's version and platform, and all of those get repeated a bunch— not having to spell out those strings saves a bunch of space.
00:17:20.560 As I said, fun feature! If we look at that same document here, we can see we have the array and the array has two objects with IVars, which are strings, and then we have object links that point to those repeated objects. So the object at offset one here is 'hi'; the object at offset two here is 'uest'. If you’re curious, the object at offset zero is the array.
00:17:51.999 Basically, you just scan through the document, and every object gets added to the list of objects, and then you can refer to them. Now, as I said, I am a security engineer, so my focus, and the reason I ended up writing stuff around Marshall, was that Marshall as a file format, and also as a module that's in Ruby, was not designed with security over arbitrary inputs in mind.
00:18:54.120 There’s this basically list of common web application vulnerabilities called OWASP, and I'm going to quote smarter people than me saying that it's good to avoid native deserialization formats. If you switch to a pure data format like JSON or XML, you lessen the chance of custom deserialization logic being put or repurposed towards malicious ends. AKA you're less likely to be hacked, owned, have a data breach, be held for ransom, etc., etc.—all those bad things.
00:19:32.800 Additionally, the US government’s National Vulnerability Database says that data which is untrusted cannot be trusted to be well-formed; malformed data or unexpected data could be used to abuse application logic, deny service, or execute arbitrary code when deserialized. That’s a list of very bad things. And so in recent years, the advice has been to not use Marshall; use other formats instead—things like JSON or not XML.
00:20:40.480 Now, you may be thinking, how could this be unsafe? You just showed me a bunch of bytes, though—nothing could be safer than a hex dump of a bunch of bytes! So now we get into why I spent a month looking at Marshall in the first place. In RubyGems, we load what are called quick specs from a gem server.
00:21:34.480 If you want, you can download one of them at that URL. You’re going to have to gz or like z-inflate the contents of that because it’s a deflate stream, and inside that there is a Marshall document with a gem specification. Now, if you Marshall.dump a gem specification, it looks a little bit like this.
00:22:37.440 This is just gem specification.new being dumped. We're not going to walk through this one byte by byte; I'll spare you, but here are the important bits. There is a 'u', meaning we have a user-defined class. We can see that gem specification is the class name, and Marshall then loads the class from the constant name. Because the tag, the type of the class was 'u', we call underscore load on the class with the binary string that was next in the document.
00:23:39.760 Now, what would happen if in that document, instead of being told we have a gem specification, we have something like—just throwing a random class out there—an Action Dispatch routing RouteSet named route collection or an ERB buffer, or a random class that someone wrote that when you call initialize on it, will call methods on arbitrary methods on a class based on what instance variables are set on it? This might sound hypothetical, but oh, about 10 or 11 years ago, there was a pretty big vulnerability announced in Rails that went through this exact thing, but with YAML.
00:24:59.080 If you deserialized a specially formatted document that contained an Action Dispatch routing route named route collection, you could execute arbitrary code. This type of vulnerability took RubyGems.org down for like a week back in the day—really bad stuff could happen. You could, let's say, dump all of the environment variables that someone has set or make requests and send things in their environment out, or install arbitrary software on their machines. Bad stuff! And you could then use Marshall.load to execute arbitrary code despite the fact that RubyGems were only expecting a very benign gem specification.
00:26:03.440 Now it makes sense that we've moved away from Marshall because of this, but it's sad because it's a pretty cool format by itself. Other than the list of 25 different tags, that's pretty long and takes a while to implement. So what's good about it? It's more compact than if you say gzip to JSON or CSV file.
00:26:47.440 You can duplicate objects. But what if we could make Marshall safer? Given that it's used in RubyGems, I wondered: could we make Marshall safer? The unsurprising answer here was yes, we could!
00:27:09.760 So I wrote a bunch of gnarly code, which was heavily inspired by Psych's Safe Load method—a method that was introduced due to that very same Rails YAML vulnerability I was talking about a minute ago. The way it works is a little like this: we call a new method instead of Marshalls.load, we call SafeMarshall.load. We pass it an IO and a list of permitted classes. We parse that stream and create a tree of objects out of the stream. We then walk that tree, transforming each of the elements in it into Ruby objects. For objects and IVars, we check if they are permitted, and we make that check before we transform them.
00:28:00.360 We then sort of close our eyes and hope we did it all correctly, tap our heels three times, and return the root object in the document.
00:28:44.200 And so like any good compiler engineer does, I ended up writing a parser and a visitor. We're going to skip over the parser bit because there are some good resources if you just Google Marshall file—Ruby Marshall file format—as I have done several times. But you know, if you think you know any corner cases about Marshall that I don't, please come and try and stop me after the talk, and then we can commiserate about how well we know Marshall.
00:29:10.160 The visitor is where all the cool safe stuff happens. The idea is we have a root visitor class; it knows how to walk the tree. We've implemented methods that say, "Hey, this is how we descend through an array." How do we walk an array? Well, we visit each one of its elements and so on. We wrote a visitor that, when it goes and visits stuff, will return a Ruby object. If you've ever contributed to Psych, this code will probably look very familiar—it's basically the same exact signature.
00:30:39.240 But our Ruby visitor gets a list of permitted classes, a list of permitted symbols, and a hash of permitted instance variables for those permitted classes. Then, you know, going off our earlier example of a gem specification, loading a user-defined object will resolve the class. This means we have a symbol, and we'll figure out what string that symbol is going to be. We check if it is permitted—if it's not, we raise an exception that says this isn't permitted.
00:31:34.240 We ensure that the symbol is something we're allowed to create; then we get the constant, we load it in. If it's undefined, we say, "Hey, it's undefined!" And then we go and call underscore load on it, but along the way, we’ve done a good amount of checks that don’t happen if you’re just calling Marshall.load. This way, we can check if the class is permitted before we call load on it, and we can check that the constant is permitted before we const get it, and we do this for all the places where mischief could happen.
00:32:52.200 Things like creating a symbol—because you could create a bunch of symbols to DOS a process—or loading a constant because you could use it to load a class that does something bad, or if you can find a module that has overridden constant resolution, you could create a bunch of plays.
00:33:34.720 In every place we call a method, we make sure that the method is defined, and when setting an instance variable, we check, hey, do we know about this instance variable on this class? Are we expecting it to be set? This helps prevent setting instance variables that the class isn't expecting.
00:34:29.640 Maybe you know the class has a memorized instance variable where if it's set, it'll avoid recomputing things. I found a CV in Rubyj.org that had an issue akin to that, but loading from YAML! Now, Marshall has a lot in common with other binary formats that have become more popular—things like Protobuf or MessagePack.
00:35:18.880 It has features like being a tag-length-value format, having compact encoding of integers, allowing you to embed strings or binary blobs without needing to escape and unescape them, and it supports arbitrary serialization of types as opaque data—all good stuff. But it also has a bunch of features that fit in really well with Ruby, like native support for hashes and arrays.
00:36:04.160 You know, also hashes with default values and stuff like that. It has multiple ways to serialize arbitrary classes, either by giving a list of instance variables or completely custom. Here's a binary string; there's a binary string; here's an object; there's an object—do whatever you want. It has support for duplicating objects and symbols in the stream, and built-in support for instance variables because, as Rubyists, all of our stuff is stored in instance variables eventually.
00:37:03.520 And also, we Rubyists just enjoy being able to do things at runtime. Dynamism means fun, right? Marshall's got support for that.
00:37:36.440 Okay, should you use this gem, Safe Marshall module, to make Marshall safer? Please don't! You know, I put on my official security engineer hat—please don’t. It's slow, and it has a bunch of Marshall features that are intentionally unimplemented because RubyGems didn’t need them, and also, I was lazy and didn’t feel like implementing them.
00:38:00.000 And finally, I wrote the code, so there’s really no promise at all that it’s any good. But if you need to read in Marshall files anyway, it’s probably better than using Marshall.load on untrusted input. And if you’re interested, Safe Marshall is available in RubyGems 3.5, which is included in Ruby 3.3. So, surprise surprise! You’ve been using this code I’ve been talking about all along.
00:39:11.400 Now, I did a really good job with time! Here, I'm going to say thank you for following along and laughing at my bad jokes in the only appropriate way I could think of—with a bunch of Marshall.