00:00:01.040
When I was a little kid, my mom used to take me to the playground. I would hang out in the sandbox, and if there was a plastic shovel or a little toy in the sandbox, I would pick it up and put it in my mouth. My mom would yell at me, 'Put that down! You don't know where it's been!' I thought that would be a great name for this talk because it's about a test that I and others picked up and put into our test suites without really knowing where it came from, and it wasn't that good for us.
00:00:20.800
If you don't know me, I am Mike. I go by Flavor Jones, I work for Shopify, and I maintain the Nokogiri and Loofah gems. Loofah is an HTML sanitizer, which basically means it cleans up HTML to make it safe. For example, if you're unfamiliar with blog post comments that have some styling in them, you want to let through styling like 'background: red;' if someone wants to use some CSS styling that is weird but not unsafe.
00:00:35.040
However, if someone tries to inject JavaScript into your site, you want to clean that up by either escaping it or removing it entirely. It's also important to sanitize CSS. For instance, 'background: red;' is an okay string. Calling a CSS function like 'rgb()' is safe, but calling a CSS function like 'url()' is not okay. So, we have to clean up that CSS as well, and that's actually where our story starts.
00:01:18.960
Rails uses the Rails HTML sanitizer gem, and a CSS sanitization test started failing. This test uses Loofah, and that's why I got the call. Let's take a look at the test in question. This test is pretty gross but not that complicated. We have a background image CSS property string, and the test is saying that when it is sanitized, it should be removed entirely. A real-world example would be a blog post comment with 'background-image: blah;'. We should remove it—that's it. The test is failing because it's saying that this is not being removed.
00:01:45.280
So, where did this test come from? Let's take a look at the Git history. If you follow it back through Action Pack, you actually end up with the 'InSticky' project. 'InSticky' was a wiki written in Rails, and it makes sense that they were worried about HTML sanitization. These tests were written in 2007, meaning they have been around for 14 years, passing until today.
00:02:00.719
Thankfully, we know a little bit about the failure. It passed with Loofah 2.7 and failed with Loofah 2.9. Therefore, I can bisect that with no sweat. The actual commit has some tests that explain that it has to do with behavior around CSS functions, letting safe ones go through and blocking unsafe ones. So, a quick question: is this calling a CSS function? It's kind of hard to tell. Let's break it down and take a look at the string. We have double quotes which tell us that the backslash is going to be interpreted as an escape character.
00:02:43.760
We see these repeating sets of five characters: backslash zero zero hex hex. There are some exceptions to that pattern, like the 'dot 1027' string. This is going to be important later. There are a couple of bare quotes hanging out too. What happens if we just print the string out to take a look at what it actually appears to be? Ah, jeez, that's really gross too—this is garbage. The question is, is it potentially harmful garbage? Does it need to be removed? Maybe Loofah 2.9 is doing the right thing by leaving this string in because it's not dangerous.
00:03:04.480
But why are we testing garbage? This test claims to be doing something with Unicode encoded strings. Is this Unicode in Ruby? No, zero zero seven five. If you print it out, you get back backslash a followed by five; that is, it consists of two separate characters: octal seven backslash a and the five character. So, this is not Unicode in Ruby; it's two bytes.
00:03:47.680
But this is going to be in an HTML document, right? Does HTML have its own Unicode syntax? No. You can pipe it through libXML, but it doesn't; HTML has a different syntax for escaping hex bytes. So, what if this test is just encoded wrong? It's supposed to be in Unicode, and it's not. What if we take a wild guess and convert the two characters like octal seven and five into Unicode 0075? If we print that out, this is pretty close, right? We have a CSS URL being called, we've got 'javascript: alert()' being called, with some syntax errors. But for the most part, this is correct.
00:04:32.360
And I’m convinced this is just a Unicode problem. So, I have a decision to make. I can take the blue pill; I can just fix the encoding and fix the syntax errors in this test and move on with my life. But there are a couple of things that are making me really nervous about doing that. Number one: backslash-u hex doesn't work in CSS. If I print out two divs, one with 'background: red;' and one with 'background: red;' encoded in Unicode, one works and one doesn't. So, this isn't valid CSS. What are we testing then?
00:05:01.280
Number two: there are still errors in the test. There are still syntax errors: '1027x' and bare quotes are not going to actually run in a browser. So, I don't know what we're actually testing. Gosh, something is starting to feel really familiar about these tests. It’s making me worried. Let's see how far the rabbit hole goes. Where have I seen these strings before?
00:05:27.720
It turns out they are in the Loofah test suite. This is a JSON string from Loofah that we use to generate tests, and it has the same errors: 105, 3, 102.7. But there are some small differences—instead of 0075, we have 00a5. The a's have been transcribed into 7's, which is weird. But these are related somehow because they all have the 1027 error. We can draw this heredity diagram, and there's got to be a missing link somewhere that these are connected by. So let's dig into Loofah's history and see if we can figure it out.
00:06:02.880
Loofah's tests came from HTML5 Lib. When I was building Loofah, I needed a corpus of working tests, and I took the ones from HTML5 Lib because they're MIT licensed. HTML5 Web has this exact string, but I can go back through history and see that it was originally backslash a5. Someone fixed this to be Unicode, and they fixed it incorrectly; they turned the backslash a into 00a5. We know that backslash a is actually seven in octal, so this is the same octal encoding error.
00:06:36.880
Sure enough, it originally came from a Python test suite that had the backslash zero zero five, and Python has the same syntax as Ruby when it comes to string encoding. So, updating our heredity diagram, we now know that HTML5 Lib introduced the a5 error. It was originally octal encoded, and that Python test came from an earlier Ruby test which came from—that's right, InSticky. So, we can close the loop and know that all these tests are related through the common parent of InSticky.
00:07:08.800
But where did InSticky get the test from? Git is no help anymore; I've got to Google now. Googling shows a result that is the OWASP XSS exploit cheat sheet, and this is exactly our test, right? This has the 1027 error and the octal encoding. This is money! This cheat sheet is dated 2015, but we can update our heredity diagram again and see that they have to have all come from the same source. So let's keep digging.
00:07:51.600
That cheat sheet came from RSnake in 2012. Before that, it was part of the hackers.org website and was a manually maintained database for over a decade. The site doesn't exist anymore, but the Internet Archive has a snapshot from June of 2006 that shows exactly the same string that we're using in InSticky—same errors. Great! This is actually our source material. Let's update our diagram to show everything came from hackers.org.
00:08:36.120
But it has errors too. Where's the source material for that? Double-clicking one more time, we can see this was actually originally reported by Windowed Liftshits as a vulnerability in Hotmail. I can Google for that and find the original email that reports the vulnerability. This is the originally reported exploit. Two things to note: there's no 1027 error, so now we know that hackers.org introduced that error. But this still has the weird and octal encoded bytes.
00:09:25.360
I feel like I'm taking crazy pills at this point. Let's update our heredity diagram, but let's circle back because I'm starting to think that maybe this encoding is telling me I'm missing something. The original report's explanation says JavaScript must be Unicode encoded, and it occurred to me: is this Unicode in JavaScript and I just missed it? Nope, JavaScript has the same syntax as Ruby and Python; it's an octal blank in JavaScript too.
00:10:05.040
And then it occurred to me: what if this is Unicode in CSS? Sure enough, that's exactly what it is! In CSS, you represent hex by having a bare backslash followed by hex digits. I had no idea, and of course, because ASCII is a subset of Unicode—it's the first 128 code points—if you represent ASCII in hex, that's Unicode. So yes, this is Unicode in CSS!
00:10:46.440
I can prove it to you by rewriting our HTML using this encoding for red, and sure enough, both divs work. If we decode that original exploit with this encoding in mind, we can see that this is a beautiful, beautiful example of calling a CSS URL, calling 'javascript: alert()', and exfiltrating document cookies. This is a beautiful, beautiful exploit—this is exactly what we should have been testing all along, and we weren't.
00:11:21.840
Now, what actually happened here is InSticky used double quotes, which means that the backslash is interpreted as an escape character. We should have used single quotes. Gosh, now I understand everything. I understand all the problems, and we can rewrite history and see how all of these errors were introduced over time. We have our original report with no errors, hackers.org introducing the 1027 error, InSticky introducing double quotes which were persisted in Python with a non-raw string.
00:11:48.800
Then later, an a5 error was introduced into the Loofah test. Awesome! And so for 14 years, this test has been propagating through various test suites, and at no point was it actually a valid test.
00:12:06.560
So, what do we do with this information? Well, first, I fixed the test in Rails to use single quotes. Second, I fixed the test in Loofah to use this encoding instead of Unicode. I plan to submit a PR to OWASP to fix the cheat sheet, and I get to annoy you all with this story. So thank you and congratulations for being part of the story.
00:12:36.480
A couple of things to meditate on before I let you go: nobody is at fault here. I'm not trying to shame anyone; something bad was bound to happen because we had a huge database of exploits and we were machine-generating tests. It's just an interesting story, and it's nobody's fault.
00:12:53.760
Security is really hard, especially when exploits are complex and non-obvious. Everyone is just going to trust that the test knows what it's doing.
00:13:05.360
Thirdly, hex encoding in CSS—who knew? I guarantee that you're not going to remember this fact when you actually need it, and it's just interesting to me that I learned that along the way.
00:13:11.920
Fourth: always take the red pill! Archaeology is fun; you might get a conference talk out of it. Finally, I just want to say thank you to Sam Ruby and Jacques Destler for maintaining security tests for so long in the Ruby community. Thank you both.
00:13:24.560
Thanks to Robert Hansen for the XSS spreadsheet, and thanks to the Internet Archive for making all this archaeology possible. If you don't donate to the Internet Archive, you should. They're a great group.
00:13:36.080
Thanks for coming along on this ride with me. I hope you're enjoying RubyConf. If you want to talk about HTML sanitization, look me up.