00:00:13.120
Hi everyone! I’m so glad so many people are interested in Rails HTML5 sanitization. Fantastic!
00:00:20.560
A quick introduction about myself: I'm a Director of Engineering at Shopify. I maintain the Rails sanitizer stack, and I’m also a member of the Rails security team.
00:00:34.840
I wanted to start this talk by posing a security puzzle—a hypothetical situation. I'll discuss the software behind that puzzle to help you understand why it's important and why it can be dangerous if we can solve it.
00:00:54.199
Then, I'll explain how we're evolving Rails to protect you and your users. So, let's start with the puzzle.
00:01:02.920
Imagine you're running a website similar to a Wiki that publishes a lot of user-controlled data. If a malicious actor wants to inject some JavaScript, you've got a sanitizer running which ensures that only safe, valid HTML4 is rendered in your users' browsers.
00:01:31.759
Now, the interesting part is that most modern browsers are HTML5 compliant. The question is: can you craft a payload that will pass through an HTML4 sanitizer, and when rendered in an HTML5 browser, become a security vulnerability?
00:01:42.320
Let’s define a few important terms: sanitizer and security vulnerability. Starting with the HTML sanitizer, Wikipedia helpfully defines it as something that takes one HTML document and produces a second HTML document containing only the tags and attributes you deem safe.
00:01:48.720
Think of it like a policeman evaluating what can go through. For instance, a blog post comment that is plain text is perfectly safe. If some formatting is included, it’s probably fine, and it should be allowed. However, if it attempts to add a <script> tag aimed at redirecting a user’s browser to a malicious site, that needs to be blocked.
00:02:18.919
A sanitizer will recognize that markup and it can either escape the tags so they are no longer processed as unsafe or remove that part of the document entirely.
00:02:40.560
The Rails HTML sanitizer specifically removes the unsafe tags and retains safe content, which is a sensible choice. For the last 10 years, the Rails HTML sanitizer stack has remained mostly the same.
00:02:59.640
The stack consists of several layers: at the top are all the Rails gems that utilize the sanitizer gem, which in turn employs Loofah for sanitizer primitives, and that leverages Nokogiri and libxml2 to ensure proper parsing of all markup.
00:03:09.239
Most attacks are crafted as payloads designed to confuse buggy parsers. OWASP has a comprehensive cheat sheet with a historical archive of real-world attacks, providing valuable insight for testing purposes. Attackers often employ tactics such as adding multiple quote characters to confuse the filter into thinking it's still in a string.
00:03:37.080
Having libxml2 at the core of the Rails sanitizer stack ensures we always obtain a well-formed tree. As an illustration, if we have an image tag and a <script> tag, the sanitizer will evaluate this and, if using Loofah, it will simply remove the <script> tags.
00:03:56.680
In summary, the sanitizer stops bad actors by first parsing correctly and then removing unsafe tags and attributes. That’s the essence of what a sanitizer does.
00:04:12.760
Now, let’s carefully define and discuss two more crucial terms: HTML4 and HTML5.
00:04:14.960
HTML4.01 was announced in December 1999, and since then evolution on HTML4 has mostly stopped. If you examine the grammar and HTML syntax definition, it appears reasonable, but it lacks guidance for error correction. This is a significant concern because there is a lot of broken markup on the internet.
00:04:39.040
Error correction refers to instances where you have malformed HTML. For example, if you have open <b> and <i> tags that are closed in reverse order, the specification offers no guidance on how the parser should handle it. This lack of guidance means that every HTML4 parser behaves slightly differently, which is problematic.
00:05:04.679
Fortunately, the world has moved beyond HTML4, and we now have HTML5, which is often called The Living Standard. The HTML5 spec is a remarkable document with features like pseudo-code to define HTML syntax clearly. It specifies a state machine that dictates how browsers interpret various tags.
00:05:36.800
This consistency means that every HTML5 parser behaves in a similar manner, rendering them robust and predictable.
00:06:00.000
The specification also has an engaging tone, containing humor and quirky recommendations on reading it multiple times, which showcases the humanity behind the document.
00:06:27.360
The HTML5 working group was formed essentially due to a schism with the W3C, with the latter aiming for perfect, well-formed XML. In contrast, the browser implementers sought to incrementally improve and ensure backward compatibility. Consequently, the working group's guiding principles emphasize backward compatibility and ensuring specs match implementations.
00:07:11.080
This practical approach is beneficial, especially considering the aging HTML4 specification left many behaviors ambiguous. Thankfully, HTML5 provides error correction instructions for handling malfunctions or malformed markup, further enhancing its reliability.
00:08:01.360
Now, let’s return to the puzzle: I’ve given you some background about HTML standards and security vulnerabilities, and now we dive into the risks. Rails has operated using an HTML4 sanitizer for the past decade, while all current browsers are HTML5 compliant.
00:08:49.720
If we can successfully craft a payload that exploits HTML5’s behavior, it could potentially expose every Rails site to vulnerabilities. Unfortunately, there is indeed an exploit already known for this, involving a <select> element that contains a <style> and a <script> tag.
00:09:22.880
The HTML parser translates this into valid HTML4, but when parsed by HTML5, it produces a different tree structure where script tags can execute JavaScript, resulting in security vulnerabilities.
00:09:52.920
The discrepancy arises from the way HTML4 and HTML5 specifications treat the child and sibling relationships of tags, leading to the eventual execution of a script that should have been sanitized.
00:10:23.440
We attempted various solutions to this problem since vulnerabilities started surfacing in 2015. One proposed fix involved recursively sanitizing C data, but this approach proved slow and detrimental to users, resulting in broken HTML.
00:11:00.920
Thus, while we did prohibit both <style> and <select> from being used together, this did not adequately solve the root issues, which can also arise from other tags such as <svg> and <math>, which complicate parsing further due to their unique contexts.
00:11:35.520
Such vulnerabilities have been categorized as mutation XSS vulnerabilities, where the tree structure changes between parsers due to differences in the specifications. Finding a long-term solution for this challenge requires aligning both the client and server to use HTML5 standards, thereby eliminating this class of problems.
00:12:43.600
Evolution is necessary; the sanitizer stack consists of many libraries, making it feel overwhelming to ensure compatibility with HTML5. My approach involves starting from the top of the stack and working my way down, aiming to understand how each part can utilize HTML5.
00:13:21.080
When tackling the Rails sanitizer stack, I began by conceptualizing an accessor for the vendor associated with sanitizer callbacks. However, I quickly realized I needed to make many components exist that didn’t yet exist, such as a comprehensive sanitizer class compatible with HTML5.
00:14:06.600
As I navigated through the stack from the Rails HTML sanitizer down to Loofah and Nokogiri, I planned out the transitions and sketched how I wanted these classes to function.
00:15:02.680
After extensive work, we successfully implemented a mechanism that would parse non-HTML5 markup, allowing us to use all layers in harmony. This facilitated the transition to support HTML5 seamlessly across the Rails stack.
00:15:45.240
The process took about a year, with discussions on various solutions, and realistically, libxml2 will never support HTML5 out of the box. Therefore, I initiated an RFC to explore alternatives for parsing HTML5 within Ruby.
00:16:28.560
Through collaborative discussions, we narrowed potential options to several libraries, including the orphaned libgumbo library and a new pure Ruby solution. Ultimately, we decided to merge NOA and LibGumbo into the Nokogiri ecosystem to leverage their existing APIs.
00:17:57.960
A significant part of this endeavor was ensuring that contributions were acknowledged and preserving the history of contributions, providing a smooth collaborative environment.
00:19:02.480
We set about addressing various licensing issues and ensuring that we still had unique functionality from the libraries while also merging their capabilities into the overarching Nokogiri document stack.
00:20:01.440
The merger process included adapting to handle namespaces and foreign contexts in the parsing process, allowing for dynamic parsing across various scenarios.
00:20:50.720
Through continued effort, we rewrote segment functionalities in C, gaining significant performance speed-ups, making the transition from HTML4 to HTML5 a reality.
00:21:52.560
In May this year, we shipped Rails HTML sanitizer with HTML5 support. As upgrading to new functionality has emerged, it became vital to introduce application configurations, ensuring backward compatibility for legacy applications while enabling newer apps to adopt HTML5 standards.
00:23:31.920
New application configurations have been introduced for Action View, as well as updated for Action Text to ensure that testing behavior remains unchanged, thereby allowing for seamless transitions.
00:24:36.800
In conclusion, Rails 7.1 now has HTML5 sanitizer support. If you're upgrading, be sure to check the relevant documentation for the three key configurations that are essential to your setup.
00:25:36.080
I also see a promising future for sanitizer APIs as they are being standardized, which will allow for far greater ease in parsing and string behavior.
00:25:51.280
Thank you all for listening to my talk on Rails HTML5 sanitization. Together, we can continue to innovate and protect against vulnerabilities.