Rails World 2023

Rails::HTML5: The Strange and Remarkable Three-Year Journey

Rails::HTML5: The Strange and Remarkable Three-Year Journey

by Mike Dalessio

In the presentation titled "Rails::HTML5: the strange and remarkable three-year journey," Mike Dalessio, Director of Engineering at Shopify, describes the extensive journey of enhancing Rails to support HTML5 sanitization. The talk outlines the crucial evolution of Rails' security measures, moving from an HTML4 sanitizer to one that accommodates the modern demands of HTML5.

Key points discussed include:

- Security Overview: The presentation begins with a hypothetical scenario posed as a security puzzle, highlighting the vulnerabilities of running an HTML4 sanitizer against modern HTML5-compliant browsers.

- Understanding Sanitizers: Dalessio explains the function of sanitizers, using examples of how they parse and sanitize input, ensuring that only safe tags are processed.

- The Shift from HTML4 to HTML5: An essential part of the discussion contrasts HTML4 and HTML5 specifications, emphasizing the flaws and inconsistencies of HTML4 due to the lack of error correction guidance versus the robustness of HTML5 which promotes consistent and predictable behavior.

- Identifying Vulnerabilities: The presentation draws attention to known vulnerabilities that emerged due to discrepancies between how HTML4 and HTML5 handle document trees, particularly concerning elements like <select> that could create security risks when mismanaged.

- Implementation Journey: Dalessio reveals the extensive collaborative efforts needed to migrate the sanitizer stack over several years, culminating in the release of Rails 7.1’s support for HTML5 sanitization.
- Collaborative Efforts and Solutions: The discussion covers the integration of previous libraries and the difficulties faced in ensuring backward compatibility while adopting HTML5 standards.

- Future Prospects: The talk concludes with optimism about standardizing sanitizer APIs that will simplify parsing and enhance security measures, encouraging continued innovation in protecting user data.

In conclusion, Dalessio encourages developers to review the new documentation on Rails 7.1 HTML5 support, underscoring the complexity and necessity of keeping web applications secure and compliant with modern standards.

The presentation reflects not only on technical improvements in Rails but also on the collaborative spirit needed to tackle such significant challenges in the open-source community.

00:00:13.120 Hi everyone! I’m so glad so many people are interested in Rails HTML5 sanitization. Fantastic!
00:00:20.560 A quick introduction about myself: I'm a Director of Engineering at Shopify. I maintain the Rails sanitizer stack, and I’m also a member of the Rails security team.
00:00:34.840 I wanted to start this talk by posing a security puzzle—a hypothetical situation. I'll discuss the software behind that puzzle to help you understand why it's important and why it can be dangerous if we can solve it.
00:00:54.199 Then, I'll explain how we're evolving Rails to protect you and your users. So, let's start with the puzzle.
00:01:02.920 Imagine you're running a website similar to a Wiki that publishes a lot of user-controlled data. If a malicious actor wants to inject some JavaScript, you've got a sanitizer running which ensures that only safe, valid HTML4 is rendered in your users' browsers.
00:01:31.759 Now, the interesting part is that most modern browsers are HTML5 compliant. The question is: can you craft a payload that will pass through an HTML4 sanitizer, and when rendered in an HTML5 browser, become a security vulnerability?
00:01:42.320 Let’s define a few important terms: sanitizer and security vulnerability. Starting with the HTML sanitizer, Wikipedia helpfully defines it as something that takes one HTML document and produces a second HTML document containing only the tags and attributes you deem safe.
00:01:48.720 Think of it like a policeman evaluating what can go through. For instance, a blog post comment that is plain text is perfectly safe. If some formatting is included, it’s probably fine, and it should be allowed. However, if it attempts to add a <script> tag aimed at redirecting a user’s browser to a malicious site, that needs to be blocked.
00:02:18.919 A sanitizer will recognize that markup and it can either escape the tags so they are no longer processed as unsafe or remove that part of the document entirely.
00:02:40.560 The Rails HTML sanitizer specifically removes the unsafe tags and retains safe content, which is a sensible choice. For the last 10 years, the Rails HTML sanitizer stack has remained mostly the same.
00:02:59.640 The stack consists of several layers: at the top are all the Rails gems that utilize the sanitizer gem, which in turn employs Loofah for sanitizer primitives, and that leverages Nokogiri and libxml2 to ensure proper parsing of all markup.
00:03:09.239 Most attacks are crafted as payloads designed to confuse buggy parsers. OWASP has a comprehensive cheat sheet with a historical archive of real-world attacks, providing valuable insight for testing purposes. Attackers often employ tactics such as adding multiple quote characters to confuse the filter into thinking it's still in a string.
00:03:37.080 Having libxml2 at the core of the Rails sanitizer stack ensures we always obtain a well-formed tree. As an illustration, if we have an image tag and a <script> tag, the sanitizer will evaluate this and, if using Loofah, it will simply remove the <script> tags.
00:03:56.680 In summary, the sanitizer stops bad actors by first parsing correctly and then removing unsafe tags and attributes. That’s the essence of what a sanitizer does.
00:04:12.760 Now, let’s carefully define and discuss two more crucial terms: HTML4 and HTML5.
00:04:14.960 HTML4.01 was announced in December 1999, and since then evolution on HTML4 has mostly stopped. If you examine the grammar and HTML syntax definition, it appears reasonable, but it lacks guidance for error correction. This is a significant concern because there is a lot of broken markup on the internet.
00:04:39.040 Error correction refers to instances where you have malformed HTML. For example, if you have open <b> and <i> tags that are closed in reverse order, the specification offers no guidance on how the parser should handle it. This lack of guidance means that every HTML4 parser behaves slightly differently, which is problematic.
00:05:04.679 Fortunately, the world has moved beyond HTML4, and we now have HTML5, which is often called The Living Standard. The HTML5 spec is a remarkable document with features like pseudo-code to define HTML syntax clearly. It specifies a state machine that dictates how browsers interpret various tags.
00:05:36.800 This consistency means that every HTML5 parser behaves in a similar manner, rendering them robust and predictable.
00:06:00.000 The specification also has an engaging tone, containing humor and quirky recommendations on reading it multiple times, which showcases the humanity behind the document.
00:06:27.360 The HTML5 working group was formed essentially due to a schism with the W3C, with the latter aiming for perfect, well-formed XML. In contrast, the browser implementers sought to incrementally improve and ensure backward compatibility. Consequently, the working group's guiding principles emphasize backward compatibility and ensuring specs match implementations.
00:07:11.080 This practical approach is beneficial, especially considering the aging HTML4 specification left many behaviors ambiguous. Thankfully, HTML5 provides error correction instructions for handling malfunctions or malformed markup, further enhancing its reliability.
00:08:01.360 Now, let’s return to the puzzle: I’ve given you some background about HTML standards and security vulnerabilities, and now we dive into the risks. Rails has operated using an HTML4 sanitizer for the past decade, while all current browsers are HTML5 compliant.
00:08:49.720 If we can successfully craft a payload that exploits HTML5’s behavior, it could potentially expose every Rails site to vulnerabilities. Unfortunately, there is indeed an exploit already known for this, involving a <select> element that contains a <style> and a <script> tag.
00:09:22.880 The HTML parser translates this into valid HTML4, but when parsed by HTML5, it produces a different tree structure where script tags can execute JavaScript, resulting in security vulnerabilities.
00:09:52.920 The discrepancy arises from the way HTML4 and HTML5 specifications treat the child and sibling relationships of tags, leading to the eventual execution of a script that should have been sanitized.
00:10:23.440 We attempted various solutions to this problem since vulnerabilities started surfacing in 2015. One proposed fix involved recursively sanitizing C data, but this approach proved slow and detrimental to users, resulting in broken HTML.
00:11:00.920 Thus, while we did prohibit both <style> and <select> from being used together, this did not adequately solve the root issues, which can also arise from other tags such as <svg> and <math>, which complicate parsing further due to their unique contexts.
00:11:35.520 Such vulnerabilities have been categorized as mutation XSS vulnerabilities, where the tree structure changes between parsers due to differences in the specifications. Finding a long-term solution for this challenge requires aligning both the client and server to use HTML5 standards, thereby eliminating this class of problems.
00:12:43.600 Evolution is necessary; the sanitizer stack consists of many libraries, making it feel overwhelming to ensure compatibility with HTML5. My approach involves starting from the top of the stack and working my way down, aiming to understand how each part can utilize HTML5.
00:13:21.080 When tackling the Rails sanitizer stack, I began by conceptualizing an accessor for the vendor associated with sanitizer callbacks. However, I quickly realized I needed to make many components exist that didn’t yet exist, such as a comprehensive sanitizer class compatible with HTML5.
00:14:06.600 As I navigated through the stack from the Rails HTML sanitizer down to Loofah and Nokogiri, I planned out the transitions and sketched how I wanted these classes to function.
00:15:02.680 After extensive work, we successfully implemented a mechanism that would parse non-HTML5 markup, allowing us to use all layers in harmony. This facilitated the transition to support HTML5 seamlessly across the Rails stack.
00:15:45.240 The process took about a year, with discussions on various solutions, and realistically, libxml2 will never support HTML5 out of the box. Therefore, I initiated an RFC to explore alternatives for parsing HTML5 within Ruby.
00:16:28.560 Through collaborative discussions, we narrowed potential options to several libraries, including the orphaned libgumbo library and a new pure Ruby solution. Ultimately, we decided to merge NOA and LibGumbo into the Nokogiri ecosystem to leverage their existing APIs.
00:17:57.960 A significant part of this endeavor was ensuring that contributions were acknowledged and preserving the history of contributions, providing a smooth collaborative environment.
00:19:02.480 We set about addressing various licensing issues and ensuring that we still had unique functionality from the libraries while also merging their capabilities into the overarching Nokogiri document stack.
00:20:01.440 The merger process included adapting to handle namespaces and foreign contexts in the parsing process, allowing for dynamic parsing across various scenarios.
00:20:50.720 Through continued effort, we rewrote segment functionalities in C, gaining significant performance speed-ups, making the transition from HTML4 to HTML5 a reality.
00:21:52.560 In May this year, we shipped Rails HTML sanitizer with HTML5 support. As upgrading to new functionality has emerged, it became vital to introduce application configurations, ensuring backward compatibility for legacy applications while enabling newer apps to adopt HTML5 standards.
00:23:31.920 New application configurations have been introduced for Action View, as well as updated for Action Text to ensure that testing behavior remains unchanged, thereby allowing for seamless transitions.
00:24:36.800 In conclusion, Rails 7.1 now has HTML5 sanitizer support. If you're upgrading, be sure to check the relevant documentation for the three key configurations that are essential to your setup.
00:25:36.080 I also see a promising future for sanitizer APIs as they are being standardized, which will allow for far greater ease in parsing and string behavior.
00:25:51.280 Thank you all for listening to my talk on Rails HTML5 sanitization. Together, we can continue to innovate and protect against vulnerabilities.