LA RubyConf 2009

Journey through a pointy forest: The state of XML parsing in Ruby

Journey through a pointy forest: The state of XML parsing in Ruby

by Aaron Patterson

In this presentation titled "Journey through a pointy forest: The state of XML parsing in Ruby," Aaron Patterson explores the nuances of XML and HTML parsing in Ruby. He begins by introducing himself as a member of Seattle RB, also known as Tender Love, and shares his expertise as the author of Noiri and maintainer of Mechanized. He provides a candid warning that the talk may be a bit of a downer but promises it will be concise, lasting only 30 minutes.

The presentation is structured into four main parts:

  • XML Processing: Patterson discusses various XML parsing styles including SAX, push, pull, and DOM parsing. He elaborates on the importance of choosing the right parser based on factors such as document count, memory and speed constraints, and programming time. SAX parsers are praised for their speed and low memory usage, but he notes their complexity in handling states within documents.

  • HTML Processing: Similar to XML, Patterson briefly outlines HTML processing. He introduces several Ruby libraries for HTML parsing, including Nori, which corrects broken HTML before parsing it. This ensures that searches behave as expected, mirroring browser behavior.

  • Data Extraction Techniques: Patterson highlights the use of CSS selectors and XPath queries for data extraction, providing examples from libraries such as Hpricot and Nori. He emphasizes the straightforward application of CSS selectors compared to more complex XPath queries.

  • Namespaces and Document Correction: He dives into XML namespaces, explaining their significance in avoiding ambiguity when parsing. He also discusses the importance of HTML document correction to ensure consistent and expected behavior when querying DOM structures. Through comparing results from different parsers, he illustrates potential discrepancies and issues with libraries like Hpricot that do not align with standard behavior.

The conclusion highlights Patterson's assertion that Nori is superior to other parsers, including REXML and Hpricot, owing to its broader support for CSS selectors and more idiomatic Ruby interface. He stresses the importance of utilizing the best tool for XML and HTML parsing, advocating for Nori due to its reliability and extensive test coverage.

Overall, Patterson's detailed discussion provides valuable insights into XML and HTML processing in Ruby, offering practical knowledge for developers involved in data extraction and formatting tasks.

00:00:12.200 Hello, my name is Aaron Patterson. I work for 18 Interactive, I live in Seattle, and I'm a member of Seattle RB.
00:00:20.080 You might know me online as Tender Love.
00:00:26.480 My talk is called "Journey through a pointy forest: The state of XML parsing", but I don't really like this title because it makes me think of journeys or possibly unicorns.
00:00:40.840 So, I've decided to change the name of the talk to "You Suck at XML". Full disclosure, I am the author of Noiri and I also maintain Mechanized, which unfortunately makes me an expert in parsing terrible HTML.
00:00:49.879 Warning: this talk may be a downer, but the good news is, it will be done in 30 minutes.
00:01:02.519 The talk is divided into four parts: I'm going to talk about XML processing, HTML processing, data extraction, and HTML correction.
00:01:13.320 We will discuss a few different libraries that you can use in Ruby to accomplish these tasks.
00:01:26.560 First up is SAX XML refractor. XML has nodes that look like this; they are well-balanced. However, they do not look like this, which is not well balanced. We all know that unbalanced cats fall.
00:01:39.600 Nodes have attributes. For example, "journey" is awesome. Documents are a collection of nodes, and an important thing to know is that everything in here is a node. The cat's tag is a node, the text is a node, even the spaces between tags are nodes. Attributes are nodes; everything here is a node.
00:01:52.360 But I shouldn't keep this up too long, because if you look at XML for too long, you'll shoot your eye out. Now, regarding XML processing, we are going to talk about a few different XML parsing styles: SAX, push, pull, and DOM parsing.
00:02:18.400 When selecting a style of parser, you need to consider various factors like the number of documents you will be parsing, memory and speed constraints, how you want to extract data from the XML, and of course, programming time.
00:02:30.920 Once you've mastered these parsing styles, you can become an XML ninja. First up are SAX parsers.
00:02:50.599 SAX parsers are event-based parsers. There are different event types that you can register, like starting or ending an element, when characters are encountered, starting a document, ending a document, etc.
00:03:05.120 You instantiate a parser, hook in the events that you're interested in, and then parse the document. The parser will send events out to your callbacks.
00:03:11.360 Current libraries that support this are REXML, XML Ruby, and Nori.
00:03:25.360 Now, REXML's syntax looks like this: you create a new document class, include a module containing all the defaults for the callbacks, and then implement the callbacks that you care about. You instantiate a new SAX parser, inform it about your document, and parse the document.
00:03:47.519 XML Ruby also has a similar style. You create a new document class, include a bunch of callbacks, and the only notable difference is that all these callbacks start with "on_". Additionally, you must specify what type of thing you're parsing by indicating whether it's a string, IO, or whatever.
00:04:05.640 You set your callback and call parse. Nori's only main difference is that rather than including a module or inheriting from a class that contains all your default callbacks, you instantiate a new parser, provide your document, and then parse the XML.
00:04:17.799 I want to show a little bit of sample SAX output. Given the XML we looked at earlier, one might write a SAX parser that simply prints out when we encounter an open tag and when we encounter a closed tag.
00:04:34.600 The output of such a program will look like 'open', 'open', etc. You can see that the parser is just moving through the document, calling your events when they occur.
00:04:57.840 The advantages of SAX parsers are that they are very fast and low on memory. SAX parsers are used inside of SOAP, for example.
00:05:11.600 However, the disadvantages are that searching is hard, document handlers are verbose, and programmer expenses are high. When you implement these documents, you're ultimately left with a state machine because you need to keep track of where you are within the document.
00:05:29.919 Next, we have push parsing. Push parsing works similarly to SAX; you give it event callbacks.
00:05:42.479 The main difference between a push parser and the previous SAX parsers is that the programmer controls the document IO. Instead of passing an IO object into the parser, you feed the data into the parser. This is useful for contexts like XMPP or Java clients, where you are interacting with an infinite length document.
00:05:54.639 You don't necessarily want to pass that socket off to the parser, but rather feed that data into the parser.
00:06:12.919 The only library that implements this, to my knowledge, is Nori. The document looks exactly the same as a SAX document, as it is.
00:06:30.840 In this example, I will illustrate that we can feed data into the parser character by character.
00:06:39.360 The programmer has fine-grain control over the IO that goes into the parser, and callbacks are called just like in the previous SAX parsers we looked at.
00:06:56.440 The advantages of this approach are low memory consumption, and it's quite fast, though not as fast as the previous SAX parsers.
00:07:09.800 The disadvantages are the same issues that arise with SAX parsers. Your document classes will end up resembling state machines.
00:07:22.400 The third style of parser we will discuss is the pull parser. The interface to a pull parser hands over the XML and yields a node object, but the node object is only yielded when the programmer actually pulls it from the parser.
00:07:39.200 The name 'pull parser' comes from this behavior.
00:07:44.840 Currently, the Ruby library that supports this pull parsing style is REXML. An example of using REXML would look like this. You instantiate a new pull parser, ask if you can get an event, and then pull the event out of the pull parser.
00:08:04.440 These work like cursors moving through your document, allowing you to encounter events and pull data as needed.
00:08:17.680 The advantages of this method are low memory consumption and extremely fast operations.
00:08:27.720 According to the LibXML team, this is the fastest XML parser available in LibXML.
00:08:40.040 The disadvantages stem from its cursor-like functionality, which isn't particularly programmer-friendly. As you navigate the document, when you need to find data, you only have one pass. If you miss the data during that pass, you have to traverse the entire document again.
00:09:01.840 Now, SAX interfaces to me are like a poke in the eye, so I'm glad we are now discussing DOM interfaces, which are my favorite and likely familiar to most of you.
00:09:16.840 Given some XML, DOM parsers build an in-memory tree, making it easy to search via XPath.
00:09:31.720 With DOM, you can query the document as much as you need, looking for the data you need using a rich searching language called XPath.
00:09:46.680 Current Ruby libraries that handle DOM parsing include REXML, XML Ruby, and Nori.
00:10:07.520 REXML makes it very easy to create in-memory documents; you simply pass new XML into XML::Document.new, and this gives you an in-memory tree.
00:10:15.440 With XPath, you need to create a new object, and then you can search through it using the XPath object.
00:10:35.360 LibXML Ruby is slightly more complex to create in-memory trees; you must instantiate a parser first. Again, you specify the type of thing you’re parsing and instantiate a parser.
00:10:51.680 You then need to call parse on the parser to get your document back. Once you have the document, you can call a find method to search through your DOM using XPath.
00:11:09.639 LibXML's interface is much easier, as it's one call to retrieve your XML data. You can also search through it using XPath or CSS selectors.
00:11:34.639 Neither LibXML Ruby nor REXML provide this functionality. It's important to note that to search via XPath or CSS, you use the same method, search.
00:11:45.520 Now, Nori has a similar interface for fetching your DOM back, but you must specify whether you're using XPath or CSS for the search; you call XPath or CSS to execute your query.
00:12:05.880 The advantages of DOM-style parsers are that they enable easy data extraction and are programmer-friendly.
00:12:22.680 The disadvantages include high memory consumption and a minor speed penalty.
00:12:33.839 I mention high memory consumption because DOM parsers hold the entire document in memory, unlike SAX-style parsers.
00:12:49.679 Now, on to HTML processing. HTML-style parsers largely resemble XML parsers, but I'll only discuss a few differences because the other two types are exactly the same as XML.
00:13:00.960 The available Ruby libraries for HTML parsing are Nori, LibXML Ruby, and Noiri.
00:13:23.000 Nori is interesting, and I assume none of you have heard of it. Nori was the first HTML parser that Mechanized used; it sits on top of REXML and corrects broken HTML before it enters the REXML parser.
00:13:38.960 Essentially, you instantiate Nori like this, creating a new HTML to XML parser. However, we actually receive a REXML document in return, allowing us to treat that document like any other REXML document.
00:13:47.360 With LibXML Ruby, you use the same style interface as the XML parser, except you call HTML::Parser::parse_string, and then parse it, after which you receive your document back.
00:14:00.080 Noiri is easier; one call will give you your HTML DOM back. Hpricot is even easier because you get your HTML DOM back without complications.
00:14:12.320 Next up, we will discuss data extraction techniques. We have a couple of techniques: CSS selectors and XPath queries.
00:14:30.520 In the following slides, I will mainly focus on Nori and Hpricot because those are the only two libraries that provide CSS selectors.
00:14:44.960 They both provide CSS selectors and XPath queries; the other libraries only provide XPath queries.
00:15:02.400 These techniques are easy to apply across all the libraries. CSS selectors are quite straightforward.
00:15:19.360 With Hpricot, you call search, pass in your CSS selector, and you get a list of nodes back. You can then manipulate those nodes.
00:15:36.400 Nori is very similar; you use the CSS method and retrieve a list of nodes back. If you know CSS, just plug in your selectors, and you're good to go.
00:15:53.920 XPath queries are a bit more complex compared to CSS selectors. Not everyone is as familiar with XPath queries.
00:16:05.440 For example, this XPath expression finds all food tags in my tree starting at the root.
00:16:14.240 The first slash means I want to search this tree starting at the root of the document.
00:16:25.360 You can also find all food tags starting from the current reference node using a dot.
00:16:37.680 The dot signifies that you're starting from your current position in the document, seeking all food tags that are descendants of that node.
00:16:52.560 XPath allows you to create queries like 'find all food tags with a child bar tag', which matches XML that has a food tag containing a child bar tag.
00:17:02.880 But be cautious with queries; ambiguity is the enemy.
00:17:10.560 If you were to plug this query into a browser, it would indeed work, highlighting the relevant elements.
00:17:27.840 This query is valid in both XPath and CSS, which can create confusion due to overlapping syntax.
00:17:45.440 This ambiguity is why in Nori, you must specify if you want to search with XPath or CSS, which eliminates these issues.
00:18:07.000 Now I want to briefly discuss XML namespaces, something you'll inevitably encounter when parsing XML.
00:18:20.640 Let's say we have a couple of XML documents, one from Ford and the other from Schwinn, both showing different tires.
00:18:29.240 The problem arises if we're searching these documents as we can't tell the difference between the tires, since we only receive tire names.
00:18:45.960 This is where namespaces come in. We have explicit namespaces, where the Anti-Ambiguity Squad has come in to fix these XML documents.
00:19:01.360 They declared namespaces in Ford.com called "car" and in Schwinn called "bike". The important thing to understand is that these names are arbitrary; what's significant are the URLs, which must be unique.
00:19:15.360 In these documents, we know that tires are associated with cars— in the first one, they're associated with Ford, and in the second with Schwinn.
00:19:35.960 When searching, you'd register the URL to find your bike tire or car tire. However, the names themselves can be anything; the crucial part is the URL.
00:19:55.600 Our second option involves implicit namespaces. The Anti-Typing Too Much Squad designated all tags in this document as namespaced with Ford and Schwinn.
00:20:06.480 Instead of explicitly naming tires with the format "car:tire" or "bike:tire", we assign an implicit namespace, ensuring all tags are associated with the URL.
00:20:22.480 Now, XPath queries behave the same, whether namespaced or not. We need to differentiate between tags with no namespace and those that do.
00:20:38.400 This first document has tires that are namespaced, while the second's tires are not namespaced.
00:20:56.160 If we don't define namespaces, we won't distinguish between tags effectively. Libraries that support namespaces include LibXML Ruby, Nori, REXML, and Hpricot.
00:21:17.679 Let's take a closer look at Hpricot. Hpricot only works with explicit namespaces, utilizing the tag names, which is problematic because those names can be arbitrary.
00:21:37.120 Both documents might use the term "car", but one remains associated with Ford and the other with Schwinn, making differentiation critical.
00:21:51.760 Implicit namespaces are not supported in Hpricot, and understanding the importance of namespaces as tags is crucial when you're searching.
00:22:12.560 Now, why is HTML document correction important? We want to treat our HTML like a browser would.
00:22:32.640 This functionality leads to less unexpected behavior. When looking at a CSS selector on my webpage, I want to find the same data when searching through my document.
00:22:47.840 If I use Firebug to select an object on my page, I want the same CSS selector to find the data I’m after.
00:23:05.920 Both Nori and Hpricot correct HTML, with Nori sitting on top of LibXML, which actually does the HTML correction.
00:23:19.140 These two libraries have different correction schemes. To determine which is correct, we can detect DOM differences.
00:23:32.160 DOM parsers store the document in memory as a tree. For example, we can create a reference HTML and see how it is represented in memory.
00:23:50.720 We want to compare the differences between these trees. I wrote a library called TreeDiff, which you can find on GitHub.
00:24:02.800 Tree differences are interesting because they indicate that those trees return different results when queried with the same CSS selector.
00:24:27.680 We only want to analyze trees that yield different outputs to assess the accuracy of the correction algorithms.
00:24:41.360 For example, I gathered 461 random HTML files and discovered that 336 of those tree structures were different.
00:24:58.640 I examined these trees to share five significant differences with you. The first difference is what I call "Encyclopedia Brown in the case of the missing <td> tag."
00:25:10.640 I reduced these HTML examples to the minimum amount that still produced different in-memory trees.
00:25:23.680 The way this renders in-browser is interesting; the browser automatically corrects by adding two closing <td> tags after 'Hello' and 'World'.
00:25:34.320 Hpricot's correction adds the same closing tags. Nori corrects it in line with the browser's behavior.
00:25:49.840 Here's what the in-memory tree looks like. The blue box represents nodes present in Nori that Hpricot does not include.
00:26:01.040 Red boxes indicate nodes that Hpricot has that Nori lacks. The gray boxes are shared nodes.
00:26:13.520 This graph was created using the TreeDiff program. The consequences of this difference are substantial; if you search for <td> in Hpricot, you would get 'Hello World' back, which is incorrect.
00:26:30.080 In Nori, searching for <td> will correctly return its associated text.
00:26:45.920 The second example involves valid HTML. As pointed out, the P tag line is missing two quotes.
00:26:56.480 That is still technically valid HTML. The browser corrects this by adding the quotes.
00:27:05.680 Hpricot adds a closing <center> tag, while Nori adds the quotes, resulting in a valid tree structure.
00:27:13.760 On the contrary, Nori maintains the original P tag under the body tag.
00:27:28.720 The third example involves discrepancies in the use of <font> tags.
00:27:39.560 There’s correctly formatted HTML, although Hpricot does add a <font> tag.
00:27:48.600 Nori does not modify the original structure.
00:27:57.360 Even with best attempts of formatting HTML, we see conflicting adjustments.
00:28:07.680 Concerning attributes, consider an example with a missing equal sign.
00:28:20.640 The browser will correct it to look strange, whereas Hpricot adds a closing <table> tag.
00:28:34.760 But Hpricot presents a more serious issue, as it loses the <body> tag entirely.
00:28:47.560 If you search for a <body> tag, it returns zero results. It has essentially been treated as a text node instead.
00:29:01.480 I intended to present a fifth example, but it seems a bit redundant now.
00:29:09.440 Out of 71% of trees examined, differences were noted. In all comparisons, LibXML2's corrections aligned with browser behavior.
00:29:20.240 It wasn't biased towards either side—if discrepancies surfaced between the trees, one parser was correcting it differently.
00:29:33.680 In every case, LibXML2 produced results in accordance with browser operations, while Hpricot did not.
00:29:46.400 In conclusion, always use the best tool for the task at hand. It's soapbox time for me.
00:30:02.080 I believe that Nori is superior to REXML and Hpricot as it utilizes a more broadly-used XML and HTML parser.
00:30:14.760 Developers who use LibXML2 are writing in a range of languages, including Perl, C, C++, Objective-C, and Python.
00:30:29.560 I also find it better than LibXML Ruby due to its more idiomatic Ruby interface and inclusion of CSS selector support.
00:30:43.200 To conclude, the differences in test coverage are notable. REXML has one test, Nori has two, while they manage to bump it to 18; I'm not certain.
00:30:56.960 LibXML2 has had an HTML parser since April 2000, durable and reliable across those nine years.
00:31:09.440 Find the interesting resources and codes here. I will tweet the slides once I post them.
00:31:20.320 That concludes my presentation. Thank you!