00:00:12.200
Hello, my name is Aaron Patterson. I work for 18 Interactive, I live in Seattle, and I'm a member of Seattle RB.
00:00:20.080
You might know me online as Tender Love.
00:00:26.480
My talk is called "Journey through a pointy forest: The state of XML parsing", but I don't really like this title because it makes me think of journeys or possibly unicorns.
00:00:40.840
So, I've decided to change the name of the talk to "You Suck at XML". Full disclosure, I am the author of Noiri and I also maintain Mechanized, which unfortunately makes me an expert in parsing terrible HTML.
00:00:49.879
Warning: this talk may be a downer, but the good news is, it will be done in 30 minutes.
00:01:02.519
The talk is divided into four parts: I'm going to talk about XML processing, HTML processing, data extraction, and HTML correction.
00:01:13.320
We will discuss a few different libraries that you can use in Ruby to accomplish these tasks.
00:01:26.560
First up is SAX XML refractor. XML has nodes that look like this; they are well-balanced. However, they do not look like this, which is not well balanced. We all know that unbalanced cats fall.
00:01:39.600
Nodes have attributes. For example, "journey" is awesome. Documents are a collection of nodes, and an important thing to know is that everything in here is a node. The cat's tag is a node, the text is a node, even the spaces between tags are nodes. Attributes are nodes; everything here is a node.
00:01:52.360
But I shouldn't keep this up too long, because if you look at XML for too long, you'll shoot your eye out. Now, regarding XML processing, we are going to talk about a few different XML parsing styles: SAX, push, pull, and DOM parsing.
00:02:18.400
When selecting a style of parser, you need to consider various factors like the number of documents you will be parsing, memory and speed constraints, how you want to extract data from the XML, and of course, programming time.
00:02:30.920
Once you've mastered these parsing styles, you can become an XML ninja. First up are SAX parsers.
00:02:50.599
SAX parsers are event-based parsers. There are different event types that you can register, like starting or ending an element, when characters are encountered, starting a document, ending a document, etc.
00:03:05.120
You instantiate a parser, hook in the events that you're interested in, and then parse the document. The parser will send events out to your callbacks.
00:03:11.360
Current libraries that support this are REXML, XML Ruby, and Nori.
00:03:25.360
Now, REXML's syntax looks like this: you create a new document class, include a module containing all the defaults for the callbacks, and then implement the callbacks that you care about. You instantiate a new SAX parser, inform it about your document, and parse the document.
00:03:47.519
XML Ruby also has a similar style. You create a new document class, include a bunch of callbacks, and the only notable difference is that all these callbacks start with "on_". Additionally, you must specify what type of thing you're parsing by indicating whether it's a string, IO, or whatever.
00:04:05.640
You set your callback and call parse. Nori's only main difference is that rather than including a module or inheriting from a class that contains all your default callbacks, you instantiate a new parser, provide your document, and then parse the XML.
00:04:17.799
I want to show a little bit of sample SAX output. Given the XML we looked at earlier, one might write a SAX parser that simply prints out when we encounter an open tag and when we encounter a closed tag.
00:04:34.600
The output of such a program will look like 'open', 'open', etc. You can see that the parser is just moving through the document, calling your events when they occur.
00:04:57.840
The advantages of SAX parsers are that they are very fast and low on memory. SAX parsers are used inside of SOAP, for example.
00:05:11.600
However, the disadvantages are that searching is hard, document handlers are verbose, and programmer expenses are high. When you implement these documents, you're ultimately left with a state machine because you need to keep track of where you are within the document.
00:05:29.919
Next, we have push parsing. Push parsing works similarly to SAX; you give it event callbacks.
00:05:42.479
The main difference between a push parser and the previous SAX parsers is that the programmer controls the document IO. Instead of passing an IO object into the parser, you feed the data into the parser. This is useful for contexts like XMPP or Java clients, where you are interacting with an infinite length document.
00:05:54.639
You don't necessarily want to pass that socket off to the parser, but rather feed that data into the parser.
00:06:12.919
The only library that implements this, to my knowledge, is Nori. The document looks exactly the same as a SAX document, as it is.
00:06:30.840
In this example, I will illustrate that we can feed data into the parser character by character.
00:06:39.360
The programmer has fine-grain control over the IO that goes into the parser, and callbacks are called just like in the previous SAX parsers we looked at.
00:06:56.440
The advantages of this approach are low memory consumption, and it's quite fast, though not as fast as the previous SAX parsers.
00:07:09.800
The disadvantages are the same issues that arise with SAX parsers. Your document classes will end up resembling state machines.
00:07:22.400
The third style of parser we will discuss is the pull parser. The interface to a pull parser hands over the XML and yields a node object, but the node object is only yielded when the programmer actually pulls it from the parser.
00:07:39.200
The name 'pull parser' comes from this behavior.
00:07:44.840
Currently, the Ruby library that supports this pull parsing style is REXML. An example of using REXML would look like this. You instantiate a new pull parser, ask if you can get an event, and then pull the event out of the pull parser.
00:08:04.440
These work like cursors moving through your document, allowing you to encounter events and pull data as needed.
00:08:17.680
The advantages of this method are low memory consumption and extremely fast operations.
00:08:27.720
According to the LibXML team, this is the fastest XML parser available in LibXML.
00:08:40.040
The disadvantages stem from its cursor-like functionality, which isn't particularly programmer-friendly. As you navigate the document, when you need to find data, you only have one pass. If you miss the data during that pass, you have to traverse the entire document again.
00:09:01.840
Now, SAX interfaces to me are like a poke in the eye, so I'm glad we are now discussing DOM interfaces, which are my favorite and likely familiar to most of you.
00:09:16.840
Given some XML, DOM parsers build an in-memory tree, making it easy to search via XPath.
00:09:31.720
With DOM, you can query the document as much as you need, looking for the data you need using a rich searching language called XPath.
00:09:46.680
Current Ruby libraries that handle DOM parsing include REXML, XML Ruby, and Nori.
00:10:07.520
REXML makes it very easy to create in-memory documents; you simply pass new XML into XML::Document.new, and this gives you an in-memory tree.
00:10:15.440
With XPath, you need to create a new object, and then you can search through it using the XPath object.
00:10:35.360
LibXML Ruby is slightly more complex to create in-memory trees; you must instantiate a parser first. Again, you specify the type of thing you’re parsing and instantiate a parser.
00:10:51.680
You then need to call parse on the parser to get your document back. Once you have the document, you can call a find method to search through your DOM using XPath.
00:11:09.639
LibXML's interface is much easier, as it's one call to retrieve your XML data. You can also search through it using XPath or CSS selectors.
00:11:34.639
Neither LibXML Ruby nor REXML provide this functionality. It's important to note that to search via XPath or CSS, you use the same method, search.
00:11:45.520
Now, Nori has a similar interface for fetching your DOM back, but you must specify whether you're using XPath or CSS for the search; you call XPath or CSS to execute your query.
00:12:05.880
The advantages of DOM-style parsers are that they enable easy data extraction and are programmer-friendly.
00:12:22.680
The disadvantages include high memory consumption and a minor speed penalty.
00:12:33.839
I mention high memory consumption because DOM parsers hold the entire document in memory, unlike SAX-style parsers.
00:12:49.679
Now, on to HTML processing. HTML-style parsers largely resemble XML parsers, but I'll only discuss a few differences because the other two types are exactly the same as XML.
00:13:00.960
The available Ruby libraries for HTML parsing are Nori, LibXML Ruby, and Noiri.
00:13:23.000
Nori is interesting, and I assume none of you have heard of it. Nori was the first HTML parser that Mechanized used; it sits on top of REXML and corrects broken HTML before it enters the REXML parser.
00:13:38.960
Essentially, you instantiate Nori like this, creating a new HTML to XML parser. However, we actually receive a REXML document in return, allowing us to treat that document like any other REXML document.
00:13:47.360
With LibXML Ruby, you use the same style interface as the XML parser, except you call HTML::Parser::parse_string, and then parse it, after which you receive your document back.
00:14:00.080
Noiri is easier; one call will give you your HTML DOM back. Hpricot is even easier because you get your HTML DOM back without complications.
00:14:12.320
Next up, we will discuss data extraction techniques. We have a couple of techniques: CSS selectors and XPath queries.
00:14:30.520
In the following slides, I will mainly focus on Nori and Hpricot because those are the only two libraries that provide CSS selectors.
00:14:44.960
They both provide CSS selectors and XPath queries; the other libraries only provide XPath queries.
00:15:02.400
These techniques are easy to apply across all the libraries. CSS selectors are quite straightforward.
00:15:19.360
With Hpricot, you call search, pass in your CSS selector, and you get a list of nodes back. You can then manipulate those nodes.
00:15:36.400
Nori is very similar; you use the CSS method and retrieve a list of nodes back. If you know CSS, just plug in your selectors, and you're good to go.
00:15:53.920
XPath queries are a bit more complex compared to CSS selectors. Not everyone is as familiar with XPath queries.
00:16:05.440
For example, this XPath expression finds all food tags in my tree starting at the root.
00:16:14.240
The first slash means I want to search this tree starting at the root of the document.
00:16:25.360
You can also find all food tags starting from the current reference node using a dot.
00:16:37.680
The dot signifies that you're starting from your current position in the document, seeking all food tags that are descendants of that node.
00:16:52.560
XPath allows you to create queries like 'find all food tags with a child bar tag', which matches XML that has a food tag containing a child bar tag.
00:17:02.880
But be cautious with queries; ambiguity is the enemy.
00:17:10.560
If you were to plug this query into a browser, it would indeed work, highlighting the relevant elements.
00:17:27.840
This query is valid in both XPath and CSS, which can create confusion due to overlapping syntax.
00:17:45.440
This ambiguity is why in Nori, you must specify if you want to search with XPath or CSS, which eliminates these issues.
00:18:07.000
Now I want to briefly discuss XML namespaces, something you'll inevitably encounter when parsing XML.
00:18:20.640
Let's say we have a couple of XML documents, one from Ford and the other from Schwinn, both showing different tires.
00:18:29.240
The problem arises if we're searching these documents as we can't tell the difference between the tires, since we only receive tire names.
00:18:45.960
This is where namespaces come in. We have explicit namespaces, where the Anti-Ambiguity Squad has come in to fix these XML documents.
00:19:01.360
They declared namespaces in Ford.com called "car" and in Schwinn called "bike". The important thing to understand is that these names are arbitrary; what's significant are the URLs, which must be unique.
00:19:15.360
In these documents, we know that tires are associated with cars— in the first one, they're associated with Ford, and in the second with Schwinn.
00:19:35.960
When searching, you'd register the URL to find your bike tire or car tire. However, the names themselves can be anything; the crucial part is the URL.
00:19:55.600
Our second option involves implicit namespaces. The Anti-Typing Too Much Squad designated all tags in this document as namespaced with Ford and Schwinn.
00:20:06.480
Instead of explicitly naming tires with the format "car:tire" or "bike:tire", we assign an implicit namespace, ensuring all tags are associated with the URL.
00:20:22.480
Now, XPath queries behave the same, whether namespaced or not. We need to differentiate between tags with no namespace and those that do.
00:20:38.400
This first document has tires that are namespaced, while the second's tires are not namespaced.
00:20:56.160
If we don't define namespaces, we won't distinguish between tags effectively. Libraries that support namespaces include LibXML Ruby, Nori, REXML, and Hpricot.
00:21:17.679
Let's take a closer look at Hpricot. Hpricot only works with explicit namespaces, utilizing the tag names, which is problematic because those names can be arbitrary.
00:21:37.120
Both documents might use the term "car", but one remains associated with Ford and the other with Schwinn, making differentiation critical.
00:21:51.760
Implicit namespaces are not supported in Hpricot, and understanding the importance of namespaces as tags is crucial when you're searching.
00:22:12.560
Now, why is HTML document correction important? We want to treat our HTML like a browser would.
00:22:32.640
This functionality leads to less unexpected behavior. When looking at a CSS selector on my webpage, I want to find the same data when searching through my document.
00:22:47.840
If I use Firebug to select an object on my page, I want the same CSS selector to find the data I’m after.
00:23:05.920
Both Nori and Hpricot correct HTML, with Nori sitting on top of LibXML, which actually does the HTML correction.
00:23:19.140
These two libraries have different correction schemes. To determine which is correct, we can detect DOM differences.
00:23:32.160
DOM parsers store the document in memory as a tree. For example, we can create a reference HTML and see how it is represented in memory.
00:23:50.720
We want to compare the differences between these trees. I wrote a library called TreeDiff, which you can find on GitHub.
00:24:02.800
Tree differences are interesting because they indicate that those trees return different results when queried with the same CSS selector.
00:24:27.680
We only want to analyze trees that yield different outputs to assess the accuracy of the correction algorithms.
00:24:41.360
For example, I gathered 461 random HTML files and discovered that 336 of those tree structures were different.
00:24:58.640
I examined these trees to share five significant differences with you. The first difference is what I call "Encyclopedia Brown in the case of the missing <td> tag."
00:25:10.640
I reduced these HTML examples to the minimum amount that still produced different in-memory trees.
00:25:23.680
The way this renders in-browser is interesting; the browser automatically corrects by adding two closing <td> tags after 'Hello' and 'World'.
00:25:34.320
Hpricot's correction adds the same closing tags. Nori corrects it in line with the browser's behavior.
00:25:49.840
Here's what the in-memory tree looks like. The blue box represents nodes present in Nori that Hpricot does not include.
00:26:01.040
Red boxes indicate nodes that Hpricot has that Nori lacks. The gray boxes are shared nodes.
00:26:13.520
This graph was created using the TreeDiff program. The consequences of this difference are substantial; if you search for <td> in Hpricot, you would get 'Hello World' back, which is incorrect.
00:26:30.080
In Nori, searching for <td> will correctly return its associated text.
00:26:45.920
The second example involves valid HTML. As pointed out, the P tag line is missing two quotes.
00:26:56.480
That is still technically valid HTML. The browser corrects this by adding the quotes.
00:27:05.680
Hpricot adds a closing <center> tag, while Nori adds the quotes, resulting in a valid tree structure.
00:27:13.760
On the contrary, Nori maintains the original P tag under the body tag.
00:27:28.720
The third example involves discrepancies in the use of <font> tags.
00:27:39.560
There’s correctly formatted HTML, although Hpricot does add a <font> tag.
00:27:48.600
Nori does not modify the original structure.
00:27:57.360
Even with best attempts of formatting HTML, we see conflicting adjustments.
00:28:07.680
Concerning attributes, consider an example with a missing equal sign.
00:28:20.640
The browser will correct it to look strange, whereas Hpricot adds a closing <table> tag.
00:28:34.760
But Hpricot presents a more serious issue, as it loses the <body> tag entirely.
00:28:47.560
If you search for a <body> tag, it returns zero results. It has essentially been treated as a text node instead.
00:29:01.480
I intended to present a fifth example, but it seems a bit redundant now.
00:29:09.440
Out of 71% of trees examined, differences were noted. In all comparisons, LibXML2's corrections aligned with browser behavior.
00:29:20.240
It wasn't biased towards either side—if discrepancies surfaced between the trees, one parser was correcting it differently.
00:29:33.680
In every case, LibXML2 produced results in accordance with browser operations, while Hpricot did not.
00:29:46.400
In conclusion, always use the best tool for the task at hand. It's soapbox time for me.
00:30:02.080
I believe that Nori is superior to REXML and Hpricot as it utilizes a more broadly-used XML and HTML parser.
00:30:14.760
Developers who use LibXML2 are writing in a range of languages, including Perl, C, C++, Objective-C, and Python.
00:30:29.560
I also find it better than LibXML Ruby due to its more idiomatic Ruby interface and inclusion of CSS selector support.
00:30:43.200
To conclude, the differences in test coverage are notable. REXML has one test, Nori has two, while they manage to bump it to 18; I'm not certain.
00:30:56.960
LibXML2 has had an HTML parser since April 2000, durable and reliable across those nine years.
00:31:09.440
Find the interesting resources and codes here. I will tweet the slides once I post them.
00:31:20.320
That concludes my presentation. Thank you!