Talks

Microtalk: Nokogiri - History and Future

Over the past few years, Nokogiri has slowly eclipsed older XML parsing libraries to garner nearly 10 million downloads from rubygems.org.
But why another XML parsing library? Isn't it boring? And what does "nokogiri" mean in Japanese, anyway?
These questions will be answered, and I'll do a brief dive into all the technologies that we use to make Nokogiri a fast, reliable and robust gem. Topics will include:
Origins of the project: motivation, problems and impact
Native C and Java extensions
FFI, and how to know if it's Right For You
Debugging tools (valgrind, perftools)
Packaging tools (mini_portile, rake-compiler)
Installation issues, and what we're doing to help
Feature roadmap

Help us caption & translate this video!

http://amara.org/v/FG99/

GoRuCo 2013

00:00:16.960 Hey guys, Mike Lovelock here. Check, check. Cool, nice to see you all here. My name is Mike D'Alessio, and on the internet, I go by Flavor Jones. I am one of the authors of Nokogiri, along with Tenderlove, whom you may know from his online presence.
00:00:26.240 These slides are available at this permalink if you want to check it out: bit.ly/nokigirigoroko2013. It's important to remember that this presentation is both valid and well-formed.
00:00:36.640 I work for Pivotal Labs. This is my internet avatar, and if you happen to see me online, feel free to say hi. Now, let's dive right in. As of this morning, Nokogiri has been downloaded over 11 million times, almost 12 million. That's pretty amazing, and I was really surprised by that.
00:00:49.760 For comparison, Rails has about twice that, around 24 million downloads. It's not a competition, but if we put it into perspective, Nokogiri would be like Kelly Clarkson, having shipped around 11 million units, while Rails would be akin to Kiss, who has been around longer. People tend to have a more emotional response to Kiss, while Kelly Clarkson's music is more adult contemporary.
00:01:13.760 But how did this happen? Nokogiri is just this little utility library that I didn't think would be anything special when I started. We're going to go into the history of this a bit more. First, let me address a common question: What does 'Nokogiri' mean? It's a Japanese word for a saw, you know, for cutting trees. Think of it as a pun for 'cutting through the forests of XML.' You'll get it later tonight on the boat.
00:01:39.920 Now, let's talk about the origins of the project. It started in 2008. To set the stage, that year, Rails 2.1 and Ruby 1.8.7 were released. Slumdog Millionaire was a hit, and oil prices were around $800 a barrel. In the news, Elliott Spitzer resigned in disgrace, which some of you may remember. On September 9th that year, I had an interesting email conversation with Tenderlove, who, at the time, did not really know who I was. I had sent a couple of pull requests to Mechanize.
00:02:29.120 I asked him if he was working on an XML wrapper, and he confirmed that he was. He mentioned that it was going to be better than Hpricot for handling broken HTML. I expressed my interest in helping out. He told me that he had already started a project called Nokogiri and that they would be using dl.
00:02:50.720 For those of you who don't know, dl was a pure Ruby library meant to interface with C code. However, it turned out to be really slow, so we decided we had to write a C extension instead. We began writing C code, compiling it into the gem, and calling the XML2 library directly.
00:03:22.080 The toughest part about this was managing libxml2's memory management. Its model for handling nodes within a document was quite complex, and I spent about two months fixing segmentation faults and managing how memory was dealt with. The key tools I used for debugging were Valgrind, which captures invalid memory access, and Perftools.rb, a Ruby version of Google's performance tools.
00:04:08.080 As for the API design of Nokogiri, we aimed to take the best XML API we could find, which was Hpricot. Those who remember Hpricot will recall that it was a lovely gem that provided a great API for handling XML, although it had its share of bugs, including off-by-one errors. It wasn't as fast or portable as I wanted it to be, which led us to create Nokogiri.
00:04:41.200 The first official release of Nokogiri was in November 2008, coinciding with the weekend of Daylight Saving Time. I was quite busy at work, so Aaron released it anyway. Interestingly, in January 2009, we had reached a point where we were evenly split between Ruby and C in our gem. We started out building a pure Ruby gem, but by then, it was about half Ruby and half C.
00:05:03.040 You may wonder how the community accepted Nokogiri initially, especially since Hpricot was so well-loved. There were performance metrics circulating that indicated Hpricot was significantly slower than regular expressions, leading to heated debates among developers about the speed of XML parsing libraries.
00:05:21.120 In August 2009, a notable tweet circulated suggesting that there was a superior alternative to Hpricot which, at the time, felt bittersweet as it marked the essence of software development—constantly advancing the art and improving speeds. However, I felt a sense of sadness knowing that progressed software might replace older works.
00:06:17.040 Let's briefly discuss JRuby, which started gaining popularity around 2009-2010. Raise your hand if you use JRuby in production right now. It seems like a decent number of you, but the challenge was that JRuby didn't support C extensions. This limitation meant that anyone using JRuby could not use Nokogiri.
00:06:27.680 At that time, I had a dream of creating a gem that could run on MRI, Rubinius, and JRuby without modification. This marked my 'FFI phase.' FFI, or Foreign Function Interface, is similar to dl but is a pure Ruby library for accessing native code, optimized for performance.
00:06:55.520 Anyway, I spent a substantial part of 2009 rewriting C code in Ruby. This process was extremely painful—over 3000 lines of Ruby code were required to replicate 4000 lines of C code. It was essentially a one-to-one comparison; if this returned null, then that returned one, etc. Rewriting the C code in Ruby was an arduous task.
00:07:35.680 On the positive side, it actually worked; FFI functioned across various platforms. However, I dubbed this process 'segfault-driven development.' Each attempt would yield crashes necessitating using a debugger and Valgrind to resolve issues. Implementing portable string handling proved particularly challenging due to the significant differences in how the JVM manages strings.
00:08:21.920 As a result, FFI code isn't necessarily cleaner than C; it resembles C code transposed into Ruby. The performance penalties were significant as well, and I faced criticism at conferences from people concerned about Google App Engine's lack of FFI support.
00:09:07.440 Ultimately, I decided to abandon FFI. If you are considering using FFI for your gems to support multiple platforms, pay close attention to how you maintain them as the path can be fraught with difficulties.
00:09:25.600 The story quickly moved to installing Nokogiri. Many of you may have experienced installation problems; raise your hand if you have. Great! I have a fantastic announcement: These problems are about to become a thing of the past.
00:09:47.440 We are set to release Nokogiri 1.6.0 today, which will package libxml2 and libxslt inside the gem. It will compile and install those libraries automatically when you install the gem, eliminating installation issues once and for all.