Ruby

Microtalk: Nokogiri - History and Future

Microtalk: Nokogiri - History and Future

by Mike Dalessio

The video titled "Microtalk: Nokogiri - History and Future" presented by Mike D'Alessio discusses the evolution and future of Nokogiri, an XML parsing library that has gained significant traction, reflecting over 11 million downloads. D'Alessio, who is one of the authors of Nokogiri, elaborates on various facets of the library’s development, origins, and the community's reception.

Key Points Discussed:
- Introduction to Nokogiri: D'Alessio highlights the popularity of Nokogiri, comparing its downloads to those of Ruby on Rails, showcasing its unexpected success since its inception.
- Meaning of 'Nokogiri': The name ‘Nokogiri’ translates to ‘saw’ in Japanese, symbolizing its purpose of cutting through complex XML.
- Origins of the Project: The library originated in 2008 during a pivotal time in technology, when there was a need for a better XML parsing solution. D'Alessio recounts his initial conversations with Tenderlove, leading to the creation of Nokogiri.
- Technical Development: The early development involved using C extensions for performance, and extensive debugging efforts were required to handle memory management with libxml2, employing tools like Valgrind and Perftools.
- Community Reception: Nokogiri faced competition from Hpricot, a beloved XML parser at the time. The transition from Hpricot to Nokogiri sparked intense debates regarding performance in the development community.
- Challenges with JRuby Compatibility: D'Alessio discusses the difficulty of making Nokogiri compatible with JRuby, which does not support C extensions, leading him to attempt a rewrite using FFI, which eventually proved unfeasible due to its complexities.
- Installation Improvements: One of the significant announcements in the talk is regarding the release of Nokogiri 1.6.0, which aims to streamline the installation process by bundling necessary libraries, addressing common installation issues encountered by users.

In conclusion, D'Alessio reflects on the progress of Nokogiri from a small utility to a widely-used tool in the Ruby community. He emphasizes the importance of iterative improvement in software development and the continuous evolution of tools to meet user needs. The presentation combines technical insights with anecdotes from the development process, highlighting both challenges and triumphs in the journey of creating Nokogiri.

00:00:16.960 Hey guys, Mike Lovelock here. Check, check. Cool, nice to see you all here. My name is Mike D'Alessio, and on the internet, I go by Flavor Jones. I am one of the authors of Nokogiri, along with Tenderlove, whom you may know from his online presence.
00:00:26.240 These slides are available at this permalink if you want to check it out: bit.ly/nokigirigoroko2013. It's important to remember that this presentation is both valid and well-formed.
00:00:36.640 I work for Pivotal Labs. This is my internet avatar, and if you happen to see me online, feel free to say hi. Now, let's dive right in. As of this morning, Nokogiri has been downloaded over 11 million times, almost 12 million. That's pretty amazing, and I was really surprised by that.
00:00:49.760 For comparison, Rails has about twice that, around 24 million downloads. It's not a competition, but if we put it into perspective, Nokogiri would be like Kelly Clarkson, having shipped around 11 million units, while Rails would be akin to Kiss, who has been around longer. People tend to have a more emotional response to Kiss, while Kelly Clarkson's music is more adult contemporary.
00:01:13.760 But how did this happen? Nokogiri is just this little utility library that I didn't think would be anything special when I started. We're going to go into the history of this a bit more. First, let me address a common question: What does 'Nokogiri' mean? It's a Japanese word for a saw, you know, for cutting trees. Think of it as a pun for 'cutting through the forests of XML.' You'll get it later tonight on the boat.
00:01:39.920 Now, let's talk about the origins of the project. It started in 2008. To set the stage, that year, Rails 2.1 and Ruby 1.8.7 were released. Slumdog Millionaire was a hit, and oil prices were around $800 a barrel. In the news, Elliott Spitzer resigned in disgrace, which some of you may remember. On September 9th that year, I had an interesting email conversation with Tenderlove, who, at the time, did not really know who I was. I had sent a couple of pull requests to Mechanize.
00:02:29.120 I asked him if he was working on an XML wrapper, and he confirmed that he was. He mentioned that it was going to be better than Hpricot for handling broken HTML. I expressed my interest in helping out. He told me that he had already started a project called Nokogiri and that they would be using dl.
00:02:50.720 For those of you who don't know, dl was a pure Ruby library meant to interface with C code. However, it turned out to be really slow, so we decided we had to write a C extension instead. We began writing C code, compiling it into the gem, and calling the XML2 library directly.
00:03:22.080 The toughest part about this was managing libxml2's memory management. Its model for handling nodes within a document was quite complex, and I spent about two months fixing segmentation faults and managing how memory was dealt with. The key tools I used for debugging were Valgrind, which captures invalid memory access, and Perftools.rb, a Ruby version of Google's performance tools.
00:04:08.080 As for the API design of Nokogiri, we aimed to take the best XML API we could find, which was Hpricot. Those who remember Hpricot will recall that it was a lovely gem that provided a great API for handling XML, although it had its share of bugs, including off-by-one errors. It wasn't as fast or portable as I wanted it to be, which led us to create Nokogiri.
00:04:41.200 The first official release of Nokogiri was in November 2008, coinciding with the weekend of Daylight Saving Time. I was quite busy at work, so Aaron released it anyway. Interestingly, in January 2009, we had reached a point where we were evenly split between Ruby and C in our gem. We started out building a pure Ruby gem, but by then, it was about half Ruby and half C.
00:05:03.040 You may wonder how the community accepted Nokogiri initially, especially since Hpricot was so well-loved. There were performance metrics circulating that indicated Hpricot was significantly slower than regular expressions, leading to heated debates among developers about the speed of XML parsing libraries.
00:05:21.120 In August 2009, a notable tweet circulated suggesting that there was a superior alternative to Hpricot which, at the time, felt bittersweet as it marked the essence of software development—constantly advancing the art and improving speeds. However, I felt a sense of sadness knowing that progressed software might replace older works.
00:06:17.040 Let's briefly discuss JRuby, which started gaining popularity around 2009-2010. Raise your hand if you use JRuby in production right now. It seems like a decent number of you, but the challenge was that JRuby didn't support C extensions. This limitation meant that anyone using JRuby could not use Nokogiri.
00:06:27.680 At that time, I had a dream of creating a gem that could run on MRI, Rubinius, and JRuby without modification. This marked my 'FFI phase.' FFI, or Foreign Function Interface, is similar to dl but is a pure Ruby library for accessing native code, optimized for performance.
00:06:55.520 Anyway, I spent a substantial part of 2009 rewriting C code in Ruby. This process was extremely painful—over 3000 lines of Ruby code were required to replicate 4000 lines of C code. It was essentially a one-to-one comparison; if this returned null, then that returned one, etc. Rewriting the C code in Ruby was an arduous task.
00:07:35.680 On the positive side, it actually worked; FFI functioned across various platforms. However, I dubbed this process 'segfault-driven development.' Each attempt would yield crashes necessitating using a debugger and Valgrind to resolve issues. Implementing portable string handling proved particularly challenging due to the significant differences in how the JVM manages strings.
00:08:21.920 As a result, FFI code isn't necessarily cleaner than C; it resembles C code transposed into Ruby. The performance penalties were significant as well, and I faced criticism at conferences from people concerned about Google App Engine's lack of FFI support.
00:09:07.440 Ultimately, I decided to abandon FFI. If you are considering using FFI for your gems to support multiple platforms, pay close attention to how you maintain them as the path can be fraught with difficulties.
00:09:25.600 The story quickly moved to installing Nokogiri. Many of you may have experienced installation problems; raise your hand if you have. Great! I have a fantastic announcement: These problems are about to become a thing of the past.
00:09:47.440 We are set to release Nokogiri 1.6.0 today, which will package libxml2 and libxslt inside the gem. It will compile and install those libraries automatically when you install the gem, eliminating installation issues once and for all.