Performance Tuning

Nokogiri: History, Present, and Future

Nokogiri: History, Present, and Future

by Mike Dalessio

The video titled "Nokogiri: History, Present, and Future" features Mike Dalessio discussing the evolution of the Nokogiri XML parsing library, a crucial tool in the Ruby community that has seen significant adoption with nearly 12 million downloads. Mike begins by sharing his personal journey with open-source contributions and his connection to Nokogiri's development. He addresses frequently asked questions about Nokogiri, including its purpose and origin. Below are the key points discussed in the video:

  • Origins of Nokogiri: Developed in response to the limitations of existing XML parsers like REXML and HPricot, Nokogiri aims to efficiently parse both HTML and XML.
  • Layout and Structure: Mike humorously compares Nokogiri to primary gems like Rails and Formtastic, emphasizing its unique niche in parsing libraries.
  • Nokogiri's name: He shares that 'Nokogiri' translates to a type of saw in Japanese, metaphorically representing the library's function of cutting through complex XML structures.
  • Technical Challenges: Mike discusses the challenges faced during development, including the use of C extensions and dynamic language bindings (DL) and later, foreign function interfaces (FFI) for cross-platform functionality.
  • Installation Issues: He highlights common installation hurdles users face, such as those encountered by Windows and Mac users, and introduces tools like MiniPortile to streamline installation processes.
  • Community Engagement: The talk underscores the value of contributing to open source, not only for personal growth but also for helping to improve the broader development community.
  • Future Directions: Mike concludes by expressing aspirations for the future of Nokogiri, particularly regarding version 2.0 and enhancing API structures, and encourages developers to engage actively in open source projects.

In summary, the presentation provides an insightful overview of Nokogiri’s developments, its importance in the Ruby ecosystem, and the valuable lessons learned from both technical and community engagement perspectives.

00:00:20.400 These days, we all use Nokogiri when parsing our HTML or XML.
00:00:26.320 Mike Dalessio is one of the authors of Nokogiri and helps run the Pivotal Labs office in New York.
00:00:32.880 He is one of the authors of Nokogiri along with Aaron Patterson and is quite an interesting figure.
00:00:38.160 Here he is to talk to you about Nokogiri. Thank you, Mike.
00:00:50.879 I wanted to start by saying thank you to Josh and the other organizers. Josh first invited me to speak about Nokogiri to a room full of strangers about five years ago at "I Can Has Ruby," and he didn’t learn from that mistake and invited me back again today, so he has my gratitude for that.
00:00:57.680 Nokogiri is just a library, but it's an interesting story. I think of Nokogiri as a second child; I have a daughter, and my personal story is wrapped up in the evolution of this gem and the Ruby community. I thought it would make for an interesting talk.
00:01:10.240 The summary of my personal story is that many years ago, I sent my first open source contribution in the pre-GitHub days, when everything was on RubyForge.
00:01:16.880 As a result, I met some interesting people like Pat Nakajima and Aaron Patterson. Aaron became a really close friend through my collaboration with him on a couple of projects.
00:01:28.640 Pat introduced me to Pivotal Labs where I learned how to write code better.
00:01:34.159 I got a job, got promoted, and eventually tricked Josh into inviting me to GoGaRuCo.
00:01:41.119 I first wanted to cover probably the five most frequently asked questions about Nokogiri in no particular order.
00:01:58.960 Number one: What is Nokogiri? Does anybody know Nokogiri? Magnus Holm wrote a gem called Nokogiri that simply installs Nokogiri as a dependency and chides you on your spelling ability.
00:02:18.720 But I think the question everyone wants to ask is: What is Nokogiri, with an 'i' and not 'l'? It’s a mark against Courier font because you can't tell the difference. Nokogiri is basically a Ruby API to parse HTML and XML.
00:02:47.680 It corrects broken markup if you need it to, and it has a SAX parser, pull parser, XSLT, and all the enterprise features you could ever dream of. I know what you might be asking yourselves: Isn't that super boring?
00:03:06.239 I don't think it's boring, but I will preface my story with a confession: I hate XML. I really, really hate it. You might wonder why I would spend my time building an XML parsing library if I hate it so much.
00:03:26.159 But it's because I actually love making painful things not painful. That's my own itch I like to scratch. When I see something that's super painful, I just want to fix it.
00:03:40.080 There are other rewarding parts of working on Nokogiri too. Not to brag, but Nokogiri has been downloaded almost 15 million times, which is amazing. I can't believe that this many people are using software that I've worked on.
00:04:04.959 For comparison, Rails has been downloaded about 28 million times, while Formtastic is around 1.7 million. This gives us a range of popular gems.
00:04:11.599 I like to think about this in terms of the recording industry numbers. Rails, with 28 million downloads, is like Kiss, which has an equal number of albums sold. It has been around forever and is now what your parents listen to, while it produced classic glam rock.
00:04:30.080 On the other end, Formtastic is like that catchy tune that everyone has heard; it's widely used and appreciated. And right in the middle is Nokogiri, which I'd compare to Kelly Clarkson: appealing and pleasantly in the background.
00:04:47.600 So, all I'm trying to say is that yes, Nokogiri does a boring thing, but it does it well. People seem to like it, which makes me really happy. This is why I'm involved in open source.
00:05:13.120 Question number four: What does Nokogiri mean? Nokogiri is a Japanese word for a specific type of saw. This is a bit of wordplay; the idea is that you've got a dense forest of XML trees that need to be cut through with a saw.
00:05:24.560 Tenderlove is known for his wordplay, and this is just another instance of that.
00:05:52.639 The fifth frequently asked question I wanted to address is probably one you've asked yourself: Why can’t I install Nokogiri on Mac OS X 10.8.4? That’ll be covered a bit later, so we’ll circle back to that.
00:06:09.600 A good question to ask is why history is important. Typically, history is considered a really boring topic, but for me, history is like the testing ground of ideas. It's an opportunity to learn what worked, what didn’t work, and where we totally failed.
00:06:33.680 My story starts in 2006, with Ruby 1.8.5 and Rails 1.1.6 out there, subversion being the cool new thing, and NASA launching the space probe to Pluto.
00:06:50.080 At the time, searching for Aaron Patterson would yield hits primarily for a serial murderer named Aaron Patterson, not for Tenderlove, the software developer.
00:07:11.680 The state of Ruby parsers in 2006 involved REXML, which is in the standard library; it’s a pure Ruby parser, super slow and doesn't fix broken markup. As long as you didn't mind its slow performance and had perfectly formed markup, it was great.
00:07:36.560 HP was the de facto standard at the time; everybody loved it because it had a great API. LibXML Ruby was a close runner-up which was a Ruby wrapper around LibXML, and this is what Nokogiri turned into. However, it had crashes at times and matched the C API exactly, so you could still run into problems, just like with C.
00:08:05.919 At that time, I was working for a company called U.S. Powergen. This was so long ago that they didn’t even have transparent GIFs, so I had to make them by hand.
00:08:31.520 I was scraping HTML with Mechanize, but more importantly, I was scraping super secret HTML. So I needed support for client-side certs when I downloaded pages and scraped data. Mechanize didn't support any client-side search at the time, so I just emailed a patch to Tenderlove.
00:09:01.040 I wrote to him saying, 'Hey dude, I like your library, here’s a patch.' That was my first pull request; back then, patches were emailed around to get things done. This was my first contribution to open source, and Tenderlove was amazing throughout the process—open, kind, supportive, and responsive.
00:09:37.599 He merged the patch, and it was really encouraging. So I sent more patches, and eventually he got tired of merging them and gave me commit privileges for Mechanize, which was incredibly exciting.
00:10:07.920 As 2008 rolled around, I was still working at my own startup called Pharos, which is a terrible name and impossible to Google. I was once again scraping HTML with Mechanize, but this time I was scraping lots of broken HTML.
00:10:31.640 I encountered problems with HPricot; it wasn't correcting markup as I expected. There were many off-by-one errors when writing CSS queries. HPricot started with a zero index, where a CSS query should start with an index of one, and it would crash if the HTML was exactly 16 KB long.
00:10:59.680 I was unhappy, and then I started following Tenderlove on GitHub. I noticed he had a library that was XML-related and asked if LibXML would fix the markup as HPricot did.
00:11:26.720 He was indulgent in his response, saying yes, it actually fixes broken HTML better than HPricot. He had test cases to show it. He even mentioned he had submitted these test cases back but opted not to patch HPricot due to its complexities.
00:11:46.560 Let’s pause for a moment to consider that you've got Why the Lucky Stiff and Aaron Patterson, both trying to write code and not understanding each other's code. This provides hope that normal developers, like me, aren't crazy for not being able to interpret every line of code in a gem.
00:12:13.760 So, I used the five most significant words to say to an open source maintainer: 'How can I help out?' Of course, he was excited and said he started a project without the C and instead used DL, and we began coding from there.
00:12:50.400 It's worth discussing what DL is for a moment. Dynamic language binding is the ability to call C libraries without writing C code. You write Ruby, and it calls directly to the C method.
00:13:14.480 I have an example here where we define what a method looks like in Ruby, specifically the XML read memory, which is the key parsing method inside LibXML.
00:13:37.520 However, doing this was pretty difficult. You had to load the Linux SO, and if that didn't work, you'd have to load the DLL for Mac. Notably, you were required to pass in a description of the function and DL wasn’t very refined in its execution.
00:14:03.839 You had to conduct a lot of your own checks—for instance, if the URL was known, you had to pass in a zero so that C understood it was a null pointer. It worked okay, but it was really slow because DL would re-parse that C description every time you made a method call.
00:14:36.640 I would have fixed it, but it was experimental code that lacked tests. So, we eventually started writing a C extension instead.
00:15:03.360 The C code looked a lot like DL, though we did have to maintain null checks. We started coding against LibXML with regards to how it managed its memory.
00:15:25.360 So, a document in our C structure can have many nodes which can have attributes or namespaces. There is a necessary relationship between these C structures and Ruby objects to ensure garbage collection occurs properly.
00:15:51.760 LibXML frees things from the top down, meaning we need to ensure that if I have a reference to a node, the document won’t get freed before I’m done using it.
00:16:06.640 As work progressed, we faced additional crashes. It turns out the document has a dictionary for caching strings since many attributes tend to repeat. Moving nodes to a new document and freeing the old one would occasionally cause segmentation faults.
00:16:24.160 I won’t go into further details about this, but we indeed had to traverse through many edges to ensure everything remained stable.
00:16:45.840 LibXML also merges two string nodes when they are exactly adjacent, making it tricky because it means a Ruby object for each could point to the same string node.
00:17:01.760 We had to develop clever methods to ensure garbage collection was happening properly and we weren’t leaking memory.
00:17:30.720 I had to write detailed comments to clarify all the complicated logic we had to apply, which often seemed overwhelming. Additionally, different versions of the library come with unique bugs or behavior that required version checks all over the place.
00:18:01.760 In summary, if you want to write a C extension, know Valgrind! It's a tool that’ll catch invalid memory accesses and reveal leaking references.
00:18:23.519 Also, Proftools.rb is a helpful extension developed by Aman Gupta on top of Google’s Proftools, allowing you to analyze combined Ruby and C stack traces.
00:18:48.640 Now, let’s transition to API design. During this period, we struggled to define what a good API would look like, so we borrowed the best ones we could find—specifically, HPricot.
00:19:21.520 The first version of Nokogiri was, effectively, a ripoff of HPricot's API. For a while, we even had a shim that was backwards compatible with HPricot, so if you were using HPricot, you could drop in Nokogiri seamlessly.
00:19:40.480 We paused the shim after a few months because it became challenging to maintain due to the complexity.
00:19:57.760 The first official release of Nokogiri happened on November 17th, during Daylight Saving Time weekend. Here’s a pro tip: don’t release software during Daylight Saving Time if you have a job in energy trading.
00:20:24.480 The community's response was mixed; some people loved it, while others had adverse reactions to it. Some early adopters like Brian Helmkamp used it to build Luffa, an HTML sanitizer on top of Nokogiri.
00:20:48.960 Brian, who was working at WePlay at the time, needed to sanitize user-generated data. Pat Nakajima utilized it to create slide presentation software, which was also quite innovative.
00:21:08.000 The MERB team started using it in their test suite, and Yehuda Katz, known for being particular about performance, pushed us hard to address any performance bottlenecks as the test suite grew increasibly slow.
00:21:50.080 On the other hand, many users loved HPricot and viewed Nokogiri as an attack on something they cherished. Additionally, benchmarks created a lot of debate.
00:22:12.080 Even today, people care about benchmarks—it's interesting to see how little debate has changed in the community. I’ve published metrics that in retrospect seemed pointless.
00:22:30.080 At the end of the day, we were writing C code, which was going to be inherently faster than other alternatives, so debating performance often seemed as pointless as arguing over movie star attractiveness.
00:22:50.080 In 2008, I was introduced to Pivotal Labs, where I still work, by Pat Nakajima, who had a great admiration for Nokogiri. My hiring was largely based on my contributions to Nokogiri and Mechanize.
00:23:12.000 I feel that my open source work was instrumental in helping me secure that job—I encourage everyone to engage with open source as it not only aids in job acquisition but also fosters various opportunities.
00:23:45.200 In August 2009, a tweet surfaced, claiming, 'Programming is rather thankless; you see your work replaced by superior alternatives.' This was tweeted by Why, shortly before he vanished.
00:24:03.200 Though I didn't connect it at the time, many believe that tweet referred to Nokogiri. This sentiment is sad to me because software is all about constant improvement.
00:24:21.920 Most of us are in this business to make things better, to scratch our own itches. While I can sympathize with Why, I don't agree with that viewpoint.
00:25:03.440 In late 2009 to 2010, we saw JRuby rising in street cred. I decided to take a poll; how many preferred deployment platforms are MRI users? This was around 80% of the room.
00:25:20.080 Then I asked how many used JRuby, with approximately 20% of you raising your hands. This was interesting, as with each talk I've given, the average tends to resemble a 10 to 1 ratio of MRI to JRuby.
00:25:39.200 In 2009, JRuby didn’t fully support the C extension API, which meant Nokogiri didn't work on JRuby. Though they've made improvements, I haven’t kept close tabs lately on how well it's working now.
00:26:12.080 The challenge remained: It was my dream to have one code base running everywhere—including Rubinius and JRuby. During this time, I entered my FFI phase, a term I borrowed from Monet's sunflower phase.
00:26:35.200 FFI, or foreign function interface, allows Ruby to call C native code directly. This is different from DL; it actually worked and was significantly faster.
00:26:58.320 I spent five months rewriting the C in Ruby using FFI. It was a painful journey as it took about a three to four ratio of Ruby FFI code to match the C code—and it looked remarkably similar.
00:27:29.920 It often felt like I was writing C in Ruby, making it clear that adapting to make Ruby align with C's efficiency wasn’t as robust as one would hope.
00:27:50.080 Additionally, segfaults were common due to FFI lacking compile-time checks. If I misdeclared a function, it was a nightmare tracking the error.
00:28:19.280 Handling strings was a challenge; FFI on MRI worked, but when I ran it on JRuby, it would crash because the JVM would free strings unceremoniously.
00:28:45.760 Moreover, the FFI code wasn’t any clearer than the original C code. I didn't want to write Ruby that resembled this complexity and still dealt with performance penalties.
00:29:06.640 At RubyConf 2009, I endured harassment at the bar about using Nokogiri on the Google App Engine, which doesn’t support shared object libraries.
00:29:30.800 After this experience, I understood that if you care about performance, you need to write native extensions. If you care about multi-platform, you need to support more than one code base.
00:30:00.480 I learned that it was better to focus on native extensions instead of attempting to make FFI work smoothly within multiple environments.
00:30:24.360 Eventually, I decided to kill the FFI port, having realized that I wasn't willing to produce a Java version of Nokogiri. Then a Spanish college student, Sergio Rbío, entered a Google Summer of Code and pursued a pure Java port, which succeeded.
00:30:51.920 In its success, we created a Ruby bounty of $625 to finish the Java port, which became the largest one at the time. Here is a shout-out to those who contributed funds, as you played an instrumental role in handling the finish.
00:31:16.560 Nokogiri now runs as a Java project, and we noticed that performance was significantly enhanced. During discussions, it became clear that native implementations were the way forward.
00:31:49.200 Returning to an FAQ from the beginning: 'I can't install Nokogiri on my system.' This remains a general problem that many developers face, primarily due to dependencies.
00:32:16.760 To address installation issues, I’ll briefly summarize. Everybody knows that Windows users lack a build toolchain. Maybe some have Visual Studio, but if you're a Ruby guy, it’s likely you're missing that.
00:32:40.160 To overcome the absence of LibXML2, we cross-compile these libraries into DLLs and package them into a fat binary gem, which is much larger than the standard versions.
00:33:01.680 In addition, JRuby faces the same issues, leading us to create jar files that get packed in likewise. Both scenarios involve building into the gem, so it can be used across different systems.
00:33:23.200 Mac users typically have LibXML installed in various locations, leading to compatibility issues, resulting in nearly a thousand questions about installing Nokogiri.
00:33:43.760 Lastly, LibXML2 introduced a breaking bug that recently caused many Mac installations to malfunction. To fix this, I used MiniPortile to build LibXML and libxslt during installation.
00:34:06.240 MiniPortile embeds the source code for those libraries in the gem, ensuring users compile it during installation, much like a hidden surprise. It’s a very efficient solution.
00:34:40.640 Thank you for your attention! I wanted to mention that I have been considering future goals to address APIs in version 2.0 and fix some architectural issues.
00:35:03.760 If you're a Windows MRI user, please consider getting involved—we need support and help. Also, if you're not contributing to open source yet, I encourage you to start small and engage.
00:35:25.520 Nokogiri does one boring thing, but it does it well. If something annoys you about a gem, go fix it! It’ll improve the world and you might even get to hang out with Tenderlove.
00:35:41.360 Thank you!