High Speed Parsing Massive XML Data in Ruby

Ruby

Tetsuya Hirota

High Speed Parsing Massive XML Data in Ruby

by Tetsuya Hirota

In this presentation at RubyConf Taiwan 2023, Tetsuya Hirota discusses high-speed parsing of massive XML data using Ruby, specifically focusing on the challenges associated with parsing the BioProject XML file. The presentation highlights the inefficiencies faced when using traditional parsing methods with Ruby libraries such as rexml and nokogiri, which often resulted in long processing times and excessive memory consumption. Hirota's journey began with a need to convert the extensive BioProject XML data, which contains over 700,000 records, into JSON format. This process initially took over 35 minutes and used around 70 gigabytes of memory due to the direct loading of all elements into memory.

To address these issues, Hirota drew inspiration from a Python implementation that utilized the iterparse() method for incremental parsing. He developed a similar mechanism in Ruby, which significantly improved efficiency. Key strategies discussed are:

Incremental Parsing: By parsing each top-level element individually rather than loading the entire document, the method allows for lower memory usage and faster processing speeds.
Ractor for Concurrency: The use of Ruby's Ractor feature enabled concurrent processing, which helped in overcoming performance bottlenecks commonly experienced in traditional Ruby parsers.
Modular Approach: Hirota's approach involved breaking down the XML parsing into manageable tasks. First, he extracted primary XML data before processing secondary elements, allowing for streamlined output.

The effective use of these methods resulted in a remarkable decrease in processing time to approximately five and a half minutes using an optimized setup of five CPUs and 8 GB of RAM. The presentation encapsulates a powerful takeaway for developers handling large XML data – leveraging Ruby's unique capabilities and concurrency features can lead to significant advancements in processing speed and memory efficiency.

Hirota concludes by inviting questions and expressing his openness to discuss the technologies and methodologies employed, emphasizing the continuous quest for improving bioinformatics database processing with Ruby.

00:02:52.640 Hi everyone, welcome back after lunch! I hope you enjoyed your meal. Right now, we are going to start discussing our topic today: high-speed parsing of massive XML data in Ruby. I am Tetsuya Hirota from Japan.

00:03:24.560 Today, I'll tell you more about high-speed processing of XML in Ruby. This presentation is an updated version of one I previously delivered at Ruby World Conference 2023 in Japan. First, I was requested to convert data in the BioProject XML file into JSON using Ruby.

00:04:10.760 While I was provided with a sample call to do this, the process took me more than four times longer than the sample code suggested. Therefore, I created a method similar to the Python `iterparse()`, which parsed each top-level element individually. This method, combined with Ruby on Ractor, improved performance to be about 30% faster than the original Python sample.

00:05:06.600 Today's content will include an introduction, a brief self-introduction, an overview of my company, and an explanation of my project. I will cover what I attempted to achieve and how I created a mechanism similar to Python's `iterparse()` using Ruby.

00:05:36.560 My name is Tetsuya Hirota, and I am the CTO at my company. I'm based in Japan, where I spend a lot of time developing applications for bioinformatics. I have been dedicated to our core biotic database development for quite some time.

00:06:10.760 In my work, I also teach a programming course focusing on lifelong learning related to bioinformatics. I offer valuable insights into life sciences, including data visualization to help understand what types of operations are demanded by the life science community.

00:07:02.640 Today, my focus will be on ongoing development processes for bioinformatics databases. We face challenges related to high-speed parsing of massive XML files.

00:07:25.200 The BioProject is a public database that encompasses various bioscience projects. This includes information related to genomes and RNA sequences registered in public databases, along with associated metadata such as papers, experimental conditions, and samples.

00:07:38.000 My project involves curating a metacollection about sequence analysis with robust metadata. The format of this project is primarily simple XML files that exceeded 2 gigabytes in size, containing over 700,000 records. The database is continuously growing.

00:08:13.800 Researchers often need to perform complex queries on this data. The current limitations mean heavy memory use, consuming around 70 gigabytes when parsing the files.

00:08:49.720 Initially, I attempted to parse the BioProject XML using standard methods, which resulted in non-responsive outputs that took over 35 minutes and consumed excessive memory.

00:09:05.640 This process is slow because it necessitates loading all objects into memory, and each request searches through the entire dataset. My goal has been to find a more efficient method.

00:09:49.720 I explored Python sample codes, which process data in about 8 minutes through incremental loading of individual elements.

00:10:10.240 The elements in our database are structured hierarchically with each project element averaging between 100 kilobytes to a few megabytes in size. The XML parser traditionally extracts all elements into memory, creating significant overhead.

00:11:04.360 To improve this process, I began leveraging string operations to separate data intelligently and utilize memory more efficiently. Initially, I focused on Ruby's existing capabilities but ran into performance issues, illustrating the importance of implementing solutions that avoid blocking processes.

00:12:01.720 I decided to split these tasks into more manageable parts. In the first part, I extracted the primary XML data, followed by the parsing of secondary elements, ensuring that the output was streamlined.

00:12:45.920 For outputting JSON, I gathered each project element and processed it before writing it to the resulting file.

00:13:08.680 This modular approach allows for much quicker results, reducing processing time significantly.

00:13:43.280 Utilizing Ruby on Ractor has enabled concurrent processing, which I found beneficial in optimizing the performance compared to Ruby's conventional parsing methods.

00:14:41.560 In the end, I was able to achieve a processing time of about five minutes and thirty-four seconds, which was a significant improvement from the initial methods I tried.

00:15:29.560 Moreover, during these tests, I utilized a five CPU setup with 8 gigabytes of RAM, showcasing how resource allocation impacts performance.

00:16:09.440 In conclusion, my approach to bioinformatics database processing in Ruby can be optimized greatly. It is essential that we take advantage of Ruby's unique features to sidestep potential bottlenecks.

00:17:38.840 I appreciate everyone's attention and I’m ready for any questions you might have.

00:31:10.720 Thank you for your interest and participation today! If you have any further queries about the technology or my methodologies, I welcome you to ask.

RubyConf Taiwan 2023