Talks
Speakers
Events
Topics
Sign in
Home
Talks
Speakers
Events
Topics
Leaderboard
Use
Analytics
Sign in
Suggest modification to this talk
Title
Description
#rubyconftw 2023 High Speed Parsing Massive XML Data in Ruby A massive XML called BioProject has been published. However, when I parsed the XML as it was, rexml got freeze somewhere, and nokogiri ended up consuming a large amount of memory and being slow. Looking at the sample implementation in Python, it uses iterparse() to parse each element in the first layer. Therefore, we created a similar mechanism in Ruby and also used ractor to speed it up.
Date
Summarized using AI?
If this talk's summary was generated by AI, please check this box. A "Summarized using AI" badge will be displayed in the summary tab to indicate that the summary was generated using AI.
Show "Summarized using AI" badge on summary page
Summary
Markdown supported
In this presentation at RubyConf Taiwan 2023, Tetsuya Hirota discusses high-speed parsing of massive XML data using Ruby, specifically focusing on the challenges associated with parsing the BioProject XML file. The presentation highlights the inefficiencies faced when using traditional parsing methods with Ruby libraries such as rexml and nokogiri, which often resulted in long processing times and excessive memory consumption. Hirota's journey began with a need to convert the extensive BioProject XML data, which contains over 700,000 records, into JSON format. This process initially took over 35 minutes and used around 70 gigabytes of memory due to the direct loading of all elements into memory. To address these issues, Hirota drew inspiration from a Python implementation that utilized the `iterparse()` method for incremental parsing. He developed a similar mechanism in Ruby, which significantly improved efficiency. Key strategies discussed are: - **Incremental Parsing**: By parsing each top-level element individually rather than loading the entire document, the method allows for lower memory usage and faster processing speeds. - **Ractor for Concurrency**: The use of Ruby's Ractor feature enabled concurrent processing, which helped in overcoming performance bottlenecks commonly experienced in traditional Ruby parsers. - **Modular Approach**: Hirota's approach involved breaking down the XML parsing into manageable tasks. First, he extracted primary XML data before processing secondary elements, allowing for streamlined output. The effective use of these methods resulted in a remarkable decrease in processing time to approximately five and a half minutes using an optimized setup of five CPUs and 8 GB of RAM. The presentation encapsulates a powerful takeaway for developers handling large XML data – leveraging Ruby's unique capabilities and concurrency features can lead to significant advancements in processing speed and memory efficiency. Hirota concludes by inviting questions and expressing his openness to discuss the technologies and methodologies employed, emphasizing the continuous quest for improving bioinformatics database processing with Ruby.
Suggest modifications
Cancel