Ruby Video | Getting Along with YAML Comments with Psych

Getting Along with YAML Comments with Psych

#data-serialization

Getting Along with YAML Comments with Psych

Masaki Hara • May 15, 2024 • Naha, Okinawa, Japan

In this presentation titled "Getting Along with YAML Comments with Psych," Masaki Hara discusses his development of a library focused on retaining comments in YAML parsing. Hara introduces the complexities of YAML and its capabilities, highlighting its similarity to JSON, its roots in Perl serialization, and its powerful handling of complex data types. The presentation covers the three key steps of YAML processing: conversion of text to a tree, tree to graph, and finally to Ruby objects, detailing how comments traditionally disappear during parsing with existing libraries such as Psych.

Hara explains how his library, which operates at a mid-level API, retains comments by employing a two-pass parsing strategy. In the first pass, the library uses the original parser, while the second pass focuses on isolating comments through an algorithm that scans for them while managing potential edge cases, such as how extra hashes or trailing comments are processed.

The talk also presents challenges involved in extending the Psych library due to its reliance on a C library named LibYAML, which limits modifications. Hara elaborates on the behavior of the library's generator, distinguishing it from parsers and explaining that formatting functions are crucial for preserving spacing and indentation.

Finally, Hara summarizes the development of the PyC comment gem, explaining its purpose and future considerations for expanding the library while also acknowledging the limitations in expressing YAML comments. He concludes with insights on the balance between practical implementation and theoretical application of comments in YAML, underscoring the library's focus on maintaining the integrity of comments during parsing.

Getting Along with YAML Comments with Psych
Masaki Hara • May 15, 2024 • Naha, Okinawa, Japan

RubyKaigi 2024

00:00:08.760 Hello, I am Masaki Hara. I am a person from Wantedly.

00:00:12.080 Thank you for coming to my talk. I will guide you through its contents and the depths of YAML.

00:00:22.560 The talk is simple. I made a small library, and I will explain its features and implementation.

00:00:26.320 Alongside it, I will lead you to the depths of YAML. So, what are psych comments?

00:00:44.399 Imagine you want to transform YAML, perhaps by adding a new element. The input may contain comments, which typically disappear. A simple solution would be to use Psyc, which is also known as Psych. However, when using Psych, the comments are lost.

00:01:00.000 Instead, you can use psych comments. My library preserves comments during parsing, making it effectively a library for YAML comments.

00:01:25.159 But do you know YAML? Yes, YAML is a fancy version of JSON. In fact, YAML was built independently of JSON.

00:01:36.000 YAML was influenced by a serialization library from Perl. This means YAML is as powerful as Marshal in Ruby. It supports various complex data types such as custom objects and regex patterns.

00:02:00.560 YAML supports all these complex structures. Here comes the abstraction. YAML processing is split into three steps. The specification organizes it as follows: first, text converts to a tree, then the tree converts to a graph, which finally converts to a Ruby object.

00:02:31.640 Cycles are processed in the second step, while custom object processing occurs in the third step. At a low level, YAML has only three kinds: sequences, mappings, and scalars. There are different syntaxes for each kind, but they essentially mean the same thing, while high-level tags determine their interpretation. Tags are indicated by the bang symbol, likely unfamiliar to many of you. If you omit tags, default rules apply, determined by their kinds.

00:04:17.000 As I mentioned earlier, plain scalars are exceptions. Plain means no quotes or headers. In this case, the content determines the tag. This set of rules is known as a schema. Applications may extend a schema. This brings us to an example in Ruby, which is one source of incompatibility. I hope you now understand how complex YAML can be.

00:05:47.160 Now, let's go on to the implementation. As I said earlier, there are three steps in YAML processing, and the Psych library you usually use also has three levels: high-level, mid-level, and low-level.

00:06:04.479 My library works on the mid-level. Psych's high-level API is also called Psych. The mid-level API is referred to as Pars. My library has a similar API to this.

00:06:27.800 What we want to achieve is to extend the mid-level API. However, we face a challenge: Psych is backed by a C library called LibYAML, which cannot be monkey patched, similar to Ruby.

00:07:31.960 To address this for parses, I implemented a solution by reading YAML multiple times. In the first pass, I simply use the original parser, while in the second pass, I scan for comments with the original algorithm. The process involves positioning a cursor at the start, advancing it to the next node, marking hash comments, and repeating this until the end.

00:08:22.919 However, there are some edge cases to consider: 1) Extra hash marks that appear only in scalars rather than comments can be skipped; 2) Special trailing comments before delimiters should be handled; 3) Since Psych does not have a node for key-value pairs, pairs are represented as flat arrays in my library. The key node represents the pair.

00:09:14.519 Next, for arrays, items indicated by a hyphen or minus always attach to the whole element. After addressing these edge cases, the parser took only about 100 lines of code.

00:10:04.160 On the opposite side, we also need a generator. Since we cannot extend the algorithm itself due to the C library restrictions, generators cannot be implemented the same way as parsing. We can only reuse the scalar part.

00:10:16.240 So for the other parts, let's simply reimplement it. Here, abstraction becomes important as formatting functions like print, space, newline, and indentation play a key role.

00:10:40.480 The space method prints a space without being immediate because spacing can be canceled depending on what follows. Similarly, newline manages indentation but not immediately, depending on subsequent indentation.

00:11:21.680 Other scenarios, such as special cases for sequences in mappings, and only one nesting pattern, allow for omitting indentation. Comments should also be placed above ballets where look-ahead is used to implement it.

00:12:23.270 Putting all these together results in a compact implementation of only around 300 lines. Let me mention an edge case related to library scope: there are two types of comments — leading and trailing comments.

00:13:01.120 Your preference may be for inline comments as well. Thankfully, someone submitted a PR, but we needed to discuss spacing issues, such as how many spaces to allocate.

00:13:45.679 There are two choices; the author wanted to preserve spaces, meaning if the original call to YAML had whitespace above, the output should reflect that. However, many other spacing details, such as how to indent structures, require careful consideration.

00:15:57.199 Remember there are three levels of abstraction at each step. Regarding parsing, details are removed bit by bit.

00:16:06.639 Psych retains slightly more detail than what is specified in serialization, while my library extends it by adding comments only — a conscious decision to maintain focus.

00:17:04.480 I communicated the decision to focus on comments, and it resulted in a PR that was later withdrawn. Fortunately, another developer is trying to implement similar changes.

00:17:50.479 To summarize, I developed the PyC comment gem, which is small but useful. I believe some of you may find it necessary to use right away or in the near future.

00:18:17.410 However, it is also important to note that the library may not be expanded indefinitely, leading to the need to reject some contributions that do not align with our goals.

00:19:02.120 In practice, the comments are typically attached to the next node. There are exceptions as discussed in the presentation regarding the ending delimiters, which can also be treated as trailing comments.

00:20:01.640 At current, the implementation for inline comments is not yet complete and will need further defining as specified.

00:21:30.440 As we venture forward, the heuristics for YAML comments are inherently limited due to their expressivity and how they attach to nodes.

00:22:40.080 While I am willing to maintain this for the foreseeable future, integrating Psych extensions will be part of my ongoing considerations.

00:23:15.679 In conclusion, my original concept focused on annotating nodes but diverged from using tags due to practical complexities involved.

See Slides on speakerdeck.com

explore all talks recorded at RubyKaigi 2024

Explore all talks recorded at RubyKaigi 2024

RubyKaigi 2024

Keynote: Writing Weird Code

The Grand Strategy of Ruby Parser

Yuichiro Kaneko

Unlocking Potential of Property Based Testing with Ractor

Remembering (ok, not really Sarah) Marshal

Strings! Interpolation, Optimisation & Bugs

Matt Valentine-House

Long journey of Ruby standard library

Hiroshi Shibata

Cross-platform mruby on Sega Dreamcast and Nintendo Wii

Namespace, What and Why

Satoshi Tagomori

Let's use LLMs from Ruby 〜 Refine RBS types using LLM 〜

The Depths of Profiling Ruby

Daisuke Aritomo

Vernier: A next generation profiler for CRuby

Exploring Reline: Enhancing Command Line Usability

Generating a custom SDK for your web service or Rails API

Ractor Enhancements, 2024

An Adventure of Happy Eyeballs

Refactoring with ASTs and Pattern Matching

Keynote: Leveraging Falcon and Rails for Real-Time Interactivity

Samuel Williams

Finding Memory Leaks in the Ruby Ecosystem

Peter Zhu and Adam Hess

Does Ruby Parser Dream of Highly Expressive Grammar?

Community-driven RBS Repository

Masataka Kuwabara

Embedding it into Ruby Code

Soutaro Matsumoto

RubyGems on ruby.wasm

Optimizing Ruby: Building an Always-On Production Profiler

Reducing Implicit Allocations During Method Calling

Unlock The Universal Parsers: A New PicoRuby Compiler

Getting Along with YAML Comments with Psych

Breaking the Ruby Performance Barrier

Maxime Chevalier-Boisvert

RuboCop: LSP and Prism

An mruby for WebAssembly

Good First Issues of TypeProf

It's About Time To Pack Ruby and Ruby Scripts In One Binary

From Interpreting C Extensions to Compiling Them

Squeezing Unicode Names into Ruby Regular Expressions

Martin J. Dürst

Running Optcarrot (faster) on my own Ruby.

Adding Security to Microcontroller Ruby

Lightning Talks

Sunao Hogelog Komuro, Andi Idogawa, Miyuki Koshiba, Yuichiro Kaneko, Hayato Kawai, Hashino Mikiko, S-H-GAMELINKS, NAITOH Jun, Sangyong Sim, Hayao Kimura, and Gui Heurich

Ruby Committers and the World

Ruby Committers

YJIT Makes Rails 1.7x Faster

Takashi Kokubun

How to implement a RubyVM with PHP?

Turning CDN edge into a Rack web server with ruby.wasm

Speeding up Instance Variables with Red-Black Trees

Aaron Patterson

Using "modern" Ruby to Build a Better, Faster Homebrew

JRuby 10: Ruby 3.3 on the Modern JVM

ERB, Ancient and Future

From LALR to IELR: A Lrama's Next Step

Junichi Kobayashi

Ruby and the World Record Pi Calculation

Emma Haruka Iwao

Finding and fixing memory safety bugs in C with ASAN

KJ Tsanaktsidis

Porting mruby/c for the SNES (Super Famicom)

Ruby Mixology 101: adding shots of PHP, Elixir, and more

Vladimir Dementyev

Make Your Own Regex Engine!

Hiroya FUJINAMI

Using Ruby in The Browser is Wonderful.

Shigeru Nakajima

The State of Ruby Dev Tooling

Keynote: Better Ruby

Yukihiro "Matz" Matsumoto