Ruby

Getting Along with YAML Comments with Psych

Getting Along with YAML Comments with Psych

by Masaki Hara

In this presentation titled "Getting Along with YAML Comments with Psych," Masaki Hara discusses his development of a library focused on retaining comments in YAML parsing. Hara introduces the complexities of YAML and its capabilities, highlighting its similarity to JSON, its roots in Perl serialization, and its powerful handling of complex data types. The presentation covers the three key steps of YAML processing: conversion of text to a tree, tree to graph, and finally to Ruby objects, detailing how comments traditionally disappear during parsing with existing libraries such as Psych.

Hara explains how his library, which operates at a mid-level API, retains comments by employing a two-pass parsing strategy. In the first pass, the library uses the original parser, while the second pass focuses on isolating comments through an algorithm that scans for them while managing potential edge cases, such as how extra hashes or trailing comments are processed.

The talk also presents challenges involved in extending the Psych library due to its reliance on a C library named LibYAML, which limits modifications. Hara elaborates on the behavior of the library's generator, distinguishing it from parsers and explaining that formatting functions are crucial for preserving spacing and indentation.

Finally, Hara summarizes the development of the PyC comment gem, explaining its purpose and future considerations for expanding the library while also acknowledging the limitations in expressing YAML comments. He concludes with insights on the balance between practical implementation and theoretical application of comments in YAML, underscoring the library's focus on maintaining the integrity of comments during parsing.

00:00:08.760 Hello, I am Masaki Hara. I am a person from Wantedly.
00:00:12.080 Thank you for coming to my talk. I will guide you through its contents and the depths of YAML.
00:00:22.560 The talk is simple. I made a small library, and I will explain its features and implementation.
00:00:26.320 Alongside it, I will lead you to the depths of YAML. So, what are psych comments?
00:00:44.399 Imagine you want to transform YAML, perhaps by adding a new element. The input may contain comments, which typically disappear. A simple solution would be to use Psyc, which is also known as Psych. However, when using Psych, the comments are lost.
00:01:00.000 Instead, you can use psych comments. My library preserves comments during parsing, making it effectively a library for YAML comments.
00:01:25.159 But do you know YAML? Yes, YAML is a fancy version of JSON. In fact, YAML was built independently of JSON.
00:01:36.000 YAML was influenced by a serialization library from Perl. This means YAML is as powerful as Marshal in Ruby. It supports various complex data types such as custom objects and regex patterns.
00:02:00.560 YAML supports all these complex structures. Here comes the abstraction. YAML processing is split into three steps. The specification organizes it as follows: first, text converts to a tree, then the tree converts to a graph, which finally converts to a Ruby object.
00:02:31.640 Cycles are processed in the second step, while custom object processing occurs in the third step. At a low level, YAML has only three kinds: sequences, mappings, and scalars. There are different syntaxes for each kind, but they essentially mean the same thing, while high-level tags determine their interpretation. Tags are indicated by the bang symbol, likely unfamiliar to many of you. If you omit tags, default rules apply, determined by their kinds.
00:04:17.000 As I mentioned earlier, plain scalars are exceptions. Plain means no quotes or headers. In this case, the content determines the tag. This set of rules is known as a schema. Applications may extend a schema. This brings us to an example in Ruby, which is one source of incompatibility. I hope you now understand how complex YAML can be.
00:05:47.160 Now, let's go on to the implementation. As I said earlier, there are three steps in YAML processing, and the Psych library you usually use also has three levels: high-level, mid-level, and low-level.
00:06:04.479 My library works on the mid-level. Psych's high-level API is also called Psych. The mid-level API is referred to as Pars. My library has a similar API to this.
00:06:27.800 What we want to achieve is to extend the mid-level API. However, we face a challenge: Psych is backed by a C library called LibYAML, which cannot be monkey patched, similar to Ruby.
00:07:31.960 To address this for parses, I implemented a solution by reading YAML multiple times. In the first pass, I simply use the original parser, while in the second pass, I scan for comments with the original algorithm. The process involves positioning a cursor at the start, advancing it to the next node, marking hash comments, and repeating this until the end.
00:08:22.919 However, there are some edge cases to consider: 1) Extra hash marks that appear only in scalars rather than comments can be skipped; 2) Special trailing comments before delimiters should be handled; 3) Since Psych does not have a node for key-value pairs, pairs are represented as flat arrays in my library. The key node represents the pair.
00:09:14.519 Next, for arrays, items indicated by a hyphen or minus always attach to the whole element. After addressing these edge cases, the parser took only about 100 lines of code.
00:10:04.160 On the opposite side, we also need a generator. Since we cannot extend the algorithm itself due to the C library restrictions, generators cannot be implemented the same way as parsing. We can only reuse the scalar part.
00:10:16.240 So for the other parts, let's simply reimplement it. Here, abstraction becomes important as formatting functions like print, space, newline, and indentation play a key role.
00:10:40.480 The space method prints a space without being immediate because spacing can be canceled depending on what follows. Similarly, newline manages indentation but not immediately, depending on subsequent indentation.
00:11:21.680 Other scenarios, such as special cases for sequences in mappings, and only one nesting pattern, allow for omitting indentation. Comments should also be placed above ballets where look-ahead is used to implement it.
00:12:23.270 Putting all these together results in a compact implementation of only around 300 lines. Let me mention an edge case related to library scope: there are two types of comments — leading and trailing comments.
00:13:01.120 Your preference may be for inline comments as well. Thankfully, someone submitted a PR, but we needed to discuss spacing issues, such as how many spaces to allocate.
00:13:45.679 There are two choices; the author wanted to preserve spaces, meaning if the original call to YAML had whitespace above, the output should reflect that. However, many other spacing details, such as how to indent structures, require careful consideration.
00:15:57.199 Remember there are three levels of abstraction at each step. Regarding parsing, details are removed bit by bit.
00:16:06.639 Psych retains slightly more detail than what is specified in serialization, while my library extends it by adding comments only — a conscious decision to maintain focus.
00:17:04.480 I communicated the decision to focus on comments, and it resulted in a PR that was later withdrawn. Fortunately, another developer is trying to implement similar changes.
00:17:50.479 To summarize, I developed the PyC comment gem, which is small but useful. I believe some of you may find it necessary to use right away or in the near future.
00:18:17.410 However, it is also important to note that the library may not be expanded indefinitely, leading to the need to reject some contributions that do not align with our goals.
00:19:02.120 In practice, the comments are typically attached to the next node. There are exceptions as discussed in the presentation regarding the ending delimiters, which can also be treated as trailing comments.
00:20:01.640 At current, the implementation for inline comments is not yet complete and will need further defining as specified.
00:21:30.440 As we venture forward, the heuristics for YAML comments are inherently limited due to their expressivity and how they attach to nodes.
00:22:40.080 While I am willing to maintain this for the foreseeable future, integrating Psych extensions will be part of my ongoing considerations.
00:23:15.679 In conclusion, my original concept focused on annotating nodes but diverged from using tags due to practical complexities involved.