Masaki Hara

Getting Along with YAML Comments with Psych

RubyKaigi 2024

00:00:08.760 Hello, I am Masaki Hara. I am a person from Wantedly.
00:00:12.080 Thank you for coming to my talk. I will guide you through its contents and the depths of YAML.
00:00:22.560 The talk is simple. I made a small library, and I will explain its features and implementation.
00:00:26.320 Alongside it, I will lead you to the depths of YAML. So, what are psych comments?
00:00:44.399 Imagine you want to transform YAML, perhaps by adding a new element. The input may contain comments, which typically disappear. A simple solution would be to use Psyc, which is also known as Psych. However, when using Psych, the comments are lost.
00:01:00.000 Instead, you can use psych comments. My library preserves comments during parsing, making it effectively a library for YAML comments.
00:01:25.159 But do you know YAML? Yes, YAML is a fancy version of JSON. In fact, YAML was built independently of JSON.
00:01:36.000 YAML was influenced by a serialization library from Perl. This means YAML is as powerful as Marshal in Ruby. It supports various complex data types such as custom objects and regex patterns.
00:02:00.560 YAML supports all these complex structures. Here comes the abstraction. YAML processing is split into three steps. The specification organizes it as follows: first, text converts to a tree, then the tree converts to a graph, which finally converts to a Ruby object.
00:02:31.640 Cycles are processed in the second step, while custom object processing occurs in the third step. At a low level, YAML has only three kinds: sequences, mappings, and scalars. There are different syntaxes for each kind, but they essentially mean the same thing, while high-level tags determine their interpretation. Tags are indicated by the bang symbol, likely unfamiliar to many of you. If you omit tags, default rules apply, determined by their kinds.
00:04:17.000 As I mentioned earlier, plain scalars are exceptions. Plain means no quotes or headers. In this case, the content determines the tag. This set of rules is known as a schema. Applications may extend a schema. This brings us to an example in Ruby, which is one source of incompatibility. I hope you now understand how complex YAML can be.
00:05:47.160 Now, let's go on to the implementation. As I said earlier, there are three steps in YAML processing, and the Psych library you usually use also has three levels: high-level, mid-level, and low-level.
00:06:04.479 My library works on the mid-level. Psych's high-level API is also called Psych. The mid-level API is referred to as Pars. My library has a similar API to this.
00:06:27.800 What we want to achieve is to extend the mid-level API. However, we face a challenge: Psych is backed by a C library called LibYAML, which cannot be monkey patched, similar to Ruby.
00:07:31.960 To address this for parses, I implemented a solution by reading YAML multiple times. In the first pass, I simply use the original parser, while in the second pass, I scan for comments with the original algorithm. The process involves positioning a cursor at the start, advancing it to the next node, marking hash comments, and repeating this until the end.
00:08:22.919 However, there are some edge cases to consider: 1) Extra hash marks that appear only in scalars rather than comments can be skipped; 2) Special trailing comments before delimiters should be handled; 3) Since Psych does not have a node for key-value pairs, pairs are represented as flat arrays in my library. The key node represents the pair.
00:09:14.519 Next, for arrays, items indicated by a hyphen or minus always attach to the whole element. After addressing these edge cases, the parser took only about 100 lines of code.
00:10:04.160 On the opposite side, we also need a generator. Since we cannot extend the algorithm itself due to the C library restrictions, generators cannot be implemented the same way as parsing. We can only reuse the scalar part.
00:10:16.240 So for the other parts, let's simply reimplement it. Here, abstraction becomes important as formatting functions like print, space, newline, and indentation play a key role.
00:10:40.480 The space method prints a space without being immediate because spacing can be canceled depending on what follows. Similarly, newline manages indentation but not immediately, depending on subsequent indentation.
00:11:21.680 Other scenarios, such as special cases for sequences in mappings, and only one nesting pattern, allow for omitting indentation. Comments should also be placed above ballets where look-ahead is used to implement it.
00:12:23.270 Putting all these together results in a compact implementation of only around 300 lines. Let me mention an edge case related to library scope: there are two types of comments — leading and trailing comments.
00:13:01.120 Your preference may be for inline comments as well. Thankfully, someone submitted a PR, but we needed to discuss spacing issues, such as how many spaces to allocate.
00:13:45.679 There are two choices; the author wanted to preserve spaces, meaning if the original call to YAML had whitespace above, the output should reflect that. However, many other spacing details, such as how to indent structures, require careful consideration.
00:15:57.199 Remember there are three levels of abstraction at each step. Regarding parsing, details are removed bit by bit.
00:16:06.639 Psych retains slightly more detail than what is specified in serialization, while my library extends it by adding comments only — a conscious decision to maintain focus.
00:17:04.480 I communicated the decision to focus on comments, and it resulted in a PR that was later withdrawn. Fortunately, another developer is trying to implement similar changes.
00:17:50.479 To summarize, I developed the PyC comment gem, which is small but useful. I believe some of you may find it necessary to use right away or in the near future.
00:18:17.410 However, it is also important to note that the library may not be expanded indefinitely, leading to the need to reject some contributions that do not align with our goals.
00:19:02.120 In practice, the comments are typically attached to the next node. There are exceptions as discussed in the presentation regarding the ending delimiters, which can also be treated as trailing comments.
00:20:01.640 At current, the implementation for inline comments is not yet complete and will need further defining as specified.
00:21:30.440 As we venture forward, the heuristics for YAML comments are inherently limited due to their expressivity and how they attach to nodes.
00:22:40.080 While I am willing to maintain this for the foreseeable future, integrating Psych extensions will be part of my ongoing considerations.
00:23:15.679 In conclusion, my original concept focused on annotating nodes but diverged from using tags due to practical complexities involved.