RubyKaigi 2023

Yet Another Ruby Parser

RubyKaigi 2023

00:00:00 Hi there, my name is Kevin Newton, and I am here to talk to you today about the Yet Another Ruby Parser project, otherwise known as YARP, that I've been working on at Shopify.
00:00:05 I've been working on this project for about six months with a couple of goals: to create a new parser for the Ruby language across implementations and tools, and to generally unify the ecosystem around one effort to expand the parser.
00:00:13 I started this project about six months ago in earnest at Shopify, but in reality, it's been about the last five years. I began working on this project back when I first started with Prettier, which was a formatter for Ruby. At that time, I was using Ripper's output to build the Abstract Syntax Tree (AST) and format it from there using TypeScript.
00:00:40 Eventually, this evolved into Syntax Tree, which was a new project built entirely in Ruby. It effectively served as a port of the TypeScript piece of it, creating an object layer on top of the Ripper AST. However, something that has always been difficult with it has been using Ripper; the interface has been quite challenging to navigate, and there are many shortcomings associated with that approach.
00:01:03 Everything is specifically tied to C Ruby, and there are numerous difficulties with using the reverse interface, which I have discussed in previous talks. I found this fantastic comic while researching for this talk. In summary, it says, "I pray for the strength to change what I can, the inability to accept what I can't, and the incapacity to tell the difference." I felt a lot like this when I was discussing with my manager at Shopify about rebuilding the Ruby parser since it seemed like an insurmountable task.
00:01:34 However, within the last six months, we have made significant progress. We can now parse 100% of Shopify's GitHub Ruby code and the top 100 gems by download from rubygems.org. Currently, we have an experimental fork of C Ruby that is using the new parser, which has already been merged into Truffle Ruby, and they are actively working on making it their default parser. There's also an experimental branch in JRuby, so I would say that we have come quite a long way.
00:02:02 I'm very excited to share with you today about the progress we've made and what it looks like. First, we will discuss the motivations behind building this new parser: why would we even attempt to create such a tool?
00:02:28 These motivations boil down to three key factors: error tolerance, which is the ability of the parser to continue processing even in the event of a syntax error; portability, which allows people and other tools to use this project outside of just C Ruby; and maintainability, which refers to the general ease of changing, deleting, or updating the existing parser.
00:02:58 Next, we will delve into the development of the new parser and what that has looked like, specifically addressing the challenges posed by Ruby's grammar and how we have navigated these issues. Finally, we will discuss the path forward, which involves the adoption of the YARP project as well as future efforts and the direction we foresee.
00:03:26 The first topic is error tolerance. Error tolerance is a subject we could discuss for an entire talk or multiple talks, as it is a significant area of focus for many developers.
00:03:34 To give you just a small taste, we can look at the Interactive Ruby Shell (IRB). IRB has some level of error tolerance baked in. For example, if you start IRB and enter `foo + bar + baz` but include some garbage characters, IRB can still identify that 'baz' is a valid identifier and indicate that there is a syntax error. This is a prime example of error tolerance.
00:04:02 The way this works is that IRB uses Ripper to tokenize the input and discards any garbage tokens until it finds something it can understand. Another example in the Ruby code base can be seen in Rust. If you use VS Code or any other editor connected to the Rust analyzer extension, it can handle numerous errors.
00:04:28 You might not see it on the slide I presented, but there is an extra pipe on a certain line that Rust can recover from, allowing it to continue parsing the rest of the file. This kind of error tolerance is vital for integrated development environments (IDEs) that employ tools like type checkers and refactor tools.
00:04:58 If a syntax error occurs on line two, but the remaining 998 lines are error-free, it is critical not to discard the entire Abstract Syntax Tree (AST). Instead, we want to maintain it and make incremental updates. In this context, Rust, particularly through the Rust analyzer, can recover from errors, thereby drastically enhancing development speed.
00:05:27 For the YARP project, we have implemented some error tolerance, and we have plans for much more. At its most basic level, we can automatically insert tokens. For instance, if you have the beginning of a class declaration with `class Foo`, and you lack the terminating token, we can detect that and automatically insert the missing token.
00:06:07 A missing token will be created and an error added to the list of errors, so you will still receive a syntax error. However, tools like IDEs, linters, or any static analysis tools will be able to continue maintaining the rest of the tree without losing it.
00:06:38 Additionally, we can insert nodes in expected places within the tree. For example, if there is a line like `1 +` with no subsequent statement, we would insert a missing node into the tree, ensuring there is still a fully formed tree with an accompanying error listed.
00:07:00 This is the most fundamental point: whenever YARP parses Ruby input, it will never crash. It will always return at least one tree—this might just be the root node—but it will always return something, along with a list of errors and their offsets from the beginning of the file.
00:07:31 For example, during previous parsing efforts, using our VS Code extension—which utilizes YARP to parse files—you could see an expression like a plus sign followed by a brace. YARP recognizes the start of a hash literal but acknowledges that we have no key, delimiter, or value.
00:07:54 Even without those components, YARP can loop back and insert four errors indicating that we are missing the closing brace, allowing us to continue parsing based on that assumption.
00:08:22 Another important feature is context-based recovery, which was inspired by Microsoft's work on building a new PHP parser for VS Code. This involves keeping a stack of contexts throughout the parsing process, allowing us to manage multiple levels of depth.
00:08:49 As an example, if we are deep inside a method definition within a class and a subsequent new line indicates an end keyword without a corresponding expected argument, we can use the stack to determine the context in which this keyword is found and smartly insert a missing node in situations where it applies.
00:09:25 This methodology allows us to better manage ambiguity and provide more accurate parsing results.
00:09:32 Moving onto portability, as discussed previously, this refers to the ability to utilize this parser outside of the C Ruby project.
00:09:51 The challenge here lies in the inability to use C Ruby's parser in order to run Ruby. Consequently, many different efforts have arisen in the ecosystem to develop alternative Ruby parsers. The complexity of this landscape can be daunting.
00:10:12 These parsers can include diverse runtimes and toolsets, as evidenced by numerous maintained projects. Each time a new feature is introduced—such as pattern matching from Ruby 2.7—these other parsers must update, in turn consuming valuable community resources.
00:10:52 This results in lost energy that could have been better spent enhancing runtimes or creating innovative tools. Therefore, one of the key objectives of the YARP project is to consolidate parsers and efforts to significantly streamline Ruby tooling.
00:11:14 To achieve this, we have established that YARP will have no reliance on C Ruby internals. We do not link against C Ruby headers and we avoid using deeper internal structures.
00:11:32 Moreover, this parser is a handwritten recursive descent parser, meaning we avoid using tools like Bison or parser combinators. The intent is to make YARP easier to embed within other projects.
00:11:57 Standardizing the parser allows for shared logic for tasks such as unescaping strings. For example, when parsing something with YARP, you receive not only the range and the internal source file of the string, but also the newline character correctly inserted into the unescaped version.
00:12:28 Additionally, we offer packing and unpacking APIs for arrays and strings, which effectively act as a mini DSL, underlining the shared functionality across runtimes.
00:12:56 The way that we share this functionality is through a serialization API, which I’ll briefly discuss. Code snippets pulled from Truffle Ruby demonstrate how they integrate the YARP parser.
00:13:16 They pull in our top-level header, and through straightforward calls, a new buffer is initialized. It allows the parser to analyze the AST and serialize it into a format that can be accommodated across languages.
00:13:39 Using this API, both Truffle Ruby and JRuby can utilize YARP, converting the AST into a format suitable for their internals without the cumbersome process of updating grammar files for their unique languages.
00:14:40 The final motivation I wish to cover is maintainability, which encompasses readability, accessibility for contributions, and the ease of updating, among other elements.
00:15:06 In this respect, YARP has great potential. Because it is a handwritten recursive descent parser, making changes to specific parse nodes is relatively straightforward. As a result, adding new syntax will become more accessible, bolstering its overall maintainability.
00:15:40 Moreover, our documentation is robust, and we maintain a test suite for every parsed node. Developers can set breakpoints across the code and trace any issues rapidly without having to generate complicated configurations.
00:16:06 To enhance maintainability further, I would urge the Ruby community to contribute to this work. A larger team with diverse perspectives can lead to a more efficient development cycle.
00:16:26 Now, let's discuss some of the challenges we faced. Ruby has a remarkably expressive grammar with many implicit and explicit features.
00:16:50 The primary challenge we encountered was grammar ambiguity, where operators, keywords, and identifiers can yield significantly different interpretations based on their contextual placements.
00:17:20 For instance, the triple dot operator can signify argument forwarding or range creation, thus generating all sorts of contextual complexity. Similarly, the do keyword can delimit several constructs in Ruby, from block definitions to loop bodies, leading to further context-specific parsing challenges.
00:17:48 Newlines, comments, and semicolons all function as terminators but depend heavily on context. For example, a newline at the end of an expression indicates termination, but subsequent newlines within the same block might get ignored.
00:18:19 Another difficulty arises in distinguishing between local variables and method calls, which often interchange based on surrounding syntax. We can define local variables through regular assignment, argument capturing in pattern matching, or via named capture groups in regular expressions.
00:18:50 To effectively manage this, we have built our own regular expression parser as part of the YARP project, allowing us to parse and introduce local variables from named captures without linking to C Ruby.
00:19:23 Furthermore, encoding presented an additional layer of complexity. In C Ruby, identifiers can be locale-specific, so we incorporated support for several encodings, allowing for identification and parsing of appropriate identifiers based on Ruby's frozen string literal comment.
00:19:47 To summarize, the next steps for YARP involve a gem release on rubygems for testing in closed-source projects and addressing ongoing bugs.
00:20:10 We are also working with a fork of C Ruby, ensuring a one-to-one compatibility with the lexor while refining the AST for streamlined usability.
00:20:41 Additionally, both JRuby and Truffle Ruby teams are actively exploring incorporating YARP into revised architectures. Many other tools can potentially benefit from this project.
00:21:05 Lastly, I want to talk about maintaining compatibility with Ripper, which provides several interfaces that we aim to fully implement.
00:21:22 Our testing suite offers one-to-one compatibility with Ripper Lex, ensuring a smooth transition for existing projects using IRB or RDoc.
00:21:43 As we move forward, the future work focuses on enhancing error tolerance, improving memory efficiency, and expanding our ecosystem of tools leveraging YARP's capabilities.
00:22:01 I would love to see a unified framework concerning parsing updates so that future syntax changes do not require widespread revisions among individual parser implementations.
00:22:25 This would provide a significant boost to Ruby tooling as a whole, ensuring that everyone benefits from advancements in the parsing process.
00:22:43 That's the current status of YARP, and I appreciate your attention. If you wish to contribute or ask questions, feel free to reach out to me on Twitter or engage with the Ruby community.
00:23:04 The YARP project is hosted on GitHub, and I'm excited about the potential advancements we can achieve together. Thank you once again for listening!