Talks

Parsing Ruby

Parsing Ruby

by Kevin Newton

In the keynote speech titled 'Parsing Ruby' at RubyConf 2021, Kevin Newton explores the various tools and methodologies used to parse Ruby code and how these concepts can be applied to individual projects using the Ripper standard library. The presentation starts with a warm welcome and an acknowledgment of the audience, before diving into the complexities of parsing Ruby.

Key points discussed include:

- Foundation of Parsing: Newton emphasizes the importance of understanding the theoretical aspects of parsing, starting with defining a grammar, which is a syntactical representation of what is allowed in a language. He constructs a simple grammar that can handle numerical expressions and demonstrates how to expand it with operations such as addition and parentheses.

- Building a Parser: He illustrates the process of creating a parser that can tokenize input strings using lexical analysis and how these tokens are processed semantically through a series of shifts and reductions to create a valid syntax tree.

- Ruby's Parsing History: A significant portion of the talk focuses on the evolution of Ruby's parsing mechanisms, starting with early implementations and transitioning to the current parser generator systems. He details the transition from using Yacc to Bison in Ruby's core implementation.

- Introduction to Ripper: Newton provides insight into the Ripper standard library, designed for easy access to parsing events. He explains how Ripper allows developers to hook into tokens and rule reductions, making it easier to build commenting and syntax tree tools.

- Community Tools: The speaker highlights various parser-generating tools and libraries available for Ruby, such as the parser gem and re-parser, discussing their functionalities, limitations, and community support.

- Future Implications: Nearing the conclusion, Newton suggests that Ruby's pace of introducing new syntax may slow down. He advocates for a standardized parser that accommodates various Ruby implementations to maintain compatibility and ease of use.

Through the exploration of these topics, Newton aims to inspire developers to engage more deeply with parsing concepts to create innovative tools and applications within the Ruby ecosystem. The presentation concludes with a call to action for the community to rally around building more developmental tools, facilitated by a robust understanding of parsing techniques.

00:00:11.200 Hello everyone! My name is Kevin Newton, and I would like to formally welcome you all to the keynote session. I want to thank Matt for the introduction. Since I'm the first person speaking here in the keynote room, I feel privileged to kick off this live keynote.
00:00:22.560 I hope that joke landed well! You can laugh if you want; a little pity laugh would be appreciated. Anyway, as I mentioned, I'm Kevin Newton, and I work at Shopify on the YJIT team, alongside Aaron, Allen, Noah, and Maxime.
00:00:34.000 If you'd like to talk about that or any other topic, feel free to find me at the booth where the nerds with the green background are located. Now, I want to open up with a quick warning: I tend to speak pretty quickly when I'm nervous, and to be honest, I am a bit nervous. I've had a lot of caffeine today.
00:01:04.399 We're going to cover a somewhat complicated topic, and I've spent hours trying to make it accessible for everyone. So, if you're a junior developer, please don't leave the room. And if you're a senior developer, please stay with me—there's content in here for you as well. It might not start from the very basics, but it will be present.
00:01:21.920 Today, I want to talk to you about parsing Ruby. We'll explore how Ruby code has been parsed over time and how we can go from plain source text to a structure we can work with. To do this effectively, I need to backtrack a bit and discuss the fundamentals underlying these concepts to provide you with a comprehensive understanding.
00:01:58.240 Here’s the game plan: we will build a grammar for a simple language, and I'll explain what that means in a moment. We'll then build a parser for this language, look at the history of the Ruby parser, and examine how it has evolved over time. Finally, we'll investigate how Ripper works, which is a standard library used to gather information during the parsing process.
00:02:09.759 To start, we need to build a grammar. A grammar is a syntactical representation of what is allowed in a given language. In this context, 'language' is a broad term; it's not limited to English or Ruby, but refers to any set of constructs that can form a series of tokens.
00:02:40.160 For instance, if we're looking at a language that only accepted a single number, the grammar would look like this: a program points to a number. Although this is slightly more complicated than necessary—it could simply state that a program consists of a number—I'm doing this for illustrative purposes. The program serves as our root node, indicating that the only acceptable entity in this grammar is a number.
00:03:01.040 The number itself is a non-terminal token, whereas a terminal token represents the final element in the parsing process. Essentially, this grammar accepts a single number token. For the sake of clarity, I realize I've mentioned 'token' multiple times; this setup accepts tokens like 1, 2, or 7, but is not extensive enough for us to perform operations, so I'll expand it a bit.
00:03:25.680 Now, when we introduce addition, we begin accepting expressions like 'number plus number' or just an individual number. This change allows us to accept expressions like '1', '1 + 2', but not '1 + 2 + 3'. The reason for this is that there is no recursive structure yet.
00:03:56.080 To enable recursion, we need to extend our grammar further. This adjustment creates a left recursive structure—essentially, the expression node in the tree points back at itself—allowing for more complexity. We can accept expressions like '1', '1 + 2', and even '1 + 2 + 3'. With this change, we can add more rules to encompass additional operations like subtraction.
00:04:02.480 At this stage, we're defining the nature of expressions and terms in our grammar. A term can be a single number or an expression defined with basic arithmetic operations. We're also considering operator precedence, which dictates how expressions are evaluated. Lastly, we will include the concept of parentheses, allowing for complex nested expressions.
00:04:45.680 A language with parentheses allows us to express operations like '1 * 2' comprehensively. In our grammar, an expression will either be a single term or a composition of terms connected by operations, and we can wrap these expressions in parentheses to indicate precedence. Thus, we have a complete grammar to guide our parsing process.
00:05:34.320 Next, we need to construct a parser capable of interpreting this grammar. The grammar we've established is an abstract concept, and now it's time to implement it. Imagine having a source file that contains simple expressions, for example, a .numbers file or any language you prefer.
00:05:52.639 We'll loop over this source file, processing it in Ruby. The goal is to handle it until the input string is empty. We'll utilize regular expressions to identify numbers and operators, skipping whitespace as needed, and yielding individual tokens, such as number tokens or operator tokens, based on the patterns we find in the source.
00:06:21.360 At this point, we've parsed the source and should have identified various tokens, forming what we refer to as a 'token stream'. A token stream is simply a list of tokens produced by analyzing the source. However, while we've accomplished our lexical analysis and created tokens, we don't yet have semantic meaning—meaningful interpretation of these tokens.
00:07:01.680 To advance, we will run through our grammar and start accepting input. This is where the concept of a semantic stack comes into play. This stack will hold our tokens as we shift and reduce them, essentially correlating them with their meanings according to our established grammar. As we shift tokens, we recognize their equivalence within the grammar, allowing us to reduce them until we achieve a full syntactic structure that represents a valid program.
00:07:56.560 As this process continues, we transform our input tokens through shifting and reducing operations until we arrive at the final grammar structure: a complete program. Each shift and reduction helps us build up a tree structure that corresponds to the relationships illustrated in our grammar.
00:08:46.160 Although this entire operation is repetitive and can be language agnostic, parser generators emerged in the late '80s and early '90s as solutions. These programs take a grammar and the associated actions in your parsing process, effectively taking the burden of shifting and reducing off your shoulders. In Ruby, when you're building a parser, the preferred generator is called Rack.
00:09:30.720 Utilizing Rack allows you to define operator precedence and and include actions executed during the reduction of rules. This way, you can immediately evaluate expressions when the necessary grammar rules are matched. This implementation makes it easier to work with the Ruby abstract syntax tree (AST), which reflects the structure of Ruby itself.
00:10:17.440 Historically, the parser generator used in Ruby is called Yacc, which stands for 'Yet Another Compiler Compiler'. It became popular in 1993, coinciding with the early development of Ruby. In this time, some notable changes were being made to the structure of Ruby itself, including modifications to the basic syntax. These changes became essential as Ruby evolved into the language we know today.
00:11:05.120 Over time, Ruby has implemented various enhancements and adjustments that reflect its unique characteristics. For instance, Ruby's syntax has been compared to both Python and C++. Changes in early versions allowed for the introduction of hash literals and the correction of syntactical errors, such as misspellings in keywords.
00:11:49.680 As Ruby matured into version 1.0, its syntax began to resemble the more recognizable form we see today. Features like access to singleton classes, specific regex flags, and the introduction of keywords for true and false values showcase the growing intricacies embedded within Ruby's syntax. The changes laid the groundwork for subsequent versions, ultimately leading to Ruby's current functionality refresh.
00:12:47.440 Throughout the years, additional features were added, including binary number literals and rescue modifiers. These developments followed the introduction of Ruby 1.9, which transitioned Ruby to a bytecode interpreter context. Other significant milestones included the release of alternative implementations like JRuby and Rubinius, which reimagined how Ruby's parser could function in different environments.
00:13:35.520 As Ruby continued to evolve, many projects sought to capture the Ruby AST (Abstract Syntax Tree) in various forms, expanding its utility beyond execution. These efforts indicate the flexibility of Ruby's parser and the growing ecosystem surrounding language tooling. Yet, sustaining compatibility and ensuring consistent performance across all implementations has proven challenging.
00:14:23.280 Ruby 1.9 saw considerable advancements, including the merging of Ripper into the standard library. Ripper serves to provide a bridge between Ruby's parsing mechanism and external needs, allowing developers to gain valuable insights into the code structure. This functionality facilitates the development of syntax analysis tools and encourages further language experimentation.
00:15:05.440 As we look back on the evolution of Ruby's parsing capabilities, we recognize the significance of Ripper in enabling developers to access and manipulate Ruby's internal grammar patterns. Ripper now operates as a standard library, demonstrating Ruby's flexibility in adapting various paradigms while encouraging ongoing community engagement.
00:15:54.240 In conclusion, today we've explored the journey of Ruby's parser and the pivotal role of tools supporting language parsing. Ripper stands as an invaluable resource for developers, providing essential functionalities to dissect and interact with Ruby's syntax. I hope this talk inspires you to delve deeper into creating tools that enhance Ruby's capabilities further and that we can collectively contribute towards building a richer Ruby ecosystem.
00:16:40.560 Thank you all for attending. I appreciate your time and attention, and I'm happy to open the floor for any questions or discussions.