00:00:06.359
Video equipment rental costs paid for by PeepCode.
00:00:19.359
Good afternoon, everybody! Today, I am going to talk about yet another Domain Specific Language (DSL). This one will help you with parsing.
00:00:24.439
Let me first share a bit of the history of how I got to this point, where I wrote this package. In the past, many of you may have been in the same boat, having written custom parsers that were handcrafted.
00:00:35.640
Typically, these parsers are referred to as recursive descent parsers, but my experience was with Perl. Eventually, I got tired of that and started exploring different tools, one of which was ANTLR for parser generation.
00:00:49.079
However, that still wasn't good enough. What I was looking for was a parser generator that used the same language for generating the parser instead of requiring a separate side file for BNF or other formats.
00:01:02.039
I also wanted something that would unify both the lexer and the parser. Tools like Racc, Yacc, and Lex were separate processes. I wanted to create something that was unified. In my search, I learned Ruby because it has the right mix for what I needed—operator overloading, blocks, and duck typing.
00:01:24.600
This is my fourth rewrite of this package. First, I want to demonstrate how the grammar works to create a parser. To parse whatever language you need, you start with grammar objects. A grammar is essentially just a Ruby object that holds information for the language you want to parse.
00:01:43.240
To build a grammar for the language you want to parse, you construct it piece by piece, starting from the lowest level and building up. You simply use regular Ruby operators and methods, along with a few blocks to get your actions in place. You generate this grammar as a representation of the language you wish to parse. Then, a grammar engine takes that grammar and generates a concrete parser that will parse your input.
00:03:01.680
This entire structure consists of Ruby objects and methods. To contrast this with most commonly used parsers, which typically rely on regular expressions, let me show you a table that illustrates the equivalence between them. The left side features regular expressions that use various symbols and operations to work with strings, while on the right, we have the grammar approach using plain Ruby.
00:03:48.200
With grammar, you don’t put the definitions into a string or a file; rather, you use a DSL that utilizes the operators of Ruby itself. For example, I use the vertical bar for alternation, the plus sign for sequences, and a method for lookahead assertions. At the top of this structure are the leaves of the grammar that you start with to build up, typically character ranges.
00:05:10.520
Next, let's discuss the input sequence. It can be anything; it just needs to resemble an external iterator. I'm using a brackets method to retrieve the next item in the sequence, which can be characters—like those used in regular expressions—or tokens resulting from lexing. The lowest level of this grammar contains what's called an element, which takes tokens from the input stream and compares them with a pattern you provide.
00:06:22.400
The pattern can be anything, but it has to work with whatever is in the input stream. The output of the parsing process varies; other parser generators typically produce abstract syntax trees, but I might describe my parsing process in terms of abstract syntax forced. Instead of generating a tree, this engine produces a sequence of outputs. The default output stream matches the input stream, and various methods allow discarding or redirecting this output.
00:07:11.800
For example, you can apply a discard method to any grammar, which prevents it from reflecting the input in the output stream. You can also redirect it to a temporary buffer where you can perform actions as needed. Additionally, the group method helps create hierarchical structures within your grammar, forming an abstract syntax tree.
00:08:07.919
I want to provide some simple examples to illustrate these concepts. Here's a unit test for the grammar. At the top, I create some helper methods, like the input method that prepares the input into the format I need for my input iterator. For instance, I convert a string into a StringIO object, and the get_c method returns something resembling a lambda that retrieves the next token from my string input.
00:09:45.360
In this first example, I parse a single digit. The grammar simply specifies a character range of 0 to 9. Subsequently, I demonstrate parsing the first digit in a string and appending to the previous output. I also display some invalid conditions based on this parsing structure.
00:10:35.800
Building on that, we move to a sequence of digits. To achieve this, I implement a method called repeat_one, which allows for one or more digits. This time, I assert that the digits must be followed by an end-of-file (EOF) assertion. Therefore, an error occurs if the sequence includes unexpected characters that are not valid.
00:11:35.800
Furthermore, grammars can also connect outputs from one grammar to the input of another. A grammar remains agnostic about the elements in these input and output streams, meaning any grammar can act as a parser or lexer. I have one method called Supply, which connects a single token grammar to form a lexer, linking that to a parser represented by another grammar. Additionally, there is a Pipe method that supports multi-threaded lexer-parser combinations.
00:12:40.000
You can chain these processes together, so a pre-processor could feed a lexer, which in turn feeds a parser. Each step would take an input stream, process it, and generate an output stream, gradually refining the structure based on your requirements.
00:13:16.120
I want to talk about the type of parser this project implements, known as an LL parser. If you've written your own parser, you'll find that they typically parse from left to right, and they are often LL parsers as well. What I'm essentially doing is generating a parser that behaves like a hand-written one, parsing one character at a time, though you might use regular expressions for a lookahead in some instances.
00:14:10.160
Yet, my approach includes a backtrack method, allowing you to look further ahead when making parsing decisions. This effectively provides you with unlimited lookahead capability. Moving on to recursion, another fascinating aspect of the grammar, you can create a grammar that exhibits recursive structures. You can define inner and outer grammar sections, where inner represents the recursive part and outer serves as the culmination of that recursion.
00:15:43.400
Whenever you hit a recursion point, it refers back to the complete grammar. I also look at the positioning of the inner grammar to apply optimizations. If inner is in the middle of a parsing sequence, plain recursion is used; however, if it appears at the end of a structure without any following elements, I implement a loop, akin to tail-call optimizations used in compilers. This prevents excessive recursion that could cause the stack to explode during parsing and provides a significant speed advantage.
00:17:27.200
Left recursion is another interesting topic. Traditionally, left recursion is linked to parsers like Racc or Yacc, commonly known as LR parsers. However, my parser operates as an LL parser while incorporating left recursion. I believe there are very few LL parsers capable of handling left recursion effectively. To explain how I achieve this, if you have a grammar that involves left recursion, I divide it into two segments. The first substitution triggers a failure, while the second assesses it through a loop.
00:18:30.920
For typical LL parsers, grammar would need to be structured away from left recursion, but in this LL parser, you can write grammar naturally in left recursion, which is often the most straightforward approach.
00:19:10.280
Let's revisit the digits parsing example, expanding it a bit. Previously, when parsing the digits '23', it returned the individual characters '2' and '3', which wasn't incredibly useful. Instead, I want it to convert that to an integer. In this case, I implemented a group method to push the results of grammar parsing into a block, converting it into an integer to achieve the desired result.
00:20:36.640
We can expand the grammar further to include recursion and parentheses. By encountering parentheses, we can recurse back into the original expression, ensuring that we can parse complex structures. This showcases an error condition as well; if there’s a mismatch in parentheses, that should be flagged.
00:21:27.360
Next, lets add unary operators. If we encounter a negative sign, we may want to discard it during parsing since it's not needed in the output stream. I also show how to negate the last output if necessary or treat it as a primary number instead.
00:22:50.760
This setup allows us to recognize left recursion where the product appears on the left side. By observing this structure, we ensure that if we see a product followed by another unary operator, it calculates the product accordingly and adds it back to the output stream.
00:23:54.879
Ultimately, we reach our final expression grammar that combines sums, products, and potential unary expressions, so the grammar reflects a complete mathematical expression including terms, products, possible unary negatives, and parentheses.
00:25:07.680
Here's a brief example of consuming the expression, where I evaluate it in its entirety. This setup grants flexibility, as you could choose to output a tree structure instead; the actions I defined have made it evaluate directly.
00:26:15.240
Going back to the grammar engine, it acts as a visitor that traverses the tree of grammars. While traversing the grammar, it plays a role in generating a parser. I offer several engines to accomplish this. One engine is Ruby Zero, which performs parsing without generating a separate parser—this was my initial prototype.
00:27:48.000
Other engines I developed include a Ruby parser, which generates parser code for later use, and one that compiles down to C for performance improvements. I believe we can explore the possibilities further, creating engines that generate C or C++ code.
00:28:15.760
Furthermore, the grammar layer is lightweight—it’s ultimately a user-defined layer that facilitates creating grammars without burdening the generated engine, which can be of a different type entirely.
00:29:03.680
You may have noticed that the action blocks look like standard Ruby code, yet they behave differently in terms of execution. These action blocks are responsible for generating code dynamically, not executing it right away. When, for example, I call a simple method, it doesn't convert at that moment—rather, it generates the corresponding code to execute later.
00:30:29.100
As a result, utilizing methods and blocks liberally adds flexibility. Initially, I attempted to create versatile methods that could handle multiple tasks, but I discovered it to be more efficient to maintain simple methods that do one thing well, while ensuring ease of debugging.
00:31:39.760
To achieve performance numbers I’ve obtained, I focus on minimizing object creation and method calls. The parsers generated are nearly entirely flat, which boosts performance. Adhering to the principle of staying simple allows the parsing of common structures without overhead.
00:32:19.920
In previous examples, we noted that parsing expressions generates clean, readable code, but care must be taken to ensure efficiency and clarity. This line demonstrates that the numeric characters must return as digits and not fixed ASCII values.
00:34:00.640
I should note that while smaller expressions can be handled easily, the overall capacity for parsing remains high, able to process large streams with ease. For instance, I can handle gigs of inputs and manage memory accordingly without generating unnecessary overhead.
00:36:00.720
Finally, I want to extend my gratitude for the opportunity to share this work with you. As we advance, I am eager to hear from all of you with any questions you may have regarding these concepts or implementations.
00:37:09.280
I would like to thank PeepCode for providing the equipment rental costs.