Time Flies Like an Arrow, Fruit Flies Like a Banana: Parsing For Fun and Profit

00:00:29.560 Our next speaker is Sing-Hui Hsu. It took me several tries over the course of knowing Sing-Hui to get that name correctly. I'm really good at mispronouncing names, as it turns out. You will recognize her because she has too many 'H's on her name tags. In fact, her Twitter handle is 'So Many H's,' if I am correct. Sing-Hui is a Rubyist who works primarily in Rails these days, and she is particularly interested in parsing. This is probably due to the fact that she speaks four languages.

00:01:04.478 Can anyone guess the four languages? French is one of them. Someone in the audience already knows all the languages. Yes, she speaks Mandarin Chinese, French, and English. If you speak Cantonese, do not talk to her; you are not welcome here. She is a Mandarin speaker. But there is one last language that's super tough to guess—Dutch! Cheater! You're such a cheater. Nobody just guesses Dutch on the first try. It’s not a common thing. So, she speaks Dutch, Chinese, English, French, and Ruby, and she's here to talk to you about parsing. Please give a warm round of applause.

00:01:47.280 Hello! Wait, is this on? There we go! Okay, hello! Oh God, yes, okay, good. My name is Sing-Hui, and as Jonan said, it doesn't look anything like it sounds. There are a lot of 'H's that are all silent, so I go by 'So Many H's.' Today, I am going to talk about parsing. A little explanation about the title of the talk: I probably started first thinking about parsing when I was teaching English as a foreign language and had to explain why so many words in English have multiple meanings, like 'flies.' Depending on the meaning of that word, the sentence can go in very different directions.

00:02:27.080 When I first started looking into how computer languages are parsed, I noticed that there are actually a lot of similarities in how humans parse sentences in the languages they hear and read every day. I'm a relative newcomer to the programming world. I never actually learned about parsers or compilers or anything in school. So, I actually also have an alt title for this talk: 'How I Accidentally Became a Computer Scientist and So Can You.' Of course, this is not going to be a comprehensive talk on all parsers, but I wanted to share how I came to learn about them because, honestly, I really had no idea what they were until a couple of months ago and feel like I came into them by accident.

00:03:02.080 Rewinding a few months back, I had spent a year going through a program called Ada Developers Academy, where I learned, among many other things, how to build Rails apps. We were cranking out new apps every couple of weeks. One day, I got curious about how things actually worked underneath all that Rails magic, specifically how routing worked. I went to GitHub and started poking around some of Rails' source code, and I came across a file called parser.rb. I was like, okay, cool! I had written a parser to better understand what a language interpreter does before, so I thought I probably knew what it would look like. It turned out it was nothing like I expected; I didn't even recognize it.

00:03:38.640 I was like, okay, I need to look a little deeper. Then I saw that this file was actually generated using a different file called parser.doy. I was like, okay, let's look at that. Still didn't make any sense—it barely even looked like any kind of Ruby code that I recognized. I was trying to figure out the best way to get myself to learn this, and I've been told that a great yet terrifying way to make yourself learn something is to submit a talk on it. I was like, okay, sure! What's the worst that can happen? If I submit a talk, either I get accepted and have to talk about it, or I get rejected.

00:04:06.480 So it turns out the worst thing that can happen is to get both rejected and accepted. So thanks a lot, Josh, for putting me through all of that! But anyway, here I am. To get warmed up to the idea of parsing, let's play a game with some audience participation. I hope most of you have played it before. It's where I ask for a type of word, and you have to fill in the blank to make a sentence. So, I need a verb.

00:04:32.240 Anyone? Okay, 'drink.' No audience plants or anything? So the young man drank. This seems like a perfectly valid sentence. If you can all remember your middle school grammar lessons, you can diagram the sentence into a subject and a verb where 'drank' is the verb. Since all sentences need a subject to actually perform the verb, we know that the subject is 'the young man.' So it makes sense, right? But are there any other possible sentences that could have started with 'the young man'? For example, 'The young man did something.' I need a noun and verb. We can use the same verb 'drank'—but drank what?

00:05:02.400 'Beer'? Great! I didn't anticipate that one at all. This is also grammatical. This time, instead of just subject plus verb, we have subject plus verb plus object, which abides by the rules of English grammar. If we interpret 'the young man' as the subject, 'drank' as the verb, and 'the beer' as an object, which makes sense because objects are nouns, but still can we come up with even more possibilities for a valid sentence that starts with 'the young man,' either using subject-verb or subject-verb-object? For example, can you fill in this blank with a noun? Anyone? Well, try 'the boat.' So, 'the young man [the boat].' Is this grammatical or is '[the young man] [the beer]' grammatical?

00:05:40.640 You think not, right? Because based on the previous sentences, we assume that 'the young man' is the subject, and then you have a direct object. Therefore, we think no—this is not grammatical. However, it turns out you can actually parse the sentence in a different way. So if we interpret the subject as 'the young' (as in, like, the young people), we see that it still follows the same rules as the sentence 'the young man drank beer,' which was subject-verb-object. We were initially led astray because we tend to think of 'man' as a noun, and therefore in the previous sentences, as part of the subject and not as a verb.

00:06:27.000 This kind of sentence is called a Garden Path sentence. Garden Path sentences contain a word that can be interpreted in more than one way, so that you think the sentence is going in one direction, and then it pivots on that ambiguous word and goes in a completely different direction. 'Time flies like an arrow, and fruit flies like a banana' is an example that follows that pattern. Another example is 'The man who hunts ducks out on weekends,' where you might be misled by thinking of 'ducks' as the direct object. We might expect something after 'with'—but 'crack' describes what's on the wall. When Fred eats, food gets thrown. Mary gave the child the dog a Band-Aid. Lastly, 'the woman who whistles tunes pianos.'

00:07:03.680 This kind of ambiguity in sentences can also be seen in hilarious actual headlines, such as 'Grandmother of eight makes a hole in one,' and 'Man-eating piranha mistakenly sold as pet fish.' Eyedrops drop off shelves for pharmacies. And, I guess, 'Complaints about NBA referees growing ugly.' My personal favorite: 'Milk drinkers are turning to powder.' So, regardless of the ambiguity in natural languages like English, we can all agree that as speakers of a language, there are certain grammar rules that we are aware of that allow us to make decisions about whether or not a sentence is a valid expression in that language.

00:07:56.480 As we've seen with the last couple of examples, we know that one possible kind of valid English sentence consists of a subject plus verb plus stuff. Most of us probably learned that sort of thing in middle school. We know that a sentence diagram tree looks something like this, where at the top level you have the sentence, which can be broken up into its constituent parts. Here, we have the noun phrase and the verb phrase, which can then be further broken down until every word of the actual sentence is added as a leaf of the tree. In this sentence, 'John lost his pants' is the leaf of the tree. But the same tree could apply to 'the young man [the boat].' These kinds of trees are constructed based on grammar rules.

00:09:01.560 The convention used to describe grammars in the computing world is a notation called Backus-Naur Form, or sometimes Extended Backus-Naur Form (BNF). In the BNF notation, you have a series of rules that together constitute the language, or in other words, can define every possible sentence in that language. There are two kinds of symbols used to express each kind of production rule, which are terminals and non-terminals. The basic format for each production rule is that there's a left-hand side which produces a right-hand side or can be rather rewritten as what's on the right-hand side.

00:09:47.680 On the left-hand side, you have a single non-terminal symbol or token which produces either a terminal or a sequence of terminals and non-terminals. What does this mean? For example, a super simplified subset of production rules in English sentences based on what we saw before might be something like: at the top level, a sentence produces a subject plus predicate, and the predicate is a verb plus stuff. Here is an example of ten production rules that can possibly describe both of the sentences we created earlier: 'the young man drank beer' and 'the young man [the boat].' These rules make use of six non-terminals: sentence, noun phrase, verb phrase, noun, verb, and article, and seven terminals, which are simply the words: the, a, young, man, boat, beer, and drink.

00:10:38.240 Using this grammar, we can start at the top level and keep applying production rules until we generate the full sentence. So, we start with: the sentence becomes a noun phrase and a verb phrase. The noun phrase becomes an article and a noun, which in turn hits the terminals of 'the' and 'young' for the article and noun, respectively. Similarly, the verb phrase can then be broken down into verb and noun phrase, and so forth until you finally get the whole sentence. It's pretty straightforward. The same production rules can be applied to 'the young man drank beer,' but just in a different order.

00:11:03.440 So, we know it's valid, but what does this have to do with computers? This is just English grammar, but it turns out the process by which we parse and understand sentences in a natural language isn't that different, at an abstract level, from what a computer language parser does. Here's the basic flowchart of what happens to your code: you have the source code, which gets passed into something called a lexer or tokenizer, which produces a bunch of tokens. It reads the sentences and figures out recognizable units, and those tokens, in turn, are fed to the parser, which in most computer languages, spits out a syntax tree, which then the compiler or interpreter can use to output CPU instructions.

00:11:37.000 Alright, so most of us don't spend a lot of time implementing new programming languages, but by better understanding the lexer and parser steps, we may be able to see how to use parsers for simpler situations. Rather than a Ruby script, you can feed anything: a log file, a markdown document, or even just a plain string into your parser and get some kind of output, which doesn't even necessarily have to be a syntax tree. It could be just a different data structure or even something as simple as a Boolean. So, starting with flexing or tokenizing, all this means is turning your input into a sequence of tokens that your parser can recognize.

00:12:22.320 These tokens can then be matched to the grammar that describes the language you're parsing. Let's look at a really simple example. Let's build a parser that can handle super basic binary operations in math, such as addition, subtraction, and multiplication. The entire language will consist of an integer followed by an operator, then another integer. The grammar rules here are pretty simple; we only need three of them to describe every possible sentence in our mini math language. The start symbol here is the expression, which consists of three non-terminal symbols: a number token, then an operator token, and then another number token. The number token can then be rewritten as any digit, and then the operator token can be rewritten as either a plus, minus, or multiplication sign.

00:13:20.520 So, here's a really simple implementation of a tokenizer in Ruby, using the string scanner library and regular expression matching—pretty much identical to the ones we saw in our production rules—to grab tokens of either type: number or operator. It then returns them all in an array. When we input the string '3 plus 7,' we should get a three-element array as our output tokens: number three, operator plus, and number seven. Now we want to pass those tokens into the parser, and ultimately we want to end up with a tree that looks something like this, so that when you feed it to the interpreter, it looks at the head and knows to execute addition on the left child and right child.

00:14:00.960 So, a really trivial implementation of this parser might look something like this: you have a parse that only expects three tokens and outputs a tree with a single node containing the operator and the two children, which are the number tokens to be operated on. It's really simple, but now we have slightly harder math. Take, for example, this expression: '2 times (3 plus 7). With this expression, we need slightly more complicated rules to describe it. Whereas before our first production rule was: expression = number followed by operator followed by number; now we have, at the top level: expression = number followed by operator followed by expression, where the expression can be rewritten as either an expression in parentheses or just a number.

00:14:47.760 The number and operator non-terminals still translate as they did before in the previous production rules. So we tokenize, to save space, it just outputs the 'two times' and so forth. This is ultimately the tree that we want to build but how do we instruct the parser to generate the tree correctly? One common approach is to use a top-down parser that looks ahead one token to decide what grammar rule to follow. Like Garden Path sentences, we need to know what comes after the current token to know what kind of rule to use to understand the sentence.

00:15:47.040 For example, knowing that 'the young man' is followed by 'the boat' allows us to know what the subject and verb of the sentence are. If you can adhere to a grammar rule when you consume each token, then you're good to go, and you can keep parsing. However, if you don't, then you hit a syntax error state and can't continue parsing. Here, we start with the first token, which is a number token. Since looking ahead at the next token tells us we have an operator coming up, we know that the only possible rule we can follow is the one that says an expression is a number followed by an operator and then another expression.

00:16:41.920 We start building our tree with node two. Shifting over to the next token, we have an operator. Looking ahead tells us that the next token is a parenthesis, and the only grammar rule that has an operator followed by parentheses still holds. We know that an expression can also start with a parenthesis. So far, we're still in the realm of grammatical correctness, and we can keep parsing. We can now set the operator as the head of the tree. We consume the parenthesis, and we see that the next token is a number.

00:17:37.240 Since we know that a valid expression can either be a number or a number plus an operator plus an expression, we still haven't violated any grammar rules, so we can continue parsing and update the right child of the tree to be the potential expression in parentheses. We consume the number token holding the value three. Previously, we knew we had two options: either interpret the number as just terminating it into a three or as part of the first production rule, which states that an expression can either be a number, an operator, or another expression.

00:18:29.240 By peeking ahead at the next token, which is an operator, we know which rule to follow—namely the first one and not the second one. Now we have an expression that turns into a binary operation and using the same parser, we know that three will be the left child of that subtree. Next, we see the operator token and know that it’s a plus. Looking ahead to seven tells us it adheres to the first production rule, which tells us we’re good to keep parsing and we can add the plus operator as the head of the subtree.

00:19:22.320 Then we reach the seven. We know from the previous step that this has to be an expression of some sort. Looking ahead tells us that we have a parenthesis, which completes the expression. If there had been no parenthesis at the end, then the second production rule could have been violated and we would not have been able to complete a valid syntax tree. Building from the top down, looking at the summary of the steps we took to build our syntax, you might have noticed that we often had to call 'expression' from within an expression. This method of parsing is also called recursive descent. This is probably the most common type of parser that people might try to write by hand, but there are a number of problems with recursive descent.

00:20:51.360 For one, the recursive nature of the process makes it relatively inefficient because you might end up checking a token against every single grammar rule that you have before being able to parse it. If you're not careful with how you write your grammar rules, you’ll end up in an infinite loop. For example, if you have a production rule that says an expression equals expression op expression, then you hit the first token and go back to expression, which produces expression op expression, which goes back to expression and so forth until the end. Lastly, you sometimes have limitations in how you can order or write your grammar rules for a language that uses recursive descent.

00:21:54.560 Another approach to parsing builds the parse tree from the bottom up. The key difference here is that this approach makes use of a stack that you can push tokens onto until there are enough tokens on the stack to fulfill a rule, at which point they get reduced and popped off. For this reason, this kind of approach to parsing is also called shift-reduce. Looking back to our slightly harder math example, we start with the first token, '2.' We push it onto our stack for the moment, and based on the grammar rules, we know that '2' is a terminal for a number token. The stack reduces it to the left-hand side symbol for 'num,' then we set it aside as a building block for our eventual tree.

00:23:00.640 We then shift over and consume the next token, which is a times sign. This gets pushed onto the stack and then looks at the grammar. The rule that matches is operator equals multiplication. So we reduce that in the stack and move it over. Then, we shift again and grab the parenthesis. We add it to the stack, and since this could be covered by rule where expression equals parentheses expression, we keep it in the stack for now. We’re just shifting to the next token, which is now three, which we push again. Applying the first grammar rule, we reduce '3' to a number on our stack and set it aside.

00:23:58.720 We shift over to the plus sign. We add that to the stack as usual. We look at the grammar and know that this is an operator token. Our next action is to reduce it, and once again, we set it aside. Our next action is to shift; we push the next token onto the stack, applying production rule, which recognizes '7' and reduces it to the expression and sets it aside. At this point, the tokens on our stack have fulfilled a rule that our grammar recognizes, namely that an expression equals number followed by operator followed by expression.

00:24:57.640 Now, the stack takes the three tokens involved in the rule and reduces them to simply an expression, which can be shown now on the stack in green. The tokens that are left on the stack now show that we've reached a state again where the tokens at the top of the stack have met another grammar rule, which is expression equals number operator expression. This allows the first tokens we set aside to be added to the tree at the proper level above the subtree constructed from the previous reduction.

00:25:45.840 We've reached the end of our tokens, and since we were able to perform a valid reduction, we know that the input was valid; it was a grammatically correct sentence in our math language, and we can accept it to create the final tree. Those are two types of approaches to parsing: top-down or recursive descent, which is again the most commonly written by hand, whereas the bottom-up approach, which you might sometimes hear called LR, is much harder to write by hand.

00:26:26.920 Unsurprisingly, the more grammar rules you have, the harder it is to construct this parser. Fortunately, there are tools that build parsers for you. These tools are called yak, rack, and bison; they are known as parser generators. All they need is a grammar file with all the rules that determine the language you want to parse, and they'll generate a parser for you, which you can then feed any input to do the right thing.

00:27:28.360 The grammar files look like the wacky files you saw at the beginning of the slides. But now you can understand how the grammar rules are defined. It’s not that different from the BNF notation we've been using. Here, you can see that the rule for a non-terminal symbol, expression, consists of a number token followed by an operator token followed by a number. When this sequence gets hit by the parser, it executes what’s in the Ruby block. The grammar file also comes with a tokenizer, but it’s pretty much the same thing we saw before, so I'm leaving off that slide.

00:28:45.840 So again, the grammar rule is defined here simply as num op num, and the block is very similar to that trivial parser we saw from the very first simple math example. All that's left is to pass your grammar file into rack or bison, and it should generate a file that can now parse whatever input you specify for the language. You don't even need to look at the output of what your parser generates, but if you did, it would look something like this. The numbers in the various tables correspond to the different actions the parser needs to take, such as shift, reduce, or go to a specific rule. For the most part, you don't even have to worry about it, and you already have a parser.

00:29:56.760 You might ask, when would I actually use a parser? That’s a great question because there are a lot of situations where you can use a parser. For example, you could use one if you're validating a string—such as a URL or an email address—or maybe extracting information from the body of an email or a log file, or even converting some kind of formatted document like markdown to HTML.

00:30:13.760 But now you might be thinking, 'Well, I've validated strings before, and I've just used regular expressions.' It’s true that for some simple string input, you might just use a regular expression, but has anyone here actually tried to validate a URL using a regular expression? Because if you did, you might end up with something terrible like this that you really never want in your code because it’s hard to read, hard to maintain, and probably doesn't cover all the cases you need it to.

00:31:25.440 The reason for this is that there’s a really important distinction in languages—even string languages—that we must take into consideration, and that is that all languages belong to a hierarchy. The established hierarchy for languages was first introduced by Chomsky, and it has four types: the first type is unrestricted, which is the category that most natural languages fall into. There are languages that do not have regular or consistent grammars and can therefore be ambiguous.

00:32:16.720 The second type is called context-sensitive, which we don’t really care about for the purposes of this talk. However, we get to the third and fourth types. Type two languages are known as context-free languages based on context-free grammars. Most programming languages are type two, but you don’t need to be a complex language like Ruby to be considered context free. Finally, type three languages are what are called regular languages, and these are the only kinds of languages that can be properly parsed by a regular expression engine.

00:33:08.840 But what does it really mean for a language to be either regular or context-free? For the most part, it just boils down to your grammar rules. For regular languages, all grammar rules can be expressed as a single non-terminal symbol on the left-hand side and either a single terminal or nil on the right-hand side. This terminal can sometimes have a non-terminal either before or after it but not both. You can have a rule that says 'A becomes BX' or 'A becomes XB,' but you can't have both. On the other hand, context-free languages have much more flexibility in grammar rules, with a single non-terminal on the left-hand side and a sequence of terminals and non-terminals on the right-hand side.

00:34:29.520 As it turns out, most languages aren't regular. Let's take a really simple example language that you think you could just parse with a regular expression: the 'AB' language, which consists of a series of A's followed by the exact same number of B's. Valid sentences are 'aabb,' 'aaabbb,' and so forth. However, an invalid sentence would be something like 'AA' where you don't have a corresponding 'B' or vice versa, or where they're interspersed or mismatched. If you were to try writing a grammar for a language like this, you might start with something like a non-terminal 'A' that becomes the string 'a' and the similarly leading non-terminal 'B' that becomes a string 'b.'

00:35:34.760 You can then construct a sentence as 'A' followed by 'B,' but that only produces a single sentence, 'AB.' In order to produce anything else, you might think of having 'AB' become 'AA' and 'BB,' which isn’t feasible since that leads to an infinite number of grammar rules. So what’s the solution? Thanks to parser generators, you don’t typically have to work much beyond just coming up with the grammar. However, sometimes the grammar itself can be a little bit tricky.

00:36:30.080 This particular problem reminds me of a logic problem that I’m sure some of you have heard before. The problem is about gnomes in a cave that have been kidnapped by an ogre. For some of these gnomes, their foreheads are painted red to be eaten right away, and some have their foreheads painted blue to be eaten later. However, the ogre gives them one chance at salvation: the gnomes are allowed to exit the cave one by one without communicating with each other, and they have to sort themselves into a contiguous line with the red-painted gnomes on one side and the blue-painted gnomes on the other. They don’t know what’s on their foreheads because the cave is pitch black. Does anyone know the answer to this one? I know some of you do.

00:37:54.000 The answer is that each gnome, as they exit the cave, inserts themselves between the red and the blue ones. If you think about it, this is how this grammar ends up working. Using BNF notation, that means a sentence or a statement 'S' is basically an 'A' or a 'B' with another statement token inserted in between the 'A' and 'B.' If we specify that the statement token can also be nil, that provides the only rule we need that will cover all cases of the deceptively simple 'AB' language. Remembering that a grammar rule for a regular language can't have a sequence of terminals and non-terminals on the right side, we see that this language is already a context-free language.

00:38:41.840 As we've seen, we don't really need that much complexity before you get out of the regular expression-parsable territory because of that. You are usually going to be better off figuring out the grammar rules for what you're trying to parse and using a parser. This also ensures greater accuracy in what you're trying to parse. For example, there are many web servers that utilize a parser for HTTP requests, such as Rack or Jetty, which is used in Mongrel and Java. But as recently as last week, there was a vulnerability uncovered in Jetty that printed debug information in the HTTP response, which is problematic.

00:39:57.360 Another thing about parsers is that they’re often a much faster way of getting what you need done. For instance, in a web framework, if you use a parser for your routes, you can build a tree for your routes, which makes the lookup much faster than just a linear search through all routes. Extra credit points for whoever can tell me the time complexity of that. Of course, parsers are tough to write, but thanks to parser generators, all you need to figure out is the grammar. There were a ton of resources that I used to prepare for this talk, and I couldn't fit them all on one slide. However, here are two links if you want to see an implementation of a recursive descent parser—it happens to be in Python—and stuff for shift-reducing.

00:41:19.760 Once again, I'm Sing-Hui, so many H's on Twitter and GitHub. I am a Rails developer at Carone, and we're hiring! So if you want a job, come talk to me. Thank you!