Talks

Time Flies Like An Arrow; Fruit Flies Like A Banana: Parsers for Great Good

Time Flies Like An Arrow; Fruit Flies Like A Banana: Parsers for Great Good

by Hsing-Hui Hsu

In "Time Flies Like An Arrow; Fruit Flies Like A Banana: Parsers for Great Good," Hsing-Hui Hsu explores the concept of parsing both in human language and computer programming. Hsu begins by drawing parallels between how humans interpret natural language and how computers parse code. The discussion is aimed at individuals without a computer science background, focusing on practical takeaways and insights into the functioning of parsers.

Key Points Discussed:
- Introduction to Parsing: Hsu explains parsing as a method of understanding sentence structure, using a classic wordplay saying to exemplify ambiguities in language.
- Personal Journey: Sharing her experience as a newcomer to programming, Hsu details how her curiosity about Ruby on Rails led her to explore parsing and the underlying mechanics of code.
- Grammar Rules in Language: The presentation covers grammatical structures (subject, verb, object) and how they relate to constructing meaningful sentences, reinforcing the idea that language can be diagrammed into parts.
- Backus-Naur Form (BNF): Hsu introduces BNF as a method for describing grammar rules in programming languages, consisting of terminals (actual words) and non-terminals (rules).
- Parsing Process Overview: The video explains the two essential steps involved in parsing: lexing (tokenization) and parsing, which leads to the generation of an abstract syntax tree (AST) for the code being analyzed.
- Creating a Simple Parser: An example of building a simple math parser illustrates the process of taking input strings, tokenizing them, and constructing a syntax tree, including working with more complex expressions involving parentheses.
- Top-Down vs. Bottom-Up Parsing: Hsu discusses different parsing methodologies, explaining recursive descent (top-down) and stack-based (bottom-up) parsing techniques, along with examples of how these are applied in practice.
- Practical Applications: The talk highlights various applications for parsers, including validation of patterns, data extraction, and the transformation of formats such as markdown to HTML.
- Chomsky's Hierarchy: Hsu briefly touches on the hierarchy of language types established by Noam Chomsky, illustrating the concept with example languages and their grammatical complexities.

In conclusion, Hsu emphasizes the importance of parsers in programming, suggesting that understanding grammar and syntax trees can significantly enhance coding practices. The key takeaway is that parsing is a foundational tool for both natural and computer languages that can lead to more effective code and comprehension for developers at any level.

00:00:15.379 Hello everyone, my name is Hsing-Hui Hsu. That is actually how it's spelled; all of the HS are silent. You can find me on Twitter as @singhwaysu.
00:00:23.689 Today, I'm going to talk to you about parsing.
00:00:29.029 For those of you who attended Aaron Patterson's talk this morning on the Ruby virtual machine, my talk covers many of the steps he skipped.
00:00:34.970 So if you're still feeling like you want to skip over those steps, now is your chance.
00:00:40.820 I’d like to start with an explanation of the title of this talk.
00:00:47.059 I first started thinking about parsing when I was teaching English as a foreign language.
00:00:53.030 I had to explain why many words in English, like 'flies,' have different meanings.
00:00:59.629 Depending on which meaning the word has, the sentence can go in really different directions.
00:01:05.420 As I started looking into how computer languages are parsed, I noticed a lot of similarities to how humans parse sentences.
00:01:12.080 I am a relative newcomer to the programming world and never learned about parsers or compilers in school.
00:01:18.259 So, I have an alternate title for this talk: 'How I Accidentally Became a Computer Scientist, and So Can You!'
00:01:25.910 This isn't going to be a comprehensive survey of parsers.
00:01:32.060 Instead, I want to share how I learned about them as someone without a computer science degree.
00:01:39.470 I had no idea what parsers were and sort of ran into them by accident.
00:01:45.679 After building Rails applications for a while, I got tired of blindly trusting all that Rails magic.
00:01:51.500 I wanted to see how things actually worked under the hood.
00:01:57.590 Specifically, I was trying to figure out how routing worked.
00:02:02.720 So I went to GitHub and started poking around the Rails source code.
00:02:08.289 I came across a file called 'parser2.rb' and thought, 'Cool! I've done an exercise similar to this, so I probably know what this code looks like.'
00:02:16.690 But when I looked at the file, it was nothing like what I expected.
00:02:23.350 It was just a bunch of arrays of numbers. Where was the logic?
00:02:28.480 I wondered who had written this code.
00:02:35.740 Looking closer, I saw that the previous file was actually generated using a different file called 'parser.y'.
00:02:41.020 So I went and looked at that file and still found it confusing—it hardly looked like any Ruby code I recognized.
00:02:46.810 It had semicolons—what was going on?
00:02:54.280 That's when I decided to figure out what all this meant.
00:03:00.670 To get warmed up to the idea of parsing, let's play a game.
00:03:07.340 I hope many of you have played this game before, but for those of you who haven’t, this is the word game where one person asks for various kinds of words.
00:03:13.069 For example, one person asks for a noun or an adjective to make sentences.
00:03:19.730 Let’s start with: 'I need a verb.' Can anyone suggest one?
00:03:26.209 "Drink!" Great! Now let’s form a sentence: 'The young man drank.' This is a valid sentence.
00:03:33.709 If you remember your middle school grammar lessons, we can diagram the sentence into a subject and a verb.
00:03:39.109 In this case, 'The young man' is the subject and 'drank' is the verb.
00:03:46.250 Now, can we find other possible sentences that start with 'The young man'?
00:03:52.310 For example, 'The young man did something.' Here, we need a verb and a noun.
00:03:57.680 We can stick with our original verb, 'drank.' Any suggestions for a noun?
00:04:04.030 'Beer!' That's a good one. Now we have: 'The young man drank beer.' This is also grammatical.
00:04:09.049 Instead of just 'subject + verb,' we now have 'subject + verb + object,' which abides by the rules of English grammar.
00:04:17.539 If we interpret 'The young man' as the subject, 'drank' as the verb, and 'beer' as the direct object, it all makes sense.
00:04:24.710 But can we come up with even more possibilities for valid sentences that start with 'The young man'?
00:04:32.380 For example, is this grammatical: 'A young man the boat?'
00:04:39.109 Yes or no? Is it a grammatical sentence?
00:04:46.250 At first, we might think it’s not grammatical because we assume that the subject is still 'the young man.'
00:04:52.310 However, it turns out we can interpret this sentence differently.
00:04:57.680 If we interpret 'young' as referring to young people, it can still follow the same rules as the valid sentence.
00:05:02.449 This kind of sentence is called a 'garden path sentence.' Garden path sentences contain words that could be interpreted in multiple ways.
00:05:09.139 They lead us to initially interpret them in one direction before pivoting on an ambiguous word and going a different way.
00:05:17.240 We might say: 'Time flies like an arrow; fruit flies like a banana.' The same structure applies to 'The man who hunts ducks on weekends.'
00:05:23.090 There are other humorous examples, like 'Grandmother of eight makes hole in one,' or 'Man-eating piranhas mistakenly sold as pet fish.'
00:05:28.580 These ambiguities in words or sentence structure can produce hilarious, unintentional meanings.
00:05:35.570 As speakers of a language, we are all aware of certain grammar rules.
00:05:39.110 These rules help us determine whether or not a sentence is a valid expression.
00:05:44.420 From the examples we’ve seen, we know that one valid sentence form is 'subject + verb + object.'
00:05:49.460 Most of us probably did some kind of sentence diagramming in middle school similar to this.
00:05:56.540 At the top of the tree, we have a sentence that can be broken down into its constituent parts.
00:06:02.240 Each part then gets broken down further until every word in the sentence can be represented as a leaf of that tree.
00:06:07.940 The leaves in this tree correspond to the words of the sentence, such as 'John lost his pants,' but the same tree structure could apply to 'The young man the boat.'
00:06:14.360 These trees are constructed based on grammar rules.
00:06:20.300 The conventional notation for describing grammars in the computing world is called Backus-Naur Form (BNF).
00:06:26.330 In BNF notation, we have a set of rules that describe the language, defining every possible valid sentence in that language.
00:06:35.179 There are two kinds of symbols to express grammar rules: terminals and non-terminals.
00:06:40.549 The basic format of each production rule shows a left-hand side symbol producing a right-hand side.
00:06:46.759 The left-hand side contains a single non-terminal symbol or token which produces either a terminal or a sequence of terminals and non-terminals.
00:06:52.790 For example, a simplified version of production rules for English sentences might look like this: at the top level, a sentence could become a subject followed by a predicate.
00:06:58.670 The predicate itself would become a verb followed by an object.
00:07:06.230 From these rules, we can describe both sentences we created earlier: 'The young man drank beer' and 'The young man the boat.'
00:07:12.649 The rules would use six non-terminals, including sentence, noun phrase, verb phrase, noun, verb, and article.
00:07:17.890 It would also include seven terminals, which are just the words: 'the,' 'a,' 'young man,' 'boat,' 'beer,' and 'drink.'
00:07:24.600 Using this grammar, we start at the top level and keep applying production rules to generate a full sentence.
00:07:30.540 We can rewrite our sentence as a noun phrase and a verb phrase, breaking them down into their respective components.
00:07:39.430 The noun phrase translates into an article and a noun, hitting the terminals for 'the' and 'young.'
00:07:47.190 Then, the noun phrase restructuring ends up as 'the boat,' and the same rules apply for the second sentence.
00:07:54.610 But what does this have to do with computers?
00:08:02.790 The process by which we parse and understand a sentence in natural languages isn't too different from how a computer language parser works.
00:08:09.110 Your code gets fed into a lexer or tokenizer, breaking up the string into recognizable units.
00:08:15.070 These units then go to the parser, which outputs an abstract syntax tree.
00:08:20.740 This tree can then be traversed by the compiler or interpreter to output CPU instructions.
00:08:27.460 Most of us aren't in the business of implementing new programming languages, but we can gain insights by understanding the lexer and parser steps.
00:08:34.240 By learning how to use parsers for simpler situations, like parsing a log file, markdown document, or even plain strings, we can see their usefulness.
00:08:41.530 For instance, feeding a string into a parser can yield different data structures.
00:08:49.110 It could even return a boolean indicating validity.
00:08:54.460 So, let’s delve into lexing or tokenization.
00:09:01.270 This is simply the process of turning your input into a sequence of tokens that the parser can recognize.
00:09:06.670 These tokens will match the grammar rules describing the language being parsed.
00:09:12.980 Let's see a simple example: we'll build a parser for basic binary operations in math, such as addition, subtraction, and multiplication.
00:09:20.460 This simple language will consist of an integer followed by an operator and then another integer.
00:09:29.460 The grammar rules for this language are quite straightforward; we only need three of them.
00:09:36.520 At the top level, an expression can consist of three non-terminal symbols: a number token, an operator token, and another number token.
00:09:43.530 The number token can be any digit, noted as above, using a regular expression.
00:09:51.510 The operator token can be rewritten as a plus sign, minus sign, or multiplication sign.
00:09:57.830 Here’s a simple implementation of a tokenizer in Ruby.
00:10:03.910 It uses regular expression matching to grab tokens of either type: num or op and then returns an array of them.
00:10:11.760 To handle the input string '3 plus 7,' we should get an array containing three elements: num, op, and num.
00:10:17.320 We want to pass this to the parser, ultimately aiming to create a tree structure.
00:10:23.890 The tree will have the operator as the head, with the two number tokens as children.
00:10:30.750 Here’s a trivial implementation of a parser that expects three tokens in a specific order.
00:10:37.610 It produces a tree containing the operator and the two number tokens.
00:10:44.690 Now, let’s try a slightly more complicated example with the expression '2 times (3 plus 7).'
00:10:50.410 This requires slightly more complex rules to describe.
00:10:57.920 Previously, we had the rule: expression = num op num.
00:11:05.870 Now, we have: expression = num op expression.
00:11:12.340 This allows us to also write another expression in parentheses or just a number.
00:11:20.100 The tokens we parse will generate an array of seven tokens, leading to the desired tree.
00:11:27.890 To tell the parser how to construct the tree correctly, we can use a top-down parser.
00:11:36.270 This parser looks ahead one token to decide which grammar rule to follow.
00:11:43.290 For example, knowing that 'The young man' is followed by 'the boat' helps us identify the subject and verb.
00:11:50.170 If you adhere to the grammar rules while consuming tokens, you're on the right track.
00:11:55.440 If not, you reach a narrow state, resulting in a parse error.
00:12:02.000 We start with the first token (a number token) and know we have an operator coming up.
00:12:08.290 So, the only possible rule we can follow states: expression = number op expression.
00:12:15.670 Next, we handle the operator token and see the parenthesis follows this token.
00:12:22.570 The only applicable grammar rule is still the first one: expression = number op expression.
00:12:30.080 We keep building our tree, starting with the operator, and then consume the parentheses.
00:12:37.320 Next token is a number, therefore we still obey the grammar rules.
00:12:44.320 When we consume that number token, holding the value 3, we assess our options.
00:12:50.590 To reach step two again involves interpreting the number token or engaging it in the first production rule.
00:12:57.560 Peeking ahead at the next token confirms we should follow the second option.
00:13:02.520 Deciding it represents a binary expression allows 3 to be the left child.
00:13:10.220 We now consume the operator token and again analyze what follows.
00:13:17.640 As before, this token follows our previous production rule, so we add the plus operator.
00:13:24.130 We reach token 7 and recognize it must also be a valid expression.
00:13:30.680 Upon consuming the final parenthesis, we complete the subexpression.
00:13:38.120 Through these steps, we build our syntax tree.
00:13:45.360 Many noticed the recursive nature of the steps we used to build the syntax tree.
00:13:53.290 We call this parsing method 'recursive descent,' commonly implemented by hand; however, it has its drawbacks.
00:14:01.420 The recursive nature can lead to inefficiencies or even infinite recursion if not carefully written.
00:14:08.390 For example, an expression rule that produces itself indefinitely creates an infinite loop.
00:14:14.890 Alternatively, parsing can be achieved from the bottom up, using a stack.
00:14:21.070 Tokens are pushed onto the stack until they fulfill a grammar rule, at which point they are reduced.
00:14:27.880 Now, returning to our math example, we start by pushing the first token onto the stack.
00:14:34.430 Since we recognize '2' as a number token, we reduce it to the left-hand symbol for number.
00:14:41.830 The next token to shift is the multiplication, categorized as an operator.
00:14:49.390 Pushing parentheses onto the stack allows flexibility within expression rules.
00:14:57.620 The next token is a '3,' which again gets pushed through into a number token.
00:15:04.150 Upon encountering the '+' operator, we identify it as an OP token, and again reduce it.
00:15:11.000 Next, we shift over to '7,' which forms a valid expression.
00:15:18.010 Having fulfilled another grammar rule; we can construct and evaluate a complete tree.
00:15:23.470 These parsing approaches—top-down and bottom-up—are the two primary methodologies.
00:15:31.830 Top-down parsers are easier to write by hand, whereas bottom-up models are more complex.
00:15:37.780 However, tools exist to facilitate parser creation, known as parser generators.
00:15:45.680 Parser generators only require a grammar file to outline the rules of the language.
00:15:53.340 These grammar files are quite similar to the BNF notation we've discussed.
00:16:01.210 Utilizing tools like Racc or Bison, one can input a grammar file and effortlessly generate a parser.
00:16:07.780 Now, in conclusion, there are many applications for using a parser.
00:16:16.320 For instance, you could use a parser to validate patterns in URLs or email addresses.
00:16:23.490 Additionally, they can help extract information from email bodies or logs.
00:16:30.500 Converting structured documents into different formats, like markdown to HTML, is another use.
00:16:38.600 While simple regular expressions can validate some strings, they become complicated for complex patterns.
00:16:44.860 Languages belong to a hierarchy introduced by Noam Chomsky, categorized into four types.
00:16:51.440 First is type zero, unrestricted; type one is context-sensitive.
00:16:58.540 Types two and three are context-free languages and regular languages respectively.
00:17:07.120 Most programming languages fit under type two.
00:17:13.370 To illustrate this further, consider the AAB language: sequences of A's followed by equal B's.
00:17:21.860 To create grammar rules for AAB is tricky, as it would require infinite definitions.
00:17:28.160 This problem parallels that of checking balanced parentheses.
00:17:35.240 Crafting a parser makes defining the grammar—the most challenging aspect.
00:17:43.830 Recall a puzzle involving gnomes, some with red paint and others blue, which can sort themselves.
00:17:57.370 Using BNF notation, the sentence indicates that an A or B includes another statement, allowing all string combinations.
00:18:05.220 Thus, further exploration reveals that it's context-free.
00:18:13.500 Grammars, while challenging to define, inform how to structure parsers. The hierarchy of languages provides a clear understanding of which are regular versus context-free.
00:18:21.190 Resources for parser generators, whether for understanding specifications like HTTP headers or validating input.
00:18:27.920 Programming languages often employ parsers for web frameworks, enhancing performance over traditional methods.
00:18:36.760 Parsers can often be more effective than using plain regular expressions, as they provide precision and clarity.
00:18:45.260 Many web servers depend on parsers to manage HTTP requests effectively, demonstrating their importance.
00:18:52.810 Parser generators save time and complexity, enough to offer powerful toolsets for clear application.
00:19:00.490 You can transform your grammar into actionable code, enabling creativity in programming pursuits.
00:19:09.780 I appreciate your attention and hope you walk away with insights into how parsing and grammars can enhance programming!
00:19:18.500 Thank you!