00:00:15.379
Hello everyone, my name is Hsing-Hui Hsu. That is actually how it's spelled; all of the HS are silent. You can find me on Twitter as @singhwaysu.
00:00:23.689
Today, I'm going to talk to you about parsing.
00:00:29.029
For those of you who attended Aaron Patterson's talk this morning on the Ruby virtual machine, my talk covers many of the steps he skipped.
00:00:34.970
So if you're still feeling like you want to skip over those steps, now is your chance.
00:00:40.820
I’d like to start with an explanation of the title of this talk.
00:00:47.059
I first started thinking about parsing when I was teaching English as a foreign language.
00:00:53.030
I had to explain why many words in English, like 'flies,' have different meanings.
00:00:59.629
Depending on which meaning the word has, the sentence can go in really different directions.
00:01:05.420
As I started looking into how computer languages are parsed, I noticed a lot of similarities to how humans parse sentences.
00:01:12.080
I am a relative newcomer to the programming world and never learned about parsers or compilers in school.
00:01:18.259
So, I have an alternate title for this talk: 'How I Accidentally Became a Computer Scientist, and So Can You!'
00:01:25.910
This isn't going to be a comprehensive survey of parsers.
00:01:32.060
Instead, I want to share how I learned about them as someone without a computer science degree.
00:01:39.470
I had no idea what parsers were and sort of ran into them by accident.
00:01:45.679
After building Rails applications for a while, I got tired of blindly trusting all that Rails magic.
00:01:51.500
I wanted to see how things actually worked under the hood.
00:01:57.590
Specifically, I was trying to figure out how routing worked.
00:02:02.720
So I went to GitHub and started poking around the Rails source code.
00:02:08.289
I came across a file called 'parser2.rb' and thought, 'Cool! I've done an exercise similar to this, so I probably know what this code looks like.'
00:02:16.690
But when I looked at the file, it was nothing like what I expected.
00:02:23.350
It was just a bunch of arrays of numbers. Where was the logic?
00:02:28.480
I wondered who had written this code.
00:02:35.740
Looking closer, I saw that the previous file was actually generated using a different file called 'parser.y'.
00:02:41.020
So I went and looked at that file and still found it confusing—it hardly looked like any Ruby code I recognized.
00:02:46.810
It had semicolons—what was going on?
00:02:54.280
That's when I decided to figure out what all this meant.
00:03:00.670
To get warmed up to the idea of parsing, let's play a game.
00:03:07.340
I hope many of you have played this game before, but for those of you who haven’t, this is the word game where one person asks for various kinds of words.
00:03:13.069
For example, one person asks for a noun or an adjective to make sentences.
00:03:19.730
Let’s start with: 'I need a verb.' Can anyone suggest one?
00:03:26.209
"Drink!" Great! Now let’s form a sentence: 'The young man drank.' This is a valid sentence.
00:03:33.709
If you remember your middle school grammar lessons, we can diagram the sentence into a subject and a verb.
00:03:39.109
In this case, 'The young man' is the subject and 'drank' is the verb.
00:03:46.250
Now, can we find other possible sentences that start with 'The young man'?
00:03:52.310
For example, 'The young man did something.' Here, we need a verb and a noun.
00:03:57.680
We can stick with our original verb, 'drank.' Any suggestions for a noun?
00:04:04.030
'Beer!' That's a good one. Now we have: 'The young man drank beer.' This is also grammatical.
00:04:09.049
Instead of just 'subject + verb,' we now have 'subject + verb + object,' which abides by the rules of English grammar.
00:04:17.539
If we interpret 'The young man' as the subject, 'drank' as the verb, and 'beer' as the direct object, it all makes sense.
00:04:24.710
But can we come up with even more possibilities for valid sentences that start with 'The young man'?
00:04:32.380
For example, is this grammatical: 'A young man the boat?'
00:04:39.109
Yes or no? Is it a grammatical sentence?
00:04:46.250
At first, we might think it’s not grammatical because we assume that the subject is still 'the young man.'
00:04:52.310
However, it turns out we can interpret this sentence differently.
00:04:57.680
If we interpret 'young' as referring to young people, it can still follow the same rules as the valid sentence.
00:05:02.449
This kind of sentence is called a 'garden path sentence.' Garden path sentences contain words that could be interpreted in multiple ways.
00:05:09.139
They lead us to initially interpret them in one direction before pivoting on an ambiguous word and going a different way.
00:05:17.240
We might say: 'Time flies like an arrow; fruit flies like a banana.' The same structure applies to 'The man who hunts ducks on weekends.'
00:05:23.090
There are other humorous examples, like 'Grandmother of eight makes hole in one,' or 'Man-eating piranhas mistakenly sold as pet fish.'
00:05:28.580
These ambiguities in words or sentence structure can produce hilarious, unintentional meanings.
00:05:35.570
As speakers of a language, we are all aware of certain grammar rules.
00:05:39.110
These rules help us determine whether or not a sentence is a valid expression.
00:05:44.420
From the examples we’ve seen, we know that one valid sentence form is 'subject + verb + object.'
00:05:49.460
Most of us probably did some kind of sentence diagramming in middle school similar to this.
00:05:56.540
At the top of the tree, we have a sentence that can be broken down into its constituent parts.
00:06:02.240
Each part then gets broken down further until every word in the sentence can be represented as a leaf of that tree.
00:06:07.940
The leaves in this tree correspond to the words of the sentence, such as 'John lost his pants,' but the same tree structure could apply to 'The young man the boat.'
00:06:14.360
These trees are constructed based on grammar rules.
00:06:20.300
The conventional notation for describing grammars in the computing world is called Backus-Naur Form (BNF).
00:06:26.330
In BNF notation, we have a set of rules that describe the language, defining every possible valid sentence in that language.
00:06:35.179
There are two kinds of symbols to express grammar rules: terminals and non-terminals.
00:06:40.549
The basic format of each production rule shows a left-hand side symbol producing a right-hand side.
00:06:46.759
The left-hand side contains a single non-terminal symbol or token which produces either a terminal or a sequence of terminals and non-terminals.
00:06:52.790
For example, a simplified version of production rules for English sentences might look like this: at the top level, a sentence could become a subject followed by a predicate.
00:06:58.670
The predicate itself would become a verb followed by an object.
00:07:06.230
From these rules, we can describe both sentences we created earlier: 'The young man drank beer' and 'The young man the boat.'
00:07:12.649
The rules would use six non-terminals, including sentence, noun phrase, verb phrase, noun, verb, and article.
00:07:17.890
It would also include seven terminals, which are just the words: 'the,' 'a,' 'young man,' 'boat,' 'beer,' and 'drink.'
00:07:24.600
Using this grammar, we start at the top level and keep applying production rules to generate a full sentence.
00:07:30.540
We can rewrite our sentence as a noun phrase and a verb phrase, breaking them down into their respective components.
00:07:39.430
The noun phrase translates into an article and a noun, hitting the terminals for 'the' and 'young.'
00:07:47.190
Then, the noun phrase restructuring ends up as 'the boat,' and the same rules apply for the second sentence.
00:07:54.610
But what does this have to do with computers?
00:08:02.790
The process by which we parse and understand a sentence in natural languages isn't too different from how a computer language parser works.
00:08:09.110
Your code gets fed into a lexer or tokenizer, breaking up the string into recognizable units.
00:08:15.070
These units then go to the parser, which outputs an abstract syntax tree.
00:08:20.740
This tree can then be traversed by the compiler or interpreter to output CPU instructions.
00:08:27.460
Most of us aren't in the business of implementing new programming languages, but we can gain insights by understanding the lexer and parser steps.
00:08:34.240
By learning how to use parsers for simpler situations, like parsing a log file, markdown document, or even plain strings, we can see their usefulness.
00:08:41.530
For instance, feeding a string into a parser can yield different data structures.
00:08:49.110
It could even return a boolean indicating validity.
00:08:54.460
So, let’s delve into lexing or tokenization.
00:09:01.270
This is simply the process of turning your input into a sequence of tokens that the parser can recognize.
00:09:06.670
These tokens will match the grammar rules describing the language being parsed.
00:09:12.980
Let's see a simple example: we'll build a parser for basic binary operations in math, such as addition, subtraction, and multiplication.
00:09:20.460
This simple language will consist of an integer followed by an operator and then another integer.
00:09:29.460
The grammar rules for this language are quite straightforward; we only need three of them.
00:09:36.520
At the top level, an expression can consist of three non-terminal symbols: a number token, an operator token, and another number token.
00:09:43.530
The number token can be any digit, noted as above, using a regular expression.
00:09:51.510
The operator token can be rewritten as a plus sign, minus sign, or multiplication sign.
00:09:57.830
Here’s a simple implementation of a tokenizer in Ruby.
00:10:03.910
It uses regular expression matching to grab tokens of either type: num or op and then returns an array of them.
00:10:11.760
To handle the input string '3 plus 7,' we should get an array containing three elements: num, op, and num.
00:10:17.320
We want to pass this to the parser, ultimately aiming to create a tree structure.
00:10:23.890
The tree will have the operator as the head, with the two number tokens as children.
00:10:30.750
Here’s a trivial implementation of a parser that expects three tokens in a specific order.
00:10:37.610
It produces a tree containing the operator and the two number tokens.
00:10:44.690
Now, let’s try a slightly more complicated example with the expression '2 times (3 plus 7).'
00:10:50.410
This requires slightly more complex rules to describe.
00:10:57.920
Previously, we had the rule: expression = num op num.
00:11:05.870
Now, we have: expression = num op expression.
00:11:12.340
This allows us to also write another expression in parentheses or just a number.
00:11:20.100
The tokens we parse will generate an array of seven tokens, leading to the desired tree.
00:11:27.890
To tell the parser how to construct the tree correctly, we can use a top-down parser.
00:11:36.270
This parser looks ahead one token to decide which grammar rule to follow.
00:11:43.290
For example, knowing that 'The young man' is followed by 'the boat' helps us identify the subject and verb.
00:11:50.170
If you adhere to the grammar rules while consuming tokens, you're on the right track.
00:11:55.440
If not, you reach a narrow state, resulting in a parse error.
00:12:02.000
We start with the first token (a number token) and know we have an operator coming up.
00:12:08.290
So, the only possible rule we can follow states: expression = number op expression.
00:12:15.670
Next, we handle the operator token and see the parenthesis follows this token.
00:12:22.570
The only applicable grammar rule is still the first one: expression = number op expression.
00:12:30.080
We keep building our tree, starting with the operator, and then consume the parentheses.
00:12:37.320
Next token is a number, therefore we still obey the grammar rules.
00:12:44.320
When we consume that number token, holding the value 3, we assess our options.
00:12:50.590
To reach step two again involves interpreting the number token or engaging it in the first production rule.
00:12:57.560
Peeking ahead at the next token confirms we should follow the second option.
00:13:02.520
Deciding it represents a binary expression allows 3 to be the left child.
00:13:10.220
We now consume the operator token and again analyze what follows.
00:13:17.640
As before, this token follows our previous production rule, so we add the plus operator.
00:13:24.130
We reach token 7 and recognize it must also be a valid expression.
00:13:30.680
Upon consuming the final parenthesis, we complete the subexpression.
00:13:38.120
Through these steps, we build our syntax tree.
00:13:45.360
Many noticed the recursive nature of the steps we used to build the syntax tree.
00:13:53.290
We call this parsing method 'recursive descent,' commonly implemented by hand; however, it has its drawbacks.
00:14:01.420
The recursive nature can lead to inefficiencies or even infinite recursion if not carefully written.
00:14:08.390
For example, an expression rule that produces itself indefinitely creates an infinite loop.
00:14:14.890
Alternatively, parsing can be achieved from the bottom up, using a stack.
00:14:21.070
Tokens are pushed onto the stack until they fulfill a grammar rule, at which point they are reduced.
00:14:27.880
Now, returning to our math example, we start by pushing the first token onto the stack.
00:14:34.430
Since we recognize '2' as a number token, we reduce it to the left-hand symbol for number.
00:14:41.830
The next token to shift is the multiplication, categorized as an operator.
00:14:49.390
Pushing parentheses onto the stack allows flexibility within expression rules.
00:14:57.620
The next token is a '3,' which again gets pushed through into a number token.
00:15:04.150
Upon encountering the '+' operator, we identify it as an OP token, and again reduce it.
00:15:11.000
Next, we shift over to '7,' which forms a valid expression.
00:15:18.010
Having fulfilled another grammar rule; we can construct and evaluate a complete tree.
00:15:23.470
These parsing approaches—top-down and bottom-up—are the two primary methodologies.
00:15:31.830
Top-down parsers are easier to write by hand, whereas bottom-up models are more complex.
00:15:37.780
However, tools exist to facilitate parser creation, known as parser generators.
00:15:45.680
Parser generators only require a grammar file to outline the rules of the language.
00:15:53.340
These grammar files are quite similar to the BNF notation we've discussed.
00:16:01.210
Utilizing tools like Racc or Bison, one can input a grammar file and effortlessly generate a parser.
00:16:07.780
Now, in conclusion, there are many applications for using a parser.
00:16:16.320
For instance, you could use a parser to validate patterns in URLs or email addresses.
00:16:23.490
Additionally, they can help extract information from email bodies or logs.
00:16:30.500
Converting structured documents into different formats, like markdown to HTML, is another use.
00:16:38.600
While simple regular expressions can validate some strings, they become complicated for complex patterns.
00:16:44.860
Languages belong to a hierarchy introduced by Noam Chomsky, categorized into four types.
00:16:51.440
First is type zero, unrestricted; type one is context-sensitive.
00:16:58.540
Types two and three are context-free languages and regular languages respectively.
00:17:07.120
Most programming languages fit under type two.
00:17:13.370
To illustrate this further, consider the AAB language: sequences of A's followed by equal B's.
00:17:21.860
To create grammar rules for AAB is tricky, as it would require infinite definitions.
00:17:28.160
This problem parallels that of checking balanced parentheses.
00:17:35.240
Crafting a parser makes defining the grammar—the most challenging aspect.
00:17:43.830
Recall a puzzle involving gnomes, some with red paint and others blue, which can sort themselves.
00:17:57.370
Using BNF notation, the sentence indicates that an A or B includes another statement, allowing all string combinations.
00:18:05.220
Thus, further exploration reveals that it's context-free.
00:18:13.500
Grammars, while challenging to define, inform how to structure parsers. The hierarchy of languages provides a clear understanding of which are regular versus context-free.
00:18:21.190
Resources for parser generators, whether for understanding specifications like HTTP headers or validating input.
00:18:27.920
Programming languages often employ parsers for web frameworks, enhancing performance over traditional methods.
00:18:36.760
Parsers can often be more effective than using plain regular expressions, as they provide precision and clarity.
00:18:45.260
Many web servers depend on parsers to manage HTTP requests effectively, demonstrating their importance.
00:18:52.810
Parser generators save time and complexity, enough to offer powerful toolsets for clear application.
00:19:00.490
You can transform your grammar into actionable code, enabling creativity in programming pursuits.
00:19:09.780
I appreciate your attention and hope you walk away with insights into how parsing and grammars can enhance programming!
00:19:18.500
Thank you!