RubyKaigi 2015

Time flies like an arrow; Fruit flies like a banana: Parsers for Great Good

http://rubykaigi.org/2015/presentations/SoManyHs

When you type print "Hello, world!", how does your computer
know what to do? Humans are able to naturally parse spoken language by
analyzing the role and meaning of each word in context of its
sentence, but we usually take for granted the way computers make sense
of the code we write.

By exploring the way our brains construct grammars to parse sentences,
we can better understand how parsers are used for computering --
whether it be in the way Ruby and other languages are implemented or
in webserver routing -- and recognize when they may be the right tool
to use in our own code.

RubyKaigi 2015

00:00:00 When you type 'print "Hello, world!"', how does your computer know what to do? Humans can naturally parse spoken language by analyzing the role and meaning of each word in the context of its sentence, but we often take for granted how computers make sense of the code we write.
00:00:09 By exploring the way our brains construct grammars to parse sentences, we can better understand how parsers are used in computing — whether it be in the way Ruby and other languages are implemented or in web server routing. We can also recognize when they may be the right tool to use in our own code.
00:00:41 My name is Hsing-Hui Hsu, which in Chinese is pronounced 'Shing Way'. I don't speak any Japanese, but I understand that Japanese Kanji derives from Chinese Han characters. Although this isn't always a perfect translation, my name means 'person capable of wisdom' in Chinese. However, my parents are from Taiwan, and the pronunciation of names can differ enough that when I introduce myself to people in Beijing, they sometimes react surprised because my name, 'Shing Way', sounds like a homophone for a different character combination meaning 'to offer a bribe'.
00:01:03 Perhaps it was fate, having an ambiguous name, that led me to think about parsing languages and foreign languages in general. I became more interested in understanding how languages work when I was teaching English as a foreign language. I often had to explain why many English words, like 'flies', could have different meanings depending on how they are used.
00:01:30 While I was learning how to program and delving into how computer languages are parsed, I noticed similarities between human sentence parsing and the way computer languages are processed. As a relative newcomer to the programming world, I never learned about parsers or compilers in school. Thus, I like to jokingly refer to my talk as 'How I Accidentally Became a Computer Scientist'.
00:01:50 This talk won’t be a comprehensive survey of parsers, but I'd like to share how I came across them. I build Rails applications and got curious about how things work beneath the surface. I wanted to understand routing, so I explored Rails source code on GitHub and found a file named 'parser.rb'.
00:02:07 At first glance, I expected it to contain logical code, but instead, it was filled with arrays of numbers. I was perplexed — where was the logic, where was the readable code? Digging deeper, I discovered that this file was generated using another file called 'parser.y', but it looked nothing like any Ruby code I recognized.
00:02:27 To introduce a concept of parsing, let’s play a game. I hope most of you have played this before, but for those unfamiliar, it's a word game where one person asks for kinds of words, such as nouns or adjectives, to construct sentences.
00:02:39 For instance, I need a verb. Anyone? Right! So, 'the young man drank' instead of 'drink'. This seems like a perfectly valid sentence based on our middle school grammar lessons. We can break it up into a subject and a verb, with 'the young man' as the subject doing the verb 'drank'.
00:03:05 However, there are other possible sentences that might start with 'the young man'. For example, we could say 'the young man did something'. Let’s need a verb and a noun. We can stick with our original verb 'drank'. Any suggestions for a noun? How about 'sake'? This generates a subject + verb + object structure, another valid type of English sentence.
00:03:50 As we explored this, we realize the structure can lead to more creativity and variations in sentence formation. For example, we can have combinations such as 'the young man drank beer' or 'the young man did something clear'. This exploration of words and grammar rules forms a basis for understanding parsing.
00:04:06 The sentences we constructed abide by the rules of English grammar. If we analyze 'the young man drank sake', we can see that 'the young man' is the subject, 'drank' the verb, and 'sake' the object. But can we interpret 'the young man' in another way? If we say 'the young man, the boat', suddenly we run into ambiguity because we expect a verb after the subject.
00:04:59 This peculiar sentence kind is called a garden path sentence — those that lead you towards a false interpretation due to ambiguous grammar. Consider the phrase 'Time flies like an arrow; fruit flies like a banana.' The pivotal word here is 'like', which in the first clause signals that 'flies' is a verb and in the second clause indicates that 'flies' is the noun.
00:05:46 Such sentences trick the brain into thinking they follow one directive before shifting unexpectedly into a new direction. For instance, 'the prime number few' misleads us by presenting 'prime number' as a noun phrase where 'number' functions as a verb — another clever twist of grammar.
00:06:07 Similarly, the phrase 'the man who hunts ducks out on weekends' appears straightforward. Yet, we initially interpret 'ducks' as a noun, when it's actually part of a phrasal verb, 'ducks out'. This playful ambiguity extends to how we perceive adjective clauses, leading us to misinterpret action agents in sentences.
00:06:35 Consider the sentence: 'The teacher drank sake,' which upon further reading reveals that it is the student who drank sake, and the teacher was merely advising. Such ambiguity can yield unintentional, humorous meanings, especially in newspaper headlines — for example, 'Grandmother of eight makes a hole-in-one!' or 'Complaints about NBA referee growing ugly.'
00:07:02 Through this exploration of the grammar of language, we see that despite variations and ambiguities, as speakers, we subscribe to certain grammatical rules that govern our expressions.
00:07:19 As we've seen with our previous examples, a valid sentence construction in English generally consists of a subject + verb + object. Most of us probably engaged in sentence diagramming in school, where we break sentences into noun and verb phrases, culminating in a structure akin to a tree representation.
00:07:53 In this tree, the leaves represent the words in the sentence. For instance, 'John lost his pants' can be diagrammed followed by the earlier example, 'the young man, the boat'. These types of trees arise from grammatical rules, and in computing, grammars are usually represented using a notation known as Backus-Naur Form (BNF).
00:08:22 BNF notation consists of a set of rules that outline the structure of the language, allowing for the definition of every potential sentence. In BNF, we have two types of symbols: terminals and non-terminals. The left-hand side of a production rule produces what appears on the right-hand side.
00:08:57 For example, a simplified set of production rules for English sentences might include a sentence becoming a subject followed by a predicate, such as in a sentence like 'the young man drank sake'. This production can use six non-terminals and seven terminals that include representative words from the sentence.
00:09:28 Using these rules, we can start from the top and keep applying production rules until we generate the complete sentence. By selecting to work through the rules applied to our example phrases, we can construct further trees in a similar manner.
00:09:55 So, what does all this have to do with computers? The process of parsing and understanding sentences in natural languages is not vastly different from how a computer language parser operates.
00:10:09 When you write code, it is fed into a lexer or tokenizer, which breaks each sentence into recognizable units. These tokens are then passed to a parser that produces an abstract syntax tree, which can be traversed by a compiler or interpreter to generate CPU instructions.
00:10:41 While most of us aren't engaged in developing programming languages from scratch, understanding how lexers and parsers function enables us to apply parsing techniques in more straightforward scenarios.
00:11:05 You can use parsers not just for Ruby scripts but for various documents, including log files, markdown files, or plain strings, leading to outputs that need not always be a tree but other data structures.
00:11:27 Let’s look at lexing — this means converting your input into a sequence of tokens that your parser can recognize. Tokens obtained here would then match the grammar rules associated with the language you want to parse.
00:11:51 For demonstration, let’s build a parser for simple binary operations in mathematics, covering addition, subtraction, and multiplication. This language consists of an integer, an operator, and another integer, encapsulated by very basic grammar rules.
00:12:37 This includes a structure for production rules, such as expressions framed by non-terminal symbols, and examples of how to implement a tokenizer in Ruby that uses regular expression matching to return relevant tokens.
00:13:11 When we input an expression like '3 + 7', our parser should return an array of tokens. Following this, we can feed this token array into our parser, ultimately aiming to establish a tree structure,
00:13:45 For mathematical expressions such as '2 * (3 + 7)', the rules we construct need to account for operators, parentheses, and more. These grammar rules must adapt to understand various levels of abstraction or compound expressions within a single tree.
00:14:21 The way we parse can either be top-down (recursive descent) or bottom-up (shift-reduce). Using a top-down parser, we check upcoming tokens, enabling us to follow to meet the grammar rules. Thus, we adjust our tree based on the tokens we utilize during parsing.
00:15:11 Within this mechanism, should the parsing hit an error, an alert signals the parser to throw a parse error, managing how we build and understand the syntax tree.
00:15:56 Compared to traditional recursive top-down approaches, shift-reduce parsing helps avoid issues like infinite recursion and managing operator precedence.
00:16:31 Again, we begin reassembling our tokens into meaningful structures, helping create a deeper understanding between parsed tokens and grammatical rules.
00:16:57 Thanks to parser generators, you can construct parsers simply by determining the grammar. Rather than writing a parser from scratch, generator tools can automate the heavy lifting.
00:17:22 Tools like Rack take grammar files to produce parsers for you. When the rules for the language are defined, Rack can generate the underlying structure for parsing various inputs, ensuring clarity and simplicity.
00:18:09 Using a parser offers accuracy in linguistic input and maintains efficient readability, thus streamlining the process of constructing and validating complex languages.
00:18:37 In the context of programming, it is also essential to distinguish between using regex and employing parsers. For instance, using regex for URL validations often results in complex patterns that can be hard to manage.
00:19:11 In contrast to regex limitations, parsers can accurately parse context-free languages and provide a more robust framework for recognition and validation.
00:19:56 Despite challenges in writing parsers, especially for languages like HTTP, once you pinpoint the grammar rules, you can effectively automate the process, significantly improving operational performance.
00:20:30 Parsers significantly accelerate routing and overall execution within web frameworks by allowing interactions with complex structures rather than linear methods, resulting in logarithmic time efficiencies.
00:21:05 The Ruby Gems and Bundler illustrate how varying approaches can lead to different outcomes based on your parsing method — while Bundler utilizes regex, Ruby Gems opts for a more formal lexing and parsing approach.
00:21:24 Choosing an appropriate parser can solve many issues previously posed by regex. By assessing your grammar needs and restructuring your parser, you can aim for clarity and efficiency in data management.
00:21:52 In summary, parsers represent a powerful mechanism to enhance language processing and code validation. Parser generators remove the cumbersome nature of writing parsers from scratch after defining necessary grammar rules.
00:22:16 In conclusion, I’d like to extend my gratitude for your attention and I hope this talk sheds light on the importance and utilization of parsers within both programming environments and natural languages.