class: center, middle # CS-3110: Formal Languages and Automata ## Parsing ### Chapter 4.2, 4.3 --- # Parsing Last time we talked about grammars, for example: $$ E \rightarrow X\\\\ E \rightarrow E + E\\\\ E \rightarrow E * E\\\\ E \rightarrow (\:E\:)\\\\ X \rightarrow a\\\\ X \rightarrow b\\\\ \vdots\\\\ X \rightarrow z $$ --- # Generation vs. Recognition * By starting with the start symbol, and applying rules we can **generate** words from our language * Often we want to verify that a given word is **in** the language, though * In practical implementations, this process is known as "parsing" * Maybe we can use a similar idea as for DFAs and NFAs? --- # Parsing * Given a string, e.g. `(a + c) * d` determine if that string is an element of our language * Recall from automata: We read the word character by character, and applied some rules (in DFAs and NFAs: the transition function) * Let's try the same! --- # Parsing Parsing: `(a + c) * d` .left-column[ $$ E \rightarrow X\\\\ E \rightarrow E + E\\\\ E \rightarrow E * E\\\\ E \rightarrow (\\:E\\:) \\\\ X \rightarrow a \\\\ X \rightarrow b \\\\ \vdots\\\\ X \rightarrow z $$ ] .right-column[ * "State": E * Input: `(a + c) * d` * Rule: `\(E \rightarrow (\: E\: )\)` * Read `(` and remember `)` for later * Next state: `E )` ] --- # Parsing Parsing: `(a + c) * d` .left-column[ $$ E \rightarrow X\\\\ E \rightarrow E + E\\\\ E \rightarrow E * E\\\\ E \rightarrow (\\:E\\:) \\\\ X \rightarrow a \\\\ X \rightarrow b \\\\ \vdots\\\\ X \rightarrow z $$ ] .right-column[ * "State": `E )` * Input: `a + c) * d` * Rule: `\(E \rightarrow X\)` * Next state: `X )` ] --- # Parsing Parsing: `(a + c) * d` .left-column[ $$ E \rightarrow X\\\\ E \rightarrow E + E\\\\ E \rightarrow E * E\\\\ E \rightarrow (\\:E\\:) \\\\ X \rightarrow a \\\\ X \rightarrow b \\\\ \vdots\\\\ X \rightarrow z $$ ] .right-column[ * "State": `X )` * Input: `a + c) * d` * Rule: `\(X \rightarrow a\)` * Read `a` * Next state: `)` ] --- # Parsing Parsing: `(a + c) * d` .left-column[ $$ E \rightarrow X\\\\ E \rightarrow E + E\\\\ E \rightarrow E * E\\\\ E \rightarrow (\\:E\\:) \\\\ X \rightarrow a \\\\ X \rightarrow b \\\\ \vdots\\\\ X \rightarrow z $$ ] .right-column[ * "State": `)` * Input: `+ c) * d` * We still have a `)` remembered! * Try to read `)` * ??? ] --- class: middle
--- class: medium # Parsing: What happened? * Maybe starting with `\(E \rightarrow (\:E\:)\)` was a bad idea? * Or maybe it was when we had `a + c) * d` and applied `\(E \rightarrow X\)` * Instead, like in an NFA, we have to try **all** rules that could be applicable * Which other rules would be applicable? All that have `E` on the left side! $$ E \rightarrow X\\\\ E \rightarrow E + E\\\\ E \rightarrow E * E\\\\ E \rightarrow (\:E\:) $$ --- # Parsing "properly" * Parsing `(a + c) * d` $$ \begin{aligned} E &\Rightarrow^\ast E * E \\\\ &\Rightarrow^\ast ( E ) \ast E \\\\ &\Rightarrow^\ast ( E + E ) \ast E \\\\ &\Rightarrow^\ast ( X + X ) \ast X \\\\ &\Rightarrow^\ast ( a + b ) \ast c \end{aligned} $$ --- # Parse Trees Recall `(a+b)*(b+c)` as a tree:
This is "almost" a *parse tree* --- # Parse Trees * To actually do the parsing, we need to apply the rules from our grammar * We said something like "To apply `\(E \rightarrow E + E\)` we'll just make a node labeled '+'" * But what if we had a rule like `\(E \rightarrow \text{if}\:C\:\text{then}\:S\:\text{else}\:S\)`? * Instead, in a "real" parse tree, our interior nodes are non-terminal symbols! --- # Parse Trees The actual parse tree of `(a+b)*(b+c)`
--- # Parse Trees What's the parse tree of `a+b*c`?
--- # Parse Trees What's the parse tree of `a+b*c`?
--- class: medium # Ambiguity * There are different orders in which we can apply rules! * This makes the parse tree ambiguous * We could say that we always replace the left-most non-terminal symbol * This will produce a **left-derivation** * Unfortunately, even *that* is not enough in our case * In practice, we often want something more deterministic! --- class: center, middle # Practical Parsing --- # Determinism * Next time we will look at Pushdown Automata, which can recognize context-free languages * They work by performing basically this replacement operation for all applicable rules in parallel * There are also Deterministic Pushdown Automata * Unlike for Finite Automata, the deterministic and non-deterministic variants are not equally powerful! --- # Determinism * In practice, deterministic recognition is usually more efficient * For grammars we actually use, we therefore strive to keep them (mostly) deterministic * How? By making sure there is always only one rule that applies * There are many forms how this can be done, e.g. by letting the recognizer "look ahead" in the string * Often can rewrite the grammar to achieve this --- class: small # Expression grammar .left-column[ $$ E \rightarrow X\\\\ E \rightarrow E + E\\\\ E \rightarrow E * E\\\\ E \rightarrow (\\:E\\:) \\\\ X \rightarrow a \\\\ X \rightarrow b \\\\ \vdots\\\\ X \rightarrow z $$ ] .right-column[ * For each non-terminal we define "First" and "Follow" * First: Set of all terminals that can appear as the left-most part of a rule * Follow: Set of all terminals that appear **after** the non-terminal on the right side of a rule * First(E) = `\(\{(, a, b, \ldots, z\}\)` * Follow(E) = `\(\{+, \ast, )\}\)` ] --- class: mmedium # First Sets To determine First(X): * Take every rule that has X on the left side * If the first character on the right side is a terminal symbol (letter) or epsilon, add it to the First-set * If the first character on the right side is a non-terminal symbol, e.g. Y, add every terminal symbol (but **not** epsilon) from First(Y) to First(X) * If First(Y) contains an epsilon, continue with the symbol after it in the same way * If you reach the end of the rule this way, add epsilon to First(Y) --- class: mmedium # Follow Sets Start by adding $ to the Follow-set of the start symbol. Then, to determine Follow(X): * Take every rule that has X on the **right** side * If the symbol after X is a terminal symbol, add it to Follow(X) * If the symbol after X is a non-terminal symbol, e.g. Y, add every terminal symbol (but **not** epsilon) from First(Y) to Follow(X) * If First(Y) contains an epsilon, continue with the symbol after Y in the same manner * If you reach the end of the rule (e.g. X is at the end, or every symbol after it can derive epsilon), add the Follow-set of the **left** side of the rule to Follow(X) --- # First and Follow * Why do we need two sets? * Because our non-terminal may "disappear" (have a rule that sets it to epsilon) * For each symbol in First, and - if the non-terminal can go to epsilon - in Follow, we can then list the rule(s) that should be applied * Basically, this will tell us which rule(s) we can apply at what point during parsing * Let's look at a reduced version of our expression grammar --- # Reduced Example $$ E \rightarrow E + E\\\\ E \rightarrow (\:E\:)\\\\ E \rightarrow X\\\\ X \rightarrow a $$ First we determine the First- and Follow- sets. -- First(E) = {(, a}, Follow(E) = {$, ), +} First(X) = {a}, Follow(X) = {$, ), +} Now we create a table, with one row per non-terminal and one column per terminal symbol. Each entry in the table represents which rule(s) we apply if we currently have a particular non-terminal symbol and then read the given terminal symbol as the next input. --- # Reduced Example $$ E \rightarrow E + E\\\\ E \rightarrow (\:E\:)\\\\ E \rightarrow X\\\\ X \rightarrow a $$
a
(
)
+
E
$$ E \rightarrow E + E\\\\ E \rightarrow X $$
$$ E \rightarrow (\:E\:)\\\\ $$
X
$$ X \rightarrow a $$
When we start with an `a`, we do not know which rule to take, because there are two entries in our table cell! --- # What could we do? $$ E \rightarrow E + E\\\\ E \rightarrow X\\\\ E \rightarrow (\:E\:)\\\\ X \rightarrow a $$ First(E) is `\(\{(, a\}\)`, so we only want (at most) two rules for E, one for each case. We can then **optionally** let the `a` be followed by more tokens. $$ E \rightarrow X F \\\\ F \rightarrow \varepsilon\\\\ F \rightarrow + E\\\\ X \rightarrow (\:E\:)\\\\ X \rightarrow a $$ --- # Expression grammar, rewritten The same logic can be applied to the full expression grammar: .left-column-math[ $$ E \rightarrow X F\\\\ F \rightarrow \varepsilon\\\\ F \rightarrow +\: E\\\\ F \rightarrow *\:E\\\\ X \rightarrow (\\:E\\:) \\\\ X \rightarrow a \\\\ X \rightarrow b \\\\ \vdots\\\\ X \rightarrow z $$ ] .right-column-math[
] --- class: small # Another Example .smallm[ $$ S \rightarrow A B \\\\ A \rightarrow \varepsilon \\\\ A \rightarrow - \\\\ B \rightarrow 0 \\\\ B \rightarrow 1 C \\\\ C \rightarrow 0 C\\\\ C \rightarrow 1 C\\\\ C \rightarrow \varepsilon $$ ] -- .left-column-wide[ * First(C) = `\(\{0,1,\varepsilon\}\)`, Follow(C) = `\(\{\$\}\)` * First(B) = `\(\{0,1\}\)`, Follow(B) = `\(\{$\}\)` * First(A) = `\(\{-,\varepsilon\}\)`, Follow(A) = `\(\{0, 1\}\)` * First(S) = `\(\{-,0,1\}\)`, Follow(S) = `\(\{$\}\)` ] .right-column-wide[
-
0
1
$
S
\( S \rightarrow A B \)
\( S \rightarrow A B \)
\( S \rightarrow A B \)
A
\(A \rightarrow -\)
\(A \rightarrow \varepsilon\)
\(A \rightarrow \varepsilon\)
B
\(B \rightarrow 0\)
\(B \rightarrow 1C\)
C
\(C \rightarrow 0 C\)
\(C \rightarrow 1C\)
\(C \rightarrow \varepsilon\)
] --- # Recursive Descent Parser * When the grammar is not ambiguous (there is at most one rule in each table cell), writing a parser is easy! * Each non-terminal symbol becomes a function * This function checks the next symbol to be read, which determines which rule to apply * Each rule is a sequence of function calls (for non-terminals) or read operations (for terminal symbols) --- # Recursive Descent Parser $$ E \rightarrow X F\\\\ X \rightarrow (\\:E\\:) \\\\ X \rightarrow a \\\\ \ldots $$ ```Python def E(input): if input.next in ["(", "a", "b", ..., "z")]: X() F() def X(input): if input.next == "(": input.read("(") E() input.read(")") elif input.next == "a": input.read("a") ... ``` --- # Sentences, words, and characters * In this class, we build our "words" from "characters" * In English (or other languages) we usually talk about "sentences" constructed of "words" * In both cases we use "grammars" * In practical applications, like programming languages, we often want to use, e.g. a "variable name" as one of our "characters" --- class: medium # Programming Language Parsing * Take "variable name" again: In most languages the rules for variable names (**identifiers**) are relatively simple (e.g. "a sequence of letters, digit, underscore, that does not start with a digit") * The syntax rules (**grammar**) of the language is something more complex built on top of these simple building blocks * In practice we therefore use these building blocks, called **tokens**, as non-terminal symbols, e.g. $$ \mathit{MethodHeader} \rightarrow \mathit{Mods}\;\text{ident}\;\text{ident}(\mathit{Args}) $$ --- class: center, middle # Practical Grammars --- # Syntax * Another "annoying" aspects of grammars is writing them down * Math notation is aesthetically pleasing, but - LaTeX notwithstanding - hard to type * It's also long-winded, although there is the common abbreviation using the symbol | for "or": $$ E \rightarrow \: X\:F\\\\ F \rightarrow \varepsilon\: |\: +\:E\: |\: *\:E\\\\ X \rightarrow (\:E\:)\: | a\: |\: b\: |\: c\: \cdots\: |\: z $$ --- class: medium # BNF * John Backus and Peter Naur were among the people developing ALGOL in 1958 * In order to describe the language more formally, they developed a "meta-language" called BNF ("Backus-Naur Form") * It allows the description of "production rules" * Non-terminal symbols are enclosed in angle brackets (<, >), terminal symbols in double quotes * Production rules consist of a non-terminal followed by "::=" followed by a sequence of non-terminal and terminal symbols --- # BNF Example Our grammar, in BNF would look like this: ```
::=
::= "+"
| "*"
::= "("
")" | "a" | "b" | ... | "z" ```
[Railroad Diagram Generator](https://www.bottlecaps.de/rr/ui) --- # EBNF * There are **many** variations of BNF, including "EBNF" (Extended BNF) * It gets rid of the angle brackets * It includes things like `{ E }` (zero or more repetitions of E), or `[ E ]` (optional E), or similar notation * Unfortunately most tools have their own idea of what *exactly* is supported and how --- # Throwback Thursday: RFC 5122 ``` addr-spec = local-part "@" domain local-part = dot-atom / quoted-string domain = dot-atom dot-atom = 1*atext *("." 1*atext) atom = 1*atext atext = ALPHA / DIGIT / "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "/" / "=" / "?" / "^" / "_" / "`" / "{" / "|" / "}" / "~" qtext = %d33 / %d35-91 / %d93-126 quoted-string = DQUOTE *([FWS] qtext) [FWS] DQUOTE ```