Lecture 11: Parsing

# CS-3110: Formal Languages and Automata

## Parsing

### Chapter 4.2, 4.3

---

# Parsing

Last time we talked about grammars, for example:

$$
E \rightarrow X\\\\
E \rightarrow E + E\\\\
E \rightarrow E * E\\\\ 
E \rightarrow (\:E\:)\\\\
X \rightarrow a\\\\ 
X \rightarrow b\\\\
\vdots\\\\
X \rightarrow z
$$

---

# Generation vs. Recognition

* By starting with the start symbol, and applying rules we can **generate** words from our language

* Often we want to verify that a given word is **in** the language, though

* In practical implementations, this process is known as "parsing"

* Maybe we can use a similar idea as for DFAs and NFAs?

---

# Parsing

* Given a string, e.g. `(a + c) * d` determine if that string is an element of our language

* Recall from automata: We read the word character by character, and applied some rules (in DFAs and NFAs: the transition function)

* Let's try the same!

---

# Parsing

Parsing: `(a + c) * d`

.left-column[
$$
E \rightarrow X\\\\
E \rightarrow E + E\\\\
E \rightarrow E * E\\\\ 
E \rightarrow (\\:E\\:) \\\\
X \rightarrow a \\\\ 
X \rightarrow b \\\\
\vdots\\\\
X \rightarrow z
$$
]

* Input: `(a + c) * d`

* Rule: `$E \rightarrow (\: E\: )$`

* Read `(` and remember `)` for later

* Next state: `E )`

]

---

# Parsing

Parsing: `(a + c) * d`

.left-column[
$$
E \rightarrow X\\\\
E \rightarrow E + E\\\\
E \rightarrow E * E\\\\ 
E \rightarrow (\\:E\\:) \\\\
X \rightarrow a \\\\ 
X \rightarrow b \\\\
\vdots\\\\
X \rightarrow z
$$
]

* Input: `a + c) * d`

* Rule: `$E \rightarrow X$`

* Next state: `X )`
]

---

# Parsing

Parsing: `(a + c) * d`

.left-column[
$$
E \rightarrow X\\\\
E \rightarrow E + E\\\\
E \rightarrow E * E\\\\ 
E \rightarrow (\\:E\\:) \\\\
X \rightarrow a \\\\ 
X \rightarrow b \\\\
\vdots\\\\
X \rightarrow z
$$
]

* Input: `a + c) * d`

* Rule: `$X \rightarrow a$`

* Read `a`

* Next state: `)`
]

---

# Parsing

Parsing: `(a + c) * d`

.left-column[
$$
E \rightarrow X\\\\
E \rightarrow E + E\\\\
E \rightarrow E * E\\\\ 
E \rightarrow (\\:E\\:) \\\\
X \rightarrow a \\\\ 
X \rightarrow b \\\\
\vdots\\\\
X \rightarrow z
$$
]

* Input: `+ c) * d`

* We still have a `)` remembered!

* Try to read `)`

* ???
]

---

---

# Parsing: What happened?

* Maybe starting with `$E \rightarrow (\:E\:)$` was a bad idea?

* Or maybe it was when we had `a + c) * d` and applied `$E \rightarrow X$`

* Instead, like in an NFA, we have to try **all** rules that could be applicable

* Which other rules would be applicable? All that have `E` on the left side!

$$
E \rightarrow X\\\\
E \rightarrow E + E\\\\
E \rightarrow E * E\\\\ 
E \rightarrow (\:E\:)
$$

---

# Parsing "properly"

* Parsing `(a + c) * d`

$$
\begin{aligned}
E &\Rightarrow^\ast E * E \\\\
  &\Rightarrow^\ast ( E ) \ast E \\\\
  &\Rightarrow^\ast ( E + E ) \ast E \\\\
  &\Rightarrow^\ast ( X + X ) \ast X \\\\
  &\Rightarrow^\ast ( a + b ) \ast c
\end{aligned}
$$

---

# Parse Trees

Recall `(a+b)*(b+c)` as a tree:

This is "almost" a *parse tree*

---

# Parse Trees

* To actually do the parsing, we need to apply the rules from our grammar

* We said something like "To apply `$E \rightarrow E + E$` we'll just make a node labeled '+'"

* But what if we had a rule like `$E \rightarrow \text{if}\:C\:\text{then}\:S\:\text{else}\:S$`?

* Instead, in a "real" parse tree, our interior nodes are non-terminal symbols!

---

# Parse Trees

The actual parse tree of `(a+b)*(b+c)`

---

# Parse Trees

What's the parse tree of `a+b*c`?

---

# Parse Trees

What's the parse tree of `a+b*c`?

---

# Ambiguity

* There are different orders in which we can apply rules!

* This makes the parse tree ambiguous

* We could say that we always replace the left-most non-terminal symbol

* This will produce a **left-derivation**

* Unfortunately, even *that* is not enough in our case

* In practice, we often want something more deterministic!

---

# Practical Parsing

---

# Determinism

* Next time we will look at Pushdown Automata, which can recognize context-free languages

* They work by performing basically this replacement operation for all applicable rules in parallel

* There are also Deterministic Pushdown Automata

* Unlike for Finite Automata, the deterministic and non-deterministic variants are not equally powerful!

---

# Determinism

* In practice, deterministic recognition is usually more efficient

* For grammars we actually use, we therefore strive to keep them (mostly) deterministic

* How? By making sure there is always only one rule that applies

* There are many forms how this can be done, e.g. by letting the recognizer "look ahead" in the string

* Often can rewrite the grammar to achieve this

---

# Expression grammar

.left-column[
$$
E \rightarrow X\\\\
E \rightarrow E + E\\\\
E \rightarrow E * E\\\\ 
E \rightarrow (\\:E\\:) \\\\
X \rightarrow a \\\\ 
X \rightarrow b \\\\
\vdots\\\\
X \rightarrow z
$$
]

* First: Set of all terminals that can appear as the left-most part of a rule

* Follow: Set of all terminals that appear **after** the non-terminal on the right side of a rule

* First(E) = `$\{(, a, b, \ldots, z\}$`

* Follow(E) = `$\{+, \ast, )\}$`
]

---

# First Sets

To determine First(X):

* Take every rule that has X on the left side

* If the first character on the right side is a terminal symbol (letter) or epsilon, add it to the First-set

* If the first character on the right side is a non-terminal symbol, e.g. Y, add every terminal symbol (but **not** epsilon) from First(Y) to First(X)

* If First(Y) contains an epsilon, continue with the symbol after it in the same way

* If you reach the end of the rule this way, add epsilon to First(Y)

---

# Follow Sets

Start by adding $ to the Follow-set of the start symbol. Then, to determine Follow(X):

* Take every rule that has X on the **right** side

* If the symbol after X is a terminal symbol, add it to Follow(X)

* If the symbol after X is a non-terminal symbol, e.g. Y, add every terminal symbol (but **not** epsilon) from First(Y) to Follow(X)

* If First(Y) contains an epsilon, continue with the symbol after Y in the same manner

* If you reach the end of the rule (e.g. X is at the end, or every symbol after it can derive epsilon), add the Follow-set of the **left** side of the rule to Follow(X)

---

# First and Follow

* Why do we need two sets?

* Because our non-terminal may "disappear" (have a rule that sets it to epsilon)

* For each symbol in First, and - if the non-terminal can go to epsilon - in Follow, we can then list the rule(s) that should be applied

* Basically, this will tell us which rule(s) we can apply at what point during parsing

* Let's look at a reduced version of our expression grammar

---

# Reduced Example

$$
E \rightarrow E + E\\\\
E \rightarrow (\:E\:)\\\\
E \rightarrow X\\\\
X \rightarrow a 
$$

First we determine the First- and Follow- sets.

First(E) = {(, a}, Follow(E) = {$, ), +}

First(X) = {a}, Follow(X) = {$, ), +}

Now we create a table, with one row per non-terminal and one column per terminal symbol. Each entry in the table represents which rule(s) we apply if we currently have a particular non-terminal symbol and then read the given terminal symbol as the next input.

---

# Reduced Example

$$
E \rightarrow E + E\\\\
E \rightarrow (\:E\:)\\\\
E \rightarrow X\\\\
X \rightarrow a 
$$

<center>
<table border="1" width="65%">
<tr><th width="15%"></th><th>a</th><th>(</th><th>)</th><th>+</th></tr>
<tr><th>E</th><td>
$$
E \rightarrow E + E\\\\
E \rightarrow X
$$
</td><td>
$$
E \rightarrow (\:E\:)\\\\
$$
</td><td width="10%"></td><td width="10%"></td>
</tr>
<tr><th>X</th><td>
$$
X \rightarrow a
$$
</td><td></td><td></td><td></td></tr>
</table>
</center>

When we start with an `a`, we do not know which rule to take, because there are two entries in our table cell!

---

# What could we do?

$$
E \rightarrow E + E\\\\
E \rightarrow X\\\\
E \rightarrow (\:E\:)\\\\
X \rightarrow a 
$$

First(E) is `$\{(, a\}$`, so we only want (at most) two rules for E, one for each case. We can then **optionally** let the `a` be followed by more tokens.

$$
E \rightarrow X F \\\\
F \rightarrow \varepsilon\\\\
F \rightarrow + E\\\\
X \rightarrow (\:E\:)\\\\
X \rightarrow a
$$

---

# Expression grammar, rewritten

The same logic can be applied to the full expression grammar:

.left-column-math[
$$
E \rightarrow X F\\\\ 
F \rightarrow \varepsilon\\\\
F \rightarrow +\: E\\\\
F \rightarrow *\:E\\\\
X \rightarrow (\\:E\\:) \\\\
X \rightarrow a \\\\ 
X \rightarrow b \\\\
\vdots\\\\
X \rightarrow z
$$
]

.right-column-math[
<!--<table border="1" width="100%">
<tr><th width="10%"></th><th>a, b, c, ...</th><th>(</th><th width="5%">)</th><th width="8%">+</th><th width="8%">*</th></tr>
<tr><th>E</th><td>
$$
E \rightarrow XF
$$
</td><td>
$$
E \rightarrow XF
$$
</td><td></td><td></td><td></td>
</tr>
<tr><th>F</th><td>

</td><td>
</td><td></td><td>
$$
F \rightarrow +\:E
$$
</td><td>
$$
F \rightarrow *\:E
$$
</td>
</tr>
<tr><th>X</th><td>
$$
X \rightarrow a\: (b, c, ...)
$$
</td><td>
$$
X \rightarrow (\:E\:)
$$
</td><td></td><td></td><td></td></tr>
</table>-->

<img src="/CS3110/assets/img/parsetable.png" width="100%"/>
]

---

# Another Example

.smallm[
$$
S \rightarrow A B \\\\
A \rightarrow \varepsilon \\\\
A \rightarrow - \\\\
B \rightarrow 0 \\\\
B \rightarrow 1 C \\\\
C \rightarrow 0 C\\\\
C \rightarrow 1 C\\\\
C \rightarrow \varepsilon
$$
]

.left-column-wide[
* First(C) = `$\{0,1,\varepsilon\}$`, Follow(C) = `$\{\$\}$`
* First(B) = `$\{0,1\}$`, Follow(B) = `$\{$\}$`
* First(A) = `$\{-,\varepsilon\}$`, Follow(A) = `$\{0, 1\}$`
* First(S) = `$\{-,0,1\}$`, Follow(S) = `$\{$\}$`
]

.right-column-wide[
<center>
<table border="1" width="100%">
<tr><th width="10%"></th><th>-</th><th>0</th><th>1</th><th>$</th></tr>
<tr><th>S</th><td>
$ S \rightarrow A B $
</td><td>
$ S \rightarrow A B $
</td><td>
$ S \rightarrow A B $
</td>
<td>
</td></tr>
<tr><th>A</th><td>
$A \rightarrow -$
</td><td>
$A \rightarrow \varepsilon$
</td><td>
$A \rightarrow \varepsilon$
</td><td>
</td></tr>

</td><td>
$B \rightarrow 0$
</td><td>
$B \rightarrow 1C$
</td><td>
</td></tr>
<tr><th>C</th><td>

</td><td>
$C \rightarrow 0 C$
</td><td>
$C \rightarrow 1C$
</td><td>
$C \rightarrow \varepsilon$
</td></tr>
</table>
</center>
]

---

# Recursive Descent Parser

* When the grammar is not ambiguous (there is at most one rule in each table cell), writing a parser is easy!

* Each non-terminal symbol becomes a function

* This function checks the next symbol to be read, which determines which rule to apply

* Each rule is a sequence of function calls (for non-terminals) or read operations (for terminal symbols)

---

# Recursive Descent Parser

$$
E \rightarrow X F\\\\ 
X \rightarrow (\\:E\\:) \\\\
X \rightarrow a \\\\
\ldots
$$

```Python
def E(input):
    if input.next in ["(", "a", "b", ..., "z")]:
       X()
       F()

def X(input):
    if input.next == "(":
       input.read("(")
       E()
       input.read(")")
    elif input.next == "a":
       input.read("a")
    ...
```

---

# Sentences, words, and characters

* In this class, we build our "words" from "characters"

* In English (or other languages) we usually talk about "sentences" constructed of "words"

* In both cases we use "grammars"

* In practical applications, like programming languages, we often want to use, e.g. a "variable name" as one of our "characters"

---

# Programming Language Parsing

* Take "variable name" again: In most languages the rules for variable names (**identifiers**) are relatively simple (e.g. "a sequence of letters, digit, underscore, that does not start with a digit")

* The syntax rules (**grammar**) of the language is something more complex built on top of these simple building blocks

* In practice we therefore use these building blocks, called **tokens**, as non-terminal symbols, e.g.

$$
\mathit{MethodHeader} \rightarrow \mathit{Mods}\;\text{ident}\;\text{ident}(\mathit{Args})
$$

---

# Practical Grammars

---

# Syntax

* Another "annoying" aspects of grammars is writing them down

* Math notation is aesthetically pleasing, but - LaTeX notwithstanding - hard to type

* It's also long-winded, although there is the common abbreviation using the symbol | for "or":

$$
E \rightarrow \: X\:F\\\\
F \rightarrow \varepsilon\: |\: +\:E\: |\: *\:E\\\\
X \rightarrow (\:E\:)\: | a\: |\: b\: |\: c\: \cdots\: |\: z
$$

---

# BNF

* John Backus and Peter Naur were among the people developing ALGOL in 1958

* In order to describe the language more formally, they developed a "meta-language" called BNF ("Backus-Naur Form")

* It allows the description of "production rules"

* Non-terminal symbols are enclosed in angle brackets (<, >), terminal symbols in double quotes

* Production rules consist of a non-terminal followed by "::=" followed by a sequence of non-terminal and terminal symbols

---

# BNF Example

Our grammar, in BNF would look like this:

```
<E> ::=  <X> <F>
<F> ::= "+" <E> | "*" <E>
<X> ::= "(" <E> ")" | "a" | "b" | ... | "z"
```

[Railroad Diagram Generator](https://www.bottlecaps.de/rr/ui)

---

# EBNF

* There are **many** variations of BNF, including "EBNF" (Extended BNF)

* It gets rid of the angle brackets

* It includes things like `{ E }` (zero or more repetitions of E), or `[ E ]` (optional E), or similar notation

* Unfortunately most tools have their own idea of what *exactly* is supported and how

---

# Throwback Thursday: RFC 5122

```
addr-spec       =   local-part "@" domain

local-part      =   dot-atom / quoted-string

domain          =   dot-atom 
    
dot-atom        =   1*atext *("." 1*atext)

atom            =   1*atext

atext           =   ALPHA / DIGIT / "!" / "#" / "$" / "%" / 
                       "&" / "'" / "*" / "+" / "-" / "/" /
                       "=" / "?" / "^" / "_" / "`" / "{" /
                       "|" / "}" / "~"
                       
qtext           =   %d33 / %d35-91 / %d93-126
                       
quoted-string   =   DQUOTE *([FWS] qtext) [FWS] DQUOTE
```