Lecture 4: Regular Expressions

# CS-3110: Formal Languages and Automata

## Regular Expressions

### Chapter 3.2 + 3.3

---

# An Alphabet

Let us start with our well-known alphabet:

$$
\Sigma = \\{0,1\\}
$$

Now we want to define "simple" languages over this alphabet.

---

# Regular Expressions

* We will define some operators to construct expressions that define a language

* Each regular expression `$e$` describes what form words in that language have

* We will write `$L(e)$` to represent the language (= set of words) defined by that expression

* In fact, we will **define** the regular expression operators in terms of the language they describe

---

# Regular Expressions

* `$\Phi$` is a regular expression: `$L(\Phi) = \{\}$`

* `$\varepsilon$` is a regular expression: `$L(\varepsilon) = \{\varepsilon\}$`

* For any symbol `a` in our alphabet, `$a$` is a regular expressions: `$L(a) = \{a\}$`

* For our alphabet this means `$0$` and `$1$` are regular expressions, with `$L(0) = \{0\}$` and `$L(1) = \{1\}$`

---

# Regular Expression Operators

* If `$x$` is a regular expression, then so is `$x^*$`, and: `$L(x^*) = L(x)^*$`

* If `$x$` is a regular expression, then so is `$(x)$`, and: `$L((x)) = L(x)$`

* If `$x$` and `$y$` are regular expressions, then so is `$xy = x \cdot y$`, and: `$L(xy) = L(x)\cdot L(y)$`

* If `$x$` and `$y$` are regular expressions, then so is `$x|y$`, and: `$L(x|y) = L(x) \cup L(y)$`

---

# Regular Expressions: Examples

* `$0|(1(0|1)^*0)$`: All words representing even binary numbers

* `$1(0|1)^*1$`: All words that start and end with a 1

* `$10^*1$`: All words that start and end with a 1 with only 0s in between

* `$(0|(11))^*$`: All words consisting of any combination of 0 and 11

* `$0|1|10|11$`: The words `0`, `1`, `10`, and `11`

---

# This looks the same as before!

Have we really done anything?

- The "*" operator is basically the same as for sets

- The "|" is just a set union

- Why do we need this new notation if we could already do these things?

## Regular Expressions are more restrictive than set notation

---

# Regular Expressions: Limitations

* Regular Expressions have "no memory"

* We could define a language like `$L = \{0^n 1^n | n \in \mathbb{N}\}$` in set notation, i.e. words consisting of a number of 0s followed by **the same number** of 1s

* We can not define this same language with regular expressions!

* So we lost something: Expressivity. What have we gained in exchange? Simplicity!

---

# Regular Expressions: Applications

* Regular Expressions are often useful to find approximate strings in text

* "Approximate": We know the general structure, but not the exact string

* Think: You want to find all phone numbers, credit card numbers, addresses, ... in data a hacker got from your company to alert the affected customers

* You can define a regular expression for such a task, and there are efficient implementations to search for "matches"

---

# Practical Applications

<img src="/CS3110/assets/img/xkcdregex.png" width="60%"/><br/>
<a href="https://xkcd.com/208/">Source</a>

---

# Practical Applications

---

# A Problem

* You start a new job

* There is a lot of code to work through

* At some point you want to know: Where in the code is the function `formatTaxes` called with the parameter `form1040` or `form1041`?

* It could be `formatTaxes(form1040)`, or `formatTaxes(W2, form1040)` or `formatTaxes(form1041, True)`, etc.

---

# Languages

How can we solve this with what we know about languages?

1. Define a language that consists of calls to `formatTaxes` with `form1040` or `form1041` as one of the parameters

2. For every line of code check if that line is a word in our language

3. Report every line for which that check returns `True`

---

# What do we need to define?

What is the structure of a function call?

`function(param1, param2, ..., paramn)`

It could be any number of parameters, and the ones we are looking for could be at any position.

What we are looking for is therefore:

`formatTaxes(...form1040...)` 
or
`formatTaxes(...form1041...)`

We already know how to write "or" as a regular expression, but what about the "..."-part?

---

# The alphabet

- As we are talking about Java code, our alphabet consists of every character that is legal in Java

- This means our alphabet contains all Unicode characters

- We could write "..." as `(a|b|c|d....)*`, but there are *many* Unicode characters, and this would be impractical

- Instead, most practical regular expression implementations define `.` to mean "any character"

- They often have many other useful shortcuts, such as `a+` to mean `aa*`, i.e. "one or more 'a's"

---

# Our Regular Expression

In practice, we would write our regular expression as:

```
(formatTaxes$.*form1040.*$) | (formatTaxes$.*form1041.*$)

```

Note: In programming languages we often want to look for parenthesis, but they also mean something in regular expressions, so they have to be escaped with backslashes.

---

# Practical Applications

---

# Practical Applications

---

# Limitations

* Our regular expression will find any line where `formatTaxes` is called with `form1040` (or `form1041`) **somewhere** between the parenthesis

* What about `formatTaxes(extractInformation(form1040))`?

* Our regex will match!

* Maybe we want that (it will format something related to form 1040)

* Maybe we don't (we are investigating a bug strictly related to form 1040/1041, and this is not helping)

---

# Limitations

* What would be the "proper" way to handle this?

* Only accept `form1040` if it is in the outermost pair of parenthesis

* Maybe we could "count" how often parenthesis are opened and closed?

* Unfortunately regular expressions can not do this at all (they have no memory)!

* We will discuss the exact limitations in a few weeks

---

# Be careful!

We discussed three different concepts so far:

* Languages defined using set notation: `$\cdot, \cup, \cap, ^\ast, \wedge, \neg, \in, ....$`

* Regular expressions (formal): Concatenation and `|, *, (), ` `$\varepsilon$`

* Regular expressions (in practice): `|, *, +, (), ...` (depends on the implementation)

Always note what an assignment allows you to do: When you have to write a **regular expression**, there can not be any set or logical operations involved. Unless otherwise noted, we also assume the formal definition of regular expressions, as the other one is implementation-dependent!

---

# Formal Regular Expressions

* If `$x$` is a regular expression, then so is `$x^*$`, and: `$L(x^*) = L(x)^*$`

* If `$x$` is a regular expression, then so is `$(x)$`, and: `$L((x)) = L(x)$`

* If `$x$` and `$y$` are regular expressions, then so is `$xy = x \cdot y$`, and: `$L(xy) = L(x)\cdot L(y)$`

* If `$x$` and `$y$` are regular expressions, then so is `$x|y$`, and: `$L(x|y) = L(x) \cup L(y)$`

These are the **only** operators that exist!

---

# Why Formal Regular Expressions?

* We only have four operators: Repetition, concatenation, choice, and parentheses

* When we want to implement Regular Expressions, we only need to implement four pieces

* For proofs, we only need to consider these four operations

* In practice this is limited, but exactly which extensions exist is implementation-dependent

* For the purposes of this class, we only focus on the formal definition