class: center, middle # CS-3110: Formal Languages and Automata ## Regular Expressions ### Chapter 3.2 + 3.3 --- # An Alphabet Let us start with our well-known alphabet: $$ \Sigma = \\{0,1\\} $$ Now we want to define "simple" languages over this alphabet. --- # Regular Expressions * We will define some operators to construct expressions that define a language * Each regular expression `\(e\)` describes what form words in that language have * We will write `\(L(e)\)` to represent the language (= set of words) defined by that expression * In fact, we will **define** the regular expression operators in terms of the language they describe --- # Regular Expressions * `\(\Phi\)` is a regular expression: `\(L(\Phi) = \{\}\)` * `\(\varepsilon\)` is a regular expression: `\(L(\varepsilon) = \{\varepsilon\}\)` * For any symbol `a` in our alphabet, `\(a\)` is a regular expressions: `\(L(a) = \{a\}\)` * For our alphabet this means `\(0\)` and `\(1\)` are regular expressions, with `\(L(0) = \{0\}\)` and `\(L(1) = \{1\}\)` --- # Regular Expression Operators * If `\(x\)` is a regular expression, then so is `\(x^*\)`, and: `\(L(x^*) = L(x)^*\)` * If `\(x\)` is a regular expression, then so is `\((x)\)`, and: `\(L((x)) = L(x)\)` * If `\(x\)` and `\(y\)` are regular expressions, then so is `\(xy = x \cdot y\)`, and: `\(L(xy) = L(x)\cdot L(y)\)` * If `\(x\)` and `\(y\)` are regular expressions, then so is `\(x|y\)`, and: `\(L(x|y) = L(x) \cup L(y)\)` --- # Regular Expressions: Examples * `\(0|(1(0|1)^*0)\)`: All words representing even binary numbers * `\(1(0|1)^*1\)`: All words that start and end with a 1 * `\(10^*1\)`: All words that start and end with a 1 with only 0s in between * `\((0|(11))^*\)`: All words consisting of any combination of 0 and 11 * `\(0|1|10|11\)`: The words `0`, `1`, `10`, and `11` --- # This looks the same as before! Have we really done anything? - The "*" operator is basically the same as for sets - The "|" is just a set union - Why do we need this new notation if we could already do these things? ## Regular Expressions are more restrictive than set notation --- # Regular Expressions: Limitations * Regular Expressions have "no memory" * We could define a language like `\(L = \{0^n 1^n | n \in \mathbb{N}\}\)` in set notation, i.e. words consisting of a number of 0s followed by **the same number** of 1s * We can not define this same language with regular expressions! * So we lost something: Expressivity. What have we gained in exchange? Simplicity! --- class: medium # Regular Expressions: Applications * Regular Expressions are often useful to find approximate strings in text * "Approximate": We know the general structure, but not the exact string * Think: You want to find all phone numbers, credit card numbers, addresses, ... in data a hacker got from your company to alert the affected customers * You can define a regular expression for such a task, and there are efficient implementations to search for "matches" --- # Practical Applications
Source
--- class: center, middle # Practical Applications --- # A Problem * You start a new job * There is a lot of code to work through * At some point you want to know: Where in the code is the function `formatTaxes` called with the parameter `form1040` or `form1041`? * It could be `formatTaxes(form1040)`, or `formatTaxes(W2, form1040)` or `formatTaxes(form1041, True)`, etc. --- # Languages How can we solve this with what we know about languages? -- 1. Define a language that consists of calls to `formatTaxes` with `form1040` or `form1041` as one of the parameters 2. For every line of code check if that line is a word in our language 3. Report every line for which that check returns `True` --- # What do we need to define? What is the structure of a function call? `function(param1, param2, ..., paramn)` It could be any number of parameters, and the ones we are looking for could be at any position. What we are looking for is therefore: `formatTaxes(...form1040...)` or `formatTaxes(...form1041...)` We already know how to write "or" as a regular expression, but what about the "..."-part? --- class: mmedium # The alphabet - As we are talking about Java code, our alphabet consists of every character that is legal in Java - This means our alphabet contains all Unicode characters - We could write "..." as `(a|b|c|d....)*`, but there are *many* Unicode characters, and this would be impractical - Instead, most practical regular expression implementations define `.` to mean "any character" - They often have many other useful shortcuts, such as `a+` to mean `aa*`, i.e. "one or more 'a's" --- # Our Regular Expression In practice, we would write our regular expression as: ``` (formatTaxes\(.*form1040.*\)) | (formatTaxes\(.*form1041.*\)) ``` Note: In programming languages we often want to look for parenthesis, but they also mean something in regular expressions, so they have to be escaped with backslashes. --- # Practical Applications
--- # Practical Applications
--- class: medium # Limitations * Our regular expression will find any line where `formatTaxes` is called with `form1040` (or `form1041`) **somewhere** between the parenthesis * What about `formatTaxes(extractInformation(form1040))`? * Our regex will match! * Maybe we want that (it will format something related to form 1040) * Maybe we don't (we are investigating a bug strictly related to form 1040/1041, and this is not helping) --- # Limitations * What would be the "proper" way to handle this? * Only accept `form1040` if it is in the outermost pair of parenthesis * Maybe we could "count" how often parenthesis are opened and closed? * Unfortunately regular expressions can not do this at all (they have no memory)! * We will discuss the exact limitations in a few weeks --- class: medium # Be careful! We discussed three different concepts so far: * Languages defined using set notation: `\(\cdot, \cup, \cap, ^\ast, \wedge, \neg, \in, ....\)` * Regular expressions (formal): Concatenation and `|, *, (), ` `\(\varepsilon\)` * Regular expressions (in practice): `|, *, +, (), ...` (depends on the implementation) Always note what an assignment allows you to do: When you have to write a **regular expression**, there can not be any set or logical operations involved. Unless otherwise noted, we also assume the formal definition of regular expressions, as the other one is implementation-dependent! --- # Formal Regular Expressions * If `\(x\)` is a regular expression, then so is `\(x^*\)`, and: `\(L(x^*) = L(x)^*\)` * If `\(x\)` is a regular expression, then so is `\((x)\)`, and: `\(L((x)) = L(x)\)` * If `\(x\)` and `\(y\)` are regular expressions, then so is `\(xy = x \cdot y\)`, and: `\(L(xy) = L(x)\cdot L(y)\)` * If `\(x\)` and `\(y\)` are regular expressions, then so is `\(x|y\)`, and: `\(L(x|y) = L(x) \cup L(y)\)` These are the **only** operators that exist! --- class: medium # Why Formal Regular Expressions? * We only have four operators: Repetition, concatenation, choice, and parentheses * When we want to implement Regular Expressions, we only need to implement four pieces * For proofs, we only need to consider these four operations * In practice this is limited, but exactly which extensions exist is implementation-dependent * For the purposes of this class, we only focus on the formal definition