class: center, middle # CS-3110: Formal Languages and Automata ## Regular Expressions --- class: center, middle # Example 1: Regular Expression Interpretation --- # Regular Expression Interpretation Given the regular expressions: `\(r_1 = ((001)|(100))^\ast, r_2 = (1)^\ast|(00)^\ast\)`, and `\(r_3 = (0(01)^\ast)|1\)` * Give an English description of the language defined by each of the three regular expressions * Which of these words is an element of the language defined by each of the three regular expressions: `\(001, 000, 0, 1, \varepsilon\)` --- # Example 1: First expression $$ r_1 = ((001)|(100))^* $$ What are some valid words in the language defined by this regular expression? $$ \varepsilon, 001, 100, 001001, 001100, 100100, 100001, \ldots $$ This language is: ### The language consisting of all words that are arbitrary combinations of arbitrarily many `001` and `100` (including zero) --- class: medium # Example 1: First expression $$ r_1 = ((001)|(100))^* $$ * `\(001\)` is a valid word * `\(000\)` is not a valid word * `\(0\)` is not a valid word * `\(1\)` is not a valid word * `\(\varepsilon\)` is a valid word --- # Example 1: Second expression $$ r_2 = (1)^\ast|(00)^\ast $$ What are some valid words in the language defined by this regular expression? $$ \varepsilon, 1, 11, 111, 00, 0000, \ldots $$ This language is: ### The language consisting of all words that consist of any number of (only) `1`s or any even number of (only) `0`s, in arbitrary number (including zero) --- class: medium # Example 1: Second expression $$ r_2 = (1)^\ast|(00)^\ast $$ * `\(001\)` is not valid word * `\(000\)` is a valid word * `\(0\)` is not a valid word * `\(1\)` is a valid word * `\(\varepsilon\)` is a valid word --- # Example 1: Third expression $$ r_3 = (0(01)^\ast)|1 $$ What are some valid words in the language defined by this regular expression? $$ 0, 1, 001, 00101, 0010101\ldots $$ This language is: ### The language consisting of the word `1`, and words that start with a `0`, and then have any number (including zero) of repetitions of `01` --- class: medium # Example 1: Third expression $$ r_3 = (0(01)^\ast)|1 $$ * `\(001\)` is a valid word * `\(000\)` is not a valid word * `\(0\)` is a valid word * `\(1\)` is a valid word * `\(\varepsilon\)` is not a valid word --- class: center, middle # Example 2: Regular Expression Definition --- class: medium # Example 2: Prologue Write a regular expression that defines the language of all words that start and end with a `1` and contain exactly one `0` over the alphabet `\(\Sigma = \{0,1\}\)`. * How do we encode these rules in a regular expression? * We only have four operations: concatenation, (arbitrary) repetition, "or" and parenthesis --- # Just some warmup
--- class: medium # Example 2: Prologue * "starts and ends with a `1`": We can enforce this by putting a `1` in our regular expression at the beginning and the end * "contains exactly one `0`": This means that all **other** characters are 1s * Since no limit is given on the length of our words, this means there can be any number of characters * We need: One `1` to start the word, then any number of `1`s until we have the (mandatory) `0`, followed by any number of additional `1`s, and finally a final `1` that ends the word --- class: mmedium # Example 2: Prologue We end up with the following regular expression: $$ r = 11^\ast{}01^\ast{}1 $$ Some words that would be valid in the language defined by this expression: * `11011` * `111101` * `101` * `10111111` --- class: center, middle # Example 2 ## The Real Deal --- class: mmedium # Example 2 You work for a company producing "smart" door locks. The keycode to open the door can only contain the digits 0-5. Your boss tells you that you need to ensure "strong" (arbitrarily long) keycodes that he defined as: * If the first digit is even, the last one has to be odd, and vice versa * Can not start with a 2 or end with a 5 * Can only have a 0 after an even digit * 1 can only be before a 3 or 4 * There has to be exactly one 1 --- class: medium # "Strong" keycodes * If the first digit is even, the last one has to be odd, and vice versa * Can not start with a 2 or end with a 5 * Can only have a 0 after an even digit * 1 can only be before a 3 or 4 * There has to be exactly one 1 While you search for a job at a company that actually knows something about security, you have to construct a regular expression that defines **exactly** the strong keycodes. --- class: medium # "Strong" keycodes * If the first digit is even, the last one has to be odd, and vice versa * Can not start with a 2 or end with a 5 * Can only have a 0 after an even digit * 1 can only be before a 3 or 4 * There has to be exactly one 1 Let's solve this puzzle! We can go rule by rule, and see what they mean --- # "Strong" keycodes ### If the first digit is even, the last one has to be odd, and vice versa This means there are two cases: * Starts with an even digit (0,2,4) and ends with an odd one (1,3,5) * Starts with an odd digit (1,3,5) and ends with an even one (0,2,4) Basic structure: $$ ((0|2|4)...(1|3|5))|((1|3|5) ... (0|2|4)) $$ --- # "Strong" keycodes ### Can not start with a 2 or end with a 5 We can just remove these options! Now we have: $$ ((0|4)...(1|3))|((1|3|5) ... (0|2|4)) $$ Time to figure out what's going on in the middle ... --- class: medium # "Strong" keycodes ### Can only have a 0 after an even digit. * Sounds tricky ... Let's start with the non-0 characters * `\((1|2|3|4|5)^*\)` allows us to put any of them * The easiest way is to just add cases for when a 0 is allowed $$ (1|2|3|4|5|(20^\ast)|(40^\ast))^\ast\\\\ $$ or, even shorter: $$ (1|3|5|(20^\ast)|(40^\ast))^\ast $$ --- # "Strong" keycodes ### 1 can only be before a 3 or 4 * Similar idea as before! * We take what we had, and only allow valid options $$ ((13)|(14)|3|5|(20^\ast)|(40^\ast))^\ast $$ but we need to be careful to allow 0s after **all** 4s: $$ ((13)|(140^\ast)|3|5|(20^\ast)|(40^\ast))^\ast $$ --- # "Strong" keycodes ### There has to be exactly one 1 How would we do this in general? $$ (...)^\ast 1 (...)^\ast $$ The "tricky" part here is to make sure that it plays nicely with the other rules ... First step (1 can only come before 3 or 4): $$ (...) ((13)|(14)) (...) $$ Now let's look how we combine this with the other rules --- # Recap and Assembly $$ ((0|4)...(1|3))|((1|3|5) ... (0|2|4))\\\\ ((13)|(140^\ast)|3|5|(20^\ast)|(40^\ast))^\ast\\\\ (...) ((13)|(14)) (...) $$ We need to carefully assemble these parts. And recheck against our rules! Let's start with the first two --- # Assembly Step 1 $$ ((0|4)...(1|3))|((1|3|5) ... (0|2|4))\\\\ ((13)|(140^\ast)|3|5|(20^\ast)|(40^\ast))^\ast $$ 0 can only come after 0, 2, 4 1 can only come before 3 or 4 $$ ((40^\ast)((13)|(14)|3|5|(20^\ast)|(40^\ast))^\ast(3|13)) |\\\\ ((13|140^\ast|3|5)((13)|(14)|3|5|(20^\ast)|(40^\ast))^\ast(20^\ast|40^\ast)) $$ --- # Assembly: Step 2 $$ ((40^\ast)((13)|(14)|3|5|(20^\ast)|(40^\ast))^\ast(3|13)) |\\\\ ((13|140^\ast|3|5)((13)|(14)|3|5|(20^\ast)|(40^\ast))^\ast(20^\ast|40^\ast)) $$ and $$ (...) ((13)|(14)) (...) $$ First part: $$ ((40^\ast)((13)|(14)|3|5|(20^\ast)|(40^\ast))^\ast(3|13)) $$ Two options: Either `(13|14)` is used somewhere in the middle, or the code ends with `13`. --- # Assembly: Step 2, Part 1 First part: $$ ((40^\ast)((13)|(14)|3|5|(20^\ast)|(40^\ast))^\ast(3|13)) $$ becomes: $$ ((40^\ast)(3|5|(20^\ast)|(40^\ast))^\ast((13)|(14))(3|5|(20^\ast)|(40^\ast))^\ast 3)\\\\ |\\\\ ((40^\ast)(3|5|(20^\ast)|(40^\ast))^\ast 13) $$ And now we do the same for the second part --- # Assembly: Step 2, Part 2 $$ ((13|140^\ast|3|5)((13)|(14)|3|5|(20^\ast)|(40^\ast))^\ast(20^\ast|40^\ast)) $$ becomes $$ ((3|5)(3|5|(20^\ast)|(40^\ast))^\ast(140^\ast))\\\\ |\\\\ ((3|5)(3|5|(20^\ast)|(40^\ast))^\ast((13)|(14))(3|5|(20^\ast)|(40^\ast))^\ast(20^\ast|40^\ast))\\\\ |\\\\ ((13|140^\ast)(3|5|(20^\ast)|(40^\ast))^\ast(20^\ast|40^\ast)) $$ --- # The Complete Expression $$ ((40^\ast)(3|5|(20^\ast)|(40^\ast))^\ast((13)|(14))(3|5|(20^\ast)|(40^\ast))^\ast 3)\\\\ |\\\\ ((40^\ast)(3|5|(20^\ast)|(40^\ast))^\ast 13)\\\\ |\\\\ ((3|5)(3|5|(20^\ast)|(40^\ast))^\ast(140^\ast))\\\\ |\\\\ ((3|5)(3|5|(20^\ast)|(40^\ast))^\ast((13)|(14))(3|5|(20^\ast)|(40^\ast))^\ast(20^\ast|40^\ast))\\\\ |\\\\ ((13|140^\ast)(3|5|(20^\ast)|(40^\ast))^\ast(20^\ast|40^\ast))\\\\ |\\\\ 14 $$ --- # Perfection
--- class: medium # Demo! * I told you that you could earn bonus points in your presentations * One possibility is for "interactive" demos * So let's look at a [demo](https://regex101.com/r/OAhk5N/1) of this regular expression * We can also [visualize](https://www.debuggex.com/) it * Note: Since this is just using existing sites, it would be 1-3 bonus points (depending on complexity) * Stay tuned for an even more interactive demo --- # Bonus: Regular Expression for URLs ``` (((([a-z]+):\/\/) ([a-z0-9-]+(:[a-z0-9-]+)?@)? ([a-z0-9-]+(\.[a-z0-9-]+)*) (:[a-z0-9-]+)?(/[a-z0-9-]*)*) | (([a-z0-9-]+(:[a-z0-9-]+)?@)? ((([a-z0-9-]+\.[a-z0-9-]+ (\.[a-z0-9-]+)+) (:[a-z0-9-]+)?(/[a-z0-9-]+)* ) | (([a-z0-9-]+(\.[a-z0-9-]+)+) (:[0-9]+)?(/[a-z0-9-]*)+))))/? ``` To find things like `www.google.com` or `user:password@server.com/`, but not `Booking.com` (brand name)