class: center, middle # CS-3110: Formal Languages and Automata ## The Pumping Lemma ### Chapter 3.7 --- # Regular Languages * Recall: Regular languages have limitations * For example, we can not "count" arbitrarily nested parenthesis (or HTML tags) * But what *exactly* is the limitation? * Is there a way to determine which languages are **not** regular? --- # Pigeonhole Principle
Author:
en:User:BenFrantzDale
; this image by
en:User:McKay
--- # An Automaton
$$ \delta^*(\text{01001011}, q_{00}) $$ What happened? -- Where are the pigeons? --- # An Automaton
* If we have a "long" word we will visit the same state multiple times * How long? More than 4 characters (in this case) * In our example, we went: ``` q00, q10, q11, q01, q11, q10, q00, q10, q00 ``` --- # An Automaton
$$ \delta^*(\text{1001}, q_{10}) $$ * This is part of our word * And we just ran in a circle * Our word had some part (`0`) before it, and some part (`011`) after it --- # What if?
$$ \delta^*(\text{10011001}, q_{10}) $$ * What if we repeated the loop? * If we keep the prefix (`0`) and postfix (`011`) the same, we still have a valid word * We can repeat the loop as often as we want! --- # The Pumping Lemma Let `\(L\)` be a regular language, then there is some number `\(n > 0\)` such that any string `\(w \in L\)` whose length is greater than or equal to n can be broken down into three pieces x, y, and z, i.e. `\(w = x\cdot y\cdot z\)`, such that: 1. x and y together contain no more than n symbols 2. y contains at least one symbol 3. `\(x\cdot y^i \cdot z \in L\)` for *every* `\(i\)` --- # The Pumping Lemma What does this mean? -- Let's look at it piece by piece --- # Minimum Length * First, we have this mysterious `\(n\)` * The pumping lemma applies to *all* strings that are longer than this n that are accepted by `\(L\)` * If our language is finite, this is vacuously true if we choose `\(n\)` to be the length of the longest word plus one * So we can assume this is only really useful if the language is infinite --- # Splitting the word Now we have a word w with length greater than n, and split it into three parts: x, y, and z, with `\(w = x\cdot y\cdot z\)`, where y is not empty. The pumping lemma says that: `\(x\cdot y^i \cdot z \in L\)` for *every* `\(i\)` This means there is some "middle part" for every long word that we can repeat ("pump") as often as we want, and the word will still be in the language. --- class: medium # Pumping Let's look at **why** we can repeat this middle part. As we have seen, every regular language can be recognized by a DFA. This DFA has a finite number of states, say `\(m\)`. * When the automaton accepts a word, it passes through a sequence of its states * If a word has more characters than there are states, at least one state has to be visited at least twice * This means, when accepting a word with length greater than `\(m\)` we have some sequence `\(q_0,\ldots, q_i,\ldots, q_i, \ldots q_f\)` that the automaton passes through * Let us look at the part `\(q_i \ldots q_i\)` in more detail --- # Pumping Let's take our word `\(w = x\cdot y\cdot z\)` * `\(x\)` is the part of the word that causes the automaton to transition from `\(q_0\)` to `\(q_i\)` * `\(y\)` is the part of the word that causes the loop * `\(z\)` is the part of the word that causes the automaton to transition from `\(q_i\)` to an accepting/final state `\(q_f\)` * What happens if we duplicate the y-part? We just run through the same loop again! --- # The Pumping Lemma: What is it good for? We have now shown: .center[ ## If a language is regular, the pumping lemma holds ] But this implies the contrapositive: .center[ ## If the pumping lemma does not hold for a language, it is **not** regular ] This is the formulation we will use in practice! --- # Non-Regular Languages Let us now look at the language: $$ L = \\{a^i\cdot b^i | i \ge 0 \\} $$ We mentioned that DFAs "can't count", so let us now show that this language is not regular using the pumping lemma. --- # Proof by Contradiction * Let's assume the language was regular * Then the pumping lemma would hold! * So there is a threshold value `\(n\)`, and we can "pump" all words that are longer than that --- # Choosing a word If our threshold value is `\(n\)`, the word `\(a^n\cdot b^n\)` is longer than `\(n\)` **and** it is in our language. That means, we can pump it. First, we split it into three parts: $$ a^nb^n = xyz $$ --- # Cases We have three options: * `y` only consists of `a`s * `y` consists of some `a`s and some `b`s * `y` only consists of `b`s But we also have the requirement: 1. x and y together contain no more than n symbols So, we only need to consider the first option --- # Cases 1 * So `y` only consists of `a`s * But what happens if we repeat `y` (and remember: the prefix and postfix stay the same!)? * If `y` consists of `i` `a`s, and we repeat `y`, we end up with `i` more `a`s than `b`s * But that doesn't work ... --- # The Contradiction We had the language: $$ L = \\{a^i\cdot b^i | i \ge 0 \\} $$ We assumed it was regular. And we discovered that the pumping lemma doesn't hold. Therefore, our assumption was wrong and the language is **not** regular. --- # Another Example Consider the language: ### R is the language of all strings over the alphabet `\(\Sigma=\{a,b\}\)` where each word has the same number of `a`s and `b`s Show that this language is not regular We could use the pumping lemma ... -- ... or we recall something from last time! --- # Another Language * Consider the regular expression: `\(a^\ast b^\ast\)` * The language defined by this regular expression is "arbitrarily many `a`s followed by arbitrarily many `b`s", but the number of `a`s and `b`s may be different. * Let's call this language `\(S\)` * It is definitely regular (we just got it from a regular expression) --- class: medium # The two languages * S is arbitrarily many `a`s followed by arbitrarily many `b`s * R is words with the same number of `a`s and `b`s * `\(R \cap S\)` is "words with a number of `a`s followed by the same number of `b`s" * But that's the same as `\(L = \{a^i\cdot b^i | i \ge 0 \}\)` --- class: medium # A contradiction * We know S is regular * Assume that R is regular as well * Then `\(R \cap S = L\)` is regular * But we just showed earlier that is not, so we have a contradiction: R can not be regular --- # Summary: Pumping Lemma Let `\(L\)` be a regular language, then there is some number `\(n > 0\)` such that any string `\(w \in L\)` whose length is greater than or equal to n can be broken down into three pieces x, y, and z, i.e. `\(w = x\cdot y\cdot z\)`, such that: 1. x and y together contain no more than n symbols 2. y contains at least one symbol 3. `\(x\cdot y^i \cdot z \in L\)` for *every* `\(i\)` As I said, we often use this in the contrapositive: If these conditions do not hold, the language is not regular. We need to be careful, though. --- # Pitfall 1 * The pumping lemma must hold for **all** words in a regular language that are longer than `\(n\)` * This means, for the contrapositive we can **choose** any word as long as it is long enough * We could have shown that our language with "same number of `a`s and `b`s" is not regular, because any of the `\(a^i b^i\)` is also a word of that language --- # Pitfall 2 * Once we have chosen the word `w`, there **exists** a split `\(w = xyz\)` * This means, for the contrapositive we **can not** choose the split arbitrarily! * We only know: `\(xy\)` is at most `\(n\)` characters long, and `\(y\)` is not empty * We need to show that `\(y\)` can not be pumped for **any** split (including when `x` is empty) --- # Pitfall 3 * The chosen word must be in the language * At least one of the pumped words not * Sometimes we may need to **shorten** words * Consider: `\(L = \{a^n b^m | n \gt m\}\)` * We can always add more `a`s! But we can choose a word such that we can not **remove** `a`s --- # Non-Regular Languages * The pumping lemma illustrates a fundamental limitation of regular languages * Since we do not have (arbitrary) memory, we can not count matches * Why? Because there could be more things (like parenthesis) to count than we have states * Therefore we will have to look to more powerful mechanisms to handle more complex languages --- # Future Schedule * Tuesday, 3/19: Presentation * Thursday, 3/21: Review * Tuesday, 3/26: Midterm * Thursday, 3/28: Context-Free Grammars * Spring Break!