Lecture 8: The Pumping Lemma

# CS-3110: Formal Languages and Automata

## The Pumping Lemma

### Chapter 3.7

---

# Regular Languages

* Recall: Regular languages have limitations

* For example, we can not "count" arbitrarily nested parenthesis (or HTML tags)

* But what *exactly* is the limitation?

* Is there a way to determine which languages are **not** regular?

---

# Pigeonhole Principle

Author: <a href="https://en.wikipedia.org/wiki/User:BenFrantzDale">en:User:BenFrantzDale</a>; this image by <a href="https://en.wikipedia.org/wiki/User:McKay">en:User:McKay</a>

---

# An Automaton

$$
\delta^*(\text{01001011}, q_{00})
$$

What happened?

Where are the pigeons?

---

# An Automaton

* If we have a "long" word we will visit the same state multiple times

* How long? More than 4 characters (in this case)

* In our example, we went:

```
q00, q10, q11, q01, q11, q10, q00, q10, q00
```

---

# An Automaton

$$
\delta^*(\text{1001}, q_{10})
$$

* This is part of our word

* And we just ran in a circle

* Our word had some part (`0`) before it, and some part (`011`) after it

---

# What if?

$$
\delta^*(\text{10011001}, q_{10})
$$

* What if we repeated the loop?

* If we keep the prefix (`0`) and postfix (`011`) the same, we still have a valid word

* We can repeat the loop as often as we want!

---

# The Pumping Lemma

Let `$L$` be a regular language, then there is some number `$n > 0$` such that any string `$w \in L$` whose length is greater than or equal to n can be broken down into three 
pieces x, y, and z, i.e. `$w = x\cdot y\cdot z$`, such that:

1. x and y together contain no more than n symbols

2. y contains at least one symbol

3. `$x\cdot y^i \cdot z \in L$` for *every* `$i$`

---

# The Pumping Lemma

What does this mean?

Let's look at it piece by piece

---

# Minimum Length

* First, we have this mysterious `$n$`

* The pumping lemma applies to *all* strings that are longer than this n that are accepted by `$L$`

* If our language is finite, this is vacuously true if we choose `$n$` to be the length of the longest word plus one

* So we can assume this is only really useful if the language is infinite

---

# Splitting the word

Now we have a word w with length greater than n, and split it into three parts:

x, y, and z, with `$w = x\cdot y\cdot z$`, where y is not empty.

The pumping lemma says that:

`$x\cdot y^i \cdot z \in L$` for *every* `$i$`

This means there is some "middle part" for every long word that we can repeat ("pump") as often as we want, and the word will still be in the language.

---

# Pumping

Let's look at **why** we can repeat this middle part. As we have seen, every regular language can be recognized by a DFA. This DFA has a finite number of states, say `$m$`.

* When the automaton accepts a word, it passes through a sequence of its states

* If a word has more characters than there are states, at least one state has to be visited at least twice

* This means, when accepting a word with length greater than `$m$` we have some sequence `$q_0,\ldots, q_i,\ldots, q_i, \ldots q_f$` that the automaton passes through

* Let us look at the part `$q_i \ldots q_i$` in more detail

---

# Pumping

Let's take our word `$w = x\cdot y\cdot z$`

* `$x$` is the part of the word that causes the automaton to transition from `$q_0$` to `$q_i$`

* `$y$` is the part of the word that causes the loop

* `$z$` is the part of the word that causes the automaton to transition from `$q_i$` to an accepting/final state `$q_f$`

* What happens if we duplicate the y-part? We just run through the same loop again!

---

# The Pumping Lemma: What is it good for?

We have now shown:

But this implies the contrapositive:

This is the formulation we will use in practice!

---

# Non-Regular Languages

Let us now look at the language:

$$
L = \\{a^i\cdot b^i | i \ge 0 \\}
$$

We mentioned that DFAs "can't count", so let us now show that this language is not regular using the pumping lemma.

---

# Proof by Contradiction

* Let's assume the language was regular

* Then the pumping lemma would hold!

* So there is a threshold value `$n$`, and we can "pump" all words that are longer than that

---

# Choosing a word

If our threshold value is `$n$`, the word `$a^n\cdot b^n$` is longer than `$n$` **and** it is in our language.

That means, we can pump it.

First, we split it into three parts:

$$
a^nb^n = xyz
$$

---

# Cases

We have three options:

* `y` only consists of `a`s

* `y` consists of some `a`s and some `b`s

* `y` only consists of `b`s

But we also have the requirement:

1. x and y together contain no more than n symbols

So, we only need to consider the first option

---

# Cases 1

* So `y` only consists of `a`s

* But what happens if we repeat `y` (and remember: the prefix and postfix stay the same!)?

* If `y` consists of `i` `a`s, and we repeat `y`, we end up with `i` more `a`s than `b`s

* But that doesn't work ...

---

# The Contradiction

We had the language:

$$
L = \\{a^i\cdot b^i | i \ge 0 \\}
$$

We assumed it was regular.

And we discovered that the pumping lemma doesn't hold.

Therefore, our assumption was wrong and the language is **not** regular.

---

# Another Example

Consider the language:

### R is the language of all strings over the alphabet `$\Sigma=\{a,b\}$` where each word has the same number of `a`s and `b`s

Show that this language is not regular

We could use the pumping lemma ...

... or we recall something from last time!

---

# Another Language

* Consider the regular expression: `$a^\ast b^\ast$`

* The language defined by this regular expression is "arbitrarily many `a`s followed by arbitrarily many `b`s", but the number of `a`s and `b`s may be different.

* Let's call this language `$S$`

* It is definitely regular (we just got it from a regular expression)

---

# The two languages

* S is arbitrarily many `a`s followed by arbitrarily many `b`s

* R is words with the same number of `a`s and `b`s

* `$R \cap S$` is "words with a number of `a`s followed by the same number of `b`s"

* But that's the same as `$L = \{a^i\cdot b^i | i \ge 0 \}$`

---

# A contradiction

* We know S is regular

* Assume that R is regular as well

* Then `$R \cap S = L$` is regular

* But we just showed earlier that is not, so we have a contradiction: R can not be regular

---

# Summary: Pumping Lemma

1. x and y together contain no more than n symbols

2. y contains at least one symbol

3. `$x\cdot y^i \cdot z \in L$` for *every* `$i$`

As I said, we often use this in the contrapositive: If these conditions do not hold, the language is not regular. We need to be careful, though.

---

# Pitfall 1

* The pumping lemma must hold for **all** words in a regular language that are longer than `$n$`

* This means, for the contrapositive we can **choose** any word as long as it is long enough

* We could have shown that our language with "same number of `a`s and `b`s" is not regular, because any of the `$a^i b^i$` is also a word of that language

---

# Pitfall 2

* Once we have chosen the word `w`, there **exists** a split `$w = xyz$`

* This means, for the contrapositive we **can not** choose the split arbitrarily!

* We only know: `$xy$` is at most `$n$` characters long, and `$y$` is not empty

* We need to show that `$y$` can not be pumped for **any** split (including when `x` is empty)

---

# Pitfall 3

* The chosen word must be in the language

* At least one of the pumped words not

* Sometimes we may need to **shorten** words

* Consider: `$L = \{a^n b^m | n \gt m\}$`

* We can always add more `a`s! But we can choose a word such that we can not **remove** `a`s

---

# Non-Regular Languages

* The pumping lemma illustrates a fundamental limitation of regular languages

* Since we do not have (arbitrary) memory, we can not count matches

* Why? Because there could be more things (like parenthesis) to count than we have states

* Therefore we will have to look to more powerful mechanisms to handle more complex languages

---

# Future Schedule

* Tuesday, 3/19: Presentation

* Thursday, 3/21: Review

* Tuesday, 3/26: Midterm

* Thursday, 3/28: Context-Free Grammars

* Spring Break!