Lecture 3: Languages

# CS-3110: Formal Languages and Automata

## Languages

### Chapter 3.1

---

# Alphabets

* An alphabet, usually denoted `$\Sigma$` is a **set** of **symbols**

* Symbols are usually individual characters, or "tokens", depending on our application

* To start with, we will use individual characters, like "a", "b", "c" or "0", "1"

$$
\Sigma = \\{0,1\\}
$$

---

# Strings

Once we have an alphabet, we can use it to make strings, or **words**.

A word is a **sequence** of symbols from our alphabet:

$$
x = a_1a_2a_3\ldots{}a_n\\\\
y = b_1b_2b_3\ldots{}b_m
$$

Words are always of finite length, but they can be arbitrarily long.

We say that words are chosen "over" an alphabet. For example "0", "00" and "0101" are all words over the alphabet `$\Sigma = \{0,1\}$`.

---

# String Equality

Two strings/words are equal iff:
  
  * They have the same length 
  
  * The characters at each position are the same

$$
\begin{aligned}
x =&\:a_1a_2a_3\ldots{}a_n\\\\
y =&\:b_1b_2b_3\ldots{}b_m\\\\
x = y \equiv&\:n = m \wedge \forall i: a_i = b_i   
\end{aligned}
$$

$$
0 = 0\\\\
01 = 01\\\\
\neg 10 = 01\\\\
\neg 10 = 100 \equiv 10 \not= 100
$$

---

# String Length

Sometimes we want to talk about the length of a string, i.e. the number of characters it has:

$$
|x| = |a_1a_2a_3\ldots{}a_n| = n
$$

With that we can also write our string equality as:

$$
x = y \equiv|x| = |y| \wedge \forall i: a_i = b_i   
$$

$$
|a| = 1\\\\
|xy| = |x| + |y|\\\\
|101| = 3\\\\
|000| = 3
$$

---

# String Concatenation

We usually use the notation "xy" to mean the concatention of two strings x and y:

$$
\begin{aligned}
x =&\:a_1a_2a_3\ldots{}a_n\\\\
y =&\:b_1b_2b_3\ldots{}b_m\\\\
xy =&\:a_1a_2a_3\ldots{}a_nb_1b_2b_3\ldots{}b_m
\end{aligned}
$$

Note that this is not particularly surprising, as `$a_1a_2$` is already a concatenation: of two characters/words of length 1.

Sometimes, when we want to be more explicit about it, we will use the concatenation operator:

$$
xy \equiv x \cdot y\\\\
01 = 0 \cdot 1
$$

---

# String Repetition

The concatention operator looks a lot like a multiplication dot (because it is). Just as with multiplication, we can define an "exponentiation" operation:

$$
\begin{aligned}
yy = y \cdot y =&\: y^2\\\\
yyy = y \cdot y \cdot y =&\: y^3\\\\
y\ldots{}y = y \cdot y \cdot \ldots \cdot y =&\: y^n
\end{aligned}
$$

$$
0^2 = 00\\\\
(01)^3 = 010101
$$

What about

$$
y^0 = ...?
$$

---

# The Empty String

We will use the symbol `$\varepsilon$` to refer to the empty string/empty word. It has zero characters, and these properties, for any word y:

$$
\begin{aligned}
y^0 =&\: \varepsilon \\\\
y\cdot \varepsilon =&\:y\\\\
\varepsilon \cdot y =&\:y\\\\
|\varepsilon| = 0
\end{aligned}
$$

---

# String Repetition

Let's look at our string repetition again:

$$
\begin{aligned}
yy = y \cdot y =&\: y^2\\\\
yyy = y \cdot y \cdot y =&\: y^3\\\\
y\ldots{}y = y \cdot y \cdot \ldots \cdot y =&\: y^n
\end{aligned}
$$

Often we want to say something like "any number of repetitions". For this we will use the notation:

$$
y^*
$$

You can think of this as the "*" being a placeholder for any (non-negative) number.

$$
0^* = \\{\varepsilon, 0, 00, 000, \ldots\\}
$$

---

# String Reversal

The next operation we are looking at is the reversal of a string:

$$
\begin{aligned}
x =&\:a_1a_2a_3\ldots{}a_n\\\\
x^R =&\:a_n\ldots{}a_3a_2a_1\\\\
(x^R)^R =&\: x\\\\
(xy)^R =&\: y^Rx^R\\\\
a^R =&\: a\\\\
\varepsilon^R =&\: \varepsilon
\end{aligned}
$$

$$
(baa)^R = aab\\\\
(abba)^R = abba\\\\
(abc)^R = ((ab)\cdot c)^R = c^R \cdot (ab)^R = c^R \cdot b^R \cdot a^R = cba
$$

---

# Palindromes

A palindrome is a word that's the same when read backwards.

For example: tacocat

We can use our notation to define the set of all palindromes as:

$$
P = \\{x | x = x^R \\}
$$

or the set of palindromes **with even length**:

$$
P_e = \\{w | \exists x: w = xx^R \\}
$$

---

# Who needs palindromes?

* Palindromes seem, at most, like a neat gimmick

* But why would we really care about them?

* Palindromes are a simplification of something that is very common in languages: matching

* For example, in a programming language you may need to make sure that opening and closing parenthesis match

<center>
<a href="https://xkcd.com/859/"><img src="/CS3110/assets/img/openparen.png" width="40%"/> <br/>(Source)</a>
<center>

---

# Languages

---

# Languages

Now we are finally ready to define what a language is:

Well, that was easy. What's for dinner?

Oh, you want a more formal treatment?

---

# The Alphabet

Recall: An alphabet, usually denoted `$\Sigma$` is a **set** of **symbols**

"Symbols" are just "words of length 1"

So our alphabet `$\Sigma = \{0,1\}$` already **is** a language

What about "words of length 2"?

Maybe we can use the idea of "concatenation" of strings, and apply it to **all** strings of two languages?

We want:

$$
\Sigma\cdot\Sigma = \\{00, 01, 10, 11\\}\\\\
\Sigma\cdot\Sigma\cdot\Sigma = \\{000, 001, 010, 100, 011, 101, 110, 111\\}\\\\
\vdots
$$

---

# Language Concatenation

Say we have two sets of words (two **languages**) X and Y. We define their concatenation:

$$
\begin{aligned}
XY = X \cdot Y =&\: \\{w | \exists s,t: (s\in X \wedge t \in Y \wedge w = st)\\}\\\\
XY = X \cdot Y =&\: \\{st | s\in X \wedge t \in Y\\}
\end{aligned}
$$

We can also concatenate words to a language:

$$
\begin{aligned}
Xy = X \cdot y =&\: \\{sy | s\in X\\} = X\cdot \\{y\\} \\\\
yX = y \cdot X =&\: \\{ys | s\in X\\} = \\{y\\} \cdot X
\end{aligned}
$$

---

# Language Concatenation

$$
\begin{aligned}
X =&\: \\{0,1\\}\\\\
Y =&\:\\{a,b,c\\}\\\\
Z =&\:\\{aba, bcb\\}\\\\
XY =&\: \\{0a, 0b, 0c, 1a, 1b, 1c\\}\\\\
YX =&\: \\{a0, a1, b0, b1, c0, c1\\}\\\\
XZ =&\: \\{0aba, 0bcb, 1aba, 1bcb\\}
\end{aligned}
$$

And:
$$
\begin{aligned}
1X =&\: \\{10, 11\\}\\\\
0Y1 =&\: \\{0a1, 0b1, 0c1\\}\\\\
cXXc =&\: \\{c00c, c01c, c10c, c11c\\}
\end{aligned}
$$

---

# Language Repetition

You may have guessed this already: We can use the same notation as before to repeat languages:

$$
\begin{aligned}
YY = Y \cdot Y =&\: Y^2\\\\
YYY = Y \cdot Y \cdot Y =&\: Y^3\\\\
Y\ldots{}Y = Y \cdot Y \cdot \ldots \cdot Y =&\: Y^n\\\\
Y^0 =&\: \\{\varepsilon\\}
\end{aligned}
$$

We will also use the same notation as before to denote "any" number of repetitions:

$$
Y^* = Y^0 \cup Y^1 \cup Y^2 \cup Y^3 \cup \cdots
$$

---

# Languages and the Alphabet

* Let's take our alphabet `$\Sigma$`

* `$\Sigma$` is a set of symbols, which are also "one letter words"

* So we can concatenate `$\Sigma$`, for example to make `$\Sigma^2$`, the set of all two letter words

This leads us to the following definition:

---

# Languages

Let's take our alphabet `$\Sigma = \{0,1\}$`

Then:

$$
\Sigma^* = \\{\varepsilon, 0, 1, 00, 01, 10, 11, 000, 001, \ldots \\}
$$

There are, in fact, infinitely many elements in this set.

Any **subset** of this set is a language over `$\Sigma$`

For example: The set of all strings representing even binary numbers:

$$
E = \\{0, 10, 100, 110, \ldots\\}
$$

---

# Even binary numbers

Recall: `$\Sigma^*$` is infinite, our language has to be a subset

But there is a neat "trick" we can do:

$$
E = \\{0\\} \cup 1\Sigma^*0
$$

Even numbers are either "0", or they start with a 1, followed by any combination of digits (even none), and end with a 0. That is exactly what this concatenation does!

Even though we "added" something to `$\Sigma*$`, the result is **still** a subset of `$\Sigma*$`.

If this idea of "infinity" seems confusing (or you're just curious), take a look at <a href="https://www.ias.edu/ideas/2016/pires-hilbert-hotel">Hilbert's Hotel</a>.

---

# Some Interesting (?) Languages

* The set of all even binary numbers probably has few actual applications

* "The set of all valid email addresses", is a language over all ASCII (or Unicode?) characters

* "The set of all valid C++ programs" is a language over the "basic source character set"

* "The set of all bug-free Java programs" is another language over all Unicode characters

---

# Languages as Sets

Since languages are just sets, we can do all the things we can do with sets with them:

* Calculate the union or intersection of two languages: Which programs are valid in both C++ and Java (<a href="http://www.nyx.net/~gthompso/poly/polyglot.htm">"polyglots"</a>)

* Calculate the difference of two languages: Which Java programs are valid, but not bug-free (i.e. they have bugs)

* Calculate the complement of a language: What are all **invalid** email addresses

---

# Programmatic View

* Using set notation we can define languages

* We will also look into other ways of defining languages

* But the actual goal is a different one: *Recognizing* valid words

* Seen as a program: We want to read a string and determine if it is a valid word in the language

* Ideally, we only need to read the string **once**, character by character (and save whatever information we may need)

* This is exactly where our automata will come in!

---

# Language Types

* As you may imagine, not all languages are created equal

* Checking if a binary number is even is probably very easy

* Checking if a Java-program has no bugs is very hard

* Checking if a program is valid C++ code is impossible (<a href="https://blog.reverberate.org/2013/08/parsing-c-is-literally-undecidable.html">literally</a>; warning: advanced content)

We will start with "easy" languages and work our way up. We'll also define what it means for language to be "easy".

---

# Notational Conventions

* We use lower-case letters (a, b, c, x, y, z) to represent individuals
 
 * Letters from the beginning of the alphabet (a, b, c) are usually our symbols (or sometimes digits: 0, 1, 2...)
 
 * Letters from the end of the alphabet (x, y, z) are usually our words
 
 * We use upper-case letters (A, B, C, X, Y) to represent sets, like languages 
 
 * Some special entities use greek letters, notably `$\Sigma$` and `$\varepsilon$`
 
We will see exceptions to these <s>rules</s> guidelines, especially in "real" languages, which may **need** a letter like "y".

---

# One last thing