An alphabet, usually denoted Σ
is a set of symbols
Symbols are usually individual characters, or "tokens", depending on our application
To start with, we will use individual characters, like "a", "b", "c" or "0", "1"
Σ={0,1}
Once we have an alphabet, we can use it to make strings, or words.
A word is a sequence of symbols from our alphabet:
x=a1a2a3…any=b1b2b3…bm
Words are always of finite length, but they can be arbitrarily long.
We say that words are chosen "over" an alphabet. For example "0", "00" and "0101" are all words over the alphabet Σ={0,1}
.
Two strings/words are equal iff:
They have the same length
The characters at each position are the same
x=a1a2a3…any=b1b2b3…bmx=y≡n=m∧∀i:ai=bi
0=001=01¬10=01¬10=100≡10≠100
Sometimes we want to talk about the length of a string, i.e. the number of characters it has:
|x|=|a1a2a3…an|=n
With that we can also write our string equality as:
x=y≡|x|=|y|∧∀i:ai=bi
|a|=1|xy|=|x|+|y||101|=3|000|=3
We usually use the notation "xy" to mean the concatention of two strings x and y:
x=a1a2a3…any=b1b2b3…bmxy=a1a2a3…anb1b2b3…bm
Note that this is not particularly surprising, as a1a2
is already a concatenation: of two characters/words of length 1.
Sometimes, when we want to be more explicit about it, we will use the concatenation operator:
xy≡x⋅y01=0⋅1
The concatention operator looks a lot like a multiplication dot (because it is). Just as with multiplication, we can define an "exponentiation" operation:
yy=y⋅y=y2yyy=y⋅y⋅y=y3y…y=y⋅y⋅…⋅y=yn
02=00(01)3=010101
The concatention operator looks a lot like a multiplication dot (because it is). Just as with multiplication, we can define an "exponentiation" operation:
yy=y⋅y=y2yyy=y⋅y⋅y=y3y…y=y⋅y⋅…⋅y=yn
02=00(01)3=010101
What about
y0=...?
We will use the symbol ε
to refer to the empty string/empty word. It has zero characters, and these properties, for any word y:
y0=εy⋅ε=yε⋅y=y|ε|=0
Let's look at our string repetition again:
yy=y⋅y=y2yyy=y⋅y⋅y=y3y…y=y⋅y⋅…⋅y=yn
Often we want to say something like "any number of repetitions". For this we will use the notation:
y∗
You can think of this as the "*" being a placeholder for any (non-negative) number.
0∗={ε,0,00,000,…}
The next operation we are looking at is the reversal of a string:
x=a1a2a3…anxR=an…a3a2a1(xR)R=x(xy)R=yRxRaR=aεR=ε
(baa)R=aab(abba)R=abba(abc)R=((ab)⋅c)R=cR⋅(ab)R=cR⋅bR⋅aR=cba
A palindrome is a word that's the same when read backwards.
For example: tacocat
We can use our notation to define the set of all palindromes as:
P={x|x=xR}
or the set of palindromes with even length:
Pe={w|∃x:w=xxR}
Palindromes seem, at most, like a neat gimmick
But why would we really care about them?
Palindromes are a simplification of something that is very common in languages: matching
For example, in a programming language you may need to make sure that opening and closing parenthesis match
Now we are finally ready to define what a language is:
Well, that was easy. What's for dinner?
Now we are finally ready to define what a language is:
Well, that was easy. What's for dinner?
Oh, you want a more formal treatment?
Recall: An alphabet, usually denoted Σ
is a set of symbols
"Symbols" are just "words of length 1"
So our alphabet Σ={0,1}
already is a language
What about "words of length 2"?
Maybe we can use the idea of "concatenation" of strings, and apply it to all strings of two languages?
We want:
Σ⋅Σ={00,01,10,11}Σ⋅Σ⋅Σ={000,001,010,100,011,101,110,111}⋮
Say we have two sets of words (two languages) X and Y. We define their concatenation:
XY=X⋅Y={w|∃s,t:(s∈X∧t∈Y∧w=st)}XY=X⋅Y={st|s∈X∧t∈Y}
We can also concatenate words to a language:
Xy=X⋅y={sy|s∈X}=X⋅{y}yX=y⋅X={ys|s∈X}={y}⋅X
X={0,1}Y={a,b,c}Z={aba,bcb}XY={0a,0b,0c,1a,1b,1c}YX={a0,a1,b0,b1,c0,c1}XZ={0aba,0bcb,1aba,1bcb}
And: 1X={10,11}0Y1={0a1,0b1,0c1}cXXc={c00c,c01c,c10c,c11c}
You may have guessed this already: We can use the same notation as before to repeat languages:
YY=Y⋅Y=Y2YYY=Y⋅Y⋅Y=Y3Y…Y=Y⋅Y⋅…⋅Y=YnY0={ε}
We will also use the same notation as before to denote "any" number of repetitions:
Y∗=Y0∪Y1∪Y2∪Y3∪⋯
Let's take our alphabet Σ
Σ
is a set of symbols, which are also "one letter words"
So we can concatenate Σ
, for example to make Σ2
, the set of all two letter words
This leads us to the following definition:
Σ
is a subset of Σ∗
Let's take our alphabet Σ={0,1}
Then:
Σ∗={ε,0,1,00,01,10,11,000,001,…}
There are, in fact, infinitely many elements in this set.
Any subset of this set is a language over Σ
For example: The set of all strings representing even binary numbers:
E={0,10,100,110,…}
Recall: Σ∗
is infinite, our language has to be a subset
But there is a neat "trick" we can do:
E={0}∪1Σ∗0
Even numbers are either "0", or they start with a 1, followed by any combination of digits (even none), and end with a 0. That is exactly what this concatenation does!
Even though we "added" something to Σ∗
, the result is still a subset of Σ∗
.
If this idea of "infinity" seems confusing (or you're just curious), take a look at Hilbert's Hotel.
The set of all even binary numbers probably has few actual applications
"The set of all valid email addresses", is a language over all ASCII (or Unicode?) characters
"The set of all valid C++ programs" is a language over the "basic source character set"
"The set of all bug-free Java programs" is another language over all Unicode characters
Since languages are just sets, we can do all the things we can do with sets with them:
Calculate the union or intersection of two languages: Which programs are valid in both C++ and Java ("polyglots")
Calculate the difference of two languages: Which Java programs are valid, but not bug-free (i.e. they have bugs)
Calculate the complement of a language: What are all invalid email addresses
Using set notation we can define languages
We will also look into other ways of defining languages
But the actual goal is a different one: Recognizing valid words
Seen as a program: We want to read a string and determine if it is a valid word in the language
Ideally, we only need to read the string once, character by character (and save whatever information we may need)
This is exactly where our automata will come in!
As you may imagine, not all languages are created equal
Checking if a binary number is even is probably very easy
Checking if a Java-program has no bugs is very hard
Checking if a program is valid C++ code is impossible (literally; warning: advanced content)
We will start with "easy" languages and work our way up. We'll also define what it means for language to be "easy".
We use lower-case letters (a, b, c, x, y, z) to represent individuals
Letters from the beginning of the alphabet (a, b, c) are usually our symbols (or sometimes digits: 0, 1, 2...)
Letters from the end of the alphabet (x, y, z) are usually our words
We use upper-case letters (A, B, C, X, Y) to represent sets, like languages
Some special entities use greek letters, notably Σ
and ε
We will see exceptions to these rules guidelines, especially in "real" languages, which may need a letter like "y".
An alphabet, usually denoted Σ
is a set of symbols
Symbols are usually individual characters, or "tokens", depending on our application
To start with, we will use individual characters, like "a", "b", "c" or "0", "1"
Σ={0,1}
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |