Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

CS-3110: Formal Languages and Automata

Languages

Chapter 3.1

1 / 29

Alphabets

  • An alphabet, usually denoted Σ is a set of symbols

  • Symbols are usually individual characters, or "tokens", depending on our application

  • To start with, we will use individual characters, like "a", "b", "c" or "0", "1"

Σ={0,1}

2 / 29

Strings

Once we have an alphabet, we can use it to make strings, or words.

A word is a sequence of symbols from our alphabet:

x=a1a2a3any=b1b2b3bm

Words are always of finite length, but they can be arbitrarily long.

We say that words are chosen "over" an alphabet. For example "0", "00" and "0101" are all words over the alphabet Σ={0,1}.

3 / 29

String Equality

Two strings/words are equal iff:

  • They have the same length

  • The characters at each position are the same

x=a1a2a3any=b1b2b3bmx=yn=mi:ai=bi

0=001=01¬10=01¬10=10010100

4 / 29

String Length

Sometimes we want to talk about the length of a string, i.e. the number of characters it has:

|x|=|a1a2a3an|=n

With that we can also write our string equality as:

x=y|x|=|y|i:ai=bi

|a|=1|xy|=|x|+|y||101|=3|000|=3

5 / 29

String Concatenation

We usually use the notation "xy" to mean the concatention of two strings x and y:

x=a1a2a3any=b1b2b3bmxy=a1a2a3anb1b2b3bm

Note that this is not particularly surprising, as a1a2 is already a concatenation: of two characters/words of length 1.

Sometimes, when we want to be more explicit about it, we will use the concatenation operator:

xyxy01=01

6 / 29

String Repetition

The concatention operator looks a lot like a multiplication dot (because it is). Just as with multiplication, we can define an "exponentiation" operation:

yy=yy=y2yyy=yyy=y3yy=yyy=yn

02=00(01)3=010101

7 / 29

String Repetition

The concatention operator looks a lot like a multiplication dot (because it is). Just as with multiplication, we can define an "exponentiation" operation:

yy=yy=y2yyy=yyy=y3yy=yyy=yn

02=00(01)3=010101

What about

y0=...?

8 / 29

The Empty String

We will use the symbol ε to refer to the empty string/empty word. It has zero characters, and these properties, for any word y:

y0=εyε=yεy=y|ε|=0

9 / 29

String Repetition

Let's look at our string repetition again:

yy=yy=y2yyy=yyy=y3yy=yyy=yn

Often we want to say something like "any number of repetitions". For this we will use the notation:

y

You can think of this as the "*" being a placeholder for any (non-negative) number.

0={ε,0,00,000,}

10 / 29

String Reversal

The next operation we are looking at is the reversal of a string:

x=a1a2a3anxR=ana3a2a1(xR)R=x(xy)R=yRxRaR=aεR=ε

(baa)R=aab(abba)R=abba(abc)R=((ab)c)R=cR(ab)R=cRbRaR=cba

11 / 29

Palindromes

A palindrome is a word that's the same when read backwards.

For example: tacocat

We can use our notation to define the set of all palindromes as:

P={x|x=xR}

or the set of palindromes with even length:

Pe={w|x:w=xxR}

12 / 29

Who needs palindromes?

  • Palindromes seem, at most, like a neat gimmick

  • But why would we really care about them?

  • Palindromes are a simplification of something that is very common in languages: matching

  • For example, in a programming language you may need to make sure that opening and closing parenthesis match


(Source)
13 / 29

Languages

14 / 29

Languages

Now we are finally ready to define what a language is:

A language is a set of words

Well, that was easy. What's for dinner?

15 / 29

Languages

Now we are finally ready to define what a language is:

A language is a set of words

Well, that was easy. What's for dinner?

Oh, you want a more formal treatment?

16 / 29

The Alphabet

Recall: An alphabet, usually denoted Σ is a set of symbols

"Symbols" are just "words of length 1"

So our alphabet Σ={0,1} already is a language

What about "words of length 2"?

Maybe we can use the idea of "concatenation" of strings, and apply it to all strings of two languages?

We want:

ΣΣ={00,01,10,11}ΣΣΣ={000,001,010,100,011,101,110,111}

17 / 29

Language Concatenation

Say we have two sets of words (two languages) X and Y. We define their concatenation:

XY=XY={w|s,t:(sXtYw=st)}XY=XY={st|sXtY}

We can also concatenate words to a language:

Xy=Xy={sy|sX}=X{y}yX=yX={ys|sX}={y}X

18 / 29

Language Concatenation

X={0,1}Y={a,b,c}Z={aba,bcb}XY={0a,0b,0c,1a,1b,1c}YX={a0,a1,b0,b1,c0,c1}XZ={0aba,0bcb,1aba,1bcb}

And: 1X={10,11}0Y1={0a1,0b1,0c1}cXXc={c00c,c01c,c10c,c11c}

19 / 29

Language Repetition

You may have guessed this already: We can use the same notation as before to repeat languages:

YY=YY=Y2YYY=YYY=Y3YY=YYY=YnY0={ε}

We will also use the same notation as before to denote "any" number of repetitions:

Y=Y0Y1Y2Y3

20 / 29

Languages and the Alphabet

  • Let's take our alphabet Σ

  • Σ is a set of symbols, which are also "one letter words"

  • So we can concatenate Σ, for example to make Σ2, the set of all two letter words

This leads us to the following definition:

A language L over an alphabet Σ is a subset of Σ

21 / 29

Languages

Let's take our alphabet Σ={0,1}

Then:

Σ={ε,0,1,00,01,10,11,000,001,}

There are, in fact, infinitely many elements in this set.

Any subset of this set is a language over Σ

For example: The set of all strings representing even binary numbers:

E={0,10,100,110,}

22 / 29

Even binary numbers

Recall: Σ is infinite, our language has to be a subset

But there is a neat "trick" we can do:

E={0}1Σ0

Even numbers are either "0", or they start with a 1, followed by any combination of digits (even none), and end with a 0. That is exactly what this concatenation does!

Even though we "added" something to Σ, the result is still a subset of Σ.

If this idea of "infinity" seems confusing (or you're just curious), take a look at Hilbert's Hotel.

23 / 29

Some Interesting (?) Languages

  • The set of all even binary numbers probably has few actual applications

  • "The set of all valid email addresses", is a language over all ASCII (or Unicode?) characters

  • "The set of all valid C++ programs" is a language over the "basic source character set"

  • "The set of all bug-free Java programs" is another language over all Unicode characters

24 / 29

Languages as Sets

Since languages are just sets, we can do all the things we can do with sets with them:

  • Calculate the union or intersection of two languages: Which programs are valid in both C++ and Java ("polyglots")

  • Calculate the difference of two languages: Which Java programs are valid, but not bug-free (i.e. they have bugs)

  • Calculate the complement of a language: What are all invalid email addresses

25 / 29

Programmatic View

  • Using set notation we can define languages

  • We will also look into other ways of defining languages

  • But the actual goal is a different one: Recognizing valid words

  • Seen as a program: We want to read a string and determine if it is a valid word in the language

  • Ideally, we only need to read the string once, character by character (and save whatever information we may need)

  • This is exactly where our automata will come in!

26 / 29

Language Types

  • As you may imagine, not all languages are created equal

  • Checking if a binary number is even is probably very easy

  • Checking if a Java-program has no bugs is very hard

  • Checking if a program is valid C++ code is impossible (literally; warning: advanced content)

We will start with "easy" languages and work our way up. We'll also define what it means for language to be "easy".

27 / 29

Notational Conventions

  • We use lower-case letters (a, b, c, x, y, z) to represent individuals

  • Letters from the beginning of the alphabet (a, b, c) are usually our symbols (or sometimes digits: 0, 1, 2...)

  • Letters from the end of the alphabet (x, y, z) are usually our words

  • We use upper-case letters (A, B, C, X, Y) to represent sets, like languages

  • Some special entities use greek letters, notably Σ and ε

We will see exceptions to these rules guidelines, especially in "real" languages, which may need a letter like "y".

28 / 29

One last thing

)
29 / 29

Alphabets

  • An alphabet, usually denoted Σ is a set of symbols

  • Symbols are usually individual characters, or "tokens", depending on our application

  • To start with, we will use individual characters, like "a", "b", "c" or "0", "1"

Σ={0,1}

2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow