# Specification of Tokens in Compiler Design

## Specification of Tokens

Specification of tokens depends on the pattern of the lexeme. Here we will be using regular expressions to specify the different types of patterns that can actually form tokens.Although the regular expressions are inefficient in specifying all the patterns forming tokens. Yet it reveals almost all types of pattern that forms a token.

- String
- Language
- Regular Expression

### 1. String

- An
**alphabet**or character class is a finite set of symbols. - A
**string**over an alphabet is a finite sequence of symbols drawn from that alphabet. - In language theory, the term "word" is often used as synonyms for "string".
- The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For example, "banana" is a string of length six.
- The empty string, denoted
**ε**, is the string of length zero.

#### Operations on String:

**1. Prefix of String**

The prefix of the string is the preceding symbols present in the string and the string(s) itself.

*For Example:*s = abcd

The prefix of the string abcd: ∈, a, ab, abc, abcd

**2. Suffix of String**

Suffix of the string is the ending symbols of the string and the string(s) itself.

*For Example:*s = abcd

Suffix of the string abcd: ∈, d, cd, bcd, abcd

**3. Proper Prefix of String**The proper prefix of the string includes all the prefixes of the string excluding ∈ and the string(s) itself.

Proper Prefix of the string abcd: a, ab, abc

**4. Proper Suffix of String**

The proper suffix of the string includes all the suffixes excluding ∈ and the string(s) itself.

Proper Suffix of the string abcd: d, cd, bcd

**5. Substring of String**

The substring of a string s is obtained by deleting any prefix or suffix from the string.

Substring of the string abcd: ∈, abcd, bcd, abc, …

**6. Proper Substring of String**

The proper substring of a string s includes all the substrings of s excluding ∈ and the string(s) itself.

Proper Substring of the string abcd: bcd, abc, cd, ab…

**7. Subsequence of String**

The subsequence of the string is obtained by eliminating zero or more (not necessarily consecutive) symbols from the string.

A subsequence of the string abcd: abd, bcd, bd, …

**8. Concatenation of String**

If s and t are two strings, then st denotes concatenation.

s = abc t = def

Concatenation of string s and t i.e. st = abcdef

### 2. Language

A

**language**is any countable set of strings over some fixed alphabet.#### Operation on Language

As we have learnt language is a set of strings that are constructed over some fixed alphabets. Now the operation that can be performed on languages are:

Union is the most common set operation. Consider the two languages L and M. Then the union of these two languages is denoted by:

L ∪ M = { s | s is in L or s is in M}

That means the string s from the union of two languages can either be from language L or from language M.

If L = {a, b} and M = {c, d} Then L ∪ M = {a, b, c, d}

Concatenation links the string from one language to the string of another language in a series in all possible ways. The concatenation of two different languages is denoted by:

L ⋅ M = {st | s is in L and t is in M} If L = {a, b} and M = {c, d}

Then L ⋅ M = {ac, ad, bc, bd}

If L = {a, b} then L* = {∈, a, b, aa, bb, aaa, bbb, …}

It is similar to the Kleene closure. Except for the term L

If L = {a, b} then L+ = {a, b, aa, bb, aaa, bbb, …}

So, these are the four operations that can be performed on the languages in the lexical analysis phase.

A regular expression ‘r’ that denotes a language L(r) is built recursively over the smaller regular expression using the rules given below.

The following rules define the regular expression over some alphabet Σ and the languages denoted by these regular expressions.

d

d

………

**1. Union**

Union is the most common set operation. Consider the two languages L and M. Then the union of these two languages is denoted by:

L ∪ M = { s | s is in L or s is in M}

That means the string s from the union of two languages can either be from language L or from language M.

If L = {a, b} and M = {c, d} Then L ∪ M = {a, b, c, d}

**2. Concatenation**

Concatenation links the string from one language to the string of another language in a series in all possible ways. The concatenation of two different languages is denoted by:

L ⋅ M = {st | s is in L and t is in M} If L = {a, b} and M = {c, d}

Then L ⋅ M = {ac, ad, bc, bd}

**3. Kleene Closure**

Kleene closure of a language L provides you with a set of strings. This set of strings is obtained by concatenating L zero or more time. The Kleene closure of the language L is denoted by:

If L = {a, b} then L* = {∈, a, b, aa, bb, aaa, bbb, …}

**4. Positive Closure**

The positive closure on a language L provides a set of strings. This set of strings is obtained by concatenating ‘L’ one or more times. It is denoted by:

It is similar to the Kleene closure. Except for the term L

^{0}, i.e. L

^{+}excludes ∈ until it is in L itself.

If L = {a, b} then L+ = {a, b, aa, bb, aaa, bbb, …}

So, these are the four operations that can be performed on the languages in the lexical analysis phase.

### 3. Regular Expression

A regular expression is a sequence of symbols used to specify lexeme patterns. A regular expression is helpful in describing the languages that can be built using operators such as union, concatenation, and closure over the symbols.A regular expression ‘r’ that denotes a language L(r) is built recursively over the smaller regular expression using the rules given below.

The following rules define the regular expression over some alphabet Σ and the languages denoted by these regular expressions.

- ∈ is a regular expression that denotes a language L(∈). The language L(∈) has a set of strings {∈} which means that this language has a single empty string.
- If there is a symbol ‘a’ in Σ then ‘a’ is a regular expression that denotes a language L(a). The language L(a) = {a} i.e. the language has only one string of length one and the string holds ‘a’ in the first position.
- Consider the two regular expressions r and s then:

- r|s denotes the language L(r) ∪ L(s).
- (r) (s) denotes the language L(r) ⋅ L(s).
- (r)* denotes the language (L(r))*.
- (r)+ denotes the language L(r).

#### Regular Definitions:

Giving names to regular expressions is referred to as a Regular definition. If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form.d

_{1}→ r

_{1}

d

_{2}→ r

_{2}

………

d

letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9

id → letter ( letter | digit ) *.

_{n}→ r_{n}- Each d
_{i}is a distinct name. - Each r
_{i}is a regular expression over the alphabet Σ U {d_{1}, d_{2},. . . , d_{i-1}}.

letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9

id → letter ( letter | digit ) *.