Specification of Tokens in Compiler Design

𝐀𝐝𝐦𝐢𝐧

Specification of Tokens

Specification of tokens depends on the pattern of the lexeme. Here we will be using regular expressions to specify the different types of patterns that can actually form tokens.

Although the regular expressions are inefficient in specifying all the patterns forming tokens. Yet it reveals almost all types of pattern that forms a token.

There are 3 specifications of tokens:

String
Language
Regular Expression

1. String

An alphabet or character class is a finite set of symbols.
A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
In language theory, the term "word" is often used as synonyms for "string".
The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For example, "banana" is a string of length six.
The empty string, denoted ε, is the string of length zero.

Operations on String:

1. Prefix of String

The prefix of the string is the preceding symbols present in the string and the string(s) itself.

For Example: s = abcd

The prefix of the string abcd: ∈, a, ab, abc, abcd

2. Suffix of String

Suffix of the string is the ending symbols of the string and the string(s) itself.

For Example: s = abcd

Suffix of the string abcd: ∈, d, cd, bcd, abcd

3. Proper Prefix of String

The proper prefix of the string includes all the prefixes of the string excluding ∈ and the string(s) itself.

Proper Prefix of the string abcd: a, ab, abc

4. Proper Suffix of String
The proper suffix of the string includes all the suffixes excluding ∈ and the string(s) itself.

Proper Suffix of the string abcd: d, cd, bcd

5. Substring of String
The substring of a string s is obtained by deleting any prefix or suffix from the string.

Substring of the string abcd: ∈, abcd, bcd, abc, …

6. Proper Substring of String
The proper substring of a string s includes all the substrings of s excluding ∈ and the string(s) itself.

Proper Substring of the string abcd: bcd, abc, cd, ab…

7. Subsequence of String
The subsequence of the string is obtained by eliminating zero or more (not necessarily consecutive) symbols from the string.

A subsequence of the string abcd: abd, bcd, bd, …

8. Concatenation of String
If s and t are two strings, then st denotes concatenation.

s = abc t = def

Concatenation of string s and t i.e. st = abcdef

2. Language

A language is any countable set of strings over some fixed alphabet.

Operation on Language

As we have learnt language is a set of strings that are constructed over some fixed alphabets. Now the operation that can be performed on languages are:

1. Union
Union is the most common set operation. Consider the two languages L and M. Then the union of these two languages is denoted by:

L ∪ M = { s | s is in L or s is in M}

That means the string s from the union of two languages can either be from language L or from language M.

If L = {a, b} and M = {c, d} Then L ∪ M = {a, b, c, d}

2. Concatenation
Concatenation links the string from one language to the string of another language in a series in all possible ways. The concatenation of two different languages is denoted by:

L ⋅ M = {st | s is in L and t is in M} If L = {a, b} and M = {c, d}

Then L ⋅ M = {ac, ad, bc, bd}

3. Kleene Closure
Kleene closure of a language L provides you with a set of strings. This set of strings is obtained by concatenating L zero or more time. The Kleene closure of the language L is denoted by:

If L = {a, b} then L* = {∈, a, b, aa, bb, aaa, bbb, …}

4. Positive Closure
The positive closure on a language L provides a set of strings. This set of strings is obtained by concatenating ‘L’ one or more times. It is denoted by:

It is similar to the Kleene closure. Except for the term L⁰, i.e. L⁺ excludes ∈ until it is in L itself.

If L = {a, b} then L+ = {a, b, aa, bb, aaa, bbb, …}

So, these are the four operations that can be performed on the languages in the lexical analysis phase.

3. Regular Expression

A regular expression is a sequence of symbols used to specify lexeme patterns. A regular expression is helpful in describing the languages that can be built using operators such as union, concatenation, and closure over the symbols.

A regular expression ‘r’ that denotes a language L(r) is built recursively over the smaller regular expression using the rules given below.

The following rules define the regular expression over some alphabet Σ and the languages denoted by these regular expressions.

∈ is a regular expression that denotes a language L(∈). The language L(∈) has a set of strings {∈} which means that this language has a single empty string.
If there is a symbol ‘a’ in Σ then ‘a’ is a regular expression that denotes a language L(a). The language L(a) = {a} i.e. the language has only one string of length one and the string holds ‘a’ in the first position.
Consider the two regular expressions r and s then:

r|s denotes the language L(r) ∪ L(s).
(r) (s) denotes the language L(r) ⋅ L(s).
(r)* denotes the language (L(r))*.
(r)+ denotes the language L(r).

Regular Definitions:

Giving names to regular expressions is referred to as a Regular definition. If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form.

d₁ → r₁
d₂ → r₂
………

d_n → r_n

Each d_i is a distinct name.
Each r_i is a regular expression over the alphabet Σ U {d₁, d₂,. . . , d_i-1}.

Example: Identifiers is the set of strings of letters and string beginning with a letter. Regular definition for this set:

letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9

id → letter ( letter | digit ) *.

Specification of Tokens in Compiler Design