Efficient Construction of the Equation Automaton

Faissal Ouardi; Zineb Lotfi; Bilal Elghadyry

doi:10.3390/a14080238

,

and

¹

Department of Computer Science, Faculty of Sciences, Mohammed V University in Rabat, Rabat 10000, Morocco

²

EA 4491-LISIC-Lab., Laboratoire d’Informatique Signal et Image de la Côte d’Opale Université Littoral Côte d’Opale, 62100 Calais, France

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms2021, 14(8), 238;https://doi.org/10.3390/a14080238

Version Notes

Order Reprints

Abstract

This paper describes a fast algorithm for constructing directly the equation automaton from the well-known Thompson automaton associated with a regular expression. Allauzen and Mohri have presented a unified construction of small automata and gave a construction of the equation automaton with time and space complexity in

O (m log m + m^{2})

, where m denotes the number of Thompson automaton transitions. It is based on two classical automata operations, namely epsilon-removal and Hopcroft’s algorithm for deterministic Finite Automata (DFA) minimization. Using the notion of c-continuation, Ziadi et al. presented a fast computation of the equation automaton in

O (m^{2})

time complexity. In this paper, we design an output-sensitive algorithm combining advantages of the previous algorithms and show that its computational complexity can be reduced to

O (m \times | Q_{\equiv_{e}} |)

, where

| Q_{\equiv_{e}} |

denotes the number of states of the equation automaton, by an epsilon-removal and Bubenzer minimization algorithm of an Acyclic Deterministic Finite Automata (ADFA).

Keywords:

regular expressions; finite automata; efficient algorithms

1. Introduction

The equation automaton (also known as derived terms automaton or Antimirov automaton) was first introduced in Mirkin’s paper [1]. In [2], Antimirov introduced the notion of partial derivative of a regular expression, that lead to another definition and construction of the equation automaton. It is an

ϵ

-free NFA which has in general smaller number of states and transitions than the well-known position automaton [3,4,5]. The complexity of the original construction algorithm of [2], which is based on the computation of the set of partial derivatives of the expression, is in

O (n^{5})

, where n denotes the size of the regular expression. In 2001, Champarnaud and Ziadi [6] introduced the notion of canonical derivatives and constructed a new automaton called the c-continuation automaton. They also proved that this automaton is isomorphic to the position automaton and that the equation automaton is its quotient for some equivalence relation.

The notion of c-derivative has been introduced in [6] to derive the equation automaton from the position automaton via the c-continuation automaton. A unique regular expression over indexed and ordered letters, called c-continuation, is assigned to each state of the position automaton. The resulting automaton is called the c-continuation automaton [6]. After that, one can define the equivalence relation between two c-continuations i.e., two states of the c-continuation automaton as follows: if deleting the indices of letters from two c-continuations results in the same regular expression, they correspond to the same partial derivative. Hence, the equation automaton would be a quotient of the c-continuation automaton w.r.t. the previously defined equivalence relation. From the algorithmic point of view, this result allows the construction of the equation automaton in

O (n^{2})

time and space [6,7]. Therefore, this improves the Antimirov’s algorithm by a factor of

O (n^{3})

.

In [8], Allauzen and Mohri present simple and unified constructions of the position automata [3,4,5], follow automata [9,10], and the equation automata [2,6,7] from regular expressions. Their algorithms are based on two standard automata operations applied to the Thompson automata [11,12] called: epsilon-remove and Hopcroft’s algorithm for DFA minimization [13]. The complexity of their construction for the equation automaton is in

O (m log m + m^{2})

, where m is the number of Thompson automaton transitions. Notice that, by construction the number of transitions of the Thompson automaton m and the size of a regular expression n are proportional. Thus, we have

m = O (n)

.

To improve the time complexity of computing the equation automaton from a regular expression E, we design an algorithm, combining advantages of previous methods [6,7,8], with a worst-case time complexity in

O (m \times | Q_{\equiv_{e}} |)

, where

| Q_{\equiv_{e}} |

denotes the number of its states. Our approach is based on Bubenzer minimization of an acyclic DFA instead Hopcroft’s algorithm for DFA minimization step used in Allauzen and Mohri’s method. The main idea is to associate implicitly each c-continuation to a corresponding state, called position state in the Thompson automaton by a special marking of

ε

-transitions. As a consequence of this marking, the right language of each position state in the Thompson automaton represents implicitly its c-continuation, called pseudo-continuation. After that, we disable temporarily the cyclic

ε

-transition in the Thompson automaton and perform Bubenzer minimization of an acyclic DFA to compute efficiently partial derivatives equivalence relation over the set of position states. Finally, we remove indexed

ε

-transitions, enable the cyclic

ε

-transition and then compute the

ε

-closure of states in the produced automaton from the previous step to get the equation automaton. The implementation of the proposed algorithm is available under the repository https://github.com/FaissalOuardi/Equation-automaton, (accessed on 27 May 2021).

The paper is organized as follows. Section 2 contains some basic definitions and necessary preliminaries. Section 3 summarizes theoretical results that lead to c-continuations of a regular expression, and their relations with the partial derivatives. The definition of the c-continuation automaton is recalled, as well as the way it is connected to the equation automaton. Section 4 is a recall to the algorithm due to Allauzen and Mohri. We detail then in Section 5 the algorithmic refinements leading to an

O (m \times | Q_{\equiv_{e}} |)

time complexity of the efficient construction of the equation automaton where

| Q_{\equiv_{e}} |

is the number of its states.

2. Preliminaries

In this section, we introduce briefly the notion of finite automata. For further details on formal aspects of finite automata theory, we particularly recommend reading classical books [14,15].

2.1. Regular Expressions and Finite Automata

Let A be a non-empty finite set of letters, called an alphabet. The set of all words over A is denoted by

A^{*}

.

ε

is the empty word. A language over A is a subset of

A^{*}

.

2.1.1. Regular Expressions and Languages

A regular expression over the alphabet A is a term of the algebra

T_{r e g (A)}

defined over the set

A \cup {0, 1}

with the symbols of functions

*, +, \cdot,

where ∗ is unary and + and · are binary. Properties of the constants

0, 1

and the operators

*, +

, and · lead to identities on this algebra. Each regular expression denotes a language. L is the function that assigns to each regular expression the regular language it denotes.

L : T_{r e g (A)} ⟶ r e g (A^{*})

is defined as follows:

\begin{matrix} L (0) & = & \emptyset \\ L (1) & = & {ε} \\ L (a) & = & a, for each a in A \\ L (F + G) & = & L (F) \cup L (G) \\ L (F \cdot G) & = & L (F) L (G) \\ L (F^{*}) & = & L {(F)}^{*} \end{matrix}

The following identities are classically used:

\begin{matrix} 0 + E = E = E + 0, 1 \cdot E = E = E \cdot 1, 0 \cdot E = 0 = E \cdot 0 . \end{matrix}

Let E be a regular expression. The set of letters occurring in E is denoted by

A_{E}

. To specify their position in the expression, letters are subscripted following the order of reading. The resulted expression is the linearized form of E, denoted by

\bar{E}

. For example, starting from

E = {(a + b)}^{*} a b a + 1

, one obtains the linearized version

\bar{E} = {(a_{1} + b_{2})}^{*} a_{3} b_{4} a_{5} + 1

of E. The subscripted letters are called positions; the set of all position in the expression E is denoted by

pos (E)

. For the previous example, we have

pos (E) = {a_{1}, b_{2}, a_{3}, b_{4}, a_{5}}

. If F is a subexpression of E, we denote by

{pos}_{E} (F)

the subset of positions of E that are letters of F. We say that a regular expression is in linear form if each letter of the expression occurs only once. We denote by h the function that maps each position in

pos (E)

to the letter of

A_{E}

that appears at this position in E. For

E = {(a + b)}^{*} a b a + 1

, we have

h (a_{1}) = h (a_{3}) = h (a_{5}) = a

and

h (b_{2}) = h (b_{4}) = b

. The size of the regular expression E, denoted by

| E |

, is the number of nodes in its syntax tree. We call alphabetic width of E, denoted by

| | E | |

, the number of occurrences of letters in the expression i.e., the cardinality of

pos (E)

. The alphabetic width of the expression

{(a + b)}^{*} a b a + 1

is equal to 5; its size is equal to 12.

Notice that the alphabetic width and the size of a regular expression are independent parameters. Therefore complexities are expressed w.r.t. both of these two parameters. However, it is usual to preprocess the input expression in order to reduce its size and to make its size proportional to its alphabetic width. So, if we consider a reduced regular expression E w.r.t. the following rules:

$1 + 1 = 1$ ,
$1 + E + 1 = 1 + E$ ,
E is in Star-Normal Form (SNF) [16].

Thus, we have in this case

| E | = O (| | E | |)

. It is known that regular expressions can be transformed to SNF in linear time [16].

λ (E)

denote the null term of E, that is

λ (E) = \{\begin{matrix} 1 & if ε \in L (E), \\ 0 & otherwise . \end{matrix}

By

T (E)

we denote the syntax tree associated with the regular expression E. A node in

T (E)

will be denoted by

ν

. We write

Nodes (E)

for the set of nodes of

T (E)

. If

ν \in Nodes (E)

is a node in

T (E), sym (ν), father (ν)

and

right (ν)

denote respectively the symbol, the father and the right son of the node

ν

. If

sym (ν)

is an operator,

E_{ν}

will denote the subexpression that corresponds to the subtree with the root

ν

.

2.1.2. Finite Automata and Recognizable Languages

A nondeterministic finite automaton (NFA) is a quintuple

A = ⟨ Q, A, q_{0}, δ, F ⟩

where Q is a finite set of states, A is the alphabet,

q_{0} \in Q

is the initial state,

F \subseteq Q

is the set of final states, and

δ : Q \times (A \cup {ε}) \to 2^{Q}

is the transition function. The size of an automaton

A

, denoted by

| A |

, is the number of its states. The automaton

A

is called deterministic (DFA) if there is only one initial state,

| δ (q, ε) | = 0

and

| δ (q, a) | = 1

, for any

q \in Q

, for any

a \in A

. A path in

A

is a sequence

(q_{i}, a_{i}, q_{i + 1}), i = 1, \dots, n

, of consecutive transitions. Its label is the word

w = a_{1} a_{2} \dots a_{n}

. A word

w = a_{1} a_{2} \dots a_{n}

is recognized by the automaton

A

if there exists a path labeled w such that

q_{1} = q_{0}

and

q_{n + 1} \in F

.

The language recognized by the automaton

A

, denoted by

L (A)

, is the set of words it recognizes. The right language of a state q in the automaton

A

, denoted by

{\vec{L}}_{q} (A)

, is obtained by setting q to be the initial state, i.e.,

{\vec{L}}_{q} (A) = {w \in A^{*} | δ (q, w) \cap F \neq \emptyset}

.

We say that

A

is acyclic if the underlying graph is acyclic. The language associated with an acyclic automaton is finite.

Let ∼ be an equivalence relation over Q. For

q \in Q

,

[q]

denotes the equivalence class of q w.r.t. ∼ and, for

C \subseteq Q, C /_{\sim}

denotes the quotient set

C /_{\sim} = {[q] | q \in C}

. We say that ∼ is right invariant w.r.t.

A

if and only if the following conditions hold:

$\sim \subseteq {(Q - F)}^{2} \cup F^{2}$ (final and non-final states are not ∼-equivalent),
for any $p, q \in Q$ , $a \in A$ , if $p \sim q$ , then $δ (p, a) /_{\sim} = δ (q, a) /_{\sim}$ .

2.2. Thompson Automaton

In [12], Thompson gave a linear time and space algorithm to convert a regular expression E to an NFA with

ε

-transitions, denoted by

T_{E}

. The recursive steps of the construction of Thompson NFA are pictured in Figure 1.

Figure 1. Thompson construction of an NFA.

Example 1.

Let us consider the regular expression

E = {(a^{*} + b a^{*} + b^{*})}^{*}

. The Thompson automaton

T_{E}

associated with E is shown in the Figure 2.

Figure 2. Thompson automaton

T_{E}

.

There are some disadvantages of Thompson’s NFA when it is used in practice: it has many redundant states and

ε

-transitions, its number of states is in

O (| E |)

while other constructions offer NFAs with

O (| | E | |)

states.

In the next section, we will present the construction of a reduced

ε

-free automaton, named equation automaton, sometimes called Antimirov automaton or derived terms automaton.

3. Equation Automaton

The equation automaton has been introduced for the first time by Mirkin in [1]. In 1996, Antimirov introduced the notion of partial derivatives and used it to define the equation automaton [2]. Champarnaud and Ziadi [6] defined the notion of canonical derivatives of a linear expression and constructed a new automaton called the c-continuation automaton. They also proved that this automaton is isomorphic to the position automaton in the sense that the two automata have identical sets of states, identical initial and final states, and transitions Theorem 6 in [6]. Using an equivalence relation over the set of states of the c-continuation automaton, they derive the equation automaton in quadratic time.

The definition of the equation automaton of a regular expression is based on that of the partial derivatives of regular expressions, which are multisets of regular expressions over A. The partial derivative of E with respect to

a \in A

is defined recursively on the structure of E as follows:

\begin{matrix} \partial_{a} (0) & = & \partial_{a} (1) = \emptyset, \\ \partial_{a} (x) & = & if a = x then {1} else \emptyset, \\ \partial_{a} (F + G) & = & \partial_{a} (F) \cup \partial_{a} (G), \\ \partial_{a} (F \cdot G) & = & \partial_{a} (F) \cdot G \cup λ (F) \cdot \partial_{a} (G), \\ \partial_{a} (F^{*}) & = & \partial_{a} (F) \cdot F^{*} . \end{matrix}

The partial derivative of E with respect to the string

u \in A^{*}

is denoted by

\partial_{u} (E)

and recursively defined by

\partial_{ε} (E) = E

and

\partial_{u a} (E) = \partial_{a} (\partial_{u} (E))

.

Let

D (E) = {E} \cup {E^{'} | E^{'} \in \partial_{u} (E) with u \in A^{*}}

.

Theorem 1.

(Antimirov [2]). The cardinality of the set

D (E)

of all partial derivatives of a regular expression E is less than or equal to

| | E | | + 1

.

The equation automaton

E_{E} = ⟨ D (E), A_{E}, E, δ, F ⟩

of E is defined by:

$δ (E^{'}, a) = {E^{″} \in D (E) | E^{″} \in \partial_{a} (E^{'})}$ ,
$F = {E^{'} \in D (E) | λ (E^{'}) = 1}$ .

Example 2.

Let us consider the regular expression

E = {(a^{*} + b a^{*} + b^{*})}^{*}

. The partial derivatives of E are as follows:

\begin{matrix} \partial_{a} (E) & = & {a^{*} {(a^{*} + b a^{*} + b^{*})}^{*}} \\ \partial_{b} (E) & = & {a^{*} {(a^{*} + b a^{*} + b^{*})}^{*}, b^{*} {(a^{*} + b a^{*} + b^{*})}^{*}} \end{matrix}

The computation of the transitions of the equation automaton

E_{E}

are as follows:

\begin{matrix} \partial_{a} (a^{*} {(a^{*} + b a^{*} + b^{*})}^{*}) & = \partial_{a} (a^{*}) {(a^{*} + b a^{*} + b^{*})}^{*} \cup \partial_{a} ({(a^{*} + b a^{*} + b^{*})}^{*}) \\ = \partial_{a} (a) a^{*} {(a^{*} + b a^{*} + b^{*})}^{*} \cup (\partial_{a} (a^{*}) \cup \partial_{a} (b a^{*}) \cup \partial_{a} (b^{*})) {(a^{*} + b a^{*} + b^{*})}^{*} \\ = {a^{*} {(a^{*} + b a^{*} + b^{*})}^{*}} \\ \partial_{b} (a^{*} {(a^{*} + b a^{*} + b^{*})}^{*}) & = \partial_{b} (a^{*}) {(a^{*} + b a^{*} + b^{*})}^{*} \cup \partial_{b} ({(a^{*} + b a^{*} + b^{*})}^{*}) \\ = \partial_{b} (a) a^{*} {(a^{*} + b a^{*} + b^{*})}^{*} \cup (\partial_{b} (a^{*}) \cup \partial_{b} (b a^{*}) \cup \partial_{b} (b^{*})) {(a^{*} + b a^{*} + b^{*})}^{*} \\ = (\partial_{b} (b) a^{*} \cup \partial_{b} (b) b^{*}) {(a^{*} + b a^{*} + b^{*})}^{*} \\ = {a^{*} {(a^{*} + b a^{*} + b^{*})}^{*}, b^{*} {(a^{*} + b a^{*} + b^{*})}^{*}} \\ \partial_{a} (b^{*} {(a^{*} + b a^{*} + b^{*})}^{*}) & = \partial_{a} (b^{*}) {(a^{*} + b a^{*} + b^{*})}^{*} \cup \partial_{a} ({(a^{*} + b a^{*} + b^{*})}^{*}) \\ = (\partial_{a} (a^{*}) \cup \partial_{a} (b a^{*}) \cup \partial_{a} (b^{*})) {(a^{*} + b a^{*} + b^{*})}^{*} \\ = {a^{*} {(a^{*} + b a^{*} + b^{*})}^{*}} \\ \partial_{b} (b^{*} {(a^{*} + b a^{*} + b^{*})}^{*}) & = \partial_{b} (b^{*}) {(a^{*} + b a^{*} + b^{*})}^{*} \cup \partial_{b} ({(a^{*} + b a^{*} + b^{*})}^{*}) \\ = \partial_{b} (b) b^{*} {(a^{*} + b a^{*} + b^{*})}^{*} \cup (\partial_{b} (a^{*}) \cup \partial_{b} (b a^{*}) \cup \partial_{b} (b^{*})) {(a^{*} + b a^{*} + b^{*})}^{*} \\ = {b^{*} {(a^{*} + b a^{*} + b^{*})}^{*}} \cup (\partial_{b} (b) a^{*} \cup \partial_{b} (b) b^{*}) {(a^{*} + b a^{*} + b^{*})}^{*} \\ = {b^{*} {(a^{*} + b a^{*} + b^{*})}^{*}, a^{*} {(a^{*} + b a^{*} + b^{*})}^{*}} \end{matrix}

The equation automaton

E_{E}

associated with E is shown in Figure 3.

Figure 3. The equation automaton

E_{E}

.

In the following, we recall the definition and properties of the c-continuation automaton. Next, we show how it can be bound to the equation automaton.

3.1. C-Continuation Automaton

This automaton has been introduced by Champarnaud and Ziadi [6] to efficiently compute the equation automaton. Let us recall the notion of c-derivative, c-continuation and c-continuation automaton.

Definition 1.

(c-derivative with respect to a letter). The c-derivative of a regular expression E with respect to a letter a is the regular expression

d_{a} (E)

defined by:

\begin{matrix} d_{a} (0) & = & d_{a} (1) = 0, \\ d_{a} (x) & = & if a = x then 1 else 0, \\ d_{a} (F + G) & = & if d_{a} (F) \neq 0 then d_{a} (F) else d_{a} (G), \\ d_{a} (F \cdot G) & = & if d_{a} (F) \neq 0 then d_{a} (F) \cdot G else λ (F) \cdot d_{a} (G), \\ d_{a} (F^{*}) & = & d_{a} (F) \cdot F^{*} . \end{matrix}

The c-derivative with respect to a word

u = u_{1} \dots u_{n}

is defined recursively by the rules:

d_{ε} (E) = E

and

d_{u_{1} \dots u_{n}} (E) = d_{u_{2} \dots u_{n}} (d_{u_{1}} (E)) .

Theorem 2.

(Theorem 4 in [6]). Let E be a linear regular expression and a be a letter from E. Then all non-zero c-derivatives of the form

d_{u a} (E)

, where u is an arbitrary word, are equal.

Theorem 2 allows us to define the c-continuation

c_{a} (E)

of a in a linear expression E as the unique value of the non-zero c-derivatives

d_{u a} (E)

.

Proposition 1.

(Proposition 6 in [6]). For every letter a of a linear expression E, the c-continuation

c_{a} (E)

is such that:

\begin{matrix} c_{a} (a) & = & 1, \\ c_{a} (F + G) & = & if c_{a} (F) exists then c_{a} (F) else c_{a} (G), \\ c_{a} (F \cdot G) & = & if c_{a} (F) exists then c_{a} (F) \cdot G else c_{a} (G), \\ c_{a} (F^{*}) & = & c_{a} (F) \cdot F^{*} . \end{matrix}

Corollary 1.

(Corollary 5 in [6]). For every letter a of a linear expression E, the c-continuation

c_{a} (E)

is either 1 or a subexpression of E or a product of subexpressions.

More precisely, for a linear regular expression E, we have

c_{a} (E) = H_{0} \dots H_{k}

, where

H_{i}

is a subexpression of E, for all

0 \leq i \leq k

.

We now consider a regular expression E over A. Let

\bar{E}

be the linearized form of E over

pos (E)

and h be the mapping from

pos (E)

onto

A_{E}

.

In order to simplify the writing for a regular expression E, we consider by convention that

c_{0} (E) = d_{ε} (\bar{E}) = \bar{E}

and

c_{x} (E)

will denote

c_{x} (\bar{E})

.

Definition 2.

(c-continuation automaton) The c-continuation automaton of E,

C_{E} = ⟨ Q, A_{E}, i, δ

,

F ⟩

, is defined by:

$Q = {(x, c_{x} (E)) | x \in pos (E) \cup {0}}$ ,
$i = (0, c_{0} (E))$ ,
$F = {(x, c_{x} (E)) | λ (c_{x} (E)) = 1}$ ,
$δ ((x, c_{x} (E)), a) = {(y, c_{y} (E) | h (y) = a a n d d_{y} (c_{x} (E)) \equiv c_{y} (E)}, \forall x \in pos (E) \cup {0} a n d \forall a \in A_{E}$ .

We note that the number of states of

C_{E}

is exactly

| | E | | + 1

.

Corollary 2.

(Corollary 7 in [6]). Let E be a regular expression. One has:

L (C_{E}) = L (E)

.

Example 3.

Let us consider the regular expression

E = {(a^{*} + b a^{*} + b^{*})}^{*}

from Example 1. The linearized form of E is

\bar{E} = {(a_{1}^{*} + b_{2} a_{3}^{*} + b_{4}^{*})}^{*}

and the c-continuations of E are as follows:

\begin{matrix} c_{0} (E) & = & {(a_{1}^{*} + b_{2} a_{3}^{*} + b_{4}^{*})}^{*} \\ c_{a_{1}} (E) & = & a_{1}^{*} {(a_{1}^{*} + b_{2} a_{3}^{*} + b_{4}^{*})}^{*} \\ c_{b_{2}} (E) & = & a_{3}^{*} {(a_{1}^{*} + b_{2} a_{3}^{*} + b_{4}^{*})}^{*} \\ c_{b_{4}} (E) & = & b_{4}^{*} {(a_{1}^{*} + b_{2} a_{3}^{*} + b_{4}^{*})}^{*} \end{matrix}

\begin{matrix} c_{a_{3}} (E) & = & c_{a_{3}} (a_{1}^{*} + b_{2} a_{3}^{*} + b_{4}^{*}) \cdot {(a_{1}^{*} + b_{2} a_{3}^{*} + b_{4}^{*})}^{*} \\ = & c_{a_{3}} (b_{2} a_{3}^{*}) \cdot {(a_{1}^{*} + b_{2} a_{3}^{*} + b_{4}^{*})}^{*} \\ = & c_{a_{3}} (a_{3}^{*}) \cdot {(a_{1}^{*} + b_{2} a_{3}^{*} + b_{4}^{*})}^{*} \\ = & a_{3}^{*} {(a_{1}^{*} + b_{2} a_{3}^{*} + b_{4}^{*})}^{*} \end{matrix}

The outgoing transitions from the state

(0, c_{0} (E))

are computed using the c-derivatives of

c_{0} (E)

as follows:

\begin{matrix} d_{a_{1}} (c_{0} (E)) = c_{a_{1}} (E), d_{b_{2}} (c_{0} (E)) = c_{b_{2}} (E), d_{a_{3}} (c_{0} (E)) = 0, d_{b_{4}} (c_{0} (E)) = c_{b_{4}} (E) \\ d_{a_{1}} (c_{a_{1}} (E)) = c_{a_{1}} (E), d_{b_{2}} (c_{a_{1}} (E)) = c_{b_{2}} (E), d_{a_{3}} (c_{a_{1}} (E)) = 0, d_{b_{4}} (c_{a_{1}} (E)) = c_{b_{4}} (E) \end{matrix}

\begin{matrix} d_{a_{1}} (c_{b_{2}} (E)) = c_{a_{1}} (E), d_{b_{2}} (c_{b_{2}} (E)) = c_{b_{2}} (E), d_{a_{3}} (c_{b_{2}} (E)) = c_{a_{3}} (E), d_{b_{4}} (c_{b_{2}} (E)) = c_{b_{4}} (E) \end{matrix}

\begin{matrix} d_{a_{1}} (c_{a_{3}} (E)) = c_{a_{1}} (E), d_{b_{2}} (c_{a_{3}} (E)) = c_{b_{2}} (E), d_{a_{3}} (c_{a_{3}} (E)) = c_{a_{3}} (E), d_{b_{4}} (c_{a_{3}} (E)) = c_{b_{4}} (E) \end{matrix}

\begin{matrix} d_{a_{1}} (c_{b_{4}} (E)) = c_{a_{1}} (E), d_{b_{2}} (c_{b_{4}} (E)) = c_{b_{2}} (E), d_{a_{3}} (c_{b_{4}} (E)) = 0, d_{b_{4}} (c_{b_{4}} (E)) = c_{b_{4}} (E) \end{matrix}

Then we get the c-continuation automaton

C_{E}

in Figure 4.

Figure 4. The c-continuation automaton

C_{E}

.

3.2. Equation Automaton as a Quotient of C-Continuation Automaton

Champarnaud and Ziadi [6] have proved that the equation automaton is a quotient of the c-continuation automaton. Let us consider the equivalence relation

\equiv_{e}

defined by

\begin{matrix} (x, c_{x} (E)) \equiv_{e} (y, c_{y} (E)) \Leftrightarrow h (c_{x} (E)) \equiv h (c_{y} (E)) \end{matrix}

(1)

Sometimes we write

x \equiv_{e} y \Leftrightarrow h (c_{x} (E)) \equiv h (c_{y} (E))

.

Proposition 2.

The relation

\equiv_{e}

is right-invariant, i.e., for all letters a in A, for all pairs of states

(x, c_{x} (E)), (y, c_{y} (E))

in Q such that

(x, c_{x} (E)) \equiv_{e} (y, c_{y} (E))

, we have:

δ ((x, c_{x} (E)), a) /_{\equiv_{e}} = δ ((y, c_{y} (E)), a) /_{\equiv_{e}}

.

Moreover, if two states are equivalent w.r.t.

\equiv_{e}

, then they are either both final or both non-final, since

(x, c_{x} (E)) \in F \Leftrightarrow λ (c_{x} (E)) = 1 \Leftrightarrow λ (h (c_{x} (E))) = 1

.

The equivalence class of the state

(x, c_{x} (E))

is represented by

C_{x} = h (c_{x} (E))

. Since the relation

\equiv_{e}

is right-invariant, we can define the quotient automaton

C_{E} /_{\equiv_{e}} = ⟨ Q_{\equiv_{e}}, A_{E}, q_{0}, δ, F ⟩

as follows:

$Q_{\equiv_{e}} = {C_{x} | x \in pos (E) \cup {0}}$ ,
$q_{0} = C_{0}$ ,
$F = {C_{x} | λ (c_{x} (E)) = 1}$ ,
$δ (C_{x}, a) = {C_{y} | h (y) = a and d_{y} (c_{x} (E)) \equiv c_{y} (E)}, \forall C_{x} \in Q_{\equiv_{e}} and \forall a \in A_{E}$ .

Theorem 3.

(Theorem 10 in [6]). Let E be a regular expression. The automaton

C_{E} /_{\equiv_{e}}

deduced from the c-continuation automaton is isomorphic to the equation automaton

E_{E}

.

We note that the number of states of

E_{E}

is majorized by

| | E | | + 1

.

Example 4.

Let us consider the regular expression

E = {(a^{*} + b a^{*} + b^{*})}^{*}

from Example 1. There are three

\equiv_{e}

-equivalence classes when applying the function h that remove indices from letters for different c-continuations of E:

\begin{matrix} h (c_{0} (E)) = {(a^{*} + b a^{*} + b^{*})}^{*} \\ h (c_{a_{1}} (E)) = h (c_{b_{2}} (E)) = h (c_{a_{3}} (E)) = a^{*} {(a^{*} + b a^{*} + b^{*})}^{*} \\ h (c_{b_{4}} (E)) = b^{*} {(a^{*} + b a^{*} + b^{*})}^{*} \end{matrix}

The c-continuation automaton

C_{E}

and the quotient automaton

C_{E} /_{\equiv_{e}}

which is isomorphic to the equation automaton

E_{E}

are schematized in Figure 5:

Figure 5. (a) The c-continuation automaton

C_{E}

versus (b) The quotient automaton

C_{E} /_{\equiv_{e}}

.

4. Allauzen and Mohri’s Algorithm

In [8], Allauzen and Mohri compute the equation automaton from the Thompson automaton of a regular expression E in

O (m log m + m^{2})

time. Their algorithm is based on some combinations of

ε

-transitions removal and Hopcroft’s algorithm for DFA minimization to the classical Thompson automata [13]. In the next, we briefly describe their method.

Let

\hat{A} = A_{E} \cup {ε_{+}^{1}, ε_{+}^{2}, ε_{*}^{1}, ε_{*}^{2}}

. We denote by

\hat{T_{E}}

the automaton over

\hat{A}

obtained by recursively marking some of the

ε

-transitions of the Thompson automaton

T_{E}

as follows:

Allauzen and Mohri have shown that the equation automaton can be obtained using some

ε

-transitions marking of the Thompson automaton and then apply two classical automata operations, namely epsilon removal, denoted by the function

r m e p s

(resp. the function

\hat{r m e p s}

for marked epsilon removal) and the Hopcroft’s algorithm for DFA minimization [13], denoted by

m i n_{B}

.

Proposition 3.

(Proposition 3 in [8]). We have

E_{E} = \hat{r m e p s} (m i n_{B} (r m e p s (\hat{T_{E}})))

.

Note that after removing

ε

-transitions from the automaton

\hat{T_{E}}

, we obtain a deterministic finite automaton

r m e p s (\hat{T_{E}})

. After that, the Hopcroft’s algorithm for DFA minimization is applied to derive the automaton

m i n_{B} (r m e p s (\hat{T_{E}}))

such that the set of its states is in bijection with the set of partial derivatives of E. Finally, to compute transitions of the equation automaton from

m i n_{B} (r m e p s (\hat{T_{E}}))

, marked

ε

-transitions are removed using

\hat{r m e p s}

operation.

Theorem 4.

(Theorem 3 in [8]). Let E be a regular expression over A. The equation automaton of E can be computed in

O (m log m + m^{2})

time.

5. Efficient Conversion Algorithm

In this section, we will show that the equation automaton

E_{E}

of a regular expression E can be deduced from the associated Thompson automaton in

O (| E | \cdot | Q_{\equiv_{e}} |)

time, where

| Q_{\equiv_{e}} |

denotes the number of states of

E_{E}

. Algorithm 1 summarizes the different steps of our approach.

Algorithm 1 Computation of the equation automaton.

input: The Thompson automaton

T_{E} = ⟨ Q, A_{E}, I, δ, F ⟩

associated with a regular expression E.
output: The equation automaton

E_{E}

associated with E.
/* Computation of states */

Compute

I d (T_{E}

):

Compute

C_{\equiv_{e}} (I d (T_{E}))

:

• Compute pseudo-continuations for all position states of

I d (T_{E})

.
• Merge equivalent states having the same pseudo-continuation.
/* Computation of transitions and final states /*

Compute

r m e p s (C_{\equiv_{e}} (I d (T_{E})))

:

• Perform epsilon removal operation using

r m e p s ()

function over

C_{\equiv_{e}} (I d (T_{E}))

.

For convenience, we assume that the k states of a given finite automaton are identified by the integers

1, \dots, k

.

From Corollary 1, the c-continuation

c_{x} (E) = H_{0} \dots H_{l}

associated with a position x is a concatenation of distinct subexpressions

H_{i}

of E, possibly reduced to a single subexpression or to 1. In the Thompson automaton

T_{E}

, we can associate a position x to a particular state q, called position state and define the associated pseudo-continuation

C (q) = N (H_{0}) \dots N (H_{l})

, where

N (H_{i})

denotes the integer that identify the initial state of the Thompson automaton

T_{H_{i}}

. So, the first step, compute

I d (T_{E})

, of our algorithm consists on computing the function

N (.)

such that for two subexpressions

H_{i}

and

H_{j}

of E, we have:

H_{i} \equiv H_{j} \Leftrightarrow N (I_{T_{H_{i}}}) = N (I_{T_{H_{j}}})

. This step can be done using a special marking of the

ε

-transitions of

T_{E}

that makes it acyclic and deterministic and such that the right languages of its states represent the structure of the corresponding subexpressions. In the next step, Compute

C_{\equiv_{e}} (I d (T_{E}))

, we re-mark the

ε

-transitions such that the resulted automaton is acyclic and deterministic and the right language of a position state in

I d (T_{E})

represents a pseudo-continuation. After that, one can merge equivalent position states having the same right language. The final step is the computation of final states and transitions of the equation automaton using an epsilon removal operation, denoted by

r m e p s (.)

, from the resulted automaton in the previous step.

In the next, we will show that the equation automaton

E_{E}

can be computed efficiently from the Thompson automaton

T_{E}

using the following operations

r m e p s (C_{\equiv_{e}} (I d (T_{E})))

.

5.1. Computation of States

In the following, We will show that the computation of the relation

\equiv_{e}

over the states of the Thompson automaton can be performed in linear time w.r.t. the size of the expression using the minimization of an acyclic deterministic finite automaton. This minimization can be performed efficiently in

O (| E |)

time using Bubenzer’s algorithm [17,18].

Before computing the equivalence classes

C_{\equiv_{e}}

over states of the Thompson automaton, we will perform a preprocessing step to identify all identical sub-expressions of E. In the next, we will show that this identification can be done in

O (| E |)

time.

5.1.1. Sub-Expressions Identification

Let Exp the set of all subexpressions of E. In this preprocessing step, we will mark each state in the Thompson automaton by a unique letter in the set

{1, 2, \dots, | Exp |}

.

Let us define a bijection N between the set Exp and a finite set of letters

{1, 2, \dots, | Exp |}

. Consequently, if

E_{1}

and

E_{2}

are two sub-expressions of E, then we have:

E_{1} \equiv E_{2} \Leftrightarrow N (E_{1}) = N (E_{2})

(2)

Based on the parsing method, introduced in Section 6 in [19], that derive an equivalent regular expression from Thompson automaton, each subexpression

H_{i}

of E is associated with an integer identifying the initial state

I_{T_{H_{i}}}

in the Thompson automaton

T_{E}

.

Let q be a state in

T_{E}

, we denote by

E_{q}

the subexpression associated with q, if it exists. For abbreviation,

N (q)

represents

N (E_{q})

.

In the following, we will show that the computation of the function N over the states of

T_{E}

turns into a minimization of the acyclic deterministic sub-automaton of the Thompson automaton,

I d (T_{E}) = ⟨ Q^{'}, A_{E}, I, δ^{'}, F ⟩

, defined by:

$A = {ε_{l +}, ε_{\cdot}, ε_{r +}, ε_{*}^{1}, ε_{*}^{2}} \cup A_{E}$ where l (resp. r), denote left (resp. right),
$Q^{'} = {(q, N (q)) | q \in Q}$ , i.e., a state in $T_{E}$ is augmented by the letter $N (q)$ .
The transition function $δ^{'}$ is defined over the Thompson automaton as follows:

Notice that this automaton is an acyclic deterministic sub-automaton of the Thompson automaton where

ε

-transitions are indexed and the cyclic transitions in the case when

E = F^{*}

are temporarily disabled.

To compute identical subexpressions, we define the equivalence relation ∼ over the states of

I d (T_{E})

as follows:

\begin{matrix} q \sim q^{'} \Leftrightarrow N (q) = N (q^{'}) \end{matrix}

(3)

Thus we have:

\begin{matrix} {[q]}_{\sim} = {[q^{'}]}_{\sim} \Leftrightarrow N (E_{q}) = N (E_{q^{'}}) \end{matrix}

(4)

Lemma 1.

Let q and

q^{'}

be two states in

I d (T_{E})

. We have:

{\vec{L}}_{q} (I d (T_{E})) = {\vec{L}}_{q^{'}} (I d (T_{E})) \Leftrightarrow E_{q} \equiv E_{q^{'}}

Proof.

Obvious, by construction. □

Proposition 4.

The function

N (.)

can be computed over

I d (T_{E})

in

O (| E |)

time.

Proof.

Let q and

q^{'}

two states in

I d (T_{E})

. One has:

\begin{matrix} q \sim q^{'} & \overset{(3)}{⟺} & N (q) = N (q^{'}) \\ \overset{(2)}{⟺} & E_{q} \equiv E_{q^{'}} \\ \overset{L e m m a 1}{⟺} & {\vec{L}}_{q} (I d (T_{E})) = {\vec{L}}_{q^{'}} (I d (T_{E})) \end{matrix}

Thus, the equivalence relation ∼ coincides with Myhill-Nerode equivalence relation [20,21] over the states of

I d (T_{E})

. Since the automaton

I d (T_{E})

is deterministic and acyclic, its minimization using Bubenzer’s algorithm [17] requires

O (| E |)

time and space complexity. □

Example 5.

The automaton

I d (T_{E})

obtained after performing the subexpression identification step for the regular expression

E = {(a^{*} + b a^{*} + b^{*})}^{*}

through

T_{E}

.

As shown in Figure 6, for the states 3 and 9 we have:

\begin{matrix} {\vec{L}}_{3} (I d (T_{E})) & = {ε_{*}^{1} a ε_{*}^{1} ε_{*}^{1}, ε_{*}^{2} ε_{*}^{1}} \\ {\vec{L}}_{9} (I d (T_{E})) & = {ε_{*}^{1} a ε_{*}^{1} ε_{*}^{1}, ε_{*}^{2} ε_{*}^{1}} \end{matrix} ⟺ 3 \sim 9 ⟺ N (3) = N (9)

Figure 6. The automaton

I d (T_{E})

associated with

E = {(a^{*} + b a^{*} + b^{*})}^{*}

.

As a consequence, we have

E_{3} \equiv E_{9}

.

5.1.2. $\equiv_{e}$ -Equivalent States Merging

Let us now turn to the computation of the set of states of the equation automaton. From Corollary 1, the c-continuation

c_{x} (E)

is a concatenation of distinct subexpressions of E, possibly reduced to a single subexpression or to 1. The following proposition shows that the c-continuation

c_{x} (E)

can be computed over the syntactic tree

T (\bar{E})

associated with the linearized version

\bar{E}

.

Proposition 5.

(Ref. [6]). Let E be a regular expression and x a position in

pos (E)

. Let

ν_{x}

be a node in

T (\bar{E})

such that

sym (ν_{x}) = x

. The c-continuation

c_{x} (E)

is as follows:

c_{x} (E) = ⨀_{\begin{matrix} ν_{x} ⪯ ν ⪯ ν_{\bar{E}} \\ f (ν) \neq ⊥ \end{matrix}} E_{f (ν)}

where ⊙ is the concatenation operator.

The function

f : Nodes (E) \cup {⊥} ⟶ Nodes (E) \cup {⊥}

is defined as follows:

f (ν) = \{\begin{matrix} father (ν) & i f sym (father (ν)) = * a n d ν \neq ν_{\bar{E}} \\ right (father (ν)) & i f sym (father (ν)) = \cdot \\ ⊥ & o t h e r w i s e . \end{matrix}

with ⊥ is an artificial node such that

f (⊥) = ⊥

.

Using Proposition 5, the computation of the set of states requires

O (| E |^{3})

time and space complexity. This is due to the fact that the size of a c-continuation is in

O (| E |^{2})

. In order to reduce this complexity, we introduce a modified definition of pseudo-continuation introduced in [6] over an acyclic deterministic sub-automaton of the Thompson automaton, denoted by

C_{\equiv_{e}} (I d (T_{E}))

. When merging

\equiv_{e}

-equivalent states over

I d (T_{E})

, we get the automaton

C_{\equiv_{e}} (I d (T_{E}))

. This step requires a linear time w.r.t the size of E using Bubenzer’s algorithm [17], since the automaton

I d (T_{E})

is acyclic and deterministic.

In the following, a state

(q, N (q))

in the automaton

I d (T_{E})

is called a position state, if there exists

a \in A_{E}

and

(q^{'}, N (q^{'})) \in Q^{'}

such that

δ^{'} ((q^{'}, N (q^{'})), a) = (q, N (q))

. The state

(0, N (0))

is also considered as a position state.

It is obvious to see that each position state is associated with a unique position in

pos (E) \cup {0}

and then it’s can be associated with a c-continuation.

Let

(q, N (q))

and

(q^{'}, N (q^{'}))

two position states associated with the positions x and

x^{'}

in

pos (E)

. One can extend the

\equiv_{e}

relation over position states in

I d (T_{E})

as follows:

\begin{matrix} (q, N (q)) \equiv_{e} (q^{'}, N (q^{'})) ⟺ h (c_{x} (E)) \equiv h (c_{x^{'}} (E)) \end{matrix}

In the next, we will prove that the computation of the equivalence classes

C_{\equiv_{e}}

can be performed in a linear time w.r.t. the size of the regular expression over

I d (T_{E})

using the notion of pseudo-continuations.

For abbreviation, a state

(q, N (q))

in

I d (T_{E})

will be denoted by q. We denote by

C (q)

the pseudo-continuation associated with the position state q which is an implicit representation of its c-continuation

c_{x} (E)

, where x is the position letter of q. We will show that the computation of the equivalence classes

C_{\equiv_{e}}

turns on the computation of pseudo-continuation

C (q)

over a particular

ε

-transitions marking of the automaton

I d (T_{E})

.

Definition 3.

The pseudo-continuation

C (q)

associated with a position state

(q, N (q)) \in Q^{'}

in

I d (T_{E})

is recursively defined by:

\begin{matrix} C (q) & = & \{\begin{matrix} N (q) & i f δ^{'} (q, ε_{l +}) = q^{'} a n d δ^{'} (q, ε_{r +}) = q^{″}, \\ N (q^{'}) \cdot C (q^{″}) & i f δ^{'} (q, ε_{1}^{*}) = q^{″} a n d δ^{'} (q^{'}, a) = q f o r s o m e a \in A_{E}, \\ N (q^{'}) \cdot C (q^{″}) & i f δ^{'} (q, ε_{1}^{*}) = q^{'} a n d δ^{'} (q, ε_{2}^{*}) = q^{″}, \\ a \cdot C (q^{'}) & i f δ^{'} (q, a) = q^{'}, f o r a l l a \in A_{E}, \\ C (q^{'}) & i f δ^{'} (q, ε) = q^{'} . \end{matrix} \end{matrix}

(5)

In order to compute efficiently the set of pseudo-continuations associated with the position states in

I d (T_{E})

, we define an acyclic deterministic sub-automaton

C_{\equiv_{e}} (I d (T_{E})) = ⟨ Q_{\equiv_{e}}, A_{E}, I, δ^{″}, F ⟩

of the Thompson automaton as follows:

$A = {ε_{0}, \dots, ε_{| Exp |}} \cup A_{E}$ .
$Q_{\equiv_{e}} = {(q, C (q)) | (q, N (q)) \in Q^{'}}$ , i.e., a state in $I d (T_{E})$ is replaced by $(q, C (q))$ .
The transition function $δ^{″}$ is defined as follows:

Let us define the equivalence relation ≈ over the position states of

C_{\equiv_{e}} (I d (T_{E}))

as follows:

\begin{matrix} (q, C (q)) \approx (q^{'}, C (q^{'})) \Leftrightarrow C (q) \equiv C (q^{'}) \end{matrix}

(6)

Thus we have:

\begin{matrix} {[q]}_{\approx} = {[q^{'}]}_{\approx} \Leftrightarrow C (q) \equiv C (q^{'}) \end{matrix}

(7)

Let

\bar{h}

be the application that maps a letter

ε_{i}

to the letter i. By construction, the following proposition holds.

Proposition 6.

Let

(q, C (q))

be a position state in

C_{\equiv_{e}} (I d (T_{E}))

. We have:

\bar{h} ({\vec{L}}_{q} (C_{\equiv_{e}} (I d (T_{E})))) = C (q)

Example 6.

Let us consider the Thompson automaton

T_{E}

defined in previous examples. Figure 7 schematizes the derived automaton from

I d (T_{E})

after pseudo continuations computation for position states.

Figure 7. Pseudo-continuations computation for position states in

I d (T_{E})

.

Notice that dotted ε-transitions are temporarily disabled and dashed ones are temporarily added.

The position states in the automaton

I d (T_{E})

are

{0, 5, 8, 11, 16}

. According to the definition of a pseudo continuation (see Formula (7)), the pseudo-continuations associated with position states are computed over the automaton

I d (T_{E})

as follows:

\begin{matrix} C (5) = C (8) = C (11) = 10 \cdot 1 \\ C (16) = 15 \cdot 1 \end{matrix}

On the other hand, we have:

\begin{matrix} {\vec{L}}_{5} (C_{\equiv_{e}} (I d (T_{E}))) = {\vec{L}}_{8} (C_{\equiv_{e}} (I d (T_{E}))) = {\vec{L}}_{11} (C_{\equiv_{e}} (I d (T_{E}))) = ε_{10} \cdot ε_{1} \\ {\vec{L}}_{16} (C_{\equiv_{e}} (I d (T_{E}))) = ε_{15} \cdot ε_{1} \end{matrix}

The following proposition is fundamental to prove that the equivalence relation

\equiv_{e}

using the notion of c-continuation is the same when using pseudo-continuations

C (q)

.

Proposition 7.

Let q (resp.

q^{'}

) be a position state associated with a position

x \in pos (E)

(resp.

x^{'} \in pos (E)

) in

C_{\equiv_{e}} (I d (T_{E}))

. One has:

C (q) \equiv C (q^{'}) \Leftrightarrow h (c_{x} (E)) \equiv h (c_{x^{'}} (E))

As a consequence, the following proposition holds.

Proposition 8.

Let q and

q^{'}

be two position states in

C_{\equiv_{e}} (I d (T_{E}))

. One has:

q \equiv_{e} q^{'} \Leftrightarrow C (q) \equiv C (q^{'})

Theorem 5.

Let E be a regular expression and

T_{E}

the associated Thompson automaton. The relation

\equiv_{e}

can be computed over

C_{\equiv_{e}} (I d (T_{E}))

in

O (| E |)

time.

Proof.

From Proposition 8, one can deduce that the computation of the equivalence relation

\equiv_{e}

, turn to apply the Myhill-Nerode relation on the states of the automaton

C_{\equiv_{e}} (I d (T_{E}))

. By definition, this last is acyclic and deterministic. Then, its minimization using Bubenzer’s algorithm [17] requires

O (| E |)

time and space complexity. □

Example 7

(Continues). The automaton

C_{\equiv_{e}} (I d (T_{E}))

, schematized in the Figure 8, is obtained from

I d (T_{E})

after merging

\equiv_{e}

-equivalent position states

5, 8

and 11.

Figure 8. The automaton

C_{\equiv_{e}} (I d (T_{E}))

associated with

E = {(a^{*} + b a^{*} + b^{*})}^{*}

.

The next step of our approach consists of the transitions and final states computation of the equation automaton

E_{E}

using epsilon removal operation, denoted by

r m e p s ()

, over the automaton

C_{\equiv_{e}} (I d (T_{E}))

.

5.2. Computation of Transitions and Final States

After merging

\equiv_{e}

-equivalent states in the previous step, we obtain a reduced automaton

C_{\equiv_{e}} (I d (T_{E}))

having the same set of states as the equation automaton. To compute the transition function, we first enable the cyclic transitions previously disabled in the case when

E = F^{*}

on

C_{\equiv_{e}} (I d (T_{E}))

.

Let

Q_{\equiv_{e}}

be the set of states of

C_{\equiv_{e}} (I d (T_{E}))

. Recall that epsilon removal operation is denoted by,

r m e p s (p)

for a state

p \in Q

and

r m e p s (A)

denotes the resulted automaton after removing marked and non-marked

ε

-transitions from the automaton

A

.

As a consequence of Lemma 5 from [8], the following Lemma yeilds.

Lemma 2.

(Lemma 5 in [8]). Let q and

q^{'}

be two position states in

Q_{\equiv_{e}}

associated respectively with the positions x and

x^{'}

in

pos (E) \cup {0}

, we have:

q \in r m e p s (q^{'}) i f f h (c_{x} (E)) \in \partial_{a} (h (c_{x^{'}} (E))), f o r s o m e a \in A_{E} .

The set of destination states of the outgoing transitions from a state

q \in Q_{\equiv_{e}}

is then equal to

{p \in Q_{\equiv_{e}} | p \in r m e p s (q) and p is a position state}

Lemma 3.

Let q be a position state in

Q_{\equiv_{e}}

associated with a position x in

pos (E) \cup {0}

, we have:

F \in r m e p s (q) i f f λ (c_{x} (E)) = 1 .

Proposition 9.

We have

E_{E} = r m e p s (C_{\equiv_{e}} (I d (T_{E})))

Example 8.

(Continues). Let us consider the automaton

C_{\equiv_{e}} (I d (T_{E}))

of the Example 7. The final states and the transitions of the equation automaton are computed over

C_{\equiv_{e}} (I d (T_{E}))

using epsilon removal operation

r m e p s

as follows:

The set of states of the equation automaton are ${0, {5, 8, 11}, 16}$ .
Since the final state of $C_{\equiv_{e}} (I d (T_{E}))$ is the state 19 and $19 \in r m e p s (0)$ , $19 \in r m e p s ({5, 8, 11})$ , and $19 \in r m e p s (16)$ , then the set of final states are ${0, {5, 8, 11}, 16}$ .
There are two paths in $C_{\equiv_{e}} (I d (T_{E}))$ from the state 0 to the state ${5, 8, 11}$ labeled respectively by $ε \cdot ε \cdot ε \cdot ε \cdot a$ and $ε \cdot ε \cdot ε \cdot b$ , then ${5, 8, 11} \in r m e p s (0)$ . Consequently, two transitions $(0, a, {5, 8, 11})$ and $(0, b, {5, 8, 11})$ are added to the equation automaton. The same process will be applied for other transitions.

Since there are

O (| E |)

states in

C_{\equiv_{e}} (I d (T_{E}))

and the operation

r m e p s (C_{\equiv_{e}} (I d (T_{E})))

is performed on exactly

| Q_{\equiv_{e}} |

states, the following theorem holds.

Theorem 6.

Let E be a regular expression. The equation automaton of E can be computed in

O (| E | \cdot | Q_{\equiv_{e}} |)

.

6. Conclusions

In this paper, we presented a fast and sophisticated construction of the equation automaton from a regular expression over its associated Thompson automaton. The time complexity of our algorithm is at least as favorable as that of the best previously known algorithm. It is based on the minimization of acyclic deterministic finite automata and epsilon removal operations. This allowed us a construction of the equation automaton in

O (| E | \cdot | Q_{\equiv_{e}} |)

time and space complexity where

| Q_{\equiv_{e}} |

denotes the number of transitions of the produced automaton. The implementation of the proposed algorithm is available under the following repository: https://github.com/FaissalOuardi/Equation-automaton (accessed on 27 May 2021).

Author Contributions

Conceptualization, F.O., Z.L. and B.E.; methodology, F.O., Z.L. and B.E.; validation, F.O., Z.L. and B.E.; formal analysis, F.O., Z.L. and B.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Source code can be found under the following link: https://github.com/FaissalOuardi/Equation-automaton (accessed on 27 May 2021).

Acknowledgments

We wish to thank the referees for the care they put into reading the previous versions of this manuscript. Their comments were invaluable in depth and detail, and the current version owes much to their efforts.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mirkin, B.G. Novyj algoritm postroéniá bazisa v ázyké régularnyh vyražénij. Izvéstiá Akadémii Nauk SSSR. Engineering cybernetics, no. 5 (1966). English translation of the preceding: Brzozowski, J. An algorithm for constructing a base in a language of regular expressions. pp. 110–116. J. Symb. Log. 1971, 36, 694. [Google Scholar]
Antimirov, V. Partial derivatives of regular expressions and finite automaton constructions. Theor. Comput. Sci. 1996, 155, 291–319. [Google Scholar] [CrossRef]
Glushkov, V.M. The abstract theory of automata. Russ. Math. Surv. 1961, 16, 1–53. Available online: https://iopscience.iop.org/article/10.1070/RM1961v016n05ABEH004112 (accessed on 27 May 2021). [CrossRef]
McNaughton, R.F.; Yamada, H. Regular expressions and state graphs for automata. IEEE Trans. Electron. Comput. 1960, 9, 39–57. [Google Scholar] [CrossRef]
Ziadi, D.; Ponty, J.-L.; Champarnaud, J.-M. A New Quadratic Algorithm to Convert a Regular Expression into an Automaton. In Proceedings of the Workshop on Implementing Automata, London, ON, Canada, 29–31 August 1996; pp. 109–119. [Google Scholar]
Champarnaud, J.-M.; Ziadi, D. Canonical derivatives, partial derivatives and finite automaton constructions. Theor. Comput. Sci. 2002, 289, 137–163. [Google Scholar] [CrossRef]
Khorsi, A.; Ouardi, F.; Ziadi, D. Fast equation automaton computation. J. Discret. Algorithms 2008, 6, 433–448. [Google Scholar] [CrossRef][Green Version]
Allauzen, C.; Mohri, M. A Unified Construction of the Glushkov, Follow, and Antimirov Automata. In Proceedings of the International Conference of Mathematical Foundations of Computer Science, Stará Lesná, Slovakia, 28 August–1 September 2006; pp. 110–121. [Google Scholar]
Ilie, L.; Yu, S. Follow automata. Inf. Comput. 2003, 186, 140–162. [Google Scholar] [CrossRef]
Champarnaud, J.-M.; Nicart, F.; Ziadi, D. From the ZPC Structure of a Regular Expression to its Follow Automaton. IJAC 2006, 16, 17–34. [Google Scholar] [CrossRef]
Kleene, S. Representation of Events in Nerve Nets and Finite Automata; Automata Studies, Ann. Math. Studies 34; Princeton University Press: Princeton, NJ, USA, 1956; pp. 3–41. [Google Scholar]
Thompson, K. Regular Expression Search Algorithm. Commun. ACM 1968, 11, 410–422. [Google Scholar] [CrossRef]
Hopcroft, J. An n log n Algorithm for Minimizing States in a Finite Automaton; Technical Report; Stanford University, CS Dept.: Stanford, CA, USA, 1971. [Google Scholar]
Hopcroft, J.E.; Ullman, J.D. Introduction to Automata Theory, Languages and Computation; Addison-Wesley: Reading, MA, USA, 1979. [Google Scholar]
Sakarovitch, J.; Thomas, R. Elements of Automata Theory; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Brüggemann-Klein, A. Regular expressions into finite automata. Theor. Comp. Sci. 1993, 120, 117–126. [Google Scholar] [CrossRef]
Bubenzer, J. Cycle-aware minimization of acyclic deterministic finite-state automata. J. Discret. Appl. Math. 2014, 163, 238–246. [Google Scholar] [CrossRef]
Revuz, D. Minimization of acyclic deterministic automata in linear time. Theor. Comput. Sci. 1992, 92, 181–189. [Google Scholar] [CrossRef]
Giammarresi, D.; Ponty, J.-L.; Wood, D.; Ziadi, D. A characterization of Thompson digraphs. Discret. Appl. Math. 2004, 134, 317–337. [Google Scholar] [CrossRef]
Myhill, J. Finite automata and the representation of events. In WADD TR-57-624; Wright Patterson AFB: Dayton, OH, USA, 1957; pp. 112–137. [Google Scholar]
Nerode, A. Linear automata transformation. Proc. AMS 1958, 9, 541–544. [Google Scholar] [CrossRef]

Figure 1. Thompson construction of an NFA.

Figure 2. Thompson automaton

T_{E}

.

Figure 3. The equation automaton

E_{E}

.

Figure 4. The c-continuation automaton

C_{E}

.

Figure 5. (a) The c-continuation automaton

C_{E}

versus (b) The quotient automaton

C_{E} /_{\equiv_{e}}

.

Figure 6. The automaton

I d (T_{E})

associated with

E = {(a^{*} + b a^{*} + b^{*})}^{*}

.

Figure 7. Pseudo-continuations computation for position states in

I d (T_{E})

.

Figure 8. The automaton

C_{\equiv_{e}} (I d (T_{E}))

associated with

E = {(a^{*} + b a^{*} + b^{*})}^{*}

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Efficient Construction of the Equation Automaton

Abstract

1. Introduction

2. Preliminaries

2.1. Regular Expressions and Finite Automata

2.1.1. Regular Expressions and Languages

2.1.2. Finite Automata and Recognizable Languages

2.2. Thompson Automaton

3. Equation Automaton

3.1. C-Continuation Automaton

3.2. Equation Automaton as a Quotient of C-Continuation Automaton

4. Allauzen and Mohri’s Algorithm

5. Efficient Conversion Algorithm

5.1. Computation of States

5.1.1. Sub-Expressions Identification

5.1.2. $\equiv_{e}$ -Equivalent States Merging

5.2. Computation of Transitions and Final States

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Efficient Construction of the Equation Automaton

Abstract

1. Introduction

2. Preliminaries

2.1. Regular Expressions and Finite Automata

2.1.1. Regular Expressions and Languages

2.1.2. Finite Automata and Recognizable Languages

2.2. Thompson Automaton

3. Equation Automaton

3.1. C-Continuation Automaton

3.2. Equation Automaton as a Quotient of C-Continuation Automaton

4. Allauzen and Mohri’s Algorithm

5. Efficient Conversion Algorithm

5.1. Computation of States

5.1.1. Sub-Expressions Identification

5.1.2. ≡ e -Equivalent States Merging

5.2. Computation of Transitions and Final States

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

5.1.2. $\equiv_{e}$ -Equivalent States Merging