Next Article in Journal
High-Gain Observer-Based Sliding-Mode Dynamic Surface Control for Particleboard Glue Mixing and Dosing System
Next Article in Special Issue
Permuted Pattern Matching Algorithms on Multi-Track Strings
Previous Article in Journal
Learning Representations of Natural Language Texts with Generative Adversarial Networks at Document, Sentence, and Aspect Level
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Algorithms for Computing the Inner Edit Distance of a Regular Language via Transducers

1
School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada
2
Department of Mathematics and Computing Science, Saint Mary’s University, Halifax, NS B3H 3C3, Canada
3
Department of Computer Science, University of Western Ontario, London, ON N6A 5B7, Canada
*
Author to whom correspondence should be addressed.
Algorithms 2018, 11(11), 165; https://doi.org/10.3390/a11110165
Submission received: 6 October 2018 / Revised: 16 October 2018 / Accepted: 18 October 2018 / Published: 23 October 2018
(This article belongs to the Special Issue String Matching and Its Applications)

Abstract

:
The concept of edit distance and its variants has applications in many areas such as computational linguistics, bioinformatics, and synchronization error detection in data communications. Here, we revisit the problem of computing the inner edit distance of a regular language given via a Nondeterministic Finite Automaton (NFA). This problem relates to the inherent maximal error-detecting capability of the language in question. We present two efficient algorithms for solving this problem, both of which execute in time O ( r 2 n 2 d ) , where r is the cardinality of the alphabet involved, n is the number of transitions in the given NFA, and d is the computed edit distance. We have implemented one of the two algorithms and present here a set of performance tests. The correctness of the algorithms is based on the connection between word distances and error detection and the fact that nondeterministic transducers can be used to represent the errors (resp., edit operations) involved in error-detection (resp., in word distances).

1. Introduction

The concept of edit distance and its variants has applications in many areas such as computational linguistics [1], bioinformatics [2], and synchronization error detection in data communications [3]. The edit distance of a language L with at least two words—also referred to as inner edit distance of L—is the minimum edit distance between any two different words in L. In [4], the author considers the problem of computing the edit distance of a regular language, which is given via a Nondeterministic Finite Automaton (NFA), or a Deterministic Finite Automaton (DFA). For a given automaton a with n transitions and an alphabet of r symbols, the algorithm proposed in [4] has worst-case time complexity
O ( r 2 n 2 q 2 ( q + r ) ) ,
where q is either the number of states in a (if a is a DFA), or the square of the number of states in a (if a is an NFA). If the size of the alphabet is ignored and the automaton in question has only states that can be reached from the start state, then the number of states is O ( n ) and the worst-case time complexity shown in Label (1) can be written as
O ( n 5 ) for DFAs , and O ( n 8 ) for NFAs .
In this paper, we present two efficient algorithms to compute the inner edit distance of a regular language given via an NFA with n transitions—see Theorems 1 and 3. Both algorithms, which are called DistErrDetect and DistInpAlter, have the same worst-case time complexity
O ( n 2 d r 2 ) ,
where d is the computed distance, which is a significant improvement over the original algorithm in [4].
Our first algorithm, DistErrDetect, is based on the general method of [5] for computing distances via the error-detection property. Now, however, we have an efficient way of realizing algorithmically that general method using an incremental construction of a (nondeterministic) transducer and the test of [6] for partial identity for transducers. In our second algorithm, DistInpAlter, the idea is to model the edit operations of the desired distance using an efficient, in terms of size, input-altering transducer (a transducer whose output is always different from the input used). Please see subsequent sections for definitions of terms. For clarity of presentation, we give in detail not only the new algorithms, but also their preliminary versions PrelimDistErrDetect and PrelimDistInpAlter that could possibly be applied to other types of distances. We have implemented the preliminary and final versions of the second algorithm (PrelimDistInpAlter and DistInpAlter) in Python using the well maintained, open source package FAdo for automata [7]. We have also tested our implementation experimentally, and we present in this paper the outcomes of the tests.
We note that some related problems involving distances between words and languages can be found in [8,9] (edit distance between a word and a language), and in [10,11,12,13,14] (various distances between two languages). Also in [15], the newer concept of edit distance with moves is investigated. The problem considered here is technically different, however, as the desired distance involves different words within the same language. More specifically, if we used directly the tools of [10,11], for instance, to compute an edit string with minimal number of errors between the given language and itself, then that string would simply be an edit string of zero errors, as the edit distance between any word and itself is zero. We also note that the inner prefix distance of a regular language, which is quite different from the inner edit distance, is considered in [16] and computed in time O ( n 2 log n ) .
The paper is organized as follows. The next section contains basic notions on languages, word relations, finite-state machines and edit-strings. Section 3 describes the approach of computing the desired edit distance via the concept of error-detection and presents the preliminary version PrelimDistErrDetect of the first algorithm. Then, Section 4 explains the improved and final version DistErrDetect of the first algorithm. In Section 5, it is shown that the edit distance is definable via an efficient input-altering transducer—see Theorem 2—and then the second algorithm DistInpAlter is presented. Section 6 discusses the implementation and testing of the second algorithm and its preliminary version. The last section contains a few concluding remarks. The appendix contains the proofs of two technical lemmata.

2. Notation, Background and Preliminary Results

This section contains basic terminology about formal languages, automata, transducers, and edit strings. Most of the basic notions presented here can be found in various texts such as [17,18,19,20,21].

2.1. Sets, Words, Languages, Channels

The set of positive integers is denoted by N . Then, N 0 = N { 0 } . If S is any set, the expression | S | denotes the cardinality of S. We use standard basic notation and terminology for alphabets, words and languages—see [22], for instance. For example, Σ denotes an alphabet, Σ + the set of nonempty words, λ the empty word, Σ = Σ + { λ } , | w | the length of the word w. We write u p w to indicate that the word u is a prefix of w, that is, w = u v for some word v. Then, u < p w means that u is a proper prefix, that is, u p w and u w . We use the concepts of (formal) language and concatenation between words, or languages, in the usual way. We say that w is an L-word if w L and L is a language.
A binary word relation ρ on Σ is any subset of Σ × Σ . The domain of ρ is { u ( u , v ) ρ for some v Σ } . A channel γ is a binary relation on Σ that is input-preserving; that is, γ Σ × Σ and ( w , w ) γ for all words w in the domain of γ . When ( u , v ) γ , we say that u can be received as v via the channel γ , or v is a possible output of γ when u is used as input. If v u , then we say that u can be received (via γ ) with errors. Here, we only consider the channel sid ( k ) , for some k N , such that ( u , v ) sid ( k ) if and only if v can be obtained by applying at most k errors in u, where an error could be a deletion of a symbol in u, a substitution of a symbol in u with another symbol, or an insertion of a symbol in u—see further below for a more rigorous definition via edit-strings.

2.2. NFAs and Transducers

A Nondeterministic Finite Automaton with empty transitions, λ-NFA for short, or just automaton, is a quintuple a = ( Q , Σ , T , s , F ) such that Q is the finite set of states, Σ is the alphabet, s Q is the start (or initial) state, F Q is the set of final states, and T Q × ( Σ { λ } ) × Q is the finite set of transitions or edges. Let ( p , x , q ) be a transition of a . Then, x is called the label of the transition, and we say that the transition goes out of p. We also use the notation
p x q
for a transition ( p , x , q ) . The λ -NFA a is called an NFA, if no transition label is empty, that is, T Q × Σ × Q . A Deterministic Finite Automaton, DFA for short, is a special type of NFA in which, for each state p, there are no two transitions with equal labels going out of p.
A path of a is a finite sequence of consecutive transitions:
( p 0 , x 1 , p 1 ) ( p 1 , x 2 , p 2 ) ( p 1 , x , p ) ,
for some nonnegative integer , where we use concatenation of these transitions to denote the path. Then, if P 1 and P 2 are two paths such that the last state of P 1 is equal to the first state of P 2 ,
P 1 P 2 denotes the path resulting by concatenating the transitions of P 1 and P 2 .
The word x 1 x is called the label of the path in Label (4). We write p 0 x p to indicate that there is a path with label x from p 0 to p . A path as above is called a computation of a if p 0 is the start state. It is called an accepting path/computation if p 0 is the start state and p is a final state. The language accepted by a , denoted as L ( a ) , is the set of labels of all the accepting paths of a . The automaton a is called trim, if every state appears in some accepting path of a .
A (finite nondeterministic) transducer [17,20] is a quintuple (In the literature, a transducer also has an output alphabet Γ , but we consider here that Γ is the same as the input alphabet Σ . Without further mention all transducers considered here are nondeterministic.) t = ( Q , Σ , T , s , F ) such that Q , s , F are exactly the same as those in λ -NFAs, Σ is the alphabet, and T Q × ( Σ { λ } ) × ( Σ { λ } ) × Q is the finite set of transitions or edges. We write ( p , x / y , q ) , or p x / y q for a transition—the label here is ( x / y ) , with x being the input and y being the output label of the transition. The concepts of path, computation, accepting path, and trim transducer are similar to those in λ -NFAs. However, the label of a transducer path ( p 0 , x 1 / y 1 , p 1 ) ( p 1 , x / y , p ) is the pair ( x 1 x , y 1 y ) of the two words consisting of the concatenations of the input and output labels in the path, respectively. The relation realized by the transducer t , denoted by R ( t ) , is the set of labels in all the accepting paths of t . We write t ( u ) for the set of possible outputs of t on input u, that is,
v t ( u ) if and only if ( u , v ) R ( t ) .
The transducer t is called functional, if the relation R ( t ) is a function, that is, t ( u ) consists of at most one word, for all input words u. We say that t realizes a partial identity, if  v t ( u ) implies that v = u .
If m is an automaton or a transducer, then the size of m , denoted by | m | , is the number of states plus the number of transitions in m . We shall write
Q m , T m for the sets of states and transitions of m , respectively .
If m is trim then | Q m | | T m | + 1 ; thus,
if m is trim then | m | = O ( | T m | ) .
We recall that making an automaton or transducer m trim can be done in linear time O ( | m | ) .

2.3. Edit Strings and Edit Distance

The alphabet E Σ of the (basic) edit operations, which depends on the alphabet Σ of ordinary symbols, consists of all symbols ( x / y ) such that x , y Σ { λ } and at least one of x and y is in Σ . If ( x / y ) E Σ and x is not equal to y, then ( x / y ) is called an error [23]. The edit operations ( a / b ) , ( λ / a ) , ( a / λ ) , where a , b Σ { λ } and a b , are called substitution, insertion, deletion, respectively. We write ( λ / λ ) for the empty word over the alphabet E Σ . We note that λ is used as a formal symbol in the elements of E Σ . For example, if a , b Σ , then ( λ / a ) ( b / b ) ( b / a ) ( λ / b ) . The elements of E Σ are called edit strings. The weight of an edit string h, denoted by weight ( h ) , is the number of errors occurring in h. For example, for
g = ( a / a ) ( a / λ ) ( b / b ) ( b / a ) ( b / b ) ,
weight ( g ) = 2 . The input and output parts of an edit string h = ( x 1 / y 1 ) ( x n / y n ) are the words (over Σ ) x 1 x n and y 1 y n , respectively. We write inp ( h ) for the input part and out ( h ) for the output part of h. For example, for the g shown above, inp ( g ) = a a b b b and out ( g ) = a b a b . The inverse of an edit string h is the edit string resulting by inverting the order of the input and output parts in every edit operation in h. For example, the inverse of g shown above is
( a / a ) ( λ / a ) ( b / b ) ( a / b ) ( b / b ) .
The channel sid ( k ) can be defined more rigorously via edit strings:
sid ( k ) = { ( u , v ) u = inp ( h ) , v = out ( h ) , for some h E Σ with weight ( h ) k } .
The edit (or Levenshtein) distance [24] between two words u and v, denoted by δ ( u , v ) , is the smallest number of errors (substitutions, insertions and deletions) that can be used to transform u to v. More formally,
δ ( u , v ) = min { weight ( h ) h E Σ , inp ( h ) = u , out ( h ) = v } .
We say that an edit string h realizes the edit distance between two words u and v, if weight ( h ) = δ ( u , v ) and, either inp ( h ) = u and out ( h ) = v , or inp ( h ) = v and out ( h ) = u . For example, for Σ = { a , b } , we have that δ ( a b a b a , b a b b b ) = 3 and the edit string
h = ( a / λ ) ( b / b ) ( a / a ) ( b / b ) ( a / b ) ( λ / b )
realizes δ ( a b a b a , b a b b b ) . Note that several edit strings can realize the distance δ ( u , v ) . If L is a language containing at least two words, then the edit distance of L is
δ ( L ) = min { δ ( u , v ) u , v L and u v } .
Testing whether a given NFA accepts at least two words is not a concern in this paper, but we note that this can be done efficiently (in linear time via a breadth first search type algorithm) [25].
The next lemma comes from [4]. The bound D a is always less than or equal to the number of states in the NFA a . Moreover, there are NFAs for which this bound is tight—see Section 6.
Lemma 1.
For every NFA a accepting at least two words, we have that
δ ( L ( a ) ) D a ,
where D a is the number of states in the longest path in a from the start state having no repeated state.
However, the bound D a is of no use in our context, as the problem of determining the length of a longest path in a given automaton, or a graph in general, is NP-complete since an algorithm solving this problem can be used to decide the existence of a Hamiltonian path; see for example [26]. There are many ways to obtain an efficiently computable upper bound on the edit distance of L ( a ) that is always at most equal to the number of states in a . For example, that distance is always less than or equal to the distance of the two shortest accepted words. We agree to use this as a working upper bound:
Lemma 2.
For every NFA a accepting at least two words, we have that
δ ( L ( a ) ) B a ,
where B a is the edit distance of two shortest words in L ( a ) .

3. Edit Distance via Error-Detection

In [5], the authors discuss a conceptual method for computing integral distances of regular languages—integral means that all distance values are nonnegative integers—via the property of error-detection. In this section, we review that method and produce a concrete preliminary algorithm for computing the edit distance of a regular language.
A language L is error-detecting for a channel γ , [27], if no L-word can be received as a different L-word via γ ; that is (The definition of error-detection in [27] uses L { λ } instead of L in Formula 6. This slight change makes the presentation here simpler and has no bearing on any existing results regarding error-detecting languages.), for any words u and v,
u , v L and ( u , v ) γ u = v .
Remark 1.
The error-detection method of [5] for computing inner distances of regular languages is based on the following observations, where a is an NFA and t is an input-preserving transducer.
1.
A language L is error-detecting for sid ( m ) , if and only if δ ( L ) > m .
2.
δ ( L ) is equal to the positive integer k such that L is error-detecting for sid ( k 1 ) and L is not error-detecting for sid ( k ) .
3.
We have the following facts from [27]. A language L is error-detecting for a channel γ if and only if the following relation is a function
γ ( L × Σ ) ( Σ × L ) .
Moreover, if a accepts L and t realizes γ, then a transducer ( t a ) realizing γ ( L × Σ ) can be constructed in time O ( | t | | a | ) and, analogously, a transducer ( t a ) realizing γ ( Σ × L ) can be constructed in time O ( | t | | a | ) . Both constructions are cross-product constructions. In each case, the resulting transducer has O ( | Q a | | Q t | ) states and O ( | T a | | T t | ) transitions. Thus, the transducer
( t a a )
realizes relation (7) and can be constructed in time O ( | t | | a | 2 ) .
4.
There is an O ( | T s | 2 + r | Q s | 2 ) time algorithm that decides whether a given transducer s is functional [6,28], where r is the size of the alphabet.
Using the above observations, we present first a preliminary error-detection-based algorithm for computing the desired edit distance.
    Algorithm PrelimDistErrDetect
0.
Input: NFA a    
1.
Let B a be the edit distance bound in Lemma 2
2.
Let min 1 and max B a 1
Perform binary search to find the largest k in { min , , max }
for which L ( a ) is error-detecting for sid ( k ) as follows:
while ( min max )
a)
Let k ( min + max ) / 2
b)
Construct transducer sid k realizing the channel sid ( k ) —see Figure 1
c)
Construct the transducer t k ( sid k a a )
d)
If L ( a ) is error-detecting for sid ( k ) , let min k + 1
Else let max k 1
4.
return min
Remark 2.
Step (3d) of the above algorithm can be computed using the transducer functionality algorithm on t k , which leads again to a polynomial but expensive algorithm. It turns out, however, using standard logical arguments, that
C o n d i t i o n ( 6 ) i s e q u i v a l e n t t o w h e t h e r ( t a a ) r e a l i z e s a p a r t i a l i d e n t i t y ,
when t realizes γ—in the above algorithm, t is sid k . Moreover, [6], there is an
O ( | T t | + r | Q t | )
time algorithm that tests whether a given transducer t realizes a partial identity, where r is the size of the alphabet.
Corollary 1.
Consider the algorithm PrelimDistErrDetect. Using the partial identity test for t k in step 3d, the algorithm computes the edit distance of a language given via a trim NFA a in time
O ( n 2 r 2 B a log B a ) ,
where r is the cardinality of the alphabet used in T a , and n = | T a | .
Proof. 
The correctness of the algorithm follows from Remarks 1 and 2. For the time complexity, the whole loop will perform O ( log B a ) iterations. In each iteration, the value k is used to construct the transducer sid k shown in Figure 1 with alphabet being the set of alphabet symbols appearing in the definition of a . Then, the transducer t k is constructed having O ( k | Q a | 2 ) states and O ( k r 2 | T a 2 | ) transitions. Then, the partial identity of t k is tested in time O ( | T a | 2 k r 2 ) . As k < B a , it follows that the total time complexity is as required. □
We note that, in the worst case, B a is of order O ( n ) and, assuming a fixed alphabet, the above algorithm operates in time O ( n 3 log n ) , which is asymptotically better than the time complexity of the algorithm in [4], even when the given automaton is a DFA.

4. An O ( n 2 d ) Algorithm for Edit Distance via Error-Detection

In this section, we observe that the algorithm of the previous section repeats a lot of computations, and we eliminate those repeated computations to arrive at an improved algorithm that computes the edit distance d of a trim NFA a in time O ( n 2 d r 2 ) , where r is the cardinality of the alphabet used in T a , and n = | T a | . The improved algorithm is based on the following two observations:
  • The previous algorithm starts the binary search loop by constructing the transducer t B a / 2 , but the edit distance might be much smaller than B a / 2 . It turns out that it is more efficient in the end to construct in turn t 1 , t 2 , until the first t d that does not realize a partial identity.
  • If t k is constructed and tested that does not realize a partial identity, then the transducer t k + 1 is constructed from scratch and the partial identity test is repeated for the part of t k + 1 that corresponds to t k . We shall define the transducer t k + 1 to be the part that is added to t k in order to obtain t k + 1 , plus some initial state. Moreover, we shall show that, if t k realizes a partial identity, then t k + 1 realizes a partial identity if and only if t k + 1 does so. Thus, the partial identity test in each step will apply only to the new part that is added to the transducer of the previous step.
We proceed with details based on the above observations.
Product construction of trim t = t a a , given transducer t and NFA a . As usual in cross product constructions, the states of t are triples of the form ( φ , q , q ) , where φ is a state of t , and q, q are states of a . The initial state of t is ( φ 0 , q 0 , q 0 ) , where φ 0 is the initial state of t and q 0 is the initial state of a . The construction is incremental, starting with the creation of ( φ 0 , q 0 , q 0 ) ; then:
  • If state ( φ , q , q ) has been created, and there are transitions φ x / y ψ , q x r , q y r of t , a λ and a λ , respectively, then the transition
    ( φ , q , q ) x / y ( ψ , r , r )
    is added in t . Here, a λ is the λ -NFA that results if we add in a the loop transitions ( q , λ , q ) to all states q of a .
The final states of t are those constructed triples consisting of final states in t and a . In the end, we also make t trim.
Optimized construction of t k + 1 and t k + 1 from the trim t k . Suppose that t k has been constructed, where initially t 1 = sid 1 a a . Constructing t k + 1 using t k will be done again incrementally. The first phase of the incremental construction is to add the new transitions
( [ k ] , q , q ) x / y ( [ k + 1 ] , r , r ) ,
where x / y is an error and q x r , q y r are transitions in a λ . There will be no new transitions of the form ( [ i ] , q , q ) x / y ( [ k + 1 ] , r , r ) for i < k , because the transducer sid k + 1 has no transitions from any state [ i ] with i < k to state [ k + 1 ] . Note that the numbers of new transitions and new states created as in (8) are O ( | T a | 2 r 2 ) and O ( | Q a | 2 ) , respectively.
After the first phase, the incremental construction proceeds from the new states ( [ k + 1 ] , r , r ) in (8). Any new transition must be of the form
( [ k + 1 ] , q , q ) σ / σ ( [ k + 1 ] , r , r ) ,
where σ Σ . This is because the transducer sid k + 1 has only transitions of the form [ k + 1 ] σ / σ [ k + 1 ] going out of the state [ k + 1 ] . The process ends when no new states are created. The transitions and final states of the transducer t k + 1 are those in t k plus the newly created ones, after removing any new states that cannot reach a final state (thus, also t k + 1 is trim). The transducer t k + 1 has as transitions and final states only the newly created ones, and has as initial state a new state [ 1 ] with transitions [ 1 ] λ / λ ( [ k ] , q , q ) , for all states of the form ( [ k ] , q , q ) . □
Lemma 3.
Suppose the trim transducer t k realizes a partial identity.
  • If C 1 is a computation of t k ending at a state of the form ( [ k ] , p , p ) , then the label of C 1 is of the form ( w 1 , w 1 ) .
  • t k + 1 realizes a partial identity if and only if t k + 1 does so.
Proof. 
For the first statement, consider any computation C 1 of t k having some label ( w 1 , w 1 ) and ending at a state of the form ( [ k ] , p , p ) . We show that w 1 = w 1 . If the state ( [ k ] , p , p ) is final, then C 1 is an accepting computation, which implies w 1 = w 1 , as t k realizes a partial identity. If ( [ k ] , p , p ) is not a final state, then, as t k is trim, there is a path C 1 from ( [ k ] , p , p ) to a final state of t k , where all states of that path are of the form ( [ k ] , r , r ) and all labels of that path are of the form σ / σ —this is because any transition of sid k from state [ k ] can only go to state [ k ] and can only have a label of the form σ / σ . Thus, there is an accepting path of t k of the form C 1 C 1 with label ( w 1 z , w 1 z ) for some nonempty word z. Then, as t k realizes a partial identity, we have that w 1 z = w 1 z , which implies w 1 = w 1 , as required.
For the ‘only if’ part of the second statement, assume that t k + 1 realizes a partial identity. Consider any accepting computation C 2 of t k + 1 with some label ( w 2 , w 2 ) . We show that w 2 = w 2 . Let [ 1 ] λ / λ ( [ k ] , p , p ) be the first transition of C 2 . Let C 2 be the path that results when we remove the first transition of C 2 . By the construction of t k + 1 , there is a computation C 1 of t k that ends at state ( [ k ] , p , p ) . Let  ( w 1 , w 1 ) be the label of C 1 . Then, C 1 C 2 is an accepting computation of t k + 1 with label ( w 1 w 2 , w 1 w 2 ) . As t k + 1 realizes a partial identity, w 1 w 2 = w 1 w 2 , which implies w 2 = w 2 , as required.
For the ‘if’ part of the second statement, assume that t k + 1 realizes a partial identity. Consider any accepting computation C of t k + 1 . We show that the label of C must be of the form ( w , w ) . If C is already a computation of t k , then this holds, as t k realizes a partial identity. Now suppose that C = C 1 C 2 such that C 1 is a computation of t k and C 2 is a path in t k + 1 that starts with a transition as in (8) and then uses transitions as in (9). Let ( [ k ] , p , p ) be the last state of C 1 , which is also the first state of C 2 . Then, C 1 has some label ( w 1 , w 1 ) . In addition, the path
( [ 1 ] λ / λ ( [ k ] , p , p ) ) C 2
is an accepting computation of t k + 1 , which implies that it has some label ( w 2 , w 2 ) . Hence, the label of C is ( w 1 w 2 , w 1 w 2 ) and, therefore, t k realizes a partial identity.
The improved algorithm is shown next:
    Algorithm DistErrDetect
0.
Input: NFA a    
1.
Construct the transducer sid 1 realizing the channel sid ( 1 ) —see Figure 1
2.
Construct the trim transducer t 1 = sid 1 a a
3.
Let k 1
4.
Let s t 1
5.
while ( s realizes a partial identity)
a)
Construct t k + 1 and t k + 1 from t k using the optimized construction
b)
Let s t k + 1
c)
Let k k + 1
6.
returnk
Theorem 1.
Algorithm DistErrDetect computes the edit distance of a language given via a trim NFA a in time
O ( n 2 d r 2 ) ,
where d is the computed edit distance, n = | T a | , and r is the cardinality of the alphabet used in T a .
Proof. 
The correctness of the algorithm follows from the optimized construction and the above lemma. For the time complexity of the algorithm, we note the following. First, t 1 is constructed in time O ( | a | 2 r 2 ) . Then, t 2 , , t d are constructed according to the optimized construction. Each of these is constructed in time O ( | a | 2 r 2 ) and has O ( | Q a | 2 ) states and O ( | T a | 2 r 2 ) transitions. In addition, each t k is tested for partial identity in time O ( | T a | 2 r 2 + | Q a | 2 r ) , which is O ( | a | 2 r 2 ) .

5. An O ( n 2 d ) Algorithm for Edit Distance via Input-Altering Transducers

In this section, we present another algorithm for computing the desired edit distance via input-altering transducers—see Theorem 3 and the associated algorithm. A transducer t is called input-altering, if
w t ( w ) , for all words w ,
that is, the output of t is never equal to the input used.
We explain now how input-altering transducers are related to edit-distance and error-detection. Let t be a transducer. A language L is t -independent, 29,30], if
u , v L and u t ( v ) u = v .
Of course, when R ( t ) is input-preserving, then t -independence is the same as error-detection for the channel R ( t ) , and condition (10) can be tested as explained in Remark 2. On the other hand, if the transducer t is input-altering, then [30], condition (10) is equivalent to
t ( L ) L = .
If L is accepted by some NFA a , then the above condition can be tested using two product constructions: first, construct an NFA b accepting t ( L ) , then construct an NFA c by intersecting b with a , and then test whether there is a path from the start to a final state of c . Thus, condition (11) can be tested in time
O ( | a | 2 | t | ) .
Certain types of input-altering transducers are useful in constructing maximal t -independent languages [30]. In Theorem 2, we show how an input-altering transducer can be used to model the edit operations used in the definition of the edit distance.

5.1. An Input-Altering Transducer for Edit-Distance

We shall define the input-altering transducer ia k , which is partially shown in Figure 2. The value i in a state [ i ] or [ i , a ] is called the error counter, meaning that any path from [ 0 ] to a state with error counter i has to be labeled u v such that δ ( u , v ) i . More precisely, we will define the edges such that a state [ i , a ] can be reached from [ 0 ] via a path with label u v if and only if u = v a x for some word x and i = | a x | , thus, v is a proper prefix of u and state [ i , a ] remembers the left-most letter of u that occurs after its prefix v. A state [ i ] with i 1 can only be reached via a path labeled u v from [ 0 ] if 1 δ ( u , v ) i , thus, u v . Furthermore, we make sure that for u v such that neither u p v nor v p u there is a path from [ 0 ] to [ δ ( u , v ) ] , which is labeled by u v or v u .
Definition 1.
The transducer
ia k = ( Q , Σ , E , [ 0 ] , F )
is defined as follows. The set of states is
Q = [ i ] [ i ] 0 i k 0 i k [ i , a ] [ i , a ] 1 i k , a Σ 1 i k , a Σ
with all but the initial state [ 0 ] being final states:
F = Q \ [ 0 ] .
The transitions in ia k can be divided into the four sets of edges E = E 0 E s E i E d . The transitions from E 0 do not introduce any error, edges from the other sets model one substitution ( E s ), insertion ( E i ), or deletion ( E d ):
E 0 = [ i ] σ / σ [ i ] σ Σ , 0 i k ,
[ i , a ] σ / σ [ i ] a , σ Σ , a σ , 1 i k ,
E s = [ i ] σ / τ [ i + 1 ] σ , τ Σ , σ τ , 0 i < k ,
E i = [ i ] λ / σ [ i + 1 ] σ Σ , 1 i < k ,
E d = [ 0 ] a / λ [ 1 , a ] a Σ ,
[ i ] σ / λ [ i + 1 ] σ Σ , 1 i < k ,
[ i , a ] σ / λ [ i + 1 , a ] a , σ Σ , 1 i < k .
Terminology. If t = ( Q , Σ , T , q 0 , F ) is a transducer in standard form, then, we write t e for the NFA
t e = ( Q , E Σ , T , q 0 , F )
over the edit alphabet E Σ , where the labels of the transitions in t are viewed as elements of E Σ . Note that, the label of a path P in t is a pair of words ( u , v ) , whereas the label of the corresponding path in t e , which we denote as P e , is an edit string h such that inp ( h ) = u and out ( h ) = v . This type of NFA is called an eNFA in [23].
Definition 2.
An edit string h of nonzero weight is calledreduced, if (a) the first error in h is not an insertion, and (b) if the first error in h is a deletion of the form ( a / λ ) , then the first non-deletion edit operation that follows ( a / λ ) in h (if any) is of the form σ / σ with σ Σ \ { a } .
Example. The edit string ( a / a ) ( a / b ) ( a / λ ) ( λ / a ) is reduced as its first error is a substitution. The edit string ( a / a ) ( a / λ ) ( b / b ) ( b / a ) is reduced as well. The edit string ( λ / a ) ( a / a ) is not reduced as it starts with an insertion, and the edit string ( a / λ ) ( b / a ) ( b / b ) is not reduced either.
The proofs of the next two lemmata are given in the appendix.
Lemma 4.
Let x , y , u , v be words. The following statements hold true:
1.
δ ( x u y , x v y ) = δ ( u , v ) .
2.
If v < p u then δ ( u , v ) = | u | | v | .
3.
If u v , then there is a reduced edit string h realizing δ ( u , v ) .
Lemma 5.
Let k N and let u , v be words. The following statements hold true with respect to the transducer  ia k .
1.
In ia k e , every path from the start state [ 0 ] to any state [ i ] or [ i , a ] has as label a reduced edit string whose weight is equal to i.
2.
If 1 δ ( u , v ) k and h is a reduced edit string realizing δ ( u , v ) , then h is accepted by ia k e .
3.
If v ia k ( u ) , then 1 δ ( u , v ) k .
Theorem 2.
For each k N , the transducer ia k is input-altering and of size O ( k r 2 ) , where r is the cardinality of the alphabet, and satisfies the following condition, for any language L containing at least two words
ia k ( L ) L = i f a n d o n l y i f δ ( L ) > k .
Proof. 
By construction, it follows that t k is trim and has O ( r k ) states and O ( k r 2 ) transitions. Hence, it is indeed of size O ( k r 2 ) . The third statement of Lemma 5 implies that the transducer is input-altering. Next, we show that (20) is true for all languages L containing at least two words.
First, for the ‘if’ part, assume δ ( L ) > k and consider any words u , v L . We need to prove v ia k ( u ) . If u = v , then this holds as ia k is input-altering. Else, it follows from the third statement of Lemma 5. Now, for the ‘only if’ part, assume
ia k ( L ) L = ,
but, for the sake of contradiction, suppose there are different words u , v L such that 1 δ ( u , v ) k . Let h be a reduced edit string realizing δ ( u , v ) . By the second statement of Lemma 5, h is accepted by ia k e via some path P e and, therefore, either of ( u , v ) and ( v , u ) is the label of the path P of ia k , that is, we have u ia k ( v ) or v ia k ( u ) , which contradicts (21). □
Corollary 2.
For each NFA a accepting at least two words and for each transducer ia k , with k N , the following condition is satisfied:
R ia k a a = i f a n d o n l y i f δ L ( a ) > k .
Proof. 
The statement follows from the above theorem and the fact (based on standard logic arguments) that R ia k a a = is equivalent to ia k L ( a ) L ( a ) = . □
The reason why condition R ia k a a = is preferred to the equivalent one in Theorem 2 is explained further below in the remark that follows Theorem 3.

5.2. The Second O ( n 2 d ) Algorithm for Edit Distance

Here, we use the results of the previous subsection to arrive at the second algorithm for computing the desired edit distance. Corollary 2 implies that the preliminary algorithm PrelimDistInpAlter shown below correctly computes the desired edit distance. Moreover, by reasoning as in the proof of Corollary 1, it follows that this algorithm also executes in time O ( n 2 r 2 B a log B a ) , where r is the cardinality of the alphabet used in T a , and n = | T a | .
    Algorithm PrelimDistInpAlter
0.
Input: NFA a    
1.
Let B a be the bound in Lemma 2
2.
Let min 1 and max B a 1
3.
Perform binary search to find the largest k in { min , , max }
for which L ( a ) is error-detecting for sid ( k ) as follows:
while ( min max )
a)
Let k ( min + max ) / 2
b)
Construct the transducer ia k (see Figure 2)
c)
Construct the transducer t k ia k a a
d)
If R ( t k ) = let min k + 1
Else let max k 1
4.
return min
We discuss now how to improve the above algorithm. The two observations we made at the beginning of Section 4 apply here as well if, instead of partial identity of t k , we talk about the emptiness of t k . Thus, we want the improved algorithm to construct in turn t 1 , t 2 , until the first t d with R ( t d ) . Moreover, when t k has been constructed and realizes ∅, we continue in the next step with new transitions added to t k in order to get t k + 1 .
Optimized construction of t k + 1 from the trim t k . Suppose that the trim t k has been constructed, where initially t 1 ia 1 a a . Constructing t k + 1 using t k will be done again incrementally. The first phase of the incremental construction is to add two sets of new transitions: the new transitions
( [ k ] , q , q ) x / y ( [ k + 1 ] , r , r ) ,
where x / y is an error and q x r , q y r are transitions in a λ ; and the new transitions
( [ k , a ] , q , q ) σ / λ ( [ k + 1 , a ] , r , r ) ,
where a , σ Σ , and q σ r , q λ r are transitions in a λ . Note that the total numbers of new transitions and states created in the first phase are O ( | T a | 2 r 2 ) and O ( | Q a | 2 r ) , respectively.
After the first phase, the incremental construction proceeds from the new states ( [ k + 1 ] , r , r ) and ( [ k + 1 , a ] , r , r ) . Any new transition must be of the form
( [ k + 1 ] , r , r ) σ / σ ( [ k + 1 ] , t , t ) or ( [ k + 1 , a ] , r , r ) σ / σ ( [ k + 1 ] , t , t ) ,
where σ Σ and, in the second case above, σ a . This is because the transducer ia k + 1 has only transitions of the form [ k + 1 ] σ / σ [ k + 1 ] and [ k + 1 , a ] σ / σ [ k + 1 ] , with σ a , going out of the state [ k + 1 ] . The incremental process ends when no new states are created. The transitions and final states of the transducer t k + 1 are those in t k plus the newly created ones, after removing any new states that cannot reach a final state (thus, also t k + 1 is trim). □
Remark 3.
If the trim transducer t k has no final states, then t k + 1 has no final states if and only if none of the new created states in the optimized construction is a final state.
    Algorithm DistInpAlter
0.
Input: NFA a    
1.
Construct the transducer ia 1 —see Figure 2
2.
Construct the trim transducer t 1 ia 1 a a
3.
Let k 1
4.
Let s the set of states of t 1
5.
while ( s contains no final states)
a)
Construct t k + 1 from t k using the optimized construction
b)
Let s be the set of new states in the optimized construction
c)
Let k k + 1
6.
returnk
Theorem 3.
Algorithm DistInpAlter computes the edit distance of the language given via a trim NFA a in time
O ( n 2 d r 2 ) ,
where d is the computed edit distance, n = | T a | , and r is the cardinality of the alphabet used in T a .
Proof. 
The correctness of the algorithm follows from the above optimized construction and Corollary 2. For the time complexity of the algorithm, we note the following: first, t 1 is constructed in time O ( | a | 2 r 2 ) . Then, t 2 , , t d are constructed according to the optimized construction. Each of these is constructed in time O ( | a | 2 r 2 ) and has O ( | Q a | 2 r ) states and O ( | T a | 2 r 2 ) transitions. In addition, each t k is tested for final states in linear time. □
Remark 4.
When a final state f, say, of t k is created, then we know that there is an accepting path of ia k a a ending at f. The label of that path is a word pair ( u , v ) such that δ ( u , v ) = k . Thus, the above algorithm can be modified to return not only the edit distance of L ( a ) , but also a witness pair for that distance.

6. Implementation and Testing

As both algorithms DistErrDetect and DistInpAlter have the same theoretical complexity, we chose to implement one of the two. We chose to implement DistInpAlter because it requires a simpler test for each constructed transducer (although ia k is slightly more complex than sid k , the test for partial identity, [6], is more sophisticated than testing merely for existence of final states). We have also implemented the preliminary algorithm PrelimDistInpAlter.
Our implementation uses the FAdo package for automata, version 1.3.5.1, [7], which is well maintained and provides several useful tools for manipulating automata. We have performed several tests (All tests were performed on a laptop with the following specification. Make: Apple, Model: MacBook Pro, Processor 2.5 GHz Intel Core i7, Memory (RAM): 16.00 GB, Operating System: macOS High Sierra Version 10.13.6.) for the correctness of these algorithms, as well as two sets of tests for the time complexity, which confirm the theoretical result that DistInpAlter is indeed faster than PrelimDistInpAlter. The two sets of tests correspond to two lists of DFAs, ( a n ) and ( b n ) , shown in Figure 3 and Figure 4. The first test set is such that the desired distance is equal to n, for each DFA a n , that is, the distance grows with n and, in fact, it is a worst-case scenario where the distance is equal to the number of states of the automaton. The second test set is such that the desired distance is fixed, equal to 2, for all n.
Table 1 shows the actual running times (in seconds) of the two algorithms on the DFAs a 28 , a 41 , a 56 , a 76 , a 100 , a 124 , a 152 , a 184 . The number in parentheses next to each a n indicates the number of transitions in a n . The column d shows the computed edit distance, and the column B a n shows the computed upper bound on the edit distance (used in algorithm PrelimDistInpAlter).
Table 2 shows the actual running times (in seconds) of the two algorithms on the DFAs b 6 , , b 13 . Again, the number in parentheses next to each b n indicates the number of transitions in b n .
In both test sets, the empirical outcomes confirm the asymptotic outcome that the improved algorithm based on the optimized construction is faster than the preliminary one.

7. Conclusions

This paper represents a significant improvement in the time complexity of computing the inner edit distance of a given regular language. The performance tests of the implemented algorithm show that in practice the algorithm is reasonably fast for moderate size automata. As discussed in [4], this problem is related to the inherent capability of a language to detect substitution, insertion, and deletion errors.
The two preliminary algorithms can be applied to different distances as long as these distances can be related to appropriate transducers. For some of those distances, the idea in the optimized algorithms can also be used. For example, one can construct a transducer similar to sid k for insertion/deletion only errors. A direction for future research is to investigate to what extent the methods used here can be extended to compute inner weighted distances or the inner edit distance with moves.

Author Contributions

All the authors contributed equally to this research. The outcome would not be possible without the contributions of all authors.

Funding

This research was supported by the Natural Science and Engineering Research Council of Canada (NSERC) Discovery Grants R2824A01 to Lila Kari and 220259 to Stavros Konstantinidis.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In this appendix, we present the proofs of Lemmas 4 and 5.
Proof. 
(Of Lemma 4) The first statement already appears in [24]. The second statement is rather folklore, but we provide a proof here for the sake of completeness. Let u = σ 1 σ n and v = σ 1 σ m , where m , n N 0 and m < n and all σ i ’s are in Σ . Then, the edit string
h = ( σ 1 / σ 1 ) ( σ m / σ m ) ( σ m + 1 / λ ) ( σ n / λ )
has weight n m and inp ( h ) = u and out ( h ) = v . We show that h realizes δ ( u , v ) by proving that, for any edit string g realizing δ ( u , v ) , weight ( g ) = n m . Indeed, first note that weight ( g ) weight ( h ) = n m . Let i and d be the number of insertions and deletions in g. Then, | v | = | u | + i d , which implies n m = d i . Now, weight ( g ) d + i d i = n m , as required.
For the third statement, let g 0 be any edit string realizing δ ( u , v ) . The following process can be used to obtain the required reduced edit string h:
  • If the first error in g 0 is a substitution, then h = g 0 .
  • If the first error in g 0 is an insertion, then set g 0 to the inverse of g 0 and continue with the next step.
  • If the first error in g 0 is a deletion ( a / λ ) , then g 0 is of the form
    g 0 = ( e 1 e r ) ( a / λ ) ( a 1 / λ ) ( a d / λ ) g 0 ,
    where the e i ’s are non-errors, d N 0 and each ( a j / λ ) is a deletion, and g 0 does not start with a deletion. We have the following subcases:
    • If g 0 is empty or starts with an edit operation ( σ / σ ) in which σ Σ \ { a } , then the required h is g 0 .
    • If g 0 starts with an edit operation ( x / τ ) in which τ Σ \ { a } and x Σ { λ } , then g 0 is of the form g 0 = ( x / τ ) g 1 , and the required h is
      h = ( e 1 e r ) ( a / τ ) ( a 1 / λ ) ( a d / λ ) ( x / λ ) g 1 .
    • If g 0 starts with an edit operation ( x / a ) in which x Σ { λ } , then it is of the form g 0 = ( x / a ) g 1 , and the edit string
      g 1 = ( e 1 e r ) ( a / a ) ( a 1 / λ ) ( a d / λ ) ( x / λ ) g 1 ,
      realizes δ ( u , v ) , as weight ( g 1 ) = weight ( g 0 ) . The process now continues from the first step using g 1 for g 0 .
As the edit string g 0 is finite, the above process terminates with a reduced edit string h, as required. □
Proof. 
(Of Lemma 5) The first statement follows when we note that the definition of ia k and ia k e implies the following facts: (a) an edge exists between a state with error counter i to one with error counter i + 1 , if and only if the label of that edge is an error; thus, in any path from [ 0 ] to [ i ] or [ i , a ] , the label of that path consists of exactly i errors; (b) any edit string accepted by ia k e is indeed reduced.
For the second statement, consider any reduced edit string h realizing δ ( u , v ) . We have two cases for the first error of h. If the first error in h is a deletion, then h is of the form
h = ( e 1 e r ) ( a / λ ) ( b 1 / λ ) ( b d / λ ) h ,
where each e i is a non-error edit operation of the form ( σ i / σ i ) , ( a / λ ) is a deletion error, d N 0 and each ( b j / λ ) is a deletion error, and h is an edit string that is either empty or starts with a non-error ( σ / σ ) such that σ a . Consider the following path of ia k e
P 1 = [ 0 ] ( e 1 e r ) [ 0 ] ( a / λ ) ( b 1 / λ ) ( b d / λ ) [ 1 + d , a ] .
If h is empty, then P 1 is an accepting path of ia k e . If h is nonempty, then it is of the form h = ( σ / σ ) h , for some σ Σ \ { a } . Then, by definition of ia k e , the following is a path of ia k e
P 1 [ 1 + d , a ] σ / σ [ 1 + d ] h [ 1 + d + weight ( h ) ]
accepting h. For the case where the first error in h is a substitution, one verifies that again h is accepted by ia k e .
For the third statement, if v ia k ( u ) , then ( u , v ) is the label of a path P from [ 0 ] to a final state [ i ] or [ i , a ] , with 0 < i k . As the label of the path P e has exactly i errors, it follows that δ ( u , v ) i k .
We also need to show that δ ( u , v ) 1 , that is, u v . First, consider the case where the path P ends at [ i , a ] , with 1 i k . Then, the label of P e is an edit string of the form
h = ( σ 1 / σ 1 ) ( σ r / σ r ) ( a / λ ) ( b 1 / λ ) ( b d / λ )
and u = inp ( h ) = σ 1 σ r a b 1 b d and v = out ( h ) = σ 1 σ r . Hence, u v . Now, consider the case where the path P ends at state [ i ] . There are two cases: (a) the states used in the path are [ 0 ] , [ 1 ] , , [ i ] ; (b) the states used in P are [ 0 ] , [ 1 , a ] , , [ r , a ] , [ r ] , , [ i ] , for some appropriate [ r ] . In both cases, one verifies that u v . For example, in case (b), u must be of the form x a σ 1 σ r 1 σ y and v of the form x σ z , where the σ j ’s are symbols, x , y , z are words, and σ is a symbol other than a; hence, u v . □

References

  1. Sankoff, D.; Kruskal, J.B. (Eds.) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison; CSLI Publications: Stanford, CA, USA, 1999. [Google Scholar]
  2. Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology; Cambridge University Press: New York, NY, USA, 1997. [Google Scholar]
  3. Paluncic, F.; Abdel-Ghaffar, K.; Ferreira, H. Insertion/deletion detecting codes and the boundary problem. IEEE Trans. Inf. Theory 2013, 59, 5935–5943. [Google Scholar] [CrossRef]
  4. Konstantinidis, S. Computing the edit distance of a regular language. Inf. Comput. 2007, 205, 1307–1316. [Google Scholar] [CrossRef]
  5. Konstantinidis, S.; Silva, P. Computing maximal error-detecting capabilities and distances of regular languages. Fundam. Inform. 2010, 101, 257–270. [Google Scholar]
  6. Allauzen, C.; Mohri, M. Efficient algorithms for testing the twins property. J. Autom. Lang. Comb. 2003, 8, 117–144. [Google Scholar]
  7. FAdo. Tools for Formal Languages Manipulation. Available online: http://fado.dcc.fc.up.pt/ (accessed on 20 October 2018).
  8. Wagner, R. Order-n correction for regular languages. Commun. ACM 1974, 17, 265–268. [Google Scholar] [CrossRef]
  9. Pighizzini, G. How hard is computing the edit distance? Inf. Comput. 2001, 165, 1–13. [Google Scholar] [CrossRef]
  10. Mohri, M. Edit-distance of weighted automata: General definitions and algorithms. Intern. J. Found. Comput. Sci. 2003, 14, 957–982. [Google Scholar] [CrossRef]
  11. Kari, L.; Konstantinidis, S.; Perron, S.; Wozniak, G.; Xu, J. Finite-State Error/Edit-Systems and Difference- Measures for Languages and Words; Report 2003-01; Mathematics and Computing Science, Saint Mary’s University: Halifax, NS, Canada, 2003. [Google Scholar]
  12. Benedikt, M.; Puppis, G.; Riveros, C. The cost of traveling between languages. In ICALP 2011, Part II. LNCS 6756; Aceto, L., Henziger, M., Sgall, J., Eds.; Springer: Heidelberg, Germany, 2011; pp. 234–245. [Google Scholar]
  13. Han, Y.S.; Ko, S.K.; Salomaa, K. Computing the edit-distance between a regular language and a context-free langauge. In DLT 2012. LNCS 7410; Yen, H.C., Ibarra, O., Eds.; Springer: Heidelberg, Germany, 2012; pp. 85–96. [Google Scholar]
  14. Han, Y.S.; Ko, S.K.; Salomaa, K. Approximate matching between a context-free grammar and a finite-state automaton. In CIAA 2013. LNCS 7982; Konstantinidis, S., Ed.; Springer: Heidelberg, Germany, 2013; pp. 146–157. [Google Scholar]
  15. Takabatake, Y.; Nakashima, K.; Kuboyama, T.; Tabei, Y.; Sakamoto, H. siEDM: An Efficient String Index and Search Algorithm for Edit Distance with Moves. Algorithms 2016, 9, 26. [Google Scholar] [CrossRef]
  16. Ng, T. Prefix Distance Between Regular Languages. In Proceedings of the 21st CIAA, Seoul, South Korea, 19–22 July 2016; Volume 9705, pp. 224–235. [Google Scholar]
  17. Berstel, J. Transductions and Context-Free Languages; B.G. Teubner: Stuttgart, Germany, 1979. [Google Scholar]
  18. Wood, D. Theory of Computation; John Wiley & Sons: New York, NY, USA, 1987. [Google Scholar]
  19. Rozenberg, G.; Salomaa, A. (Eds.) Handbook of Formal Languages, Vol. I; Springer: Berlin, Germany, 1997. [Google Scholar]
  20. Yu, S. Regular Languages. Handbook of Formal Languages, Vol. I; Springer: Berlin, Germany, 1997; pp. 41–110. [Google Scholar]
  21. Sakarovitch, J. Elements of Automata Theory; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
  22. Mateescu, A.; Salomaa, A. Formal Languages: an Introduction and a Synopsis. Handbook of Formal Languages, Vol. I; Springer: Berlin, Germany, 1997; pp. 1–39. [Google Scholar]
  23. Kari, L.; Konstantinidis, S. Descriptional Complexity of Error/Edit Systems. J. Autom. Lang. Comb. 2004, 9, 293–309. [Google Scholar]
  24. Levenshtein, V.I. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
  25. Yang, M. Application and Implementation of Transducer Tools in Answering Certain Questions about Regular Languages. Master’s Thesis, Department Mathematics and Computing Science, Saint Mary’s University, Halifax, NS, Canada, 2012. [Google Scholar]
  26. Schrijver, A. Combinatorial Optimization: Polyhedra and Efficiency; Springer Science & Business Media: Berlin, Germany, 2003. [Google Scholar]
  27. Konstantinidis, S. Transducers and the Properties of Error-Detection, Error-Correction and Finite-Delay Decodability. J. Univ. Comput. Sci. 2002, 8, 278–291. [Google Scholar]
  28. Béal, M.; Carton, O.; Prieur, C.; Sakarovitch, J. Squaring transducers: An efficient procedure for deciding functionality and sequentiality. Theor. Comput. Sci. 2003, 292, 45–63. [Google Scholar] [CrossRef]
  29. Shyr, H.; Thierrin, G. Codes and Binary Relations. In Séminaire d’Algèbre Paul Dubreil, Paris 1975–1976 (29ème Année); Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 1975; pp. 180–188. [Google Scholar]
  30. Konstantinidis, S. Applications of Transducers in Independent Languages, Word Distances, Codes. In Proceedings of the DCFS 2017, Milano, Italy, 3–5 July 2017; Volume 10316, pp. 45–62. [Google Scholar]
Figure 1. The input-preserving transducer sid k realizing the channel sid ( k ) . Each edge label σ / σ represents many transitions, one for each symbol σ of the alphabet, and similarly for σ / λ and λ / σ . Each edge label σ / τ represents many transitions, one for each pair of distinct symbols σ and τ from the alphabet. Thus, if the alphabet size is r, then the transducer has O ( k ) states and O ( r 2 k ) transitions.
Figure 1. The input-preserving transducer sid k realizing the channel sid ( k ) . Each edge label σ / σ represents many transitions, one for each symbol σ of the alphabet, and similarly for σ / λ and λ / σ . Each edge label σ / τ represents many transitions, one for each pair of distinct symbols σ and τ from the alphabet. Thus, if the alphabet size is r, then the transducer has O ( k ) states and O ( r 2 k ) transitions.
Algorithms 11 00165 g001
Figure 2. A segment of the input-altering transducer ia k : for each a Σ the complete transducer has k states of the form [ i , a ] . The labels σ and τ on an edge mean: one edge for each σ , τ Σ with σ τ ; for some edge sets, additional restrictions apply denoted, for example, by σ a .
Figure 2. A segment of the input-altering transducer ia k : for each a Σ the complete transducer has k states of the form [ i , a ] . The labels σ and τ on an edge mean: one edge for each σ , τ Σ with σ τ ; for some edge sets, additional restrictions apply denoted, for example, by σ a .
Algorithms 11 00165 g002
Figure 3. The automaton a n accepting the language 0 n 1 ( 10 n 1 ) .
Figure 3. The automaton a n accepting the language 0 n 1 ( 10 n 1 ) .
Algorithms 11 00165 g003
Figure 4. The automaton b n accepting the Levenshtein code, which consists of all binary words b 1 b n of length n such that ( i = 1 n i · b i ) % ( n + 1 ) = 0 , where ‘%’ is the integer division remainder operation. This code has edit distance equal to 2. On the other hand, its distance for insertion/deletion errors only is 3. The automaton (before making it trim) has n 2 + n + 1 states: [ n , 0 ] and [ i , s ] , with 0 i n 1 and 0 s n . The meaning of state [ i , s ] is that the automaton has read i bits b 1 b i and s = 1 · b 1 + + i · b i . We have that f ( i , s ) = [ i + 1 , ( s + i + 1 ) % ( n + 1 ) ] .
Figure 4. The automaton b n accepting the Levenshtein code, which consists of all binary words b 1 b n of length n such that ( i = 1 n i · b i ) % ( n + 1 ) = 0 , where ‘%’ is the integer division remainder operation. This code has edit distance equal to 2. On the other hand, its distance for insertion/deletion errors only is 3. The automaton (before making it trim) has n 2 + n + 1 states: [ n , 0 ] and [ i , s ] , with 0 i n 1 and 0 s n . The meaning of state [ i , s ] is that the automaton has read i bits b 1 b i and s = 1 · b 1 + + i · b i . We have that f ( i , s ) = [ i + 1 , ( s + i + 1 ) % ( n + 1 ) ] .
Algorithms 11 00165 g004
Table 1. Outcomes of performance tests on the automata ( a n ) .
Table 1. Outcomes of performance tests on the automata ( a n ) .
DFAd B a n PrelimDistInpAlterDistInpAlter
a 28 ( 28 ) 28280.696s0.078s
a 41 ( 41 ) 41413.977s0.267s
a 56 ( 56 ) 565612.811s0.691s
a 76 ( 76 ) 767652.086s1.885s
a 100 ( 100 ) 100100159.370s4.841s
a 124 ( 124 ) 124124354.306s10.643s
a 152 ( 152 ) 152152998.438s21.991s
a 184 ( 184 ) 1841842294.636s43.484s
Table 2. Outcomes of performance tests on the automata ( b n ) .
Table 2. Outcomes of performance tests on the automata ( b n ) .
DFAd B b n PrelimDistInpAlterDistInpAlter
b 6 ( 28 ) 230.112s0.076s
b 7 ( 41 ) 230.302s0.196s
b 8 ( 56 ) 240.731s0.436s
b 9 ( 76 ) 231.621s0.927s
b 10 ( 100 ) 233.223s1.844s
b 11 ( 124 ) 245.673s3.238s
b 12 ( 152 ) 2561.416s5.892s
b 13 ( 184 ) 2416.624s9.272s

Share and Cite

MDPI and ACS Style

Kari, L.; Konstantinidis, S.; Kopecki, S.; Yang, M. Efficient Algorithms for Computing the Inner Edit Distance of a Regular Language via Transducers. Algorithms 2018, 11, 165. https://doi.org/10.3390/a11110165

AMA Style

Kari L, Konstantinidis S, Kopecki S, Yang M. Efficient Algorithms for Computing the Inner Edit Distance of a Regular Language via Transducers. Algorithms. 2018; 11(11):165. https://doi.org/10.3390/a11110165

Chicago/Turabian Style

Kari, Lila, Stavros Konstantinidis, Steffen Kopecki, and Meng Yang. 2018. "Efficient Algorithms for Computing the Inner Edit Distance of a Regular Language via Transducers" Algorithms 11, no. 11: 165. https://doi.org/10.3390/a11110165

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop