^{1}

^{*}

^{2}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Several variants of the edit distance problem with block deletions are considered. Polynomial time optimal algorithms are presented for the edit distance with block deletions allowing character insertions and character moves, but without block moves. We show that the edit distance with block moves and block deletions is NP-complete (Nondeterministic Polynomial time problems in which any given solution to such problem can be verified in polynomial time, and any NP problem can be converted into it in polynomial time), and that it can be reduced to the problem of non-recursive block moves and block deletions within a constant factor.

The traditional edit distance problem considers character insertions, deletions and sometimes exchanges, in order to find the minimum number of such operations required to convert a given string to another. Although a simple quadratic time dynamic programming algorithm can be used for the edit distance problem (see the book of Gusfield [

Given two strings,

Section 2 summarizes related research; Section 3 presents formal definitions of the generalized edit distance operations. In Section 4 we prove that edit distance with block moves and block deletions is NP-hard, and show that eliminating recursion can be done within a constant factor. In Section 5 we present optimal algorithms for solving variations of edit distance with block deletions.

Ukkonen [^{2}·min(^{2},^{2})). Our algorithms improve this running time to ^{2}·m

Muthukrishnan and Sahinalp [^{2}), using an embedding of strings into the vector space. Cormode and Muthukrishnan [_{1} vector space. Cormode,

Shapira and Storer [^{0.43}) and ^{0.69}). An implication of this bound is that the log(^{0.46}).

Ann

Bafna and Pevzner [

Durand

The classical algorithm for computing the similarity between two sequences uses a dynamic programming matrix, and compares two strings of size ^{2}) time. Masek and Paterson [^{2}/log ^{2}/log ^{2}/log ^{2}/log

We now give a formal description of the operations involved:

Let _{1}⋯_{n}

_{1}⋯_{p−1}_{p}_{n}

_{1}⋯_{p−1}_{p+ℓ}⋯_{n}

_{1} ≠ _{2} ≤ _{1}, _{2}) moves the string at position _{1} of length ℓ to position _{2}. After performing the move operation, if _{1} < _{2} and ℓ ≤ _{2} − _{1} then _{1}⋯_{p1−1}_{p1+ℓ}⋯_{p2−1}_{p1}⋯_{p1+ℓ−1}_{p2}⋯_{n}_{2} < _{1} and 1 ≤ ℓ ≤ _{1} + 1, _{1}⋯_{p2−1}_{p1}⋯_{p1+ℓ−1}_{p2}⋯_{p1−1}_{p1+ℓ}⋯_{n}

When ℓ = 1 we refer to the operations as _{1}, _{2}), where

In this section we first note that finding the minimum edit-distance to transform

We first prove the NP-completeness of edit distance with block moves, block deletions, and character insertions.

Given two strings

Since a non-deterministic algorithm need only guess the operations and check in polynomial time that

Using the notation above, Theorem 1 uses a reduction from the edit distance problem with no insertions, with no deletions, and with block moves, <-/-/b>, to show that <c/b/b> is also NP-hard. The same proof can be applied to 7 more versions of the edit distance problem proving they are also NP Complete: <-/c/b>, <-/b/b>, <c/-/b>, <c/c/b>, <b/-/b>, <b/c/b>, and <b/b/b> (the last three allow block insertions).

Recursive operations allow one to deal with substrings that have already been dealt with, and which do not occur consecutively in the original string. In the following, we define when moves and deletes are considered recursive. Ukkonen [

A sequence of operations applied to a string is

A sequence of operations including character insertions, block moves, and block deletions is

For example, if

If there is a sequence of

By induction on

_{1}, _{2}, ⋯, _{r}_{i}_{i}_{i}_{−1} and
_{j}_{+1}. The symbols
_{x}, i_{i}_{−1} and _{j}_{+1} were moved within the first 3(_{i}_{−1} and _{j}_{+1}(if _{i}_{−1} and _{j}_{+1} were not moved in previous operation, this additional cost is left for future cost). We show that the inner blocks {_{i}_{j}

The bound of Theorem 2 is tight.

The following example builds two strings _{n}_{n}_{n}_{n}

First consider the alphabet Σ = {_{0}, _{1}, _{0}, _{0}}. Let _{1} = _{1}_{0}_{0}_{0} and _{1} = _{1}_{0}_{0}_{0}. Suppose _{1} is a substring of _{1} of _{0}) and _{1}_{0}_{0}_{0})) and the non-recursive edit distance is 4 (_{1}), _{0}), _{0}) and _{0})).

In general, we use the following definition:

We get that

Using the above definition of _{n}_{n}_{n}_{n}_{n}

The recursive edit distance of _{n}_{n}

We still have to define _{n}_{n}_{n}_{n}_{n}_{n}

A recursive sequence of

Given a sequence of

In the previous section we have shown that the problem of finding the edit-distance with block moves and block deletions is NP-complete. In this section we consider several variations of the edit distance problem with block deletions without block moves and show that for the following set of operations, the problem can be solved optimally, in polynomial time, using variations of the traditional dynamic programming method:

Block deletions.

Block deletions and character insertions.

Block deletions, character insertions, and character moves.

In the last case we prove that we can apply the dynamic programming algorithm for block deletions and character insertions and modify the way we calculate the cost.

Given two strings

The block deletion problem can be solved in polynomial time using dynamic programming. Allowing character insertions in addition to block deletions requires small changes in the algorithm. Let _{c}_{c}

Let _{1}⋯_{n}_{1}⋯_{m}_{1,…}_{k}_{k}_{n}_{1,…}_{k}_{sn}_{1}⋯_{n}_{−1} into _{1}⋯_{n}_{−1} into _{1}⋯_{m}_{−1}.

Assume that _{k}_{n}

If _{k}_{n}_{k}_{1,…}_{k}_{−1}} is an optimal sequence of block deletions which converts _{1}⋯_{n}_{−1} into _{1,…}_{ℓ}} be an optimal sequence of block deletions converting _{1}⋯_{n}_{−1} into _{1,…}_{ℓ}, _{k}

If _{k}_{n}_{n}_{n}_{k}_{n}_{1,…}_{k}_{sn}_{1}⋯_{n−1} into _{1,…}_{ℓ}} be an optimal sequence of block deletions converting _{1}⋯_{n}_{−1} into _{i}_{i}_{i}_{k}_{1,…}_{i−1}, _{i}

Assume that _{k}_{n}

In this case _{n}_{m}_{1}⋯_{n}_{−1} into _{1}⋯_{m}_{−1}. Otherwise, if _{1,…}_{ℓ}} be an optimal sequence of block deletions which converts _{1}⋯_{n}_{−1} into _{1}⋯_{m}_{−1} with cost less than

The last lemma implies the following recurrence formula for the edit distance problem with block deletions <-,b,->,

The dynamic programming algorithm is given in Algorithm 2. It uses a table _{1}⋯_{i}_{1}⋯_{j}_{i}_{i}_{0} < _{i0} starts a block deletion, _{0},_{i}_{+1} follows a block deletion, there is no need in recharging a cost of a block deletion if _{i}_{+1} should be deleted too, it could be deleted “for free” with the cost applied to the first character, _{i0}, of the block deletion. The

After the dynamic programming table is initialized, the cost of converting _{1}⋯_{i}_{1}⋯_{j}_{i}_{j}_{1}⋯_{i}_{1}⋯_{j}_{1}⋯_{i}_{−1} into _{1}⋯_{j}_{−1}, unless it is cheaper to apply a block deletion to the character _{i}_{i}

_{i}

_{j}

Two cases are distinguished, the case for which its previous character was deleted, and the case its previous character was not deleted. In the former case there is no new charge for deleting _{i}_{i}_{i}_{j}_{i}_{i}^{2} m

Running Algorithm 2 on the above example,

The block deletion problem is now extended to include character insertions, and requires small modifications to the algorithm presented in Algorithm 2. An optimal solution to the edit distance problem of character insertions and block deletions will not insert a character

Let _{1}⋯_{n}_{1}⋯_{m}_{1,…}_{k}

If _{k}_{m}_{1,…}_{k}_{−1}} is an optimal sequence of character insertions and block deletions which converts _{1} ⋯ _{m}_{−1}. Otherwise, if _{k}_{m}_{1}⋯_{n}_{−1} into _{1}⋯_{m}_{−1}.

If _{k}_{n}_{1,…}_{k}_{sn}_{1}⋯_{n}_{−1} into _{1}⋯_{n}_{−1} into _{1}⋯_{m}_{−1}.

If _{k}_{1,…}_{k}_{−1}} is an optimal sequence of block deletions and character insertions.

If _{k}

If _{k}_{m}_{1,…}_{k}_{−1}} is an optimal sequence of character insertions and block deletions which converts _{1}⋯_{m}_{−1}. Otherwise, if _{1,…}_{ℓ}} be an optimal sequence of block deletions and character insertions converting _{1}⋯_{n}_{−1} into _{1,…}_{ℓ}, _{k}

If _{k}_{m}_{n}_{n}_{m}_{1}⋯_{n}_{−1} into _{1}⋯_{m}_{−1}. Otherwise, if _{1,…}_{ℓ}} be an optimal sequence of block deletions which converts _{1}⋯_{n}_{−1} into _{1}⋯_{m}_{−1} with cost less than

The last lemma implies the following recurrence formula for the edit distance problem with block deletions and character insertions, <c,b,->,

_{i}

_{j}

While the algorithm of Algorithm 2 deals only with block deletions, the one in Algorithm 3 also permits character insertions to be applied to _{j}_{1}⋯_{i}_{i}_{1}⋯_{i}_{−1} to _{1}⋯_{j}_{−1} in case the characters are equal. The running time of the algorithm is ^{2} m

Running Algorithm 3 on the example,

We now consider the minimum number of block deletions, character insertions and character moves applied to

Shapira and Storer [^{P}^{P}

For any two paths

The proof appears in Shapira and Storer [

Denote by

The following lemma only considers the edit distance problem with character insertions, character deletions and character moves. Character moves are irrelevant unless the cost of a move is cheaper than the cost of a character insert and the cost of a character delete. Then block deletion is added to the set of operations, and a dynamic programming algorithm is presented, which is optimal under the assumption of unit costs.

For any positive costs of character insertion, character deletion, and character move, denoted

A character move is deleting the character from its source location and inserting it again at its destination. By splitting characters according to the number of operations performed on them in path

Let ^{P}^{P′}

By ^{P}^{P}′

For unit costs of character insertion, block deletion, and character move, the minimal edit distance including character moves can be reconstructed from every optimal path of the edit-distance table constructed using the algorithm presented in Algorithm 3 of character insertions and block deletions.

Let

Lemma 4 can also be applied when the cost of a character move is less than the cost of a character insert and a delete operation, and the cost of a character insertion is equal to the cost of a character move. If there is an optimal path that leads to a cell and consists of an insert and a delete of the same character, and none of these operations were converted into a move operation yet in the current path, we reduce the cost of the cell (assuming that a move operation is cheaper than the cost of insertion and deletion). To determine the final cost of the cell, we refer to the (two) characters associated with it. We distinguish between two different cases. Consider first the case where the character is deleted among other characters, and the same character is inserted within the same path. In this case it is not worth replacing this character by a character move, since the cost of an insert character is less than the cost of a move character together with the cost of a block deletion (assuming the cost of a block deletion is positive). The reason is that when a character is deleted among other characters within a block deletion, it is not charged an additional cost for this deletion, since all characters are deleted in a single operation. However, when converting the deletion of this character into a move operation, we must partition the deleted block into two sub-blocks while adding another block deletion, unless it occurs at one of the block ends. In case the character is not at one of the block ends, this additional charge makes this case more expensive than the one without moves in the case where a character insertion operation is cheaper than a character move and a block deletion. In case the character is deleted at one of the block ends, shortening the block deletion to not include that character does not change the cost using the assumption that the cost of a character insert is equal to the cost of a character move. The situation is different when converting a character deletion and character insertion (of the same character), both performed in the same path, to a move operation, when this character was deleted alone. The former case of a character deletion we should subtract the cost of an insert and a delete, and add the cost of a move to the cost of the current cell, since we have charged it while deleting it and while inserting it, but need to charge it for the move. This is true when a cost of a move is less than a cost of an insert and a delete. For simplicity, the algorithm presented in Algorithm 4 is given for the case of uniform costs, but can be easily adapted to the case the above assumptions on the costs are valid.

Using the reasoning above regarding the conversion of insert and delete character into a move character, only in the case that the character was not deleted within a block deletion, the optimal algorithm for the edit distance problem with block deletions, character insertions, and character moves first computes the edit distance with character insertions and block deletions presented in Algorithm 3, and then chooses any optimal path in the constructed table, and reduces its final cost by exchanging the cost of an insert and a delete of the same character by the cost of a move. The following algorithm shows the way to traverse the dynamic programming table constructed by the Delete-Insert algorithm, based on Lemma 4. It uses two associative arrays Insert and Delete, for storing the information gathered during traversal. For simplicity the algorithm only outputs the operations for converting _{j}_{j}_{i}_{i}^{2}·m^{2}·m

_{i}

_{i}

_{i}

_{j}

_{j}if

_{j}]++

_{i}

_{i}

_{i}

_{i}

_{i}

_{i}

_{i}

_{i}

_{i}

_{i}

The

Assume, by _{1,…}_{k}_{i}_{i}

If for all _{i}

Let

We have shown that the problem of finding the edit-distance considering only block deletion is solved in polynomial time using dynamic programming, and the addition of insert character requires small changes in the algorithm.

Edit distance Example: (

Dynamic Table for the Edit Distance of Block Deletions and Character Moves for the strings

Summary of Results.

<c/c/-> | ^{2}/log( |
Masek and Paterson [ |

<c/c/b> | NPC | Shapira and Storer [ |

<c/b/b> | NPC | Current paper (4.1) |

<-/b/-> | ^{2} · |
Current paper (5.1) |

<c/b/-> | ^{2} · |
Current paper (5.1) |

<c/c/c> | ^{2} · |
Shapira and Storer [ |

<c/b/c> | ^{2} · |
Current paper (5.2) |