Edit Distance with Block Deletions

: Several variants of the edit distance problem with block deletions are considered. Polynomial time optimal algorithms are presented for the edit distance with block deletions allowing character insertions and character moves, but without block moves. We show that the edit distance with block moves and block deletions is NP-complete (Nondeterministic Polynomial time problems in which any given solution to such problem can be verified in polynomial time, and any NP problem can be converted into it in polynomial time), and that it can be reduced to the problem of non-recursive block moves and block deletions within a constant factor.


Introduction
The traditional edit distance problem considers character insertions, deletions and sometimes exchanges, in order to find the minimum number of such operations required to convert a given string to another.Although a simple quadratic time dynamic programming algorithm can be used for the edit distance problem (see the book of Gusfield [1] for more information), when generalized to include the operations block move, block delete, and block copy it is NP-complete.Here we consider block deletions with or without character insertions, character moves and block moves.These variations are shown to be solved optimally in polynomial time using dynamic programming, except when including block moves, where the problem is then NP-hard, and we show how to reduce the problem within a OPEN ACCESS Muthukrishnan and Sahinalp [7,8] consider the edit distance with block moves, deletes, copies and reversals.They approximate this problem, called also the nearest neighbors problem, to a factor of O(log(n)•(log*(n)) 2 ), using an embedding of strings into the vector space.Cormode and Muthukrishnan [9] achieve a factor of O(log(n)•log*(n)) when considering the edit distance with moves only.They embed the strings into the L 1 vector space.Cormode, et al. [10] consider the hamming and Levenshtein distances, and define the LZ distance, which employs block copies and block deletions, together with the usual single character operations.Given two strings S and T, they consider the problem of starting from an empty string and producing T using copies from S, in a minimum number of operations.Here, copies are not allowed, and we consider the minimum number of operations performed on S in order to produce T. Ergun et al. [11] consider the edit distance with block moves, deletes and copies and present a polynomial time algorithm with a factor of 12 approximation to optimal.
Shapira and Storer [4] consider the standard edit distance problem augmented with block moves, where a sequence of characters are moved from one location in the string to another with constant cost, and present a log(n) factor approximation algorithm to optimal for a sub-family of strings (finding an optimal solution is NP-hard [4]).The proposed greedy algorithm reduces the two given strings to (possibly) shorter strings by replacing repeatedly a longest common substring by a new single character.The traditional edit distance is then applied on the new strings.Chrobak et al. [12] consider the Minimum Common String Partition (MCSP) problem, that receives two input strings and tries to minimize the number of partitions of the strings into the same collection of substrings.They refer to several versions of this problem by limiting the number of times each character can occur in both input strings, and study the approximation of a greedy algorithm for MCSP that at each step extracts a longest common substring from the given strings.In the case of 2-MCSP, where each character can occur at most twice in each input string, the approximation of 3 is shown.For 4-MCSP the approximation ratio of the greedy algorithm is at least Ω(log n), and for the general problem they present an approximation ratio between Ω(n 0.43 ) and O(n 0.69 ).An implication of this bound is that the log(n)-approximation greedy algorithm presented in Shapira and Storer [4] for a sub-family of strings cannot be extended to provide a log(n)-approximation for all inputs.Shafrir and Kaplan [13] improve the lower bound on the approximation guarantee of the greedy algorithm for MCSP to Ω(n 0.46 ).
Ann et al. [6] present polynomial-time optimization algorithms for the edit distance problem which involves character insertions, character deletions, block copies and block deletions, when some restrictions are applied, such as allowing only left to right non recursive operations to be applied on the source string in order to get the target string.They refer to different versions of edit problems where internal copies (copy of a substring of the current string), external copies (copy of a substring of the original string), and shifted copies (copy of a shifted substring) are allowed.They also consider various costs of the operations such as constant, linear (the cost of a block operation is proportional to the length of the block) and nested costs (in which they allow the copied substring to be further edited).
Bafna and Pevzner [14] consider the case that S is a permutation of the integers 1 through n, and give a 1.5 approximation algorithm for minimizing the number of moves needed to transform S into another permutation T. The restriction that all characters are distinct makes the problem easier.Lopresti and Tompkins [15] compare two strings by extracting collections of substrings and placing them into the corresponding order.Tichy [16] considers block copies and looks for the minimal covering set of one string with respect to another.In Hannenhalli [17], one can only swap a prefix or suffix of one chromosome with a prefix or suffix of another.
Durand et al. [18] consider the best alignment.There are several types of two-sequence alignments.The global alignment is the traditional edit-distance, where the goal is to match the entire sequence.The local alignment is to see whether a substring in one sequence aligns well with a substring in the other, thus, deleting prefixes or suffixes is free.The semi-global alignment is when the input is constructed from two sequences, one short and one long, and the goal is to find whether the short one is a part of the long one.These variations are addressed by Smith and Waterman [19].In the alignment problem, the term gap is used instead of a block deletion in the edit distance problem.A gap is a maximal consecutive run of spaces in a single string of a given alignment.The affine gap penalty is when k consecutive deletions are not simply the cost of k individual single character deletions.The penalty combines the cost of opening a gap and k times the cost of extending a gap.Introduction of gaps into sequence alignments allows the alignment to be extended into regions where one sequence may have lost or gained sequence characters not found in the other.The gap penalty is used to help decide whether or not to accept a gap or insertion in an alignment when it is possible to achieve a good alignment at some other point in the sequence.When aligning two sequences it is often required to insert gaps in them in order to optimize the alignment.
The classical algorithm for computing the similarity between two sequences uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time.Masek and Paterson [2], apply the "Four-Russians technique" and reduce the running-time to O(n 2 /log n) by exploiting repetitions in the strings.A drawback of the Masek and Paterson algorithm is that it can only be applied when the given scoring function is rational.Crochemore et al. [3] also present a O(n 2 /log n) time algorithm by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv-Welch parsing of both strings.For most texts, the time complexity is O(hn 2 /log n) where h ≤ 1 is the entropy of the text.Hermelin et al. [20] generalize the idea of applying edit distance on the compressed forms of the strings and present a generic platform for popular compression schemes such as the LZ-family, Run-Length Encoding, Byte-Pair Encoding and dictionary methods.In this paper we use basic dynamic programming which can be adapted to a O(n 2 /log n) processing time by applying compression techniques to the underlying strings.

Definitions of the Generalized Operations
We now give a formal description of the operations involved: insert, delete and move, all of which are with respect to strings of characters over a finite alphabet.Let

S S S S
    .
Move: Given two distinct positions When 1   we refer to the operations as delete-character, and move-character, and when 1   we refer to the operations as block-deletion, and block-move Alternatively, we sometimes write , / , , delete str p move str p p , where str specifies the string which is deleted or moved, rather than its length.We even use   / delete move str when the indices are clear.

Edit Distance with Block Moves, Block Deletions, Character Insertions
In this section we first note that finding the minimum edit-distance to transform S to T using character insertions, block deletions and block moves is NP-complete.Second, we introduce the notation of recursive moves and recursive-moves-deletes, and simplify the problem of finding an approximation algorithm by showing that eliminating recursive-moves-deletes cannot change the edit distance by more than a constant factor.

NP-Completeness
We first prove the NP-completeness of edit distance with block moves, block deletions, and character insertions.
Theorem 1.Given two strings S and T, and a positive integer l, performing only the three unit-cost operations character insertions, block deletions, and block moves on S, it is NP-complete to determine if S can be converted to T with cost ≤ l.
Proof: Since a non-deterministic algorithm need only guess the operations and check in polynomial time that S is converted to T with cost ≤ l, the problem is in NP.Shapira and Storer [4] show that the edit distance problem with only block moves is NP complete.In other words: suppose you are given a string S', a permutation T' of S' and a positive integer k.Determining whether S' can be converted to T' in a cost less than or equal to k, performing only move operations, is NP complete.We employ a transformation from the edit distance problem with only moves.Given an instance S', T' (which is a permutation of the characters in S'), and an integer k, of the edit distance problem with only moves, let: S = S'; T = T'; l = k.As T' is a permutation of the characters of S', every delete character operation will require an insert of the characters.This means that character insertions and deletions (especially block deletions) are useless in the sense of minimizing the edit-distance.Therefore, S can be converted to T with a cost ≤ l using character inserts, block moves and block deletes, if and only if S' can be converted to T' with cost ≤ k using only moves.
Using the notation above, Theorem 1 uses a reduction from the edit distance problem with no insertions, with no deletions, and with block moves, <-/-/b>, to show that <c/b/b> is also NP-hard.The same proof can be applied to 7 more versions of the edit distance problem proving they are also NP Complete: <-/c/b>, <-/b/b>, <c/-/b>, <c/c/b>, <b/-/b>, <b/c/b>, and <b/b/b> (the last three allow block insertions).

Non Recursive Block Moves and Deletions
Recursive operations allow one to deal with substrings that have already been dealt with, and which do not occur consecutively in the original string.In the following, we define when moves and deletes are considered recursive.Ukkonen [5] gives a similar definition to non recursive block operations called "restricted editing sequences".
Definition: A sequence of operations applied to a string is recursive with respect to move operations if it contains an operation which moves a substring whose characters do not occur consecutively in the original string.For example, if S is the string abcde and the character b is moved to obtain the string T = acdbe, then moving the substring dbe or ac are both considered as recursive moves.
Definition: A sequence of operations including character insertions, block moves, and block deletions is recursive with respect to move and delete operations if it moves or deletes a substring whose characters do not occur consecutively in the original string.
For example, if S is the string abcde and the character b is moved to obtain the string T = acdbe, then deleting the substring dbe or ac are both recursive with respect to move and delete operations.Theorem 2. If there is a sequence of k recursive move operations that convert S into T, then there is a non-recursive sequence of no more than 3k moves that convert S to T.

Proof: By induction on k.
Base case: k = 1.One move operation converts S to T. Since this move is not recursive, S is converted to T in less than three non-recursive operations.
Inductive hypothesis: Assume that for every recursive sequence of l < k moves that convert S to T, there is a conversion of S to T of a non-recursive sequence of no more than 3l moves.We prove that this holds for l = k.
Induction step: Consider any recursive sequence of k moves that convert S to T. Remove the last move of this sequence.Denote the obtained string by  T .There is a recursive sequence of k − 1 moves that convert S to  T .Using the inductive hypothesis, a non-recursive sequence converts S to  T by not more than 3(k − 1) operations.These 3(k − 1) operations introduce a parsing of S into r blocks , where each block A i contains consecutive characters in S. A move operation of a block b introduces at most two boundaries in the source location, where the substrings to the left of the left boundary, and the substring to the right of the right boundary can be either blocks that were originally there, or blocks that were transferred to that location by previous operations.A move operation of b also introduces a boundary in its destination.Therefore not all substrings i A are necessarily substrings that were moved.This way a move operation introduces 3 additional blocks, two in its source location and one in its destination.More formally, b is of the form 2 will denote the remaining prefix and suffix of the corresponding blocks, respectively.Every block A x , i  x  j , which was already moved by non-recursive moves, could be moved straight to its final location.If blocks A i 1 and A j 1 were moved within the first 3(k − 1) operations, then when non-recursive operations are considered, only the partial blocks A i 1 1 and A j 1 2 are moved.However, the blocks A i 1 2 and A j 1 1 must move to their final location, which adds a cost of two to the number of operations by splitting the original blocks A i 1 and A j 1 (if A i 1 and A j 1 were not moved in previous operation, this additional cost is left for future cost).We show that the inner blocks { , } of b, which were not moved within the first 3(k − 1) operations, are moved to their final location with no additional cost.Blocks which were never moved, are located in different blocks since there was a block, which was moved in between them or from between them, in one of the previous operations.In the case a block was moved out from between two blocks, this operation was "charged" by three non recursive moves, instead of one (since this is a non recursive move to start with), so these blocks are now moved for free.In case a block was moved in between two blocks that were never moved, without loss of generality, the block to the left of the border can be moved for free using the cost that was charged for this operation in its destination.By assigning the charge for each border to the block to its left, we are left with at most a single block for which the block to its right was moved out (otherwise it is a block to the left of a border of a block which was moved in, and is moved for free) or this block is one of the ends of b.In any of these cases the block can be moved for free using the charge that was applied to a block move source.Using the induction hypothesis, we have "charged" this move by three non recursive moves instead of one.Therefore, we can move the two adjacent blocks without charging it again.The worst case uses three more operations in order to convert  T to T. Adding it to the number of operations executed in order to convert S into  T we get 3 + 3(k − 1) = 3k operations.

Theorem 3:
The bound of Theorem 2 is tight.

Proof:
The following example builds two strings S and T of n substrings of the form S n and n T respectively, where S n and n T are defined recursively, such that the non-recursive edit distance of S and T is three times the recursive edit distance of these strings.For simplicity, because each character will not occur more than once in both S and T , for this example we use the notation ( ) move s for moving a substring s of S , without stating its source and destination locations.

First consider the alphabet
is a substring of S, not aligned with the substring T 1 of T , and therefore must be moved.The recursive edit distance is 2 (move(M 0 ) and move(L 1 M 0 L 0 R 0 )) and the non-recursive edit distance is 4 (move(L 1 ), move(M 0 ), move(L 0 ) and move(R 0 )).
In general, we use the following definition: Using the above definition of S n and T n , we continue taking out a mid part of a block (thus, breaking a continuous string into three parts), and putting it into another block, and again breaking a continuous string.This way we add three blocks by each move operation.If S n should be recursively moved (given the assumption that S n and T n are not aligned in the entire strings S and T), these previous move operations turn into recursive ones, and cannot be performed by the non-recursive sequence of moves.A non-recursive sequence of moves must move each block separately to its final location.Since every recursive operation adds three more blocks, the non-recursive edit distance increases by three.
The recursive edit distance of S and T (which include S n and T n as non aligned substrings) is n + 1 (including the move of the entire block).We have 3n blocks in addition to the one block we have started with, which require 1 + 3n non-recursive moves.If n is unbounded, We still have to define S and T so that S n must be moved in S in order to be aligned with T n in T .As we would like the other parts of S not to be moved instead of the substring S n , moving them must cost more than moving S n .In order to achieve this we construct 2n strings, {S Theorem 4. A recursive sequence of k block moves and block deletes that converts S to T , can be implemented by a non recursive sequence of no more than 3k non recursive operations.
Proof: Given a sequence of k recursive block moves and block deletes that converts S to T , we construct a sequence of k recursive block moves that converts  S to  T in the following way: For simplicity let $ be a character which does not occur in the alphabet of S and T .Define  S  S$ and  T to be the string starting with T$ as its prefix followed by the concatenation of all substrings that were deleted in the original recursive sequence, in the exact order of block deletions.We use the same sequence of recursive operations, but instead of performing a block deletion we move the block to the end of the generated string.At the end of this process we will obtain  T using only recursive moves.We now use Theorem 3 that the recursive sequence of k moves that converts  S to  T can be implemented by a non recursive sequence,  A , of no more than 3k non recursive operations.We use  A to construct a non-recursive sequence of moves and deletes that convert S to T .We denote by  T the suffix of  T that remains after eliminating its prefix $ T , i.e.,  T  T$  T .Note that  T is obtained by moving substrings that are deleted from S (and therefore from  S ), and all characters that moved to  T remain there until the end of the process.We use one block deletion for each of these special block moves, thus the number of operations remains the same.Starting from a non-recursive sequence of block moves implies a non-recursive sequence of block moves and block deletes with the exact number of operations 3k.

Edit Distance with Block Deletions
In the previous section we have shown that the problem of finding the edit-distance with block moves and block deletions is NP-complete.In this section we consider several variations of the edit distance problem with block deletions without block moves and show that for the following set of operations, the problem can be solved optimally, in polynomial time, using variations of the traditional dynamic programming method: Block deletions.Block deletions and character insertions.Block deletions, character insertions, and character moves.
In the last case we prove that we can apply the dynamic programming algorithm for block deletions and character insertions and modify the way we calculate the cost.

Block Deletions with or without Character Insertions
Given two strings S and T, where T is a sub-sequence of S (i.e., it is possible to delete characters from S to obtain T), we are interested in the smallest integer k such that T is partitioned into k blocks (substrings), and the blocks of T are substrings of S, and occur in S in the same order as in T. If k = 1 this is the traditional pattern matching problem, which returns whether the pattern T occurs in S or not.Consider for example S = bcxyabczfdlmefij and T = abcdef.In this case there are three blocks in S occurring in the same order as in T. Define S' = $S$, where $ is a new character not occurring in S nor T. Finding the minimum number of blocks of S so that the blocks of T occur in S in the same order as in T is equivalent to finding the minimum number of block deletions minus 1, performed on S' so that S' is identical to T. The difference of 1 in the cost is since every block deletion of S' can be matched with the corresponding preceding block of S, except for the first block of S'.
The block deletion problem can be solved in polynomial time using dynamic programming.Allowing character insertions in addition to block deletions requires small changes in the algorithm.
is an optimal sequence of block deletions which converts s 1 s n−1 into T. Otherwise, O is an optimal sequence of block deletions which converts s 1 s n−1 into t 1 t m−1 .

Proof:
(1) Assume that k o ends with the character s n : is an optimal sequence of block deletions which converts S = s 1 s n−1 into T. Otherwise, if O is not optimal, let   is an optimal sequence of block deletions which converts s 1 s n−1 into T. Otherwise, if   be an optimal sequence of block deletions converting ).Let i be the minimum index 1 i    so that the block deletion i o  deletes some character which is also deleted by k o .
Define The last lemma implies the following recurrence formula for the edit distance problem with block deletions <-,b,->, where BD is a block deletion cost which is 1 for the character that opens the block and 0 otherwise.

The dynamic programming algorithm is given in Algorithm 2. It uses a table A[i,j],
where rows correspond to S and columns correspond to T. An assistant function, During_Deletion, is given in Algorithm 1.It receives the current cell in the dynamic programming table, A, indexed by (i,j) and returns whether the cheapest sequence of operations converting s 1 s i into t 1 t j ends with a block deletion, i.e., whether A[i,j] = A[i − 1,j] + 1 (s i starts a block deletion) or A[i,j] = A[i − 1,j] (s i is an internal character of a block deletion).In the later case it should be verified that there exists an index 0 < i 0 < i such that 0 i s starts a block deletion, i.e., such that A[i 0 ,j] = A[i − 1,j] + 1.If s i+1 follows a block deletion, there is no need in recharging a cost of a block deletion if s i+1 should be deleted too, it could be deleted "for free" with the cost applied to the first character, 0 i s , of the block deletion.The Delete function presented in Algorithm 2 is given two strings S and T of lengths n and m, respectively, and returns the minimum number of block deletions applied to S in order to convert it into T.If both strings are empty there is no cost to convert S into T.If only S is empty, the cost is infinity, since block deletions are applied to S. If T is empty, the cost is 1 since a single block deletion of the entire string S converts it into T. Algorithm 1. During_Deletion function-returns whether cell (i,j) is involved in a block deletion operation.
After the dynamic programming table is initialized, the cost of converting s 1 s i into t 1 t j is calculated and stored in cell A[i,j] of the dynamic programming table.If the characters s i and t j are identical, the cost of converting s 1 s i into t 1 t j is equal to converting s 1 s i−1 into t 1 t j−1 , unless it is cheaper to apply a block deletion to the character s i .The character s i can be deleted among other characters, and the cost of a block deletion is constant, no matter how many characters are deleted.

Algorithm 2. Edit distance with block deletion. function Delete(S,T) {
Two cases are distinguished, the case for which its previous character was deleted, and the case its previous character was not deleted.In the former case there is no new charge for deleting s i since the block deletion is charged by the first character of the block.In this case A[i,j] has the same cost as A[i − 1,j].Otherwise, it is worthwhile starting a block deletion with s i , charging the first character of the block.In this case A[i,j] is equal to A[i − 1,j] + 1 (we assume that ∞ + 1 = ∞).If the characters s i and t j are not identical, s i should be deleted.In order to determine the cost, it has to be checked if s i shall be the first character of a block deletion or not, as done previously.Since the during_deletion function runs in linear time in the size of S, the total running time of the edit distance with block deletion algorithm is O(n 2 m).
Running Algorithm 2 on the above example, S = bcxyabczfdlmefij and T = abcdef, produces Figure 1a.The optimal path from the bottom right cell (corresponding to S and T) to the upper left cell (corresponding to S   and T   ) is illustrated by gray colored cells.The number of block deletions is 4, and the deleted blocks are {bcxy,zf,lm,ij}.The block deletion problem is now extended to include character insertions, and requires small modifications to the algorithm presented in Algorithm 2. An optimal solution to the edit distance problem of character insertions and block deletions will not insert a character c into S followed by a block deletion that includes c, since the insertion of c is unnecessary.Thus, block deletions and character insertions can be ordered so that all insertions follow block deletions.The following lemma refers to the case of character insertions and block deletions, but does not separate them, and considers both operations in every step, so that the two dimension table is traversed only once.The lemma is followed by a recurrence formula which is the basis of the polynomial time dynamic programming algorithm presented in Algorithm 3.

Lemma 2:
Let S = s 1 s n and T = t 1 t m be two strings of lengths n and m, respectively, and let   is an optimal sequence of character insertions and block deletions which converts   be an optimal sequence of block deletions and character insertions converting s 1 s n−1 into T with cost less than O .=Define   The last lemma implies the following recurrence formula for the edit distance problem with block deletions and character insertions, <c,b,->, where BD is a block deletion cost and is applied only for the character that opens the block.Algorithm 3. Edit Distance with Block Deletions and Character Insertion.
While the algorithm of Algorithm 2 deals only with block deletions, the one in Algorithm 3 also permits character insertions to be applied to S. The algorithm is similar to the previous one, but with the following difference.In each stage it also allows inserting t j right after s 1 s i , if it makes the cost cheaper than deleting s i within a block deletion, or having the same cost of converting s 1 s i−1 to t 1 t j−1 in case the characters are equal.The running time of the algorithm is O(n 2 m) since during_deletion is linear.
Running Algorithm 3 on the example, S = bcxyabczfdlmefij and T = abcdefg, produces the table presented in Figure 1b.The number of character insertions and block deletions is 5, and corresponds to deleting the blocks {bcxy,zf,lm,ij}and inserting character "g".

Block Deletions with Character Moves
We now consider the minimum number of block deletions, character insertions and character moves applied to S in order to attain T. We use the dynamic programming algorithm introduced in the previous section taking into account character moves.A character move is an insert and a delete of the same character.Therefore, when performing dynamic programming, we try to reduce the cost by exchanging insertions and deletions with moves.We first consider the case of character deletions, character insertions, and character moves for general costs.Replacing the character deletion with block deletions is considered for uniform costs and is solved using dynamic programming.
Shapira and Storer [4] showed that in the case of uniform costs of character insertions, character deletions and character moves, the minimal edit distance occurs in any optimal path transforming S to T of the traditional edit distance.We now generalize this to all cases.We use the following notations: P and  P denote paths in a constructed dynamic table, starting at the left most upper cell and ending at the right lower cell (the final cell).I  P / D  P denote the number of character insertions and character deletions of   , when converting S into T in path P , and c(x) denotes the cost of operation x , where x is either an insert character or a delete character or a move character.The traditional edit distance, including only character insertions and character deletions, is denoted by ed, and the edit distance of S and T , including character moves is denoted by edm.Thus, ed P and edm P refer to paths in the traditional and move edit distance, respectively, i.e., if path P in the model of edit distance with moves includes both an insertion and a deletion of some character  , the cost is calculated as the cost of a move, subtracting the cost of a delete and an insert.
Fact: For any two paths P and  P and every   , The proof appears in Shapira and Storer [4], and is included here for completeness.
Proof: Denote by S n  and T n  the number of appearances of a character σ in the strings S and T respectively.For any path P which converts S into T, The following lemma only considers the edit distance problem with character insertions, character deletions and character moves.Character moves are irrelevant unless the cost of a move is cheaper than the cost of a character insert and the cost of a character delete.Then block deletion is added to the set of operations, and a dynamic programming algorithm is presented, which is optimal under the assumption of unit costs.Lemma 3.For any positive costs of character insertion, character deletion, and character move, denoted c(insert), c(delete) and c(move) respectively, such that c(move) < c(insert) + c(delete), the minimal edit distance including character moves can be reconstructed from every optimal path of the traditional edit-distance.
Proof: A character move is deleting the character from its source location and inserting it again at its destination.By splitting characters according to the number of operations performed on them in path P, we have: Let P and  P be two paths converting S into T, such that P ed   P ed .By using Equations 1 and 2 we find that: which implies that: By using Equations 1 and 3 we conclude that P edm   P edm .

Lemma 4.
For unit costs of character insertion, block deletion, and character move, the minimal edit distance including character moves can be reconstructed from every optimal path of the edit-distance table constructed using the algorithm presented in Algorithm 3 of character insertions and block deletions.

Proof:
Let   a .Since a cost of an insert is the same cost as a move operation, replacing an insert of a and a deletion of a within a block deletion, by a move operation of a, has the same cost, unless a is deleted by a character deletion.In the case of a character deletion, Lemma 3 proves that the minimal edit distance including character moves can be reconstructed from every optimal path of the traditional edit-distance, which is a special case of the algorithm of character insertions and block deletions.Otherwise, since there are no changes in the costs, the minimal edit distance including character moves can be reconstructed from every optimal path of the edit-distance table constructed using the algorithm of character insertions and block deletions.Lemma 4 can also be applied when the cost of a character move is less than the cost of a character insert and a delete operation, and the cost of a character insertion is equal to the cost of a character move.If there is an optimal path that leads to a cell and consists of an insert and a delete of the same character, and none of these operations were converted into a move operation yet in the current path, we reduce the cost of the cell (assuming that a move operation is cheaper than the cost of insertion and deletion).To determine the final cost of the cell, we refer to the (two) characters associated with it.We distinguish between two different cases.Consider first the case where the character is deleted among other characters, and the same character is inserted within the same path.In this case it is not worth replacing this character by a character move, since the cost of an insert character is less than the cost of a move character together with the cost of a block deletion (assuming the cost of a block deletion is positive).The reason is that when a character is deleted among other characters within a block deletion, it is not charged an additional cost for this deletion, since all characters are deleted in a single operation.However, when converting the deletion of this character into a move operation, we must partition the deleted block into two sub-blocks while adding another block deletion, unless it occurs at one of the block ends.In case the character is not at one of the block ends, this additional charge makes this case more expensive than the one without moves in the case where a character insertion operation is cheaper than a character move and a block deletion.In case the character is deleted at one of the block ends, shortening the block deletion to not include that character does not change the cost using the assumption that the cost of a character insert is equal to the cost of a character move.The situation is different when converting a character deletion and character insertion (of the same character), both performed in the same path, to a move operation, when this character was deleted alone.The former case of a character deletion we should subtract the cost of an insert and a delete, and add the cost of a move to the cost of the current cell, since we have charged it while deleting it and while inserting it, but need to charge it for the move.This is true when a cost of a move is less than a cost of an insert and a delete.For simplicity, the algorithm presented in Algorithm 4 is given for the case of uniform costs, but can be easily adapted to the case the above assumptions on the costs are valid.
Using the reasoning above regarding the conversion of insert and delete character into a move character, only in the case that the character was not deleted within a block deletion, the optimal algorithm for the edit distance problem with block deletions, character insertions, and character moves first computes the edit distance with character insertions and block deletions presented in Algorithm 3, and then chooses any optimal path in the constructed table, and reduces its final cost by exchanging the cost of an insert and a delete of the same character by the cost of a move.The following algorithm shows the way to traverse the dynamic programming table constructed by the Delete-Insert algorithm, based on Lemma 4. It uses two associative arrays Insert and Delete, for storing the information gathered during traversal.For simplicity the algorithm only outputs the operations for converting S into T.By storing the indices of the characters, more precise information can be output, such as the source and destination locations for a move operation.After applying the Delete-Insert algorithm of Algorithm 3 and constructing a dynamic programming table, A, the algorithm traverses A starting with cell A[n,m], and ending at cell A[0,0].At each cell the algorithm reconstructs the optimal operation that led to that cell.In case the cell was determined from the cell to its left, it indicates a character insertion of t j , and this information is saved in the Insert table in the location for t j .In case the cell was determined from the cell above, it indicates a block deletion of s i .The table is checked to see whether this character is deleted alone.If so, this information is saved in the Delete table in the location for s i .Otherwise, a block deletion (of at least two characters) is identified and output.Once cell A[0,0] is reached, the information stored in the Delete and Insert tables is retrieved, and the number of character insertions, character deletions and character moves is computed and output., T)

Lemma 5:
The algorithm presented in Algorithm 4 produces an optimal solution for the edit distance problem with character insertions, block deletions and character moves.

Proof:
Assume, by contradiction that the algorithm presented in Algorithm 4 is not optimal, and let    Figure 2 presents the dynamic table constructed, after applying the algorithm of Algorithm 4 of edit distance with unit cost operations (insert, delete, and move) on the strings S = abcbcbcabcabcaa and T = bcabcabcyabca.The light and dark gray cells represent an optimal path.The light gray cells represent an insert (vertical) and a delete (horizontal) of a character a, so the final cost for the edit distance including character moves is 4, which is block deletion of abc, character insertion of a, character insertion of y and a character deletion of a, reducing the cost to 3. The character insert and character delete can be replaced by a character move of a. Notice that another more expensive option is replacing the insert character of a and shortening the block deletion to only delete bc by a move operation of a, but it is not optimal.

Conclusions
We have shown that the problem of finding the edit-distance considering only block deletion is solved in polynomial time using dynamic programming, and the addition of insert character requires small changes in the algorithm.Adding character moves is solved using dynamic programming, and traversing any optimal path in order to reduce the cost.However, adding block moves to the set of operations, changes the problem to be NP-complete, and can be reduced to the problem of non-recursive block moves and block deletions within a constant factor.An interesting open question is whether the problem remains NP-complete when block sizes are restricted in some fashion.
For a string S and a position 1 p n   , the operation σ to the pth position of S. After performing this operation,

Lemma 1 :
Let o be a block deletion which deletes a substring that ends with a character c, | c o denotes the substring o after truncating its last character.If o does not end with c, | c o is equal to o.The following lemma refers to the case of block deletions only, and implies a recurrence formula which is the basis of the dynamic programming algorithm presented in Algorithm 2. Let S = s 1 s n and T = t 1 t m be two strings of lengths n and m, respectively, and let sequence of block deletions which converts S into T.If k o ends with the character s n , then

Figure 1 .
Figure 1.Edit distance Example: (a) with block deletion for S = bcxyabczfdlmefij and T = abcdef; (b) with character insertions and block deletion for S = bcxyabczfdlmefij and T = abcdefg.(a) (b)

ioo..
for deleting the character a. Obviously, the operations of O convert S into T. From the fact above, any sequence of operations which convert S into T, has the same difference between the number of its insertions and the number of its deletions for every character   .Formally, stand for the number of insertions and deletions of  in O , and , , denote the number of insertions and deletions of an optimal conversion of S into T using only character insertions and block deletions (e.g., an output of Algorithm 3).If for all   ,   .Thus, the algorithm presented in Algorithm 4 gives the same cost of the optimal sequence of operations (converting 1 back into a move character i o ) and it is optimal, which contradicts our assumption.Therefore, there exist a character  in Σ such that , We show that there exists a character   for which of <b,c,-> (Lemma 2).Let  be such character in  for which We now construct a third sequence of operations O  for converting S into T, which uses character insertions, block deletions and character moves.O  includes the same operations as O except those operations where  is involved.Instead, the operations applied to  are replaced by character moves of  from the location where it is deleted to the location where it is inserted.Thus O  contradicts the optimality of O and the algorithm presented in Algorithm 4 is optimal.

Figure 2 .
Figure 2. Dynamic Table for the Edit Distance of Block Deletions and Character Moves for the strings S = abcbcbcabcabcaa and T = bcabcabcyabca.
The sequence O  of block deletions is a sequence of block deletions which converts S into T with cost less than O, which contradicts the optimality of O.  If k o deletes s n , and s n is deleted within a block deletion of length greater than 1, the block deletion charge is applied to some previous character, and s n is deleted for free.Thus, shortening k o to not include s n is done within the same cost, and     be an optimal sequence of block deletions converting s 1 s n−1 into T with cost less than O .Define ˆi o to be a block deletion of the continues substring starting with characters n :  In this case s n is equal to t m , and O is also an optimal sequence of block deletions which convertss 1 s n−1 into t 1 t m−1 .Otherwise, if O is not optimal, let      bean optimal sequence of block deletions which converts s 1 s n−1 into t 1 t m−1 with cost less than O. Then O  is a sequence of block deletions which converts S into T with cost less than O, which contradicts the optimality of O.
of block deletions and character insertions which converts S into T.  If k o is a character insertion: then if it is an insert of the character t m then Otherwise, if k o is a character insertion of a character different than t m , then O is an optimal sequence of block deletions and character insertions which converts s 1 s n−1 into t 1 t m−1 .insertions and block deletions which converts s 1 s n−1 into T. Otherwise, O is an optimal sequence of character insertions and block deletions which converts s 1 s n−1 into t 1 t m−1 .
The Delete-Insert function runs in time O(n 2 •m), Traversing the dynamic programming table takes O(n•m), therefore, the total running time is O(n 2 •m + |Σ|), where Σ is the alphabet.Edit Distance with Block Deletions, Character Insertion and Character Moves.