A Quaternary Code Correcting a Burst of at Most Two Deletion or Insertion Errors in DNA Storage

Due to the properties of DNA data storage, the errors that occur in DNA strands make error correction an important and challenging task. In this paper, a new code design of quaternary code suitable for DNA storage is proposed to correct at most two consecutive deletion or insertion errors. The decoding algorithms of the proposed codes are also presented when one and two deletion or insertion errors occur, and it is proved that the proposed code can correct at most two consecutive errors. Moreover, the lower and upper bounds on the cardinality of the proposed quaternary codes are also evaluated, then the redundancy of the proposed code is provided as roughly 2log48n.


Introduction
In recent years, because of its huge capacity and excellent durability, deoxyribonucleic acid (DNA) storage is becoming attractive for future long-term data storage [1][2][3]. However, during the processes of DNA storage, the molecule can be faced with errors that do not normally occur in traditional storage devices such as deletion and insertion errors [4]. Therefore, research to address deletion and insertion errors is extremely significant in DNA storage, and error-correcting codes for the errors have been studied. Our work focuses on the codes capable of correcting multiple deletion or insertion errors in DNA storage.
For correcting one deletion or insertion error in binary codes, Varshamov-Tenengolts (VT) codes were first proposed in [5] and in the same year the modified VT code construction was provided in [6] to correct a single deletion, insertion or substitution error. Shortly thereafter, to deal with more than a single error, Levenshtein extended the VT code to a binary code that can correct at most two consecutive deletion or insertion errors [7]. In [8], a binary codeword was arranged as an array with b rows and each row was a binary VT codeword so that this construction could correct a burst of the size of exactly b deletion or insertion errors (with any fixed b ≥ 2). Then, the authors of [9] proposed a binary shifted-Varshamov-Tenengolts (SVT) code to obtain an improved construction which still corrects exactly b errors but with a lower redundancy than one in [8]. From the obviously efficient correction and low redundancy of the VT codes, the authors in [10,11] proposed a method of the linear-time encoders to implement the binary VT code which satisfies the homopolymer run and Guanine-Cytosine(GC)-content constraints [12,13] among important properties of a DNA strand. However, the binary VT codes used in these linear-time encoders correct a single nucleotide of a DNA strand. With a similar approach as [10,11], but to correct a burst of size exactly b deletions or insertions of DNA symbols, the authors of [14] applied the encoder of the binary modified VT code in [6] and binary SVT codes in [9]. Then, by interleaving bits of binary VT codewords and binary SVT codewords, the work [9] obtained a binary code construction that can correct a burst error of size exactly 2b, and finally, the codeword of this construction was translated to DNA symbols.
A non-binary VT code was first proposed in [15], and a non-binary SVT code was proposed in [16]. The codes were defined over a q-ary alphabet for any q > 2. With the similar property of the binary codes, the q-ary VT and q-ary SVT codes can correct a single deletion or insertion symbol. To correct multiple errors, the construction in [8,9] can be applied to obtain a q-ary code that can correct a burst of size exactly b of deletion or insertion errors. However, designing q-ary VT codes that can correct multiple deletion or insertion errors has been an interesting problem [17]. Recently, there were some works [18][19][20][21] focused on code design to correct exact multiple errors but the efficient design for q-ary codes (or even quaternary codes) that can correct a burst of at most b deletion or insertion errors is still an open problem. The authors of [22] proposed a non-binary code correcting at most two consecutive deletions with redundancy log n + log q log (log n + 6) + log 6 + 3. In [22], the authors used the construction method in [9] with one binary code in [7] and a modified of it in interval P. However, we propose a quaternary code which is suitable for robust DNA storage and can correct at most two consecutive deletion or insertion errors with the direct construction. Moreover, the redundancy of the proposed code is improved than [22].
As to the cardinality of VT codes, for about 50 years, a lower bound of size of the best class of VT codes can be achieved, but an upper bound is rarely provided even in binary case. The author in [23] used Mixed Integer Linear Programming (MILP) relaxation technique to obtain the tighter upper bound of the binary VT code, for example, with the length n = 11, the maximum size of one deletion code was calculated as 173. Moreover, the conjecture about maximum size of VT code for all n was also provided. However, in this work, we focus on the correction error capability of the proposed code design, then we use the previous methods in [7,15] to evaluate lower bound and upper bound of the proposed code design.
In our work, we have extended binary codes based on the results of [7], by adding two constraints to determine the exact values and positions of the errors in the quaternary sequence. By mathematically analyzing the possible cases of errors, we propose decoding algorithms to prove the error correction capability of this code design. We note that the main concern in this work is the error correction capability of quaternary code design, not focus on constraints in DNA storage. It is assumed that the combination design of error correction code and constraints of DNA storage was already done by other algorithms [11,24]. The main contributions in this paper can be summarized as follows.

•
We propose a quaternary code design that is suitable for the deletion or insertion channel, especially for mapping 0 ↔ A, 1 ↔ C, 2 ↔ T, and 3 ↔ G. This proposed design is directly applicable to sequencing in DNA storage. Furthermore, this proposed code can correct at most two consecutive deletion or insertion errors. • We propose two decoding algorithms for this proposed code to correct one deletion and two consecutive deletion errors. For the decoding of insertion errors, some differences between the deletion and insertion cases are shown and the important functions for correcting the insertion error are also presented in Appendix. • We provide the lower bound and evaluate upper bound of the proposed code design. The redundancy of the proposed code design is also calculated to be at most 2 log 4 8n.
This paper is organized as follows. In Section 2, we list basic notations and definitions used in the rest of the paper and we briefly present previous binary and quaternary code constructions to correct one and two consecutive deletions. Then, Section 3 contains the proposed code construction, a proof of the correction capability, and the bounds of the cardinality for the proposed quaternary code. Section 4 provides a discussion and, finally, conclusion is presented in Section 5 of this paper.

Notation and Definition
Let F n 2 and F n 4 be the set of binary and quaternary sequences of length n, respectively. Let a quaternary codeword with length n be defined as c = (c 1 , c 2 , . . . , c n ) ∈ F n 4 . Then, a modified sequence c n−b l of the sequence c is defined as c n−b l = (c 1 , c 2 , . . . , c l−2 , c l−1 , c l+b , c l+b+1 , . . . , c n ) ∈ F n−b 4 , where c l , c l+1 , . . . , c l+b−1 are deleted in c. Similarly, a sequence c n+b l of the sequence c is also defined as c n+b l = (c 1 , c 2 , . . . , c l−1 , h 1 , h 2 , . . . , h b , c l , c l+1 , . . . , c n ) ∈ F n+b 4 , where h 1 , h 2 , . . . , h b are inserted from l-th position in c. For a binary sequence x = (x 1 , x 2 , . . . , x n ) ∈ F 2 n , we can consider a sequence 0x with length n + 1, where 0x = (0, x 1 , x 2 , . . . , x n ) ∈ F n+1 2 . For simplicity, the sequence 0x with length n + 1 is regarded as having a starting value of x 0 = 0. For example, a binary sequence x with length 10 is given as x = (0, 0, 1, 0, 0, 1, 1, 1, 1, 1). For convenience, the binary sequence notation can be changed to x = 0010011111. In the rest of this paper, these two notations are used as the same meaning, so there is a binary sequence 0x ,with length 11, as 0x = 00010011111. Then, the run-length vector r denotes the number of zeros and ones runlength in 0x. In addition, the binary sequence 0x is composed of four runs u 0 u 1 u 2 u 3 , which are u 0 = 000, u 1 = 1, u 2 = 00, and u 3 = 11111. Herein, for a non-negative integer k, the zeros and ones runs are denoted u 2k and u 2k+1 , respectively. Then, the run-length vector r of 0x is r = (r 0 , r 1 , r 2 , r 3 )= (3, 1, 2, 5).
Let r be the total number of elements in the run-length vector r of 0x, corresponding to the total number of runs in 0x. Then, from the run-length vector, the run-syndrome of the binary sequence 0x is defined as (1) In the previous example, for 0x = 00010011111, since the run-length vector r is (3,1,2,5), Rsyn(0x) = ∑ 4−1 i=0 ir i = 20. If the j-th bit of 0x belongs to the m-th run u m , we define k 0x (j) as the index of the run and k 0x (j) = m, for 1 ≤ j ≤ n. Since the total number of elements of the run-length vector cannot exceed the length of 0x, r is bounded as where the equality is satisfied if the binary sequence 0x = 0101010· · · . From the previous example, the binary sequence 0 = 00010011111, with length 11, has the run-length vector r = (r 0 , r 1 , r 2 , r 3 ) = (3,1,2,5) and the total number of elements of the run-length vector of 0x is r = 4 < n + 1= 11. For 1 ≤ j ≤ 10, since the third bit in 0x belongs to the run u 1 =1, the index of the run which the third bit belongs to is k 0x (3) = 1.

Previous Works
With the binary case, to the best of our knowledge, the VT code in [5] is the best code to correct a single deletion or insertion error and the modified VT code in [6] is the best code to correct a single deletion, insertion, or substitution error. To correct more than a single error, we briefly recap the binary code correcting at most two consecutive deletions from [7]. Moreover, we briefly present the deletion correction capability of the binary code in [7] when single deletion or two consecutive deletions occur. Definition 1. For 0 ≤ d ≤ 2n − 1, the binary code C(n, 2) in [7] with length n is given as The correction capability of the code in Definition 1 was also proved in [7]. From the length of the received sequences y, we can know that one or two consecutive bits are removed from the codeword g. If one deletion at the j-th bit or two consecutive deletions at the j-th and (j + 1)-th bits occur, then y can be g n−1 j or g n−2 j .
To determine the position of the deleted bit, we first calculate the difference of the run-syndrome as ∆ = d − Rsyn(0y) mod 2n. If one deletion error occurs, ∆ = d − Rsyn(0g n−1 j ) mod 2n and if two consecutive deletions occur, ∆ = d − Rsyn(0g n−2 j ) mod 2n. These values are used to identify the value and position j of the deleted bit if there is one deletion in the codeword g or the values and positions j and j + 1 of the two deleted bits in the case that two consecutive deletions occur in the codeword g.
However, in the quaternary case, there exists the code to correct a single deletion or insertion error. The overview of q-ary insertion and deletion-correcting codes with length n is briefly presented in Definitions 2 and 3. The VT code family, known as the set of the most basic codes for correcting a single deletion or insertion, is defined as follows [15]: For 0 ≤ a < n and 0 ≤ e < q, the q-ary VT code with length n, VT a,e (n, q) is defined as where α 1 = 1 and From Definition 2, since the binary sequence α = (α 1 , α 2 , · · · , α n ) is strongly related to the q-ary sequence c, a deletion of the j-th symbol in the codeword c also leads to a deletion of the j-th bit in the binary sequence α. Hence, from the help of the binary sequence α, the q-ary sequence is finally corrected.
Similarly, the authors of [16] proposed a single deletion-correcting code that defined the q-ary SVT code.

Definition 3.
For 0 ≤ a ≤ P, 0 ≤ e < q, and f ∈ {0, 1}, the q-ary SVT code, SVT a,e, f (n, P, q), with length n is defined as SVT a,e, f (n, P, q) where Compared to the construction of the q-ary VT code, since mod (P + 1) is used in the constraint (6) in Definition 3 instead of mod n, the redundancy of the q-ary SVT code is reduced from log q (n + 1) to log q (2P + 2) + 1 . The constraint (8) is added to imply that the binary sequence α belongs to the binary SVT code. Hence, similar to the correcting method in Definition 2, the q-ary SVT code in Definition 3 can correct one deletion in any position.
However, the q-ary VT code in Definition 2 and the q-ary SVT code in Definition 3 correct only a single deletion or insertion error, but cannot correct consecutive deletions or insertions in the sequence. To solve this drawback, we can convert the idea in [8,9] about a construction for the binary codes correcting a burst of deletion or insertion errors with a size of exactly b for b ≥ 2, into the q-ary case. The q-ary codeword c with length n is treated as a codeword array A b (c) with size b × n b and the codeword is arranged column-by-column. Then, to reduce redundancy than in [8], the first row and each of the other (b − 1) rows in the codeword array are encoded by a q-ary VT code and q-ary SVT code, respectively. From this construction, one deletion or insertion error in each row can be corrected by the q-ary VT code or q-ary SVT code, such that a burst of b consecutive deletion or insertion errors can be corrected.
For example, for correcting a burst of deletions of size two, the q-ary codeword c with length n is presented as a 2 × n 2 array A 2 (c), which is given by Since each row of A 2 (c) is protected by the q-ary VT code or SVT code with length n 2 , the code from A 2 (c) can correct exactly two consecutive deletions.
To sum up the previous statements, to correct one or exactly two consecutive deletion or insertion errors, we can use the q-ary VT and SVT codes. However, quaternary code to correct at most two consecutive deletion or insertion errors has not been developed. In the following section, we propose a new code design of quaternary codes suitable for the DNA storage and these codes can correct at most two consecutive deletion or insertion errors.

Proposed Code Design
This section provides a new design for a quaternary code to correct at most two consecutive deletions or insertions symbols. The construction of the proposed code is given in Section 3.1. Sections 3.2 and 3.3 prove the correction capabilities of the presented code if one deletion occurs or two consecutive deletion errors occur, respectively. The decoding of insertion errors is presented in Section 3.4. The evaluation of a lower bound and an upper bound on the cardinality of the proposed code is derived in Section 3.5.

Code Construction
Exploring a new design for the quaternary code to correct one or two consecutive symbols, we explain the proposed code design as the following definition. In the proposed code design, the binary sequence which has the same length and related to the quaternary sequence is used to construct the constraints for the proposed code.

Definition 4.
For 0 ≤ a ≤ n, 0 ≤ d ≤ 2n − 1, and 0 ≤ e < 4, a quaternary code C(n, 4) has a codeword c = (c 1 , c 2 , · · · , c n ). First, we can consider a mapping from the quaternary codeword c to a binary sequence x = (x 1 , x 2 , · · · , x n ) for 1 ≤ i ≤ n as, Then, the quaternary code C(n, 4) which satisfies the following three conditions can correct at most two consecutive deletion or insertion errors.
The basic idea of the mapping (10) is that the quaternary codeword c corresponds to the binary sequence x with the same length n. Therefore, a deletion in the j-th position of the codeword c also leads to a deletion in the j-th position of the binary sequence x. For example, if the received sequence is y = c n−1 j ∈ F n−1 4 , after using the mapping (10), we can obtain the binary sequence x n−1 j ∈ F n−1 2 , which has one deletion error in the j-th position. In Definition 4, the condition (11) is the same as the condition (3) in Definition 1 for C(n, 2), which means that the sequence x is protected by a binary codeword of C(n, 2). Therefore, decoding of the binary sequence x can be used for finding the positions of the deleted symbols and guessing the values of deleted symbols of codeword c.
The two constraints (12) and (13) in Definition 4, which are not in Definition 1, are used to obtain the correcting property in the quaternary regime. Since from constraint (11), the possible positions of the deletion errors can be obtained; however, in the case there is more than one value which satisfies the constraint (11), the constraints (12), (13) are used to remove invalid values of the possible positions. The constraint (13) is added to determine exactly the value of the deleted symbol and sum value of two consecutive deleted symbols. Then, finally the position and the value of symbols satisfy 3 constraints (11), (12) and (13) will be unique and the resulting quaternary sequence will be corrected. For example, n = 10, d = 0, a = 0 and e = 0, the binary sequence is corrected as x = 1100000111, the underlined bits are the bits which are inserted to correct x. From the mapping (10), the possible quaternary sequence can be c = 0300011322, c = 0200011322, c = 0310011322, or c = 0210011322. If there are no constraints (12), (13), the decoder cannot output the corrected quaternary sequence. Therefore, the constraints (12), (13) exclude the invalid quaternary sequences as described in Table 1, then the output is the unique sequence c = 0300011322.

Decoding Procedure for One Deletion Error
It is assumed that a transmitter and receiver share the parameters n, d, a, e of the code C(n, 4) in Definition 4. Then, we first consider a case that one deletion error occurs in the codeword c. For 1 ≤ i ≤ n, if the j-th symbol in c is removed, we obtain a received sequence y = c n−1 j ∈ F n−1 4 , with length n − 1. If the symbol at the j-th position is deleted, the constraint (13) can be rewritten as ∑ From the received sequence y = c n−1 j ∈ F n−1 4 , the constraint is given as Thus, the value of the deleted symbol value c j is calculated as c j = e − ∑ n i=1,i =j c i mod 4 = e − ∑ n−1 i=j y i mod 4. Next, we need to find the deletion position j. From the mapping (10) for the received sequence y to acquire the binary sequence x n−1 j with length n − 1, 0x n−1 j is obtained as 0x n−1 j = (0, x 1 , x 2 , . . . , x j−2 , x j−1 , x j+1 , x j+2 , . . . , x n ). Then, the run-length vector r is determined from 0x n−1 j and Rsyn(0x n−1 j ) = ∑ r −1 i=0 ir i mod 2n in the constraint (11). As mentioned in Definition 1, when one deletion error occurs, the run-syndrome decreases by ∆ = d − Rsyn(0x n−1 j ) mod 2n. To provide a proof for the correction capabilities of the proposed quaternary code in Definition 4, we develop Algorithm 1 as a correcting method in the case of one deletion symbol.

Decoding Procedure for Two Deletion Errors
Suppose that the received sequence y = c n−1 j ∈ F n−2 4 with length n − 2, where two consecutive symbols in the j-th and (j + 1)-th positions of codeword c ∈ C(n, 4) are deleted.
Algorithms 1 and 2 are proved using an exhaustive search strategy to show that the proposed code can correct at most two consecutive deletion symbols. However, as mentioned in [9], deletion-correcting codes are not always successful in identifying the exact location of the deleted symbols. For example, if an all-zero codeword is sent and one deletion error occurs, to find value of the deleted symbol is easy but it is impossible to find the exact position of the deleted symbol. Even though the exact position cannot be detected, the codeword can be successfully recovered by inserting a zero symbol in any position. This means that when the exact index of the deleted error is not detected but the run index which the deleted error belongs to is determined, the codeword can be successfully recovered by inserting one symbol in any position in the run.
If a codeword with a large run was sent and one deletion occurs in the large run, the proposed algorithm can always determine the value and the run index of the deleted symbol but rarely find the exact position of the deleted symbol in the run. In this case, we prioritize the proposed algorithm to output the first index in the detected run. Therefore, when a deletion error occurs in a large run and it is not possible to find the exact position in a codeword, the codeword of the proposed code will be successfully decoded by inserting the deleted symbol in the first index of the run.

Decoding Procedure for Insertion Errors
Since there is a similarity to the case of deletion errors, in this subsection, the correction capability of this proposed code for insertion errors is briefly presented. The received quaternary sequence y has a length that is one or two symbols larger than n, if one or two consecutive insertion errors occur. Table 4 summarizes the different computations of decoding between insertion and deletion errors. Table 4. The differences between insertion and deletion errors.

Content One Insertion Error Two Consecutive Insertion Errors One Deletion Error Two Consecutive Deletion Errors
Length of the received sequence n + 1 Difference of run-syndrome ( mod 2n)

Correcting one Insertion Error
It is assumed that the received sequence with length n + 1 is y = c n+1 j = (c 1 , c 2 , . . . , c j−1 , h 1 , c j , c j+1 , . . . , c n ) ∈ F n+1 4 , this means that one symbol h 1 is inserted at the j-th position of the codeword c ∈ C(n, 4). The process to correct the received sequence y can be briefly presented by the following steps.
The first step is calculating the value of the inserted symbol h 1 in y. The received sequence y has a sum of total symbols computed as ∑ n+1 The second step is determining the insertion position j. From mapping (10), we obtain the binary sequence 0x n+1 j . From the binary sequence 0x n+1 j , we obtain the run-length vector of 0x n+1 j and then calculate the difference of the run-syndrome by ∆ = Rsyn(0x n+1 j ) − d mod 2n. To determine the position j of the inserted symbol h 1 , in Appendix A, we provide Algorithm A1 and Function 3 for this step and the output is the corrected quaternary sequence.

Correcting Two Consecutive Insertion Errors
If two consecutive insertion errors occur at the j-th and (j+1)-th positions of the codeword c ∈ C(n, 4), the received sequence is y = c n+2 j = (c 1 , c 2 , . . . , c j−1 , h 1 , h 2 , c j , c j+1 , . . . , c n ) ∈ F n+2 4 with length n + 2. From the received sequence c n+2 j and a similar analysis as the one insertion case, the sum of the two inserted symbols is obtained as

Cardinality of the Proposed Code
Since our main contribution is the correction code capability of the proposed code, then the lower bounds and upper bound of this code design is evaluated based on the previous methods in [7,15].

Lower Bound of the Code Cardinality
In [15] of Section IV, the lower bound of the code cardinality was determined by the potential values of the syndrome and checksum in the code construction. Hence, with the similar approach, by applying d ∈ [0, 2n − 1], a ∈ [0, 8n] and e ∈ [0, 3], we can obtain the lower bound for the cardinality m(n, 4) of the proposed code as m(n, 4) ≥ 4 n 8n(8n + 1) .

Upper Bound of the Code Cardinality
Define |M(n, 4)| as the cardinality of the quaternary code of length n, with a maximum possible number of codewords, which can correct at most two consecutive deletion or insertion errors. Similar to the method in [7], the upper bound of the cardinality of |M(n, 4)| is evaluated as |M(n, 4)| ≤ |M 1 (n, 4)| + |M 2 (n, 4)|. (16) where |M 1 (n, 4)| is the number of codewords with length n such that the number of runs is larger than (r + 1) (with r is an arbitrary number) and |M 2 (n, 4)| is number of codewords with length n such that the number of runs is not larger than (r + 1). The equation (16) will be represented as |M(n, 4)| ≤ 4 n−2 2(r + 1) + 1 Let we set r= , and let n tends to infinity, then 2(r + 1) + 1 ≈ , the upper bound of the cardinality of the proposed code can be written as

Discussion
In this section, we explain the results of our proposed code design and then discuss about the applications of the proposed code.
We provide a new design of quaternary codes to correct at most two consecutive deletion or insertion errors. From Algorithms 1 and 2 and Appendixes A and B the correction capabilities of this design with deletion and insertion errors are proved. Obviously, with this proposed code, we can consider 0 ↔ A, 1 ↔ C, 2 ↔ T, and 3 ↔ G to directly construct or sequencing the DNA strands.
To deal with a burst of size of at most b (for any fixed b ≥ 3) deletion or insertion errors, the intersection between the proposed code and the quaternary code can correct exactly b − 2 consecutive deletion or insertion errors.
For example, to correct a burst error of a size of at most b = 3, first, we create the code C(n, 4) from Definition 4, which takes care of one or two consecutive deletion or insertion errors. Then, we create a q-ary VT code VT a,e (n, 4) from Definition 2 with q = 4 to correct a single error. Then, by intersecting C(n, 4) and VT a,e (n, 4) we can obtain the expected quaternary code with length n, to deal with a burst error of size at most three. With given b > 3, we use the array code construction which is described in Section 2.2 to create a quaternary code that can correct exactly b − 2 consecutive deletion or insertion errors. Through intersection of this code and our proposed code, a quaternary code that can correct at most b > 3 consecutive deletion or insertion errors can be obtained.

Conclusions
In this paper, we propose a new design of a quaternary code to correct at most two consecutive deletion or insertion errors with redundancy at most 2 log 4 8n symbols. We also develop decoding algorithms for correcting one and two consecutive deletion or insertion errors in any quaternary sequences. Even though the results in this work provide significant applications for DNA storage and correction of multiple quaternary errors, there are still several open problems, such as code constructions which can correct at most b non-consecutive deletion or insertion errors and codes that can correct at most b deletion or insertion and substitution errors, for arbitrary b. Moreover, the optimal design when concatenation of constrained code and our proposed code for DNA-based data storage also needs to be considered.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. One Insertion Error Correction
As mentioned in Section 3.4, Algorithm A1 and the function ins1_correct1 which is used in Algorithm A1 are given as follows.
To determine the position of the inserted symbol, the value of Syn_new in line 2 of Function 3 indicates the syndrome of the received quaternary sequence c n+1 j when not considering the inserted symbol c n+1 j (j). Thus, in the second term of the right-hand side, the coefficient needs to be (i − 1). This syndrome is compared to the constraint (12) to obtain the position of the inserted symbols and finally, remove the inserted symbol in the Algorithm A2 Correct two consecutive insertion symbols Input: n, a, y, 0x n+2 j . Output: c = (c 1 , c 2 , . . . , c n ) ∈ C(n, 4).