Abstract
Due to the properties of DNA data storage, the errors that occur in DNA strands make error correction an important and challenging task. In this paper, a new code design of quaternary code suitable for DNA storage is proposed to correct at most two consecutive deletion or insertion errors. The decoding algorithms of the proposed codes are also presented when one and two deletion or insertion errors occur, and it is proved that the proposed code can correct at most two consecutive errors. Moreover, the lower and upper bounds on the cardinality of the proposed quaternary codes are also evaluated, then the redundancy of the proposed code is provided as roughly .
  1. Introduction
In recent years, because of its huge capacity and excellent durability, deoxyribonucleic acid (DNA) storage is becoming attractive for future long-term data storage [,,]. However, during the processes of DNA storage, the molecule can be faced with errors that do not normally occur in traditional storage devices such as deletion and insertion errors []. Therefore, research to address deletion and insertion errors is extremely significant in DNA storage, and error-correcting codes for the errors have been studied. Our work focuses on the codes capable of correcting multiple deletion or insertion errors in DNA storage.
For correcting one deletion or insertion error in binary codes, Varshamov–Tenengolts (VT) codes were first proposed in [] and in the same year the modified VT code construction was provided in [] to correct a single deletion, insertion or substitution error. Shortly thereafter, to deal with more than a single error, Levenshtein extended the VT code to a binary code that can correct at most two consecutive deletion or insertion errors []. In [], a binary codeword was arranged as an array with b rows and each row was a binary VT codeword so that this construction could correct a burst of the size of exactly b deletion or insertion errors (with any fixed ). Then, the authors of [] proposed a binary shifted-Varshamov–Tenengolts (SVT) code to obtain an improved construction which still corrects exactly b errors but with a lower redundancy than one in []. From the obviously efficient correction and low redundancy of the VT codes, the authors in [,] proposed a method of the linear-time encoders to implement the binary VT code which satisfies the homopolymer run and Guanine-Cytosine(GC)-content constraints [,] among important properties of a DNA strand. However, the binary VT codes used in these linear-time encoders correct a single nucleotide of a DNA strand. With a similar approach as [,], but to correct a burst of size exactly b deletions or insertions of DNA symbols, the authors of [] applied the encoder of the binary modified VT code in [] and binary SVT codes in []. Then, by interleaving bits of binary VT codewords and binary SVT codewords, the work [] obtained a binary code construction that can correct a burst error of size exactly , and finally, the codeword of this construction was translated to DNA symbols.
A non-binary VT code was first proposed in [], and a non-binary SVT code was proposed in []. The codes were defined over a q-ary alphabet for any . With the similar property of the binary codes, the q-ary VT and q-ary SVT codes can correct a single deletion or insertion symbol. To correct multiple errors, the construction in [,] can be applied to obtain a q-ary code that can correct a burst of size exactly b of deletion or insertion errors. However, designing q-ary VT codes that can correct multiple deletion or insertion errors has been an interesting problem []. Recently, there were some works [,,,] focused on code design to correct exact multiple errors but the efficient design for q-ary codes (or even quaternary codes) that can correct a burst of at most b deletion or insertion errors is still an open problem. The authors of [] proposed a non-binary code correcting at most two consecutive deletions with redundancy . In [], the authors used the construction method in [] with one binary code in [] and a modified of it in interval P. However, we propose a quaternary code which is suitable for robust DNA storage and can correct at most two consecutive deletion or insertion errors with the direct construction. Moreover, the redundancy of the proposed code is improved than [].
As to the cardinality of VT codes, for about 50 years, a lower bound of size of the best class of VT codes can be achieved, but an upper bound is rarely provided even in binary case. The author in [] used Mixed Integer Linear Programming (MILP) relaxation technique to obtain the tighter upper bound of the binary VT code, for example, with the length n = 11, the maximum size of one deletion code was calculated as 173. Moreover, the conjecture about maximum size of VT code for all n was also provided. However, in this work, we focus on the correction error capability of the proposed code design, then we use the previous methods in [,] to evaluate lower bound and upper bound of the proposed code design.
In our work, we have extended binary codes based on the results of [], by adding two constraints to determine the exact values and positions of the errors in the quaternary sequence. By mathematically analyzing the possible cases of errors, we propose decoding algorithms to prove the error correction capability of this code design. We note that the main concern in this work is the error correction capability of quaternary code design, not focus on constraints in DNA storage. It is assumed that the combination design of error correction code and constraints of DNA storage was already done by other algorithms [,]. The main contributions in this paper can be summarized as follows.
      
- We propose a quaternary code design that is suitable for the deletion or insertion channel, especially for mapping , , , and . This proposed design is directly applicable to sequencing in DNA storage. Furthermore, this proposed code can correct at most two consecutive deletion or insertion errors.
 - We propose two decoding algorithms for this proposed code to correct one deletion and two consecutive deletion errors. For the decoding of insertion errors, some differences between the deletion and insertion cases are shown and the important functions for correcting the insertion error are also presented in Appendix A.
 - We provide the lower bound and evaluate upper bound of the proposed code design. The redundancy of the proposed code design is also calculated to be at most .
 
This paper is organized as follows. In Section 2, we list basic notations and definitions used in the rest of the paper and we briefly present previous binary and quaternary code constructions to correct one and two consecutive deletions. Then, Section 3 contains the proposed code construction, a proof of the correction capability, and the bounds of the cardinality for the proposed quaternary code. Section 4 provides a discussion and, finally, conclusion is presented in Section 5 of this paper.
2. Preliminaries and Previous Works
2.1. Notation and Definition
Let  and  be the set of binary and quaternary sequences of length n, respectively. Let a quaternary codeword with length n be defined as  = . Then, a modified sequence  of the sequence  is defined as  = , where  are deleted in . Similarly, a sequence  of the sequence  is also defined as  = , where  are inserted from l-th position in .
For a binary sequence  = , we can consider a sequence  with length , where  = . For simplicity, the sequence  with length  is regarded as having a starting value of  = 0. For example, a binary sequence  with length 10 is given as  = . For convenience, the binary sequence notation can be changed to  = 0010011111. In the rest of this paper, these two notations are used as the same meaning, so there is a binary sequence , with length 11, as  = 00010011111. Then, the run-length vector  denotes the number of zeros and ones run-length in . In addition, the binary sequence  is composed of four runs , which are = 000, , = 00, and = 11111. Herein, for a non-negative integer k, the zeros and ones runs are denoted  and , respectively. Then, the run-length vector  of  is  = = .
Let  be the total number of elements in the run-length vector  of , corresponding to the total number of runs in . Then, from the run-length vector, the run-syndrome of the binary sequence  is defined as
        
      
        
      
      
      
      
    
In the previous example, for  = 00010011111, since the run-length vector  is (3,1,2,5),  = = 20.
If the j-th bit of  belongs to the m-th run , we define  as the index of the run and , for . Since the total number of elements of the run-length vector cannot exceed the length of ,  is bounded as
        
      
        
      
      
      
      
    
        where the equality is satisfied if the binary sequence  = 0101010⋯.
From the previous example, the binary sequence 0 = 00010011111, with length 11, has the run-length vector  =  = (3,1,2,5) and the total number of elements of the run-length vector of  is  = 4 = 11. For , since the third bit in  belongs to the run  =1, the index of the run which the third bit belongs to is  = 1.
2.2. Previous Works
With the binary case, to the best of our knowledge, the VT code in [] is the best code to correct a single deletion or insertion error and the modified VT code in [] is the best code to correct a single deletion, insertion, or substitution error. To correct more than a single error, we briefly recap the binary code correcting at most two consecutive deletions from []. Moreover, we briefly present the deletion correction capability of the binary code in [] when single deletion or two consecutive deletions occur.
Definition 1.  
For , the binary code  in [] with length n is given as
      
        
      
      
      
      
    
The correction capability of the code in Definition 1 was also proved in []. From the length of the received sequences , we can know that one or two consecutive bits are removed from the codeword . If one deletion at the j-th bit or two consecutive deletions at the j-th and -th bits occur, then  can be  or .
To determine the position of the deleted bit, we first calculate the difference of the run-syndrome as  = . If one deletion error occurs,  =  and if two consecutive deletions occur,  = . These values are used to identify the value and position j of the deleted bit if there is one deletion in the codeword  or the values and positions j and  of the two deleted bits in the case that two consecutive deletions occur in the codeword .
However, in the quaternary case, there exists the code to correct a single deletion or insertion error. The overview of q-ary insertion and deletion-correcting codes with length n is briefly presented in Definitions 2 and 3. The VT code family, known as the set of the most basic codes for correcting a single deletion or insertion, is defined as follows []:
Definition 2.  
For  and , the q-ary VT code with length n,  is defined as
      
        
      
      
      
      
    
      
        
      
      
      
      
    
          where  and  for .
From Definition 2, since the binary sequence  =  is strongly related to the q-ary sequence , a deletion of the j-th symbol in the codeword  also leads to a deletion of the j-th bit in the binary sequence . Hence, from the help of the binary sequence , the q-ary sequence is finally corrected.
Similarly, the authors of [] proposed a single deletion-correcting code that defined the q-ary SVT code.
Definition 3.  
For , , and , the q-ary SVT code, , with length n is defined as
      
        
      
      
      
      
    
      
        
      
      
      
      
    
      
        
      
      
      
      
    
          where  and  for .
Compared to the construction of the q-ary VT code, since mod  is used in the constraint (6) in Definition 3 instead of mod n, the redundancy of the q-ary SVT code is reduced from  to . The constraint (8) is added to imply that the binary sequence  belongs to the binary SVT code. Hence, similar to the correcting method in Definition 2, the q-ary SVT code in Definition 3 can correct one deletion in any position.
However, the q-ary VT code in Definition 2 and the q-ary SVT code in Definition 3 correct only a single deletion or insertion error, but cannot correct consecutive deletions or insertions in the sequence. To solve this drawback, we can convert the idea in [,] about a construction for the binary codes correcting a burst of deletion or insertion errors with a size of exactly b for , into the q-ary case. The q-ary codeword  with length n is treated as a codeword array  with size  and the codeword is arranged column-by-column. Then, to reduce redundancy than in [], the first row and each of the other  rows in the codeword array are encoded by a q-ary VT code and q-ary SVT code, respectively. From this construction, one deletion or insertion error in each row can be corrected by the q-ary VT code or q-ary SVT code, such that a burst of b consecutive deletion or insertion errors can be corrected.
For example, for correcting a burst of deletions of size two, the q-ary codeword  with length n is presented as a  array , which is given by
        
      
        
      
      
      
      
    
Since each row of  is protected by the q-ary VT code or SVT code with length , the code from  can correct exactly two consecutive deletions.
To sum up the previous statements, to correct one or exactly two consecutive deletion or insertion errors, we can use the q-ary VT and SVT codes. However, quaternary code to correct at most two consecutive deletion or insertion errors has not been developed. In the following section, we propose a new code design of quaternary codes suitable for the DNA storage and these codes can correct at most two consecutive deletion or insertion errors.
3. Proposed Code Design
This section provides a new design for a quaternary code to correct at most two consecutive deletions or insertions symbols. The construction of the proposed code is given in Section 3.1. Section 3.2 and Section 3.3 prove the correction capabilities of the presented code if one deletion occurs or two consecutive deletion errors occur, respectively. The decoding of insertion errors is presented in Section 3.4. The evaluation of a lower bound and an upper bound on the cardinality of the proposed code is derived in Section 3.5.
3.1. Code Construction
Exploring a new design for the quaternary code to correct one or two consecutive symbols, we explain the proposed code design as the following definition. In the proposed code design, the binary sequence which has the same length and related to the quaternary sequence is used to construct the constraints for the proposed code.
Definition 4.  
For , , and , a quaternary code  has a codeword  = . First, we can consider a mapping from the quaternary codeword  to a binary sequence  =  for  as,
      
        
      
      
      
      
    
Then, the quaternary code  which satisfies the following three conditions can correct at most two consecutive deletion or insertion errors.
      
        
      
      
      
      
    
      
        
      
      
      
      
    
      
        
      
      
      
      
    
The basic idea of the mapping (10) is that the quaternary codeword  corresponds to the binary sequence  with the same length n. Therefore, a deletion in the j-th position of the codeword  also leads to a deletion in the j-th position of the binary sequence . For example, if the received sequence is  = , after using the mapping (10), we can obtain the binary sequence , which has one deletion error in the j-th position.
In Definition 4, the condition (11) is the same as the condition (3) in Definition 1 for , which means that the sequence  is protected by a binary codeword of . Therefore, decoding of the binary sequence  can be used for finding the positions of the deleted symbols and guessing the values of deleted symbols of codeword .
The two constraints (12) and (13) in Definition 4, which are not in Definition 1, are used to obtain the correcting property in the quaternary regime. Since from constraint (11), the possible positions of the deletion errors can be obtained; however, in the case there is more than one value which satisfies the constraint (11), the constraints (12), (13) are used to remove invalid values of the possible positions. The constraint (13) is added to determine exactly the value of the deleted symbol and sum value of two consecutive deleted symbols. Then, finally the position and the value of symbols satisfy 3 constraints (11), (12) and (13) will be unique and the resulting quaternary sequence will be corrected. For example,  and , the binary sequence is corrected as , the underlined bits are the bits which are inserted to correct . From the mapping (10), the possible quaternary sequence can be , , , or . If there are no constraints (12), (13), the decoder cannot output the corrected quaternary sequence. Therefore, the constraints (12), (13) exclude the invalid quaternary sequences as described in Table 1, then the output is the unique sequence .
       
    
    Table 1.
    Correction capability of constraints (12), (13) when two consecutive deletions occur.
  
3.2. Decoding Procedure for One Deletion Error
It is assumed that a transmitter and receiver share the parameters  of the code  in Definition 4. Then, we first consider a case that one deletion error occurs in the codeword . For , if the j-th symbol in  is removed, we obtain a received sequence , with length .
If the symbol at the j-th position is deleted, the constraint (13) can be rewritten as . From the received sequence , the constraint is given as . Thus, the value of the deleted symbol value  is calculated as .
Next, we need to find the deletion position j. From the mapping (10) for the received sequence  to acquire the binary sequence  with length ,  is obtained as  = . Then, the run-length vector  is determined from  and  =  in the constraint (11). As mentioned in Definition 1, when one deletion error occurs, the run-syndrome decreases by  = .
To provide a proof for the correction capabilities of the proposed quaternary code in Definition 4, we develop Algorithm 1 as a correcting method in the case of one deletion symbol.
        
| Algorithm 1 Correct one deletion symbol. | 
  | 
Function 1 provides function del_correct1 for Algorithm 1 to determine the deletion position, and then the output of Function 1 is the corrected quaternary sequence. In addition, in Function 1, Syn_new stands for the syndrome of the quaternary sequence after inserting the lost symbol  in the j-th position of .
		
  | 
Example 1: Let , and e be 10, 0, 0, and 0, respectively. Assume that one deletion occurs at the sixth position of the codeword  = . The received sequence  is  =  = . As mentioned in Algorithm 1, the value of the lost symbol is  = . From the mapping (10), we obtain the binary sequence  = 010000111. Then, the run-length vector of  is  so  = 4 and the run-syndrome of  is  =  = 18. The change of the run-syndrome is computed as  = .
For , since , following Algorithm 1, when j = 6, then  =  = 2. If inserting the lost symbol with  = 1 in the sixth position of the received sequence as , the syndrome of this quaternary sequence  =  = 0 (equals to a). Thus, the deletion error of the quaternary sequence is recovered correctly.
3.3. Decoding Procedure for Two Deletion Errors
Suppose that the received sequence  =  with length , where two consecutive symbols in the j-th and -th positions of codeword  are deleted.
The constraint (13) in Definition 4 can be rewritten as , and it is easy to obtain as  = , corresponding to  = . Since  = , we can rewrite  as  = .
From the mapping (10) for the received sequence , the binary sequence with length  is obtained as  = . Then, the run-length vector  of  also can determine the run-syndrome of  as  = . Thus, similar to the approach mentioned in Definition 1, the difference of the run-syndrome is computed as  = .
To recover two deletion errors, we first recover the binary sequence  with length n from the binary sequence . The authors of [] suggested the eight possible instances when two consecutive bits are deleted, as summarized in Table 2. However, in this work, we consider more instances which are 16 in total, and the remaining eight instances are listed in Table 3. Please note that in Algorithms 2 and A2, a notation  is used to imply that the reverse value of the -th position is assigned to the bit at the j-th position. Thus, two notations  and  have the same meaning, and this means that two neighbor -th and j-th bits have different values. For example, if , then  or .
       
    
    Table 2.
    The eight possible instances in [] for two consecutive deletion errors.
  
       
    
    Table 3.
    The eight possible instances are added in this work for two consecutive deletion errors.
  
In an analysis approach similar to [], for , there are four possible deleted bit pairs  = . Then, if 2 , we combine four possible cases of  with neighboring bits  =  and (1,1), and we need to consider 16 instances of .
From the above analysis, we develop Algorithm 2 for the proposed code to correct two consecutive deletion errors. In addition, in Algorithm 2, though it was not mentioned, the bit  is mathematically analyzed as an accompanied pair with , as described above, to obtain the conditions, such as lines 9, 16, 27, 34 to determine the deleted positions.
        
| Algorithm 2 Correct two consecutive deletion symbols | 
  | 
To clarify the explanation of the function del_correct2 for Algorithm 3 in Section 3.3, we provide the detail in Function 2. In Function 2, Syn_new implies the syndrome of the quaternary sequence after inserting the lost symbols  and  in the j-th and -th position of . If the value of Syn_new equals to the parameter of syndrome a, we infer that the quaternary sequence is retrieved successful.
		
  | 
Example 2: Let , and e be 10, 0, 0, and 0, respectively. It is assumed that two consecutive deletions occur at the seventh and eighth position of the codeword  =  and the received the quaternary sequence is  =  = . As mentioned in Algorithm 3, the sum of the values of the two deleted symbols is  =  =  = 0.
From the mapping (10) for , the binary sequence  and  are  = 001000011 and  = , respectively. Then,  = 4 and the run-syndrome  = 15. The difference of run-syndrome is calculated as  =  = 5.
From Algorithm 2, since  and , for , the value  satisfies the equation  =  =  = 5. Thus, as mentioned in line 28 of Algorithm 2, we obtain  and , and the corrected binary sequence is 0100000111.
Applying mapping (10) to the binary sequence  and , the two deleted symbols  are determined as . The syndrome Syn_new of the quaternary sequence when inserting = (1,3) into  is 0, which equals the syndrome a of codeword . Thus, finally the recovered quaternary sequence is .
Algorithms 1 and 2 are proved using an exhaustive search strategy to show that the proposed code can correct at most two consecutive deletion symbols. However, as mentioned in [], deletion-correcting codes are not always successful in identifying the exact location of the deleted symbols. For example, if an all-zero codeword is sent and one deletion error occurs, to find value of the deleted symbol is easy but it is impossible to find the exact position of the deleted symbol. Even though the exact position cannot be detected, the codeword can be successfully recovered by inserting a zero symbol in any position. This means that when the exact index of the deleted error is not detected but the run index which the deleted error belongs to is determined, the codeword can be successfully recovered by inserting one symbol in any position in the run.
If a codeword with a large run was sent and one deletion occurs in the large run, the proposed algorithm can always determine the value and the run index of the deleted symbol but rarely find the exact position of the deleted symbol in the run. In this case, we prioritize the proposed algorithm to output the first index in the detected run. Therefore, when a deletion error occurs in a large run and it is not possible to find the exact position in a codeword, the codeword of the proposed code will be successfully decoded by inserting the deleted symbol in the first index of the run.
3.4. Decoding Procedure for Insertion Errors
Since there is a similarity to the case of deletion errors, in this subsection, the correction capability of this proposed code for insertion errors is briefly presented. The received quaternary sequence  has a length that is one or two symbols larger than n, if one or two consecutive insertion errors occur. Table 4 summarizes the different computations of decoding between insertion and deletion errors.
       
    
    Table 4.
    The differences between insertion and deletion errors.
  
3.4.1. Correcting one Insertion Error
It is assumed that the received sequence with length  is  =  = , this means that one symbol  is inserted at the j-th position of the codeword . The process to correct the received sequence  can be briefly presented by the following steps.
The first step is calculating the value of the inserted symbol  in . The received sequence  has a sum of total symbols computed as  =  = , and then  =  = . The value of the inserted symbol  is calculated by  = .
The second step is determining the insertion position j. From mapping (10), we obtain the binary sequence . From the binary sequence , we obtain the run-length vector of  and then calculate the difference of the run-syndrome by  = . To determine the position j of the inserted symbol , in Appendix A, we provide Algorithm A1 and Function 3 for this step and the output is the corrected quaternary sequence.
3.4.2. Correcting Two Consecutive Insertion Errors
If two consecutive insertion errors occur at the j-th and +-th positions of the codeword , the received sequence is  = =  with length . From the received sequence  and a similar analysis as the one insertion case, the sum of the two inserted symbols is obtained as  =  = .
From the mapping (10) in Definition 4, since two consecutive symbols  are inserted in  corresponding two consecutive bits are also inserted in the binary sequence , we can obtain the binary sequence . Thus, the run-syndrome of  is calculated by  = . Algorithm A2 and Function 4 in Appendix B are provided to determine exact values of  and the positions j and 1 of the two consecutive insertion errors. Finally,  and  are removed from the sequence  to retrieve the codeword .
3.5. Cardinality of the Proposed Code
Since our main contribution is the correction code capability of the proposed code, then the lower bounds and upper bound of this code design is evaluated based on the previous methods in [,].
3.5.1. Lower Bound of the Code Cardinality
In [] of Section IV, the lower bound of the code cardinality was determined by the potential values of the syndrome and checksum in the code construction. Hence, with the similar approach, by applying  and , we can obtain the lower bound for the cardinality  of the proposed code as
          
      
        
      
      
      
      
    
The redundancy of the proposed code can be at most as below
          
      
        
      
      
      
      
    
3.5.2. Upper Bound of the Code Cardinality
Define  as the cardinality of the quaternary code of length n, with a maximum possible number of codewords, which can correct at most two consecutive deletion or insertion errors. Similar to the method in [], the upper bound of the cardinality of  is evaluated as
          
      
        
      
      
      
      
    
          where  is the number of codewords with length n such that the number of runs is larger than  (with r is an arbitrary number) and  is number of codewords with length n such that the number of runs is not larger than . The Equation (16) will be represented as
          
      
        
      
      
      
      
    
Let we set r=, and let n tends to infinity, then . Therefore, with , the upper bound of the cardinality of the proposed code can be written as
          
      
        
      
      
      
      
    
4. Discussion
In this section, we explain the results of our proposed code design and then discuss about the applications of the proposed code.
We provide a new design of quaternary codes to correct at most two consecutive deletion or insertion errors. From Algorithms 1 and 2 and Appendixes Appendix A and Appendix B the correction capabilities of this design with deletion and insertion errors are proved. Obviously, with this proposed code, we can consider , , , and  to directly construct or sequencing the DNA strands.
To deal with a burst of size of at most b (for any fixed ) deletion or insertion errors, the intersection between the proposed code and the quaternary code can correct exactly  consecutive deletion or insertion errors.
For example, to correct a burst error of a size of at most , first, we create the code  from Definition 4, which takes care of one or two consecutive deletion or insertion errors. Then, we create a q-ary VT code  from Definition 2 with  to correct a single error. Then, by intersecting  and  we can obtain the expected quaternary code with length n, to deal with a burst error of size at most three. With given , we use the array code construction which is described in Section 2.2 to create a quaternary code that can correct exactly  consecutive deletion or insertion errors. Through intersection of this code and our proposed code, a quaternary code that can correct at most  consecutive deletion or insertion errors can be obtained.
5. Conclusions
In this paper, we propose a new design of a quaternary code to correct at most two consecutive deletion or insertion errors with redundancy at most  symbols. We also develop decoding algorithms for correcting one and two consecutive deletion or insertion errors in any quaternary sequences. Even though the results in this work provide significant applications for DNA storage and correction of multiple quaternary errors, there are still several open problems, such as code constructions which can correct at most b non-consecutive deletion or insertion errors and codes that can correct at most b deletion or insertion and substitution errors, for arbitrary b. Moreover, the optimal design when concatenation of constrained code and our proposed code for DNA-based data storage also needs to be considered.
Author Contributions
All authors discussed the contents of the manuscript and contributed to its presentation. T.-H.K. designed and implemented the proposed code construction and algorithms, wrote the paper under the supervision of S.K. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by Samsung Research Funding & Incubation Center of Samsung Electronics under Project Number SRFC-IT1802-09.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
   The following abbreviations are used in this manuscript:
      
| DNA | Deoxyribonucleic acid | 
| VT | Varshamov–Tenengolts | 
| SVT | shifted-Varshamov–Tenengolts | 
| GC-content | Guanine-Cytosine content | 
| MILP | Mixed Integer Linear Programming | 
Appendix A. One Insertion Error Correction
As mentioned in Section 3.4, Algorithm A1 and the function ins1_correct1 which is used in Algorithm A1 are given as follows.
        
| Algorithm A1 Correct one insertion symbol | 
  | 
  | 
Algorithm A1 finds the possible position j of the inserted symbol as steps 6, 13, 19 then uses Function 3 which presents function ins1_correct1 to check this value of j to satisfy the constraint (12).
To determine the position of the inserted symbol, the value of Syn_new in line 2 of Function 3 indicates the syndrome of the received quaternary sequence  when not considering the inserted symbol . Thus, in the second term of the right-hand side, the coefficient needs to be . This syndrome is compared to the constraint (12) to obtain the position of the inserted symbols and finally, remove the inserted symbol in the j-th position in . Therefore, the quaternary sequence satisfies constraints (11), (12), and (13) can correct any one insertion error.
Appendix B. Two Consecutive Insertion Errors Correction
To correct the quaternary sequence when two consecutive insertion errors occur, we provide the details of correction procedure in Algorithm A2 and Function 4.
Algorithm A2 is constructed based on the analysis which is mentioned in Section 3.5. From steps 6, 13, 24, 31, the possible positions j and  of two consecutive insertion errors in the related binary sequence  can be obtained. However, since mapping (10) is used to map from quaternary symbols to binary bits, there can exist different cases of quaternary symbols which are mapped to the same binary bits, so we need to verify the exact quaternary values corresponding to the j-th and -th positions.
		
  | 
The function ins2_correct2 in Algorithm A2 is provided as Function 4 to output the unique sequence which satisfies three constraints (11), (12), and (13). Steps 1,2 in Function 4 correspond to the comparison to the constraints (13) and (12), respectively. This comparison determines the exact value and position of the inserted symbols as mentioned in Section 3.1.
In the similar way to Function 3 in Appendix A, the function Syn_new calculates syndrome of  when not considering the symbols in the j-th and -th positions. This leads to the coefficient in the second term of function Syn_new is , meaning that the symbols which are after the -th symbols are shifted to the left by 2 positions. The syndrome value Syn_new is compared to the value a of the constraint (12) to determine the exact values and positions of the inserted symbols of sequence. Obviously, the output sequence will satisfy both three constraints in Definition 4. Finally, two consecutive inserted symbols at the j-th and -th positions of  are removed. The output of Function 4 is the corrected quaternary sequence.
        
| Algorithm A2: Correct two consecutive insertion symbols | 
  | 
References
- Goldman, N.; Bertone, P.; Chen, S.; Dessimoz, C.; LeProust, E.M.; Sipos, B.; Birney, E. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 2013, 494, 77–80. [Google Scholar] [CrossRef] [PubMed] [Green Version]
 - Blawat, M.; Gaedke, K.; Hütter, I.; Chen, X.; Turczyk, B.; Inverso, S.; Pruitt, B.; Church, G. Forward error correction for DNA data storage. Procedia Comput. Sci. 2016, 80, 1011–1022. [Google Scholar] [CrossRef] [Green Version]
 - Erlich, Y.; Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science 2016, 355, 950–954. [Google Scholar] [CrossRef] [PubMed] [Green Version]
 - Heckel, R.; Mikutis, G.; Grass, R. A characterization of the DNA data storage channel. Sci. Rep. 2019, 9, 1–12. [Google Scholar]
 - Varshamov, R.; Tenengolts, G. A code that correctscorrects single asymmetric errors. Autom. Telemkhanika 1965, 26, 288–292. [Google Scholar]
 - Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
 - Levenshtein, V.I. Asymptotically optimum binary codes with correction for losses of one or two adjacent bits. Syst. Theo. Res. 1970, 19, 298–304. [Google Scholar]
 - Cheng, L.; Swart, T.; Ferreira, H.; Abdel-Ghaffar, K. Codes for correcting three or more consecutive deletions or insertions. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 1246–1250. [Google Scholar]
 - Schoeny, C.; Wachter-Zeh, A.; Gabrys, R.; Yaakobi, E. Codes correcting a burst of deletions or insertions. IEEE Trans. Inf. Theory 2017, 63, 1971–1985. [Google Scholar] [CrossRef] [Green Version]
 - Chee, Y.; Kiah, H.; Nguyen, T. Linear-time encoders for codes correcting a single edit for DNA-based data storage. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 September 2019; pp. 773–776. [Google Scholar]
 - Nguyen, T.; Cai, K.; Immink, K.; Kiah, H. Capacity-approaching constrained codes with error correction for DNA-based data storage. IEEE Trans. Inf. Theory 2021, 67, 5602–5613. [Google Scholar] [CrossRef]
 - Bornholt, J.; Lopez, R.; Carmean, D.; Ceze, L.; Seelig, G. A DNA-based archival storage system. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, Atlanta, GA, USA, 2–6 April 2016; pp. 637–649. [Google Scholar]
 - Ross, M.; Russ, C.; Costello, M.; Hollinger, A.; Lennon, N.; Hegarty, R.; Nusbaum, C.; Jaffe, D. Characterizing and measuring bias in sequence data. Genome Bio. 2013, 14, R51. [Google Scholar] [CrossRef] [PubMed] [Green Version]
 - Cai, K.; Chee, Y.; Gabrys, R.; Kiah, H.; Nguyen, T. Correcting a single indel/edit for DNA-based data storage: Linear-time encoders and order-optimality. IEEE Trans. Inf. Theory 2021, 67, 3438–3451. [Google Scholar] [CrossRef]
 - Tenengolts, G. Nonbinary codes, correcting single deletion or insertion. IEEE Trans. Inf. Theory 1984, 30, 766–769. [Google Scholar] [CrossRef]
 - Schoeny, C.; Sala, F.; Dolecek, L. Novel combinatorial coding results for DNA sequencing and data storage. In Proceedings of the 2017 51st Asilomar Conf. Signals, Systems, and Computers, Pacific Grove, CA, USA, 29 October–1 November 2017; pp. 511–515. [Google Scholar]
 - Paluni, F.; Swart, T.; Weber, J.; Ferreira, H.; Clarke, W. A note on non-binary multiple insertion/deletion correcting codes. In Proceedings of the 2011 IEEE Information Theory Workshop, Paraty, Brazil, 16–20 October 2011; pp. 683–687. [Google Scholar]
 - Sima, J.; Raviv, N.; Bruck, J. Two deletion correcting codes from indicator vectors. IEEE Trans. Inf. Theory 2020, 66, 2375–2391. [Google Scholar] [CrossRef]
 - Sima, J.; Gabrys, R.; Bruck, J. Optimal codes for the q-ary deletion channel. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 740–745. [Google Scholar]
 - Sima, J.; Gabrys, R.; Bruck, J. Optimal systematic t-deletion correcting codes. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 769–774. [Google Scholar]
 - Sima, J.; Bruck, J. On optimal k-deletion correcting codes. IEEE Trans. Inf. Theory 2020, 67, 3360–3375. [Google Scholar] [CrossRef]
 - Wang, S.; Sima, J.; Farnoud, F. Non-binary codes for correcting a burst of at most 2 deletions. In Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 12–20 July 2021; pp. 2804–2809. [Google Scholar]
 - No, A. Nonasymptotic upper bounds on binary single deletion codes via mixed integer linear programming. Entropy 2019, 21, 1202. [Google Scholar] [CrossRef] [Green Version]
 - Immink, K.; Cai, K. Properties and constructions of constrained codes for DNA-based data storage. IEEE Access 2020, 8, 49523–49531. [Google Scholar] [CrossRef]
 
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.  | 
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).