Bio-Constrained Codes with Neural Network for Density-Based DNA Data Storage

: DNA has evolved as a cutting-edge medium for digital information storage due to its extremely high density and durable preservation to accommodate the data explosion. However, the strings of DNA are prone to errors during the hybridization process. In addition, DNA synthesis and sequences come with a cost that depends on the number of nucleotides present. An efﬁcient model to store a large amount of data in a small number of nucleotides is essential, and it must control the hybridization errors among the base pairs. In this paper, a novel computational model is presented to design large DNA libraries of oligonucleotides. It is established by integrating a neural network (NN) with combinatorial biological constraints, including constant GC-content and satisfying Hamming distance and reverse-complement constraints. We develop a simple and efﬁcient implementation of NNs to produce the optimal DNA codes, which opens the door to applying neural networks for DNA-based data storage. Further, the combinatorial bio-constraints are introduced to improve the lower bounds and to avoid the occurrence of errors in the DNA codes. Our goal is to compute large DNA codes in shorter sequences, which should avoid non-speciﬁc hybridization errors by satisfying the bio-constrained coding. The proposed model yields a signiﬁcant improvement in the DNA library by explicitly constructing larger codes than the prior published codes.


Introduction
The exponential increase in big data demands high density and capacity storage.Inspired by nature, DNA (deoxyribonucleic acid) has various applicable features for digital data storage.DNA comprises four bases: adenine (A), guanine (G), cytosine (C), and thymine (T), collectively called nucleotides.DNA data storage has three key steps [1][2][3][4][5][6][7]: (i) Digital data are converted into binary data, which are encoded into DNA strands with quaternary alphabet (A, C, T, and G) strings/sequences that are called DNA codes or codewords.(ii) These strands are synthesized (data writing) into oligonucleotides by a DNA synthesizer, and the data are stored.(iii) DNA strands are decoded by DNA sequencing (data reading) to retrieve the data.These key steps come under the big umbrella of DNA computing, in which DNA data storage is partially based on information technology (IT) and biotechnology (BT).In IT, data encoding and decoding techniques are employed, comprising computational and mathematical models.In BT, DNA synthesis, storage, and sequencing are carried out with base pairs (A, C, G, and T) in a DNA molecule.It is essential for any DNA computing model to select the DNA molecules and code them efficiently to attain maximum storage density [8].
in GC-content which propagated severe errors, including mismatches, deletions, and insertions, during the decoding process.In addition, no theoretical lower or upper bounds were presented for those constraints.
In 2018, ref. [10] proposed a novel altruistic algorithm with lower bounds to generate constraint-based stable DNA codes.It also used constant GC-content and minimum Hamming distance and reported an improved number of DNA codewords.However, the storage efficiency was not sufficiently considered for density-based DNA data storage.In 2020, the author [17] proposed a damping multi-verse optimizer algorithm to design the DNA coding sets by constructing the GC-content with no-run-length constraints.Their results revealed 4-16% more improved DNA coding than that of [10], which suggests that the increase in constraints can improve the codes for high-density DNA data storage.In 2021, our previous paper [12] extended the work [10] by proposing a novel algorithm to construct the DNA coding sets with improved lower bounds.The proposed algorithm was applied with GC-content and no-run-length constraints and achieved 30% better lower bounds.However, besides the insertion and deletion errors in DNA codes, another issue of secondary structures (SS) occurs during the reading process [19].The SS is a base pairing contact of a single-strand sequence that folds back on itself, as presented in [11] (Figure 1).Any DNA sequence with an SS shape will consume the extra resources and energy to be unfolded, which slows the chemical reaction immensely.Therefore, DNA needs to be free from the SS shape before reading DNA sequences in the wet lab.There are few studies on eliminating this severe issue.The author in [11,20] introduced the RC constraint to overcome the SS issue.They subjected the GC-content and RCcontent together to improve the DNA coding sets.Their studies furnish the basic idea of combinatorial constraints to generate DNA codes with minimum errors.Although the literature mentioned above [10][11][12]17,20] received high storage DNA code sets and coding rates, these studies do not provide a sufficient method to design higher DNA codes in the shorter sequences that must satisfy the biological constraints, which is enormously important for a stable density-based DNA storage system.
This paper introduces a more efficient coding technique with a novel computational model that is based on biologically inspired computing because it uses a neural network (NN) with biological constraints to obtain a high-density-based DNA data storage.In the proposed model, LSTM as an NN with a forward pass is utilized to open a new door in the NN for DNA code construction.Firstly, the binary data are converted into premiere DNA bases by using the [3] scheme.Then, the yielded premiere DNA strings are passed through the NN model with the forward passing mechanism.A particular criterion trains the activation functions to randomly generate DNA codes.If those DNA codes pass that criterion, we term them optimal DNA codes.Then, the combinatorial constraints are utilized to concatenate these optimal DNA codes.The combinatorial constraints, including GC-content and RC constraint with Hamming distance, are computed to generate a DNA library that is used to store the digital information, for which different propositions and theorems are constructed in the Magma program and proved in this paper.GC-content and Hamming distance are computed, and results are obtained with improved lower bounds.Meanwhile, the RC constraint with Hamming distance is constructed to avoid secondary structure, and it is concatenated with GC to generate the DNA library with the best-known codes.These codes are generated by Magma with different inequalities.These inequalities are based on the previous studies that are used for the comparison of our results.Furthermore, the results are analyzed by the coding rate formula, which helps us to evaluate the data storage density in DNA media.
In general, there are two goals to be delivered for high-density-based DNA data storage with the following features: 1.
To improve the net information density by storing a large amount of digital data in shorter DNA sequences.2.
To construct the DNA codes that satisfy the combinatorial bio-constraints to overcome the reading errors.
In this scenario, these goals are accomplished by the following significant contributions: • A novel computational model based on the LSTM neural network with a forward pass is proposed to generate the optimal DNA codes from the premiere DNA bases.To the best of our knowledge, such a model has not been studied in the prior studies.

•
The combinatorial bio-constraints, including GC-content, RC constraint, and Hamming distance, are constructed for optimal DNA codes to avoid non-specific hybridization by overcoming sequencing errors and secondary structures.

•
The results receive many DNA coding sets satisfying the bio-constraints and significantly improving the DNA coding rates compared to the existing studies.
The structure of the rest of the paper is as follows: Section 2 delivers the prior work about deep neural network and combinatorial constraints for DNA data storage.Section 3 presents the preliminaries and notations, Section 4 introduces the proposed model, Section 5 elaborates on the results, and Section 6 concludes this work.

Literature Review
This section is divided into two subsections to emphasize our paper's contributions based on neural networks for DNA codes and DNA coding with combinatorial constraints.

Deep Neural Networks for DNA Codes
DNA computing has successfully impacted human life due to well-known computation tools based on machine learning and the deep learning community.With the rapid generation of digital data, efficient and effective deep learning architectures (DLAs) have been constructed to compute big data [21].DLA has been approved in a variety of domains with significant accuracies and predictions.In this article, we consider applying the deep neural network based on DLA.Recurrent neural networks (RNNs) provide the connection between the nodes to form a directed graph along a temporal sequence.The graph exhibits a short-term memory that allows RNNs to remember information from the previous state to the next state [22].Long short-term memory (LSTM) is a variant of RNN that efficiently learns the long-term dependencies.It has three gates: input, output, and the forget gate.An LSTM unit has a node or cell that accounts for the values over particular time intervals while the rest of the gates regulate the information [23,24].
Various deep neural networks have been applied with different methods and models in various natural computing studies.In 2015, a novel method was proposed for the transformation of DNA sequences into numerical sequences.This method was based on a pulse-coupled neural network and Huffman coding, which used triplet codes to encode the different lengths of DNA sequences [25].Another study also attempted to encode the data with that method, but it found that encoded sequences are compressed at a close distance, making the results less informative [26].Although numerous studies have been conducted with deep neural networks for natural computing, deep neural networks are still new models for the DNA data storage system.For instance, in 2021, a GRU (gated recurrent unit)-based deep learning model was presented for DNA information storage with next-generation sequence prediction [14].Similarly, the author proposed a DeepMod system that integrates the RNN and LSTM models to perceive the DNA codes from various Oxford Nanopore sequences.While RNNs are employed to capture the Nanopore sequencing, LSTM overcomes the vanishing gradient issues in the training of RNNs.The proposed system collectively achieves better DNA codes from the given sequences compared to others [15].In addition, in [16], the problem of DNA synthesis cost was addressed by delivering a high-density-based DNA data storage system.To achieve a high storage density, convolutional neural networks were embedded to generate the DNA bases.It employed a DNA-mapping method that consisted of GC-content and homopolymer length constraints to design the DNA codebook.It was reported that the proposed scheme efficiently stored and retrieved the information from the DNA storage system with the integration of a deep neural network with the DNA mapping method.
These studies provide the motivation for the integrational system of deep neural networks with DNA coding and combinatorial bio-constraints.

DNA Coding with Combinatorial Bio-Constraints
In DNA synthesizing and sequencing, various errors occur, which are combated with different coding techniques.For instance, error correction coding and bio-constraint coding are mainly used in DNA-based information storage systems [27].Bio-constraint coding has been practically applied in mass data storage, i.e., magnetic and optical data recording [28].There are different types of DNA constraint coding reported in the existing literature.Researchers [2,29,30] have formulized single constraints and/or combined the constraints to attain the targeted results by preventing DNA sequence (DNA code) errors.
G. M. Church delivered a pivotal work on DNA data storage by converting 8-bit ASCII (0 to A or C and 1 to G or T) into DNA bases.It pondered the GC-content constraint and the homopolymer run-length constraint by disallowing a run-length greater than 3.It was a groundbreaking step toward storing the digital information into DNA, but it was plagued by huge errors and a lack of competency [7].N. Goldman presented a different method by compressing the raw data into DNA sequencing with differential coding and a Huffman coding scheme.It employed the run-length constraint of at most 1 and achieved an effective coding rate.It suggested considering the GC-content for better constraint satisfaction [6].Similarly, R. N. Grass combined the different constraints to deliver an error-correcting scheme.It used Reed-Solomon codes for error control [5].In comparison, M. Blawat offered a seminal study for DNA data storage by proposing a forward errorcorrection mechanism.It provided the codewords (codes) that avoided deletion and substitution errors by utilizing the GC-content constraint [4].Y. Erlich and D. Zielinski proposed another benchmark study by designing the fountain code that considered the GC-content and run-length for DNA synthesis and sequencing.Their study significantly achieved better coding rates compared to existing work; however, in the decoding process, it found error propagation in the information retrieval stage [3].W. Song and Y. Wang also conducted research on DNA data storage by presenting a mathematical method for DNA code generation by preserving GC-content and Hamming distance constraints [9,27].D. Limbachiya constructed an altruistic algorithm to create DNA codewords of a specific length.The algorithm also formalized the Hamming distance for each code to satisfy the constraints [10].In our previous work, a novel algorithm was developed to generate the DNA codes.The obtained DNA codes' errors have been corrected to a limited extent with GC-content and no-run-length constraints [12].
In conjunction with GC-content, no-run-length, and Hamming distance, a few inspiring and influential constraint studies deal with the reverse-complement constraint.In 2005, Oliver D. King reported the theoretical lower and upper bounds for the maximum size of DNA codes.The report created the codes with the minimum Hamming distance and reverse-complement of any code with the least distance.It stated that obtained DNA codes were larger than ever [18].In 2010, A. Niema accommodated Oliver D. King's bounds to design DNA codewords with GC-content, the Hamming distance, and the reversecomplement RC constraint by avoiding non-specific hybridizations.They employed RC constraint to handle the searching of codes related to bases with 0 and 1 points.It obtained many new codes for DNA data storage [20].In 2021, Benerjee K.G. exhibited the families of those DNA codes which avoid secondary structures.It combined the RC constraint with homopolymers' run-lengths to construct the dissimilar DNA codes [11].
The prior work on bio-constrained coding established an idea of combinatorial constraints with different mathematical methods and formulations.These studies achieved many DNA codes in their particular methods.However, we have found that the desired DNA codes are still capable of improving the lower and upper bounds for the generation of high-density-based DNA codes.As we discuss a few studies on deep neural networks that impact constructing DNA codes, in this paper, we propose a novel model that integrates a deep neural network with combinatorial constraints to design DNA codewords.

Preliminaries and Notations
According to fundamental bio-constraints (Section 1), we design the oligos as sequence α for ∑ = {A, C, G, T}.If α ∈ ∑ n , the alphabet at the position i in the sequence α is presented as α i .Thus, a sequence α i = α 1 α 2 . . .α n ∈ ∑ n will be generated.In the same way, let another sequence β i = β 1 β 2 . . .β n , ∈ ∑ n be possible if the Hamming distance [31] between both sequences (α, β ∈ ∑ n ), denoted d H (α, β), satisfies the following: Apart from d H (α, β), sequences α, β, ∈ ∑ n must satisfy the GC-content and reversecomplement constraints (Section 3) to produce the DNA library L. Hereafter, oligos are denoted by Greek letters, and other notations are a generic set ξ of sequences α, β, ∈ ∑ n .Here, we need to provide the definition of the DNA library [31].Definition 1.A set of DNA bases {A,C,G,T} with n-mer oligos ξ ⊆ ∑ n that satisfies the constant GC constraint, reverse-complement constraint, and Hamming distance constraint is called a DNA code/codeword (n, d, ω) which collectively forms a DNA library L = ∑(n, d, ω).
If a DNA library is indicated by the number of K-constrained sequences of q − ary strings initiated with a non-zero symbol i, Shannon's relationship [32] can be written as a recurrent relationship: If the number of codes n increases, the L k (n) increase exponentially by the following: where c ∼ 1 is a constant and Γ k is an exponential growth factor which is a real root of In order to store the digital data in the nucleotide, this expression leads to defining the DNA data density [32].Definition 2. The maximum number of digital data bits (b) stored per nucleotide (nt) is termed as data density, denoted by D k and defined: In high-density-based DNA storage, there is a probability of secondary structures.In experiments, the Nussinov-Jacobson (NJ) algorithm is employed to predict the secondary structures approximately [33].During the chemical reaction, a DNA sequence α i = α 1 , α 2 , . . ., α n releases the energy to attain stability after forming secondary structures.Thus, this form can be calculated by a DNA property called free energy (E).This energy relies on the sequence pair α i , β j , where 1 ≤ i < j ≤ n and the pair releases its energy, which is termed as interaction energy ϕ α i , β j .Note that ϕ α i , β j between sequence α i and β j in any pair α i , β j will be independent of other sequence pairs.In the NJ algorithm, the interaction energies depend on the selected sequence pairs α i , β j as non-positive values, while, for the independent interaction energies, the assumption for the NJ algorithm with minimum free energy E i,j for a DNA sequence . ., j for particular conditions and E l,l = E l−1,l = 0 for l = 1, 2, . . ., n [34].

Proposed Model
In this paper, the concept of a neural network is embedded with the combinatory constraints A GC,RC 4 (n, d, ω) to design the DNA codes (n, d, ω) of nucleotides that preserve the GC-content, Hamming distance d H (α, β), and RC constraints.The proposed model is built on the three layers listed below: 1.
Transform the digital data into the sequence of bases (A, C, G, and T).

2.
Encode the DNA bases into optimal DNA codes.

3.
Create the bio-constraint codes for the DNA library construction.

Proposed Model
In this paper, the concept of a neural network is embedded with the combinatory constraints  , , ,  to design the DNA codes , ,  of nucleotides that preserve the GC-content, Hamming distance  ,  , and RC constraints.The proposed model is built on the three layers listed below: 1. Transform the digital data into the sequence of bases (A, C, G, and T). 2. Encode the DNA bases into optimal DNA codes.3. Create the bio-constraint codes for the DNA library construction.

NN-Based DNA Codes
The encoded premiere DNA codes ξ ⊆ ∑ n are moved through the NN model to obtain the optimal DNA codes Φ DN A .This model is based on 4 layers (encodings, 2 LSTM layers, and forward pass).In this neural network, 128 LSTM units have been considered, while the amount of hidden units is 4 times greater than that of the LSTM units and the dropout rate is set to 0.5 to avoid the overfitting issue.This rate will result in a 50% decrease in the number of neurons in the repetition oligonucleotides.It creates weight = 0 if a code is in a forward pass for a single iteration.During the training process, the trainable parameter is automatically set to false to prevent weight updates.The model is trained on forward and reverse input sequences (α, β) to append the DNA codes, which must have different oligos in front of each other.In this paper, various primer templates are created from the ξ ⊆ ∑ n with length 9 bases to make model learning efficient.The learning sequences are essential to attain the DNA codewords, avoiding identical bases.A sequence α i = α 1 α 2 . . .α n , learned from one layer, is concatenated to another layer according to the forward pass mechanism.In the encoding layer, two single-stranded sequences (α, β) are concatenated by inserting a particular connector < c > token, which serves as an ending token also.In addition, another special token < b > is appended at the beginning of the sequence.Each encoded base E i is indexed and fed through the LSTM layers.These layers are double stacked, and the unique tokens are transferred through the dense layers.Each sequence α or β in these layers interacts with two-headed arrows to present the bi-directional LSTM for readability.All sequence nodes are initiated from 0 and updated based on the next nucleotide's information.The hidden LSTM nodes predict the potential patterns and the forget gates update all nucleotides in the given DNA sequence.In the last layer, the final sequence updating is passed through the forward pass of LSTM to identify the forward-base DNA code Φ DN A .
To construct the NN model that permits the constant flow of sequences (α, β) through self-connected units, each oligonucleotide is protected with a self-linear unit j.The input gate in j unit is responsible for protecting the linear unit j from the other irrelevant units' connection.Next, the critical unit, a memory cell, is designed for each DNA sequence with a linear unit j to stop connecting with different DNA sequences.The memory cell of j unit is indicated by c j , with the current net sequence net c j with c j achieving the input from the multiplicative out j unit, which is considered the output gate in the LSTM model.The activation of the input gate in j and output gate out j with iteration time t is indicated by y in j (t) and y out j (t), respectively, which can be defined [23]: where . where where w stands for the number of the weight matrix and y u is the activation of an arbitrary unit u.These activation functions enable the network to learn the complex features of each DNA sequence at the input gate in j and output gate out j .Although there are other weights and vector formulas for the LSTM gates [35,36], we omit them in this paper.However, we generalize the above activation functions to architect the forward pass for the final output.These functions learn the DNA bases to satisfy the following criteria for primer design.

•
The DNA primer length is generally 15-30 nt [6].The best length for PCR amplification primers is usually 20 nt; we also train our model at this limit.

•
The length of repeated bases in a primer is generally ≤4 nt [5].The consecutive appearance of any particular base makes the unstable DNA structure.We set consecutive base lengths ≤ 3 nt.

•
The GC ratio of the primers should be 45-55% [3].The bases A and T are linked by 2 hydrogen bonds, and C and G bases are connected by 3 hydrogen bonds (see Figure 6 [37]).We also consider the GC-content to be 45-55% in this work.
If the primer does not satisfy the above criteria, we alter one base from one primer.For example, if a primer does not satisfy the condition with the sequence AGGTCATC, we alter the first base 'A' with 'T', because they have a connected hydrogen bond between them, and reconfirm the criterion, while, if the primer meets the criteria, the premiere DNA codes ξ ⊆ ∑ n are trained to the multiplicative units.The j-th memory cell block c j , which found the input from multiplicative units in j and out j , will have a v-th unit of a memory cell block c v j for a net input net c v j for time t: The internal state s c v j and output activation y c v j of the v-th memory for a time t with memory cell block c j will be: The final net input for the index k, which ranges for output units and ranges of the final activation of output with t are: where net k (t) = ∑ u: u not a gate w ku y u (t − 1).
Note that each memory cell has its weight w for the final net input net k (t).The DNA sequence is updated with the latest bases to design the DNA library.Finally, the LSTM cell determines the output by assigning these updates to the output gate out j .The out j computes the final output activation y c v j that is passed through the cell as a final optimal DNA code Φ DN A .

Combinatorial Constraints
This section deliberates the coding method to map the optimal DNA codes Φ DN A with sequence length k(2n − 1) to the DNA library L k (n) with sequence length kn that satisfies the combinatorial constraints (GC, d H (α, β), RC).The basic idea is to combine or concatenate the k optimal sequences (α, β) of length n to the sequence of length kn by constructing adjacency relations.For instance, if α i = α 1 α 2 . . .α n and β i = β 1 β 2 . . .β n are sequences with length n, then α i β i = α 1 α 2 . . .α kn β 1 β 2 . . .β kn is the concatenation of α i and β i .Since the prescribed parameters of k and n > 3 to guarantee the sequence α is optimal are met, it is necessary that α i−1 α i will be an optimal sequence, if ∀ i ∈ {2, 3, . . . ,k}, where α i−1 indicates the sequence which must have 3 symbols of α i−1 .
The constant GC-content ω can be presented analogously as A GC 4 (n, d, ω) for the A 4 (n, d, ω) if all DNA codewords in Φ DN A have similar melting temperatures and each code desires to be ω.The following are the upper and lower bound constraints for the DNA library L k (n) construction.Proposition 1 is based on upper bounds with modified variables for the Hamming distance d, while the original proposition [18] considered only the number of sequences n.Due to new variables, the proof is presented with new codes for this work.Proposition 1.For the sequences (α, β) having the number of codewords n > 0, with constraint 0 ≤ d ≤ n and 0 ≤ ω ≤ n for the upper bound, Proof.Say that if there are 3 codewords having GC-content ω < n 3 , there will be some position i where none of the words has C or G; thus, 2 of 3 words should agree in that position.Hence, A GC 4 (n, d, ω) ≤ 2 and if ω < n 3 , then 2 codewords will be C ω A n−ω and G ω T n−ω .In contrast, if there are 4 codewords and none of the codes is agreed at any position i, then all 4 nucleotides will occur in each position of i.Thus, the average GC-content will be n 2 , which is based on A 4 (n, d, ω) ≤ 3, and 3 codewords with . Similarly, if there are 2 codewords and no agreement in any position, there can be 4 codes according to the pigeonhole principle.Thus, the average GC-content will be ω for A GC 4 (n, d, ω) ≤ 4 and 4 codewords will be From this proposition, Theorem 1 is derived by considering the n − 1 code length to generate the improved DNA coding sets.In contrast, Theorem 2 is an explicit condition for d − 1 Hamming distance with constant GC-content to produce the DNA coding sets which satisfy both constraints.Theorem 1.A code with length n can be smaller than a code with length n − 1 with a minimum Hamming distance of 0 ≤ d ≤ n and 0 < ω < n.
Proof.In the case of Equation ( 13), the sequence α i for α 1 words with length n, d H (α, β), and GC-content ω, there will be position j where ωα 1 /2n codewords have C nucleotide, or, at some position, it will be G.Otherwise, the average GC-content can be less than ω.
Considering those codewords and deleting position j can generate n − 1 and GC-content ω − 1 codes with minimum d H (α, β).In contrast, Equation ( 14) is analogous, which only differs with GC-content for some position where (n − ω) α 1 /2n generates A's or T's.
The inequalities in Equations ( 13) and ( 14) are applied to achieve the upper bounds on A GC 4 (n, d, ω) with = d, n = ω, or ω = 0 conditions.Similarly, different bounds can also be obtained by varying different orders; for instance, at constant n = d, Equation (13) can still be used after n = ω and Equation (14) after ω = 0. Theorem 2. For the maximum code length n with minimum distance d − 1 for the GC-content ω in lower bounds, Proof.By Equation ( 15), the numerator n ω 2 n provides the total codewords with GCcontent ω.The denominator yields the codewords with d − 1 distance for a sequence α, while denotes the lower bounds, which give the codewords of a sequence β with GC-content ω that must satisfy the d H (α, β), for avoiding the error r.
Apart from the GC-content, a reverse-complement constraint is integrated with this paper since we are employing the NJ algorithm for interaction energies ϕ α i , β j (Section 2) to control free energy for the secondary structures.To unfold the secondary structures before reading, let us consider a set of codewords, {AG, AC, TC, CA, TT} ∈ ∑ n .Any DNA code ξ ⊆ ∑ n in DNA codebook with 2n length is constructed by defining a bijective map ϕ between the quinary alphabet Z 5 and ∑ n , and then the net code rate (R = log 4 k/n, k is number of DNA coding set, and n is number of sequence length) is, in this case, log 4 5/2 ≈ 0.58 times of code.The bounds on free energy DNA code ξ ⊆ ∑ n are presented in Proposition 2 to determine the secondary structure in a DNA sequence.
From this proposition, the free energy E i,j reduces for DNA sequences over ξ.Hence, any DNA sequence α i or β i in ξ ⊆ ∑ n will avoid secondary structures.In the above coding sets, with length 2n, E 1,2n ≥ −5 2n 2 = −5n.Now, from this proposition, we need to provide Theorems 3 and 4 to construct the model with which we avoid the secondary structure from any sequence.Theorem 3. Any DNA sequence (α i or β i ) with length 2n in ξ ⊆ ∑ n is free from the secondary structure if the stem length l is more than 1 and minimum Hamming distance d H = d.
Proof.Note that in any DNA sequence, if there is a secondary structure of stem length, then there will be 2 disjoint sub-sequences (α and β) with length l and there is α = β sc .The result will be contrapositive, such as if a DNA sequence frees from an SC (secondarycomplement) sub-sequence with length l, then it will be freed from a secondary structure with a stem length of more than one.

Theorem 4. For any
Proof.DNA code length and size follow the complement of DNA sequence in RC constraints for a given codeword over ξ, ξ DN A with minimum Hamming distance min{d H , n} = d H for the RC constraint.The results will follow the distance property of d H .
After constructing Propositions 1 and 2 for the upper and lower bounds' improvement and avoiding the secondary structure, respectively, we present the combinatorial constraints by utilizing Proposition 3 [20].This proposition leads to Theorems 5 and 6, improving the lower bounds of k-constraint length and avoiding particular errors from the number of sequences with errors r.Proposition 3. Suppose the bounds over A GC,RC   4   (n, d, ω) are concatenated with GC(α, β) and RC(α, β) constraints with the Hamming distance d for 0 ≤ d ≤ n and 0 ≤ ω ≤ n.We have 2 cases: If n is odd, Proof.For any set of codewords with length n, if their complements in any subset replace all integers, the GC-content will be maintained due to the existence of a Hamming distance d between each codeword of α and β.However, the reverse or reverse-complement and Hamming distance between the codewords are not maintained generally.Subsequentially, if n is even, we can replace codeword α i by its complements to generate a new codeword β i with the first n/2 coordination, and then for all codewords α i and β j .In contrast, if n is odd, we can replace codeword α i by its complements to generate a new codeword β i with the first (n − 1)/2 coordination, and then ≤ 1 for all codewords α i and β j [20].
Theorem 5.The code with combinatorial constraint is optimal for maximum code length n and minimum distance d = 2 for the GC-content ω in lower bounds if 0 ≤ ω ≤ n and Proof.By Theorem 2, A GC,RC Similarly, by Proposition 3 (16), A GC,RC

4
(n, 2d, ω), and Theorem 4.5 of [38], A R 2 (n, 2 ) = 2 n−2 .In this argument, the set of all binary words for 2 n−1 does not have palindromes for odd Hamming weight M, while the reverse of odd word weight is still odd weight when n is even, so 2 n−1 words are distributed into 2 n−2 pairs α, α R , wherein each word from each pair indicates that A R 2 (n, 2) = 2 n−2 .Thus, the product lower bounds for the Hamming distance between 2 separated words of odd weight M should be at least 2; then, the inequality determines the Halving bound, The lower bounds with deletion or substitution errors ε and with d ≥ 2 are not tight enough to generate the DNA library for high-density data storage.We can improve the lower bounds of the maximum number of sequences without errors r by constructing the redundancy of explicit DNA codes with r /2 logM by considering Shannon's relationship [32] (Equation ( 2)).The purpose of Theorem 6 is to improve the lower bounds of sequences without errors r; for which, the lower bounds with deletion and substitution errors ε are considered with fixed numbers of errors.Theorem 6.Let M, k, r, and ε be positive integers with r and f ixed ε.Suppose that K > 3logM + ε.Then the redundancy of improved lower bounds is r /2 logM + r /2 ε − O(1).
Proof.For a sequence α ∈ Γ L+2 M , we consider the sequence α 1 α 2 α 3 . . .α M in the way of descending lexicographic order.Each sequence contains discrete code, so each sequence of length K ε occurs at most 2 ε times.Hence, the number of equivalence classes m is exactly the number of odd weights M with 2K ε runs.This number is known to be (see [39,40], page 360).
This expression for m is inconvenient to work with, so we assign a lower bound on m.W.L.O.G., we consider that for 1 ≤ i ≤ m 1 , where m 1 ≤ m and m 1 < i ≤ m with weight M of discrete codes, while the number of equivalence classes with repetitions is where, in this expression, 2 K ε K gives the number of choices of these discrete codes, and K M−K counts the remaining M − K sequences as repetitions of the K discrete ones.Since L > 3logM + ε, when K ≤ M − 2, we have The Equation ( 18) is larger than Equation ( 19) w.r.t.discrete codes in each given sequence k: Now, let Š be an error-correcting code.According to the pigeonhole principle, the sequence size is least |Š| m for a class Hence,   .Now, let Š be an error-correcting code.According to the pigeonhole principle, the sequence size is least |Š| for a class Ȿ, which indicates Ȿ ≜ Š.Therefore, Let  ≜ 0,1  and We noted that while    …  ∈ Ȿ is a coding set, at this point, we use the lexicographic order to assign the indices,    1,     1,  …    1,  ∈ Σ .
We suppose that ℒ  ⊆  is a code of minimum Hamming distance at least ; otherwise, if there are two codewords in ℒ  that have the Hamming distance at most  1, then both concerned codewords in Ȿ can be confusable.Thus, by deleting the length of  suffixes, the concerned codes will be different in ℒ  .Thus, by using the Hamming bound on |ℒ  | which is same as |Ȿ|, we have Hence,
We noted that while    …  ∈ Ȿ is a coding set, at this point, we use the lexicographic order to assign the indices,    1,     1,  …    1,  ∈ Σ .
We suppose that ℒ  ⊆  is a code of minimum Hamming distance at least ; otherwise, if there are two codewords in ℒ  that have the Hamming distance at most  1, then both concerned codewords in Ȿ can be confusable.Thus, by deleting the length of  suffixes, the concerned codes will be different in ℒ  .Thus, by using the Hamming bound on |ℒ  | which is same as |Ȿ|, we have Š.Therefore, The Equation ( 18) is larger than Equation ( 19) w.r.t.discrete codes in each given sequence k: Hence,   .Now, let Š be an error-correcting code.According to the pigeonhole principle, the sequence size is least We noted that while    …  ∈ Ȿ is a coding set, at this point, we use the lexicographic order to assign the indices,    1,     1,  …    1,  ∈ Σ .
We suppose that ℒ  ⊆  is a code of minimum Hamming distance at least ; otherwise, if there are two codewords in ℒ  that have the Hamming distance at most  1, then both concerned codewords in Ȿ can be confusable.Thus, by deleting the length of  suffixes, the concerned codes will be different in ℒ  .Thus, by using the Hamming bound on |ℒ  | which is same as |Ȿ|, we have Let Σ {0, 1} ε and .
The Equation ( 18) is larger than Equation ( 19) w.r.t.discrete codes in each quence k: Hence, We noted that while    …  ∈ Ȿ is a coding set, at this point, we us cographic order to assign the indices, We suppose that ℒ  ⊆  is a code of minimum Hamming distance a otherwise, if there are two codewords in ℒ  that have the Hamming distanc  1, then both concerned codewords in Ȿ can be confusable.Thus, by de length of  suffixes, the concerned codes will be different in ℒ  .Thus, by Hamming bound on |ℒ  | which is same as |Ȿ|, we have }.
We noted that while {α It follows that   is increasing in ; hence, The Equation ( 18) is larger than Equation ( 19) w.r.t.discrete codes in each given sequence k: Hence, Now, let Š be an error-correcting code.According to the pigeonhole principle, the sequence size is least We noted that while    …  ∈ Ȿ is a coding set, at this point, we use the lexicographic order to assign the indices,    1,     1,  …    1,  ∈ Σ .
We suppose that ℒ  ⊆  is a code of minimum Hamming distance at least ; otherwise, if there are two codewords in ℒ  that have the Hamming distance at most  1, then both concerned codewords in Ȿ can be confusable.Thus, by deleting the length of  suffixes, the concerned codes will be different in ℒ  .Thus, by using the Hamming bound on |ℒ  | which is same as |Ȿ|, we have is a coding set, at this point, we use the lexicographic order to assign the indices, (α We suppose that L k (n) ⊆ Σ M is a code of minimum Hamming distance at least d; otherwise, if there are two codewords in L k (n) that have the Hamming distance at most d + 1, then both concerned codewords in where, in this expression,  gives the number of choices of these discrete codes, and  counts the remaining   sequences as repetitions of the K discrete ones.Since  3 , when   2, we have 1.

It follows that
  is increasing in ; hence, The Equation ( 18) is larger than Equation ( 19) w.r.t.discrete codes in each given sequence k: Hence, Now, let Š be an error-correcting code.According to the pigeonhole principle, the sequence size is least We noted that while    …  ∈ Ȿ is a coding set, at this point, we use the lexicographic order to assign the indices,    1,     1,  …    1,  ∈ Σ .
We suppose that ℒ  ⊆  is a code of minimum Hamming distance at least ; otherwise, if there are two codewords in ℒ  that have the Hamming distance at most  1, then both concerned codewords in Ȿ can be confusable.Thus, by deleting the length of  suffixes, the concerned codes will be different in ℒ  .Thus, by using the Hamming bound on |ℒ  | which is same as |Ȿ|, we have can be confusable.Thus, by deleting the length of ε suffixes, the concerned codes will be different in L k (n).Thus, by using the Hamming bound on |L k (n)| which is same as  where, in this expression,  gives the number of choices of these discrete codes, and  counts the remaining   sequences as repetitions of the K discrete ones.Since  3 , when   2, we have The Equation ( 18) is larger than Equation ( 19) w.r.t.discrete codes in each given sequence k: Hence, Now, let Š be an error-correcting code.According to the pigeonhole principle, the sequence size is least Let  ≜ 0,1  and ℒ  ≜    1,  ,    1,  , … ,    1,  ∈ Σ |  ,  ,  , … ,  ∈ Ȿ .
We noted that while    …  ∈ Ȿ is a coding set, at this point, we use the lexicographic order to assign the indices,    1,     1,  …    1,  ∈ Σ .
We suppose that ℒ  ⊆  is a code of minimum Hamming distance at least ; otherwise, if there are two codewords in ℒ  that have the Hamming distance at most  1, then both concerned codewords in Ȿ can be confusable.Thus, by deleting the length of  suffixes, the concerned codes will be different in ℒ  .Thus, by using the Hamming bound on |ℒ  | which is same as |Ȿ|, we have , we have while the number of equivalence classes with repetitions is where, in this expression,


gives the number of choices of these discrete codes, and  counts the remaining   sequences as repetitions of the K discrete ones.Since  3 , when   2, we have It follows that   is increasing in ; hence, The Equation ( 18) is larger than Equation ( 19) w.r.t.discrete codes in each given sequence k: Hence, Let  ≜ 0,1  and ℒ  ≜    1,  ,    1,  , … ,    1,  ∈ Σ |  ,  ,  , … ,  ∈ Ȿ .
We noted that while    …  ∈ Ȿ is a coding set, at this point, we use the lexicographic order to assign the indices,    1,     1,  …    1,  ∈ Σ .
We suppose that ℒ  ⊆  is a code of minimum Hamming distance at least ; otherwise, if there are two codewords in ℒ  that have the Hamming distance at most  1, then both concerned codewords in Ȿ can be confusable.Thus, by deleting the length of  suffixes, the concerned codes will be different in ℒ  .Thus, by using the Hamming bound on |ℒ  | which is same as |Ȿ|, we have By combining the Equations ( 20) and ( 21), we have In this paper, biologically constrained quaternary codes are pondered to use DNA primers economically.The pseudo-code of Algorithm 1 is utilized to generate the DNA library L k (n) which is based on the optimal DNA codes Φ DN A designed by a neural net- work.This algorithm produces the codes that satisfy the GC-content ω, reverse constraint, and Hamming distance d H (α, β) using the quaternary encoding.
Algorithm 1. Proposed algorithm to construct DNA library L k (n).
Initiate the NN with activation gates y in j (t) using Equation ( 6) and y out j (t) using Equation ( 7) to encode the primers; 3.
Generate optimal DNA codes Φ DN A by output activation y c v j (t) using Equation (11) and LSTM layers; 4.
Remove the codewords from Φ DN A if that does not follow the GC-content ω (Proposition 1 and Theorem 2); 5.
Reverse the DNA codes that enable secondary structures (Theorems 3 and 4) and avoid the codes that do not satisfy d H (α, β) (d − 1); 6.
Concatenate the bio-constraints A GC,RC
Construct the error-correcting codes to produce the final DNA library L k (n) by Theorem 6.
return: DNA library L k (n) for DNA data storage.

Result Evaluations
This section elaborates on the improved lower bounds and DNA coding sets obtained by the proposed model of NN and combinatorial bio-constraints.Figure 2 illustrates a random sample of forward and reverse primers for the optimal DNA codes received after the NN implementation.These random DNA sequences were programmed in the Magma program [41] with different sequence lengths and a minimum Hamming distance.The aforementioned propositions and theorems were considered for program construction.As a result, we received .codfiles with different lower bounds of DNA codes, satisfying the combinatorial bio-constraints for particular n and d.The codes in the .codfiles were calculated in the Tables format.In addition, Figure 3 illustrates the numerical analysis by considering the coding rate and storage density of lower bounds given in these tables.Tables 1 and 2 present the lower bounds obtained by our model of GC- ,  with the NN.In each row, the upper entries in Table 1 are direct [10], while the upper entries in Table 2 belong to [17].The lower entries are pare our outputs with existing studies.The superscript  represents im bounds and  indicates the decreased lower bounds, while the rest of the have almost the same lower bounds as compared to [10] and [17], respectiv   Tables 1 and 2 present the lower bounds obtained by our model of GC-content ω and d H (α, β) with the NN.In each row, the upper entries in Table 1 are directly taken from [10], while the upper entries in Table 2 belong to [17].The lower entries are used to compare our outputs with existing studies.The superscript i represents improved lower bounds and d indicates the decreased lower bounds, while the rest of the other bounds have almost the same lower bounds as compared to [10] and [17], respectively.In Table 1, the lower bounds are based on GC-content ω and d H (α, β) by deriving Proposition 1 and Theorem 1, and they are compared with Table 1 of [10], which uses the 4 ≤ n ≤ 13 and 3 ≤ d ≤ 10 inequalities to construct the DNA codes.We have compared our proposed model's results with [10] by considering the GC-content ω and d H (α, β) with NN.As a comparison, 51% of bounds are improved, 5% have decreased, and 44% are almost the same lower bounds as in [10].
Similarly, in Table 2, the lower bounds are based on RC constraint and d H (α, β) by deriving Proposition 1 and Theorem 2 and are compared with the Table 7 of the study [17], which considers the 4 ≤ n ≤ 10 and 3 ≤ d ≤ n inequalities to design the DNA codes.In comparison, 64% of bounds are improved in our work, while 11% have decreased and 25% are almost the same as the lower bounds of [17].However, the limitations of these bounds can be further improved by constructing new theorems or modifying Proposition 1 by varying the values of n and ω.
Apart from the lower bound improvements for the given constraints, the coding rates (R = 1 n log 4 L, n is the sequence length number, and L is the total number of lower bounds in a sequence) have also been improved in a shorter sequence (n − 1).For instance, ref. [10] reported R = 0.3036 when n = 8 and d = 5, while our work reports the same coding rate (0.3034) with a shorter sequence when n = 7 and d = 5.Similarly, ref. [17] obtained R = 0.4881 when n = 6 and d = 3; in contrast, this work receives this coding rate (0.4857) when n = 5 and d = 3.The reported improved lower bounds have a better influence on the DNA library L k (n) generation, indicating the proposed model's effectiveness for DNA code construction.Regardless of this improvement, the 95% confidence interval (CI) mean (Figure 3a) of received bounds presents a breakthrough coding for DNA data storage in DNA computing.The bigger the interval, the more significant the development of coding for DNA data storage.As the purpose of individual RC constraints is to avoid the secondary structures in the DNA sequences, the RC constraints are not concerned with generating the lower bounds separately.However, the studies [11,13,18,20] motivate the idea of integrating the RC constraint with GC-content and Hamming distance in an assembled format to design new DNA coding sets.Taking advantage of their work, we generalize the RC constraint with Proposition 3 and Theorems 5 and 6 to generate the new DNA codes.
Table 3 is the collection of lower bounds with combinatorial constraints.Each column has upper and lower entries; the former is taken from the Table 8 of the study [20], and the latter is attained by our proposed computational model.The bold entries indicate the outperformed bounds of our proposed model over [20].Likewise, the coding rates are compared in Figure 3b for n = 8 of our lower bounds with n = 9 of [20].Our model designs the codes with almost the same R with n − 1 sequences.In addition, the underlined entries indicate the best-known codes of this work that have nine bounds total.In [20], Tables 8 and 9 present the best-achieved bounds which satisfy the GC-content, Hamming distance, and RC constraints; we only compare our results with Table 8 due to its particular inequalities (i.e., 3 ≤ d ≤ 11).As Tables 1 and 2 are also based on these inequalities, we focus on a particular inequality in this paper for all the results.
The new lower bounds delivered by this work are better than the prior work.For instance, for n = 10 and d = 5, the size of our DNA codes is 22% greater than that of [20].In another scenario, if we consider all the sequences at d = 6, the new improved DNA codes are still 36% better than [20].These significant improvements are based on our proposed computational model that integrates a neural network with combinatorial bio-constraints.In addition, the size of these DNA codes is still capable of increasing as that of the bestknown codes for the highest storage density.The storage density with our DNA codes for n = 9 to n = 12 and d = 3 to d = 8 is given in Figure 3c.The high storage density is received in lower Hamming distance, which is also based on the DNA coding sets of each sequence length.For the given particular lower bounds in Figure 3c, the highest density of 4.41 is attained for d = 3. sets of each sequence length.For the given particular lower bounds in Figure 3c, the highest density of 4.41 is attained for  3.    Furthermore, the improvements of these lower bounds for any sequence length pioneer the DNA coding rates.A general analysis of Table 3 indicates that the same coding rate (R) is found in 73% of lower bounds with shorter sequences.For example, [20]    Furthermore, the improvements of these lower bounds for any sequence length pioneer the DNA coding rates.A general analysis of Table 3 indicates that the same coding rate (R) is found in 73% of lower bounds with shorter sequences.For example, ref. [20] received R = 1  12 log 4 87 = 0.2684 when n = 12 and d = 7, while this work acquires the same coding rate (0.2673) when n = 11 and d = 7.Similarly, in another example, when n = 10 and d = 4, ref. [20] reported a 0.4874 coding rate; in contrast, this work delivers R = 0.4860 with the number of sequence lengths n = 9 at the same Hamming distance.In the case of best-known codes (bold underlined entries), our coding rate is better than [20], with a shorter sequence (n − 1), i.e., n = 8, d = 3, and R = 0.5652, while this work reports R = 0.5660 when n = 7 and d = 3.Thus, these analytical results present that the shorter sequences can achieve the same DNA storage density as the longer sequences.The improved lower bounds in various coding sets indicate the reduction in insertion and deletion errors in the DNA sequences, which enables the proposed computational model to avoid the non-specific hybridization process.In addition, Table 4 presents the DNA library L k (n) satisfying the A GC,RC 4 (n, d, ω) constraints when n = 10 and d = 7, as in Table 3.The satisfaction of combinatorial constraints over the optimal DNA codes from the NN's output collaborated to improve the lower bounds of DNA coding sets, which emphasizes our proposed computational model.

Conclusions
An exciting research challenge in DNA data storage systems is to explore improved lower bounds by avoiding non-specific errors to generate high-density-based storage, which could store a large amount of information in a shorter sequence.In this paper, a novel computational model is offered to construct an extensive DNA library of oligonucleotides.It is accomplished by presenting a three-layer model that integrates a neural network (LSTM) and combinatorial bio-constraints, including GC-content, Hamming distance, and reversecomplement constraints.We derive the recursive expression in propositions and theorems to attain all possible large DNA coding sets by satisfying combinatorial constraints.
All DNA codewords in Tables 1 and 2 satisfy the GC-content and Hamming distance constraints and improve 51% and 64% of lower bounds compared to [10] and [17], respectively.The lower bounds presented in Table 3 are single error-correcting codes based on the concatenation constraints, while the underlined bounds exhibit the DNA sequences that have avoided secondary structures.Furthermore, the improvements in the lower bounds directly impact the coding rate.For example, results in Section 3 report that the shorter sequences can achieve the same DNA storage density as the longer sequences.It is concluded that the proposed computational model can store a large amount of data in a small number of DNA nucleotides that can improve the data density and reduce the DNA synthesis and sequence cost for a DNA-based data storage system.
In our results, there are still lower bounds that need to be improved by mutation strategies for high-density data storage.Similarly, the insertion and deletion errors can be further controlled by experimenting with the application-oriented bio-constraints, i.e., run-length constraints [42].

Figure 1 .
Figure 1.The proposed computational model with NN and combinatorial bio-constraints for DNA data storage.Figure 1.The proposed computational model with NN and combinatorial bio-constraints for DNA data storage.

Figure 1 .
Figure 1.The proposed computational model with NN and combinatorial bio-constraints for DNA data storage.Figure 1.The proposed computational model with NN and combinatorial bio-constraints for DNA data storage.
Š be an error-correcting code.According to the pigeonhole principle, the sequence size is least |Š| for a class Ȿ, which indicates Ȿ ≜ Š.Therefore,

Figure 3 '
s analyses were drawn by using the Prism program.athematics 2022, 10, x FOR PEER REVIEW

Figure 2 .
Figure 2. A sample of received primers for the optimal DNA codes.

Figure 2 .
Figure 2. A sample of received primers for the optimal DNA codes.

Figure 3 .
Figure 3. Lower bounds acquired by coding constraints with  : (a) The CI mean with lower and upper bounds of coding constraints with  for  8. (b) The coding rate comparison between lower bounds is obtained by  for our work ( 8) and that of [20]  9. (c) The storage density with our DNA codes for  9 to  12 and  3 to  8.

Figure 3 .
Figure 3. Lower bounds acquired by coding constraints with d H : (a) The CI mean with lower and upper bounds of coding constraints with GC for n = 8.(b) The coding rate comparison between lower bounds is obtained by RC for our work (n = 8) and that of [20] n = 9.(c) The storage density with our DNA codes n = 9 to n = 12 and d = 3 to d = 8.

Table 4 .
DNA coding sets for DNA library L k (n) retrieved when n = 10 and d = 7.