Genetic Similarity Analysis Based on Positive and Negative Sequence Patterns of DNA

: Similarity analysis of DNA sequences can clarify the homology between sequences and predict the structure of, and relationship between, them. At the same time, the frequent patterns of biological sequences explain not only the genetic characteristics of the organism, but they also serve as relevant markers for certain events of biological sequences. However, most of the aforementioned biological sequence similarity analysis methods are targeted at the entire sequential pattern, which ignores the missing gene fragment that may induce potential disease. The similarity analysis of such sequences containing a missing gene item is a blank. Consequently, some sequences with missing bases are ignored or not effectively analyzed. Thus, this paper presents a new method for DNA sequence similarity analysis. Using this method, we ﬁrst mined not only positive sequential patterns, but also sequential patterns that were missing some of the base terms (collectively referred to as negative sequential patterns). Subsequently, we used these frequent patterns for similarity analysis on a two-dimensional plane. Several experiments were conducted in order to verify the effectiveness of this algorithm. The experimental results demonstrated that the algorithm can obtain various results through the selection of frequent sequential patterns and that accuracy and time efﬁciency was improved.


Introduction
In recent years, a large volume of biological sequence data has been generated. When a new DNA sequence is obtained, similarity analysis is used in order to determine whether it is similar to a known sequence. If it is homologous, this will save time and effort in re-determining the function of the new sequence. In bioinformatics research, similarity analysis of biological sequences is by no means a straightforward mechanical comparison. However, numerous mathematical and statistical methods are used to assist in analysis. In sequence similarity analysis, alignment and classical research methods are the most common. In sequence alignment, two problems exist that directly affect the similarity score: the substitution matrix and gap penalty. Gap penalty is used to compensate the influence of insertion and deletion on sequence similarity and no suitable theoretical model exists to describe the slot problem. Therefore, vacancy penalty points lack a functional theoretical basis and are subjectivity.
First, the drawbacks of sequence alignment have caused researchers to explore other methods for comparing DNA sequence similarities. For example, experts have devised various mathematical schemes. The graphical representation of biological sequences can identify the information content of any sequence to help biologists to choose another complex theoretical or experimental method.

Related Work
This section is divided into two parts: the first describes the pattern similarity analysis of biological sequences and the second describes the biological sequential pattern mining technique that we used.

Pattern Similarity Analysis of Biological Sequences
In recent decades, numerous DNA sequence similarity analysis methods have been proposed. At present, DNA similarity analysis [8,9] primarily focuses on elucidating a homologous relationship between sequence or predicting the structure and function of unidentified sequences in known sequences. The DNA sequence consists of four bases, adenine, thymine, cytosine, and guanine, which are represented by the letters A, T, C, and G, respectively. Most of the methods cannot process this sequence of letters, so we need to convert them into a sequence of numbers. Most DNA similarity analysis methods can also be divided into a graphic representation and other schemes according to their digital forms. Of these methods, graphic representation is a popular research field in DNA sequence similarity analysis. This method was first proposed by Hamori and Ruskin [10] and it has been subsequently widely used.
The graphics-based approach can be further divided into several categories that are based on the spatial dimension of the sequence, from two-dimensional (2-D) to three-dimensional (3-D), and other categories. The two-dimensional graphical representation of DNA sequences is a useful method for studying gene sequences [11]. The primary objective of representing nucleotides as digital vectors and mapping the DNA sequence in curves on a two-dimensional plane, based on the numerical characteristics of the DNA sequence that can be obtained. Gong et al. proposed new DNA sequence descriptors [12] that are derived from geometric concepts of curvature approximation and eigenvalue absence, which have the complexity of linear sequence length growth. Guo et al. proposed another similarity analysis method of DNA sequences that were built on two-dimensional graphical representation [13]. One DNA sequence corresponds to 24 different curves, which are organized in a two-dimensional Cartesian coordinate system. However, this ignores the chemical structure of the DNA sequence and omits most of its chemical information. In 2014, Ma et al. introduced a new type of iterative functional system in order to outline the two-dimensional graphical representation of protein sequences [14], which combines the various physic-chemical properties of amino acids. Lee et al. proposed a similarity measure that is capable of handling non-overlapped data and analyzed its characteristics on data distributions [15]. In order to obtain discriminative similarity values for non-overlapped data, Lee considered two approaches. The first was to adopt the traditional similarity measurement method after preprocessing the non-overlapping data. The second was to consider the neighbor data information when designing the similarity measure, where the relationship to specific data and residual data information was considered. In 2018, Xie et al. proposed the F-B curve and its corresponding single-base correlation 2D curve method [16]. The construction of these graphic curves is based on the allocation of individual bases of four different sine (or tangent) functions. In 2019, Abo-Elkhier et al. numerically represented each amino acid in the protein sequence and proposed a new 2-D graphical representation method [17]. They introduced a new descriptor that consisted of a vector (Ā t , SA t ) consisting of the mean and standard deviation from the total number of protein sequences. In addition, numerous 3-D methods also exist. For example, based on 64 codons and four nucleotide chemical DNA sequences, Jafarzadeh et al. proposed another 3-D representation method (C-Curve) [18]. However, this three-dimensional approach may require more storage space and pose a larger computational challenge than a 2-D approach. Numerous other types of graphical representations also exist, such as those that were proposed by Liao et al. [19]. According to the classification of the four bases of DNA, the main sequence is converted into a structure diagram. Invariants, such as topological index, were extracted from the graphical representation of these primary DNA sequences and then used to compute the similarity between the 11 species.
None of the above methods can effectively analyze the sequence of missing bases. Furthermore, in order to effectively analyze DNA sequence similarity, several key issues need to be considered: (1) how to effectively represent a DNA sequence with a digital sequence; (2) how to select appropriate descriptors that can be regarded as DNA sequence characteristics and then characterize them according to the digital sequence; and, (3) how to effectively process DNA sequences of various lengths and maintain their consistency. In this regard, we propose graphically representing the maximum frequent sequential patterns on a two-dimensional plane and analyzing the similarity with the represented DNA sequences.

Biological Sequential Pattern Mining Technique
In this section, we begin with PSP mining of biological sequences and then introduce NSP mining. Biological sequential pattern mining is a major research topic for mining frequent sub-sequences in biological sequence databases as patterns and it has a wide range of application prospects, such as the early STAR algorithm [20]. Kurtz et al. proposed the REPuter algorithm that is based on the suffix tree [21], which overcomes the limitation of input sequence size, but with which is difficult to find repeat sequences with a high occurrence frequency of DNA sequences that are based on paired sub-sequences. Deng et al. [22] proposed a new method for frequent pattern mining in DNA sequences, which is based on two levels of nested hash table data structures and set operations. Scanning the DNA sequence one time reveals all frequent patterns and their positions in the DNA sequence. In 2018, Zhang proposed MulMer [23], which effectively mines all distinct multi-mers. MulMer first utilizes the inverted-index technique in order to project the original sequence and the method of pattern growth is then adopted to generate potential multi-mers; each multi-mers accurately records its location in the original sequence.
Few papers exist on NSP mining and none have applied NSP mining to DNA and protein sequences. We briefly introduce this below. Hsueh and colleagues designed an NSP mining method, named PNSP [24], which comprises three mining process steps. The first step is to use traditional algorithms to mine PSPs. The second step is to derive the negative item sets from the positive item sets. The third step is to join the positive and negative item sets to generate the positive and negative candidate sequential patterns using a method that is similar to prior concatenation. Finally, the candidate sequence support is obtained based on a database repeat scan, and the PSPs and NSPs are determined. Zheng et al. have introduced a GSP method in order to determine NSPs, referred to as Negative-GSP [25]. It first discovers PSPs through GSP after using a modified connection method and pruning operation in order to generate and trim a negative sequential candidate (NSC). The negative pattern is then generated by rescanning the database to calculate the support degree of the NSCs. Ouyang et al. proposed a discoverable from the NSP mining algorithm, such as (¬A, B), (A, ¬B) and (¬A, ¬B). The pattern mining negative association rules are very close. This method needs to meet (A ∩ B) = φ. The primary objective of this method is to obtain all frequent item sets, after the use of frequent item sets to generate frequent and infrequent sequences. The NSP is then mined from infrequent positive sequences. The work that was published in [26] raised the issue of NSPs, but did not provide a concrete solution; the work that was published in [27] used the same NSP as in the existing literature and applied it to an incremental database. Lin proposed an NSP mining algorithm, named NSPM [28]; however, the NSP defined in this algorithm only allows for the last element of the sequence to be a negative term and all other elements must be positive terms. Repetitive sequence patterns capture repetitions of sequence patterns in various sequences and understanding their behavior from the repeated relationship between them is crucial. Therefore, Dong et al. proposed a type of effective algorithm, called an e-RNSP [29], in order to mine the repetitions of NSPs (RNSPs). This method can convert repeated negative constraints to repeated positive constraints, and it can quickly calculate repetition supports by only using the corresponding RPSP information without rescanning the entire database. However, NSP mining is still in its infancy and it faces numerous challenging problems, one of which is how to select useful NSPs. In order to solve this first problem, Dong et al. proposed a Topk-NSP [30] algorithm to mine k of the most common negative patterns and, of these, the authors proposed three optimization strategies. Topk-NSP was the first algorithm that is capable of mining the most commonly used k NSPs. A fairly good e-NSP exists, but it has its drawbacks, which we will not describe here, because they were mentioned in the first section above. The f-NSP was selected to mine DNA sequences, which not only efficiently mined frequent sequential patterns, but also numerous NSPs, which are crucial for our next similarity analysis of DNA.
The current similarity analysis methods of biological sequences continue to be of interest to researchers. Numerous approaches have been proposed, but room for improvement still exists. In particular, no method exists for similarity analysis of NSPs. This paper proposes the adoption of frequent sequence patterns for measuring similarity.

Basic Principles
In this section, we introduce several basic principles and related instructions.

Definition
Definition 1. A DNA sequence, which is also known as a gene sequence, is the first order structure of a real or hypothetical DNA molecule that carries genetic information, represented by a string of letters.

Definition 2. Maximal frequent patterns. Given a DNA sequence S
is a frequent pattern if its support is no less than min-sup. A maximal frequent pattern is one in which none of its super-sequences are frequent and its sub-sequences are frequent [31]. Definition 3. Dynamic time warping, which has a simple purpose, has been widely used in the field of speech recognition. It is a nonlinear programming technology that combines time planning and distance measurement in order to calculate the maximum similarity between two time series, namely minimum distance.

Data Sets of DNA Sequence
At present, few DNA sequence data sets can be used in order to study sequence similarity and finding a more suitable DNA sequence set is still a problem. The β−globin gene from 15 different species are the most commonly used DNA sequences [32]. These data sets can be found at https: //www.ncbi.nlm.nih.gov/genbank/.

Similarity Distances
Calculating the distance between DNA sequences is essential for DNA similarity analysis. Euclidean distance and correlation angles are the most commonly used methods for calculating distance. We can calculate the Euclidean distance between the sequences or the correlation angle between them. When the Euclidean distance or correlation angle is smaller, the sequence is more similar, which is, the sequence is more homologous.

Output Data
Generally speaking, a distance matrix is used to represent the output data of DNA sequence similarity analysis. Phylogenetic trees are often constructed based on the distance matrix in order to better show the homology relationship between various species.

F-NSP Algorithm Based on Biological Sequences
We use the f-NSP [7] to mine negative sequence patterns. In order to provide the readers with better understanding of the algorithm, we will briefly describe the process of the algorithm below.

Preprocessing
For each sequence or genome to be processed, each is preprocessed before frequent pattern mining. First, the letters of the data set are replaced with numbers. Subsequently, when the DNA sequence length is long, preprocessing reduces the memory and time consumption of sequence processing. The sequence is broken into blocks, each consisting of the same number of bases. Unlike the FPE method, we do not discard any base sequences when we block the sequence patterns of species. The length of the blocks is chosen. We used our lab's f-NSP algorithm to mine frequent DNA sequential patterns, because this is currently a relatively fast algorithm and it is able to mine negative DNA sequential patterns.

The Main Idea and Data Structure of f-NSP
The main idea of f-NSP is as follows: (1) the GSP algorithm is used in order to obtain all positive frequent sequences, and the bitmap corresponding to each sequence is stored in a hash table; (2) corresponding NSCs based on all the positive sequences are generated; and, (3) support of NSCs can be calculated by bit operation. If the support of a NSC is greater than min_sup, then it is a frequent sequential pattern; In general, the f-NSP creates a bitmap for PSP to store its information, and then calculates NSC's support through related bit operations. If a positive sequence is contained in the i-th data sequence, the i-th position of this positive sequence bitmap is set to 1, otherwise to 0. The length of each bitmap is the number of sequences that are contained in the data sequence. Table 1 shows the data set, such as the bitmap AT | 1 | 1 | 1 | 1 | 0 |, indicating that AT is contained by the first four data sequences.
The generation process of the f-NSP data structure can be referred to in [7].

Calculating the Supports of Negative Sequences in f-NSP
We have adopted a new bitmap storage structure, where we can use the bit OR operation to replace the original union operation. Assuming that s is a positive sequence, its bitmap is represented by B(s), and the number 1 in the bitmap is represented by N(B(s)). Subsequently, a negative sequence ns of m-size and n-neg-size is given, and its support degree is: Figure 1 explains the bit OR operation. If a positive sequence is <G C T A>, then sup(CA) = 5. According to the negative candidate generation method, a negative candidate sequence ns is <¬G C ¬T A>. The corresponding MPS(ns) = <C A>, P(1-negMS 1 ) = <G C A>, and P(1-negMS 2 ) = <C T A>. Assuming that B(<GCA>) = |1|0|0|1|0|, B(<CTA>) = |1|1|0|1|1|. Subsequently, Figure 1 shows the bitmap union bitmap of B(<GCA>) OR B(<CTA>). Therefore, we can easily obtain N(unionbitmap) = 4, and then obtain sup(<¬G C ¬T A>) = 1 from Equation (1).
If ns only contains one negative element, then the support of sequence ns is obtained by the Equation (2). sup (ns) = sup (MPS (ns)) − sup (p (ns)) In particular, the support for a single element negative sequence <¬G> is obtained by the Equation (3).

2-D Representation of Negative DNA Sequential Patterns
Bai et al. [33] proposed a similarity analysis method for the positive sequential sequence. Based on this, we first propose a similarity analysis method for negative sequential patterns. We constructed a purine-pyrimidine diagram on the complex plane, as shown in Figure 2. The first and third quadrants are purines (A, ¬A, G and ¬G), the second and fourth quadrants are pyrimidines (T, ¬T, C and ¬C), which represent the unit vectors of the eight nucleotides A, ¬A, G, ¬G, C, ¬C, T and ¬T, and their corresponding sequences are as follows: A and T are conjugate, ¬A and ¬T are conjugate, C and G are conjugate, ¬C and ¬G are conjugate. A, T, C, G stands for the existing base pair. Additionally, ¬A, ¬T, ¬C, ¬G stands for the base pair that should have appeared but did not (or the missing base pair), and is termed the negative base, as shown in Figure 2. By this means, we can restore each frequent sequence pattern to a set of vectors. We numbered the DNA sequence and then thought of it as a finite complete ordered set with t elements, which is the same as [t] = {1, 2, ..., t}.
j = 0, 1, 2, ..., n, where j represents the base type at the 0, 1, 2,..., n-th position in the sequence S, and n is the length of the DNA sequence being studied. We can uniquely obtain the original DNA sequence in the DNA diagram by connecting the points on the curve.

Algorithm Principle of DTW Distance
Set the time series S 1 (t) = s 1 1 , s 1 2 , ..., s 1 m , S 2 (t) = s 2 1 , s 2 2 , ..., s 2 n , and the lengths are m and n, respectively. According to their position time sorting, construct the matrix A m×n of m × n, and each element of the matrix, a ij = d s In the matrix, the collection of a group of adjacent matrix elements is called the winding path, which is denoted as W = w 1 , w 2 , ..., w k , the k-th element of W is w k = a ij k and this path is used in order to satisfy the following conditions: The DTW algorithm can be summarized in order to apply the idea of dynamic programming to find an optimal path to the smallest bending cost, namely, Of these, i = 2, 3, ..., m. j = 2, 3, ..., n. D(m, n) is the minimum cumulative value of the bending path in A m×n .

Similarity Analysis of Negative DNA Sequences
Because the DNA sequence corresponds to its time series of one-to-one [33], the similarity of DNA sequences can only be compared by comparing the similarity of their corresponding time series. The DTW algorithm is one of the classical methods used to measure the similarity of the time series. The DTW distance algorithm is used here in order to compare the similarity of DNA sequences.

Experiment Results
We first used the f-NSP algorithm to obtain frequent sequence patterns, and then used the mined maximum frequent sequence patterns for similarity analysis. All of the experiments were performed on an Intel Core i5 computer with a 2.4-GHz CPU and 8 GB of memory, as well as using the Windows 7 operating system.

Experiment Data Set
Because the DNA sequence corresponds to its time series one to one, the similarity of the DNA sequence can only be compared by comparing the similarity of their corresponding time series.
We compared the results of the frequent patterns mining of the first exon of the β-protein gene of the 10 different species based on our proposed graphical representation. Table 2 shows the coding sequences of the first exon of the β-globin gene of the 10 different species. Additionally, Table 3 lists the sequences information.

Result of Mining Patterns
Two positive and one negative maximum frequent sequential patterns of the 10 species were selected as the data set, as shown in Table 4. The min_sup was set to 0.3 during mining.

DNA Sequence Similarity Analysis
First, we used Equations (4) and (5) to convert 30 sequential patterns into the time series. Subsequently, we utilized the DTW distance algorithm in order to calculate the distance between two sequences. Finally, we obtained the distance matrix between the frequent patterns of the 10 species, as shown in Tables 5 and 6.
Here, we introduce the similarity analysis process of the sequences in detail. For example, the complex number sequence that is obtained by the sequence Human1 through Equations (4) and (5)  Similarly, we obtained the time series after the transformation of the other 29 frequent sequences and we used our method to calculate the similarity with different data groups listed in Table 4, and the results are given in Tables 5 and 6.   The phylogenetic tree was generated according to the distance matrix. A phylogenetic tree is a tree-like branching graph that summarizes the genetic or evolutionary relationships of various organisms. Here, we used MEGA-X to construct our phylogenetic tree. If it could be reasonably constructed, as shown in Figure 3, different sequence combinations would provide different results, but all of them were consistent with the evolutionary genetic relationship among organisms. For example, we noted that the results of the phylogenetic tree of Hum1, Opo2, Rat2, Chi2, Gal2, Goa2, Gor2, Lem2, Mou2, and Rab2 were the same as those in citation [34], and this introduced a group representation vector to represent each protein sequence to generate a similar/different vector, rather than a regular similar/different matrix. The phylogenetic tree of Hum2, Opo1, Rat1, Chi1, Gal1, Goa1, Gor1, Lem1, Mou1, and Rab1 were similar to [16]. The phylogenetic tree of Hum1, Opo2, Rat2, Chi1, Gal2, Goa2, Gor2, Lem1, Mou2, and Rab2 were the same as that in [35], which constructs a graphic representation of the DNA sequence according to the Fermat spiral curve. When considering the local characteristics of the DNA sequence, each point on the Fermat spiral curve then related to the corresponding mass according to the relationships between the four adjacent nucleotides. The homology of the selected NSP combination was similar to the result presented in [16], but there was still a certain gap between them in terms of evolutionary matrix. Because more than one maximum frequency pattern was mined, and this was particularly true of NSPs, we could derive more pattern combinations and, thus, more evolutionary relationships between species, particularly those that were missing some of their bases, which we could still effectively partition.
We compared the first group of frequent pattern combinations that were obtained above with three existing methods and Blastn [36]. By using Blasten, we will obtain a score, and our results can be directly generated by using the software. Readers can refer to https://blast.ncbi.nlm.nih.gov/. The higher the score, the better the homology and the closer the distance. For the other three methods, the one proposed by Mo et al. [35] in 2018, and the other by Yu [37]. The third method is FPE, as proposed by Xie et al. [31], which used the prefix span algorithm to find the maximum frequency pattern, and then calculated the entropy of each block according to the probability of the pattern, and finally constitutes the vector component of the sequence by the obtained entropy. MEGA is a well-known alignment based tool, called Sequence Alignment Tools, so, here, we also used the results of MEGA [38] software as a benchmark. Molecular Evolutionary Genetics Analysis version 5 (MEGA5), user friendly software for online mining databases, was used to build sequence alignments and phylogenetic trees. It is available free of charge from http://www.megasoftware.net. MEGA software development is currently supported by research grants from the National Institutes of Health. The Pearson correlation coefficients between the results of our method and the four comparison methods and the results of MEGA were calculated. Table 7 outlines the distance between the six methods and the other species and humans. The values in the brackets are the true distances that are normalized to a range between 0 and 1. We processed the score data of BLASTn according to the method that was proposed by Xie [31]. Finally, the Pearson correlation coefficient between the results of our method and the four comparison methods was calculated. Our method had the highest correlation coefficient with MEGA and, thus, our method had the highest correlation with MEGA, indicating that our method could more accurately calculate the similarity between DNA sequences. In addition, Figure 4 shows that the curve of our method was closer to that calculated while using the MEGA method, which again indicates that our method had the highest correlation with it.  We learned that the overall variation of our method was consistent with the other comparison methods, so the method that was proposed in this paper was effective and feasible. We experimentally proved that our method was more accurate than other methods and that the proposed method is applicable for both short and long sequences. Because the data we used were frequent patterns after mining, the length of the sequence used for comparison was generally shortened and the characteristics of the original sequence were retained. The calculation was simple and memory consumption of the computer was reduced. In addition, more than one maximum frequency pattern was mined and this was particularly true of NSPs. Therefore, more pattern combinations could be derived. By comparing the similarities among the 10 species, we saw that various combinations of patterns yielded unique results, which may be useful for various considerations.

Conclusions Future Work
We proposed a DNA sequence representation and similarity analysis method based on frequent patterns, which were presented as eight vectors in a 2-D space. The frequent pattern consisted of the frequent pattern in general and the frequent pattern with some bases missing. Different pattern combinations have unique evolutionary results, which can adequately classify species. Some noise could be tolerated because we only considered maximum frequency patterns and retained the characteristics of the sequence. Our method reduced the consumption of computer memory by a large amount. The calculations were very simple. Testing the β-globin gene of 10 species showed that our method shared similarities to several recently developed alignment-free methods. Crucially, the correlation comparison of several methods and MEGA showed that our results had the highest correlation, indicating that our method more accurately calculated the similarity between DNA sequences.
Our future work will be to find a more effective way to mine biological sequences, which will not only maintain the continuity of biological sequences, but also effectively mine NSPs. In addition, we aim to find a method for selecting optimal frequent patterns in order to reduce the errors in similarity analysis. Funding: This paper was partly supported by the National Natural Science Foundation of China (62076143, 61806105) and the Natural Science Foundation of the Shandong Province (ZR2017LF020).