Classification of Protein Sequences by a Novel Alignment-Free Method on Bacterial and Virus Families

The classification of protein sequences provides valuable insights into bioinformatics. Most existing methods are based on sequence alignment algorithms, which become time-consuming as the size of the database increases. Therefore, there is a need to develop an improved method for effectively classifying protein sequences. In this paper, we propose a novel accumulated natural vector method to cluster protein sequences at a lower time cost without reducing accuracy. Our method projects each protein sequence as a point in a 250-dimensional space according to its amino acid distribution. Thus, the biological distance between any two proteins can be easily measured by the Euclidean distance between the corresponding points in the 250-dimensional space. The convex hull analysis and classification perform robustly on virus and bacteria datasets, effectively verifying our method.


Introduction
Proteins, the highly complex substances that are the basic organic matter of life, have been a hot topic for bioinformatics [1]. Proteins play different roles in the processes of life [2], such as enzymes [3]. Proteins carry out the duties specified by gene information. Thus, the research on proteins and protein sequences can reveal the evolutionary relationships of different species. From the molecular biology perspective, a protein sequence is a series of amino acids bonded via peptide bonds, and protein structures vary. There are 20 different types of amino acids that can be combined to make a protein [4]; the sequence of the amino acids determines each protein's unique three-dimensional structure and specific function. These sequences contain the distribution information of 20 types of amino acids, and a single nucleotide polymorphism (SNP) may result in a change in a protein's function [5]. There are many reasons for protein diversity [6]. One reason is the diversity of the sequences; proteins are diverse both within and between families, which makes classifying different virus or bacteria families based on protein sequences reliable.
Studying the relationship among different protein sequences is now a matter of great concern in related research. The methods for sequence similarity analysis commonly depend on a multiple sequence alignment, which usually requires a long computation time to obtain results. Therefore, alignment-free methods are proposed to overcome this ineffectiveness. Currently published alignment-free methods include graphical representation [7,8], probabilistic measure [9,10], k-mer [11,12], etc. Furthermore, sequence vector representation methods without alignment are also popular, such as the moment vector [13] and natural vector [14,15].
In this paper, we propose a novel alignment-free method for protein sequences. The accumulated natural vector for genome sequences is a previously published method that performs well on many datasets [16]. However, it ignores the vast field of protein sequences. Similar to the 18-dimensional vector of nucleotide sequences [16], the accumulated natural vector of protein sequences we designed also covers the number, average position, variance, and covariance information of amino acids. Detailed information is introduced in the next chapter. Our method not only considers the basic properties of each amino acid but also the covariance between them. Each protein sequence is in one-to-one correspondence with a point in a 250-dimensional space. Subsequently, the biological distance between two protein sequences is measured by the Euclidean distance between the corresponding points. Therefore, this approach can classify sequences into the correct cluster at a lower time cost without reducing accuracy. We also performed a convex hull analysis, which states that the convex hull formed from the same family's protein corresponding points do not intersect with other families' convex hulls. The classification results with high and robust accuracy further validate our methods effectively.

Accumulated Natural Vector for Protein Sequences
Since the distribution of amino acids determines a protein sequence, the use of appropriate models to describe the distribution of amino acids is an important issue. Former discrete models, such as the "pseudo amino acid composition" (PseAAC) model [17], have been applied to the prediction of various protein attributes. The information that can be extracted from the distribution of amino acids is very diverse. Our accumulated natural vector method is a natural description of amino acid distribution information. Assume S = (s 1 , s 2 , s 3 , · · · , s N ) is a protein sequence of length N, i.e., s i ∈ {A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V}, i = 1, 2, · · · , N. Each letter represents a type of amino acid, and there are 20 letters representing 20 types of amino acids. For convenience, the set of these 20 letters is denoted as A. This subsection defines the accumulated natural vector of the protein sequence S.

Related Definitions
We first define the indicator function of these 20 amino acids, respectively: where α ∈ A, i = 1, 2, · · · , N Next, we define the accumulated indicator function of each amino acid: We have defined the indicator function and accumulated indicator function and noticed that there is an obvious property: n α , the total amount of the amino acid α in the sequence S, is the last column.
Now we define the average position of the amino acid α in the sequence S: For the two different amino acids α and β, define their covariance in the protein sequence S as: where α, β ∈ A, θ α = ∑ N i=1Ĩ α (i)/N, θ β = ∑ N i=1Ĩ β (i)/N. Note that the definition of θ α here is different from the average position above.
Then, the variance of the amino acid α is the special case of α = β in (5): Thus, we have defined every concept needed in the accumulated natural vector.

Accumulated Natural Vector
Now we can build up the accumulated natural vector of the protein sequence S.
For example, take the protein sequence S = (ARRN ADCDCC) of length 10. The indicator functions and the accumulated indicator functions are shown in Table 1 and Table 2, respectively. Here, except for A, R, N, D, and C, the functions of 15 amino acids are all 0.
Next, we calculate the variance and covariance.

Convex Hull Method
In the previous section, we introduce how a protein sequence is represented by a 250-dimensional vector. Therefore, the distance between two vectors represents the biological distance of the corresponding two protein sequences. The convex hull principle for protein states that convex hulls corresponding to different families are disjoint with each other [15,18]. The relationship among point sets can be better observed by constructing their convex hulls.
Given a finite point set A = {a 1 , a 2 , · · · , a n } in R k space, we define the convex hull of A as: Generally speaking, the convex hull of a given point set is the smallest convex set that contains this point set.
There are many ways to judge whether two convex hulls intersect. Considering that the construction of a convex hull in a high-dimensional space is computationally intensive and time-consuming, we do not directly consider two convex hulls here. Instead, we consider two point sets to determine whether the convex hull formed by them intersects.
Given two finite point sets in the R k spaces A = {a 1 , a 2 , · · · , a n } and B = {b 1 , b 2 , · · · , b m }, we want to determine whether the convex hulls of A and B intersect. From the definition given in (13), if some sets of coefficients satisfy: Then the two convex hulls of A and B intersect; otherwise, the two convex hulls are disjoint [18].

Convex Hull Distance
The distance between the two sequences we have defined is the Euclidean distance between their corresponding natural vectors. Then, the distance between two convex hulls corresponding to the two families is the Euclidean distance between the centers of the two convex hulls constructed by these natural vectors.
A, B are the two sets of 250-dimensional accumulated natural vectors of the two families. Here, meanvector(A) represents the center of the convex hull composed of set A. The expression of meanvector(A) is: where n is the number of vectors in set A, A i is a vector in set A. With this convex hull distance definition, it is easy to calculate the distance of two convex hulls if we know their point sets.

Linear Discriminant Analysis
With the help of the accumulated natural vector and the convex hull method, we can construct convex hulls and study their intersection in a 250-dimensional space. When we want to visualize convex hulls and research their properties in a lower-dimensional space, especially in a 2-dimensional space, the dimensional reduction method in cluster analysis is indispensable. First, we introduce linear discriminant analysis, which is the widely known method.
We define X j (j = 0, 1) as the set of samples in Class j. In our experiment, the two sets of samples are two point sets of the convex hulls whose intersection is our study subject.
Our goal is to find the projection direction ω that maximizes the sample distance between the two projected categories (0,1).
We define N j (j = 0, 1) as the number of samples in Class j, µ j (j = 0, 1) as the mean value of the samples in Class j, and ∑ j as the covariance matrix of the samples in Class j. µ j (j = 0, 1) can be calculated by: ∑ j can be calculated by: As previously mentioned, the LDA (short for linear discriminant analysis) algorithm should make two projected classes of datasets (X 1 , X 2 ) in the projection direction ω as far as possible, so we need to maximize ω T µ 0 − ω T µ 1 2 2 (using the Euclidean distance here) and minimize the within-group covariance in the meantime.
The within-class scatter matrix S ω is defined as: The between-class scatter matrix S b is defined as: Then, the two-class LDA can be formulated as an optimization problem to find a set of linear combinations with the coefficient ω.
The expression is the Rayleigh Quotient R(A,x). The solution ω to the above optimization function J(ω) is the eigenvector corresponding to the maximum eigenvalue of the matrix S −1 ω S b . In the case of projecting to two dimensions, we need to calculate the two eigenvectors that correspond to the two largest eigenvalues of the matrix S −1 ω S b . This step can be accomplished by MATLAB programming or another program.

Maximum Margin Criterion
When we use LDA to complete the dimensionality reduction, we may meet the small sample size problem [19], which is caused by the singular S ω . The maximum margin criterion (MMC) is proposed to avoid the small sample size problem and calculate the most discriminant vectors. The MMC method is simple, efficient, and stable compared to the PCA+LDA method [20].
Consider a linear mapping W ∈ R D×d , where D and d are the dimensionalities of the data before and after the projection. The MMC introduces a new objective function: We can suppose that ω T ω = 1 because ω can be multiplied by any constant to make it be the unit vector. Then, we just need to solve the following constrained optimization: The small sample size problem is avoided by the formation S b − S ω instead of S −1 ω S b of the LDA.

Knn Classification
The classification performance of the 250-dimensional accumulated natural vector is also required in this paper. The K-nearest neighbor is a simple classification method and has been developed successfully in real applications [21]. The choice of the k-value has a huge impact on the final classification results. In practical applications, cross-checking is usually required to select the optimal k-value [22]. In particular, we achieved good results when we chose k = 1 in our work.
Suppose the training set contains N samples {x 1 , x 2 , · · · , x N }, which belong to t classes {A 1 , A 2 , · · · , A t }. Define the Euclidean distance from the unclassified sample x to the training set sample x i as d(x, x i ), if: Then, the nearest neighbor classification decision is: x ∈ A j .

Convex Hull Analysis of Bacterial Families
Bacteria are almost everywhere on earth and play an important role in many research fields [23]. The number of bacteria-related protein sequences is enormous and still rising. As of May 2020, the Reference Sequence Database (RefSeq) on NCBI had a total of 140 million bacterial protein sequences. Therefore, we selected 117 bacterial families from 13 different phyla and 150 protein sequences per these families for a total of 117 × 150 = 17, 550 protein sequences. Table S1 provides the complete list of the 117 bacterial families.
For each protein sequence, we calculated the 250-dimensional accumulated natural vector. Thus, each protein sequence corresponded to a point in a 250-dimensional space, and 117 different bacteria families corresponded to 117 finite point sets in a 250-dimensional space. We considered the convex hull of these 117 finite point sets and verified whether they intersect in pairs. There were ( 117 2 ) = 6786 pairs of convex hulls. No intersection was observed using the method in Section 2.2. We were not surprised because the convex hull method had similar results when applied to other datasets in previous work [15]. We applied the linear discriminant analysis method to visualize the results. For example, Figure 1 shows that the Acetobacteraceae family disjoints the Acidiferrobacteraceae family. The y-axis projection direction by LDA method Convex hulls of the two groups using LDA method Acetobacteraceae Acidiferrobacteraceae Convex hull of Acetobacteraceae Convex hull of Acidiferrobacteraceae This result shows that our proposed 250-dimensional accumulated natural vector is very effective when applied to protein sequences. The disjointness property reveals that our method accurately clusters protein sequences from the same family. It is worth noting that the advantage of using the convex hull method is that even if two points are closer together, it does not prevent them from being in different convex hulls. We can suppose that the convex hulls of the different bacterial families are pairwise disjoint. For a new, unclassified protein sequence, we can calculate its 250-dimensional accumulated natural vector in the same way and then analyze in which convex hull the point is located. Subsequently, this new protein is classified into the corresponding family. This is the most intuitive application of our method.

Classification of Protein Enzyme Classes
We performed a classification analysis on protein enzyme classes using the accumulated natural vector. Four datasets were selected from UniProt. According to UniProt, each dataset is composed of seven enzyme classes: Oxidoreductases, Transferases, Hydrolases, Lyases, Isomerases, Ligases, and Translocases. We used the one-nearest neighbor classification method mentioned in Section 2.6 to divide these protein sequences into the seven enzyme classes. All protein sequences appeared in the training set and the test set. By predicting the enzyme classes in the test set and comparing these with the training set, we calculated the accuracy of our classification method.
E. coli (Escherichia coli) is a type of bacteria closely related to human life [24]. The E. coli dataset contained 12284 protein sequences. A total of 11788 protein sequences are classified accurately, and the total accuracy is 11,782/12,284(95.9%). The detailed results of this classification are shown in Table 3. Using the one-nearest neighbor classification method, one sequence is incorporated into its nearest convex hull, which represents one bacterial family. The high accuracy rate results indicate that our method combined with the one-nearest neighbor algorithm performed robustly in the classification of protein enzyme classes. Although the accuracy cannot reach 100%, it is still a good result. We must admit that our dataset does not guarantee that every point with the nearest distance is put into the correct convex hull, which is equivalent to saying that two points that are close together can be in different convex hulls. This non-ideal situation is possible but at a low frequency. This is also another piece of evidence to support the notion that accumulated natural vectors provide a good representation of protein space.

Intersection of Virus Families in a 250-Dimensional Space
We chose 73 virus families and downloaded all of their reviewed protein sequences. Detailed information about the dataset is shown in Table S5. Then, we obtained the accumulated natural vector of each protein sequence in a 250-dimensional space. These vectors of different families constituted the different vector sets. We wanted to determine whether the convex hulls of these vectors intersected in 250 dimensions. Here, we introduce the Baltimore virus taxonomy, which divides viruses into seven categories based on differences in the gene expressions of different viruses.
Our expected result was that all virus families under the same Baltimore virus category would be disjoint.
For one Baltimore classification, we tested the intersection of the convex hulls under each Baltimore classification. The test method follows the theorem in Section 2.2; the input was the two "point set" (vector set), and the output was the intersection of the convex hull pairs. The intersection results of the 73 virus families are shown in Table 4. A total of 738 pairs of convex hull pairs were counted. Among these, 727 pairs were disjoint in 250 dimensions, and 11 pairs intersected in 250 dimensions. The disjoint convex hull pairs accounted for 0.9851 of all convex hull pairs. Among the seven Baltimore taxonomies, several convex hulls intersect under the dsDNA Baltimore taxonomy, and all convex hulls disjoint under the other six Baltimore taxonomies. There are 11 intersecting dsDNA virus family pairs in the 250-dimensional space, as shown in Table S6.
Then, we used the distances of the convex hulls corresponding to these 11 virus family pairs to perform a cluster analysis. The convex hull distances calculated here were the Euclidean distance between the two centers of the two convex hulls defined in Section 2.3. The result of our cluster analysis is in Figure 2. From Figure 2, we can see that the three nearest virus families are Podoviridae, Siphoviridae and Myoviridae, which all belong to Caudovirales and are parasitic in bacteria. Their convex hulls have closer distances, so we deduce that these three families have closer evolutionary distances.
Similarly, we performed a cluster analysis of all virus families in the dataset by calculating the distance matrix of their convex hulls. The result is that all virus families are divided into three clusters. Cluster 2 includes Togaviridae, Tobaniviridae, Secoviridae, Potyviridae, Picornaviridae and Hypoviridae. Cluster 3 includes Iflaviridae and Flaviviridae.
To visualize the result of the cluster analysis, we represent each virus family by the mean vector of its corresponding convex hull, and we project these points onto a 2-dimensional plane. The result is shown in Figure 3.
As shown in Figure 3, we can see that all virus families are divided into three clusters in a 2-dimensional space, and the virus families in each cluster remain consistent with those in a 250-dimensional space.
Above all, most of the convex hull pairs of the 73 virus families are disjoint in a 250-dimensional space. Those virus families whose convex hulls intersect with those of the other families are all from the dsDNA Baltimore taxonomy. Although these virus families have intersecting convex hulls with each other, their evolution relationships can also be reflected by the convex hull distances.

Intersection of Virus Families in a 2-Dimensional Space
Visualization of the results of classification is very important. Convex hulls in a 2-dimensional space are much more intuitive than those in 250-dimensional space. Thus, we project the convex hulls constructed above onto 2-dimensional space to observe their intersection. The method used is the maximum margin criterion, instead of the traditional method of linear discriminant analysis [25], to avoid the small sample size problem [22]. A part of the results of the classification is shown in Table 5, and the complete result is shown in Table S7. Compared to our newly proposed 250-dimensional natural vector of protein sequences, the intersection result of a 60-dimensional natural vector [26] is bad in a 2-dimensional space. All convex hulls intersect in a 2-dimensional space. The detailed results are shown in Table S8. This illustrates that the introduction of covariance between the amino acids is an improvement for the intersection analysis. The percentage of non-intersecting convex hull pairs is nearly 0.85 for most of the virus families, except for Adenoviridae, Herpesviridae, Iridoviridae, Mimiviridae, Myoviridae, Podoviridae, Poxviridae, Reoviridae, and Siphoviridae.
We dropped the above nine families whose convex hulls intersect with the convex hulls of other families in a 250-dimensional space, and then the percentage of non-intersecting convex hull pairs is 0.9588.
In conclusion, most of the convex hulls of the virus families disjoint both in a 250-dimensional space and a 2-dimensional space. The percentage of non-intersection of the virus families may be improved by incorporating other elements in the accumulated natural vector; for example, we can incorporate higher-order moments of the 20 amino acids. Correspondingly, the addition of more elements will add to the dimensions of the vectors and make calculations more complicated.

Conclusions
In this paper, the 250-dimensional accumulated natural vector method is proposed by describing the distribution of 20 amino acids within a protein sequence. Protein sequences with similar properties correspond to closer points, so proteins in the same family tend to cluster together. Our proposed method makes it easy to classify the protein sequences since it avoids the high computational complexity associated with sequence alignment and takes advantage of mathematical concepts only. Its applications to real datasets suggest that the accumulated natural vector method is a powerful tool for the classification of protein sequences. However, there is still a lot of room for improvement of this novel method by incorporating other elements.  Table S7: The results of convex hull intersection in a 2-dimensional space: Percentage of non-intersection is the percentage of disjoint convex hull pairs of all convex hull pairs under one virus family, Table S8. The results of convex hull intersection in 2-dimension space based on 60-dimensional natural vector: Percentage of non-intersection is the percentage of disjoint convex hull pairs of all convex hull pairs under one virus family. All convex hulls intersect in 2-dimension space based on 60-dimensional natural vector. So some families are omitted.