Numerical characterization of protein sequences based on the generalized Chou's pseudo amino acid composition

The technique of comparison and analysis of biological sequences is playing an increasingly important role in the field of Computational Biology and Bioinformatics. One of the key steps in developing the technique is to identify an appropriate manner to represent a biological sequence. In this paper, on the basis of three physical-chemical properties of amino acids, a protein primary sequence is reduced into a six-letter sequence, and then a set of elements which reflect the global and local sequence-order information is extracted. Combining these elements with the frequencies of 20 native amino acids, a (21+λ) dimensional vector is constructed to characterize the protein sequence. The utility of the proposed approach is illustrated by phylogenetic analysis and identification of DNA-binding proteins.


Introduction
In the task of comparison and analysis of biological sequences, choosing a type of DNA/protein representation is an important step.The usual representation of the primary structure of DNA is a string of four letters: A (adenine); G (guanine); C (cytosine); and T (thymine).This expression is called a letter sequence representation (LSR) or a DNA primary sequence.Similarly, a protein primary sequence is usually expressed in terms of a series of 20 letters, which denote 20 different amino acids.The sequence encodes information of the corresponding structure and function in a living organism.However, it is difficult to obtain the information from the representation of a primary sequence directly.Therefore, various sequence representation techniques have been developed for encoding bio-sequences and extracting the hidden information.
Graphical representation of DNA is a useful tool for visualizing and analyzing DNA sequences.By using the tool, one can obtain a route to condense the information coded by DNA primary sequences into a set of invariants [1,2].Early attempts towards graphical representations of DNA were made by Hamori and Ruskin in 1983 [3], Hamori in 1985 [4], and Gates in 1985 [5].Afterwards, more graphical representations of DNA sequences were well developed by researchers [1,2,[6][7][8][9][10][11][12][13][14][15].In comparison with DNA, graphical representations of proteins emerged only very recently [2,[16][17][18][19][20][21][22][23][24][25][26][27].As a matter of fact, most of the graphical representations of DNA involve some degree of arbitrariness, such as the selection of directions to be assigned to individual bases.For a string like DNA sequence over an alphabet with size 4, there are 4! = 24 possible ways of assigning 4 directions to 4 nucleic acid bases.If these methods are directly extended to protein sequences, the corresponding figure is 20! ≈ 2.433 × 10 18 .It is impracticable to represent one protein sequence by such an enormous number of graphs.This is probably the most important reason why protein graphical representations have not been advanced [19,23].It is found that reducing the alphabet or fixing the directions assigned to amino acid residues plays an important role in addressing this problem.For details, we refer to some recent publications [2,16,21,23,24,28].
Matrix representation of a biological sequence is another powerful tool for characterization and comparison of sequences.These matrices include: The frequency matrix; Euclidean-distance matrix (ED); graph theoretical distance matrix (GD); line distance matrix (LD); quotient matrix (D/D, M/M, L/L); and their "higher order" matrices [1,2,12,13,20,21,27,29,30].Among them, ED, GD, L/L, etc., are derived from a graphical representation.For example, L/L is a symmetric matrix whose diagonal entries are zero, while other entries are defined as the quotient of the Euclidean distance between two points of the graph and the sum of geometrical lengths of edges between the two points.Once the matrix is given, some of matrix invariants can be used as descriptors of the sequence.Eigenvalues of a matrix are one of the best-known matrix invariants [31].In fact, two graphs are isomorphic if and only if their adjacency matrices are similar.It is of interest to note that similar matrices have the same eigenvalues.Among all the eigenvalues, the leading eigenvalue often plays a special role and has been widely used in the field of biological science and chemistry.However, a problem we must face is that the calculation of the eigenvalue will become more and more difficult with the order of the matrix large.ALE-index is an alternative invariant we proposed in 2005 [32].The ALE-index can be viewed as an Approximation of the Leading Eigenvalue (ALE) of the corresponding matrix (it is just in this sense that it is called 'ALE'-index), while it is much simpler for calculation than the latter.Therefore, it may be more economical to adopt the ALE-index when one is interested only in the leading eigenvalue.
The third method for formulating a protein sequence is the pseudo amino acid composition (PseAAC), with the advantage of avoiding loss of the sequence-order information.Ever since the concept of PseAAC [33,34] or Chou's PseAAC [35,36] was proposed, it has rapidly penetrated into nearly all fields of computational proteomics (see a long list papers cited in [36,37]).Stimulated by the great successes of PseAAC in dealing with protein/peptide sequences, the concept of PseAAC has been extended [38][39][40][41][42] to cover DNA/RNA sequences as well via the form of PseKNC (pseudo K-tuple nucleotide composition) [43,44], which has been proven very useful in studying many important genome analysis problems, as summarized in a recent review paper [45].Also, because the concept of PseAAC has been increasingly and widely used in both computational proteomics and genomics, a very powerful web-server called "Pse-in-One" [46] was established that can be used to generate the pseudo components for both protein/peptide and DNA/RNA sequences.
In this paper, we modify the method of Chou's PseAAC and propose a novel approach for numerically characterizing a protein sequence.We characterize a protein sequence by a (21 + λ) dimensional vector, whose first 20 components are the occurrence frequencies of 20 native amino acids, while the last λ + 1 components are based on a six-letter sequence derived from the protein primary sequence.The former is used to reflect the effect of the amino acid composition, and the latter is used to reflect the effect of sequence order and property of the residues.It is well known that a sequence naturally contains two pieces of information: the elements of the sequence; and the orders of the elements.Any methodologies based on the amino acid composition alone are worthy of further investigation.However, as pointed out by Chou [33,34], it is not feasible to completely include all sequence order patterns.It was stirring to see that Chou creatively developed an approach as mentioned above to extract the important feature beyond amino acid composition.Our scheme is similar to, but different from, that of Chou.Experiments about phylogenetic analysis on two datasets and identification of DNA-binding proteins illustrate the utility of the proposed method.

Methods
A protein sequence can be viewed as a string of 20 amino acids.Without loss of generality, by the numerical indices 1,2, . . .,20, we represent the 20 native amino acids according to the alphabetical order of their single-letter codes: A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W and Y. Then the frequencies of appearance of the 20 amino acids in a protein sequence are often used to construct a vector [f 1 , f 2 , . . . ,f 20 ] This is the conventional amino acid composition.The advantage of such a vector representation is that it is easy in statistical treatment, but it cannot reflect the effect regarding sequence order and property.In what follows, we will take this effect into account through a set of elements in addition to the 20 components.
Hydrophobicity, isoelectric point (pI), and relative distance (RD) are three important physicochemical properties of the 20 native amino acids.Here RD can be viewed as an integration of the information on three side chain properties: composition; polarity; and molecular volume-where composition is defined as the atomic weight ratio of hetero (noncarbon) elements in end groups or rings to carbons in the side chain (for details, see [47]).Listed in Table 1 are the original numerical values for hydrophobicity, pI and RD.As can be seen from Table 1, the values of P 0 1 (Hydrophobicity) is in the range [−2.53~1.38],and the values of P 0 2 (isoelectric point) are in the range of 2.97~10.76,while P 0 3 (relative distance) varies between 1469 and 3355.Therefore, the normalization of these values is needed.Here we normalize them by the formulary below:  For each amino acid (AA), we associate it with a triple (t(1), t(2), t(3)), where All the amino acids with a same triple form a group.In this way, the 20 native amino acids can be classified into 6 groups: For each group, the first amino acid is selected to be the representative.That is, A, C, D, E, H and P are used to stand for the six groups, respectively.The value of the property of a group is defined as the average value of the property of amino acids belonging to the group.Listed in Table 3 are the corresponding values of the six groups.At the same time, a protein primary sequence can be reduced into a six-letter sequence by replacing each element in the protein sequence with its representative letter.Suppose S = S 1 S 2 . . .S L is a given six-letter sequence, we inspect it by stepping one element at a time.For the step k (k = 1, 2, . . ., L), a 3-D space point q k = (x k , y k , z k ) can be constructed as follows: where (x 0 , y 0 , z 0 ) = (0, 0, 0).When k runs from 1 to L, we get L points q 1 , q 2 , . . . ,q L .Connecting these points one by one sequentially with straight lines, a three-dimensional curve can be drawn.One can further associate the graph with some structural matrices.Here we adopt the L/L matrix and denote it by M, whose (i,j)-entry is defined as follows: where d(i, j) is the Euclidean distance between points q i and q j .It is not difficult to see that lim t→+∞ t M is a (0,1) matrix; here t M stands for the product of Hadmammard multiplication of the matrix M by itself t-times.In this paper, we call the limit matrix as a generalized adjacency matrix (GAM) generated by points q 1 , q 2 , . . ., q L , and denote it by M G .Obviously, [M G ] ij = 1 if and only if q i and q j lie on a straight line in the graph.
As mentioned above, once a symmetric matrix is given, one can calculate its ALE-index by the following formula: where L is the order of the matrix, ||• || m1 and ||• || F are the m1and F-norms of a matrix, respectively.In order to reduce variations caused by comparison of matrices with different sizes, we consider instead of . In addition, following the similar procedures in capturing the sequence-order information of a protein [33,34], for the six-letter sequence S = S 1 S 2 • • • S L , we extract a set of new order-correlated factors as defined below: where θ k (k = 1, 2, . . ., λ) is called the k-th tier correlation factor, g n (S, k) represents the coupling mode function as given by Factor θ 1 reflects the coupling mode between the most contiguous elements along a six-letter sequence (Figure 1a); θ 2 reflects the coupling mode between the second-most contiguous (Figure 1b); θ 3 reflects the coupling mode between the third-most contiguous (Figure 1c), and so on.λ is the highest rank of the coupling mode.Consequently, a protein sequence can be characterized by a (21 + λ) dimensional vector V: where Here and are weight factors.It is easy to see that the first 20 components reflect the effect of the amino acid composition, whereas the last λ + 1 components reflect the effect of sequence order and property of the residues.For convenience, a set of such 21 + λ components as formulated by Equations ( 8) and ( 9) is called the generalized pseudoamino acid composition of a protein sequence, and denoted by G-PseAAC.

Results
In this section, we will illustrate the use of the new quantitative characterization of protein sequences with two experiments.As we can see from Equations ( 8) and ( 9), there are three adjustable parameters for the G-PseAAC: λ, , and .It is not known beforehand which λ, , and are best for a given problem.Three datasets are considered in this paper.The first one is used for determining these parameters and others for testing purpose.

Experiment I: Phylogenetic Analysis on Two Datasets
The first dataset used in this paper is composed of β-globin protein of 17 species (see Table 4).According to the method proposed, we associate each of the 17 protein sequences with a = 21 + λ dimensional vector.These vectors are then used to define a pair-wise evolutionary distance between any two protein sequences i and j: where = ( , , … , , ) and = ( , , … , , ) are the corresponding vectors for sequences i and j, respectively.Thus, a 17 × 17 real symmetric matrix D is obtained.On the basis of the achieved distance matrix D , a phylogenetic tree can be constructed using a UPGMA (Unweighted Pair Group Method with Arithmetic Mean) program included in the MEGA4 package.It is found that, when λ = 7 and = = 1.6, the non-mammals, including Guttata, Gallus and Muscovy duck, appear to cluster together and stay outside of the mammals, while Opossum is distinguished from the remaining mammals.In addition, Primate group {Human, Chimpanzee, Gorilla}, Cetartiodactyla group {Cattle, Banteng, Sheep, Goat}, Lagomorpha group {Rabbit, European hare}, and Rodentia group {House mouse, Western wild mouse, Spiny mouse, Norway rat} form separate branches, respectively (cf. Figure 2).This result is in accordance with the accepted taxonomy and the literature [1,12,30].Consequently, a protein sequence can be characterized by a (21 + λ) dimensional vector V: where Here w 1 and w 2 are weight factors.It is easy to see that the first 20 components reflect the effect of the amino acid composition, whereas the last λ + 1 components reflect the effect of sequence order and property of the residues.For convenience, a set of such 21 + λ components as formulated by Equations ( 8) and ( 9) is called the generalized pseudoamino acid composition of a protein sequence, and denoted by G-PseAAC.

Results
In this section, we will illustrate the use of the new quantitative characterization of protein sequences with two experiments.As we can see from Equations ( 8) and ( 9), there are three adjustable parameters for the G-PseAAC: λ, w 1 , and w 2 .It is not known beforehand which λ, w 1 , and w 2 are best for a given problem.Three datasets are considered in this paper.The first one is used for determining these parameters and others for testing purpose.

Experiment I: Phylogenetic Analysis on Two Datasets
The first dataset used in this paper is composed of β-globin protein of 17 species (see Table 4).According to the method proposed, we associate each of the 17 protein sequences with a τ = 21 + λ dimensional vector.These vectors are then used to define a pair-wise evolutionary distance between any two protein sequences i and j: where V i = (v i1 , v i2 , . . . ,v i,τ ) and V j = v j1 , v j2 , . . ., v j,τ are the corresponding vectors for sequences i and j, respectively.Thus, a 17 × 17 real symmetric matrix D 17 is obtained.On the basis of the achieved distance matrix D 17 , a phylogenetic tree can be constructed using a UPGMA (Unweighted Pair Group Method with Arithmetic Mean) program included in the MEGA4 package.It is found that, when λ = 7 and w 1 = w 2 = 1.6, the non-mammals, including Guttata, Gallus and Muscovy duck, appear to cluster together and stay outside of the mammals, while Opossum is distinguished from the remaining mammals.In addition, Primate group {Human, Chimpanzee, Gorilla}, Cetartiodactyla group {Cattle, Banteng, Sheep, Goat}, Lagomorpha group {Rabbit, European hare}, and Rodentia group {House mouse, Western wild mouse, Spiny mouse, Norway rat} form separate branches, respectively (cf. Figure 2).This result is in accordance with the accepted taxonomy and the literature [1,12,30].Using the above-determined values for λ, , and , we infer the relationship of 72 coronavirus spike (S) proteins.The coronavirus, whose name is derived from its crown-like shape, is a positivesense, single-stranded RNA virus in the family Coronaviridae.It was first identified in the 1960s from the nasal cavities of patients with the common cold.Most coronaviruses are not dangerous, but some strains could cause severe, sometimes fatal, diseases in humans and other animals.The MERS coronavirus (commonly shortened to MERS-CoV) is the virus that causes the Middle East respiratory syndrome (MERS).MERS was first reported in 2012 in Saudi Arabia and then in other countries in the Middle East, Africa, Asia, Europe and America.As of July 2016, 1769 laboratory-confirmed cases of MERS-CoV infection, including at least 630 related deaths (the case fatality rate is >30%), have been reported in over 27 countries (http://www.who.int/emergencies/mers-cov/en/).People also died from a severe acute respiratory syndrome (SARS), which first emerged in 2002 in Guangdong Province, China, and then spread globally.SARS resulted in more than 8000 infections with a case-fatality rate of ~10%.The virus that causes SARS is officially called SARS coronavirus (SARS-CoV).Both MERS-CoV and SARS-CoV are identified as members of the beta group of coronavirus, Betacoronavirus, while they are distinct from each other.The name, accession number, and abbreviation of the 72 sequences are listed in Table 5.According to the existing taxonomic groups, sequences 1-5 belong to group alpha (formerly known as Coronavirus group 1 (CoV-1)), sequences 6-8 are members of group gamma (formerly CoV-3), and the remaining belongs to group beta (formerly CoV-2).Refer to Table 5 for details.Using the above-determined values for λ, w 1 , and w 2 , we infer the relationship of 72 coronavirus spike (S) proteins.The coronavirus, whose name is derived from its crown-like shape, is a positive-sense, single-stranded RNA virus in the family Coronaviridae.It was first identified in the 1960s from the nasal cavities of patients with the common cold.Most coronaviruses are not dangerous, but some strains could cause severe, sometimes fatal, diseases in humans and other animals.The MERS coronavirus (commonly shortened to MERS-CoV) is the virus that causes the Middle East respiratory syndrome (MERS).MERS was first reported in 2012 in Saudi Arabia and then in other countries in the Middle East, Africa, Asia, Europe and America.As of July 2016, 1769 laboratory-confirmed cases of MERS-CoV infection, including at least 630 related deaths (the case fatality rate is >30%), have been reported in over 27 countries (http://www.who.int/emergencies/mers-cov/en/).People also died from a severe acute respiratory syndrome (SARS), which first emerged in 2002 in Guangdong Province, China, and then spread globally.SARS resulted in more than 8000 infections with a case-fatality rate of ~10%.The virus that causes SARS is officially called SARS coronavirus (SARS-CoV).Both MERS-CoV and SARS-CoV are identified as members of the beta group of coronavirus, Betacoronavirus, while they are distinct from each other.The name, accession number, and abbreviation of the 72 sequences are listed in Table 5.According to the existing taxonomic groups, sequences 1-5 belong to group alpha (formerly known as Coronavirus group 1 (CoV-1)), sequences 6-8 are members of group gamma (formerly CoV-3), and the remaining belongs to group beta (formerly CoV-2).Refer to Table 5 for details.The corresponding phylogenetic tree constructed by our method is shown in Figure 3. Observing Figure 3, we find that TGEVG, TGEV, PEDVC, PEDV and HCoV-229E, which belong to group alpha, are clearly clustered together, and so do the three gamma coronaviruses IBV, IBVBJ, IBVC.In the subtree of the group beta, MERS-CoVs appear to cluster together, and SARS-CoVs are situated at an independent branch, while BCoV, BCoVM, BCoVQ, BCoVE, BCoVL, HCoV-OC43, MHV, MHVA, MHVM, MHVP and MHVJHM form a separate branch.The resulting cluster agrees well with the established taxonomic groups.

Experiment II: Identification of DNA-Binding Proteins
Numerous biological mechanisms depend on nucleic acid-protein interactions.The first step for understanding these mechanisms is to identify the interacting molecules.There are different strategies for determining DNA sequences that bind specifically to a known protein.However, it is difficult to accurately identify DNA-binding proteins [50].Existing experimental techniques have low practical value due to time consumption and expensive costs [51].Therefore, developing an efficient computational approach for identifying DNA-binding proteins is becoming increasingly important.In this section, we explore the application of the G-PseAAC to the identification of DNA-binding proteins.The parameters λ, w 1 , and w 2 used here are the same as those determined in Section 3.1.
The dataset used here is taken from [51].Itsoriginal version was created in 2009 by Kumar et al. [52], in which the DNA-binding proteins are extracted from the Pfam database [53] with keywords of "DNA-binding domain" and pairwise sequence identity cutoff of 25%, while the non DNA-binding domains are randomly selected from Pfam protein families that are unrelated to the DNA-binding protein family.Xu et al. [51] removed some sequences from the original dataset, and its current version is composed of 1585 protein sequences.This benchmark dataset contains 770 DNA-binding proteins and 815 non DNA-binding proteins, which form the positive sample set and negative sample set, respectively.We randomly divide the 770 DNA-binding proteins into two parts, one has 410 sequences and the other 360 sequences.Also, we randomly select 410 and 405 sequences from the 815 non DNA-binding proteins, respectively.We conduct two sets of data.Set I contains 410 DNA-binding proteins and 410 non DNA-binding proteins.This set serves as a training set.The remaining protein sequences (360 DNA-binding proteins and 405 non DNA-binding proteins) form Set II, which serves as a test set.

Experiment II: Identification of DNA-Binding Proteins
Numerous biological mechanisms depend on nucleic acid-protein interactions.The first step for understanding these mechanisms is to identify the interacting molecules.There are different strategies for determining DNA sequences that bind specifically to a known protein.However, it is difficult to accurately identify DNA-binding proteins [50].Existing experimental techniques have low

Selection of Properties for Amino Acids
In addition to the three physical-chemical properties mentioned above, both hydrophilicity and molecular weight of amino acids can play important roles for characterization of proteins.Therefore, one can consider r-combinations of the five properties to describe a protein sequence.The purpose of this paper is to find an appropriate way for converting a protein sequence of 20 kinds of amino acids into a string over a "small" alphabet.If we take r to be 3, by the scheme described in Section 2, the triple (t(1), t(2), t(3)) has at most 2 3 = 8 different forms.This means that the 20 native amino acids can thus be classified into no more than eight groups, whereas if the 5-combination or 4-combination is selected, by the similar scheme, (t(1), t(2), • • • , t(r)) will have 2 5 = 32 or 2 4 = 16 possible forms.Compared with "20," the figure is not "small."Therefore, r is taken to be 3 in this paper.By means of each of the 3-combinations of the five properties, the same experiments are performed.As a result, we find that hydrophobicity, isoelectric point, and relative distance form the best 3-combination.

Feature Analysis
As we see from Equations ( 8) and ( 9), the 28-D feature vector consists of three parts: 20 amino acid compositions; 7 correlation factors; and 1 ALE-index.One may be interested in knowing whether or not the last two parts are significant.First and foremost, let us see what would happen if only the first part was used?Without loss of generality, suppose S is a protein sequence and the counts of 20 native amino acids are n 1 , n 2 , • • • , n 20 , respectively.Then we have a multi-set M (S) = {n Based on the knowledge of combinatorics, it is not difficult to see that there are a total of different sequence/strings possessing the same amino acid compostion.This suggests that the amino acid composition alone is not sufficient to represent and compare protein sequences.What would happen if only the first two parts were used (i.e., without using the ALE-index)?By using the vector with the first 27 components, experiments I and II are performed.For the first dataset, there is no significant difference between the tree constructed with the 27-D vector and that with the 28-D vector.For the second dataset, the corresponding relationship tree of coronavirus spike proteins is shown in Figure 4. From Figure 4, it is easy to see that MERS-CoVs belonging to Betacoronavirus appear to cluster together with the three Gammacoronaviruses, instead of the other Betacoronaviruses.This phenomenon is disappointing.For the third dataset, we repeat the three cross-validation tests with the 27-D vector and list the corresponding results in Table 7.By comparing Table 7 with Table 6, we can find that the prediction quality diminished slightly.These results indicate that the ALE-index can make a very positive contribution to the performance of experiments.

Conclusions
By means of three important physicochemical properties of amino acids, we first classify the 20 native amino acids into six groups, and assign to each group a representative symbol.Then, by substituting each letter with its representative letter, we convert a protein primary sequence into a six-letter sequence, which can be regarded as a coarse-grained description of the protein primary sequence.In comparison with the string composed of 20 kinds of amino acids, the reduced sequence not only makes the generalization from representations of DNA sequences to those of proteins easier, but also enables us to focus more on the information of our interest.On the basis of the six-letter sequence, we obtain a generalized adjacency matrix (GAM) and then its normalized ALE-index.Also, we extract λ order-correlated factors via the reduced sequence.Combining these elements with the frequencies of occurrenceof 20 native amino acids, we constructa (21 + λ) dimensional vector to characterize a protein sequence.Our method is tested byphylogenetic analysis and identification of DNA-binding proteins.The feature analysis implies that the λ + 1 components beyond the amino acid composition play very important roles in the performance of the experiment.As shown in a series of recent publications (see, e.g., [58,[67][68][69][70][71][72]) in demonstrating new methods or approaches, userfriendly and publicly accessible web-servers will significantly enhance their impacts [73].We will make efforts in our future work to further improve our method and provide a web-server for the new method presented.

Conclusions
By means of three important physicochemical properties of amino acids, we first classify the 20 native amino acids into six groups, and assign to each group a representative symbol.Then, by substituting each letter with its representative letter, we convert a protein primary sequence into a six-letter sequence, which can be regarded as a coarse-grained description of the protein primary sequence.In comparison with the string composed of 20 kinds of amino acids, the reduced sequence not only makes the generalization from representations of DNA sequences to those of proteins easier, but also enables us to focus more on the information of our interest.On the basis of the six-letter sequence, we obtain a generalized adjacency matrix (GAM) and then its normalized ALE-index.Also, we extract λ order-correlated factors via the reduced sequence.Combining these elements with the frequencies of occurrenceof 20 native amino acids, we constructa (21 + λ) dimensional vector to characterize a protein sequence.Our method is tested byphylogenetic analysis and identification of DNA-binding proteins.The feature analysis implies that the λ + 1 components beyond the amino acid composition play very important roles in the performance of the experiment.As shown in a series of recent publications (see, e.g., [58,[67][68][69][70][71][72]) in demonstrating new methods or approaches, user-friendly and publicly accessible web-servers will significantly enhance their impacts [73].We will make efforts in our future work to further improve our method and provide a web-server for the new method presented.

Figure 1 .
Figure 1.A schematic diagram to show: (a) the first-tier; (b) the second tier; and (c) the third-tier sequence order correlation mode along a sequence.Where the regular hexagon is used to show that each element of the sequence corresponds to one of the six amino acid groups.

Figure 1 .
Figure 1.A schematic diagram to show: (a) the first-tier; (b) the second tier; and (c) the third-tier sequence order correlation mode along a sequence.Where the regular hexagon is used to show that each element of the sequence corresponds to one of the six amino acid groups.

Figure 2 .
Figure 2. The relationship tree of 17 species.

Figure 3 .
Figure 3.The relationship tree of 72 coronavirus spike proteins.

Figure 3 .
Figure 3.The relationship tree of 72 coronavirus spike proteins.

Figure 4 .
Figure 4.The relationship tree of the coronavirus spike proteins with the 27-D vector.
S -C o V s

Figure 4 .
Figure 4.The relationship tree of the coronavirus spike proteins with the 27-D vector.

Table 1 .
The original numerical values for the properties of the 20 native amino acids.

Table 2 .
The last row in this table gives the average values.

Table 2 .
The normalized values for the properties of the 20 native amino acids.

Table 3 .
The values for properties of the six groups.

Table 5 .
The accession number, name and abbreviation for 72 coronavirus spike proteins.

Table 6 .
The results of three different cross-validation tests.

Table 7 .
Results of the three cross-validation tests with the 27-D vector.

Table 7 .
Results of the three cross-validation tests with the 27-D vector.