Identifying Phage Virion Proteins by Using Two-Step Feature Selection Methods

Accurate identification of phage virion protein is not only a key step for understanding the function of the phage virion protein but also helpful for further understanding the lysis mechanism of the bacterial cell. Since traditional experimental methods are time-consuming and costly for identifying phage virion proteins, it is extremely urgent to apply machine learning methods to accurately and efficiently identify phage virion proteins. In this work, a support vector machine (SVM) based method was proposed by mixing multiple sets of optimal g-gap dipeptide compositions. The analysis of variance (ANOVA) and the minimal-redundancy-maximal-relevance (mRMR) with an increment feature selection (IFS) were applied to single out the optimal feature set. In the five-fold cross-validation test, the proposed method achieved an overall accuracy of 87.95%. We believe that the proposed method will become an efficient and powerful method for scientists concerning phage virion proteins.


Introduction
The bacteriophage, which is also known as the phage, is a kind of virus that infects bacteria and can complete growth and reproduction in bacteria [1]. The bacteriophage consists of an outer protein coat and a single genetic material (DNA or RNA) inside [2] and is the most widely distributed group of the virus, which is usually found in places full of bacterial communities such as soil and animals' intestine.
Proteins are the major components of viruses including structural proteins (namely, phage virion proteins) and non-structural proteins (namely, phage non-virion proteins). Non-structural proteins are proteins encoded by viral genes that function in the process of viral gene expression, but they don't bind to viral particles. Structural proteins are the necessary proteins to form mature and infectious virus particles including shell proteins, envelope proteins, and virus particle enzymes, etc. Among them, the shell proteins are the proteins that constitute the structure of the virus capsid and their main function is protecting the viral nucleic acid, participating in the adsorption and invasion of bacteriophages, and more. Envelope proteins are the proteins that constitute the viral envelope structure and the main function is to act as a viral surface antigen, which maintains the viral structure, participates in virus budding, and more. Due to the clear differences in the function of structural proteins and non-structural proteins, the correct identification of them will be helpful to further understand the molecular mechanisms of bacteriophage genetics and the development of antimicrobial drugs.
The traditional method for the identification of the phage virion and non-virion proteins is mass spectrometry (MS) [3]. However, it has not kept pace with the explosive growth of protein sequences in the post-genome era. Therefore, it is necessary to adopt machine learning methods to identify phage virion proteins. In 2013, Feng et al. proposed a Naïve Bayes classifier to identify phage virion proteins [4]. Afterward, Ding et al. developed an SVM-based method to identify phage virion proteins in which the proteins were encoded using the optimal features obtained by using the ANOVA feature selection technique [5]. In 2015, Zhang et al. introduced an ensemble method for predicting phage virion proteins from phage protein sequences by combining the CTD, bi-profile Bayes, PseAAC, and PSSM [6]. Subsequently, Manavalan et al. proposed a method called PVP-SVM, which adopted the SVM classifier with multiple feature extraction methods [7]. More recently, Fan et al. proposed a novel method called PhagePred to predict phage virion proteins by using the Multinomial Naïve Bayes classifier combined with the g-gap features tree [8]. However, these current feature extraction methods do not describe the protein sequences completely. Therefore, it is necessary to apply new feature extraction and selection methods to investigate the identification of phage virion proteins further.
In this paper, by using a new feature extraction method and two feature selection methods (ANOVA and mRMR) to select optimal features, we proposed an SVM-based method to identify phage virion proteins. As shown in Figure 1, the rest of the paper is organized based on the following aspects: (1) collection of raw data sets and processing of raw data sets, (2) feature extraction, (3) feature selection, (4) classifier algorithm, and (5) performance evaluation.

Benchmark Dataset
The raw positive and negative datasets adopted in this research were extracted from the Universal Protein Resource (UniProt) [9]. In order to obtain a high-quality and reliable benchmark dataset, the following steps were performed. First, only the phage virion proteins and phage non-virion proteins that have been experimentally confirmed can be included. Second, the protein sequences that are fragments of other proteins were excluded. Third, protein sequences containing nonstandard letters such as 'B,' 'U,' 'X,' or 'Z' were eliminated because their meanings are ambiguous. By following these three rigorous screening processes, a total of 121 phage virion protein sequences and 231 phage non-virion protein sequences were obtained. In order to obtain a high quality benchmark dataset, the CD-HIT [10] program was used by setting the cutoff threshold of the protein sequence identity to 40%. Lastly, we obtained 99 phage virion protein sequences and 208 phage non-virion protein sequences.
It is necessary to evaluate the proposed model by using an independent dataset to check whether the prediction model has a generalization capability. In this study, the independent dataset constructed by Manavalan et al. [7] was used, which can be downloaded from http://www.thegleelab.org/PVP-SVM/SVM-PVPData.html. It is a reliable independent dataset containing 30 phage virion protein sequences and 64 phage non-virion protein sequences.

The g-Gap Dipeptide Composition
After the benchmark dataset was built, we needed to describe the protein sequences using a computer-readable form. Proteins are formed with the use of 20 amino acids, according to a certain order and space structure. The most common and simplest method is to formulate the sample protein P with L residues with its entire amino acid sequence, which is shown below.
where A 1 represents the first amino acid residue of the sample protein P, A 2 represents the second amino acid residue of the sample protein P, and so forth. Another straightforward method to formulate the protein sequence is amino acid composition (AAC). In order to obtain the sequence-related information, the AAC is replaced by the adjacent dipeptide composition to represent the protein sequences [4]. However, the adjacent dipeptide composition can only express the correlation between two adjacent amino acid residues. In fact, in three-dimensional space, two amino acids with g-gap residues may be adjacent. In order to find the important correlation in protein sequences, we applied the g-gap dipeptide composition, which is extended from the adjacent dipeptides [8,11,12] and is a kind of mode of PseAAC [13][14][15]. By adopting this method, a sample protein P can be formulated by using the equation below.
where the symbol T represents the transposition of the vector while the v g i represents the frequency of the i-th (i = 1, 2, . . . , 400) g-gap dipeptide and can be formulated by using the equation below.
where n g i represents the number of the i-th g-gap dipeptide, L represents the length of the protein P, g represents the number of amino acid residues separated by two amino acid residues, g = 1 indicates the correlation between two amino acid residues with the interval of one residue, g = 2 indicates the correlation between two amino acid residues with the interval of two residues, and so forth.

The Analysis of Variance (ANOVA)
In general, if a model was built on a low-dimensional feature subset, the robustness of the model will be excellent. However, the low dimensionality of the feature subset will not provide enough information, which results in a poor performance of the model. On the contrary, a high-dimensional features subset can lead to information redundancy and overfitting problems. Both of these problems will lead to low accuracy of the cross-validation prediction and a poor predictor generalization ability. In order to overcome these shortcomings, the best way is to pick out the best feature subset, but it is time-consuming to investigate the performance of all feature subsets by the computer [16][17][18][19]. For example, if the amino acid composition contains a 400-dimension feature vector, the number of all possible combinations for the 400-D vector is C 1 400 + C 2 400 + C 3 400 + . . . + C 399 400 + C 400 400 = 2.58 × 10 120 . Therefore, the analysis of variance (ANOVA) method with the incremental feature selection (IFS) [20][21][22] process was applied to investigate the optimal feature set with the maximum accuracy.
The ANOVA method can score each feature according to a unified standard and then reduce the features according to their contribution. This will not only save the calculation time but will also improve the model's performance. According to the principle of ANOVA, the score (F) of the i-th g-gap dipeptide can be formulated by using the formula below.
where S 2 B (i) represents the variance between groups (MSB) of i-th feature in the sample and S 2 W (i) represents the variance within groups (MSW) of i-th feature in the sample, which are calculated by using the equations below.
where d f B = K − 1 and d f W = M − K represent the degree of freedom for MSB and MSW, respectively. K and M represent the number of group (here K = 2) and the number of samples (here M = 307), respectively. SS B (i) and SS W (i) represent the sum of MSB and MSW, respectively, and are calculated by using the formula below.
where f g i (s, j) represents the frequency of the i-th g-gap dipeptide of the j-th sample in the s-th group. A feature with a high F(i)-value means that its ability to identify the sample is excellent, which is more conducive to building a highly robust model. Therefore, we ranked all features according to their F(i)-values from high to low and obtained new feature vectors, which are shown below.
By using the ANOVA method, we have a clear understanding of each feature's capabilities for the model. Therefore, we don't have to exhaust all the feature subsets but instead selectively construct feature subsets according to their F(i)-values in this paper. The first feature subset contains the feature with the highest F(i)-value, P g = v 1,g . The second feature subset adds the second highest F(i)-value to the first subset, P g = v 1,g , v 2,g . The third feature subset adds the third highest F(i)-value to the second subset, P g = v 1,g , v 2,g , v 3,g . The procedure was repeated until the accuracy of the model no longer increased.

Minimal-Redundancy-Maximal-Relevance (mRMR)
The combination of some of the best-performing features does not mean that the best predictive effect can be achieved. The main reason for this phenomenon is that these features are likely to have a high degree of correlation, which leads to more redundant information in the feature vector. To solve this problem, Peng et al. proposed the mRMR algorithm [23]. MRMD [24] is another tool similar to mRMR. The main idea of the algorithm is to filter out some of the most relevant features in the subset to achieve the goal of minimizing information redundancy and then obtain the most 'concise' subset of features in theory. Therefore, when using the mRMR-ranked feature benchmark dataset with a smaller dimension, it can still effectively represent a dataset with a larger dimension, which can ensure that the feature dimension and the time of the training model are greatly reduced with almost no loss of effective information.
The mRMR algorithm is often used to select discretization features and continuity features. Based on the issues involved in this paper, the following is a description about discretization features. Given two random discrete variables x and y, the mutual information I(x, y) between them can be calculated by using the formula below.
where the mutual information I(x, y) is a measure of the degree of correlation between two random variables x and y and p(x), p(y), p(x, y) denote the probabilistic density functions, respectively. The metrics of the mRMR algorithm are Max-Relevance and Min-Redundancy and they can be described by the equations below.
1. Max-Relevance: 2. Min-Redundancy: where x i represents the i-th feature attribute, c represents the category variable, S represents the feature subset, and |S| represents the size of the feature subset. Furthermore, if the two metrics are considered equally important, we can abbreviate it by using the formula below.

Support Vector Machine (SVM)
The support vector machine (SVM) is a widely used binary classification model, which has been widely used in bioinformatics [25][26][27][28][29][30][31][32][33][34][35][36][37][38]. It is a supervised machine learning method and its main idea is to map the input features from low-dimensional space to a high-dimensional space through nonlinear transformation and then find the optimal linear classification surface in this high-dimensional space. For convenience, SVM software packages LibSVM can be download from https://www.csie.ntu.edu.tw/~cjlin/libsvm/. In the current study, the LibSVM package was adopted to investigate the performance of identifying the phage virion proteins. Since it has been widely adopted in bioinformatics, the radical basis function kernel was selected to perform predictions.

Performance Evaluation
Three cross-validation methods, which are called the independent dataset test, the sub-sampling test, and the jackknife test, are widely used to evaluate the predictive ability of a predictor in practical application [4,28,[39][40][41][42][43][44][45][46][47][48][49]. Among these three methods, the jackknife test was considered to be the most rigorous one that can get a unique outcome in statistical prediction. This test has been widely used by investigators to assess the performance of the predictor [4,37,46,[50][51][52][53][54]. In this paper, in order to save computational time, the five-fold cross-validation method was used to tune the parameters C and g in the SVM.
In this paper, we adopted five evaluation indexes to evaluate the model. Sensitivity (S n ) is used to evaluate the model's ability to predict positive samples. Specificity (S p ) is used to evaluate the model's ability to predict negative samples. Overall Accuracy (Acc) reflects the proportion of the entire benchmark dataset that can be correctly predicted. The Matthew correlation coefficient (Mcc) is an indicator used to reflect the reliability of the algorithm. Its value is between −1 and 1 and the high value of Mcc indicates that the model has a good prediction performance. The four indexes are defined below.
where TP, TN, FP, and FN represent the number of the correctly recognized phage virion proteins, the number of the correctly recognized phage non-virion proteins, the number of phage non-virion proteins recognized as phage virion proteins, and the number of phage virion proteins recognized as phage non-virion proteins, respectively.
The ROC (receiver operating characteristic) curve is a more intuitive way to demonstrate the performance of the model. Therefore, we plotted the ROC and calculated the area under the ROC curve (auROC). The high value of auROC indicates that the model has a good classification ability and deserves our trust.

Prediction of Phage Virion Proteins Based on Single Kind of g-Gap Dipeptides
The ANOVA method with the IFS process was applied to investigate the optimal feature set with the maximum accuracy. The details of the optimization process can be referred to Section 2.3. We chose residue parameter g from 0 to 9 to estimate the performance for all the 10 × 400 = 4000 feature subsets using SVM until all the 4000 Accs (overall accuracies) were calculated. In addition, we plotted 10 curves by setting the overall Acc as an ordinate and the number of features as abscissa shown in Figure 2. Additionally, the highest predictive accuracies for g-gap dipeptides are shown in Table 1.

Prediction of Phage Virion Proteins Based on Fusing Features
For different g (from 0 to 9) values, each optimum subset represents the best characterization of proteins at different levels. If the 10 best feature subsets were fused together, each protein in the benchmark dataset can be encoded using a 107 + 213 + 135 + 87 + 42 + 89 + 70 + 174 + 255 + 94 = 1266dimensional feature vector, which would be more comprehensive in representing protein sequences. We investigated the performance of identifying phage virion proteins based on the 1266 features. An Acc of 85.02% was achieved by adopting SVM in the five-fold cross-validation method. It is worthy of further investigation because the prediction performance was still far from satisfactory.
However, there exists noise in such a set of features. Therefore, we used the mRMR to score the 1266 features and ranked the features according to their scores. The details of the optimization process is outlined in Section 2.4. In order to show the contributions of each feature to the prediction, we made a heat map for the 1266 features based on their Z-scores, which is shown in Figure 3. The scores can be calculated with the help of the formula below.
where represents mRMR score for the i-th feature s, represents the overall average of 1266 mRMR scores, and represents the standard deviation of 1266 mRMR scores.

Prediction of Phage Virion Proteins Based on Fusing Features
For different g (from 0 to 9) values, each optimum subset represents the best characterization of proteins at different levels. If the 10 best feature subsets were fused together, each protein in the benchmark dataset can be encoded using a 107 + 213 + 135 + 87 + 42 + 89 + 70 + 174 + 255 + 94 = 1266-dimensional feature vector, which would be more comprehensive in representing protein sequences. We investigated the performance of identifying phage virion proteins based on the 1266 features. An Acc of 85.02% was achieved by adopting SVM in the five-fold cross-validation method. It is worthy of further investigation because the prediction performance was still far from satisfactory.
However, there exists noise in such a set of features. Therefore, we used the mRMR to score the 1266 features and ranked the features according to their scores. The details of the optimization process is outlined in Section 2.4. In order to show the contributions of each feature to the prediction, we made a heat map for the 1266 features based on their Z-scores, which is shown in Figure 3. The scores can be calculated with the help of the formula below.
where x i represents mRMR score for the i-th features, µ represents the overall average of 1266 mRMR scores, and σ represents the standard deviation of 1266 mRMR scores. By using the IFS process to investigate the performance for all the subsets, the Acc reached its peak (87.95%) when the top ranked 368 dipeptides were used to build the model (Figure 4). In this case, the , , and are 83.84%, 89.90%, and 0.761%, respectively, with the auROC at 0.915 ( Figure 5). This result indicates that the performance of the proposed model is smart and reliable for identifying phage virion proteins. By using the IFS process to investigate the performance for all the subsets, the Acc reached its peak (87.95%) when the top ranked 368 dipeptides were used to build the model (Figure 4). In this case, the S n , S p , and Mcc are 83.84%, 89.90%, and 0.761%, respectively, with the auROC at 0.915 ( Figure 5). This result indicates that the performance of the proposed model is smart and reliable for identifying phage virion proteins. By using the IFS process to investigate the performance for all the subsets, the Acc reached its peak (87.95%) when the top ranked 368 dipeptides were used to build the model (Figure 4). In this case, the , , and are 83.84%, 89.90%, and 0.761%, respectively, with the auROC at 0.915 ( Figure 5). This result indicates that the performance of the proposed model is smart and reliable for identifying phage virion proteins.

Comparision with Other Published Methods
It is necessary to compare the methods used in this article with other published methods. Table  2 shows the detailed predictive results from the published papers. Based on the same benchmark dataset, Feng et al. proposed a Naïve Bayes-based method to predict bacteriophage virion proteins by using the amino acid composition and the dipeptide composition and obtained an overall accuracy of 79.15% [4]. Ding et al. adopted the ANOVA feature selection method to select optimal 1-gap dipeptides and obtained an overall accuracy of 85.02% [5]. Manavalan et al. proposed a novel method

Comparision with Other Published Methods
It is necessary to compare the methods used in this article with other published methods. Table 2 shows the detailed predictive results from the published papers. Based on the same benchmark dataset, Feng et al. proposed a Naïve Bayes-based method to predict bacteriophage virion proteins by using the amino acid composition and the dipeptide composition and obtained an overall accuracy of 79.15% [4]. Ding et al. adopted the ANOVA feature selection method to select optimal 1-gap dipeptides and obtained an overall accuracy of 85.02% [5]. Manavalan et al. proposed a novel method called 'PVP-SVM' in which the AAC, ATC, CTD, DPC, and PCP were used to represent the protein sequences and got an overall accuracy of 87.0% in the jackknife test [7]. Pan et al. adopted a Multinomial Naïve Bayes classifier based on the discrete features obtained from the g-gap feature tree. They achieved a superb overall accuracy of 98.37% in the 10-fold cross-validation [8]. In this work, we achieved an overall accuracy of 87.95%, which is the second-best prediction result until now. Compared to the results with reference [4,5,7] except for the value of S p , which is slightly lower than that of Manavalan's [7], the values of S n , Acc, Mcc and auROC are significantly higher than published results. These comparisons demonstrate a better performance of our proposed methods.

Performance Evaluation Using an Independent Dataset
In general, the best model for the training dataset is not the optimal model for the independent dataset. Therefore, we repeated the feature selection procession, retrained the model, and validated the model on an independent dataset. The results for different classifiers were listed in Table 3. As indicated in Table 3, among the three compared methods, our method obtained the highest S n while the S p , Acc, and MCC is similar to the PVP-SVM, which obtained the best predictive results on the independent. This indicates that our method can play complementary roles to existing methods for identifying phage virion proteins.

Conclusions
In this work, we investigated the accuracies of different features for identifying phage virion proteins. The maximum overall accuracy (87.95%) was obtained by fusing 10 optimal g-gap (0 to 9) dipeptide compositions, which was obtained by fusing ANOVA and mRMR feature selection methods. Compared with the existing methods, the proposed model improved the overall accuracy. Therefore, the method can be used as a reliable tool for accurately predicting phage virion proteins. In our study, there are many problems worth investigating for phage virion protein prediction. For example, in order to build a high quality dataset, greater attention should be paid to the dynamic changes of the database. Second, the biological meanings of the selected optimal features also need to be clarified.
Third, considering the promising performance of the ensemble classification methods [55] and the deep learning technique [56][57][58] in bioinformatics, we will integrate multiple classification algorithms to build the model for identifying phage virion proteins. User-friendly and publicly accessible web tools including predictors [25,27,50,59,60] or databases [61][62][63][64] represent the future direction for developing a more useful bioinformatics method. In the future, we will establish a powerful tool for phage virion protein prediction. The feature selection strategy can be extended to other fields.

Conflicts of Interest:
The authors declare that there is no conflict of interest.