iPVP-MCV: A Multi-Classiﬁer Voting Model for the Accurate Identiﬁcation of Phage Virion Proteins

: The classic structure of a bacteriophage is commonly characterized by complex symmetry. The head of the structure features icosahedral symmetry, whereas the tail features helical symmetry. The phage virion protein (PVP), a type of bacteriophage structural protein, is an essential material of the infectious viral particles and is responsible for multiple biological functions. Accurate identiﬁcation of PVPs is of great signiﬁcance for comprehending the interaction between phages and host bacteria and developing new antimicrobial drugs or antibiotics. However, traditional experimental approaches for identifying PVPs are often time-consuming and laborious. Therefore, the development of computational methods that can efﬁciently and accurately identify PVPs is desired. In this study, we proposed a multi-classiﬁer voting model called iPVP-MCV to enhance the predictive performance of PVPs based on their amino acid sequences. First, three types of evolutionary features were extracted from the position-speciﬁc scoring matrix (PSSM) proﬁles to represent PVPs and non-PVPs. Then, a set of baseline models were trained based on the support vector machine (SVM) algorithm combined with each type of feature descriptors. Finally, the outputs of these baseline models were integrated to construct the proposed method iPVP-MCV by using the majority voting strategy. Our results demonstrated that the proposed iPVP-MCV model was superior to existing methods when performing the rigorous independent dataset test.


Introduction
The bacteriophages, also known informally as phages, are a form of viruses that infect and replicate within bacteria and archaea. Bacteriophages are among the most common and diverse entities on Earth, usually found wherever bacteria exist. Their interactions with microbial communities profoundly influence microbial ecology and biogeochemical cycling in various ecosystems [1]. The classic structure of a bacteriophage is commonly characterized by complex symmetry. The head of the structure features icosahedral symmetry, whereas the tail features helical symmetry. Recently, it has been shown that the abundant bacteriophages existing in the human gut microbiota heavily impact human metabolism and immunity [2], with evident therapeutic implications for some diseases [3]. Bacteriophages are composed of proteins that encapsulate a DNA or RNA genome [4]. They replicate within the bacterium following the injection of the genome into bacterial cytoplasm. Owing to their properties, no toxicity for human cells, harmless to normal flora, and their potential against antibiotic-resistant bacteria, phages are expected to become an alternative to antibiotics [5].
Bacteriophage proteins are fundamental materials of the infectious viral particles and are responsible for important biological functions in the interaction between bacteriophages and host cells. They are divided into two categories: structural proteins, also called phage virion proteins (PVPs), and nonstructural proteins, also called phage non-virion proteins (non-PVPs). PVPs recognize the host, bind to its surface receptors, and deliver the phage's genome into the host's cell, while non-PVPs play crucial roles in the biological process of viral genome replication and expression [6][7][8]. The accurate identification of PVPs will not only contribute to further comprehending the molecular mechanisms of phage genetics but also be helpful in the development of antimicrobial drugs [9]. However, it is often quite expensive and time-consuming to annotate new PVPs based on the current experimental methods, such as mass spectrometry, protein arrays, and electron microscopy [10,11]. Therefore, with the explosive extension of protein sequence data, there is an urgent need to exploit computational methods to identify PVPs.
In a few years of recent, several sequence-based computational methods have been developed to predict PVPs, including PVPred [12], PVP-SVM [13], PhagePred [14], Pred-BVP-Unb [15], PVPred-SCM [16], Meta-iPVP [17], and so on [18][19][20][21][22]. For instance, Feng et al. [19] constructed a benchmark dataset with 99 PVPs and 208 non-PVPs and applied a naïve Bayes (NB) classifier to predict PVPs by using amino acid composition (AAC) and dipeptide composition (DPC). Based on jackknife cross-validation (CV), their method could distinguish the PVPs from the non-PVPs with an accuracy (ACC) of 79.15% [19]. In 2018, Manavalan et al. [13] developed a support vector machine (SVM)-based classifier, namely PVP-SVM, which extracted AAC, DPC, atomic composition (ATC), physicochemical properties (PCP), and chain-transition-distribution (CTD) as preliminary input features and then performed a feature selection protocol to identify the optimal features. PVP-SVM achieved an ACC of 79.80% when tested on the independent dataset [13]. Subsequently, Arif et al. [15] proposed an unbiased predictor called Pred-BVP-Unb for PVPs prediction, which utilized three protein encoding strategies, i.e., composition and translation (CT), split AAC, and bi-profile position-specific scoring matrix (PSSM). In addition, they applied the synthetic minority oversampling technique to reduce the bias of the unbalanced benchmark dataset and employed the recursive feature elimination algorithm to select the optimal feature subset [15]. Pred-BVP-Unb obtained the highest ACC value of 83.06% on the same independent dataset with PVP-SVM [15]. Recently, Charoenkwan et al. [16] developed an interpretable PVPs classifier named PVPred-SCM in a systematic manner based on the scoring card method combined with DPC, with an ACC of 77.66% on the independent dataset. Later, the same research group [17] first established a balanced benchmark dataset which contained 313 PVPs and 313 non-PVPs, and then explored a meta-predictor termed Meta-iPVP to distinguish whether a query protein is a PVP or not. Meta-iPVP adopted seven feature encodings and four machine learning algorithms to construct the final ensemble classifier, which performed well on the independent test. Many other reports about the computational identification of PVPs can be seen in the recent review articles [23][24][25].
As we all know, PSSM profiles, which represent the evolutionary information of protein sequence, have been widely applied and proved to have a remarkable contribution to protein function and attribute predictions [26]. However, it appears that evolutionary features derived from the PSSM profile have rarely been used to identify PVPs. Thus, there is still room for further improvement in PVPs prediction by extracting more effective evolutionary features from the PSSM profile. To this end, we present an ensemble model, called iPVP-MCV, to increase the annotation levels of PVPs by combining PSSM profiles and a multi-classifier voting scheme. The overall framework of the iPVP-MCV method is illustrated in Figure 1. First, three types of feature encodings were utilized to convert protein sequences into fixed-length feature vectors, including PSSM-AAC, PSSM-composition, and DP-PSSM. Second, these features were input into the baseline SVM classifiers to perform the prediction. Finally, iPVP-MCV integrated the outputs of these baseline models to predict whether the query protein was a PVP or not by using the majority voting strategy. Experimental results from the independent dataset test indicated that iPVP-MCV performed better than most of the current existing methods and could serve as a useful tool to help enhance the prediction performance of PVPs. The overall framework of the proposed iPVP-MCV model. First, the query protein sequence was converted into its PSSM profile by running the PSI-BLAST program against the UniRef50 database. Second, three types of evolutionary features were extracted from the PSSM profile to characterize the query protein. Third, these features were input into the baseline SVM classifiers to perform the prediction. Finally, the outputs of three baseline models were integrated to predict whether the query protein was a PVP or not by using the majority voting strategy.

Datasets
A reliable and high-quality benchmark dataset is crucial for training and verifying the proposed prediction model. In this study, the benchmark dataset, which contained the training subset (termed D1_Train) and the independent subset (termed D1_Test), was directly collected from previous work to validate the proposed model [13]. Specifically, the training dataset contained 208 non-PVPs and 99 PVPs, while the independent dataset included 64 non-PVPs and 30 PVPs.
To objectively assess the robustness of the proposed method, another dataset constructed by Charoenkwan et al. [17] was further studied. This dataset first expanded the benchmark dataset and then randomly divided it into two parts: The first one included 250 PVPs and 250 non-PVPs (termed D2_Train), which was applied to train the model and perform the CV; and the other one consisted of 63 PVPs and 63 non-PVPs (termed D2_Test), which was used for the independent test.
The main reasons why we adopted the above two benchmark datasets were as follows: (i) they were extracted from the Universal Protein Resource (UniProt) [27]; (ii) they eliminated the sequences with nonstandard letters, such as "B", "X", or "Z", and (iii) they removed protein sequences with more than 40% similarity to avoid misleading results with the overestimated ACC.

PSSM Profiles
A growing number of studies have found that evolutionary information embedded in the PSSM profiles highly represents the sequence information [28]. The PSSM profile of a query protein with the length of L can be represented as an L × 20 matrix, which was generated by performing the PSI-BLAST search against the UniRef50 [29] database with three iterations and a threshold value of 0.001 [30]. The (i, j)th element in the matrix denotes the log odds of amino acid type j to appear at position i of the protein sequence. To facilitate the further analysis, the elements of the PSSM profile were normalized between 0 and 1 using the following formula: The original and standardized PSSM profiles are respectively denoted as follows:

PSSM-AAC
AAC is a commonly used feature descriptor that computes the fractions of the 20 standard amino acid residues in each protein sequence [31]. The PSSM-AAC features derived from the PSSM profile can be defined as: where Note that the AAC features based on the sequence (termed Seq-AAC) can be calculated by the following formula: where c j is the number of amino acid of type j in the query protein sequence.

PSSM-Composition
PSSM-composition transforms the PSSM profile into a 20 × 20 matrix by summing up all rows for each naturally occurring amino acid type [32]: where R i represents the ith row of the converted matrix, r k denotes the kth row of the normalized PSSM profile, p k denotes the kth amino acid in the original sequence, and a i denotes the ith of 20 standard amino acids. Finally, a 400-dimensional numeric vector was generated by flattening the 20 × 20 matrix.

DP-PSSM
DP-PSSM is a similarity-based protein descriptor that reflects the hidden sequential order information in the real number and can avoid the cancellation of positive and negative terms in the average process [33]. The whole process is described below.
First, the original PSSM profile S was normalized to a matrix T by using the following three equations: Second, a 40-dimensional feature vector T was generated by computing the average of positive and negative terms in each column of the normalized matrix T, as shown in the Formulas (11) and (12): where NP j and NN j represent the numbers of positive and negative terms in the jth column of the matrix T, respectively.
Third, a feature vector G with the dimension of 20 × (α × 2) was obtained by calculating the average of squared differences between entries corresponding to amino acids at the positions i and i + k in each column of matrix T, as shown below: where NDP j and NDN j are the numbers of positive and negative terms of Finally, the DP-PSSM descriptor was defined as a 40 + 40 × α dimensional feature vector by combining the generated T and G .
In this study, the value of the parameter α was set to 4, and we obtained a 200-dimensional numeric vector for each protein sequence. The experimental results of parameter selection are listed in Supplementary Table S1.

Support Vector Machine
Support vector machine (SVM), which is considered one of the most robust prediction algorithms [34], has been successfully used to solve the protein classification problems in bioinformatics [35,36]. Theoretically, SVM maps the training examples into a highdimensional space and then finds an optimal hyperplane that maximized the margin between two classes. New test examples that are mapped into the same high-dimensional space will be predicted based on which side of the hyperplane they fall into. In addition to solving the linear classification problem, SVM can efficiently handle the non-linear problem by using the kernel trick. Due to its better performance, the radial basis kernel function (RBF) was adopted in this work. The effectiveness of the RBF-based SVM depends on the values of the kernel parameter γ and the regularization parameter C. In this study, to avoid overfitting to the testing set, we only performed a grid search method to select the best parameters C ∈ 2 −5 , 2 −3 , . . . , 2 13 , 2 15 and γ ∈ 2 3 , 2 5 , . . . , 2 −13 , 2 −15 of each SVM-based PVPs predictor on the training set by using the nested 10-fold CV.

Performance Evaluation
In this study, the proposed method was rigorously validated based on three widely used validation tests: the 10-fold CV, the jackknife CV, and the independent test. All results were reported by applying the four performance measurements: sensitivity (SN), specificity (SP), accuracy (ACC), and Matthew's correlation coefficient (MCC), formulated as follows: where TP, FP, TN, and FN represent the instances of true positive, false positive, true negative, and false negative, respectively. Moreover, we exploited a receiver operating characteristic (ROC) curve to illustrate the diagnostic ability of the proposed method. The ROC curve was created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. Note that TPR and FPR were equal to SN and 1-SP, respectively. The area under the ROC curve (AUC) was also calculated for the model evaluation. The high value of AUC means that the model has a good performance.

Performance Comparison of Various Classifiers
In this section, five commonly used machine learning algorithms, including SVM, random forest (RF), extreme gradient boosting machine (XGB), extremely randomized trees (ERT), and NB, were adopted to compare their performance for the PVP's prediction by using three different feature encoding methods. All the experiments were performed on the D1_Train dataset by utilizing the jackknife CV, and the results are illustrated in Figures 2 and 3.   As can be seen from Figure 2, the SVM, ERT, RF, and XGB classifiers obtained satisfactory ACC values with three different encoding schemes, while the NB algorithm performed worse in this task. Specifically, the SVM classifier reached the highest ACC values based on the PSSM-AAC and PSSM-composition descriptors and showed comparable performance with the XGB model when combined with the DP-PSSM encoding method. Accordingly, the SVM algorithm was selected as the final predictor in the subsequent analysis due to its robustness. In addition, the models based on the PSSM-AAC and DP-PSSM descriptors performed slightly better than the ones based on the PSSM-composition descriptor. This suggested that these three types of features may provide important clues for the identification of PVPs, and the predictive ability could be further improved by constructing the ensemble model. Furthermore, the ROC curves associated with these models were plotted in Figure 3, which led to similar conclusions as Figure 2.

Performance Comparison between the Base Models and the Ensemble Models
In this section, we assessed the performance of six SVM-based models on the D1_Train dataset by using the jackknife CV, including four base models with different feature encodings (i.e., Seq-AAC, PSSM-AAC, DP-PSSM, and PSSM-composition) and two ensemble models with different integration strategies (one was multi-feature fusion, and the other one, multi-classifier voting). The four metrics, including ACC, sensitivity (SN), specificity (SP), and Matthew's correlation coefficient (MCC), were applied to evaluate the performance of these models, and the corresponding results are reported in Table 1. As shown in Table 1, three PSSM-based models were superior to the one based on the Seq-AAC descriptor in terms of ACC, SN, SP, and MCC, which reconfirmed that the evolutionary information in the form of PSSM profiles may be more informative than the sequence information alone when tested on the PVPs identification. Among three PSSM-based models, the PSSM-AAC model achieved the best ACC value (0.857), the DP-PSSM model obtained the highest SP value (0.913), and the PSSM-composition model could bear comparison with the DP-PSSM model in terms of SN. In addition, we attempted to further improve the prediction performance of PVPs by using two different ensemble methods. The first one constructed an SVM model by fusing three PSSM-based feature spaces, which outperformed the DP-PSSM model and the PSSM-composition model in terms of ACC and MCC but lagged behind the PSSM-AAC model. The second one (i.e., iPVP-MCV) combined the outputs of three individual SVM models into a final prediction by the majority vote, which performed better than the other predictors listed in Table 1 in terms of ACC (0.879), SP (0.928), and MCC (0.720). This suggested that the multi-classifier voting strategy was superior to the multi-feature fusion strategy for this task. The possible cause could be that either redundant or irrelevant features existed in the fused feature space and had an adverse effect on the prediction of PVPs.

Performance Comparison with Existing Methods
To objectively assess the performance of iPVP-MCV, we made fair comparisons with several existing methods on the two benchmark datasets by using the same validation schemes with previous works [13,17]. These methods included Feng et al.'s method [19], PVPred [12], PVP-SVM [13], Tan et al.'s method [21], PVP-SCM [16], and Meta-iPVP [17]. For this experiment, the jackknife CV was first performed on the D1_Train dataset to verify the performance of iPVP-MCV, and then the iPVP-MCV model trained with the D1_Train dataset was examined on the D1_Test dataset. Moreover, we carried out the 10-fold CV on the D2_Train dataset to validate the robustness of iPVP-MCV. The iPVP-MCV model trained with the D2_Train dataset was further evaluated on the D2_Test dataset. The prediction results of our method and several existing models were documented in Tables 2 and 3. Note that the prediction values were collected from the previous studies [15][16][17]. The results of two validation tests on the two training datasets were also listed in Supplementary Table  S2 and the ROC curves of iPVP-MCV classifier on the independent testing datasets were illustrated in Supplementary Figure S1. As shown in Table 2, the proposed iPVP-MCV model, together with the PVP-SVM model, exhibited the best performance with the ACC value higher than 0.87, the SP value better than 0.92, and the MCC value near or greater than 0.70 during the jackknife test on the D1_Train dataset. It is worth mentioning that the PVP-SVM predictor first extracted multiple features, including AAC, ATC, CTD, DPC, and PCP, to train an SVM model and then employed a feature selection protocol to remove the noisy features and enhance the predictive ability. Unlike the PVP-SVM predictor, our method utilized the majority voting strategy to improve the performance of base classifiers. The comparison results on the D1_Test dataset also validated the feasibility and effectiveness of two methods for the prediction of PVPs. In particular, the iPVP-MCV model achieved the highest ACC (0.840), SP (0.922), and MCC (0.621). Additionally, the best SN value was reached by the PVPred-SCM model, which adopted the scoring card method in conjunction with DPC to identify the PVPs.
To further evaluate the robustness of our method, two larger training and testing datasets were adopted to compare the iPVP-MCV method with the recent Meta-iPVP method and other existing tools. From Table 3, our model was slightly superior to the Meta-iPVP model in terms of ACC, SP, and MCC not only on the 10-fold CV but also on the independent test. The SN values of the two models were also level pegging. It should be noted that the Meta-iPVP model combined seven sequence-based feature encodings and four machine learning algorithms to construct a meta-predictor. These experiment results indicated that ensemble learning methods could indeed obtain better predictive performance than those based on the single feature encoding or the individual classifier. Additionally, our method showed substantial improvements for identifying PVPs, which may be attributed to the helpful evolutionary information extracted from the PSSM profile and the efficient multi-classifier voting technique. We anticipated that our method could become a useful tool or at least play a complementary role to existing methods for increasing the annotation levels of PVPs.

Conclusions
The medical and commercial value of PVPs has motivated the development of computational tools that facilitate the accurate annotation of PVPs based on their protein sequences. In this study, we proposed an ensemble model iPVP-MCV to further enhance the predictive accuracy of PVPs by using a multi-classifier voting strategy combined with evolutionary features extracted from the PSSM profile. First, three types of feature descriptors were designed to characterize all the PVPs and non-PVPs from the working datasets, including PSSM-AAC, PSSM-composition, and DP-PSSM. Second, a set of baseline models were trained based on each type of feature descriptor. Third, iPVP-MCV integrated the outputs of these baseline models to perform the prediction of PVPs by using the majority voting strategy. Finally, both the CV and the independent test were adopted to verify the performance of iPVP-MCV on the two benchmark datasets. The comparison results with the existing tools demonstrated that the proposed method was efficient, robust, and promising for the annotation of PVPs and could serve as an alternative tool to identify putative PVPs. The source code and all the datasets are freely available at https://github.com/taigangliu/iPVP-MCV, accessed on 14 August 2021. In the future, we will develop a user-friendly web server for the public use.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10.339 0/sym13081506/s1, Table S1: The performance of DP-PSSM with different α on the D1_Train dataset using the 10-fold CV, Table S2: Prediction comparison on the D1_Train and D2_Train datasets by using the 10-fold CV and the jackknife CV, Figure S1: The ROC curves of iPVP-MCV on the independent testing datasets.

Data Availability Statement:
The data used to support the findings of this study are freely available to the academic community at https://github.com/taigangliu/iPVP-MCV, accessed on 14 August 2021.