Highly Accurate Prediction of Protein-Protein Interactions via Incorporating Evolutionary Information and Physicochemical Characteristics

Protein-protein interactions (PPIs) occur at almost all levels of cell functions and play crucial roles in various cellular processes. Thus, identification of PPIs is critical for deciphering the molecular mechanisms and further providing insight into biological processes. Although a variety of high-throughput experimental techniques have been developed to identify PPIs, existing PPI pairs by experimental approaches only cover a small fraction of the whole PPI networks, and further, those approaches hold inherent disadvantages, such as being time-consuming, expensive, and having high false positive rate. Therefore, it is urgent and imperative to develop automatic in silico approaches to predict PPIs efficiently and accurately. In this article, we propose a novel mixture of physicochemical and evolutionary-based feature extraction method for predicting PPIs using our newly developed discriminative vector machine (DVM) classifier. The improvements of the proposed method mainly consist in introducing an effective feature extraction method that can capture discriminative features from the evolutionary-based information and physicochemical characteristics, and then a powerful and robust DVM classifier is employed. To the best of our knowledge, it is the first time that DVM model is applied to the field of bioinformatics. When applying the proposed method to the Yeast and Helicobacter pylori (H. pylori) datasets, we obtain excellent prediction accuracies of 94.35% and 90.61%, respectively. The computational results indicate that our method is effective and robust for predicting PPIs, and can be taken as a useful supplementary tool to the traditional experimental methods for future proteomics research.


Introduction
Proteins are the building blocks of any living organism. Protein-protein interactions (PPIs) occur at almost all levels of cell functions in organisms [1]. Identification of PPIs is essential for deciphering molecular mechanisms and further providing great insight into various biological processes [2][3][4]. The analysis of disease-related PPIs can speed up new drug development and therapy breakthrough [5]. Recently, a variety of high-throughput experimental technologies, such as two-hybrid-based screens [6,7], protein chips [8] and spectrometric protein complex identification [9], have been proposed by investigators for the large-scale PPIs detection. However, these experimental techniques suffer from some inherent disadvantages such as significantly time-consuming, expensive, very low coverage and high false positive rate [10,11]. Therefore, it is highly desired to develop the efficient and accurate computational approaches to facilitate the prediction of novel PPIs [12].
In general, computational approaches for PPIs detection contain two critical steps: feature extraction and classification prediction [13,14]. Feature extraction is the foundation of the overall prediction process. If those extracted features are highly discriminative, they will facilitate the subsequent steps to significantly improve the success rate of PPIs prediction. In fact, numerous feature extraction approaches have been proposed to improve the performance of PPIs prediction. For example, Shen et al. developed a conjoint triad feature extraction method using only the information of protein sequence for predicting PPI and PPI networks [15]. Guo et al. adopted auto covariance of protein sequence to construct feature vector and obtained the promising prediction results [2]. Zhou et al. employed local descriptors to capture continuous and discontinuous binding patterns of protein sequences [16]. In addition, evolutionary-based features of protein sequences have also been widely used in PPIs prediction. Zahiri et al. extracted the evolutionary features based on position-specific scoring matrix (PSSM) of protein sequences [17]. Jia et al. incorporated seven physicochemical properties and wavelet transform to detect the interactions between proteins [4]. Although the aforementioned techniques have been demonstrated to be successful in PPIs analysis, they only utilized partial information of protein sequences (such as sequential information, or evolutionary-based information, or physicochemical characteristics). Considering the fusion of multi-class information may reveal some implicit correlations of protein sequences and are able to provide more discriminative information, we select four representative physicochemical characteristics integrated with evolutionary information based on PSSM of protein sequences to improve the prediction performance of PPIs.
Besides feature extraction, the following classification prediction is also critical. Many machine learning techniques have been employed for classification, such as support vector machine (SVM) [2,16,18], artificial neural network (ANN) [19,20], relevance vector machine (RVM) [21,22], collaborative filtering (CF) [23], weighted sparse representation [1,24] and ensemble classifier [4,25,26]. In this work, our newly developed discriminative vector machine (DVM) [27,28] classifier is used. To the best of our knowledge, it is the first time that the DVM model is applied to the field of bioinformatics. More specifically, we first use the position-specific scoring matrix (PSSM) to represent each protein sequence and calculate the corresponding PSSM probabilities. Second, each probabilistic residue product is calculated. Third, the autocorrelation coefficients are calculated and the final 160-dimensional vector for each protein sequence is constructed accordingly. Moreover, the proposed method is evaluated on the two different PPIs datasets: Yeast and Helicobacter pylori (H. pylori). The computational results show that our method yields good prediction accuracy. To further validate the performance of our method, it is compared with the state-of-the-art SVM classifier. Achieved results demonstrate that the proposed method is superior to SVM in prediction performance. Finally, comparisons between the proposed method and other previous works are implemented.

Performance of the Proposed Method on Yeast and Helicobacter pylori (H. pylori) Datasets
In this step, to minimize data dependence and avoid the over-fitting of the predicting model, fivefold cross-validation was adopted. As described in materials and methods section, the final Yeast dataset contains 11,188 protein pairs, half from the negative dataset and half from the positive dataset.
Here four-fifths of the protein pairs (8950 protein pairs) respectively from the negative and positive dataset were randomly chosen to train the predicting model and the remaining one-fifths (2238 protein pairs) were employed for testing. To validate the robustness of the proposed approach, the random selection of training set and test set was repeated five times and five training sets and five test sets were obtained. Therefore, five predicting models on the Yeast dataset were generated accordingly.
The processing method for the H. pylori dataset is the same as the one for the Yeast dataset. To facilitate the comparison between different experiments, the four physicochemical properties of protein sequence and parameters of the DVM predictor were set to the same for the Yeast and H. pylori datasets. The RBF function was chosen as the kernel function. The achieved results of the proposed method on the Yeast and H. pylori datasets are shown in Tables 1 and 2.
When applying the proposed approach to the Yeast dataset, we got the prediction results of average accuracy (Acc), sensitivity (Sen), precision (Pre), and Matthews's correlation coefficient (MCC) of 94.35%, 92.97%, 96.52%, and 89.07%, respectively. The corresponding standard deviations were 0.68%, 0.65%, 1.17%, and 1.56%. Similarly, the average values of accuracy, sensitivity, precision, and MCC on the H. pylori dataset reached 90.61%, 91.32%, 90.74%, and 82.79%. Their standard deviations were 1.55%, 1.48%, 1.81%, and 1.47%, respectively. The computational results indicate that the proposed method is successful in predicting PPIs.  From the results in Tables 1 and 2, we can see that the DVM-based predicting model combining the four physicochemical properties with PSSM evolutionary information is accurate, effective and robust for the prediction of PPIs. The possible reasons of the excellent prediction performance lie in the highly discriminative hybrid features and the choice of the powerful DVM classifier. The proposed feature extraction method is novel and effective. As a representation of a protein sequence, PSSM not only retains the probability of any given amino acid at a particular position sequence but also holds sufficient prior evolutionary information. Apart from the use of PSSM, we also extracted four selected physicochemical attributes which also retain highly discriminatory information. By incorporating effective evolutionary-based information and physicochemical characteristics, the highly discriminatory features were formulated in the end.

Comparison with SVM-Based Method
To further evaluate the performance of the proposed method, we also constructed the state-of-the-art Support Vector Machine (SVM) classifier. Here, we used LIBSVM toolbox [29] as SVM classifier to carry out the prediction of PPIs. To be fair, the two predicting models adopted the same hybrid feature extracted from the Yeast dataset. A general grid search scheme was employed to optimize LIBSVM's two parameters (regularization parameter C, kernel width parameter γ) and they (C, γ) were tuned to 0.7 and 0.3 respectively. Additionally, Gaussian function was chosen as the kernel function. For the DVM and SVM classifiers, all the input vectors were normalized in the range of [−1,1].
The final prediction results of the two methods are illustrated in Table 3 and the corresponding ROCs (receiver operating characteristic curve) are shown in Figure 1. From Table 3, the average prediction accuracy, sensitivity, precision and MCC of the SVM method attained 85.77%, 85.38%, 86.46%, and 75.65%, respectively. Meanwhile, the corresponding values based on DVM achieved 94.35%, 92.97%, 96.52%, and 89.07%, which indicate that our method is significantly better than SVM for predicting PPIs. Furthermore, as shown in Figure 1, the ROC of the DVM-based prediction model is superior to that of the SVM-based classifier. It obviously suggests that the proposed method is more effective and robust. There are two possible explanations to explain the results. (1) Based on k nearest neighbors (kNNs), the robust M-estimator and manifold regularization, DVM reduces the effect of outliers and overcomes the shortcoming of the kernel function being required to satisfy the condition of Mercer; (2) Although there are three parameters (β, γ, and θ) in DVM model, those parameters slightly affect the performance of DVM if they are adjusted in appropriate ranges. Therefore, the DVM-based model is more suitable for PPIs prediction than the SVM-based method.  Table 3 and the corresponding ROCs (receiver operating characteristic curve) are shown in Figure 1. From Table 3, the average prediction accuracy, sensitivity, precision and MCC of the SVM method attained 85.77%, 85.38%, 86.46%, and 75.65%, respectively. Meanwhile, the corresponding values based on DVM achieved 94.35%, 92.97%, 96.52%, and 89.07%, which indicate that our method is significantly better than SVM for predicting PPIs. Furthermore, as shown in Figure 1, the ROC of the DVM-based prediction model is superior to that of the SVM-based classifier. It obviously suggests that the proposed method is more effective and robust. There are two possible explanations to explain the results.
(1) Based on k nearest neighbors (kNNs), the robust M-estimator and manifold regularization, DVM reduces the effect of outliers and overcomes the shortcoming of the kernel function being required to satisfy the condition of Mercer; (2) Although there are three parameters (β, γ, and θ) in DVM model, those parameters slightly affect the performance of DVM if they are adjusted in appropriate ranges. Therefore, the DVM-based model is more suitable for PPIs prediction than the SVM-based method. Average accuracy (Acc), sensitivity (Sen), precision (Pre), and MCC.

Comparison with Other Methods
So far, numerous classification methods for predicting PPIs have been developed by investigators. To further validate the advantage of our approach, we compared the predictive performance of our method with other existing methods (as described in Tables 4 and 5). The achieved results of fivefold cross-validation of different methods on the Yeast and H. pylori datasets are shown in Tables 4 and 5. In Table 4, the prediction accuracy of other previous methods on the Yeast dataset varies from 75.08% to 93.92%, while the proposed method achieved higher value of 94.35%. Similarly, the sensitivity, precision and MCC of our method are also higher than those of other methods. Moreover, the corresponding standard deviations demonstrate the proposed method is stable and robust. Considering that ensemble classifier usually has better prediction effect than single classifier, although RF + PR-LPQ method has smaller standard deviations, our method is also considered as one of the most competitive computational methods for predicting PPIs. The similar results on the H. pylori dataset can also be found in Table 5. The highest prediction accuracy of six other methods is 89.47%, which is lower than the result (90.61%) of the proposed method. The same is true for precision, sensitivity and MCC. All prediction results in Tables 4 and 5 indicate that the DVM classifier incorporating the evolutionary-based information and physicochemical characteristics can improve the prediction performance compared with the state-of-the-art methods. The high prediction performance of our method may contribute to the novel feature extraction method which extracts the highly discriminative information, and the use of DVM classifier which has been demonstrated to be robust and powerful [27].

Dataset
In this work, we evaluate the proposed method on the two high-confidence PPIs benchmarked datasets Yeast and H. pylori which are gathered from the publicly available Database of Interaction Proteins (DIP), version DIP_20070219 [34]. Those protein pairs in the datasets with less than 50 residues are excluded because they might be fragments. All protein pairs are aligned by using a multiple sequence alignment tool, cd-hit [35]. The protein pairs with too much sequence identity are generally considered to be homologous; so the pairs having ≥40% sequence identity are also removed. After above preprocessing, each dataset is divided into two subsets: negative dataset (non-interacting pairs) and positive dataset (interacting pairs). In the Yeast dataset, we select 5594 negative protein pairs as the negative dataset and 5594 positive protein pairs as the positive dataset. In the same way, 1458 negative protein pairs are selected to construct the negative dataset and 1458 positive protein pairs to form the positive dataset from H. pylori dataset. Therefore, the Yeast dataset consists of 11,188 protein pairs and H. pylori dataset includes 2916 protein pairs.

Feature Extraction
In this work, we aim to demonstrate that the perdition performance of PPIs can be improved by incorporating amino acids' physicochemical properties and evolutional information. Although Taguchi and Gromiha held the viewpoint that physicochemical-based features do not carry important discriminative information [36], we believe that the combination of physicochemical properties with evolutionary information can provide highly discriminatory features for PPIs prediction. However, there are more than 544 physicochemical characteristics [37,38] . Fortunately, according to Gaurav Raicar et al., not all physicochemical properties play the same role for predicting PPIs [39]. Gaurav Raicar summarized the rank of physicochemical characteristics based on its frequency counts over all the datasets and a subset of them was identified. Here, through the extensive experiments, four physicochemical characteristics, including hydrophobicity (H), polarity (P), polarizability (Z), and van der Waals volume (V), are selected for the calculations. The numerical indices of the four physicochemical characteristics for the 20 amino acids are shown in Table 6. Since the length of each protein sequence is different, the physicochemical characteristics and evolutionary-based information cannot merge directly. Based on pseudo amino acid composition (PseAAC) [40,41], we propose a novel feature extraction method which integrates position specific scoring matrix (PSSM) probabilities with the four physicochemical properties. PSSM is a representation of a protein sequence which defines the probability of any given amino acid occurring at a particular position in the sequence and carries the evolutionary information of protein sequence [39]. In this work, we adopt the position specific iterated BLAST (PSI-BLAST) tool to create PSSMs for all protein sequences of the Yeast and H. pylori datasets, via three iterations setting the E-value cutoff at 0.001 for the query protein sequence against multiple sequence alignment [10,42]. The PSSM P of a query protein sequence is a L × 20 matrix (P = P j i , i = 1, 2, . . . , L, j = 1, 2, . . . , 20), where L is the length of the protein sequence and 20 denotes the 20 native amino acids. P ij is the score for the jth amino acid in the ith position of the given protein sequence [13]. The residue index R m for the mth physicochemical property is a column vector of 20 × 1 (as described in Table 6). Therefore, the probabilistic expression F m (m = 1, 2, . . . , 4) of the residues about the mth physicochemical property can be defined as where F m is a vector of size L × 1. It should be pointed out that the order of the amino acids in matrix P and vector R m must remain consistent. Then the hybrid features based on physicochemical characteristics and evolutionary information are calculated by using autocorrelation coefficients of the probabilistic expressions (F m ) of the protein sequence. The calculating formula is illustrated as where F j m is the jth probabilistic residue of F m on the mth physicochemical property in a protein sequence and µ is the average value of all F j m (j = 1, 2, . . . , L). In this work, we use i = 1, 2, . . . , 40, thus producing 40 autocorrelation coefficients features to the mth physicochemical property. Therefore, each protein sequence is converted to a 4 × 40 = 160 dimensional feature vector.

Discriminative Vector Machine
Classification is a fundamental issue in pattern recognition field and there exist numerous classification algorithms for different recognition tasks. In this work, our newly developed discriminative vector machine (DVM) classifier is adopted in the classification. To the best of our knowledge, it is the first time that DVM model is applied to the field of Bioinformatics. DVM is a probably approximately correct (PAC) learning algorithm which can reduce the error caused by generalization and has strong robustness [27]. Given a test sample y, the first step of DVM is to find its k nearest neighbors (kNNs) to suppress the effect of outliers. The kNNs of y can be expressed by X k = [x 1 , x 2 , . . . , x k ], where x i is the ith nearest neighbor. For convenience, X k is also represented by X k = [x k,1 , x k,2 , . . . , x k,c ], where x k,j denotes the sample vector from the jth class. Then the objective of DVM is to solve the following minimization problem: where (y − X k β k ) i is the ith element of y − X k β k and β k is denoted as β 1 k , β 2 k , . . . , β k k or [β k,1 , β k,2 , . . . , β k,c ], where β k,i is the coefficient from the ith class. ∅ is a robust M-estimator to improve the robustness of DVM. M-estimator is a generalized maximum likelihood operator proposed by Huber to estimate parameters under the cost function [43]. There are a variety of alternative robust estimators like Welsch M-estimator, MBA (Median Ball Algorithm) estimator and Cauchy M-estimator [44].
In this work, a robust Welsch M-estimator (∅(x) = (1/2) (1 − exp −x 2 ) is adopted to attenuate large error terms so that outliers would have a less impact on classification. ||β k || is a norm of β k and the corresponding l2-norm is employed in our calculation. The last section of Equation (3) is the manifold regularization where w pq is the similarity between the pth and the qth nearest neighbor (NN) of y. In this work, w pq is defined as the cosine distance between the pth and the qth NN of y. Then the corresponding Laplacian matrix L can be depicted as where W is the similarity matrix whose element is w pq (p = 1, 2, . . . , k; q = 1, 2, . . . , k), D is a diagonal matrix whose ith element d i is the sum of w iq (q = 1, 2, . . . , k). According to Equation (4), the last section of Equation (3) can be denoted as γβ T k Lβ k . Construct a diagonal matrix P = diag (p i ) and its element p i (i = 1, 2, . . . , d) is: where σ is the kernel size which can be calculated in the following form: where θ is a constant to suppress the effect of outliers. In this work, it is assigned to 1.0 as in the literature [45]. Based on the Equations (4)-(6), the minimization of Equation (3) can be converted to the following problem: According to the theory of half-quadratic minimization, the global solution β k of Equation (7) can be solved by: After the related coefficients for each class are calculated, the test sample y can be identified as the ith class if the residual ||y − X ki β ki || is the minimum distance.
As can be seen, DVM uses the robust M-estimator and manifold regularization to suppress the effect of outliers and improve its discriminatory ability; therefore, it has better robustness and higher generalization ability than kNNs. In this work, there are two classes in total to be identified: non-interacting protein pair (class 1) and interacting pair (class 2). If the residual R 1 is the minimum distance, the test sample y will be classified as non-interacting protein pair (class 1), or it will be identified as interacting protein pair (class 2). For three free parameters (δ, γ, θ) of the DVM model, it is time-consuming to directly search for their optimal values. It is gratifying that the DVM algorithm is so stable that all these parameters only affect the performance slightly if they are set in feasible ranges. Based on above knowledge and through grid search, the parameters δ and γ are set as 1 × 10 −3 and 1 × 10 −4 respectively. Just as described before, θ is a constant and is always set to 1 throughout the whole process. For large data set, the DVM classifier needs to spend relatively more time in finding the representative vector, so multi-dimensional indexing techniques can be adopted to speed up search process to a certain extent.

Procedure of the Proposed Method
In this study, the procedure of the proposed approach mainly consists of two steps: feature extraction and classification prediction. The feature extraction is also divided into three sub steps: (1) the PSI-BLAST tool is used to represent each protein sequence and the corresponding PSSMs are obtained; (2) Based on PSSM and physicochemical characteristics, each probabilistic residue F m is calculated; (3) Each autocorrelation correlation feature vector V i is established according to Equation (2). Similarly, classification prediction also includes two sub steps. (1) As described before, each dataset is divided into training set and test set. The training set is used to train the DVM model; (2) the trained DVM model is employed to predict the PPIs on the Yeast and H. pylori datasets and the performance of the algorithm is evaluated. Similarly, the SVM model is also constructed for predicting PPIs on the Yeast dataset. The flow chart of our proposed approach is illustrated as Figure 2.

Performance Evaluation
To evaluate the predictive performance of the proposed approach, four evaluation metrics, including the accuracy (Acc), sensitivity (Sen), precision (Pre), and Matthews's correlation coefficient (MCC), were calculated. The concrete computational formulas can be formulated as follows: where FP, FN, TP and TN, denote false positive, false negative, true positive and true negative, respectively. More specifically, FP is the number of non-interacting protein pairs that are falsely predicted to be interacting protein pairs, and FN denotes the number of interacting protein pairs that are falsely predicted to be non-interacting protein pairs. Similarly, TP represents the number of interacting protein pairs predicted correctly while TN stands for the number of non-interacting protein pairs predicted correctly. Furthermore, the Receiver Operating characteristic (ROC) curve is employed to evaluate the performance comparison between SVM and the proposed method.

Conclusions
In this work, we propose a novel computational method for predicting PPIs using the hybrid feature incorporating the evolutionary information and physicochemical characteristics of protein sequence. To minimize data dependence and avoid the over-fitting, five-fold cross-validation is adopted. When applied to the Yeast and H. Pylori datasets, the proposed method achieves good prediction accuracies of 94.35% and 90.61%, respectively. To further evaluate the performance of the proposed method, it is compared with SVM model and other previous works. The achieved results show that our proposed method is very competitive for predicting PPIs and can be taken as a useful supplementary tool to the traditional experimental methods for future proteomics research.

Performance Evaluation
To evaluate the predictive performance of the proposed approach, four evaluation metrics, including the accuracy (Acc), sensitivity (Sen), precision (Pre), and Matthews's correlation coefficient (MCC), were calculated. The concrete computational formulas can be formulated as follows: where FP, FN, TP and TN, denote false positive, false negative, true positive and true negative, respectively. More specifically, FP is the number of non-interacting protein pairs that are falsely predicted to be interacting protein pairs, and FN denotes the number of interacting protein pairs that are falsely predicted to be non-interacting protein pairs. Similarly, TP represents the number of interacting protein pairs predicted correctly while TN stands for the number of non-interacting protein pairs predicted correctly. Furthermore, the Receiver Operating characteristic (ROC) curve is employed to evaluate the performance comparison between SVM and the proposed method.

Conclusions
In this work, we propose a novel computational method for predicting PPIs using the hybrid feature incorporating the evolutionary information and physicochemical characteristics of protein sequence. To minimize data dependence and avoid the over-fitting, five-fold cross-validation is adopted. When applied to the Yeast and H. Pylori datasets, the proposed method achieves good prediction accuracies of 94.35% and 90.61%, respectively. To further evaluate the performance of the proposed method, it is compared with SVM model and other previous works. The achieved results show that our proposed method is very competitive for predicting PPIs and can be taken as a useful supplementary tool to the traditional experimental methods for future proteomics research.