Predicting Protein–Protein Interactions Based on Ensemble Learning-Based Model from Protein Sequence

Simple Summary Due to most traditional high-throughput experiments are tedious and laborious in identifying potential protein–protein interaction. To better improve accuracy prediction in protein–protein interactions. We proposed a novel computational method that can identify unknown protein–protein interaction efficiently and hope this method can provide a helpful idea and tool for proteomics research. Abstract Protein–protein interactions (PPIs) play an essential role in many biological cellular functions. However, it is still tedious and time-consuming to identify protein–protein interactions through traditional experimental methods. For this reason, it is imperative and necessary to develop a computational method for predicting PPIs efficiently. This paper explores a novel computational method for detecting PPIs from protein sequence, the approach which mainly adopts the feature extraction method: Locality Preserving Projections (LPP) and classifier: Rotation Forest (RF). Specifically, we first employ the Position Specific Scoring Matrix (PSSM), which can remain evolutionary information of biological for representing protein sequence efficiently. Then, the LPP descriptor is applied to extract feature vectors from PSSM. The feature vectors are fed into the RF to obtain the final results. The proposed method is applied to two datasets: Yeast and H. pylori, and obtained an average accuracy of 92.81% and 92.56%, respectively. We also compare it with K nearest neighbors (KNN) and support vector machine (SVM) to better evaluate the performance of the proposed method. In summary, all experimental results indicate that the proposed approach is stable and robust for predicting PPIs and promising to be a useful tool for proteomics research.


Introduction
Protein-protein interactions (PPIs) play a crucial role in almost all cellular processes and functions, such as DNA transcription and replication, immune response, signal transduction, and gene expression [1,2]. Thus, detecting and characterizing potential protein interactions correctly are significant for understanding the properties of biological processes. Recently, a number of innovative high-throughput biological experimental technologies, including yeast two-hybrid screen (Y2H) [3,4], protein chip [5], and tandem affinity purification tagging (TAP) [6], and other methods have been proposed to detect the interaction between proteins systematically. With the development of biotechnology, the number of PPI data is quickly accumulating. For this reason, multiple databases have been built to record PPI data efficiently. Biomolecular Interaction Network Database (BIND) [7], Database of Interacting Proteins (DIP) [8], and the Molecular Interaction database (MINT) [9] are mainly used databases by researchers. However, there still exist some drawbacks to traditional high-throughput methods, such as being costly and labor-intensive and it will raise a high rate of false positives. The known PPI pairs which have been validated through biological experiment methods only account for a small portion of the whole PPI network [10,11]. As a result, developing a novel computational method is conducive to inferring potential PPIs.
Up to now, a great number of computational techniques have been proposed for predicting potential PPIs [12][13][14]. Generally, some existing methods for predicting PPIs typically can be treated as binary classification problems which adopt different features to represent protein pairs [15][16][17]. Different feature sources or protein attributes like protein domains, phylogenetic profiles, and protein structure information are employed to detect potential protein interactions. There are also exist methods that utilize interaction information from several different protein features [18,19]. However, these approaches are not easy to implement unless pre-knowledge of protein pairs can be available.
Recently, a couple of computational methods mainly based on protein sequences have been proposed since protein sequences are the easiest to obtain [20][21][22]. Many researchers have engaged in the development of a sequence-based method for detecting potential PPIs [23][24][25][26], and a variety of experimental results indicated that it is sufficient to predict PPI using the information of amino acid sequences alone [27][28][29][30][31]. For instance, Xia et al. [32] proposed that the Moran autocorrelation descriptor can effectively depict the level of the correlation between two protein sequences of specific physicochemical property and use rotation forest to predict PPIs. Shen et al. [33] proposed a computational method that extracts features by utilizing the conjoint triad (CT) which considered the local environments of residues, and then using a support vector machine (SVM) to predict PPIs; this method achieved the result of average accuracy of 83.9%. You et al. [34] developed a novel computational method that used multi-scale continuous and discontinuous (MCD) to represent protein sequence and achieved an excellent result in the Yeast dataset. Chen et al. [35] reported an approach that uses XGBoost to reduce feature noise and adopts StackPPI that several ensemble classifiers to detect the interaction of protein pairs. Zhao et al. [36] proposed an ensemble method and the results of the proposed method obtained good performance. Yousef et al. [37] developed a sequence-based, fast, and adaptive PPIs prediction method, which employed principal component analysis (PCA) as a proper feature extraction method and utilized adaptive learning vector quantization (LVQ) to predict different PPI datasets. This method achieved an average accuracy of 93.88% and 90.03% on S. cerevisiae and H. pylori, respectively. Wang et al. [38] reported an approach only using the information of protein sequence; the approach combined continuous and discrete wavelet transforms and weight sparse representation-based classifier for predicting PPIs. Zahiri et al. [39] proposed a novel evolutionary-based algorithm called PPIevo, which extracts features from PSSM for predicting protein-protein interactions. In general, previous works illustrate that the feature extraction method and classification are the two most important steps to predicting PPIs.
In this paper, by fully using evolutionary information of protein sequence, we report a novel computational method, which obtaining numerical representation by Position Specific Scoring Matrix (PSSM), extracting feature vector by using locality preserving projections (LPP), and predicting by rotation forest (RF) classifier. More Specifically, the first step is transforming protein sequence into numerical representation, PSSM. Second, each PSSM can be extracted by LPP, and obtained a low-dimensional feature vector. Finally, the feature descriptors are fed into the RF classifier for inferring potential protein-protein interactions. Two datasets, Yeast and H. pylori, are applied in our proposed method; the fivefold cross-validation results of average accuracy are 92.81% and 92.56%, respectively. The performance of our proposed approach is better than the support vector machine (SVM), K nearest neighbors. We also performed extensive experiments on four cross-species independent datasets. Experimental results show that the proposed method outperforms other existing methods and we hope this approach can provide a solution for inferring potential PPIs.

Datasets
The dataset can help us know whether the performance of the proposed method is good or not. In this study, we adopt two benchmark datasets to evaluate the model. Firstly, from the DIP database [8], the Yeast PPI dataset was collected. For enhancing credibility, the length of protein pairs that were less than 50 residues and protein pairs that have more than forty percent sequence identity will be directly removed. Thus, the Yeast dataset was constructed by 11,888 protein pairs, including a positive dataset of 5594 protein pairs and a negative dataset of 5594 protein pairs. The second dataset-the H. pylori PPI dataset-is described by Martin et al. [40]; the whole dataset is constructed by 2916 protein pairs (1458 interacting pairs and 1458 non-interacting pairs).

Position Specific Scoring Matrix
Position-Specific Scoring Matrix (PSSM) [41] is widely used for transform biological sequence into numerical representation [42,43]. Given a protein sequence with length N, The PSSM can be represented as follows: where α i,j means the probability of the ith residue being mutated into type j of 20 native amino acids during the evolutionary process of the protein from multiple sequence alignments. In this step, employing the tool mentioned in [44] can convert each protein sequence into PSSM.

Locality Preserving Projections
We aim to extract the feature vectors, which mainly reduce the dimensional of the original matrix to reduce the influence of noise. In this section, Locality Preserving Projections (LPP) algorithm [45] is adopted for extracting feature vectors in each PSSM. It adopts a linear approximation to the Laplace characteristic map for obtaining the structure feature between neighboring. Thus, it is widely used for data processing and analysis application. Given a training sample X = [x 1 , x 2 , . . . , x n ] T ∈ R D , where D is the feature dimension, and n denotes the feature vectors of each sample. Then for projecting the high-dimensional input dataset X into a low-dimensional dataset Y = [y 1 , y 2 , . . . , y n ], it is necessary to seek a projection matrix W. The objective function of LPP is defined as where x i and x j is linked 0, x i and x j is not linked where P ij is the heat kernel, w denotes a transformation vector, and in Equation (3), the parameter t means scale size. Thus, we can define the distance formula is shown as follows: Then, it is necessary to minimize the projection matrix W. The steps are defined as: where D is a diagonal matrix, Dii = ∑ n j=1 W ij and L = D − W represent the Laplacian matrix. The constraint is: Finding a transfer matrix w, the following generalized eigenvalue problem: Solving the question in Equation (8), we can obtain all the eigenvalues and eigenvectors corresponding, k eigenvalues: λ 0 , λ 1 , . . . , λ k−1 are sorted from small to large. {w 0 , w 1 , . . . , w k−1 } denotes the corresponding characteristic vectors. First, l eigenvectors are selected to form the projection matrix W = [w 0 , w 1 , . . . , w l−1 ]. As a result, the embedding descriptors are

Rotation Forest
Rodriguez et al. [46] proposed rotation forest (RF) is a typical ensemble learning algorithm that is widely used in the classification task. In the RF algorithm, it first divides the feature set into K subsets by randomly combining the features of the sample. Principal component analysis (PCA) is then used to transform the data and retain the accuracy of the original data. As a result, RF improves classification performance by amplifying the differences between base classifiers.
Let training sample set S be an M × m matrix. m represents the feature vector length of each training sample. Let X be the feature set, and Y = (y 1 , y 2 , . . . , y n ) T are corresponding labels. Assuming L decision tree, which can be denoted as [Q 1 , Q 2 , . . . , Q L ], respectively. In this algorithm, the complete feature set will be divided into K subsets equally and randomly. A single classifier Q i of processing steps can be summarized as follows: (1) Set X is randomly divided into K disjoint subsets; each subset contains the number of features is C = M/K. (2) Form a new matrix S i,j by choosing the corresponding column of the feature in the subset Q i,j from the training dataset S. And applying a bootstrap sampling technique from seventy-five percent of the original training dataset S to generate a new matrix S j,j .
(3) Employ C feature by adopting the PCA method in matrix S j,j . The principal component coefficients are stored in T i,j , which can be represented as γ ij . (4) Construct a sparse rotation matrix F i , in which matrix S i,j contain coefficients. The matrix F i can be defined as: Given a test sample x in the prediction phase, d i,j xF a i is the probability. The confidence of a class can be computed by the average combined method, and the formula is shown as: Thus, the greatest possible can be easily assigned, and the final predicted label can be obtained.

Evaluation Criteria
To more intuitively evaluate the predictive performance of the model, the criteria were evaluated using the classification precision (Prec.), accuracy (Accu.), Matthews correlation coefficient (MCC) and sensitivity (Sen.) these are defined respectively by where FP is false positive, TN is true negative, FN is false negative, and TP is true positive. The ROC curve is a curve that describes relative trade-offs between TP and FP. The x-axis of the ROC curve is defined as the false positives rate (FPR) or 1-specificity, and the y-axis is defined as the true positives rate (TPR) or sensitivity.
The ROC curve of the best possible contains a point very close to coordinate (0,1) or in the upper left corner of the ROC space, which represents the highest specificity and sensitivity. In our paper, the area under the ROC curve (AUC) is computed, which shows the performance of the proposed method in numerical form.

Prediction Ability Assess
In this section, we carried out our proposed method on two datasets: Yeast and H. pylori. Meanwhile, the 5-fold cross-validation method is also adopted to assess the reported method and avoid over-fitting in the experiments. By doing this, five training models would be generated on five groups of training datasets. To obtain the best feature vector representation in the LPP algorithm, we implement a variety of feature vectors with different dimensions (40-dimensional, 60-dimensional, 80-dimensional, 100-dimensional, 120-dimensional, and 140-dimensional) to predict protein interactions to obtain the best feature representation. We repeat this experiment several times and the optimization results can be seen in Table 1. When adopting 40-dimensional feature vectors in the Yeast dataset, the result of accuracy achieved 92.81%, and when employing 80-dimensional feature vectors in the H. pylori dataset, the result of accuracy yielded 92.56%. We also plotted the accuracy performance of the Yeast and H. pylori datasets in Figure 1, which clearly illustrate that the best performance can be obtained by using 40-dimensional feature vectors on Yeast dataset and 80-dimensional feature vectors on the H. Pylori dataset. However, it is worth noting that there has been a decrease from 80-dimensional to 100-dimensional in the Yeast dataset. We thought that when we extract feature vectors from PSSM, 100-dimensional feature vectors have more redundant and noise information than 80-dimensional feature vectors. The accuracy gap between 80-dimensional and 100-dimensional is 0.70%. Moreover, the accuracy score is the main criteria we focused on. Thus, 40-dimensional feature vectors are selected on the Yeast dataset and 80-dimensional feature vectors are adopted on the H. pylori dataset. the accuracy performance of the Yeast and H. pylori datasets in Figure 1, which clearly illustrate that the best performance can be obtained by using 40-dimensional feature vectors on Yeast dataset and 80-dimensional feature vectors on the H. Pylori dataset. However, it is worth noting that there has been a decrease from 80-dimensional to 100-dimensional in the Yeast dataset. We thought that when we extract feature vectors from PSSM, 100dimensional feature vectors have more redundant and noise information than 80-dimensional feature vectors. The accuracy gap between 80-dimensional and 100-dimensional is 0.70%. Moreover, the accuracy score is the main criteria we focused on. Thus, 40-dimensional feature vectors are selected on the Yeast dataset and 80-dimensional feature vectors are adopted on the H. pylori dataset. In the rotation forest algorithm, there are two main parameters in the rotation forest classifier: K and L. where K represents the number of feature subsets, L denotes the number of decision trees. We select the best parameter after the experiment and set K, L as 5, 5, respectively. The results of the two datasets were shown in Tables 2 and 3. In the rotation forest algorithm, there are two main parameters in the rotation forest classifier: K and L. where K represents the number of feature subsets, L denotes the number of decision trees. We select the best parameter after the experiment and set K, L as 5, 5, respectively. The results of the two datasets were shown in Tables 2 and 3. computed. The ROC curves of the two datasets are shown in Figures 2 and 3. As shown in Figures 2 and 3, for the dataset Yeast, the AUC value is 0.9506. For the dataset H. pylori, we could clearly see that the AUC value is 0.9463. In conclusion, promising results demonstrate that the method we proposed is stable and effective for predicting PPIs.  When predicting the PPIs of the Yeast dataset, the proposed method yielded results of average accuracy is 92.81%, precision is 96.80%, sensitivity is 88.55%, and MCC is 86.61%, respectively. The corresponding standard deviations are 0.66%, 0.68%, 0.95%, and 1.15%, respectively. When predicting the H. pylori, the average accuracy, precision, sensitivity, and MCC are 92.56%, 94.11%, 90.82%, and 86.22%, with the corresponding standard deviations of 0.86%, 0.99%, 0.93%, and 1.47%, respectively. Meanwhile, the values of AUC were also computed. The ROC curves of the two datasets are shown in Figures 2 and 3. As shown in Figures 2 and 3, for the dataset Yeast, the AUC value is 0.9506. For the dataset H. pylori, we could clearly see that the AUC value is 0.9463. In conclusion, promising results demonstrate that the method we proposed is stable and effective for predicting PPIs.

Performance Comparison of RF with Other Models
We compare the proposed method with K nearest neighbor (KNN) and support vector machine (SVM) classifier for further evaluating the proposed method. The algorithm of KNN is widely used in machine learning due to its efficiency and simplicity. The parameter k of KNN needs to be optimized to obtain the best performance. Here, the k is set to 2. When training the SVM model, the LIBSVM tool is adopted to predict PPIs. There are two corresponding parameters of c and g that need to be optimized in the SVM classifier. When we carried out the experiment using the same feature vectors, 40-dimensions in Yeast and 80-dimensions in H. pylori, in the SVM classifier, we optimized several parame-

Performance Comparison of RF with Other Models
We compare the proposed method with K nearest neighbor (KNN) and support vector machine (SVM) classifier for further evaluating the proposed method. The algorithm of KNN is widely used in machine learning due to its efficiency and simplicity. The parameter k of KNN needs to be optimized to obtain the best performance. Here, the k is set to 2. When training the SVM model, the LIBSVM tool is adopted to predict PPIs. There are two corresponding parameters of c and g that need to be optimized in the SVM classifier. When we carried out the experiment using the same feature vectors, 40-dimensions in Yeast and 80-dimensions in H. pylori, in the SVM classifier, we optimized several parameters to find the best performance of the classifier. Thus, in the Yeast dataset, parameters c and g are set 1 and 4. In the H. pylori dataset, we optimize the parameter and finally, we set c = 5 and g = 0.1, respectively. The performance results of SVM can be seen in Tables 4 and 5. When using SVM for predicting the Yeast PPI dataset, the performance result of average accuracy is 80.72%, precision is 81.39%, sensitivity is 79.66%, MCC is 68.87%, and AUC becomes 0.8804, respectively. The corresponding standard deviations are 0.81%, 1.16%, 0.98%, 0.98%, and 0.0067, respectively. When the SVM is adopted to predict H. pylori, the prediction results of average accuracy, precision, sensitivity, MCC, and AUC are 88.71%, 91.86%, 85.06%, 79.91%, and 0.9438, respectively. The ROC curves of the SVM classifier on the Yeast and H. pylori datasets are shown in Figures 4 and 5.     Furthermore, the prediction experiment of KNN has been carried out and the average results are obtained by adopting the same feature extraction method. Table 6 summarizes the prediction result of the different prediction models. The results of RF are significantly better than SVM and KNN on the Yeast dataset. For example, the accuracy gaps between RF and SVM are 12.09% in the Yeast dataset and 3.85% in the H. pylori dataset. Similarly, the accuracy gaps between RF and KNN are 18.08% in the Yeast dataset and 1.51% in the H. pylori dataset. As a result, we can conclude that the RF classifier is more accurate and outstanding than SVM and KNN. Furthermore, the prediction experiment of KNN has been carried out and the average results are obtained by adopting the same feature extraction method. Table 6 summarizes the prediction result of the different prediction models. The results of RF are significantly better than SVM and KNN on the Yeast dataset. For example, the accuracy gaps between RF and SVM are 12.09% in the Yeast dataset and 3.85% in the H. pylori dataset. Similarly, the accuracy gaps between RF and KNN are 18.08% in the Yeast dataset and 1.51% in the H. pylori dataset. As a result, we can conclude that the RF classifier is more accurate and outstanding than SVM and KNN. To evaluate the method performance more intuitively, we plotted the ROC curves of RF, SVM, and KNN, which can be seen in Figure 6. It can be known that the higher the AUC value, the better the performance of the experimental method. For instance, the AUC values gaps between RF and SVM are 0.0702 in the Yeast dataset and 0.0025 in the H. pylori dataset. Similarly, the AUC values gaps between RF and KNN are 0.2034 in the Yeast dataset and 0.0359 in the H. pylori dataset.

Sensitivity
Given more thinking in this section, we adopt the same feature vectors in different classifiers: KNN, SVM, and RF. SVM shows many unique advantages in solving small samples and nonlinear and high-dimensional pattern recognition. It always shows the state-of-the-art performance in many previous works. In addition, KNN is a commonly supervised learning method in which K is usually selected manually. In this study, RF obtained better accuracy results than KNN and SVM; it also demonstrates when predicting the protein-protein interactions, RF can capture more useful information and have less noise influence than SVM and KNN. Thus, we conclude that the method we proposed has better prediction performance for predicting PPIs.
To evaluate the method performance more intuitively, we plotted the ROC curves of RF, SVM, and KNN, which can be seen in Figure 6. It can be known that the higher the AUC value, the better the performance of the experimental method. For instance, the AUC values gaps between RF and SVM are 0.0702 in the Yeast dataset and 0.0025 in the H. pylori dataset. Similarly, the AUC values gaps between RF and KNN are 0.2034 in the Yeast dataset and 0.0359 in the H. pylori dataset. Given more thinking in this section, we adopt the same feature vectors in different classifiers: KNN, SVM, and RF. SVM shows many unique advantages in solving small samples and nonlinear and high-dimensional pattern recognition. It always shows the state-of-the-art performance in many previous works. In addition, KNN is a commonly supervised learning method in which K is usually selected manually. In this study, RF obtained better accuracy results than KNN and SVM; it also demonstrates when predicting the protein-protein interactions, RF can capture more useful information and have less noise influence than SVM and KNN. Thus, we conclude that the method we proposed has better prediction performance for predicting PPIs.

Performance on Independent Dataset
Although our method has achieved satisfactory results, we carried out an extensive experiment on our proposed method. Four independent PPI datasets, including H. pylori, H. sapiens, C. elegans, and M. musculus, were selected to evaluate the predictive capacity of the proposed model. The experiment is based on the hypothesis that a large number of interacting proteins in an organism evolve in a related way, and their respective orthologs in other organisms also interact. Specially, we first adopted the Yeast PPI dataset as the

Performance on Independent Dataset
Although our method has achieved satisfactory results, we carried out an extensive experiment on our proposed method. Four independent PPI datasets, including H. pylori, H. sapiens, C. elegans, and M. musculus, were selected to evaluate the predictive capacity of the proposed model. The experiment is based on the hypothesis that a large number of interacting proteins in an organism evolve in a related way, and their respective orthologs in other organisms also interact. Specially, we first adopted the Yeast PPI dataset as the training set after optimizing the parameters. The same feature extraction method was applied to four independent PPI datasets, and then these feature vectors of the independent dataset would be treated as test data. The results are summarized in Table 7. When applying our proposed method to predict PPIs on four cross-species, we achieved the average values of accuracy varying from 88.60% to 97.44% in Table 7. Moreover, based on our hypothesis, the accuracy result for the H. sapiens dataset is 88.60% and 97.44% for M. musculus dataset. We can suppose that when we employ the Yeast dataset as the training set, the H. sapiens dataset shows a lower correlation; on the contrary, the M. musculus dataset shows a higher correlation. All in all, it demonstrates that our proposed model has good predictive and generalization capabilities in predicting PPIs and can be applied to different protein interaction prediction problems.

Comparison with Other Methods
There have been proposed many related works for improving the prediction performance. To compare whether our method is efficient or not, we make a comparison between our method and the previous works on the Yeast and H. pylori PPI datasets. Tables 8 and 9 list the comparison result on two datasets.  Table 8 clearly illustrates that our proposed method achieved the best results in accuracy, precision, sensitivity, and MCC, respectively. Especially in the criteria of the MCC, our method is 8.09% higher than the ensemble ELM method, and our model is 5.06% higher in accuracy than ensemble ELM. Generally, the result of our method is ranked first that our model obtained the highest prediction accuracy on the H. pylori.
In Table 9, the results of previous work are listed. Our approach achieved the highest result in several criteria. Specifically, the proposed method achieved 92.81% on accuracy, which is 3.48% higher than the first highest in Guo's work. The results yielded from our model on sensitivity only achieved 88.55%, which was 9.35% lower than the third-highest in Yang's work. It is worth noting that the feature extraction method and classifier make a great contribution to achieving excellent performance. Generally speaking, the performance of the method we proposed is superior to other methods in the table and is effective in predicting PPIs.

Conclusions
In this article, we reported a novel computational method by combining localitypreserving projections and rotation forest for inferring potential PPIs. It is worth noting that the feature extraction method is conducive to predicting PPIs. The main improvement in this work is that the locality preserving projections (LPP) are insensitive to anisotropic values and can better maintain fixed local structure information internally. Then obtaining the final prediction result by employing rotation forest. The method achieved an average prediction accuracy of 92.81% on Yeast and 92.56% on H. pylori. We also further compare the prediction performance among the rotation forest, support vector machine, and K nearest neighbor. Extensive experiments were also carried out on four independent datasets. The experiment results show that the performance of our model is appropriate. However, the computational method still has some drawbacks, the feature vectors extracted by LPP have potential noise, and evolutionary information cannot be retained completely. In future studies, we will continue to study the use of more efficient descriptors to predict PPIs.