An Ensemble Classiﬁer with Random Projection for Predicting Protein–Protein Interactions Using Sequence and Evolutionary Information

: Identifying protein–protein interactions (PPIs) is crucial to comprehend various biological processes in cells. Although high-throughput techniques generate many PPI data for various species, they are only a petty minority of the entire PPI network. Furthermore, these approaches are costly and time-consuming and have a high error rate. Therefore, it is necessary to design computational methods for efﬁciently detecting PPIs. In this study, a random projection ensemble classiﬁer (RPEC) was explored to identify novel PPIs using evolutionary information contained in protein amino acid sequences. The evolutionary information was obtained from a position-speciﬁc scoring matrix ( PSSM ) generated from PSI-BLAST. A novel feature fusion scheme was then developed by combining discrete cosine transform (DCT), fast Fourier transform (FFT), and singular value decomposition (SVD). Finally, via the random projection ensemble classiﬁer, the performance of the presented approach was evaluated on Yeast, Human, and H. pylori PPI datasets using 5-fold cross-validation. Our approach achieved high prediction accuracies of 95.64%, 96.59%, and 87.62%, respectively, effectively outperforming other existing methods. Generally speaking, our approach is quite promising and supplies a practical and effective method for predicting novel PPIs.


Introduction
Proteins are fundamental to human life and seldom function as a single unit. They always interact with each other in a specific way to perform cellular processes [1]. As a consequence, the analysis of protein-protein interactions (PPIs) can help researchers reveal tissue functions and structures and identify the pathogenesis of human diseases and drug targets of gene therapy. Recently, various high-throughput experimental techniques have been discovered for PPI detection, including a yeast two-hybrid system [2], immunoprecipitation [3], and protein chips [4]. However, the biological experiments are generally costly and time-consuming. Moreover, both false negative and false positive rates of these methods are very high [2,5]. Therefore, the development of reliable calculating models for the prediction of PPIs has great practical significance.
To construct a computational method for PPI prediction, the most important factor is to extract highly discriminative features that can effectively describe proteins. So far, protein feature extraction methods are based on many data types, such as genomic information [6,7], structure information [8,9], 2 of 15 evolutionary information [10,11], and amino acid sequence information. Of these approaches, sequence-based methods are more readily available, and it has demonstrated that protein amino acid sequence information is important for detecting PPIs [12][13][14][15][16]. Martin et al. used a descriptor called a "signature product" to discover PPIs [12]. The signature product is a product of sub-sequences from a protein sequence and extends the signature descriptor from chemical information. Shen et al. presented a conjoint triad (CT) approach to take the characters of amino acids and their adjacent amino acids into account [13]. Guo et al. proposed an auto covariance (AC) descriptor approach to represent an amino acid sequence with a foundation of seven different physicochemical scales [14]. When they detected Yeast PPIs, prospective prediction accuracy was achieved. Wong et al. employed the physicochemical property response matrix combined with the local phase quantization descriptor (PR-LPQ) to extract the eigen value of the proteins [15]. Considering the evolutionary information of protein, Huang et al. adopted substitution matrix representation (SMR) based on BLOSUM62 to construct a feature vector and achieved promising prediction accuracy [16]. Ding et al. proposed a novel protein sequence representation method based on a matrix to predict PPIs via an ensemble classification method [17]. Wang et al. proposed a computational method based on a probabilistic classification vector machine (PCVM) model and a Zernike moment (ZM) descriptor to identify PPIs from amino acids sequences [18]. Lei et al. employed the NABCAM (the neighbor affinity-based core-attachment method) to identify protein complexes from dynamic PPI networks [19]. Nanni et al. summarized and evaluated a couple of feature extraction methods for describing protein amino acids sequences by verifying them on multiple datasets, and they constructed an ensemble of classifiers for sequence-based protein classification, which not only performed well on many datasets but was also, under certain conditions, superior to the state of the art [20][21][22][23].
Next, the computational methods for PPI prediction can be formulated as a binary classification problem. A number of machine learning-based computational models for PPI prediction have emerged. Ronald et al. proposed a technique of applying Bayesian networks to detect PPIs on the Yeast dataset [24]. Qi et al. employed several classifiers, including support vector machine (SVM), decision tree (DT), random forest (RF), and logistic regression (LR), to compare their performances in predicting PPIs [25].
Of machine-learning-based computational models for PPI prediction, one of the most important challenges is that the high-dimensional features may include unimportant information and noise, leading to the over-fitting of classification systems [26,27]. Previous works have shown that random projection (RP) is a high-efficiency and sufficient precision approach that can reduce the dimensions of many high-dimensional datasets [28][29][30]. However, the performance using a single RP method is poor [31] because of the instability of RP. Therefore, in our study, an RP ensemble method was designed to predict PPIs.
The capacity of the integration model was better and more effective than that of separate runs of the RP approach. Moreover, the ensemble algorithm achieved results that are also superior to those of similar schemes that use principal component analysis (PCA) to reduce the dimensionality of the dataset [32]. The RP-ensemble-based classifier has useful features [33]. Firstly, RP maintains the geometrical structure of the dataset with a certain distortion rate when its dimensionality is reduced. This feature can reduce the complicacy of the classifier and the difficulty of new sample classification. In addition, dimensionality reduction can eliminate redundant information and reduce generalization error. Particularly, instead of relying on a single classifier, the RP ensemble method incorporates several classifiers that are superior to each single classifier. This feature can lead to better and more stable classification results.
In this paper, we propose an RPEC-based approach for detecting PPIs by combining a protein sequence with its evolutionary information. Firstly, a position-specific scoring matrix (PSSM) is used to express the amino acid sequence. Secondly, three 400 dimensional feature vectors are extracted from the PSSM matrix by using DCT, FFT, and SVD, respectively, and each protein sequence is described as a 2400 size of eigen vector. Then, an 1000 dimensionally reduced feature vector is obtained via PCA.
Finally, the RP ensemble model is built by employing the feature matrix of the protein pairs as an input to predict PPIs. Our method is estimated on the PPI datasets of Yeast, Human, and H. pylori and yields higher prediction accuracy of 95.64%, 96.59% and 87.62%, respectively. When compared with the SVM classifier fed into the same feature vector, the accuracies of our method are increased by 0.58%, 1.4%, and 2.57%, respectively. We also compare our method with other approaches. The obtained outcomes prove that our method is much better at predicting PPIs than those of previous works.

Position-Specific Scoring Matrix
A position-specific scoring matrix (PSSM) can be applied to search distantly related proteins. It emerged from a group of sequences formerly arranged by structural or similarity [34]. There are many methods of calculating distances and metric spaces [35,36]. Here, some research on PSSM methods and its relation to amino acids is discussed. Liu et al. [37] discovered that the PSSM profile has been shown to provide a useful source of information for improving the prediction performance of the protein structural class. Wang et al. [38] proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. In our method, we predict PPIs based on PSSM. Thus, each protein can be converted to a PSSM by using the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) [39].
The PSSM can be described as follows: where N m = (N 1m , N 2m , · · · , N Lm ) T , (m = 1, 2, · · · , 20). With a size of L × 20 PSSM, L represents the length of an amino acid, and 20 is the number of amino acids. In our research, we achieved the experiment datasets by using PSI-BLAST to use a PSSM for PPI detection. In order to achieve a wide and analogous sequences, the parameter e-value of PSI-BLAST was adjusted to 0.001 and opted for three iterations, and the other values in PSI-BLAST were defaults. Finally, the PSSM from a certain protein sequence was expressed as a M × 20 matrix, where M denotes the quantity of residues, and 20 indicates the number of amino acids.

Discrete Cosine Transform
Discrete cosine transform (DCT) is a classical orthogonal transformation, which was first proposed in the 1970s by Ahmed [40]. It is used in image compression processing with lossy signals because of its strong compaction performance. DCT has better energy aggregation than others. It can convert spatial signals into frequency domains and thus work well in de-correlation. DCT can be defined as follows: where The N × 20 PSSM is the input signal, which is x i ∈ R n . Here, we can obtain 400 coefficients as the protein feature vector after the DCT feature descriptor. At the end, we can obtain a feature vector whose dimension is 800 from each protein pair via DCT.

Fast Fourier Transform
Fast Fourier transform (FFT) is a feature extraction method. The simplified energy function for FFT algorithms evaluate in the space of protein partner mutual orientations. The mesh displacements of one protein centroid in regard to another protein centroid can represent the translational space [41].
Here we describe the simplified energy scoring function, as PPIs are defined on a mesh and indicate M correlation functions all possible values of l, m, and n (assuming that one protein is the ligand and the other is the receptor): where Lp(a,b,c) and Rp(a,b,c) are the integral part of the related function defined with protein interactions on the ligand and the receptor, respectively. We can thus use M forward and an inverse fast Fourier transform to calculate the expression efficiently, and the forward and inverse fast Fourier transform can be denoted by FT and IFT, respectively: where a = √ −1, N 1 , N 2 and N 3 are grid sizes of the three coordinates. If N 1 = N 2 = N 3 = N, the complexity of this method is O(N 3 log(N 3 )). Then, we can use FFT to compute the related function of Lp with the pre-calculated function of Rp. The final sum offers the scoring values of function for all probable conversions of the ligands. Finally, we obtained and sorted the results from different rotations.

Singular Value Decomposition
It is a challenge for bioinformatics to explore effective methods for analyzing global gene expression data. Singular value decomposition (SVD) is a common technique for multivariate data analysis [42]. We assumed that M is the size of an m × n matrix. The decomposition for SVD can be expressed as follows: where U means an m × m unitary matrix, S indicates a positive semi-definite m × n diagonal matrix, V represents an n × n unitary matrix, and V* is a conjugate transposed of V. As a result, we can obtain the singular value of the matrix with proteins. The columns of U form the base vector of orthogonal input or analysis for M and are called left singular vectors. Rows of V* form the base vector of orthogonal output for M and are called right singular vectors. Thus, the diagonal element values of S are singular.

Principal Component Analysis
Principal component analysis (PCA) is a data-dimensionally reduced method. PCA is widely used for data analysis, and the variable interacting with information from the dimensionally reduced dataset can persist [43,44]. It embeds samples in high-dimensional space into low-dimensional space, and the dimensionally reduced data represents the original data as closely as possible. The PCA of a data matrix determines the main information from a matrix according to a complementary group of scores and loading diagrams.
Furthermore, PCA converts primitive variables into a linear combination set, the principal components (PCs), which catch the data variables, are linearly independent, and are weighted in decreasing order of variance coverage [45]. This can reduce the data dimension directly by discarding low variability characteristic elements. In this way, all original α-dimensional datasets can be optimally implanted in a feature space with lower dimension.
The concept and calculation of PCA technology is simple. It can be expressed as follows: Given M = A ij (i = 1, 2, · · · , α; j = 1, 2, · · · , β), where A ij denotes the feature value of the j-th sample with the i-th feature. Firstly, α-dimensional means that the full datasets with vector µ j and α × α covariance matrix are calculated. Secondly, the feature vectors and feature values are calculated and sorted according to decreasing feature values. These feature vectors can be expressed as F 1 with feature value λ 1 , as F 2 with feature value λ 2 , and so on. The largest k feature vectors can then be obtained. This can be done by observing the frequency content of feature vectors. The maximum feature values are equivalent to the dimensions which is the large variance of the dataset. Finally, we can construct an α × k matrix X, whose rows denote the number of samples, and k makes up the feature vectors. Afterwards, the lower dimensional feature space's number of k (k < α) was transformed by N = X T M(A). It is thus shown that the representation can minimize the square error criterion.

Random Projection Ensemble Classifier
Machine learning has been extensively applied in many fields. In mathematical statistics, RP is a method for reducing the dimensionality of a series of points that lie in Euclidean space. Compared with other methods, the RP method is simpler and has less output error. It has been successfully used in the reconstruction of dispersion signals, facial recognition, and textual and visual information retrieval. Now we introduce the RP algorithm in detail. Let be a series of column vectors in primitive data space with high dimension. x i ∈ R n , n is the high dimension, and N denotes the number of columns. Dimensionality reduction embeds the vectors into a space R q , which has lower dimension than R n , where q << n. The output results are column vectors in the space with lower dimension.
where q is close to the intrinsic dimensionality of Γ [46,47]. The embedded vectors refer to the vectors in the set Γ.
Here, if we want to employ the RP model to reduce the dimensionality of Γ, a random vector set γ = {r i } n i=1 must be structured first, where r i ∈ R q . There are two choices in structuring the random basis: (1) The vectors {r i } n i=1 are spread on the q-dimensional unit spherical surface. (2) The components of {r i } n i=1 conform to Bernoulli +1/+2 distribution and the vectors are standardized such that r i l 2 = 1 for i = 1, · · · , n.
We generated a q × n matrix R, where q consists of the vectors in γ. n was mentioned in the previous paragraph. The projecting result ∼ x i can be obtained by In our method, RP is used to construct a training set on which the classifiers will be trained. The use of RP lays the foundation for our ensemble model. Now, we illustrate the theory of our ensemble method. A training set Γ is given in Equation (9). We generate a G whose degree is n × N, and G comes from the column vectors in Γ.
We then structure {R i } k i=1 , whose size is q × n, where q and n are introduced in the preceding paragraph, k means the size of ensemble classifiers. The columns are standardized so as to the l 2 norm is 1.
In the ensemble classifiers, we constructed the training sets They are then fed into the base classifier, and the outcomes are a group of classifiers { i } k i=1 . For the sake of classifying a fresh set u with classifier i , firstly, u will be inlaid into the target space R q . It can be obtained by embedding u into R i .
where ∼ u is the result of embedding. The classification of ∼ u can be garnered by i . In this ensemble algorithm, the RP ensemble classifier will use the classification result of all classifiers { i } k i=1 of ∼ u to decide the final result with a majority voting scheme.
In this study, the 1000 coefficients were divided into 100 non-overlapping blocks. We chose the projection from a block of size 10 that obtained the smallest test error with the leave-one-out test error estimate. We used the k-Nearest Neighbor (KNN) as a base classifier, where k = seq(1, 25, by = 3). Prior probability of interaction pairs in the calculated training samples dataset was considered as the voting parameter.

Datasets Construction
We collected the highly reliable Saccharomyces cerevisiae PPI dataset from the DIP database of DIP_20071007 (http://dip.doe-mbi.ucla.edu) [48]. Protein sequences less than 50 residues may be fragments. Thus, we directly removed these protein pairs. In addition, much of the sequence identity of protein pairs is usually deemed as homologous. To eliminate the effects of these homologous sequence pairs, those with ≥40% sequence identity from protein pairs were also deleted. Finally, we used the remaining 5594 protein pairs as a positive PPI dataset and constructed a negative dataset with 5594 other pairs from distinct subcellular localizations. The final Yeast PPI dataset in our experiment was composed of 11,188 protein pairs with 50% positive samples and 50% negative samples.
For the sake of evaluating the universality of our model, we also performed the proposed method on Human and Helicobacter pylori datasets. The Human dataset was gathered from the Human Protein Reference database (HPRD). We also removed protein pairs with ≥25% sequence identity. To construct a golden standard positive dataset, we chose the remaining 3899 interacting protein pairs among 2502 different Human proteins. Because the proteins in diverse subcellular fractions cannot interact with each other, we built a golden standard negative dataset by selecting 4262 protein pairs among 661 distinct Human proteins [49]. Finally, the Human dataset was composed of 8161 protein pairs. Another PPI dataset used in this study was made of 2916 Helicobacter pylori protein pairs, which are mentioned by Martin et al. [12]. In this dataset, there are 1458 interacting pairs and 1458 non-interacting pairs.

Evaluation Measurements
In order to assess the capability of the RP classifier, the accuracy (Acc), sensitivity (Sen), precision (PE), and Mathews' correlation coefficient (MCC) were used as evaluation indexes. They can be described as follows: where true positive (TP) indicates the count of true samples predicted to interact; false negative (FN) is the quantity of interacting samples predicted to not interact; false positive (FP) is the count of non-interacting samples predicted to interact; true negative (TN) is the number of true samples predicted to not interact.

Experimental Environment
In the study, the presented sequence-based PPI prediction system was carried out using MATLAB (R2014a, the Math Works, Inc., Natick, MA, USA) and R programming language (X64 3.3.1, Copyright© 2016 The R Foundation for Statistical Computing). We finished the experiment with a machine with a 2.4 GHz 2-core CPU and an 8 GB memory based on operating system of Windows. We adapted an RP ensemble classifier to predict PPIs and applied an RP ensemble classifier to train the datasets in the experiment, and the k-Nearest Neighbor (KNN) was employed as a base classifier, where k = seq(1, 25, by = 3).

Performance of PPI Prediction
Three different PPI datasets were applied to estimate the results of our presented model. They are Yeast, Human and H. pylori PPI datasets, respectively.

Performance of the Proposed Method with Three Diverse PPI Datasets
In our problem of PPI predictions, the dimension of input features is 2400, which may contain the unimportant information and noise. Thus, the PCA algorithm is used to eliminate noise in the dataset. However, it is hard to determine the optimal number of features to use. Here, we carried out the experiments to find the optimal PCA dimension. For the sake of example, for the H. pylori PPI dataset, the prediction performance of different PCA dimensions is shown in Table 1. As a result, the favorable number of PCA dimensions is 1000-dimensional.   For the sake of avoiding over-fitting and of verifying the constancy of the model, 5-fold cross-validations were used, which is part of the sub-sampling test method. More specifically, in 5-fold cross-validation, the entire dataset is split into 5 parts, where 4 parts are applied as training samples and 1 part is used as testing samples. In this way, we obtain five models from the datasets, and each model is a separate experiment. The prediction results of three datasets with the RP ensemble classifier are based on protein sequences and evolutionary information shown in Tables 2-4. As shown in Tables 2-4  The Receiver Operating Characteristic (ROC) curves for Yeast, Human, and H. pylori PPI datasets with five-fold cross-validation are shown in Figures 1-3, respectively. We computed the average AUC (area under ROC curve) values of Yeast, Human, and H. pylori PPI datasets to be 0.9570, 0.9615, and 0.8287, respectively. In conclusion, the higher accuracies and lower standard deviations of these values indicate that our presented approach is feasible and reasonable for detecting PPIs.          Compared with the Human and the Yeast PPI datasets, the prediction performance of the H. pylori dataset is lower. It should be noticed that the sample sizes of the Human, Yeast, and H. pylori datasets are 8161, 11,188, and 2916, respectively. We found that the prediction performance, as indicated by the accuracy score, improves as the size of the samples increases.

Performance Comparison between the RP Classifier and the SVM Model
Many machine learning techniques and algorithms are employed to predict PPIs. We compared our RP model with the state-of-the-art SVM model. During the experiment, we extracted the feature values by the same method to ensure fairness. We used the LIBSVM toolbox on http://www.csie. ntu.edu.tw/~cjlin/libsvm/. The radial basis function (RBF) kernel was applied in our experiments. The method of a gridding search was employed to optimize two kernel parameters: C and g.
For the Yeast PPI dataset, we use the optimized parameters C = 0.3 and g = 1. The obtained results as shown in Table 5. For Human and H. pylori PPI datasets, the optimized penalty parameters are 0.06 and 0.5, and the kernel function parameters are 2 and 0.3, respectively. When we predicted the PPIs by applying the SVM classifier on the Yeast dataset, we obtained an average Acc, PE, Sen, and MCC of 95.06%, 95.35%, 94.76% and 90.60%, respectively. Compared with the SVM classifier, the accuracy of our method is higher by about 0.58%. When we predicted the Human PPI dataset, the results based on the SVM classifier of the average accuracy, precision, sensitivity, MCC, respectively, were 95.19%, 94.91%, 95.04% and 90.84%, respectively. Compared with the SVM classifier, the accuracy of our method was higher by about 1.40%. On the H. pylori dataset, the SVM classifier achieved an 85.05% average accuracy and a 91.92% precision, with 76.80% sensitivity and 74.27% MCC. Compared with the SVM classifier, the accuracy of our method was higher by about 2.57%.
Based on data in Table 5, the average accuracy, precision, and sensitivity values of our proposed model are much higher than averages attained by the SVM approach. The higher the standard deviation, the more unstable the algorithm is. Furthermore, we plotted the ROC curves on the three datasets by applying the SVM classifier as shown in Figures 4-6. It can be seen that the AUC areas yielded by our method are higher than those of the SVM classifier.

Comparison with Other Methods
There have been many prediction approaches developed for detecting PPIs. To further estimate the capacity of the proposed model, we compared our method with other existing methods. Table 6 shows the results of diverse approaches to the Yeast PPI dataset. The accuracies obtained by other previous methods range from 86.15% to 94.72%. In contrast, our method achieves an average accuracy of 95.64%. Our method obtained a higher PE and Sen than that of eight other methods. In conclusion, compared with all methods from      Similarly, we compared our method with five other existing methods on the Human dataset. Based on Table 7, which shows the results of diverse approaches on the Yeast PPI dataset, the accuracies obtained by other previous methods range from 89.30 to 95.70%. In contrast, our method achieves an average accuracy of 96.59%. Accordingly, our method outperforms better than most other approaches, which are also based on ensemble classifiers.    Similarly, we compared our method with five other existing methods on the Human dataset. Based on Table 7, which shows the results of diverse approaches on the Yeast PPI dataset, the accuracies obtained by other previous methods range from 89.30 to 95.70%. In contrast, our method achieves an average accuracy of 96.59%. Accordingly, our method outperforms better than most other approaches, which are also based on ensemble classifiers.  Similarly, we compared our method with five other existing methods on the Human dataset. Based on Table 7, which shows the results of diverse approaches on the Yeast PPI dataset, the accuracies obtained by other previous methods range from 89.30% to 95.70%. In contrast, our method achieves an average accuracy of 96.59%. Accordingly, our method outperforms better than most other approaches, which are also based on ensemble classifiers.

Discussion and Conclusions
In the post-genome era, it is quite important to predict PPIs using computational techniques. In the study, we proposed a PPI prediction model by extracting evolutionary information from the position-specific scoring matrix (PSSM) generated by PSI-BLAST. Then, an RP ensemble classifier is used to implement PPI prediction. We conducted experiments on Yeast, Human, and H. pylori PPI datasets. In order to evaluate the capacity of our model, we compared our approach with an SVM-based model as well as other existing methods. The results of our model are quite promising; our model is a beneficial supplement to traditional experimental methods for PPI prediction. Moreover, the RPEC method may also be employed to solve other classification problems.