A Novel Ensemble Learning-Based Computational Method to Predict Protein-Protein Interactions from Protein Primary Sequences

Simple Summary Protein–protein interactions (PPIs) play a central role in the evolution and progression of various biological processes. In this article, we constructed a novel ensemble-learning-based model to predict potential PPIs, which only utilized the protein sequence information. The presented method used Discrete Hilbert transform to extract amino acid sequence information from position-specific scoring matrices. Then these extracted features were fed into rotation forest for training and predicting. When applying our method to the three datasets (Yeast, Human, and Oryza sativa) for detecting PPIs, we obtained excellent prediction performance. Furthermore, the comparison results indicated that our computational model is effective and robust in predicting potential PPI pairs. Abstract Protein–protein interactions (PPIs) are crucial for understanding the cellular processes, including signal cascade, DNA transcription, metabolic cycles, and repair. In the past decade, a multitude of high-throughput methods have been introduced to detect PPIs. However, these techniques are time-consuming, laborious, and always suffer from high false negative rates. Therefore, there is a great need of new computational methods as a supplemental tool for PPIs prediction. In this article, we present a novel sequence-based model to predict PPIs that combines Discrete Hilbert transform (DHT) and Rotation Forest (RoF). This method contains three stages: firstly, the Position-Specific Scoring Matrices (PSSM) was adopted to transform the amino acid sequence into a PSSM matrix, which can contain rich information about protein evolution. Then, the 400-dimensional DHT descriptor was constructed for each protein pair. Finally, these feature descriptors were fed to the RoF classifier for identifying the potential PPI class. When exploring the proposed model on the Yeast, Human, and Oryza sativa PPIs datasets, it yielded excellent prediction accuracies of 91.93, 96.35, and 94.24%, respectively. In addition, we also conducted numerous experiments on cross-species PPIs datasets, and the predictive capacity of our method is also very excellent. To further access the prediction ability of the proposed approach, we present the comparison of RoF with four powerful classifiers, including Support Vector Machine (SVM), Random Forest (RF), K-nearest Neighbor (KNN), and AdaBoost. We also compared it with some existing superiority works. These comprehensive experimental results further confirm the excellent and feasibility of the proposed approach. In future work, we hope it can be a supplemental tool for the proteomics analysis.


Introduction
Predicting protein-protein interactions (PPIs) is essential for elucidating protein functions and understanding the biological structures in cells [1]. Additionally, the prediction of PPIs not only helps people to further examine how proteins exert their various functions, but also provides the crucial information for the design of targeted drugs. in the past decade, there have been many biological experimental approaches, including mass spectrometry [2], tandem affinity purification [3], and two-yeast hybrids [4] have been extensively studied for decades. However, these conventional studies present some drawbacks, such as high cost, time-intensive, and suffer from high rate of false-positives and false-negatives. Accordingly, the development of novel computational approaches to predict potential PPI pairs would be of enormous value to biologists [5].
To date, several computational methods for PPIs' prediction have been presented. In general, these methods can be broadly grouped into three types: ligand-based approaches, structure-based methods, and sequence-based methods. Typically, the sequence-based methods do not perform as well as the first two methods, while the ligand and structurebased approaches usually need the a priori information of proteins. The challenging problem will arise when this information did not exist. In recent years, following the advancement of genome technologies, a large amount of protein sequence data had been collected and entered in databases. Therefore, the sequence-based methods to identify PPIs have aroused an increasing concern. The vast majority of the existing computational methods are usually based on the machine learning algorithms, including rotation forest [6], support vector machine [7,8], and Naive Bayes [9]. For example, Huang et al. [10] adopted discrete cosine transform descriptors and weighted sparse representation model to predict PPIs from protein sequence. You et al. [11] proposed a method called PCA-EELM, which utilized four different types of sequence information to predict PPIs. Li et al. [12] proposed a method called PSIPEL that combined an novel feature extraction approach, Low Rank Approximation with Rotation Forest, to predict PPIs from protein primary sequences. Zeng et al. [13] developed a deep learning framework to predict PPIs, which employed a sliding window and text convolutional neural network to capture local contextual and global sequence features from target proteins, respectively. Chen et al. [14] applied Fast Fourier Transform to capture protein feature descriptors and fed them to Random Projection for training and detecting self-interacting proteins [15]. Different from the traditional machine learning-based methods, deep learning-based approaches can not only extract feature vectors from the protein sequence directly, but also can capture their nonlinear relationships to improve the prediction performance. As a consequence, deep learning algorithms also have been widely employed in PPI prediction in recent years. For example, Sun et al. [16] first adopted a deep learning technique, stacked autoencoder, for predicting human PPIs from amino acid sequence. Zhang et al. [17] presented Ensemble Deep Neural Networks (EnsDNN), which is a neural network-based method that employs different protein descriptors to detect PPIs. Yao et al. [18] designed a novel method called Res2vec to represent protein sequences, then the residual representation was integrated into a deep neural network for training and predicting. Hashemifar et al. [19] developed a method named DPPI, which combined data augmentation, convolutional neural network, and random projection to predict PPIs. Richoux et al. [20] made a comparison of two powerful deep learning models and discussed the required attention when applying the deep learning algorithm to PPI prediction. Despite of these achievements, there is still great room for these computational based approaches to attain improvement [21].
Inspired by these excellent works, we herein attempted to develop a new computational model to predict potential PPIs from the information of amino acid sequences. Specifically, we first transformed the sequences into a position-specific scoring matrix (PSSM), from which we could preserve the evolution information of primary protein sequence. Then the Discrete Hilbert transform (DHT) algorithm was adopted to capture feature descriptors from the PSSM. Finally, the Rotation Forest (RoF) classifier was used for training and determining whether the proteins are related or not. In order to access the pre-dictive ability of our approach, we performed it on the Yeast, Human, and Oryza sativa PPIs datasets, and yielded a high prediction accuracy of 91.93, 96.35, and 94.24%, respectively. Moreover, we compared our approach with several existing sequence-based methods. We also applied it on four independent PPI datasets. Experimental results demonstrated that our method is effective for identifying whether the protein pairs interact or not, and it can be considered as a supplemental tool to the commonly used experimental methods.

Protein Interaction Dataset
In this article, the presented approach was first validated on a high-confidence PPIs dataset named Yeast, which was selected from the Database of Interaction Proteins (DIP) [22] by Guo et al. [23]. This dataset was collected from the Saccharomyces cerevisiae core subset which contains 5996 interaction pairs. In order to remove redundant information, the CD-Hit [24,25] was employed in this work. CD-Hit is a multiple sequence alignment tool for removing the homologous sequence pairs. After removing the protein pairs which had ≥40% sequence identity or the fragments with less than 50 residues, we obtained 5594 protein pairs as the positive samples. For the construction of a negative dataset, we randomly chose 5594 additional Yeast pairs from different subcellular compartments. Accordingly, the final Yeast PPIs dataset contained 11,188 protein pairs.
To indicate the generality of the proposed approach, we also verified our experiment on the Human and Rice (Oryza sativa) PPIs dataset. The Human dataset was selected from the Human Protein Reference Database (HPRD) [26]. After removing sequences with greater than 25% sequence identity, we employed 3899 interaction pairs, which collected from 2502 different human proteins to construct the positive samples. For the negative samples, we used the same approach to construct the negative samples of Human dataset. Finally, the negative set consisted of 4262 pairs from 661 proteins. In addition, Oryza sativa dataset was collected from the PRIN [27] database. The Oryza sativa dataset is consists of 4800 positive samples and 4800 negative samples.

Encoding Amino Acid Sequence as Date Matrix
The Position-Specific Scoring Matrix (PSSM) was adopted to represent the protein sequence. It was presented by Gribskov et al. [28] to analysis the sequence similarities of proteins. PSSM produces excellent results in many fields, such as in protein secondary structure prediction [29], disorder region prediction [30], and DNA function prediction [31]. A PSSM is a matrix that can be represented as PSSM = {ϕ m,n : m = 1 · · · b and n = 1 · · · 20}, where m denotes the length of the protein sequence, and the number 20 represents the 20 amino acids. The ϕ m,n can be expressed as follows: where P(a, q) indicate the frequency value of the qth amino acid at the position a of the probe, and w(b, q) indicate the value of Dayoff mutation matrix between the acid of bth and qth. The main concern in applying the PSSM algorithm is that it can enable the sequence to match the alignment table by awarding a higher score to a conservative position, while a good score means a conservative position and a low score represents a low-conserved position.
In this work, the PSI-BLAST tool was applied to transform the protein sequence into a PSSM matrix. BLAST is a useful resource for searching local similarity regions between different amino acid sequences. It can make a comparison of sequences and nucleotides with particular databases, and compute a statistical significance of the matches, to infer functions and evolutionary associations between different sequence. PSI-BLAST is an enhanced BLAST technique, which can robustly identify novel proteins in distantly related organisms. The main improvement of PSI-BLAST is that it can adopt the profile to search the non-redundant SWISS-PROT database, and then employ the searched results to rebuild the profile, and so on, until no new results are generated. SWISS-PROT is an annotated protein sequence database and the sequences collected in it are searched for by many authoritative biologists. Moreover, to better exploit the performance of the PSI-BLAST algorithm, we chose three iterations, and the e-value parameter was assigned to 0.001, and the PAM was selected as the scoring matrix. The other parameters were set to their default values.

Discrete Hilbert Transform
In this work, the Discrete Hilbert transform [32] (DHT) algorithm was adopted to capture feature values from the PSSM matrix to generate the feature vectors, which can make the prediction results more accurate. Discrete Hilbert transform was first employed to analysis the signal in the frequency and time domains. Before introducing the 2-D DHT, the 1-D DHT is first used in spatial and frequency domain. Let (a) represent the discrete signal, ∧ (a) can be shown as: where: After applying the Fourier transform (FT), ∧ (a) could be represented as: In Equation (4), IDFT represents the Inverse Discrete Fourier transform [33], and the Fourier transform of (a) and ∧ (a) can be described as ∧ F(jΩ) and F(jΩ), respectively. Above all, the function H(jΩ) can be written as: where angular frequency is Ω and the finite discrete signum function is denoted by sgn(Ω). For better capturing feature vectors from the PSSM matrix, we applied the 2D DHT [34] that was defined in the frequency domain to extract features from the PSSM. The odd and even parts of the PSSM features in the frequency domain refer to the highly conserved order of amino acids within a particular protein sequence. Suppose that the odd and even parts of PSSM features in a frequency domain are defined by f 0 (x, y) and f e (x, y), respectively. The formula of the 2D Discrete Hilbert transform can be written as: where bdy(x, y) is employed to adjust the boundary and the finite discrete signum function is described by sgn(x, y). H 1 and H 2 represent the size of f 0 (x, y) and f e (x, y), respectively.

Ensemble-Learning-Based Classifier
Rotation forest (RoF) is an ensemble learning algorithm, which was introduced by Rodriguez et al. [35] to improve the diversity and accuracy of each classifier in the ensemble system. The main contribution of the RoF algorithm is that it applies the principal component analysis (PCA) technique to construct a rotational matrix, which can then transform initial variables into new variables to construct new independent decision trees. Moreover, PCA algorithm ensures the diversity of the classifier, and it retains most of the evolutionary information of the protein feature descriptors [36]. The specific framework of this algorithm is summarized as follows.
Let T represents the training sample set, H denotes the feature set, and E be the corresponding labels. Let α be the set of class labels {α 1 , α 2 }, from which E takes values. Assume that T is a N × n matrix, where n and N represents the features and training samples in the PPIs data set. The data will be divided randomly into K subsets of the approximate size; there are L decision trees represented as D 1 , . . . , D L , respectively. In the RoF algorithm, L and K are the two parameters that require advance optimization. The specific details of the RoF algorithm can be defined as follows: (1) Divide the feature set H optionally into K subsets. Assume that K is a factor of m, then, each feature will include u = m/K features.
(2) Let H ij represent the j-th subset of features for training classifier D i . The features of dataset T in H ij is defined as T ij . Then a bootstrap subset of size 75% of the data set is extracted to construct the training set, which is defined as T ij . Then the PCA algorithm is adopted with T ij to generate the coefficients into a matrix C ij . Denoted as a (3) Using the coefficients in C ij to build a spare rotation matrix R i and it can be expressed as follows; In the classification stages, provided there is a target sample x, let d ij (XR a i ) denotes the probability produced by the classifier D i to the class α i . Finally, the confidence level of each class can be found through the mean combination technique: In this way, the test sample x can 190 be easily distributed to the class with the highest confidence.

Evaluation Measures
In this study, in order to avoid over-fitting to affect the predictive ability of the proposed method, we used the five-fold cross-validation (five-fold CV) technique to measure the predictive ability of the proposed method. All samples were randomly split into five subsets, in which four were used as a training set and the other one was adopted as the test set. In this experiment, this procedure was performed five times to guarantee that each subset was used once as a test subset. Lastly, the average and standard deviations of these five experiments were taken as the final experiment results. In our experiments, several evaluation criteria were employed to estimate the predictive ability of the proposed model, including accuracy (ACC), sensitivity (Sen.), specificity (Spec.), precision (PR), and Matthews' correlation coefficient (MCC) to access the predictive power. Their corresponding calculating formulae are as follows: Sen. = TP FN + TP where true positive (TP) indicated the quantity of true samples, which can be identified correctly; false positive (FP) represents the amount of true non-interacting pairs detected to be PPIs falsely; true negative (TN) is the amount of true non-interacting pairs that are correctly identified; false negative (FN) represents the number of true samples categorized as non-interacting pairs incorrectly. Additionally, the receiver operating characteristic (ROC) curves were also plotted in order to prove the predictive power of our method. The AUC (area under ROC curves) values were also calculated to express the ROC values in a more accessible way.

Prediction Performance on Three PPIs Datasets
In this study, we first validated our model on the Yeast data set, and Table 1 summarizes the results of the five-fold cross-validation (five-fold CV) experiment. It can be seen from Table 1 that the average accuracy, sensitivity, specificity, precision, and MCC values are 91.93%, 89.78%, 94.05%, 93.82%, and 85,14%, and their standard deviations were 0.69%, 0.79%, 1.30%, 1.19%, and 1.15%, respectively. Then, the proposed method was performed on the Human PPIs dataset; we also yielded excellent predicted results shown on Table 2, with average accuracy, sensitivity, specificity, precision, and MCC values of 96.35%, 95.76%, 96.87%, 96.57%, and 92.95%, and their standard deviations were 0.56%, 0.78%, 0.71%, 0.64%, 1.03%, respectively. In addition, to further demonstrate the robustness of the proposed model, we finally applied it to a plant PPI dataset, Oryza sativa. With respect to the Oryza sativa dataset, the average accuracy, sensitivity, specificity, precision, and MCC values of the proposed model are shown in Table 3 as 94.24%, 94.50%, 94.02%, 94.02%, and 89.14%, and their standard deviations were 0.37%, 0.97%, 0.82%, 1.03%, and 0.66%, respectively. The receiver operating characteristic (ROC) curves for the three benchmark datasets are shown in Figures 1-3. We also calculated the area under the ROC curve (AUC) values of these three PPI datasets for further evaluate the predictive power of our model, and they were 0.9586, 0.9831, and 0.9667, respectively.

Compared with Different Classifier Models
To date, there are a lot of machine learning algorithms have been developed for detecting PPIs. To further verify the prediction accuracy of the proposed model, we compared it with some popular classifiers, including Support vector machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN), and AdaBoost algorithm. To be specific, we utilized the same DHT descriptors and compared the predictive performance between RoF and these classifiers. We used the LIBSVM tool to train and predict the SVM-based model. To optimize the best parameter of the SVM classifier, the grid search method was adopted to select the best parameters of SVM c and g. We set c = 13, g = 0.0006 and c = 3, g = 0.0005 for the Yeast and Human data set. When performing on the Oryza sativa data set, we set c = 7, g = 0.0009. The parameter K of RF classifiers of the Yeast, Human, and Oryza sativa dataset were 27, 7, and 17, respectively. The parameters of KNN model included the number of neighbors and distance measures. In this article, all the experiments used the Manhattan distance, and the number of neighbors for these three PPI data sets were 15, 17, and 4, respectively. Table 4 illustrated the details of the prediction results of these four state-of-art classifiers on the Yeast, Human, and Oryza sativa data set. To identify any potential overfitting or underfitting problems in the proposed model, we also used a train/test/validation process for predicting these datasets. The experimental results based on this approach can be seen in our Supplementary Materials Tables S1-S4. As shown in Table 4, the proposed method provided the best results on the three PPI data sets in terms of all the metrics, and the least accuracy improvement was reached with 7.49% on the Yeast dataset, 1.03% on the Human data set, and 8.66% on the Oryza sativa data set. The lowest enhanced AUC values were reached with 3.34% on the Yeast dataset, 0.13% on the Human dataset, and 4.17% on the Oryza sativa dataset. For the visual analysis, we drew a histogram for the ACC and AUC values that were generated by these powerful classifiers in Figure 4. These experimental results further demonstrated that rotation forest is the best classifier for the features that we introduced.

Evaluation of Prediction Ability on Four Independent Dataset
Although the proposed model has achieved satisfactory results on the Yeast, Human, and Oryza sativa PPI datasets, we also applied it on four independent datasets, including H. sapiens, H. pylori, M. muscules, and C. elegans, to further demonstrate the suitability of our method. In this experiment, we utilized all of the Yeast dataset as the training set and the other four independent datasets were used as the test sets in order to verify the robustness of the proposed method. In addition, we also compared the predictive performance with some excellent approaches. Table 5 summarizes the results of the accuracy comparisons between our model and some existing methods on the four datasets. It can be seen that the prediction accuracy yielded by our method on the H. sapiens, H. pylori, M. muscules, and C. elegans datasets were all higher than 91%, which were 94.27, 91.67, 93.12, and 92.14%, respectively. These experimental results further indicated that our method has strong a generalization ability to predict PPIs. (N/A means not available.)

Compared with Existing Methods
In recent years, various kinds of computational methods have been proposed for predicting potential protein-protein interactions. Here, we compared the prediction ability of the proposed model with some popular methods on the Yeast and Human dataset, which were also utilized in the five-fold cross-validation method. Tables 6 and 7 list the predictive performance of these methods with several common evaluation criteria, including accuracy, precision, sensitivity, and MCC. From Table 6, we can see that our method produced an accuracy of 91.93% on the Yeast dataset; the precision is 93.82%, the sensitivity is 89.78%, and the MCC value is 85.14%. The average accuracy results of selected methods are all lower than our method on the Yeast dataset. Table 7 summarizes the average results of these collected approaches on the Human dataset, and are between 90.57 and 96.09%, while the average accuracy of our method is as high as 96.35%. These results further indicated that combining the DHT descriptor and rotation forest classifier is effective for PPIs' prediction. (N/A means not available.)

Discussion
The identification of protein-protein interactions (PPIs) can provide a novel perspective for clinical diagnosis and treatment. It also plays an important role in inter-cellular and intra-cellular functions and inter-molecular connectivity. In this article, we presented a novel ensemble-learning-based method to predict potential PPIs that only used the amino acid sequence information. There are four reasons why the proposed model has excellent prediction performance. First, all protein sequence data were preprocessed to remove residues and redundant information. Second, the target protein sequences were calculated into features by the PSSM technique, which can embed the evolutionary information in the form of a matrix. Thirdly, the Discrete Hilbert transform (DHT) algorithm was employed to extract the feature descriptors from the PSSM. In this way, the proposed model can capture high-dimensional and complex potential information to improve the prediction performance. Finally, the ensemble-learning-based classifier, rotation forest (RoF), was utilized to deal with the classification problem. We performed our method on three PPIs datasets (Yeast, Human and Oryza sativa) under five-fold cross-validation. To further demonstrate the excellent prediction ability of our method, we also applied it in four independent cross-species datasets and compared it with some existing excellent methods. The comprehensive experimental results indicated that our model can be served as a powerful tool to guide researchers to study the functions and roles of proteins. However, there are still some limitations in our work. Firstly, the negative datasets were the random section from the non-interacting pairs. These negative sets may include false negative cases. This has the potential to affect the prediction accuracy of the developed model. In future work, we will investigate the DHT algorithm, which is more appropriate for problems involving large feature dimensions and a small number of training samples; through this, we are hoping to better solve the problem of protein-protein interaction prediction.

Conclusions
In this study, we proposed a novel ensemble learning based model that can greatly improve sequence-based PPIs' prediction. We conducted a comprehensive experiment on three gold standard datasets. Furthermore, we performed independent validation on four cross-species PPI datasets. Experimental results based on cross validations and comparison indicated that our method is effective and robust in predicting PPIs.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/biology11050775/s1, Table S1. The optimal model parameters (K and L) on the three PPIs datasets. Author Contributions: Conceptualization, J.P. and Y.S.; methodology, software, validation, and formal analysis C.Y. and Z.Y.; investigation, resources, and data curation, L.L. and S.W.; writing-original draft preparation, J.P.; writing-review and editing, J.P.; visualization, J.P. and Y.S.; supervision, Z.Y.; project administration, L.L.; funding acquisition, Z.Y. and Y.S. All authors have read and agreed to the published version of the manuscript.