Next Article in Journal
Functional, Antioxidant, and Anti-Inflammatory Properties of Cricket Protein Concentrate (Gryllus assimilis)
Next Article in Special Issue
SMMDA: Predicting miRNA-Disease Associations by Incorporating Multiple Similarity Profiles and a Novel Disease Representation
Previous Article in Journal
Functional Characterization and Whole-Genome Analysis of an Aflatoxin-Degrading Rhodococcus pyridinivorans Strain
Previous Article in Special Issue
BioChemDDI: Predicting Drug–Drug Interactions by Fusing Biochemical and Structural Information through a Self-Attention Mechanism
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Ensemble Learning-Based Computational Method to Predict Protein-Protein Interactions from Protein Primary Sequences

1
Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, College of Life Science, Northwest University, Xi’an 710069, China
2
School of Information Engineering, Xijing University, Xi’an 710123, China
3
College of Grassland and Environment Sciences, Xinjiang Agricultural University, Urumqi 830052, China
4
School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China
*
Authors to whom correspondence should be addressed.
Biology 2022, 11(5), 775; https://doi.org/10.3390/biology11050775
Submission received: 13 April 2022 / Revised: 10 May 2022 / Accepted: 11 May 2022 / Published: 19 May 2022
(This article belongs to the Special Issue Intelligent Computing in Biology and Medicine)

Abstract

:

Simple Summary

Protein–protein interactions (PPIs) play a central role in the evolution and progression of various biological processes. In this article, we constructed a novel ensemble-learning-based model to predict potential PPIs, which only utilized the protein sequence information. The presented method used Discrete Hilbert transform to extract amino acid sequence information from position-specific scoring matrices. Then these extracted features were fed into rotation forest for training and predicting. When applying our method to the three datasets (Yeast, Human, and Oryza sativa) for detecting PPIs, we obtained excellent prediction performance. Furthermore, the comparison results indicated that our computational model is effective and robust in predicting potential PPI pairs.

Abstract

Protein–protein interactions (PPIs) are crucial for understanding the cellular processes, including signal cascade, DNA transcription, metabolic cycles, and repair. In the past decade, a multitude of high-throughput methods have been introduced to detect PPIs. However, these techniques are time-consuming, laborious, and always suffer from high false negative rates. Therefore, there is a great need of new computational methods as a supplemental tool for PPIs prediction. In this article, we present a novel sequence-based model to predict PPIs that combines Discrete Hilbert transform (DHT) and Rotation Forest (RoF). This method contains three stages: firstly, the Position-Specific Scoring Matrices (PSSM) was adopted to transform the amino acid sequence into a PSSM matrix, which can contain rich information about protein evolution. Then, the 400-dimensional DHT descriptor was constructed for each protein pair. Finally, these feature descriptors were fed to the RoF classifier for identifying the potential PPI class. When exploring the proposed model on the Yeast, Human, and Oryza sativa PPIs datasets, it yielded excellent prediction accuracies of 91.93, 96.35, and 94.24%, respectively. In addition, we also conducted numerous experiments on cross-species PPIs datasets, and the predictive capacity of our method is also very excellent. To further access the prediction ability of the proposed approach, we present the comparison of RoF with four powerful classifiers, including Support Vector Machine (SVM), Random Forest (RF), K-nearest Neighbor (KNN), and AdaBoost. We also compared it with some existing superiority works. These comprehensive experimental results further confirm the excellent and feasibility of the proposed approach. In future work, we hope it can be a supplemental tool for the proteomics analysis.

1. Introduction

Predicting protein–protein interactions (PPIs) is essential for elucidating protein functions and understanding the biological structures in cells [1]. Additionally, the prediction of PPIs not only helps people to further examine how proteins exert their various functions, but also provides the crucial information for the design of targeted drugs. in the past decade, there have been many biological experimental approaches, including mass spectrometry [2], tandem affinity purification [3], and two-yeast hybrids [4] have been extensively studied for decades. However, these conventional studies present some drawbacks, such as high cost, time-intensive, and suffer from high rate of false-positives and false-negatives. Accordingly, the development of novel computational approaches to predict potential PPI pairs would be of enormous value to biologists [5].
To date, several computational methods for PPIs’ prediction have been presented. In general, these methods can be broadly grouped into three types: ligand-based approaches, structure-based methods, and sequence-based methods. Typically, the sequence-based methods do not perform as well as the first two methods, while the ligand and structure-based approaches usually need the a priori information of proteins. The challenging problem will arise when this information did not exist. In recent years, following the advancement of genome technologies, a large amount of protein sequence data had been collected and entered in databases. Therefore, the sequence-based methods to identify PPIs have aroused an increasing concern. The vast majority of the existing computational methods are usually based on the machine learning algorithms, including rotation forest [6], support vector machine [7,8], and Naive Bayes [9]. For example, Huang et al. [10] adopted discrete cosine transform descriptors and weighted sparse representation model to predict PPIs from protein sequence. You et al. [11] proposed a method called PCA-EELM, which utilized four different types of sequence information to predict PPIs. Li et al. [12] proposed a method called PSIPEL that combined an novel feature extraction approach, Low Rank Approximation with Rotation Forest, to predict PPIs from protein primary sequences. Zeng et al. [13] developed a deep learning framework to predict PPIs, which employed a sliding window and text convolutional neural network to capture local contextual and global sequence features from target proteins, respectively. Chen et al. [14] applied Fast Fourier Transform to capture protein feature descriptors and fed them to Random Projection for training and detecting self-interacting proteins [15]. Different from the traditional machine learning-based methods, deep learning-based approaches can not only extract feature vectors from the protein sequence directly, but also can capture their nonlinear relationships to improve the prediction performance. As a consequence, deep learning algorithms also have been widely employed in PPI prediction in recent years. For example, Sun et al. [16] first adopted a deep learning technique, stacked autoencoder, for predicting human PPIs from amino acid sequence. Zhang et al. [17] presented Ensemble Deep Neural Networks (EnsDNN), which is a neural network-based method that employs different protein descriptors to detect PPIs. Yao et al. [18] designed a novel method called Res2vec to represent protein sequences, then the residual representation was integrated into a deep neural network for training and predicting. Hashemifar et al. [19] developed a method named DPPI, which combined data augmentation, convolutional neural network, and random projection to predict PPIs. Richoux et al. [20] made a comparison of two powerful deep learning models and discussed the required attention when applying the deep learning algorithm to PPI prediction. Despite of these achievements, there is still great room for these computational based approaches to attain improvement [21].
Inspired by these excellent works, we herein attempted to develop a new computational model to predict potential PPIs from the information of amino acid sequences. Specifically, we first transformed the sequences into a position-specific scoring matrix (PSSM), from which we could preserve the evolution information of primary protein sequence. Then the Discrete Hilbert transform (DHT) algorithm was adopted to capture feature descriptors from the PSSM. Finally, the Rotation Forest (RoF) classifier was used for training and determining whether the proteins are related or not. In order to access the predictive ability of our approach, we performed it on the Yeast, Human, and Oryza sativa PPIs datasets, and yielded a high prediction accuracy of 91.93, 96.35, and 94.24%, respectively. Moreover, we compared our approach with several existing sequence-based methods. We also applied it on four independent PPI datasets. Experimental results demonstrated that our method is effective for identifying whether the protein pairs interact or not, and it can be considered as a supplemental tool to the commonly used experimental methods.

2. Materials and Methodology

2.1. Protein Interaction Dataset

In this article, the presented approach was first validated on a high-confidence PPIs dataset named Yeast, which was selected from the Database of Interaction Proteins (DIP) [22] by Guo et al. [23]. This dataset was collected from the Saccharomyces cerevisiae core subset which contains 5996 interaction pairs. In order to remove redundant information, the CD-Hit [24,25] was employed in this work. CD-Hit is a multiple sequence alignment tool for removing the homologous sequence pairs. After removing the protein pairs which had ≥40% sequence identity or the fragments with less than 50 residues, we obtained 5594 protein pairs as the positive samples. For the construction of a negative dataset, we randomly chose 5594 additional Yeast pairs from different subcellular compartments. Accordingly, the final Yeast PPIs dataset contained 11,188 protein pairs.
To indicate the generality of the proposed approach, we also verified our experiment on the Human and Rice (Oryza sativa) PPIs dataset. The Human dataset was selected from the Human Protein Reference Database (HPRD) [26]. After removing sequences with greater than 25% sequence identity, we employed 3899 interaction pairs, which collected from 2502 different human proteins to construct the positive samples. For the negative samples, we used the same approach to construct the negative samples of Human dataset. Finally, the negative set consisted of 4262 pairs from 661 proteins. In addition, Oryza sativa dataset was collected from the PRIN [27] database. The Oryza sativa dataset is consists of 4800 positive samples and 4800 negative samples.

2.2. Encoding Amino Acid Sequence as Date Matrix

The Position-Specific Scoring Matrix (PSSM) was adopted to represent the protein sequence. It was presented by Gribskov et al. [28] to analysis the sequence similarities of proteins. PSSM produces excellent results in many fields, such as in protein secondary structure prediction [29], disorder region prediction [30], and DNA function prediction [31]. A PSSM is a matrix that can be represented as P S S M = φ m , n : m = 1 b   and   n = 1 20 , where m denotes the length of the protein sequence, and the number 20 represents the 20 amino acids. The φ m , n can be expressed as follows:
φ m , n = t = 1 20 P ( a , q ) × w ( b , q ) , a = 1 P ,   b = 1 20
where P ( a , q ) indicate the frequency value of the q th amino acid at the position a of the probe, and w ( b , q ) indicate the value of Dayoff mutation matrix between the acid of b th and q th . The main concern in applying the PSSM algorithm is that it can enable the sequence to match the alignment table by awarding a higher score to a conservative position, while a good score means a conservative position and a low score represents a low-conserved position.
In this work, the PSI-BLAST tool was applied to transform the protein sequence into a PSSM matrix. BLAST is a useful resource for searching local similarity regions between different amino acid sequences. It can make a comparison of sequences and nucleotides with particular databases, and compute a statistical significance of the matches, to infer functions and evolutionary associations between different sequence. PSI-BLAST is an enhanced BLAST technique, which can robustly identify novel proteins in distantly related organisms. The main improvement of PSI-BLAST is that it can adopt the profile to search the non-redundant SWISS-PROT database, and then employ the searched results to rebuild the profile, and so on, until no new results are generated. SWISS-PROT is an annotated protein sequence database and the sequences collected in it are searched for by many authoritative biologists. Moreover, to better exploit the performance of the PSI-BLAST algorithm, we chose three iterations, and the e-value parameter was assigned to 0.001, and the PAM was selected as the scoring matrix. The other parameters were set to their default values.

2.3. Discrete Hilbert Transform

In this work, the Discrete Hilbert transform [32] (DHT) algorithm was adopted to capture feature values from the PSSM matrix to generate the feature vectors, which can make the prediction results more accurate. Discrete Hilbert transform was first employed to analysis the signal in the frequency and time domains. Before introducing the 2-D DHT, the 1-D DHT is first used in spatial and frequency domain. Let ( a ) represent the discrete signal, ( a ) can be shown as:
( a ) = ( a ) p ( a )
where:
p ( a ) = 1 ( 1 ) a a π a = 1 , 2 , 3
After applying the Fourier transform (FT), ( a ) could be represented as:
( a ) = IDFT F ( j Ω ) = IDFT F ( j Ω ) j sgn ( Ω )
In Equation (4), IDFT represents the Inverse Discrete Fourier transform [33], and the Fourier transform of ( a ) and ( a ) can be described as F ( j Ω ) and F ( j Ω ) , respectively. Above all, the function H ( j Ω ) can be written as:
H ( j Ω ) = j sgn ( Ω ) = j Ω > 0 , j Ω < 0 ,
where angular frequency is Ω and the finite discrete signum function is denoted by sgn ( Ω ) . For better capturing feature vectors from the PSSM matrix, we applied the 2D DHT [34] that was defined in the frequency domain to extract features from the PSSM. The odd and even parts of the PSSM features in the frequency domain refer to the highly conserved order of amino acids within a particular protein sequence. Suppose that the odd and even parts of PSSM features in a frequency domain are defined by f 0 ( x , y ) and f e ( x , y ) , respectively. The formula of the 2D Discrete Hilbert transform can be written as:
f 0 ( x , y ) = sgn ( x , y ) + b d y ( x , y ) f e ( x , y )
sgn ( x , y ) = 1 0 < x < 1 2 , 0 < y < H 2 2 1 H 1 2 < x < H 1 , H 2 2 < y < H 2 0 elsewhere
where b d y ( x , y ) is employed to adjust the boundary and the finite discrete signum function is described by sgn ( x , y ) . H 1 and H 2 represent the size of f 0 ( x , y ) and f e ( x , y ) , respectively. Given an image P ( x , y ) , the 2D DHT of P ( x , y ) in the frequency domain can be expressed as:
P ( x , y ) = sgn ( x , y ) + b d y ( x , y ) · P ( x , y )
where x = 0 , 1 , , H 1 1 and y = 0 , 1 , , H 2 1 ; H 1 and H 2 are the size of the input image.
Let T ( i , j ) represent an image, then the 2D DHT of T ( i , j ) in spatial domain can be defined as:
T ( i , j ) = T ( i , j ) R ( i , j )
R ( i , j ) = cot π H 1 i + cot π H 2 j 2 H 1 H 2
where i = 0 , 1 , 2 , , H 1 1 and j = 0 , 1 , 2 , , H 2 1 . Because of the same mathematic principle between 1D and 2D DHT, the image f ( i , j ) can be expanded as a 2D Fourier series:
f ( i , j ) = 1 Z Q u = 0 Z 1 v = 0 Q 1 F ( α , β ) sin ( ϕ α , β ( i , j ) )
where F ( α , β ) = i = 0 Z 1 j = 0 Q 1 f ( i , j ) e j 2 π ( α i Z + β j Q ) , α = 0 , 1 , 2 , , Z 1 , and i = 0 , 1 , 2 , , Q 1 ; Z and Q are the size of the image.

2.4. Ensemble-Learning-Based Classifier

Rotation forest (RoF) is an ensemble learning algorithm, which was introduced by Rodriguez et al. [35] to improve the diversity and accuracy of each classifier in the ensemble system. The main contribution of the RoF algorithm is that it applies the principal component analysis (PCA) technique to construct a rotational matrix, which can then transform initial variables into new variables to construct new independent decision trees. Moreover, PCA algorithm ensures the diversity of the classifier, and it retains most of the evolutionary information of the protein feature descriptors [36]. The specific framework of this algorithm is summarized as follows.
Let T represents the training sample set, H denotes the feature set, and E be the corresponding labels. Let α be the set of class labels α 1 , α 2 , from which E takes values. Assume that T is a N × n matrix, where n and N represents the features and training samples in the PPIs data set. The data will be divided randomly into K subsets of the approximate size; there are L decision trees represented as D 1 , , D L , respectively. In the RoF algorithm, L and K are the two parameters that require advance optimization. The specific details of the RoF algorithm can be defined as follows:
(1) Divide the feature set H optionally into K subsets. Assume that K is a factor of m, then, each feature will include u = m / K features.
(2) Let H i j represent the j-th subset of features for training classifier D i . The features of dataset T in H i j is defined as T i j . Then a bootstrap subset of size 75% of the data set is extracted to construct the training set, which is defined as T i j . Then the PCA algorithm is adopted with T i j to generate the coefficients into a matrix C i j . Denoted as a i j ( 1 ) , , a i j ( M i ) , the size of each T i j is U × 1 .
(3) Using the coefficients in C i j to build a spare rotation matrix R i and it can be expressed as follows;
R i = a i 1 ( 1 ) , , a i 1 ( M 1 ) 0 0 0 a i 1 ( 1 ) , , a i 2 ( M 2 ) 0 0 0 a i K ( 1 ) , , a i K ( M K )
In the classification stages, provided there is a target sample x, let d i j ( X R i a ) denotes the probability produced by the classifier D i to the class α i . Finally, the confidence level of each class can be found through the mean combination technique:
λ i ( x ) = 1 L i = 1 L d i j ( x R i a )
In this way, the test sample x can 190 be easily distributed to the class with the highest confidence.

3. Results

3.1. Evaluation Measures

In this study, in order to avoid over-fitting to affect the predictive ability of the proposed method, we used the five-fold cross-validation (five-fold CV) technique to measure the predictive ability of the proposed method. All samples were randomly split into five subsets, in which four were used as a training set and the other one was adopted as the test set. In this experiment, this procedure was performed five times to guarantee that each subset was used once as a test subset. Lastly, the average and standard deviations of these five experiments were taken as the final experiment results. In our experiments, several evaluation criteria were employed to estimate the predictive ability of the proposed model, including accuracy (ACC), sensitivity (Sen.), specificity (Spec..), precision (PR), and Matthews’ correlation coefficient (MCC) to access the predictive power. Their corresponding calculating formulae are as follows:
A C C = T P + T N T N + F P + T P + F N
S e n . = T P F N + T P
S p e c . = T N T N + F P
P R = T P F P + T P
M C C = T N × T P F N × F P ( T P + F N ) × ( T P + F P ) × ( Τ Ν + F P ) × ( T N × F N )
where true positive (TP) indicated the quantity of true samples, which can be identified correctly; false positive (FP) represents the amount of true non-interacting pairs detected to be PPIs falsely; true negative (TN) is the amount of true non-interacting pairs that are correctly identified; false negative (FN) represents the number of true samples categorized as non-interacting pairs incorrectly. Additionally, the receiver operating characteristic (ROC) curves were also plotted in order to prove the predictive power of our method. The AUC (area under ROC curves) values were also calculated to express the ROC values in a more accessible way.

3.2. Prediction Performance on Three PPIs Datasets

In this study, we first validated our model on the Yeast data set, and Table 1 summarizes the results of the five-fold cross-validation (five-fold CV) experiment. It can be seen from Table 1 that the average accuracy, sensitivity, specificity, precision, and MCC values are 91.93%, 89.78%, 94.05%, 93.82%, and 85,14%, and their standard deviations were 0.69%, 0.79%, 1.30%, 1.19%, and 1.15%, respectively. Then, the proposed method was performed on the Human PPIs dataset; we also yielded excellent predicted results shown on Table 2, with average accuracy, sensitivity, specificity, precision, and MCC values of 96.35%, 95.76%, 96.87%, 96.57%, and 92.95%, and their standard deviations were 0.56%, 0.78%, 0.71%, 0.64%, 1.03%, respectively. In addition, to further demonstrate the robustness of the proposed model, we finally applied it to a plant PPI dataset, Oryza sativa. With respect to the Oryza sativa dataset, the average accuracy, sensitivity, specificity, precision, and MCC values of the proposed model are shown in Table 3 as 94.24%, 94.50%, 94.02%, 94.02%, and 89.14%, and their standard deviations were 0.37%, 0.97%, 0.82%, 1.03%, and 0.66%, respectively. The receiver operating characteristic (ROC) curves for the three benchmark datasets are shown in Figure 1, Figure 2 and Figure 3. We also calculated the area under the ROC curve (AUC) values of these three PPI datasets for further evaluate the predictive power of our model, and they were 0.9586, 0.9831, and 0.9667, respectively.

3.3. Compared with Different Classifier Models

To date, there are a lot of machine learning algorithms have been developed for detecting PPIs. To further verify the prediction accuracy of the proposed model, we compared it with some popular classifiers, including Support vector machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN), and AdaBoost algorithm. To be specific, we utilized the same DHT descriptors and compared the predictive performance between RoF and these classifiers. We used the LIBSVM tool to train and predict the SVM-based model. To optimize the best parameter of the SVM classifier, the grid search method was adopted to select the best parameters of SVM c and g. We set c = 13, g = 0.0006 and c = 3, g = 0.0005 for the Yeast and Human data set. When performing on the Oryza sativa data set, we set c = 7, g = 0.0009. The parameter K of RF classifiers of the Yeast, Human, and Oryza sativa dataset were 27, 7, and 17, respectively. The parameters of KNN model included the number of neighbors and distance measures. In this article, all the experiments used the Manhattan distance, and the number of neighbors for these three PPI data sets were 15, 17, and 4, respectively. Table 4 illustrated the details of the prediction results of these four state-of-art classifiers on the Yeast, Human, and Oryza sativa data set. To identify any potential overfitting or underfitting problems in the proposed model, we also used a train/test/validation process for predicting these datasets. The experimental results based on this approach can be seen in our Supplementary Materials Tables S1–S4.
As shown in Table 4, the proposed method provided the best results on the three PPI data sets in terms of all the metrics, and the least accuracy improvement was reached with 7.49% on the Yeast dataset, 1.03% on the Human data set, and 8.66% on the Oryza sativa data set. The lowest enhanced AUC values were reached with 3.34% on the Yeast dataset, 0.13% on the Human dataset, and 4.17% on the Oryza sativa dataset. For the visual analysis, we drew a histogram for the ACC and AUC values that were generated by these powerful classifiers in Figure 4. These experimental results further demonstrated that rotation forest is the best classifier for the features that we introduced.

3.4. Evaluation of Prediction Ability on Four Independent Dataset

Although the proposed model has achieved satisfactory results on the Yeast, Human, and Oryza sativa PPI datasets, we also applied it on four independent datasets, including H. sapiens, H. pylori, M. muscules, and C. elegans, to further demonstrate the suitability of our method. In this experiment, we utilized all of the Yeast dataset as the training set and the other four independent datasets were used as the test sets in order to verify the robustness of the proposed method. In addition, we also compared the predictive performance with some excellent approaches. Table 5 summarizes the results of the accuracy comparisons between our model and some existing methods on the four datasets. It can be seen that the prediction accuracy yielded by our method on the H. sapiens, H. pylori, M. muscules, and C. elegans datasets were all higher than 91%, which were 94.27, 91.67, 93.12, and 92.14%, respectively. These experimental results further indicated that our method has strong a generalization ability to predict PPIs. (N/A means not available.)

3.5. Compared with Existing Methods

In recent years, various kinds of computational methods have been proposed for predicting potential protein–protein interactions. Here, we compared the prediction ability of the proposed model with some popular methods on the Yeast and Human dataset, which were also utilized in the five-fold cross-validation method. Table 6 and Table 7 list the predictive performance of these methods with several common evaluation criteria, including accuracy, precision, sensitivity, and MCC. From Table 6, we can see that our method produced an accuracy of 91.93% on the Yeast dataset; the precision is 93.82%, the sensitivity is 89.78%, and the MCC value is 85.14%. The average accuracy results of selected methods are all lower than our method on the Yeast dataset. Table 7 summarizes the average results of these collected approaches on the Human dataset, and are between 90.57 and 96.09%, while the average accuracy of our method is as high as 96.35%. These results further indicated that combining the DHT descriptor and rotation forest classifier is effective for PPIs’ prediction. (N/A means not available.)

4. Discussion

The identification of protein–protein interactions (PPIs) can provide a novel perspective for clinical diagnosis and treatment. It also plays an important role in inter-cellular and intra-cellular functions and inter-molecular connectivity. In this article, we presented a novel ensemble-learning-based method to predict potential PPIs that only used the amino acid sequence information. There are four reasons why the proposed model has excellent prediction performance. First, all protein sequence data were preprocessed to remove residues and redundant information. Second, the target protein sequences were calculated into features by the PSSM technique, which can embed the evolutionary information in the form of a matrix. Thirdly, the Discrete Hilbert transform (DHT) algorithm was employed to extract the feature descriptors from the PSSM. In this way, the proposed model can capture high-dimensional and complex potential information to improve the prediction performance. Finally, the ensemble-learning-based classifier, rotation forest (RoF), was utilized to deal with the classification problem. We performed our method on three PPIs datasets (Yeast, Human and Oryza sativa) under five-fold cross-validation. To further demonstrate the excellent prediction ability of our method, we also applied it in four independent cross-species datasets and compared it with some existing excellent methods. The comprehensive experimental results indicated that our model can be served as a powerful tool to guide researchers to study the functions and roles of proteins. However, there are still some limitations in our work. Firstly, the negative datasets were the random section from the non-interacting pairs. These negative sets may include false negative cases. This has the potential to affect the prediction accuracy of the developed model. In future work, we will investigate the DHT algorithm, which is more appropriate for problems involving large feature dimensions and a small number of training samples; through this, we are hoping to better solve the problem of protein–protein interaction prediction.

5. Conclusions

In this study, we proposed a novel ensemble learning based model that can greatly improve sequence-based PPIs’ prediction. We conducted a comprehensive experiment on three gold standard datasets. Furthermore, we performed independent validation on four cross-species PPI datasets. Experimental results based on cross validations and comparison indicated that our method is effective and robust in predicting PPIs.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biology11050775/s1, Table S1. The optimal model parameters (K and L) on the three PPIs datasets. Table S2. The prediction results obtained by (train/test/validation) based method on the Yeast dataset. Table S3. The prediction results obtained by (train/test/validation) based method on the Human dataset. Table S4. The prediction results obtained by (train/test/validation) based method on the Oryza sativa dataset.

Author Contributions

Conceptualization, J.P. and Y.S.; methodology, software, validation, and formal analysis C.Y. and Z.Y.; investigation, resources, and data curation, L.L. and S.W.; writing—original draft preparation, J.P.; writing—review and editing, J.P.; visualization, J.P. and Y.S.; supervision, Z.Y.; project administration, L.L.; funding acquisition, Z.Y. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the NSFC Program, under Grant 62072378, 61873212, 32170114 and 62002297. This work is also supported by Natural Science Basic Research Program of Shaanxi (Program No.2022JQ-700) and the Science and Technology Innovation 2030-New Generation Artificial Intelligence Major Project (No.2018AAA0100103).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in https://github.com/jie-pan111/Biology (accessed on 14 May 2022).

Acknowledgments

The authors would like to thank all editors and reviewers for their helpful suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Izoré, T.; Cryle, M.J. The many faces and important roles of protein–protein interactions during non-ribosomal peptide synthesis. Nat. Prod. Rep. 2018, 35, 1120–1139. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Yakubu, R.R.; Nieves, E.; Weiss, L.M. The methods employed in mass spectrometric analysis of posttranslational modifications (PTMs) and protein–protein interactions (PPIs). In Advancements of Mass Spectrometry in Biomedical Research; Advances in Experimental Medicine and Biology book series; Springer: Berlin/Heidelberg, Germany, 2019; pp. 169–198. [Google Scholar]
  3. Carnes, R.M.; Kesterson, R.A.; Korf, B.R.; Mobley, J.A.; Wallis, D.J.G. Affinity purification of NF1 protein–protein interactors identifies keratins and neurofibromin itself as binding partners. Genes 2019, 10, 650. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Castel, P.; Holtz-Morris, A.; Kwon, Y.; Suter, B.P.; McCormick, F.J. DoMY-Seq: A yeast two-hybrid–based technique for precision mapping of protein–protein interaction motifs. J. Biol. Chem. 2021, 296, 100023. [Google Scholar] [CrossRef]
  5. Pan, J.; You, Z.-H.; Yu, C.-Q.; Li, L.-P.; Zhan, X.-K. Predicting Protein-Protein Interactions from Protein Sequence Information Using Dual-Tree Complex Wavelet Transform. In Proceedings of the International Conference on Intelligent Computing, Bari, Italy, 2–5 October 2020; pp. 132–142. [Google Scholar]
  6. Wang, L.; You, Z.-H.; Yan, X.; Xia, S.-X.; Liu, F.; Li, L.-P.; Zhang, W.; Zhou, Y.J. Using two-dimensional principal component analysis and rotation forest for prediction of protein-protein interactions. Sci. Rep. 2018, 8, 12874. [Google Scholar] [CrossRef] [PubMed]
  7. Romero-Molina, S.; Ruiz-Blanco, Y.B.; Harms, M.; Münch, J.; Sanchez-Garcia, E.J. PPI-detect: A support vector machine model for sequence-based prediction of protein–protein interactions. J. Comput. Chem. 2019, 40, 1233–1242. [Google Scholar] [CrossRef]
  8. Chakraborty, A.; Mitra, S.; De, D.; Pal, A.J.; Ghaemi, F.; Ahmadian, A.; Ferrara, M. Determining Protein–Protein Interaction Using Support Vector Machine: A Review. IEEE Access 2021, 9, 12473–12490. [Google Scholar] [CrossRef]
  9. Lin, X.; Chen, X.W. Heterogeneous data integration by tree-augmented naïve B ayes for protein–Protein interactions prediction. Proteomics 2013, 13, 261–268. [Google Scholar] [CrossRef]
  10. Huang, Y.-A.; You, Z.-H.; Gao, X.; Wong, L.; Wang, L. Using weighted sparse representation model combined with discrete cosine transformation to predict protein-protein interactions from protein sequence. BioMed Res. Int. 2015, 2015, 902198. [Google Scholar] [CrossRef] [Green Version]
  11. You, Z.-H.; Lei, Y.-K.; Zhu, L.; Xia, J.; Wang, B. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC Bioinform. 2013, 14, S10. [Google Scholar] [CrossRef] [Green Version]
  12. Li, J.-Q.; You, Z.-H.; Li, X.; Ming, Z.; Chen, X. PSPEL: In silico prediction of self-interacting proteins from amino acids sequences using ensemble learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 14, 1165–1172. [Google Scholar] [CrossRef] [PubMed]
  13. Zeng, M.; Zhang, F.; Wu, F.-X.; Li, Y.; Wang, J.; Li, M. Protein–protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics 2020, 36, 1114–1120. [Google Scholar] [CrossRef] [PubMed]
  14. Chen, Z.-H.; You, Z.-H.; Li, L.-P.; Wang, Y.-B.; Wong, L.; Yi, H.-C. Prediction of self-interacting proteins from protein sequence information based on random projection model and fast Fourier transform. Int. J. Mol. Sci. 2019, 20, 930. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Pan, J.; Li, L.-P.; Yu, C.-Q.; You, Z.-H.; Ren, Z.-H.; Tang, J.-Y. FWHT-RF: A Novel Computational Approach to Predict Plant Protein-Protein Interactions via an Ensemble Learning Method. Sci. Program. 2021, 2021, 1607946. [Google Scholar] [CrossRef]
  16. Sun, T.; Zhou, B.; Lai, L.; Pei, J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinform. 2017, 18, 277. [Google Scholar] [CrossRef] [Green Version]
  17. Zhang, L.; Yu, G.; Xia, D.; Wang, J. Protein–protein interactions prediction based on ensemble deep neural networks. Neurocomputing 2019, 324, 10–19. [Google Scholar] [CrossRef]
  18. Yao, Y.; Du, X.; Diao, Y.; Zhu, H. An integration of deep learning with feature embedding for protein–protein interaction prediction. PeerJ 2019, 7, e7126. [Google Scholar] [CrossRef]
  19. Hashemifar, S.; Neyshabur, B.; Khan, A.A.; Xu, J. Predicting protein–protein interactions through sequence-based deep learning. Bioinformatics 2018, 34, i802–i810. [Google Scholar] [CrossRef] [Green Version]
  20. Richoux, F.; Servantie, C.; Borès, C.; Téletchéa, S.J.A.P.A. Comparing two deep learning sequence-based models for protein-protein interaction prediction. arXiv 2019, arXiv:1901.06268. [Google Scholar]
  21. Pan, J.; You, Z.-H.; Li, L.-P.; Huang, W.-Z.; Guo, J.-X.; Yu, C.-Q.; Wang, L.-P.; Zhao, Z.-Y. DWPPI: A Deep Learning Approach for Predicting Protein–Protein Interactions in Plants Based on Multi-Source Information With a Large-Scale Biological Network. Front. Bioeng. Biotechnol. 2022, 10, 807522. [Google Scholar] [CrossRef]
  22. Salwinski, L.; Miller, C.S.; Smith, A.J.; Pettit, F.K.; Bowie, J.U.; Eisenberg, D. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004, 32, D449–D451. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  23. Guo, Y.; Yu, L.; Wen, Z.; Li, M. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res. 2008, 36, 3025–3030. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  24. Li, W.; Jaroszewski, L.; Godzik, A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 2001, 17, 282–283. [Google Scholar] [CrossRef] [PubMed]
  25. Li, W.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658–1659. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Keshava Prasad, T.; Goel, R.; Kandasamy, K.; Keerthikumar, S.; Kumar, S.; Mathivanan, S.; Telikicherla, D.; Raju, R.; Shafreen, B.; Venugopal, A. Human protein reference database—2009 update. Nucleic Acids Res. 2009, 37, D767–D772. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  27. Gu, H.; Zhu, P.; Jiao, Y.; Meng, Y.; Chen, M. PRIN: A predicted rice interactome network. BMC Bioinform. 2011, 12, 161. [Google Scholar] [CrossRef] [Green Version]
  28. Gribskov, M.; McLachlan, A.D.; Eisenberg, D. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 1987, 84, 4355–4358. [Google Scholar] [CrossRef] [Green Version]
  29. Wang, Y.; Cheng, J.; Liu, Y.; Chen, Y. Prediction of protein secondary structure using support vector machine with PSSM profiles. In Proceedings of the 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference, Chongqing, China, 20–22 May 2016; pp. 502–505. [Google Scholar]
  30. Zhao, T.-H.; Jiang, M.; Huang, T.; Li, B.-Q.; Zhang, N.; Li, H.-P.; Cai, Y.-D. A novel method of predicting protein disordered regions based on sequence features. BioMed Res. Int. 2013, 2013, 414327. [Google Scholar] [CrossRef] [Green Version]
  31. Gelfand, M.S. Prediction of function in DNA sequence analysis. J. Comput. Biol. 1995, 2, 87–115. [Google Scholar] [CrossRef]
  32. Cizek, V. Electroacoustics. Discrete hilbert transform. IEEE Trans. Audio Electroacoust. 1970, 18, 340–343. [Google Scholar] [CrossRef]
  33. Ponomareva, O.; Ponomarev, A.; Ponomarev, V. Evolution of forward and inverse discrete fourier transform. In Proceedings of the 2018 IEEE East-West Design & Test Symposium (EWDTS), Kazan, Russia, 14–17 September 2018; pp. 1–5. [Google Scholar]
  34. Read, R.R.; Treitel, S. The stabilization of two-dimensional recursive filters via the discrete Hilbert transform. IEEE Trans. Geosci. Electron. 1973, 11, 153–160. [Google Scholar] [CrossRef]
  35. Rodriguez, J.J.; Kuncheva, L.I.; Alonso, C.J. Rotation forest: A new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell. 2006, 28, 1619–1630. [Google Scholar] [CrossRef] [PubMed]
  36. Good, R.P.; Kost, D.; Cherry, G.A. Introducing a unified PCA algorithm for model size reduction. IEEE Trans. Semicond. Manuf. 2010, 23, 201–209. [Google Scholar] [CrossRef]
  37. Ding, Y.; Tang, J.; Guo, F. Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinform. 2016, 17, 398. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  38. Zhan, X.-K.; You, Z.-H.; Li, L.-P.; Li, Y.; Wang, Z.; Pan, J. Using Random Forest Model Combined with Gabor Feature to Predict Protein-Protein Interaction From Protein Sequence. Evol. Bioinform. 2020, 16, 1176934320934498. [Google Scholar] [CrossRef] [PubMed]
  39. Wang, Y.-B.; You, Z.-H.; Li, L.-P.; Huang, Y.-A.; Yi, H.-C. Detection of interactions between proteins by using legendre moments descriptor to extract discriminatory information embedded in pssm. Molecules 2017, 22, 1366. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  40. Yang, X.; Yang, S.; Li, Q.; Wuchty, S.; Zhang, Z.J.C. Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Comput. Struct. Biotechnol. J. 2020, 18, 153–161. [Google Scholar] [CrossRef]
  41. Wang, X.; Yu, B.; Ma, A.; Chen, C.; Liu, B.; Ma, Q. Protein–protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique. Bioinformatics 2019, 35, 2395–2402. [Google Scholar] [CrossRef]
  42. Zhou, Y.Z.; Gao, Y.; Zheng, Y.Y. Prediction of protein-protein interactions using local description of amino acid sequence. In Advances in Computer Science and Education Applications; Springer: Berlin/Heidelberg, Germany, 2011; pp. 254–262. [Google Scholar]
  43. An, J.-Y.; Zhou, Y.; Zhao, Y.-J.; Yan, Z.-J. An efficient feature extraction technique based on local coding PSSM and multifeatures fusion for predicting protein-protein interactions. Evol. Bioinform. 2019, 15. [Google Scholar] [CrossRef] [Green Version]
  44. Li, Y.; Wang, Z.; Li, L.-P.; You, Z.-H.; Huang, W.-Z.; Zhan, X.-K.; Wang, Y.-B. Robust and accurate prediction of protein–protein interactions by exploiting evolutionary information. Sci. Rep. 2021, 11, 16910. [Google Scholar] [CrossRef]
  45. Pan, X.-Y.; Zhang, Y.-N.; Shen, H.-B. Large-Scale prediction of human protein− protein interactions from amino acid sequence based on latent topic features. J. Proteome Res. 2010, 9, 4992–5001. [Google Scholar] [CrossRef] [PubMed]
  46. Li, Z.-W.; You, Z.-H.; Chen, X.; Li, L.-P.; Huang, D.-S.; Yan, G.-Y.; Nie, R.; Huang, Y.-A. Accurate prediction of protein-protein interactions by integrating potential evolutionary information embedded in PSSM profile and discriminative vector machine classifier. Oncotarget 2017, 8, 23638. [Google Scholar] [CrossRef] [PubMed]
Figure 1. ROC curves generated by the proposed model on the Yeast dataset.
Figure 1. ROC curves generated by the proposed model on the Yeast dataset.
Biology 11 00775 g001
Figure 2. ROC curves generated by the proposed model on the Human dataset.
Figure 2. ROC curves generated by the proposed model on the Human dataset.
Biology 11 00775 g002
Figure 3. ROC curves generated by the proposed model on the Oryza sativa dataset.
Figure 3. ROC curves generated by the proposed model on the Oryza sativa dataset.
Biology 11 00775 g003
Figure 4. Comparison of the results produced by different classifier models on three benchmark datasets. (a) Is the obtained accuracy results on the three benchmark datasets. (b) Is the obtained AUC results on the three benchmark datasets.
Figure 4. Comparison of the results produced by different classifier models on three benchmark datasets. (a) Is the obtained accuracy results on the three benchmark datasets. (b) Is the obtained AUC results on the three benchmark datasets.
Biology 11 00775 g004
Table 1. Five-fold CV results performed by the proposed model on the Yeast PPIs dataset.
Table 1. Five-fold CV results performed by the proposed model on the Yeast PPIs dataset.
DatasetACC (%)Sen. (%)Spec. (%)PR (%)MCC (%)AUC
190.8889.1392.6792.5783.420.9562
291.5590.3192.7992.5584.520.9581
392.4089.4295.3795.0485.930.9581
492.4990.9094.1594.2086.100.9608
592.3189.1595.3094.7385.750.9599
Average91.93 ± 0.6989.78 ± 0.7994.05 ± 1.3093.82 ± 1.1985.14 ± 1.150.9586 ± 0.0018
Table 2. Five-fold CV results performed by the proposed model on the Human PPIs dataset.
Table 2. Five-fold CV results performed by the proposed model on the Human PPIs dataset.
DatasetACC (%)Sen. (%)Spec. (%)PR (%)MCC (%)AUC
196.2095.2397.0896.7292.670.9834
295.4795.2395.6995.4791.340.9808
396.9497.1096.7896.6294.060.9850
496.6395.7397.4096.8993.440.9817
596.5195.5397.4197.1493.240.9846
Average96.35 ± 0.5695.76 ± 0.7896.87 ± 0.7196.57 ± 0.6492.95 ± 1.030.9831 ± 0.0018
Table 3. Five-fold CV results performed by the proposed model on the Oryza sativa PPIs dataset.
Table 3. Five-fold CV results performed by the proposed model on the Oryza sativa PPIs dataset.
DatasetACC (%)Sen. (%)Spec. (%)PR (%)MCC (%)AUC
193.9194.6493.2292.9488.550.9635
294.3894.6494.1393.8389.380.9656
394.7995.0994.4894.7090.120.9674
494.2295.2893.1793.2289.100.9628
593.9192.8495.0895.4088.540.9689
Average94.24 ± 0.3794.50 ± 0.9794.02 ± 0.8294.02 ± 1.0389.14 ± 0.660.9667 ± 0.0022
Table 4. Predictive performance comparison among four different classifiers.
Table 4. Predictive performance comparison among four different classifiers.
DatasetMethodACC (%)Sens. (%)Spec. (%)PR (%)MCC (%)AUC
YeastSVM84.44 ± 0.8483.14 ± 1.0185.77 ± 1.4085.37 ± 1.6873.71 ± 1.170.9149 ± 0.0061
RF81.97 ± 0.4180.26 ± 1.2783.68 ± 0.4883.09 ± 0.8470.41 ± 0.550.8979 ± 0.0038
KNN81.39 ± 1.0775.19 ± 2.1687.63 ± 1.1785.88 ± 1.2169.47 ± 1.370.8967 ± 0.0057
AdaBoost78.15 ± 1.8276.88 ± 1.9079.45 ± 2.9585.46 ± 2.8565.87 ± 1.970.8546 ± 0.0120
RoF91.93 ± 0.6989.78 ± 0.7994.05 ± 1.3093.82 ± 1.1985.14 ± 1.150.9586 ± 0.0018
HumanSVM87.93 ± 0.8685.78 ± 1.2889.89 ± 1.3788.59 ± 1.5378.69 ± 1.310.9446 ± 0.0069
RF95.32 ± 0.9692.63 ± 1.9397.82 ± 0.9497.50 ± 1.0391.04 ± 1.740.9804 ± 0.0016
KNN87.92 ± 1.1976.67 ± 2.4498.23 ± 0.4997.51 ± 0.7478.10 ± 1.960.9758 ± 0.0046
AdaBoost75.64 ± 1.6971.36 ± 3.8779.53 ± 3.0476.19 ± 2.2962.88 ± 1.830.8362 ± 0.0170
RoF96.35 ± 0.5695.76 ± 0.7896.87 ± 0.7196.57 ± 0.6492.95 ± 1.030.9831 ± 0.0018
Oryza
sativa
SVM85.58 ± 1.2784.06 ± 1.0887.16 ± 2.4686.73 ± 2.6375.32 ± 1.780.9246 ± 0.0085
RF84.19 ± 0.9281.71 ± 1.2386.68 ± 1.0385.99 ± 0.8573.34 ± 1.240.9070 ± 0.0096
KNN76.51 ± 0.7085.19 ± 0.8867.82 ± 0.9172.58 ± 1.1063.50 ± 0.770.8327 ± 0.0040
AdaBoost80.82 ± 1.3781.50 ± 1.8780.16 ± 1.6180.40 ± 2.0069.01 ± 1.670.8876 ± 0.0132
RoF94.24 ± 0.3794.50 ± 0.9794.02 ± 0.8294.02 ± 1.0389.14 ± 0.660.9667 ± 0.0022
Table 5. Prediction accuracy of the four independent datasets.
Table 5. Prediction accuracy of the four independent datasets.
SpeciesTest PairOur MethodDing et al. [37]Huang et al. [10]Zhan et al. [38]Wang et al. [39]
H. sapiens141294.29%90.23%82.22%91.93%80.10%
H. pylori142091.67%90.34%82.18%91.34%N/A
M. muscules31393.12%91.37%79.87%94.89%89.14%
C. elegans401392.14%86.72%81.19%93.20%92.96%
Table 6. Performance comparisons of computational methods on the Yeast dataset.
Table 6. Performance comparisons of computational methods on the Yeast dataset.
AuthorMethodACC (%)PR (%)Sens. (%)MCC (%)
Guo et al. [23]ACC + SVM89.3389.9388.87N/A
Yang et al. [40]LD + KNN86.1590.2481.30N/A
Wang et al. [41]3-MER + CNN90.2691.6588.1482.38
Zhou et al. [42]LD + SVM88.5689.5087.3777.15
An et al. [43]PSSMMF + SVM90.4890.5890.2682.84
You et al. [11]PCA + ELLM87.0087.5986.1577.36
Our methodDHT + RoF91.9393.8289.7885.14
Table 7. Performance comparisons of computational methods on the Human dataset.
Table 7. Performance comparisons of computational methods on the Human dataset.
AuthorMethodACC (%)PR (%)Sens. (%)MCC (%)
Ding et al. [37]MMI + RF96.0896.6795.0592.17
Li et al. [44]OLPP + RoF96.0996.5695.2092.47
Pan et al. [45]LDA + SVM90.70N/A89.781.3
Li et al. [46]IWLD + SVM90.5789.0191.6181.22
Our methodDHT + RoF96.3596.5795.7692.95
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Pan, J.; Wang, S.; Yu, C.; Li, L.; You, Z.; Sun, Y. A Novel Ensemble Learning-Based Computational Method to Predict Protein-Protein Interactions from Protein Primary Sequences. Biology 2022, 11, 775. https://doi.org/10.3390/biology11050775

AMA Style

Pan J, Wang S, Yu C, Li L, You Z, Sun Y. A Novel Ensemble Learning-Based Computational Method to Predict Protein-Protein Interactions from Protein Primary Sequences. Biology. 2022; 11(5):775. https://doi.org/10.3390/biology11050775

Chicago/Turabian Style

Pan, Jie, Shiwei Wang, Changqing Yu, Liping Li, Zhuhong You, and Yanmei Sun. 2022. "A Novel Ensemble Learning-Based Computational Method to Predict Protein-Protein Interactions from Protein Primary Sequences" Biology 11, no. 5: 775. https://doi.org/10.3390/biology11050775

APA Style

Pan, J., Wang, S., Yu, C., Li, L., You, Z., & Sun, Y. (2022). A Novel Ensemble Learning-Based Computational Method to Predict Protein-Protein Interactions from Protein Primary Sequences. Biology, 11(5), 775. https://doi.org/10.3390/biology11050775

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop