Prediction of Self-Interacting Proteins from Protein Sequence Information Based on Random Projection Model and Fast Fourier Transform

Zhan-Heng Chen; Zhu-Hong You; Li-Ping Li; Yan-Bin Wang; Leon Wong; Hai-Cheng Yi

doi:10.3390/ijms20040930

,

and

¹

The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Int. J. Mol. Sci.2019, 20(4), 930;https://doi.org/10.3390/ijms20040930

This article belongs to the Section Molecular Biophysics

Version Notes

Order Reprints

Abstract

It is significant for biological cells to predict self-interacting proteins (SIPs) in the field of bioinformatics. SIPs mean that two or more identical proteins can interact with each other by one gene expression. This plays a major role in the evolution of protein‒protein interactions (PPIs) and cellular functions. Owing to the limitation of the experimental identification of self-interacting proteins, it is more and more significant to develop a useful biological tool for the prediction of SIPs from protein sequence information. Therefore, we propose a novel prediction model called RP-FFT that merges the Random Projection (RP) model and Fast Fourier Transform (FFT) for detecting SIPs. First, each protein sequence was transformed into a Position Specific Scoring Matrix (PSSM) using the Position Specific Iterated BLAST (PSI-BLAST). Second, the features of protein sequences were extracted by the FFT method on PSSM. Lastly, we evaluated the performance of RP-FFT and compared the RP classifier with the state-of-the-art support vector machine (SVM) classifier and other existing methods on the human and yeast datasets; after the five-fold cross-validation, the RP-FFT model can obtain high average accuracies of 96.28% and 91.87% on the human and yeast datasets, respectively. The experimental results demonstrated that our RP-FFT prediction model is reasonable and robust.

Keywords:

self-interacting proteins; position-specific scoring matrix; fast Fourier transform; random projection

1. Introduction

Protein is an important component of all cells. It is an organic macromolecule and the basic material of life. It also is the main undertaker of activity. Without protein, there is no life. Most proteins often work together with a partner or other proteins. They can interact with two or more copies by themselves, which is termed self-interacting proteins (SIPs). However, for most researchers, whether proteins can interact with each other is a difficult thing to determine. SIPs play a key role in the development of protein interaction networks (PINs) [1,2]. The functions of many proteins, which could control the transport of ions and small molecules that pass through cell membranes, depends on their homo-oligomers [3]. Ispolatov et al. discovered that the average quantity of SIPs is more than twice that of other proteins in the PINs [4]. It is crucial for elucidating the functions of SIPs to comprehend whether a protein can self-interact; this also gives us an insight into the adjustment of protein function and can help us achieve a better comprehension of disease mechanisms [5]. Over the past few years, many studies have shown that homo-oligomerization plays an important role in many biological processes, such as signal transduction, gene expression regulation, immune response, and enzyme activation [6,7,8,9]. Therefore, SIPs will be useful for improving steadiness and preventing against cellular stress and the denaturation of proteins via reducing the surface area [10].

So far, there are many ways to study bioinformatics [11,12,13,14,15,16] and genomics [17,18,19,20,21,22], and a number of previous methods for predicting PPIs have been put forward. For example, Pitre et al. [23] raised a new Protein‒Protein Interaction Prediction Engine (PIPE), which could predict PPIs for any target pair of the yeast S. cerevisiae proteins from their primary structure and without the need for any additional information or predictions about the proteins. Xia et al. [24] put forward a sequence-based multi-classifier system that applied auto-correlation descriptor to encode a protein interaction pair and selected rotation forest as classifier to deduce PPIs. However, these methods are good for PPI detection [25] but have certain limitations in that they must take the correlation between protein pairs into account for Protein Self-interaction detection—for example, co-expression, co-localization, and co-evolution. Nevertheless, this information is useless for SIPs. Moreover, the datasets for PPI detection are balanced and those of SIPs are unbalanced. Besides, prediction of PPIs datasets has no PPIs between the same partners. For these reasons, the above computational models are not suitable for SIPs detection. Accordingly, it is becoming more and more crucial to exploit an effective calculation method to predict SIPs.

In our study, a random projection (RP) method for SIPs prediction from protein sequence information with Fast Fourier Transform (FFT) was proposed. Furthermore, the main idea of our proposed method includes four aspects: (1) the protein sequence information could be described as a Position-Specific Scoring Matrix (PSSM); (2) using the fast Fourier transform (FFT) method to extract eigenvectors from protein sequences on a PSSM; (3) using the Principal Component Analysis (PCA) approach to convert the high-dimensional data into useful information after FFT and the noise is removed, so the pattern in the data is found; (4) the RP algorithm is employed to build a training set where the classifier will be trained. Take it in detail as follows: first, the PSSM from each protein sequence is likely to result in a eigenvector whose dimension is 400 by applying the FFT method for extracting important information; then, reduce the dimension of the FFT vector to 300 for improving the performance of prediction by employing the PCA dimensionality reduction method; eventually, perform classification on yeast and human datasets by applying the RP classifier. The results demonstrate that this method outperforms the SVM-based approach and six other existing technologies. This indicates that the proposed model is suitable and performs well for predicting SIPs.

2. Results and Discussion

2.1. Performance Evaluation

In this study, to estimate the stability and availability of our prediction model, we used five measurements that were commonly used in binary classification tasks, including accuracy (Acc.), sensitivity (Sen.), specificity (Spe.), Matthews correlation coefficient (MCC) [26,27,28,29,30,31,32], and Balanced Accuracy (B_Acc.) [33], respectively. They could be defined as follows:

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(1)

S e n = \frac{T P}{T P + F N}

(2)

S p e = \frac{T N}{T N + F P}

(3)

M C C = \frac{(T P \cdot T N) - (F P \cdot F N)}{\sqrt{(T P + F N) (T N + F P) (T P + F P) (T N + F N)}}

(4)

B_Acc = \frac{S e n + S p e}{2} = \frac{2 T P \cdot T N + T P \cdot F P + T N \cdot F N}{2 (T P + F N) (T N + F P)},

(5)

where TP represents the count of true positives, that is to say the number of real interacting pairs predicted correctly. FP is the quantity of false positives, defined as the volume of real non-interacting pairs mis-predicted. TN stands for the count of true negatives, which is the quantity of real non-interacting pairs correctly predicted. FN means the quantity of false negatives; in other words, it represents the true sample error predicted to be false samples. On the basis of these parameters, a Receiver Operating Curve (ROC) was plotted to assess the performance of the random projection approach. Then we can compute the area under the curve (AUC) to estimate the quality of the classifier.

2.2. Performance of the Proposed Method

In order to evaluate the performance of the presented model and avoid the overfitting problem, we applied the RP-FFT model to the human dataset. In statistical prediction, three cross-validation (CV) methods, such as an independent dataset test, a sub-sampling (or k-fold CV) test, and a leave-one-out CV (LOOCV) test, are frequently used to calculate the expected success rate of a developed predictor [34,35,36,37,38]. Among the three methods, however, the LOOCV test is deemed the least arbitrary and most objective, as demonstrated by Equations (28)–(32) of [39], and hence it has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors [38,40,41]. However, it seems time- and resource-consuming. Thus, we used 5-fold CV to examine the proposed models. In 5-fold CV, the benchmarking dataset was randomly partitioned into 10 subsets. One subset is used as a test set and the remaining nine subsets are used as the training sets. This procedure is repeated five times, where each subset is used once as a test set. The performance of the five corresponding results is averaged to give the performance of the classifier. To assess the feasibility and stability of our prediction method, we also estimated the prediction performance of RP-FFT model on the yeast dataset.

To ensure the fairness of the experiment, we optimized a number of parameters for the RP-FFT prediction model. In this paper, we set up the same parameters for human and yeast datasets. Thus, we classify the training and test sets for B1 = 10 independent projections, each one carefully chosen from a block of size B2 = 30, and then chose the K-Nearest Neighbor (KNN) as the base classifier and the leave-one-out test error estimate, where k = seq (1, 40, by = 3).

Our model can not only deal with balanced data, but can also solve the imbalanced data problem to some extent. At first, we employed the undersampling technique, as mentioned in [18], to solve the imbalanced dataset problem. The human dataset included 1441 SIPs as positives and 1441 non-SIPs as negatives. Using the same strategy, the yeast dataset contained 710 positive samples and 710 negative samples. The experimental results can be seen in Table 1 and Table 2.

Table 1. The results of the RP-FFT method with 5-fold cross-validation on the human dataset.

Table 2. The results of the RP-FFT method with 5-fold cross-validation on the yeast dataset.

In addition, the initial imbalanced data collected from DIP, BioGRID, IntAct, InnateDB, and MatrixDB also used to compare our proposed method with previous work. If we use the undersampling technique to reconstruct the dataset, the size of the initial imbalanced data will be substantially reduced. As shown in Table 2 and Table 3, we performed our proposed model on the initial imbalanced data in the experiment.

Table 3. The results of the RP-FFT method with 5-fold cross-validation on the human dataset.

The experimental results of the RP-FFT prediction model on the human and yeast datasets are listed in Table 3 and Table 4. Table 3 lists the data obtained that the model put forward obtained for average Accuracy (Acc.), Sensitivity (Sen.), Specificity (Spe.), Matthews correlation coefficient (MCC), and Balance accuracy (B_Acc.): 96.28%, 81.48%, 97.62%, 76.46%, and 89.55% for the human dataset and the standard deviations of them 0.22%, 2.43%, 0.35% 1.29%, and 1.08%, respectively. In the same way, we also got good results in Table 4 for average Acc., Sen., Spe., MCC and B_Acc.: 91.87%, 48.81%, 97.42%, 54.62% and 73.12%, and the standard deviations of them are 0.82%, 4.50%, 0.45%, 4.25%, and 2.30% for the yeast dataset, respectively.

Table 4. The results of the RP-FFT method with 5-fold cross-validation on the yeast dataset.

From the above data, it is obvious that the proposed method could achieve good outcomes for SIPs predictions due to the suitable feature extraction and classifier. It can be summarized that the main improvement of our characteristic extraction technique contains the following factors: (1) The PSSM gives the score for finding a special matching amino acid in a target protein sequence. It is a good tool that can not only represent the protein sequence information but also saves enough prior information. Therefore, a PSSM contains all the major information of one protein sequence for detecting SIPs. (2) We extracted the features from the protein sequence by using the Fast Fourier Transform (FFT) method, which can further increase the performance of the RP-FFT model. (3) In case of ensuring the integrity information of FFT feature vector, we used Principal Component Analysis (PCA) to decrease the dimension of data and influence of noise, and thus the pattern in the data is found. Experimental results revealed that the eigenvector extracted from applying FFT on PSSM is quite suitable for SIP detection.

2.3. Comparison with Other Feature Extraction Methods

In this section, in order to illustrate the use of the FFT feature extraction method, we compared the FFT method with SVD (Singular Value Decomposition), DCT (Discrete Cosine Transform), and COV (Covariance) [42,43] on the Random Projection classifier. The results of RP classifier based on different feature extraction methods with 5-fold cross-validation on the yeast dataset are shown in Table 5. On the whole, it can be seen that the FFT feature extraction method works better than other methods for the yeast dataset.

Table 5. The results of RP classifier based on different feature extraction methods on the yeast dataset.

2.4. Comparison with the SVM-Based Method

Though the RP-FFT model achieved better performance for predicting SIPs, we still need to further assess its use with our presented method. The veracity and stability of prediction of the RP classifier were compared with the state-of-the-art SVM method via the same characteristic extraction approach based on the yeast and human datasets, respectively. We applied the LIBSVM packet tool [44] to run the classification. Before the experiment, there are several parameters of SVM classifier should be optimized. In this paper, we chose a radial basis function (RBF) as the kernel function, and then used grid search to optimize the parameters of RBF, whose parameters were set to c = 0.03 and g = 1200.

As shown in Table 6 and Table 7, we employed 5-fold cross-validation to train and compare the models of RP and SVM on the yeast and human datasets, respectively. The average Acc., the average Sen., the average Spe., the average MCC and B_Acc. of SVM classifier are 93.68%, 23.80%, 100.00%, 47.13%, and 61.90% on the human dataset in Table 6, respectively. Nevertheless, the RP classifier obtained 96.28% average Acc., 81.48% average Sen., 97.62% average Spe., 76.46% average MCC, and 89.55% average B_Acc. On the human dataset. Similarity, the average Accuracy, the average Sen., the average Spe., the average MCC and B_Acc. of SVM classifier are 90.63%, 17.79%, 100.00%, 39.95%, and 58.90% on the yeast dataset in Table 7. Nevertheless, the RP classifier received 91.87% average Acc., 48.81% average Sen., 97.42% average Spe., 54.62% average MCC and 73.12% average B_Acc. On the human dataset. In a word, it is obvious that the overall prediction result of RP classifier is much better than that of the SVM method.

Table 6. Performance comparison of RP and SVM on the human dataset.

Table 7. Performance comparison of RP and SVM on the yeast dataset.

Meanwhile, the ROC curves between RP and SVM on the human and yeast datasets are displayed in Figure 1 and Figure 2. From Figure 1, it is clear that the average area under the curve (AUC) of SVM classifier is 0.6190 and that of the RP classifier is 0.8955. From Figure 2, we can see that the average AUC of SVM classifier is 0.5890 and that of the RP classifier is 0.7312. It is obvious that the average AUC of RP method is also larger than the AUC of the SVM method. So Random Projection is an accurate and robust method for SIP detection.

Figure 1. Comparison of ROC curves between RP and SVM on human (5-fold cross validation). (a) is the ROC curve of SVM method on human dataset by 5-fold cross validation. (b) is the ROC curve of RP classifier on human dataset by 5-fold cross validation.

Figure 2. Comparison of ROC curves between RP and SVM on yeast (5-fold cross validation). (a) is the ROC curve of SVM method on yeast dataset by 5-fold cross validation. (b) is the ROC curve of RP classifier on yeast dataset by 5-fold cross validation.

2.5. Comparison with Other Existing Methods

In our study, we compared the presented model, termed RP-FFT, with other existing models on the yeast and human datasets to further prove that it can achieve good results. These comparison results of RP-FFT models and other models on the yeast and human datasets are shown in Table 8 and Table 9. From Table 8, it is obvious that the RP-FFT model obtained a higher average accuracy than other existing models on yeast dataset. It is also clear that the other six methods got lower specificity and sensitivity than our proposed model for the same dataset. Accordingly, as is apparent from Table 9, the overall outcomes of our prediction model are also significantly better than the other six models on the human dataset. To sum up, the experimental results of the proposed model called RP-FFT prove its accuracy for predicting SIPs compared with the six approaches. This explains why our prediction model is superior to the other six methods, because it employs a good method of feature extraction and a suitable classifier. It can be further illustrated that our RP-FFT model is suitable for predicting SIPs.

Table 8. Comparison of RP-FFT with the other existing models on the yeast dataset.

Table 9. Comparison of RP-FFT with the other existing models on the human dataset.

3. Materials and Methodology

3.1. Datasets

The datasets derived from the UniProt database [49] include 20,199 curated human protein sequences. The PPIs data could be collected from a variety of sources, including DIP [50], BioGRID [51], IntAct [52], InnateDB [53], and MatrixDB [54]. In this experiment, we mainly built the PPIs dataset, which obtains two identical interacting proteins and whose style of interaction was described as ‘direct interaction’ in correlative databases. On this foundation, 2994 human SIPs could be obtained.

We built the datasets to estimate the performance of our prediction method, which has three steps [48]: (1) protein sequences with a length less than 50 or more than 5000 residues from the human proteome were removed; (2) to build the human positive dataset, we picked out the SIPs data with high quality, which should meet one of the following requirements: (a) the self-interactions were discovered by at least one small-scale experiment or two types of large-scale experiments; (b) we annotated the protein as a homo-oligomer (comprising homodimer and homotrimer) in UniProt; (c) it has been reported by at least two publications for self-interactions; (3) for the human negative dataset, we eliminated SIPs from all the human proteome (containing proteins labeled as ‘direct interaction’ and much wider ‘physical association’) and the prediction of SIPs in the UniProt database. Eventually, the human dataset contained 1441 SIPs as a positive dataset and 15,938 non-SIPs as a negative dataset [48].

In addition, the yeast dataset was also built to further illustrate the cross-species performance of the RP-FFT model, which included 710 SIPs samples and 5511 non-SIPs samples [48] via the same strategy mentioned above.

3.2. Position-Specific Scoring Matrix

We discovered distantly correlative proteins by applying the Position-Specific Scoring Matrix (PSSM) [55,56,57], which is a helpful tool. Therefore, a PSSM can be transformed from each protein sequence information by applying the Position-Specific Iterated BLAST (PSI-BLAST) [58]. Then, each protein sequence could be transformed into an N × 20 PSSM matrix as follows:

M = {M_{α β}, α = 1, \dots, N, β = 1, \dots, 20},

(6)

where N indicates the size of a protein sequence, and each protein gene was constructed by 20 types of amino acids. For the query protein sequence, a PSSM could arrange the value M_αβ that represents the β-th amino acid at the position of α. Thus, M_αβ could be described as:

M_{α β} = \sum_{k = 1}^{20} p (α, k) \times q (β, k),

(7)

where p(α, k) means the occurrence frequency score of the k-th amino acid in the position of α with the probe, and q(β, k) represents the value of Dayhoff’s mutation matrix between the β-th and k-th amino acids. Accordingly, a high value is a strongly conservative position; otherwise, it means a weakly conservative position.

In conclusion, PSSM could be a helpful tool for predicting self-interacting proteins. Each PSSM from the protein sequence was generated by employing PSI-BLAST for SIPs detection. For the sake of getting a high degree and a wide range of homologous information, we chose three iterations and assigned the e-value of PSI-BLAST to be 0.001 in this process. Consequently, the PSSM of each protein sequence could be expressed as a matrix consisting of M × 20 elements, where row M of the matrix means the quantity of residues of each protein, and column 20 of the PSSM indicates the 20 different kinds of amino acids.

3.3. Fast Fourier Transform

Fast Fourier Transform (FFT) [59] was first applied in digital signal processing in a number of diverse areas. Afterwards it was used for image processing for a given curve C whose shape was a closed scope. At a certain time t, there is a data sequence F(t), 0 ≤ t < T. Since F(t) is a periodic function, F(t) = F(t + nT). In this study, we used it to extract the eigen values. Hence, we expand F(t) into a Fourier series as much as possible; it can be described as follows:

F (t) = \sum_{- \infty}^{\infty} ω_{n} e^{(2 α π n t / T)},

(8)

where ω_n is the Fourier coefficients of F(t).

ω_{n} = \frac{1}{T} \int_{0}^{T} F (t) e^{(- 2 α π n t / T)} d t, n \in ℤ

(9)

The discrete Fourier transform is given by

ω_{n} = \frac{1}{N} \sum_{t = 0}^{N - 1} F (t) e^{(- 2 α π n t / N)}, n = 0, 1, \dots, N - 1,

(10)

where

α = \sqrt{- 1}

,

N = 2^{n}

,

n = 1, 2, \dots, n_{\max}

. F(t) is commonly named the shape signature, which represents the shape boundary of any one-dimensional function. Fourier transform could only capture the architectural characteristics of a shape, which is important to stem FFT from a perceptually meaningful shape signature. FFT stemmed from the centroid distance function is superior to FFT stemmed from other shape signatures. From the centroid (x_c, y_c) of the shape, the centroid distance function r(t) could be defined by the distance of the boundary points:

r (t) = {({[x (t) - x_{c}]}^{2} + {[y (t) - y_{c}]}^{2})}^{1 / 2},

(11)

where

x_{c} = \frac{1}{N} \sum_{t = 0}^{N - 1} x (t)

,

y_{c} = \frac{1}{N} \sum_{t = 0}^{N - 1} y (t)

and N is the quantity of boundary points.

It is a matter of great significance to extract informative characteristics based on machine learning approaches. In our study, for the sake of each protein sequence being composed of amounts of amino acids, the eigenvector cannot be directly obtained from a PSSM by PSI-BLAST, which will lead to diverse length of eigenvectors. For solving the question, we multiply the transpose of PSSM by PSSM to obtain a 20 × 20 matrix, and the feature extraction method of fast Fourier transform is employed to generate characteristic vectors from the PSSM profile. In the end, each protein sequence could be calculated to a 400-dimensional vector after FFT. In this study, eventually, each protein sequence from the yeast and human datasets was transformed into a 400-dimensional vector by employing the fast Fourier transform method.

In our study, for the sake of obtaining the main important data and advancing the prediction accuracy, we used the Principal Component Analysis (PCA) approach to reduce the size of the yeast and human databases from 400 to 300. Furthermore, reducing the dimensionality of the datasets could remove the complexity of the classifier and improve the generalization error.

3.4. Support Vector Machine

Support vector machine (SVM) was first proposed by Cortes and Vapnik et al. [60] in 1995. SVM inherently do binary classification. SVM is a statistical learning theory method, which is mainly used in the field of pattern recognition. The purpose of SVM is to find the hyperplane that maximizes the distance margin between the two classes. Hence, we can transform it into a convex quadratic programming problem. This idea can be expressed formally as follows:

\begin{matrix} \min_{w, b, ξ} & \frac{w^{T} w}{2} + C \sum_{i = 1}^{l} ξ_{i} \\ subject to & y_{i} (w^{T} \emptyset (x_{i}) + b) \geq 1 - ξ_{i}, \\ ξ_{i} \geq 0 \end{matrix}

(12)

where (x_i, y_i) is a training set of instance-label pairs, i = 1, ..., l. x_i ϵ Rⁿ are mapped into a higher dimensional space by the function Ø. y ϵ {1, −1}^l. Furthermore, the kernel function can be described as K(x_i, x_j) ≡ Ø(x_i)^TØ(x_j). It has four basic kernels that can be found in [61]:

(1): Linear: K(x_i, x_j) = $x_{i}^{T}$ x_j.
(2): Polynomial: K(x_i, x_j) = (γ $x_{i}^{T}$ x_j + r)^d, γ > 0.
(3): Radial basis function (RBF): K(x_i, x_j) = exp(−γ||x_i − x_j||²), γ > 0.
(4): Sigmoid: K(x_i, x_j) = tan h(γ $x_{i}^{T}$ x_j + r).

Here, γ, r, and d are kernel parameters. In our experiment, we chose RBF as the kernel function.

3.5. Random Projection Classifier

In mathematics and statistics, random projection (RP) is a technique for dimensionality reduction of some points that exist in Euclidean space. The meaning of the RP method is that projecting N points in N dimensional space can almost always onto a space of dimension ClogN with control on the ratio of distances and the error [62]. This method has been successfully applied for the reestablishment of frequency-sparse signals [63,64], facial recognition [65,66,67], protein mapping [68], and textual and visual information retrieval [69].

Next, we formally describe the random projection technique in detail. First, let

Γ = {A_{i}}_{i = 1}^{N}, A_{i} \in R^{n}

(13)

be the primitive high-dimensional space dataset, where n is the quantity of high dimension and N is the count of the dataset. The goal of descending dimension is embedding the eigenvectors into a lower dimensional space R^q from a high-dimensional Rⁿ where q << n. The output of data is represented as follows:

\tilde{Γ} = {\tilde{A_{i}}}_{i = 1}^{N}, \tilde{A_{i}} \in R^{q},

(14)

where q approaches the inherent dimensionality of Γ. Thus, the vectors of Γ were regarded as embedding vectors.

If we want to reduce the dimension of Γ via the random projection method, a random vector set γ =

{r_{i}}_{i = 1}^{k}

must first be constructed, where r_i ∈ R^q. The random basis can be obtained by two common choices, as follows [62]:

(1): The vectors ${r_{i}}_{i = 1}^{k}$ are normally distributed over the q dimensional unit sphere.
(2): The components of the vectors ${r_{i}}_{i = 1}^{k}$ are selected Bernoulli +1/−1 distribution and the vectors are standardized so that ||r_i||_l_{2 = 1} for i = 1, …, n.

The columns of q × n matrix R consist of the vectors in γ. The embedding result Ã_i of A_i can be got by

\tilde{A_{i}} = R \cdot A_{i}

(15)

In our proposed method, random projection is employed to build a training set where the classifier would be trained. We enrich the component of the integration method by using random projection.

Next, the dimension of the objective space was set to one part around the space where the training members reside. We built a size of n × N matrix G whose columns are made up the column eigenvectors in Γ. The training set Γ is given in Equation (7).

G = (A_{1} | A_{2} | \dots | A_{N})

(16)

Then, we construct k random matrices

{R_{i}}_{i = 1}^{k}

whose magnitude is q × n, q and n are mentioned in the above paragraph, and k is the quantity of integration classifiers. Here, the columns of matrices are normalized so the l₂ norm is 1.

Then, using our method, we constructed training sets

{T_{i}}_{i = 1}^{k}

by projecting G onto

{R_{i}}_{i = 1}^{k}

which is the k random matrix. It can be represented as follows:

T_{i} = R_{i} \cdot G, i = 1, \dots, k .

(17)

The training sets are imported into an inducer and the export results are a set of classifiers

{ℓ_{i}}_{i = 1}^{k}

. How do we classify a new dataset I through classifier ℓ_i? First, we embed I into the dimensionality reduction space R^q. Then, it can be owned via embedding u in the random matrix R_i as follows:

\tilde{I} = R_{i} \cdot I,

(18)

where Ĩ is the inlaying of u, the classification of Ĩ can be garnered from the classification of I by ℓ_i. In this ensemble method, the random projection classifier apply a data-driven voting threshold that is employed on the classification results of the whole classifier

{ℓ_{i}}_{i = 1}^{k}

for the Ĩ to decide produce the ultimate classification result of Ĩ.

In this experiment, the random projections were segmented into non-overlapping parts, where B1 = 10 and each one was carefully chosen from a certain part of size B2 = 30 that achieved the smallest estimate of the test error. We chose the k-Nearest Neighbor (KNN) as the base classifier and the leave-one-out test error estimate, where k = seq (1, 40, by = 3). The prior probability of interaction pairs in the training sample dataset was taken as the voting parameter. Our classifier integrates the results of taking advantage of the base classifier on the chosen projection, with the data-driven voting threshold confirming the ultimate mission.

4. Conclusions

In our study, we developed a new prediction model based on protein sequence information to detect SIPs. This model was created by combining Position-Specific Scoring Matrix with Fast Fourier Transform and Random Projection classifier, which was termed RP-FFT. The main point of the experiment is that the datasets used by the classifier are unbalanced. The main improvements of the presented model are: (1) making use of a reasonable feature extraction method that could capture the main information of the data to improve the performance efficiency. (2) The RP classifier is strongly suitable for SIPs prediction. To summarize, the experimental results achieved by the presented method on the yeast and human datasets indicated that our prediction performance is obviously better than that of the SVM-based method and six other existing models. In the future, there will be more and more characteristic extraction techniques and machine learning or deep learning methods attempted for detecting SIPs.

Author Contributions

Conceptualization, Z.-H.C. and Z.-H.Y.; methodology, L.-P.L.; software, Z.-H.C.; validation, Z.-H.C., Y.-B.W. and H.-C.Y.; formal analysis, L.W.; investigation, Y.-B.W.; resources, L.-P.L.; data curation, Z.-H.Y.; writing—original draft preparation, Z.-H.C.; writing—review and editing, Y.-B.W.; visualization, L.W.; supervision, L.-P.L.; project administration, H.-C.Y.; funding acquisition, Z.-H.Y.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61373086. The authors would like to thank all the guest editors and anonymous reviewers for their constructive advice.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, Z.-H.; You, Z.-H.; Li, L.-P.; Wang, Y.-B.; Li, X. RP-FIRF: Prediction of Self-interacting Proteins Using Random Projection Classifier Combining with Finite Impulse Response Filter. In Proceedings of the International Conference on Intelligent Computing, Wuhan, China, 15–18 August 2018; pp. 232–234. [Google Scholar]
Liu, Z.; Guo, F.; Zhang, J.; Wang, J.; Lu, L.; Li, D.; He, F. Proteome-wide prediction of self-interacting proteins based on multiple properties. Mol. Cell. Proteom. 2013. [Google Scholar] [CrossRef] [PubMed]
Marianayagam, N.J.; Sunde, M.; Matthews, J.M. The power of two: Protein dimerization in biology. Trends Biochem. Sci. 2004, 29, 618–625. [Google Scholar] [CrossRef] [PubMed]
Ispolatov, I.; Yuryev, A.; Mazo, I.; Maslov, S. Binding properties and evolution of homodimers in protein–protein interaction networks. Nucleic Acids Res. 2005, 33, 3629–3635. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.-B.; You, Z.-H.; Li, L.-P.; Huang, Y.-A.; Yi, H.-C. Detection of interactions between proteins by using legendre moments descriptor to extract discriminatory information embedded in pssm. Molecules 2017, 22, 1366. [Google Scholar] [CrossRef] [PubMed]
Woodcock, J.M.; Murphy, J.; Stomski, F.C.; Berndt, M.C.; Lopez, A.F. The dimeric versus monomeric status of 14-3-3ζ is controlled by phosphorylation of Ser58 at the dimer interface. J. Biol. Chem. 2003, 278, 36323–36327. [Google Scholar] [CrossRef]
Baisamy, L.; Jurisch, N.; Diviani, D. Leucine zipper-mediated homo-oligomerization regulates the Rho-GEF activity of AKAP-Lbc. J. Biol. Chem. 2005, 280, 15405–15412. [Google Scholar] [CrossRef] [PubMed]
Katsamba, P.; Carroll, K.; Ahlsen, G.; Bahna, F.; Vendome, J.; Posy, S.; Rajebhosale, M.; Price, S.; Jessell, T.; Ben-Shaul, A. Linking molecular affinity and cellular specificity in cadherin-mediated adhesion. Proc. Natl. Acad. Sci. USA 2009, 106, 11594–11599. [Google Scholar] [CrossRef] [PubMed]
Koike, R.; Kidera, A.; Ota, M. Alteration of oligomeric state and domain architecture is essential for functional transformation between transferase and hydrolase with the same scaffold. Protein Sci. 2009, 18, 2060–2066. [Google Scholar] [CrossRef]
Miller, S.; Lesk, A.M.; Janin, J.; Chothia, C. The accessible surface area and stability of oligomeric proteins. Nature 1987, 328, 834. [Google Scholar] [CrossRef]
Zeng, X.; Liao, Y.; Liu, Y.; Zou, Q. Prediction and validation of disease genes using HeteSim Scores. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 2017, 14, 687–695. [Google Scholar] [CrossRef]
Zou, Q.; Wan, S.; Ju, Y.; Tang, J.; Zeng, X. Pretata: Predicting TATA binding proteins with novel features and dimensionality reduction strategy. BMC Syst. Biol. 2016, 10, 114. [Google Scholar] [CrossRef]
Nanni, L.; Lumini, A.; Brahnam, S. A set of descriptors for identifying the protein–drug interaction in cellular networking. J. Theor. Biol. 2014, 359, 120–128. [Google Scholar] [CrossRef]
Nanni, L.; Brahnam, S. Set of approaches based on 3D structure and Position Specific Scoring Matrix for predicting DNA-binding proteins. Bioinformatics 2018. [Google Scholar] [CrossRef]
You, Z.-H.; Huang, Z.-A.; Zhu, Z.; Yan, G.-Y.; Li, Z.-W.; Wen, Z.; Chen, X. PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction. PLoS Comput. Biol. 2017, 13, e1005455. [Google Scholar] [CrossRef]
You, Z.-H.; Lei, Y.-K.; Gui, J.; Huang, D.-S.; Zhou, X. Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data. Bioinformatics 2010, 26, 2744–2751. [Google Scholar] [CrossRef]
Zou, Q.; Li, J.; Song, L.; Zeng, X.; Wang, G. Similarity computation strategies in the microRNA-disease network: A survey. Brief. Funct. Genom. 2015, 15, 55–64. [Google Scholar] [CrossRef]
Manavalan, B.; Shin, T.H.; Kim, M.O.; Lee, G. PIP-EL: A new ensemble learning method for improved proinflammatory peptide predictions. Front. Immunol. 2018, 9, 1783. [Google Scholar] [CrossRef]
Wang, Y.-B.; You, Z.-H.; Li, X.; Jiang, T.-H.; Cheng, L.; Chen, Z.-H. Prediction of protein self-interactions using stacked long short-term memory from protein sequences information. BMC Syst. Biol. 2018, 12, 129. [Google Scholar] [CrossRef]
Yi, H.-C.; You, Z.-H.; Huang, D.-S.; Li, X.; Jiang, T.-H.; Li, L.-P. A Deep Learning Framework for Robust and Accurate Prediction of ncRNA-Protein Interactions Using Evolutionary Information. Mol. Ther. Nucleic Acids 2018, 11, 337–344. [Google Scholar] [CrossRef]
You, Z.-H.; Zhou, M.; Luo, X.; Li, S. Highly efficient framework for predicting interactions between proteins. IEEE Trans. Cybern. 2017, 47, 731–743. [Google Scholar] [CrossRef]
Wang, L.; You, Z.-H.; Xia, S.-X.; Liu, F.; Chen, X.; Yan, X.; Zhou, Y. Advancing the prediction accuracy of protein-protein interactions by utilizing evolutionary information from position-specific scoring matrix and ensemble classifier. J. Theor. Biol. 2017, 418, 105–110. [Google Scholar] [CrossRef] [PubMed]
Pitre, S.; Dehne, F.; Chan, A.; Cheetham, J.; Duong, A.; Emili, A.; Gebbia, M.; Greenblatt, J.; Jessulat, M.; Krogan, N. PIPE: A protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs. BMC Bioinform. 2006, 7, 365. [Google Scholar] [CrossRef] [PubMed]
Xia, J.-F.; Han, K.; Huang, D.-S. Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor. Protein Pept. Lett. 2010, 17, 137–145. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.-B.; You, Z.-H.; Li, X.; Jiang, T.-H.; Chen, X.; Zhou, X.; Wang, L. Predicting protein–protein interactions from protein sequences by a stacked sparse autoencoder deep neural network. Mol. BioSyst. 2017, 13, 1336–1344. [Google Scholar] [CrossRef] [PubMed]
Basith, S.; Manavalan, B.; Shin, T.H.; Lee, G. iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree. Comput. Struct. Biotechnol. J. 2018, 16, 412–420. [Google Scholar] [CrossRef] [PubMed]
Manavalan, B.; Subramaniyam, S.; Shin, T.H.; Kim, M.O.; Lee, G. Machine-learning-based prediction of cell-penetrating peptides and their uptake efficiency with improved accuracy. J. Proteome Res. 2018, 17, 2715–2726. [Google Scholar] [CrossRef] [PubMed]
Wei, L.; Hu, J.; Li, F.; Song, J.; Su, R.; Zou, Q. Comparative analysis and prediction of quorum-sensing peptides using feature representation learning and machine learning algorithms. Brief. Bioinform. 2018. [Google Scholar] [CrossRef]
Manavalan, B.; Shin, T.H.; Kim, M.O.; Lee, G. AIPpred: Sequence-Based Prediction of Anti-inflammatory Peptides Using Random Forest. Front. Pharmacol. 2018, 9, 276. [Google Scholar] [CrossRef]
Wei, L.; Luan, S.; Nagai, L.A.E.; Su, R.; Zou, Q. Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species. Bioinformatics 2018. [Google Scholar] [CrossRef]
Manavalan, B.; Govindaraj, R.G.; Shin, T.H.; Kim, M.O.; Lee, G. iBCE-EL: A new ensemble learning framework for improved linear B-cell epitope prediction. Front. Immunol. 2018, 9, 1695. [Google Scholar] [CrossRef]
Wei, L.; Chen, H.; Su, R. M6APred-EL: A sequence-based predictor for identifying N6-methyladenosine sites using ensemble learning. Mol. Ther. Nucleic Acids 2018, 12, 635–644. [Google Scholar] [CrossRef]
Gabere, M.N.; Noble, W.S. Empirical comparison of web-based antimicrobial peptide prediction tools. Bioinformatics 2017, 33, 1921–1929. [Google Scholar] [CrossRef]
Manavalan, B.; Shin, T.H.; Lee, G. PVP-SVM: Sequence-based prediction of phage virion proteins using a support vector machine. Front. Microbiol. 2018, 9, 476. [Google Scholar] [CrossRef]
Wei, L.; Zhou, C.; Chen, H.; Song, J.; Su, R. ACPred-FL: A sequence-based predictor based on effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics 2018, 34, 4007–4016. [Google Scholar] [CrossRef]
Manavalan, B.; Shin, T.H.; Lee, G. DHSpred: Support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest. Oncotarget 2018, 9, 1944. [Google Scholar] [CrossRef]
Wei, L.; Tang, J.; Zou, Q. SkipCPP-Pred: An improved and promising sequence-based predictor for predicting cell-penetrating peptides. BMC Genom. 2017, 18, 1. [Google Scholar] [CrossRef]
Manavalan, B.; Basith, S.; Shin, T.H.; Choi, S.; Kim, M.O.; Lee, G. MLACP: Machine-learning-based prediction of anticancer peptides. Oncotarget 2017, 8, 77121. [Google Scholar] [CrossRef]
Chou, K.-C. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 2011, 273, 236–247. [Google Scholar] [CrossRef]
Dao, F.-Y.; Lv, H.; Wang, F.; Feng, C.-Q.; Ding, H.; Chen, W.; Lin, H. Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics 2018. [Google Scholar] [CrossRef]
Manavalan, B.; Lee, J. SVMQA: Support–vector-machine-based protein single-model quality assessment. Bioinformatics 2017, 33, 2496–2503. [Google Scholar] [CrossRef]
Nanni, L.; Lumini, A.; Brahnam, S. An empirical study of different approaches for protein classification. Sci. World J. 2014, 2014, 236717. [Google Scholar] [CrossRef] [PubMed]
Nanni, L.; Brahnam, S.; Lumini, A. Wavelet images and Chou’s pseudo amino acid composition for protein classification. Amino Acids 2012, 43, 657–665. [Google Scholar] [CrossRef] [PubMed]
Chang, C.-C.; Lin, C.-J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2011, 2, 27. [Google Scholar] [CrossRef]
Du, X.; Cheng, J.; Zheng, T.; Duan, Z.; Qian, F. A novel feature extraction scheme with ensemble coding for protein–protein interaction prediction. Int. J. Mol. Sci. 2014, 15, 12731–12749. [Google Scholar] [CrossRef] [PubMed]
Zahiri, J.; Yaghoubi, O.; Mohammad-Noori, M.; Ebrahimpour, R.; Masoudi-Nejad, A. PPIevo: Protein–protein interaction prediction from PSSM based evolutionary information. Genomics 2013, 102, 237–242. [Google Scholar] [CrossRef] [PubMed]
Zahiri, J.; Mohammad-Noori, M.; Ebrahimpour, R.; Saadat, S.; Bozorgmehr, J.H.; Goldberg, T.; Masoudi-Nejad, A. LocFuse: Human protein–protein interaction prediction via classifier fusion using protein localization information. Genomics 2014, 104, 496–503. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Yang, S.; Li, C.; Zhang, Z.; Song, J. SPAR: A random forest-based predictor for self-interacting proteins with fine-grained domain information. Amino Acids 2016, 48, 1655–1665. [Google Scholar] [CrossRef]
Consortium, U. UniProt: A hub for protein information. Nucleic Acids Res. 2014, 43, D204–D212. [Google Scholar] [CrossRef]
Salwinski, L.; Miller, C.S.; Smith, A.J.; Pettit, F.K.; Bowie, J.U.; Eisenberg, D. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004, 32, D449–D451. [Google Scholar] [CrossRef]
Chatr-Aryamontri, A.; Oughtred, R.; Boucher, L.; Rust, J.; Chang, C.; Kolas, N.K.; O’Donnell, L.; Oster, S.; Theesfeld, C.; Sellam, A. The BioGRID interaction database: 2017 update. Nucleic Acids Res. 2017, 45, D369–D379. [Google Scholar] [CrossRef]
Orchard, S.; Ammari, M.; Aranda, B.; Breuza, L.; Briganti, L.; Broackes-Carter, F.; Campbell, N.H.; Chavali, G.; Chen, C.; Del-Toro, N. The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2013, 42, D358–D363. [Google Scholar] [CrossRef] [PubMed]
Breuer, K.; Foroushani, A.K.; Laird, M.R.; Chen, C.; Sribnaia, A.; Lo, R.; Winsor, G.L.; Hancock, R.E.; Brinkman, F.S.; Lynn, D.J. InnateDB: Systems biology of innate immunity and beyond—recent updates and continuing curation. Nucleic Acids Res. 2012, 41, D1228–D1233. [Google Scholar] [CrossRef] [PubMed]
Chautard, E.; Fatoux-Ardore, M.; Ballut, L.; Thierry-Mieg, N.; Ricard-Blum, S. MatrixDB, the extracellular matrix interaction database. Nucleic Acids Res. 2010, 39, D235–D240. [Google Scholar] [CrossRef] [PubMed]
Gribskov, M.; McLachlan, A.D.; Eisenberg, D. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 1987, 84, 4355–4358. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; You, Z.; Li, X.; Chen, X.; Jiang, T.; Zhang, J. PCVMZM: Using the Probabilistic Classification Vector Machines Model Combined with a Zernike Moments Descriptor to Predict Protein–Protein Interactions from Protein Sequences. Int. J. Mol. Sci. 2017, 18, 1029. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.-B.; You, Z.-H.; Li, L.-P.; Huang, D.-S.; Zhou, F.-F.; Yang, S. Improving Prediction of Self-interacting Proteins Using Stacked Sparse Auto-Encoder with PSSM profiles. Int. J. Biol. Sci. 2018, 14, 983–991. [Google Scholar] [CrossRef] [PubMed]
Altschul, S.F.; Koonin, E.V. Iterated profile searches with PSI-BLAST—A tool for discovery in protein databases. Trends Biochem. Sci. 1998, 23, 444–447. [Google Scholar] [CrossRef]
Ahmed, N.; Rao, K.R. Orthogonal Transforms for Digital Signal Processing; Springer Science & Business Media: Berlin, Germany, 2012. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Hsu, C.-W.; Chang, C.-C.; Lin, C.-J. A Practical Guide to Support Vector Classification; National Taiwan University: Taipei, Taiwan, 2003. [Google Scholar]
Schclar, A.; Rokach, L. Random projection ensemble classifiers. In Proceedings of the International Conference on Enterprise Information Systems, Milan, Italy, 6–10 May 2009; pp. 309–316. [Google Scholar]
Candès, E.J.; Romberg, J.; Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 2006, 52, 489–509. [Google Scholar] [CrossRef]
Donoho, D.L. Compressed sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
Goel, N.; Bebis, G.; Nefian, A. Face recognition experiments with random projection. Proc. SPIE 2005, 5779, 426–438. [Google Scholar]
Lumini, A.; Nanni, L.; Brahnam, S. Ensemble of texture descriptors and classifiers for face recognition. Appl. Comput. Inf. 2017, 13, 79–91. [Google Scholar] [CrossRef]
Nanni, L.; Lumini, A.; Brahnam, S. Ensemble of texture descriptors for face recognition obtained by varying feature transforms and preprocessing approaches. Appl. Soft Comput. 2017, 61, 8–16. [Google Scholar] [CrossRef]
Linial, M.; Linial, N.; Tishby, N.; Yona, G. Global self-organization of all known protein sequences reveals inherent biological signatures1. J. Mol. Biol. 1997, 268, 539–556. [Google Scholar] [CrossRef] [PubMed]
Bingham, E.; Mannila, H. Random projection in dimensionality reduction: Applications to image and text data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001; pp. 245–250. [Google Scholar]