Prediction of Self-Interacting Proteins from Protein Sequence Information Based on Random Projection Model and Fast Fourier Transform

It is significant for biological cells to predict self-interacting proteins (SIPs) in the field of bioinformatics. SIPs mean that two or more identical proteins can interact with each other by one gene expression. This plays a major role in the evolution of protein‒protein interactions (PPIs) and cellular functions. Owing to the limitation of the experimental identification of self-interacting proteins, it is more and more significant to develop a useful biological tool for the prediction of SIPs from protein sequence information. Therefore, we propose a novel prediction model called RP-FFT that merges the Random Projection (RP) model and Fast Fourier Transform (FFT) for detecting SIPs. First, each protein sequence was transformed into a Position Specific Scoring Matrix (PSSM) using the Position Specific Iterated BLAST (PSI-BLAST). Second, the features of protein sequences were extracted by the FFT method on PSSM. Lastly, we evaluated the performance of RP-FFT and compared the RP classifier with the state-of-the-art support vector machine (SVM) classifier and other existing methods on the human and yeast datasets; after the five-fold cross-validation, the RP-FFT model can obtain high average accuracies of 96.28% and 91.87% on the human and yeast datasets, respectively. The experimental results demonstrated that our RP-FFT prediction model is reasonable and robust.


Introduction
Protein is an important component of all cells. It is an organic macromolecule and the basic material of life. It also is the main undertaker of activity. Without protein, there is no life. Most proteins often work together with a partner or other proteins. They can interact with two or more copies by themselves, which is termed self-interacting proteins (SIPs). However, for most researchers, whether proteins can interact with each other is a difficult thing to determine. SIPs play a key role in the development of protein interaction networks (PINs) [1,2]. The functions of many proteins, which could control the transport of ions and small molecules that pass through cell membranes, depends on their homo-oligomers [3]. Ispolatov et al. discovered that the average quantity of SIPs is more than twice that of other proteins in the PINs [4]. It is crucial for elucidating the functions of SIPs to comprehend whether a protein can self-interact; this also gives us an insight into the adjustment of protein function and can help us achieve a better comprehension of disease mechanisms [5]. Over the past few years, B_Acc = Sen + Spe 2 = 2TP·TN + TP·FP + TN·FN 2(TP + FN)(TN + FP) , (5) where TP represents the count of true positives, that is to say the number of real interacting pairs predicted correctly. FP is the quantity of false positives, defined as the volume of real non-interacting pairs mis-predicted. TN stands for the count of true negatives, which is the quantity of real non-interacting pairs correctly predicted. FN means the quantity of false negatives; in other words, it represents the true sample error predicted to be false samples. On the basis of these parameters, a Receiver Operating Curve (ROC) was plotted to assess the performance of the random projection approach. Then we can compute the area under the curve (AUC) to estimate the quality of the classifier.

Performance of the Proposed Method
In order to evaluate the performance of the presented model and avoid the overfitting problem, we applied the RP-FFT model to the human dataset. In statistical prediction, three cross-validation (CV) methods, such as an independent dataset test, a sub-sampling (or k-fold CV) test, and a leave-one-out CV (LOOCV) test, are frequently used to calculate the expected success rate of a developed predictor [34][35][36][37][38]. Among the three methods, however, the LOOCV test is deemed the least arbitrary and most objective, as demonstrated by Equations (28)- (32) of [39], and hence it has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors [38,40,41]. However, it seems time-and resource-consuming. Thus, we used 5-fold CV to examine the proposed models. In 5-fold CV, the benchmarking dataset was randomly partitioned into 10 subsets. One subset is used as a test set and the remaining nine subsets are used as the training sets. This procedure is repeated five times, where each subset is used once as a test set. The performance of the five corresponding results is averaged to give the performance of the classifier. To assess the feasibility and stability of our prediction method, we also estimated the prediction performance of RP-FFT model on the yeast dataset.
To ensure the fairness of the experiment, we optimized a number of parameters for the RP-FFT prediction model. In this paper, we set up the same parameters for human and yeast datasets. Thus, we classify the training and test sets for B1 = 10 independent projections, each one carefully chosen from a block of size B2 = 30, and then chose the K-Nearest Neighbor (KNN) as the base classifier and the leave-one-out test error estimate, where k = seq (1, 40, by = 3).
Our model can not only deal with balanced data, but can also solve the imbalanced data problem to some extent. At first, we employed the undersampling technique, as mentioned in [18], to solve the imbalanced dataset problem. The human dataset included 1441 SIPs as positives and 1441 non-SIPs as negatives. Using the same strategy, the yeast dataset contained 710 positive samples and 710 negative samples. The experimental results can be seen in Tables 1 and 2. In addition, the initial imbalanced data collected from DIP, BioGRID, IntAct, InnateDB, and MatrixDB also used to compare our proposed method with previous work. If we use the undersampling technique to reconstruct the dataset, the size of the initial imbalanced data will be substantially reduced. As shown in Tables 2 and 3, we performed our proposed model on the initial imbalanced data in the experiment.  The experimental results of the RP-FFT prediction model on the human and yeast datasets are listed in Tables 3 and 4. Table 3 lists the data obtained that the model put forward obtained for average Accuracy (Acc.), Sensitivity (Sen.), Specificity (Spe.), Matthews correlation coefficient (MCC), and Balance accuracy (B_Acc.): 96.28%, 81.48%, 97.62%, 76.46%, and 89.55% for the human dataset and the standard deviations of them 0.22%, 2.43%, 0.35% 1.29%, and 1.08%, respectively. In the same way, we also got good results in Table 4 for average Acc., Sen., Spe., MCC and B_Acc.: 91.87%, 48.81%, 97.42%, 54.62% and 73.12%, and the standard deviations of them are 0.82%, 4.50%, 0.45%, 4.25%, and 2.30% for the yeast dataset, respectively. From the above data, it is obvious that the proposed method could achieve good outcomes for SIPs predictions due to the suitable feature extraction and classifier. It can be summarized that the main improvement of our characteristic extraction technique contains the following factors: (1) The PSSM gives the score for finding a special matching amino acid in a target protein sequence. It is a good tool that can not only represent the protein sequence information but also saves enough prior information. Therefore, a PSSM contains all the major information of one protein sequence for detecting SIPs. (2) We extracted the features from the protein sequence by using the Fast Fourier Transform (FFT) method, which can further increase the performance of the RP-FFT model. (3) In case of ensuring the integrity information of FFT feature vector, we used Principal Component Analysis (PCA) to decrease the dimension of data and influence of noise, and thus the pattern in the data is found. Experimental results revealed that the eigenvector extracted from applying FFT on PSSM is quite suitable for SIP detection.

Comparison with Other Feature Extraction Methods
In this section, in order to illustrate the use of the FFT feature extraction method, we compared the FFT method with SVD (Singular Value Decomposition), DCT (Discrete Cosine Transform), and COV (Covariance) [42,43] on the Random Projection classifier. The results of RP classifier based on different feature extraction methods with 5-fold cross-validation on the yeast dataset are shown in Table 5. On the whole, it can be seen that the FFT feature extraction method works better than other methods for the yeast dataset.

Comparison with the SVM-Based Method
Though the RP-FFT model achieved better performance for predicting SIPs, we still need to further assess its use with our presented method. The veracity and stability of prediction of the RP classifier were compared with the state-of-the-art SVM method via the same characteristic extraction approach based on the yeast and human datasets, respectively. We applied the LIBSVM packet tool [44] to run the classification. Before the experiment, there are several parameters of SVM classifier should be optimized. In this paper, we chose a radial basis function (RBF) as the kernel function, and then used grid search to optimize the parameters of RBF, whose parameters were set to c = 0.03 and g = 1200.
As shown in Tables 6 and 7, we employed 5-fold cross-validation to train and compare the models of RP and SVM on the yeast and human datasets, respectively. The average Acc., the average Sen., the average Spe., the average MCC and B_Acc. of SVM classifier are 93.68%, 23.80%, 100.00%, 47.13%, and 61.90% on the human dataset in Table 6, respectively. Nevertheless, the RP classifier obtained 96.28% average Acc., 81.48% average Sen., 97.62% average Spe., 76.46% average MCC, and 89.55% average B_Acc. On the human dataset. Similarity, the average Accuracy, the average Sen., the average Spe., the average MCC and B_Acc. of SVM classifier are 90.63%, 17.79%, 100.00%, 39.95%, and 58.90% on the yeast dataset in Table 7. Nevertheless, the RP classifier received 91.87% average Acc., 48.81% average Sen., 97.42% average Spe., 54.62% average MCC and 73.12% average B_Acc. On the human dataset. In a word, it is obvious that the overall prediction result of RP classifier is much better than that of the SVM method.  Meanwhile, the ROC curves between RP and SVM on the human and yeast datasets are displayed in Figures 1 and 2. From Figure 1, it is clear that the average area under the curve (AUC) of SVM classifier is 0.6190 and that of the RP classifier is 0.8955. From Figure 2, we can see that the average AUC of SVM classifier is 0.5890 and that of the RP classifier is 0.7312. It is obvious that the average AUC of RP method is also larger than the AUC of the SVM method. So Random Projection is an accurate and robust method for SIP detection.  Meanwhile, the ROC curves between RP and SVM on the human and yeast datasets are displayed in Figures 1 and 2. From Figure 1, it is clear that the average area under the curve (AUC) of SVM classifier is 0.6190 and that of the RP classifier is 0.8955. From Figure 2, we can see that the average AUC of SVM classifier is 0.5890 and that of the RP classifier is 0.7312. It is obvious that the average AUC of RP method is also larger than the AUC of the SVM method. So Random Projection is an accurate and robust method for SIP detection.

Comparison with Other Existing Methods
In our study, we compared the presented model, termed RP-FFT, with other existing models on the yeast and human datasets to further prove that it can achieve good results. These comparison results of RP-FFT models and other models on the yeast and human datasets are shown in Tables 8  and 9. From Table 8, it is obvious that the RP-FFT model obtained a higher average accuracy than other existing models on yeast dataset. It is also clear that the other six methods got lower specificity and sensitivity than our proposed model for the same dataset. Accordingly, as is apparent from Table  9, the overall outcomes of our prediction model are also significantly better than the other six models on the human dataset. To sum up, the experimental results of the proposed model called RP-FFT prove its accuracy for predicting SIPs compared with the six approaches. This explains why our prediction model is superior to the other six methods, because it employs a good method of feature extraction and a suitable classifier. It can be further illustrated that our RP-FFT model is suitable for predicting SIPs.

Comparison with Other Existing Methods
In our study, we compared the presented model, termed RP-FFT, with other existing models on the yeast and human datasets to further prove that it can achieve good results. These comparison results of RP-FFT models and other models on the yeast and human datasets are shown in Tables 8  and 9. From Table 8, it is obvious that the RP-FFT model obtained a higher average accuracy than other existing models on yeast dataset. It is also clear that the other six methods got lower specificity and sensitivity than our proposed model for the same dataset. Accordingly, as is apparent from Table 9, the overall outcomes of our prediction model are also significantly better than the other six models on the human dataset. To sum up, the experimental results of the proposed model called RP-FFT prove its accuracy for predicting SIPs compared with the six approaches. This explains why our prediction model is superior to the other six methods, because it employs a good method of feature extraction and a suitable classifier. It can be further illustrated that our RP-FFT model is suitable for predicting SIPs.

Datasets
The datasets derived from the UniProt database [49] include 20,199 curated human protein sequences. The PPIs data could be collected from a variety of sources, including DIP [50], BioGRID [51], IntAct [52], InnateDB [53], and MatrixDB [54]. In this experiment, we mainly built the PPIs dataset, which obtains two identical interacting proteins and whose style of interaction was described as 'direct interaction' in correlative databases. On this foundation, 2994 human SIPs could be obtained.
We built the datasets to estimate the performance of our prediction method, which has three steps [48]: (1) protein sequences with a length less than 50 or more than 5000 residues from the human proteome were removed; (2) to build the human positive dataset, we picked out the SIPs data with high quality, which should meet one of the following requirements: (a) the self-interactions were discovered by at least one small-scale experiment or two types of large-scale experiments; (b) we annotated the protein as a homo-oligomer (comprising homodimer and homotrimer) in UniProt; (c) it has been reported by at least two publications for self-interactions; (3) for the human negative dataset, we eliminated SIPs from all the human proteome (containing proteins labeled as 'direct interaction' and much wider 'physical association') and the prediction of SIPs in the UniProt database. Eventually, the human dataset contained 1441 SIPs as a positive dataset and 15,938 non-SIPs as a negative dataset [48].
In addition, the yeast dataset was also built to further illustrate the cross-species performance of the RP-FFT model, which included 710 SIPs samples and 5511 non-SIPs samples [48] via the same strategy mentioned above.

Position-Specific Scoring Matrix
We discovered distantly correlative proteins by applying the Position-Specific Scoring Matrix (PSSM) [55][56][57], which is a helpful tool. Therefore, a PSSM can be transformed from each protein sequence information by applying the Position-Specific Iterated BLAST (PSI-BLAST) [58]. Then, each protein sequence could be transformed into an N × 20 PSSM matrix as follows: where N indicates the size of a protein sequence, and each protein gene was constructed by 20 types of amino acids. For the query protein sequence, a PSSM could arrange the value M αβ that represents the β-th amino acid at the position of α. Thus, M αβ could be described as: where p(α, k) means the occurrence frequency score of the k-th amino acid in the position of α with the probe, and q(β, k) represents the value of Dayhoff's mutation matrix between the β-th and k-th amino acids. Accordingly, a high value is a strongly conservative position; otherwise, it means a weakly conservative position.
In conclusion, PSSM could be a helpful tool for predicting self-interacting proteins. Each PSSM from the protein sequence was generated by employing PSI-BLAST for SIPs detection. For the sake of getting a high degree and a wide range of homologous information, we chose three iterations and assigned the e-value of PSI-BLAST to be 0.001 in this process. Consequently, the PSSM of each protein sequence could be expressed as a matrix consisting of M × 20 elements, where row M of the matrix means the quantity of residues of each protein, and column 20 of the PSSM indicates the 20 different kinds of amino acids.

Fast Fourier Transform
Fast Fourier Transform (FFT) [59] was first applied in digital signal processing in a number of diverse areas. Afterwards it was used for image processing for a given curve C whose shape was a closed scope. At a certain time t, there is a data sequence F(t), 0 ≤ t < T. Since F(t) is a periodic function, F(t) = F(t + nT). In this study, we used it to extract the eigen values. Hence, we expand F(t) into a Fourier series as much as possible; it can be described as follows: where ω n is the Fourier coefficients of F(t).
The discrete Fourier transform is given by where α = √ −1, N = 2 n , n = 1, 2, · · · , n max . F(t) is commonly named the shape signature, which represents the shape boundary of any one-dimensional function. Fourier transform could only capture the architectural characteristics of a shape, which is important to stem FFT from a perceptually meaningful shape signature. FFT stemmed from the centroid distance function is superior to FFT stemmed from other shape signatures. From the centroid (x c , y c ) of the shape, the centroid distance function r(t) could be defined by the distance of the boundary points: It is a matter of great significance to extract informative characteristics based on machine learning approaches. In our study, for the sake of each protein sequence being composed of amounts of amino acids, the eigenvector cannot be directly obtained from a PSSM by PSI-BLAST, which will lead to diverse length of eigenvectors. For solving the question, we multiply the transpose of PSSM by PSSM to obtain a 20 × 20 matrix, and the feature extraction method of fast Fourier transform is employed to generate characteristic vectors from the PSSM profile. In the end, each protein sequence could be calculated to a 400-dimensional vector after FFT. In this study, eventually, each protein sequence from the yeast and human datasets was transformed into a 400-dimensional vector by employing the fast Fourier transform method.
In our study, for the sake of obtaining the main important data and advancing the prediction accuracy, we used the Principal Component Analysis (PCA) approach to reduce the size of the yeast and human databases from 400 to 300. Furthermore, reducing the dimensionality of the datasets could remove the complexity of the classifier and improve the generalization error.

Support Vector Machine
Support vector machine (SVM) was first proposed by Cortes and Vapnik et al. [60] in 1995. SVM inherently do binary classification. SVM is a statistical learning theory method, which is mainly used in the field of pattern recognition. The purpose of SVM is to find the hyperplane that maximizes the distance margin between the two classes. Hence, we can transform it into a convex quadratic programming problem. This idea can be expressed formally as follows: where (x i , y i ) is a training set of instance-label pairs, i = 1, ..., l. x i R n are mapped into a higher dimensional space by the function Ø. y {1, −1} l . Furthermore, the kernel function can be described as . It has four basic kernels that can be found in [61]: (1) Linear: Here, γ, r, and d are kernel parameters. In our experiment, we chose RBF as the kernel function.

Random Projection Classifier
In mathematics and statistics, random projection (RP) is a technique for dimensionality reduction of some points that exist in Euclidean space. The meaning of the RP method is that projecting N points in N dimensional space can almost always onto a space of dimension ClogN with control on the ratio of distances and the error [62]. This method has been successfully applied for the reestablishment of frequency-sparse signals [63,64], facial recognition [65][66][67], protein mapping [68], and textual and visual information retrieval [69].
Next, we formally describe the random projection technique in detail. First, let be the primitive high-dimensional space dataset, where n is the quantity of high dimension and N is the count of the dataset. The goal of descending dimension is embedding the eigenvectors into a lower dimensional space R q from a high-dimensional R n where q << n. The output of data is represented as follows: where q approaches the inherent dimensionality of Γ. Thus, the vectors of Γ were regarded as embedding vectors.
If we want to reduce the dimension of Γ via the random projection method, a random vector set γ = {r i } k i=1 must first be constructed, where r i ∈ R q . The random basis can be obtained by two common choices, as follows [62]: (1) The vectors {r i } k i=1 are normally distributed over the q dimensional unit sphere. (2) The components of the vectors {r i } k i=1 are selected Bernoulli +1/−1 distribution and the vectors are standardized so that ||r i || l2 = 1 for i = 1, . . . , n.
The columns of q × n matrix R consist of the vectors in γ. The embedding result Ã i of A i can be got by In our proposed method, random projection is employed to build a training set where the classifier would be trained. We enrich the component of the integration method by using random projection.
Next, the dimension of the objective space was set to one part around the space where the training members reside. We built a size of n × N matrix G whose columns are made up the column eigenvectors in Γ. The training set Γ is given in Equation (7).
Then, we construct k random matrices {R i } k i=1 whose magnitude is q × n, q and n are mentioned in the above paragraph, and k is the quantity of integration classifiers. Here, the columns of matrices are normalized so the l 2 norm is 1.
Then, using our method, we constructed training sets which is the k random matrix. It can be represented as follows: The training sets are imported into an inducer and the export results are a set of classifiers { i } k i=1 . How do we classify a new dataset I through classifier i ? First, we embed I into the dimensionality reduction space R q . Then, it can be owned via embedding u in the random matrix R i as follows: where Ĩ is the inlaying of u, the classification of Ĩ can be garnered from the classification of I by i . In this ensemble method, the random projection classifier apply a data-driven voting threshold that is employed on the classification results of the whole classifier { i } k i=1 for the Ĩ to decide produce the ultimate classification result of Ĩ.
In this experiment, the random projections were segmented into non-overlapping parts, where B1 = 10 and each one was carefully chosen from a certain part of size B2 = 30 that achieved the smallest estimate of the test error. We chose the k-Nearest Neighbor (KNN) as the base classifier and the leave-one-out test error estimate, where k = seq (1, 40, by = 3). The prior probability of interaction pairs in the training sample dataset was taken as the voting parameter. Our classifier integrates the results of taking advantage of the base classifier on the chosen projection, with the data-driven voting threshold confirming the ultimate mission.

Conclusions
In our study, we developed a new prediction model based on protein sequence information to detect SIPs. This model was created by combining Position-Specific Scoring Matrix with Fast Fourier Transform and Random Projection classifier, which was termed RP-FFT. The main point of the experiment is that the datasets used by the classifier are unbalanced. The main improvements of the presented model are: (1) making use of a reasonable feature extraction method that could capture the main information of the data to improve the performance efficiency. (2) The RP classifier is strongly suitable for SIPs prediction. To summarize, the experimental results achieved by the presented method on the yeast and human datasets indicated that our prediction performance is obviously better than that of the SVM-based method and six other existing models. In the future, there will be more and more characteristic extraction techniques and machine learning or deep learning methods attempted for detecting SIPs.