Detection of Interactions between Proteins through Rotation Forest and Local Phase Quantization Descriptors

Protein-Protein Interactions (PPIs) play a vital role in most cellular processes. Although many efforts have been devoted to detecting protein interactions by high-throughput experiments, these methods are obviously expensive and tedious. Targeting these inevitable disadvantages, this study develops a novel computational method to predict PPIs using information on protein sequences, which is highly efficient and accurate. The improvement mainly comes from the use of the Rotation Forest (RF) classifier and the Local Phase Quantization (LPQ) descriptor from the Physicochemical Property Response (PR) Matrix of protein amino acids. When performed on three PPI datasets including Saccharomyces cerevisiae, Homo sapiens, and Helicobacter pylori, we obtained good results of average accuracies of 93.8%, 97.96%, and 89.47%, which are much better than in previous studies. Extensive validations have also been explored to evaluate the performance of the Rotation Forest ensemble classifier with the state-of-the-art Support Vector Machine classifier. These promising results indicate that the proposed method might play a complementary role for future proteomics research.


Introduction
As a necessary component of all organisms, proteins are involved in most processes of living cells. Because proteins usually function in pairs, knowledge of protein interactions can provide great insights into more biological functions [1,2]. As a hotspot in proteomics research, detecting protein-protein interactions (PPIs) is conducive to understanding disease mechanisms and making progress in developing drugs for specific diseases. In recent years, many innovative techniques based on biological experiments have been developed for detecting PPIs. The valuable PPI data on diverse species have been accumulated by high-throughput experimental technologies, such as protein chip [3,4], yeast two-hybrid (Y2H) [5][6][7] systems, tandem affinity purification (TAP) [8], mass spectrometry protein complex identification (MS-PCI) [9] and correlated mRNA expression profiling [10]. Further studies are boosted by this available data even though the current PPI data obtained through biological approaches cover only a small fraction of the complete PPI

Evaluation Measures
For the purpose of measuring the prediction performance of the proposed method, overall Accuracy, Sensitivity, Precision, Matthews Correlation Coefficient (MCC), and Receiver Operating Characteristic (ROC) and Area Under Curve (AUC) were calculated. The definitions of these measures are as follows: Accuracy " TP`TN TP`FP`TN`FN (1) Sensitivity " TP TP`FN (2) Precision " TP TP`FP (3) MCC " TPˆTN´FPˆFN a pTP`FNqˆpTN`FPqˆpTP`FPqˆpTN`FNq (4) where true positive (TP) is where the testing samples, having PPIs, are predicted successfully; false negative (FN) is where the testing samples, non-interacting protein pairs, are predicted unsuccessfully; false positive (FP) is where the testing samples, having PPIs, are predicted unsuccessfully; true negative (TN) is where the testing samples, non-interacting protein pairs, are predicted successfully; Mathews correlation coefficient is the abbreviation of MCC that is a correlation coefficient that measures the quality of binary classifications in machine learning. In addition, ROC curve is a graphical plot with specificity-sensitivity for a binary classifier system. And AUC, a threshold independent measure, is to assess the performance by the normalized area under the ROC curve.

Parameter Selection
The number of feature subsets K and decision tree number L are crucial for the performance of the Rotation Forest classifier. Therefore, we need to set these two vital parameters in advance. It is quite complex to set the specific value and obtain the best performance for randomness and uncertainty. A higher value of K indicates more subsets, where each subset has fewer features, and a higher value of L indicates more basic classifiers in the ensemble classifier.
In this context, overall classification accuracy is evaluated on a Helicobacter pylori dataset using different K and L values in the first computational validation. Specifically, we adopt the parameter selection strategy that the first step is to fix L to 20 and tune K from 10 to 70 at intervals of 5. We then set K to the value obtained from the first step and tune L from 10 to 70 at intervals of 5.
The prediction results of Helicobacter pylori are shown in Figure 1. From Figure 1a, we can see that setting the K = 55 leads can obtain good result with an accuracy of 89.54% on the conditions of L = 20. We then set K to 55 and increase the value of L from 10 to 70 at intervals of 5 to work out the results shown in Figure 1b. We then determine that the optimal value of L is 50.
The same parameter selection strategy is adopted when exploring the other two datasets. The proposed method on the Human dataset yields an accuracy of 97.91% with the optimized settings (K = 25; L = 40). For the Saccharomyces cerevisiae dataset, it achieves the best accuracy of 94.32% with the optimized settings (K = 65, L = 40). Mathews correlation coefficient is the abbreviation of MCC that is a correlation coefficient that measures the quality of binary classifications in machine learning. In addition, ROC curve is a graphical plot with specificity-sensitivity for a binary classifier system. And AUC, a threshold independent measure, is to assess the performance by the normalized area under the ROC curve.

Parameter Selection
The number of feature subsets K and decision tree number L are crucial for the performance of the Rotation Forest classifier. Therefore, we need to set these two vital parameters in advance. It is quite complex to set the specific value and obtain the best performance for randomness and uncertainty. A higher value of K indicates more subsets, where each subset has fewer features, and a higher value of L indicates more basic classifiers in the ensemble classifier.
In this context, overall classification accuracy is evaluated on a Helicobacter pylori dataset using different K and L values in the first computational validation. Specifically, we adopt the parameter selection strategy that the first step is to fix L to 20 and tune K from 10 to 70 at intervals of 5. We then set K to the value obtained from the first step and tune L from 10 to 70 at intervals of 5.
The prediction results of Helicobacter pylori are shown in Figure 1. From Figure 1a, we can see that setting the K = 55 leads can obtain good result with an accuracy of 89.54% on the conditions of L = 20. We then set K to 55 and increase the value of L from 10 to 70 at intervals of 5 to work out the results shown in Figure 1b. We then determine that the optimal value of L is 50.
The same parameter selection strategy is adopted when exploring the other two datasets. The proposed method on the Human dataset yields an accuracy of 97.91% with the optimized settings (K = 25; L = 40). For the Saccharomyces cerevisiae dataset, it achieves the best accuracy of 94.32% with the optimized settings (K = 65, L = 40).

Prediction Performance of Proposed Model
To validate the proposed model, we apply it to three prevalent PPIs datasets, including the Helicobacter pylori dataset, Homo sapiens dataset, and Saccharomyces cerevisiae dataset. To avoid the

Prediction Performance of Proposed Model
To validate the proposed model, we apply it to three prevalent PPIs datasets, including the Helicobacter pylori dataset, Homo sapiens dataset, and Saccharomyces cerevisiae dataset. To avoid the problem of over-fitting, five-fold cross-validation is used for performance evaluation. We also operate the support vector machine (SVM) to compare its performance with the proposed model.
The performance of the Helicobacter pylori and Saccharomyces cerevisiae dataset are shown in Tables 1 and 2 which list the overall accuracy, sensitivity, precision, MCC, and AUC. And the ROC curves are plotted in Figures 2 and 3. We can see from Table 1 that the proposed method yields a high accuracy of 89.47% on average on the Helicobacter pylori dataset. The average value of the AUC is close to 0.90, which indicates the method has high precision in predicting PPIs. The standard deviation of the accuracy, precision, sensitivity, MCC, and AUC are 1.05%, 1.77%, 1.41%, 0.0167, and 0.0145, respectively. When employed on the Saccharomyces cerevisiae dataset, our proposed method yields an AUC of 0.93 with a high accuracy of 93.80%, and the values of precision and sensitivity are 96.66% and 90.64%, respectively. The standard deviations of accuracy, precision, sensitivity, MCC, and AUC are 0.50%, 0.62%, 0.87%, 0.009, and 0.002, respectively. The Support Vector Machine (SVM) is a state-of-the-art classification model. Therefore, we compare the Rotation Forest classifier with the SVM model on the Human dataset. The experimental results are shown in Table 3, from which it can be seen that our proposed method yields good results reflected in average values of accuracy, precision, sensitivity, and MCC as high as 97.96%, 98.35%, 97.32%, and 0.96, respectively. When employing the SVM model for prediction, the average values of accuracy, precision, sensitivity, and MCC are 90.21%, 93.00%, 85.96%, and 0.82, respectively. From the ROC curves of Figures 4 and 5 it can also be seen that the average AUC score of the proposed method was 0.9792, and the value of SVM was 0.8996. In addition, the standard deviations of accuracy, sensitivity, and MCC yielded by the proposed method were as low as 0.22%, 0.73%, and 0.0042, respectively, which are lower than the values obtained by the SVM model of 0.46%, 0.99%, and 0.0077, respectively. In conclusion, the experimental results above suggested that our proposed method is much better than the SVM-based method.

Comparison with Other Methods
Many methods have been proposed for predicting PPIs. Here, we compare the prediction performance of the proposed method with the existing approaches. All the results yielded by different methods on the Saccharomyces cerevisiae dataset are shown in Table 4. We can observe from Table 4 that Zhou's work performs well with the lowest standard deviation of 0.33% for accuracy, and Guo's work has a higher accuracy of 89.33%. In addition, Yang's work makes a higher precision value of 90.24%. It should be noticed that the proposed method yields the best performance in terms of sensitivity, precision, accuracy and MCC at 90.64%, 96.66%, 93.80%, and 88.35%, respectively. The corresponding standard deviations are 0.87%, 0.62%, 0.50%, and 0.87%, respectively. The above results show that the performance of our proposed method is superior.

Comparison with Other Methods
Many methods have been proposed for predicting PPIs. Here, we compare the prediction performance of the proposed method with the existing approaches. All the results yielded by different methods on the Saccharomyces cerevisiae dataset are shown in Table 4. We can observe from Table 4 that Zhou's work performs well with the lowest standard deviation of 0.33% for accuracy, and Guo's work has a higher accuracy of 89.33%. In addition, Yang's work makes a higher precision value of 90.24%. It should be noticed that the proposed method yields the best performance in terms of sensitivity, precision, accuracy and MCC at 90.64%, 96.66%, 93.80%, and 88.35%, respectively. The corresponding standard deviations are 0.87%, 0.62%, 0.50%, and 0.87%, respectively. The above results show that the performance of our proposed method is superior.

Comparison with Other Methods
Many methods have been proposed for predicting PPIs. Here, we compare the prediction performance of the proposed method with the existing approaches. All the results yielded by different methods on the Saccharomyces cerevisiae dataset are shown in Table 4. We can observe from Table 4 that Zhou's work performs well with the lowest standard deviation of 0.33% for accuracy, and Guo's work has a higher accuracy of 89.33%. In addition, Yang's work makes a higher precision value of 90.24%. It should be noticed that the proposed method yields the best performance in terms of sensitivity, precision, accuracy and MCC at 90.64%, 96.66%, 93.80%, and 88.35%, respectively. The corresponding standard deviations are 0.87%, 0.62%, 0.50%, and 0.87%, respectively. The above results show that the performance of our proposed method is superior.
We also compare our proposed method with other methods on the Helicobacter pylori dataset and the results are shown in Table 5. Compared with the other methods, the proposed method achieves outstanding performance for its high sensitivity, precision, accuracy, and MCC. In detail, the performances of the classifiers are quite disparate. The worst result, yielded by the phylogenetic bootstrap, has an accuracy of 75.80%, precision of 80.20%, and sensitivity of 69.80%. HKNN achieves 84.00% accuracy, 84% precision, and 86% sensitivity. In contrast, the proposed method achieves an accuracy of 89.47%, precision of 89.63%, sensitivity of 89.18%, and an MCC of 81.16%, respectively. The above results indicate that our proposed method is promising and exhibits good performance for PPIs prediction.

Generation of the Data Sets
The first dataset is derived from Saccharomyces cerevisiae in which we selected the core subset of the Database of Interacting Proteins (DIP). We implement a data preprocessing program to remove the redundant protein pairs. More specifically, protein pairs with more than forty percent sequence identity or fewer than fifty residues are removed. The final positive pairs are comprised of 5594 protein pairs, and the final negative pairs with different sub-cellular localizations have the same number as the positive pairs. The final dataset consists of 11,188 protein pairs.
The Homo sapiens dataset is generated from the Human Protein References Database (HPRD). The original dataset has 3899 interacting pairs and 4262 non-interacting pairs after filtering the ones with more than 25% sequence identity. More specifically, the interacting protein pairs are generated from 2502 different kinds of protein derived from humans. The non-interacting protein pairs are yielded from 661 kinds of proteins. However seven of the sequences are too long and exceed our computational ability when using the proposed protein presentation method. The final Homo sapiens dataset contains 3892 positive samples and 4262 negative samples. The Helicobacter pylori dataset is described by Martin et al. [25], which consists of 2916 protein pairs, of which half are positive and the rest negative.

Representation for Protein
To borrow the feature extraction techniques from image processing, it is necessary to preprocess each amino acid sequence by transforming them into a matrix. The method, named Physicochemical Property Response Matrix (PR) [21], is used to represent the protein sequence. First, the physicochemical property response matrix PRM d (i,j) P R NˆN is calculated for a given protein P = (p 1 , p 2 , . . . , p n ) and its size depends on the protein sequence by selecting a specific physicochemical property d. According to a specific physicochemical property d, set the value in PRM d (i,j) to the sum of the indexing values corresponding to the amino acid in position i and j. Consider PRM pi, jq " index pp i q`index`p j˘i , j " 1, . . . , N where index(p) denotes the value of the certain property in AAIndex for the protein amino acid p.
In the proposed method, we employ the hydrophobicity index as the physicochemical property. Table 6 shows the values of the hydrophobicity index for each amino acid. For instance, assuming the protein amino acids sequence p = "ARND", then its PRM is as follows: PRM " After calculating the matrix, the matrix would be compressed if its size is larger than 250ˆ250. Because the physicochemical property response matrix is two-dimensional and the amino acid sequences may be beyond the ability of our computer performance, a handful of sequences with excessive length would be ignored.

Feature Vector Extraction
There are many methods of extracting features from images in image processing. The Local Phase Quantization (LPQ) [21] method is a common and efficient texture descriptor that adopts the Fourier transform to analyze the information in matrix. It is based on the blur invariance property of the Fourier phase spectrum. That is, the observed image is generated from the original image after blur processing. Consider g pxq " f pxqˆh pxq where g(x), h(x), and f (x) denote the observed image, original image and blur function, respectively. The Fourier transform functions of Function (4) are as follows: where G(x), H(x), and F(x) are the Fourier transform functions of g(x), h(x), and f (x), respectively. In the LPQ method, to reflect the local information effectively, the Fourier transform operates on the locality of the image that is on the neighborhood N mˆm located at x with the size of mˆm. Consider F pu, xq " The local phase information is extracted from the two-dimensional short-term Fourier Transform (STFT). The STFT is used to calculate a rectangular neighborhood transformed from each pixel position. Because the output of Fourier transform and its phase are continuous, the LPQ method employs four kinds of phase. That is, it would output four complex coefficients that correspond to four field two-dimensional frequencies after STFT, and that contain a real part and imaginary part and then use a binary coding scheme to quantize them as integers between 0 and 255. Consider F c x " rFpu 1 , xq, Fpu 2 , xq, Fpu 3 , xq, Fpu 4 , xqs.
and F x " rRe tF c x u , Im tF c x us T (11) where Re and Im denote the real part and the imaginary part. The corresponding binary sequence is as follows: w " rRe tw u1 , w u2 , w u3 , w u4 u , Im tw u1 , w u2 , w u3 , w u4 us T .
The feature vector utilized in the experiment is a normalized histogram of such coefficients calculated from STFT. As a protein pair contains two parts, the final feature vector of an interaction pair is constructed by concatenating the descriptors of two proteins.

Rotation Forest
An ensemble classifier usually has higher performance than a single base classifier. In this study, a new classification model, Rotation Forest (RF), is employed to predict PPIs based on a novel quantitative description of the protein amino acid sequence. Rotation Forest exhibits excellent classification performance and is widely applied as classifier in data mining [26].
Assume a matrix X, size of Nˆn, denoted as N training samples, where each sample has n features. Set a label vector Y = [y 1 , . . . , y N ] T with size Nˆ1 that holds the value 1 or´1 to differentiate whether the protein pair is interacting or not. A value of 1 represents PPI and´1 non-PPI. Denote K as the number of subsets of feature set F and L as the number of the decision trees in a Rotation Forest. An individual decision tree is denoted as D i . Note that the parameters K and L must be set in advance.
The training procedure for an individual decision tree classifier D i is as follows: Step 1: Select K subsets from the feature set F at random. Note that each subset must hold M = n/K features, and fill with zero vectors if the last subset has less than M features.
Step 2: Set the jth feature subset to F ij for training classifier D i , and let X ij be a dataset of features in F ij . Then, denote a new set X 1 ij in which three-fifths of the new training set describes a bootstrap subset of targets. Generate coefficients by applying PCA on X 1 ij , and store in matrix C ij , which is composed of the coefficients of principal components, a p1q ij , . . . , a M j ij , where the size of each is Mˆ1.
Step 3: Organize Cij, and generate a sparse rotation matrix Ri, as follows: It is essential to rearrange R i and build R a i psizeNˆnq to match the order of the feature set F. Finally,`Y, XR a i˘i s allocated as the training set to train decision tree D i . After training L decision trees, when a given test sample x is as input, each decision tree D i assigns the probability d i,j`x R a i˘a nd assumes that sample x has correlation. Then, an average probability µ j pxq is calculated as follows: µ j pxq " 1 L L ÿ i"1 d i,j pxR a i q j " 1, . . . , c.
Finally, assign x to the class with the highest confidence.

Conclusions
Predicting interactions between protein pairs is of great importance to understand the molecular basis of complex cellular processes. This article reports a novel computational method for predicting protein-protein interactions solely using the protein sequence. Three large, real public PPI data sets including Saccharomyces cerevisiae, Homo sapiens, and Helicobacter pylori datasets are explored to evaluate the prediction performance of the proposed method. The validation results show that our proposed model can achieve better performance than the existing methods. The improvement of our method mainly comes from the use of the Rotation Forest (RF) classifier and the Local Phase Quantization (LPQ) descriptor from the Physicochemical Property Response Matrix (PR). Therefore, the proposed method can be used to guide related experimental validations and as a supplementary tool to proteomics research.