PCLPred: A Bioinformatics Method for Predicting Protein–Protein Interactions by Combining Relevance Vector Machine Model with Low-Rank Matrix Approximation

Protein–protein interactions (PPI) are key to protein functions and regulations within the cell cycle, DNA replication, and cellular signaling. Therefore, detecting whether a pair of proteins interact is of great importance for the study of molecular biology. As researchers have become aware of the importance of computational methods in predicting PPIs, many techniques have been developed for performing this task computationally. However, there are few technologies that really meet the needs of their users. In this paper, we develop a novel and efficient sequence-based method for predicting PPIs. The evolutionary features are extracted from the position-specific scoring matrix (PSSM) of protein. The features are then fed into a robust relevance vector machine (RVM) classifier to distinguish between the interacting and non-interacting protein pairs. In order to verify the performance of our method, five-fold cross-validation tests are performed on the Saccharomyces cerevisiae dataset. A high accuracy of 94.56%, with 94.79% sensitivity at 94.36% precision, was obtained. The experimental results illustrated that the proposed approach can extract the most significant features from each protein sequence and can be a bright and meaningful tool for the research of proteomics.


Introduction
Protein-protein interactions (PPI) are a key step in the realization of protein function within cell cycle progression, DNA replication, and signal transmission [1][2][3]. With the development of high-throughput biological technologies, including a yeast two-hybrid screen (Y2H) [4], protein chip technology [5], mass spectrometry [6], and tandem affinity purification tagging (TAP) [7], more PPI data have been accumulated [8]. PPI datasets have been stored in a number of constructed databases, such as the Molecular Interaction database (MINT), the Database of Interacting Proteins (DIP), and the Biomolecular Interaction Network Database (BIND) [8][9][10]. However, experimental methods are labor-intensive and time-consuming. The number of PPIs that are validated by these methods represents only a small portion of the entire PPI network. Moreover, the experimental methods are usually associated with a high rate of both false negative and false positive predictions. All of these drawbacks encourage further research into a computational approach for identifying PPIs.
Different kinds of available protein data are obtained by previous experimental methods, such as the primary, secondary, and tertiary structure of proteins. In order to utilize this wealth of protein data, numerous machine learning approaches have been designed to infer new PPIs. It is popular, among these approaches, to predict PPIs based on the structure of the protein information. For example, Agrawal et al. [11] proposed a computational tool-named a spatial interaction map (SIM)-that utilizes the structure of unbound proteins to detect the residues from PPIs. Qiu et al. [12] presented a novel residue characterization model, based on 3D structures, for the purpose of detecting PPIs. These computational methods-based on structural data-identify the interaction domain by analyzing the hydrophobicity, solvation, protrusion, and accessibility of residues. Since the volume of newly discovered protein sequence data is increasing exponentially, there is an increasingly larger gap between the volume of complex protein structure data, and that of protein sequence data [13,14]. Predicting PPIs based on structure data does not satisfy the requests of the many biochemists who have the sequences, but no structural data. Therefore, it is more important to develop effective computational models based on protein sequence data.
Currently, there are a number of different computational methods designed to implement this pattern in PPI prediction [15][16][17][18][19][20][21][22][23]. The common computational models for PPI prediction are composed of two key parts, namely, protein feature representation and sample classification. The purpose of the first step is to represent the proteins with useful attributes and transform the samples into feature vectors that are the same size as the sample classifier's inputs. Effective feature descriptors can play an important role in improving the prediction performance of the system.
Previous studies have shown that the evolutionary information on proteins may play a crucial role in predicting PPIs [24,25]. However, it is not easy to include evolutionary information in a protein sequence [26][27][28]. There is currently no single protein presentation method that takes full advantage of protein evolutionary information. Additionally, sequence evolution information is more difficult to use because of the differences in protein sequence length. In the face of such difficulties, how do we design a way to use the evolution information of proteins to implement the prediction of PPIs efficiently? In order to overcome this problem, we proposed a novel scheme that uses a position-specific scoring matrix (PSSM) to translate the protein sequence into a matrix, in which both the evolutionary information and the amino acid composition are included. Following this, we introduced a low-rank approximation (LRA) method to find the lowest level representation of all of the candidates and accurately recover the row space of the data to achieve high precision.
With regards to the second issue, some machine learning algorithms-such as random forests, neural networks, ensemble classifiers, random projections, and Naïve Bayes classifiers-are proposed for detecting PPIs to improve the accuracy of the prediction model [29][30][31]. The main trend in computational PPI detection is to achieve the highest precision, rather than speed, in the training of the classified model. Recently, relevance vector machines (RVMs) are a new statistical learning technique that provide the output of the probability classification, which uses Bayesian inferences to obtain a concise solution for the regression and classification [32]. Unlike support vector machines (SVM), RVM classifiers-with fewer input variables-provide better classification estimates for small, high dimensional datasets [33]. In this paper, the performances of RVMs and SVMs for classifying PPIs were compared. Using the PPI dataset, we show that the proposed method can quickly and effectively differentiate interactive protein pairs from large-scale data. The results of the experiment indicate that the proposed technique can complement experimental approaches for identifying PPI interactions.
In this paper, we proposed a novel protein representation method using protein evolutionary information. The main improvement was attributed to the use of LRA, a PSSM, and RVMs. In particular, we first used an LRA method on a PSSM that represented protein in a matrix form to obtain the feature vectors of the protein. Following this, the principal component analysis (PCA) method was employed to eliminate some of the noise and reduce the dimensions of the feature vectors. Finally, we used RVM classifiers to carry out the test. The proposed method was performed on the Yeast PPI dataset. The experimental results show that it is superior to SVM-based methods and other excellent technology that has been developed previously. Therefore, this approach is fit for predicting PPIs. Additionally, a user-friendly web server for predicting PPIs, PCLPred, was developed for academic users at http://219.219.62.123:8888/pclpred/.
The rest of this paper is organized as follows: Section 2 introduces the test results obtained from applying the proposed method, the SVM-based method, and several other existing methods. Section 3 describes the proposed approach. Section 4 summarizes the work presented in this paper.

Five-Fold Cross-Validation
In this study, five-fold cross-validation methods were utilized to compare the performance of this model with other competing approaches. The whole PPI dataset is randomly divided into five roughly-equivalent subsets, each containing approximately equal amounts of interacting and non-interacting proteins. Four of the subsets are used for training and the remaining one is used for the test. This process is repeated five times, using a different subset of the test each time. The average of the five results is then calculated to ensure the highest level of fairness.

Comparison with the SVM-Based Approach Using the Same Feature Representation
In order to effectively assess the performance of the SVM classifier, we compared its performance with that of a state-of-the-art SVM classifier with the same feature extraction method on the Yeast dataset [34]. The LIBSVM (A Library for Support Vector Machines) tool provides an interface to facilitate the use of the SVM classifier. The cross-validation strategy is employed to optimize the related parameters of the SVM. Consequently, the parameters (c, g) are set to 0.8 and 0.4, respectively. Furthermore, the radial basis function is taken as the kernel function.
The result of applying the two methods to the Yeast dataset are presented in Table 1, and the corresponding receiver operating characteristic (ROC) curves are shown in Figure 1. The prediction performance of the SVM classifier can be seen, from Table 1, to have achieved 89.4% accuracy, 88.5% sensitivity, 90.3% specificity, and 81.1% Matthews Correlation Coefficient (MCC). The average prediction results of applying the RVM classifier were 94.6% accuracy (which is 5.2% higher than the SVMs classifier) and 94.8% sensitivity (which is 6.3% higher than SVMs classifier). Several other indicators of the RVM classifier's performance-shown in Table 1-are 4.0% above the performance of the SVM classifier. This comparison proves that the effect of using the RVM classifier to predict PPIs can be clearly distinguished from the effect of using the SVM classifier. Additionally, Figure 1 indicates that the ROC curves of the two classifiers also show that RVM classifier can be more powerful in detecting PPI performance than the SVM classifier.  The reasons for this method producing better classification results come from the following points: (1) Based on a Bayesian framework to build a learning machine, the RVM classifier is conducive to making more scientific decisions based on the information; (2) in the choice of the kernel function, the RVM classifier is not limited by the Mercer theorem, and can construct any kernel function; (3) there is no need to set penalties. The penalty factor in the SVM classifier is a constant that balances the empirical risk and the confidence interval. The experimental results are very sensitive to the data. An improper setting may cause over-learning and other problems. The parameters in the RVM classifier, however, are automatically assigned; (4) compared to the SVM classifier, the RVM classifier is sparser, which means that the test time is shorter, making it more suitable for online testing. It is well known that the number of SVM support vectors grows linearly with the increase of the training samples, which is obviously not convenient when the training samples are very large. Although the RVM correlation vector also increases with the training samples, the growth rate is much slower than that of the SVM support vectors; and (5) previous research indicates that the RVM classifier has a better generalization performance than the SVM classifier. Additionally, when compared with the SVM classifier, the RVM classifier not only produces a binary output, but also gets the probability of the output.

A Comparison of the Proposed Method with Other Methods
Currently, many methods that are based on machine learning theory have been proposed for sequences-based PPIs. To assess the ability of the proposed approach, several existing techniques [35] are applied to the Yeast dataset and their results are compared to the results of our method. The comparison of the results of these methods is listed in Table 2. Table 2 clearly indicates that the proposed method achieved the highest average accuracy (94.6%) out of all of these methods. At the same time, the sensitivity and precision of the proposed technique are also superior to those of the other techniques. All of these results indicated that the RVM classifier, using the features vector that was extracted by the PSSM, LRA, and the PCA method, can substantially improve the quality of PPI prediction. This is mainly because of the efficient feature extraction strategy and the powerful classifier.

An Assessment of the Prediction Performance on the Helicobacter pylori PPI Dataset
In order to further investigate the prediction performance of our approach, we also compared the proposed approach with several other existing methods on the Helicobacter pylori PPI dataset. The prediction results for the abovementioned methods are reported in Table 3. In order to achieve a fair measure of randomness, we calculated the average of the measure values over five runs. We can observe from Table 3 that this method can achieve a good result, with 84.7% accuracy, 85.9% precision, and 84.4% sensitivity. It should be noticed that the precision and accuracy achieved by the proposed method are superior to those of the other methods.

Dataset
In this paper, the proposed approach was verified on the high-confidence Yeast and Helicobacter pylori PPI datasets. We gathered the Yeast dataset from the publicly available Database of Interacting Proteins (DIP) [8]. For the purpose of ensuring the effectiveness of the experiment, we removed the protein pairs of less than fifty residues and greater than 40% sequence identity. By performing this screening work, the remaining 5594 protein pairs are reserved for building the positive dataset. The additional 5594 non-interacting protein pairs, with different subcellular localizations, were then used to build the negative dataset. As a result, the whole Yeast dataset finally consisted of 11,188 protein pairs. In order to further verify the general applicability of the proposed method, we also evaluated our method on the Helicobacter pylori PPI dataset. In total, we obtained 1458 positive samples and 1458 negative samples, as described by Martin et al. [39].

Position Specific Scoring Matrix (PSSM)
PSSM is a type of scoring matrix that was proposed by Gribskov et al. [24]. It is used to perform BLAST (Basic Local Alignment Search Tool) searches, where amino acid substitution scores are assigned to a specific location in the proteins' multiple sequence alignments. It has been successfully applied in various fields of biological information because it contains the evolutionary information of proteins. PSSM is represented as a T × 20 matrix that can be interpreted as M = c i,j : i = 1 · · · T and j = 1 · · · 20 . The representation of PSSM is as follows: The elements in this matrix are generally expressed as integers (negative or positive). A higher score indicates that a given amino acid substitution occurs frequently in the alignment, while a lower score indicates a lower frequency of the substitution.
We created the PSSM using a Position-Specific Iterated BLAST (PSI-BLAST, Bethesda, MD, USA), which found a protein sequence that was similar to the query sequence, and then constructed the PSSM from the obtained alignment. In this work, we set the number of iterations to three and the e-value to 0.001 and t, respectively, in order to obtain a highly broad homologous sequence.

Low-Rank Approximation (LRA)
LRA is a widely used method for matrix analysis, where the cost function measures the fit between an approximation matrix (optimization variable) and a given sparse matrix, constrained by the reduced rank of the approximation matrix [40,41]. In this case, using LRA on the PSSM of the obtained protein sequences results in a descriptor containing evolutionary information that is used for representing a protein. For a 20 × L feature matrix N, the LRA would be written as follows: Subject to : rank(N) ≤ r where • F represents the Frobenius norm. Formula (2) is solved using the singular value decomposition (SVD) method. Let N = U ∑ V T ∈ R m×n be the SVD of N and partition U, ∑ =: diag(σ 1 , σ 2 , σ 3 , . . . , σ 20 ), and V as follows: where ∑ 1 is a square array of r. U 1 and V 1 represent different matrices, and their sizes are m × r and n × r. The rank-r matrix can then be gained as follows: The Σ 1 1 /2 , with dimensions r-by-r, can be obtained by computing the square root of the reduced matrix Σ 1 , in which the sequence order information of the protein is contained. It is noteworthy that the feature matrix N of the protein may have a different number of columns, which is caused by the unequal lengths of protein sequences. However, the U 1 Σ 1 1 /2 is a fixed length (a 20 × r matrix).
We form a vector from the gained matrix U 1 Σ 1 1 /2 by concatenating all of the rows, from row 1 to 20, of matrix U 1 Σ 1 1 /2 . Therefore, the feature descriptor consists of a total of 20 × r descriptor values. Considering the trade-off between the cost of computing for extracting the protein feature and the overall prediction accuracy, the optimal rank is 5. We connect the descriptors of the two protein sequences to represent an interaction pair.

Properties of the Proposed Algorithm
Based on orthogonal triangular decomposition theory and LRA theory, the properties of the PSSM feature extraction algorithm are deduced. andN * := U * ∑ * (V * ) T is an SVD ofN * . Based on the single invariance of the Frobenius norm, we have conformably with ∑ * = ∑ * 1 0 0 0 and observe that Thus,N 12 = 0. Similarly,N 21 = 0. Observe also that rank Thus,N 11 = ∑ * 1 . Therefore,N T be the SVD ofN 22 . Then the matrix has the optimal rank-m approximation ∑ * = ∑ * 1 0 0 0 , such that Therefore, Thus, if σ m > σ m+1 , the rank-m truncated SVD is unique andN * is the unique solution of LRA.
The salient feature of Lemma 1 is that, although the rank constraint is highly non-convex and non-linear, one is still able to efficiently solve (2) using the SVD method. Additionally, under all of the consistent rules, there is an optimal solution under the Frobenius norm.

Relevance Vector Machine (RVM) Model
The RVM model is a probabilistic model under a Bayesian framework, developed by Tipping et al. [32,33,42]. It has been widely applied for solving classification and regression problems. Assuming that the training datasets are (x n ,y n ) n=1 N for binary classification problems, x n ∈ R d is the training sample; t n ∈ (0, 1) denotes the label of the training dataset; t i is the label of the testing dataset; , is the classification model; and ε i is the additional noise, with a variance of σ 2 and a mean value of zero, where ε i ∼ N 0, σ 2 , y i ∼ N b i , σ 2 . The training datasets are assumed to be independent and distributed identically. The observation of vector t follows the distribution as follows: where ∂ meets the following definition: The method used by the RVMs to predict the label t * of a test sample is given by: m(y * |y) = p y * c, σ 2 p(c, σ 2 y dwdσ 2 (17) In order to reduce the computational complexity of the kernel function and ensure that the majority of the weight vector has a value of zero, the weight vector w is limited by extra conditions.
We get p c, x, σ 2 t using the Bayesian formula m c, x, m c x, σ 2 , t = p t c, σ 2 p(c|x)/p t x, σ 2 The integral of the product of p(t x, σ 2 ) and p(c|x) is given by The maximum likelihood method was used to solve m(x, σ 2 t )∝ m t x, σ 2 m(x)m σ 2 and m(x, σ 2 t ), and is represented by The iterative process of x MP and σ 2 MP is as follows: where ∑ i, i represents the ith element on the diagonal of Σ, and the initial value of a and σ 2 are determined via the approximation of a MP and σ 2 MP , by continuously using Formula (19).

Procedure of the Proposed Method
In the study, the workflow of the PCLPred method is presented in Figure 2. More specifically, the protein amino acids sequence datasets are downloaded from DIP. The CD-HIT (Cluster Database at High Identity with Tolerance) and PSI-BLAST programs are then used to remove sequence redundancy and generate PSSM, respectively [43]. Following this, LRA is employed to obtain the feature representation from PSSM, which contains a large volume of valuable evolutionary knowledge for PPI prediction. After the dimensionality reduction-using the PCA technique-the significant features are extracted and used as input features to train the RVM classifier. Finally, the prediction performance is evaluated using five-fold cross-validations [44][45][46][47][48].

Performance Evaluation
In order to evaluate the performance of the designed model, a number of validation measures are employed.
(1) Overall prediction accuracy: (2) Sensitivity: (3) Specificity: (4) Positive predictive value: (5) Negative predictive value: (6) F-score: (7) Matthews correlation coefficient: where T P is true positive, indicating that the total number of interactive proteins will be predicted correctly; F P is false positive, indicating the total number of these proteins pairs that have no interaction, but are determined as interacting; F N is false negative, indicating the total number of interactive proteins that are determined as non-interacting; and T N is true negative, indicating the total number of these proteins pairs that have no interaction that are determined correctly. Additionally, the ROC curve is adopted as a measure that is used to evaluate the prediction performance of the different methods [49,50].

Conclusions
In this study, we proposed a novel computation-based automated decision-making method by employing the RVM model combined with the LRA method and PSSM. More specifically, LRA is employed to obtain the feature representation from PSSM, which contains a large volume of valuable evolutionary knowledge for PPI prediction. The RVM classifier is then applied to predict novel PPIs. Extensive computational experiments are performed on several PPI datasets in order to evaluate the PPI identification ability of the developed approach. These experimental results have proven that the PPI identification ability of this approach is clearly stronger than that of the SVM-based method and several other existing approaches. The promising results demonstrate that the proposed method is an efficient and reliable approach to detecting PPIs. It is also a practical tool that will help to advance research in the field of bioinformatics.