Prediction of Protein–Protein Interactions with Clustered Amino Acids and Weighted Sparse Representation

With the completion of the Human Genome Project, bioscience has entered into the era of the genome and proteome. Therefore, protein–protein interactions (PPIs) research is becoming more and more important. Life activities and the protein–protein interactions are inseparable, such as DNA synthesis, gene transcription activation, protein translation, etc. Though many methods based on biological experiments and machine learning have been proposed, they all spent a long time to learn and obtained an imprecise accuracy. How to efficiently and accurately predict PPIs is still a big challenge. To take up such a challenge, we developed a new predictor by incorporating the reduced amino acid alphabet (RAAA) information into the general form of pseudo-amino acid composition (PseAAC) and with the weighted sparse representation-based classification (WSRC). The remarkable advantages of introducing the reduced amino acid alphabet is being able to avoid the notorious dimensionality disaster or overfitting problem in statistical prediction. Additionally, experiments have proven that our method achieved good performance in both a low- and high-dimensional feature space. Among all of the experiments performed on the PPIs data of Saccharomyces cerevisiae, the best one achieved 90.91% accuracy, 94.17% sensitivity, 87.22% precision and a 83.43% Matthews correlation coefficient (MCC) value. In order to evaluate the prediction ability of our method, extensive experiments are performed to compare with the state-of-the-art technique, support vector machine (SVM). The achieved results show that the proposed approach is very promising for predicting PPIs, and it can be a helpful supplement for PPIs prediction.

There are two key factors affecting the predicting performance of PPIs, i.e., feature extraction and sample classifier. Obviously, feature extraction is vital for all kinds of classifiers. In genome analysis, a well-selected feature generally helps to reveal hidden relationships between proteins and their biological activities. There already exists a wide range of feature selection approaches to extract features out of amino acid compositions, but few approaches consider the effect of amino sequence order. Chou et al. [21] proposed to use pseudo-amino acid composition for further analysis. Based on their approach, they identified 20 factors only reflecting the influence of amino acid composition, whereas the rest of the factors are influenced by the sequence order. This approach is then widely adopted by many protein attributes-based prediction approaches, such as predicting amino acid gamma-aminobutyric-acid (GABA(A)) receptor proteins [22], predicting protein folding rates [23], identifying cyclin proteins [2], predicting supersecondary structure [24] and predicting a protein's subcellular location [25,26]. More similar works could be found in the review paper of Gonzalez-Diaz et al. [27]. Recently, two open-access tools were released to generate various modes of Chou's pseudo-amino acid composition [28][29][30]. One of the most useful, pseudo-amino acid composition (PseAAC) modes, is the so-called n-peptide composition; however, the dimension of features exponentially increases when we increase n. To avoid the dimensionality explosion problem, we, in this paper, attempt to integrate the approach of the reduced amino acid alphabet [31][32][33] with the general form of PseAAC to constrain the increase of feature dimensions.
For the sample classifier, the most popular choice is SVM; however, it needs a careful parameter tuning process in order to achieve better performance, and this tuning process requires extra efforts [34,35]. To cope with this issue, sparse representation classification (SRC) [36] can be adopted, which was originally proposed for face recognition. As indicated in [37], the original SRC ignores the relationship between features in low-dimensional space, which is not desired. Therefore, we determine to adopt a weighted sparse representation-based classification approach, which does not require a longer process in parameter tuning, but a comparable good classification accuracy.
The rest of this paper is organized as follows: (1) we first generate a benchmark dataset for the validation of our proposed method; (2) we then introduce an efficient feature extraction approach that can discover the intrinsic correlation between proteins; (3) we adopt a powerful classification algorithm to predict PPIs; (4) we execute the cross-validation tests to evaluate the prediction accuracy; and (5) we further evaluate our method on other dataset and compare it with other feature extraction methods and classifier.

Evaluation Criteria
To evaluate the performance of our approach, the following criteria are chosen in the experiments, which are the accuracy (ACC), sensitivity (SN), precision (PE) and Matthews correlation coefficient (MCC), written as: where TP, TN, FP and FN denote true positive, true negative, false positive and false negative, respectively. In addition, the receiver operating characteristic (ROC) curve is also adopted to evaluate the prediction performance. The ROC curve plots the true positive rate (TPR) versus the false positive rate (FPR) with the threshold varying. The area under the ROC curve is called the area under the ROC curve (AUC), which falls into (0,1). The larger the AUC, the better prediction performance we can achieve.

Assessment of Prediction Ability
The Gaussian kernel width σ and the tolerance threshold ε are two parameters that need to be tuned in the experiments. After a careful parameter tuning, we set σ = 50 and ε = 0.05 in the rest of the experiments.
In the following experiments, we used five-fold cross-validation to perform the experiments. Several models were constructed and evaluated on five different cluster profiles (shown in Table 1) for the dipeptide case (shown in Table 2, n = 1, 2, 3). The prediction results are reported in Table 3. It is obvious that when we increase n, the prediction performance of our approach improves and achieves the best performance when n = 3. From Table 3, we can observe that the prediction result is the best for CP (8) with dimension = 512, and its accuracy is 90.91%, sensitivity 94.17%, precision 87.22% and MCC value 84.43%. In general, the PPI prediction accuracy of our approach is higher than 70%, and most of the 15 groups are better than 80%. Moreover, we plotted the ROC curve and AUC values of CP(8) with dimension = 512, as shown in Figure 1, and its AUC value is 0.97307, which is close to one. Table 1. Scheme for reduced amino acid alphabet based on protein blocks method.

Robustness Performance Evaluation
In order to evaluate the robustness of our approach, we further test our algorithm on an H. pylori dataset. This dataset includes 1365 interaction protein pairs, as well as some noise. Those proteins with the number of amino acids <50 will be removed from the dataset. The prediction classifier is built with 1365 × 2 = 2730 protein pairs and keeps the same parameter settings. The prediction results are reported in Table 4. The accuracy of our approach is the highest (83.04%) on CP(13) with dimension = 169. From this table, the accuracy does not improve as n increases, for example CP(11) with dimension = 121 and 1331 and CP(13) with dimension = 169 and 2197. The reason is the existence of "overfitting" or "high dimensionality disaster" problems. From this observation, we can conclude that by properly choosing feature dimensions, we can save computational cost, as well as improve the prediction accuracy.

Comparison with Other Methods
To further demonstrate the effectiveness of our approach, several related state-of-the-art approaches are performed in this experiment for performance comparison. We first implement several weighted sparse representation-based classification (WSRC) classifiers using different feature extraction approaches, and we also perform the SVM classifier with the pseudo-amino acid composition and reduced amino acid alphabet feature.
For the feature extraction method, we chose the most popular feature extraction methods in PPIs prediction, which are auto covariance, Moran autocorrelation, Geary autocorrelation, conjoint triad and pseudo-amino acid composition. The experiments are also performed using five-fold cross-validation. As shown in Table 5, the WSRC classifier with the pseudo-amino acid composition and reduced amino acid alphabet feature can achieve better results than the other methods in terms of accuracy, sensitivity, precision and MCC values. The SVM classifier with the same feature set is performed and is compared with the WSRC classifier. The parameters of the SVM classifier need to be tuned, which are soft margin parameter C and Gaussian kernel parameter g. In this experiment, we set C = 16 and g = 16. As shown in Table 6, the SVM classifier can achieve the best results for CP (13) with dimension = 2197, and its accuracy, sensitivity, precision and MCC are 92.04%, 90.37%, 93.51% and 85.35%, respectively, which are a little higher than the best results shown in Table 5. However, the WSRC classifier can achieve higher accuracies in most of the tested 15 groups. Especially, the WSRC significantly outperforms the SVM classifier in the low-dimensional data space. In the high-dimensional data space, the SVM performs very well, but with a higher computational cost. In fact, the performance of SVM seriously relies on its parameters, and the performance can vary greatly if the parameters are not properly tuned, whereas the performance of the WSRC classifier is quite stable and varies slightly when the parameters change. To emphasize once more, the proposed WSRC can achieve comparable classification performance with most of the experimental settings and is especially good in the low-dimensional data space. Alternatively, by performing the PPIs prediction in a low-dimensional data space, we can largely save the computational cost, as well as preserve the relationships between non-continuous protein pairs. However, the SVM-based classifier can achieve better performance only in a high-dimensional data space, which indicates that a much higher time complexity is required to achieve the best model performance. In practice, it is desired that the PPIs classifier can achieve better classification performance with a much lower cost. Apparently, the proposed WSRC is superior to the SVM, as well as the WSRC, with other feature extraction methods. We further compared our best result with previous reported papers, such as Guo et al. [20], Zhou et al. [38] and Yang et al. [39]. Additionally, we also implemented sparse representation-based classification (SRC) and SVM under the same setting as WSRC, which used CP(8) with dimension = 512. The results are shown in Table 7. As we see, WSRC gains the best performance in accuracy, sensitive and MCC with values of 90.91%, 94.17% and 83.43%, respectively. Though WSRC cannot beat Yang's work in precision, it still reveals the potential of our method in predicting PPIs.

Dataset
The PPI dataset was manually extracted from the public S. cerevisiae subset of interacting proteins (DIP) database [20]. First, to reduce the size of the dataset, we removed the protein pairs whose length is less than 50 residues, and we also removed those that have more than 40% sequence identities. A positive sub-dataset (protein pairs with interaction) is generated consisting of 5594 protein pairs. To generate the negative sub-dataset (protein pairs without interaction), we also assume that proteins from different subcellular positions will not interact with each other. Accordingly, the negative dataset with 5594 protein pairs is generated, and the whole evaluation dataset is acquired, containing 5594 positive protein pairs and 5594 negative protein pairs.

Pseudo-Amino Acid Composition and Reduced Amino Acid Alphabet
Given a protein sequence, how to properly represent the protein sequence is one of the key issues in protein-protein interaction prediction. Conventionally, a protein can be represented by its entire amino acid sequence as it contains complete information. For a protein amino acid sequence with L residues, its representation can be written as: where R1 denote the first residue of protein P and there are a total of L residues in P. As proposed in BLAST [40], the similarity between a series of amino acid sequences is calculated as the basis for the later prediction. However, the prediction fails when the query protein does not have significant homologies to the known proteins. Alternatively, discrete models are proposed, which do not rely on the order of amino acid sequences for the similarity calculation. One of the most commonly-adopted discrete model is to use the amino acid composition (AAC) to represent proteins, which is formulated as: where fu (i = 1···20) is a normalized occurrence frequency of the 20 amino acids in P, and T is the transpose operator. As a result, once the protein sequence information is known, the amino acid composition of a protein can be easily calculated. By formulating with Equation (6), the effect of sequence order is not considered, and thus, we propose to a new approach, which is based on pseudo-amino acid composition (PseAAC). The new approach can now make use of the characteristics of amino acid composition, as well as the order of the amino acid sequence. Among all of the PseAAC modes, the simplest one is the so-called n-peptide composition. When n = 1, 2, 3, this model degenerates to the AAC model, dipeptide composition [41][42][43] and tripeptide composition [44], respectively. In this way, the sort of sequence order information can be preserved. However, when n ≥ 2, the number of components increases rapidly. For example, when n = 2, there are around 20 2 = 400 features generated for the prediction, whereas we should make the prediction using over 20 3 = 8000 features when n = 3. The exponential increase of features not only leads to an unacceptable model training process, but also causes a very expensive biological experiment in terms of both the experimental materials and the experiment process. In addition, this high-dimensional feature set will also generate a series of other issues, such as: (i) the over-fitting problem, which will cause poor model generalization; and (ii) the sparse representation problem, which will cause serious biased results and a poor ability to interpret missing data.
To avoid this well-known issue, we propose to use the reduced amino acid alphabet (RAAA) approach and clustering 20 native amino acids into a smaller number of groups with each group as a representative residue. Compared with the traditional AAC approach, RAAA not only considers the protein composition, but also preserves the sequence order information. By assuming that different amino acids might be responsible for the same biological activity, it can group similar amino acids into the same group. With the clustering process, this approach can greatly reduce the dimension of features used. Later, De Brevern et al. proposed a structural alphabet, called protein blocks (PBs) [45,46], which has been widely applied in computational proteomics [47][48][49]. To further improve the efficiency, a novel RAAA was proposed by Etchebest et al. based on PBs [50], where the 20 native amino acids are grouped into five different cluster profiles, that is CP(13), CP(11), CP(9), CP(8) and CP(5), as shown in Table 1.
Accordingly, the proteins are now encoded with RAAA by a discrete feature vector P written as: where T is the transpose operator and fi is the occurrence frequency of the i th n-peptide RAAA. For instance, using CP(13) and n = 1, G is grouped into the first group; I/V fall into the second group, and so on. As I and V are in the same group, their occurrence frequencies are counted in the second group. Clustered into 13 groups, the dimension of the feature vector of CP(13) is only 13, which is much smaller than that of the traditional AAC features, i.e., 20. When n = 2, GG is in the first group, GI/GV/IG/VG are in the second group, and similarly, we can get the rest of the groups. To summarize, the elements in these 13 groups can permute to form 169 new groups. Table 1 shows the dimension of feature vectors of Equation (7) with different cluster profiles and different n, and the corresponding dimensions are reported in Table 2. Apparently, we can observe that the dimension of the feature vector generating the adopted approach is much smaller than those generated by the traditional dipeptide composition and tripeptide composition. By this comparison, we determine the approach to represent features of amino acids, and we then introduce one effective algorithm for PPIs prediction.

Weighted Sparse Representation Based Classification
Sparse representation classification (SRC) assumes that there is a training sample matrix X ∈ R d×n with n samples and d-dimensional feature vectors. Let cl denote the l th sample of X and all the samples are divided into K object classes. Assuming that there are ni samples belong to the i th class, Xi = [ci1···cini] is the whole data set which can be rewritten as X = [X1···XK]. Suppose there is a new testing sample y ∈ R d belongs to the i th class, the sparse representation is to find a column vector α = [αi1···αini] such that: Suppose a linear representation coefficient vector α0 ∈ R n , y can be written in terms of all training samples as: According to the sparse representation approach, in α0, only the entries corresponding to the same class as y have a nonzero value. Then, we have: The SRC aims to solve the following l0 minimization problem: Theoretically, it is an NP hard problem to solve Equation (11) [51]. To solve this problem, we replace this equation with its convex surrogates, as mentioned by Candes [52], for which the l1 minimum solution approximates to that of l0 solutions. Therefore, Equation (11) is rewritten as: To avoid the occlusion problem, the l1 norm minimization is further extended to the following stable l1 norm minimization problem as: where ε > 0 is a pre-defined threshold. This minimization can be resolved using standard linear programming methods [53]. After obtaining the sparsest solution αˆ1, SRC uses the following classification criterion denoted as: where y belongs to class k when the minimum solution Equation (14) is acquired. As mentioned before, only when y belongs to class k do those entries of αˆ1 associated with class k have a nonzero value. As gk represents the residual, we want to assign y to the smallest residual. However, as pointed out by [37], the SRC can only achieve better performance when the data are represented by a high-dimensional feature space, which contradicts our method representing amino acids in a low-dimensional feature space. The weighted sparse representation-based classification (WSRC) approach proposed in [37] can solve this problem well, and it also can guarantee a better performance in the low-dimensional data space. To adopt this approach, we need further to study how to properly choose α, as it determines the residuals. The goal of WSRC is to find a proper way to evaluate the relationship between training and testing samples. The natural measurement metric is to compare the distance between training and test samples. The algorithm has two main steps, and the first step is to calculate the distance between existing training samples and given test data by which we can calculate a new α. The second step is to run the traditional SRC algorithm with the new α.
In this paper, we choose the Gaussian kernel to compute the data distance, as the Gaussian kernel distance can capture the nonlinear relationship in the dataset. With this Gaussian kernel distance, we can then calculate the similarity between the training data and testing data. The data samples are represented as x,y ∈ R d , where x,y is the training and testing sample, respectively. Additionally, their Gaussian kernel distance can be written as: where σ is the scale parameter of the Gaussian kernel. Compared with the traditional distance measurement [54], such as the Euclidean distance, the Gaussian kernel distance can preserve the neighborhood relationship well in the nonlinear data space. It is also the first attempt to adopt Gaussian kernel distance for calculating the weights for SRC. As the Gaussian kernel distance is between 0 and 1, we can directly use this distance as the weight of the training samples. Therefore, for a training sample xi ∈ R d , its weight can be written as dg(xi,y). By calculating the weight of each training sample, a new weighted training dataset can be generated, which is denoted as X' = [X'1,··· X'C]. Additionally, for the k th class, X'k = [wk1Xk1,···,wknkXknk], where nk is the number of training samples in the k th class. Consequently, the new WSRC classifier can be learned, and the corresponding algorithm is illustrated in Algorithm 1.
Algorithm 1: The WSRC Algorithm 1: Normalize the columns of X to have unit l2-norm. 2: Give a new test sample y; calculate the distance between each training sample x via Gaussian kernel dg(x,y) = exp(−ǁx − yǁ 2 /2σ 2 ). Use the results to determine the weight and to generate the new training set: 3: Find a column vector α satisfying the following formulation: 4: Compute the residuals: gk(y) = ǁy − Xαˆ1kǁ, k = 1···K (18) where αˆ1k is the representation coefficient vector associated with class k. 5: Output the class label of y as k = identity (argmin gk(y)) Compared with traditional SRC, WSRC can utilize the similarity between training and testing samples well with a linear sparse feature representation. It can significantly improve both the robustness and the accuracy of the PPIs prediction. For the parameter learning, we only need to carefully tune σ, the Gaussian kernel width and the threshold ε. In the following section, we will discuss how the experiments are performed, as well as the related discussion.

Conclusions
In this paper, we proposed a new PPIs prediction method, which makes the best use of the information of protein sequence order. The proposed prediction model was built based on the WSRC classifier and integrated with both the pseudo-amino acid composition and the reduced amino acid alphabet feature. Unlike the n-peptide method, we can acquire a set of features combining the PseAAC and RAAA feature with different feature dimensions. This novel feature extraction method can effectively reduce the data dimension and reduce the computational cost. In the experiments, we can see that the performance of WSRC with the features extracted by our method is superior to the state-of-the-art SVM in terms of accuracy, sensitivity, precision and MCC value. This is the first attempt to adopt WSRC to predict PPIs. The promising experimental results demonstrate that the WSRC approach can not only achieve better performance with the PseAAC feature combined with RAAA, but also with other features. This shows the practical significance of our method in predicting PPIs.