A Novel Feature Extraction Scheme with Ensemble Coding for Protein–Protein Interaction Prediction

Protein–protein interactions (PPIs) play key roles in most cellular processes, such as cell metabolism, immune response, endocrine function, DNA replication, and transcription regulation. PPI prediction is one of the most challenging problems in functional genomics. Although PPI data have been increasing because of the development of high-throughput technologies and computational methods, many problems are still far from being solved. In this study, a novel predictor was designed by using the Random Forest (RF) algorithm with the ensemble coding (EC) method. To reduce computational time, a feature selection method (DX) was adopted to rank the features and search the optimal feature combination. The DXEC method integrates many features and physicochemical/biochemical properties to predict PPIs. On the Gold Yeast dataset, the DXEC method achieves 67.2% overall precision, 80.74% recall, and 70.67% accuracy. On the Silver Yeast dataset, the DXEC method achieves 76.93% precision, 77.98% recall, and 77.27% accuracy. On the human dataset, the prediction accuracy reaches 80% for the DXEC-RF method. We extended the experiment to a bigger and more realistic dataset that maintains 50% recall on the Yeast All dataset and 80% recall on the Human All dataset. These results show that the DXEC method is suitable for performing PPI prediction. The prediction service of the DXEC-RF classifier is available at http://ailab.ahu.edu.cn:8087/DXECPPI/index.jsp.


Introduction
Protein-protein interactions (PPIs) [1,2] play crucial roles in virtually every biological function. Proteins interact with each other to form protein-protein complexes and perform different biological processes, including metabolism, immune response, endocrine function, and DNA replication [3,4]. Various experimental and computational methods (e.g., two-hybrid systems [5,6], mass spectrometry [7], and protein chip technology [8]) have been developed to detect PPIs. PPIs have generally been studied individually by small-scale biochemical and biophysical experimental techniques. However, these experimental approaches are usually time-consuming and expensive. In recent years, high-throughput biology experimental methods [9,10] have been developed to produce PPIs.
In this study, a novel method called DXEC-RF, was developed to predict PPIs. The DXEC method incorporates six coding methods and a feature selection method to construct a classifier. The experiment demonstrates that the ensemble coding (EC) method based on the feature extraction scheme contributes to PPI prediction and is better than other well-known methods using the yeast/human dataset.

Performance Evaluation
PPI prediction is a binary classification problem. In this experiment, precision, recall, accuracy, F-measure, and Matthews correlation coefficient (MCC) were employed to measure the performance of classifiers: where TP denotes true interaction pair, TN denotes true non-interaction pair, FP denotes false interaction pair, and FN denotes false non-interaction pair. The ROC (Receiver Operating Characteristic) curve is often used to evaluate classifier performance [40]. A classifier conducts predictions on the basis of a threshold, which generally is defined as 0.5. When the threshold value is changed, new predictions can be obtained and a point can be plotted with the true positive rate (TPR) versus the false positive rate (FPR) for different threshold values.
The area under a curve (AUC) for the ROC curve is also used. When the AUC value of a predictor is larger than the area of other ROC curves, such a predictor is considered better than other predictors.

The DX Result
To ensemble different coding methods, we need to calculate the score of each feature of AC, LD, CT, GAC, MAC, and NMBAC. Figure 1A shows the DX score distribution of each feature of different methods on the Gold Yeast dataset. A larger DX score corresponds to the greater separation of positives and negatives in a feature. To reduce complexity, the top-ranked 150 features from different methods were used to construct EC. The DX method was also adopted in EC to rank each feature. Figure 1B shows the DX score distribution of each feature according to the EC method on the Gold Yeast dataset. The DX score of EC is larger than that of other coding methods on the whole (AC, LD, CT, GAC, MAC, and NMBAC). Figures 2-4 show the DX score distribution on the Silver Yeast, Gold Human, and Silver Human datasets, respectively. Table S1 shows the result of using the DX method on the 7 coding methods (AC, LD, CT, GAC, MAC, NMBAC and EC). These tables rank all features according to the DX criteria. The front features denote the features that are more important for PPI prediction in the DX feature table.

Feature Importance Evaluation
To obtain the best feature space for PPI prediction, we constructed the classifier by using the RF algorithm with the DX feature selection method and EC encoding (DXEC-RF). After ranking each EC feature, incremental feature selection (IFS) [41] was adopted for optimal feature set selection. During the IFS procedure, features were added sequentially from high to low ranking according to the DX score table. The 900 individual predictors corresponding to 900 feature subsets were constructed to train the dataset by using DXEC-RF. The average results of 900 predictors by using 5-fold cross-validation (i.e., the process can be performed 5 times) is presented in Table S2. This feature selection process is illustrated in Figures 5 and 6. The DXEC-RF predictor achieves the highest MCC (i.e., 0.4279) when adopting the top-ranked 470 features on the Gold Yeast dataset ( Figure 5A). The DXEC-RF predictor achieves the highest MCC (0.5518) when adopting the top-ranked 34 features on the Silver Yeast dataset ( Figure 5B). The DXEC-RF predictor achieves the highest MCC (0.6339) when adopting the top-ranked 532 features on the Gold Yeast dataset ( Figure 6A).The DXEC-RF predictor achieves the highest MCC (0.6448) when adopting the top-ranked 872 features on the Silver Yeast dataset ( Figure 6B).

Comparison of Prediction Performance by Using Different Methods
After the optimal feature subset was confirmed, an experiment was conducted to evaluate the performance of the DXEC method against other coding methods. We performed 7 experiments according to AC, LD, CT, GAC, MAC, NMBAC, and EC by using five-fold cross-validation on the Yeast/Human Gold/Silver datasets; this process can be conducted 10 times. The detailed results are presented in Figures 7-18. The EC method obtains the highest accuracy of 70.67%, which is higher by 7.44%, 5.95%, 7.5%, 2.45%, 7.16%, and 1.54% than AC (63.  Figure 8A). The EC method obtains the highest precision of 67.21%, which is higher than the precision of other methods by 1% to 6% ( Figure 8B). These results show that the EC method can effectively reduce false PPI prediction.
The EC method scores the highest F-measure of 73.35%, which and is higher than AC (66.

Comparison with Other Methods on the All Interaction Datasets
To further assess the performance of the DXEC-RF method, we also tested the ability of trained classifiers on the Yeast/Human All Interaction datasets, which contain all yeast/human protein interactions from the source databases and considers random protein pairs as negatives. Classifiers trained on Gold and Silver Yeast/Human datasets were tested separately. The performance of DXEC-RF is summarized in Tables 1-4. The other six methods were also implemented on the All Interaction datasets. The DXEC-RF method obtains the highest scores on all parameters, excluding recall ( Table 1). The ROC area, MCC, accuracy, and F-measure of DXEC-RF are approximately 1% to 4%, 2% to 6%, 1% to 5%, and 1% to 3% higher than other methods, respectively. Table 2 shows a similar conclusion to that of Table 1. The DXEC-RF obtains a higher recall, F-measure, MCC, and ROC area than other methods (Table 3). When the Silver Human model was used to evaluate the Human All Interaction dataset, the performance of the DXEC-RF is better than other methods (Table 4). Figures 19 and 20 show the ROC curves of each method. As shown in the four graphs, the DXEC-RF method is better than the other methods for both the Yeast and Human All Interaction datasets.

Preparation of Datasets
This study is divided into 3 phases: the feature selection phase, training phase, and testing phase. We used the Yeast and Human protein interaction datasets derived by Indrajit Saha et al. [27]. The dataset was composed of 3 datasets extracted from the DIP, MINT, BioGrid, and IntAct databases. We constructed 3 types of datasets, namely, the Gold, Silver, and All Interaction datasets. The Gold dataset contains the PPIs confirmed at least twice by using 2 different experimental methods: the yeast 2-hybrid method and an affinity-based method. The Silver dataset contains PPIs that were confirmed more than once but not necessarily with different experimental methods. The All Interaction dataset was confirmed by using at least one experimental method. On the basis of these datasets, we retained the PPIs with protein sequence lengths larger than 50.
Selecting non-interacting pairs are more complex and important than selecting positive datasets. Random pairs that were not in the positive dataset were used as non-interacting datasets in other studies. In the current study, we created a dataset of negative examples by pairing protein entries randomly selected from UniProtKB. These random pairs were then cross-checked against PPI datasets to remove any true positives. Interacting and non-interacting pairs were labeled as the "positive class" (denote as 1) and "negative class" (denoted as −1), respectively. Table 5 presents the number of instances in each dataset.

AARC (Amino Acid Residue Change) Features
The AARC (Amino Acid Residue Change) features mainly reflect the amino acids characteristics. These physicochemical properties include hydrophobicity, hydrophilicity, polarity, polarizability, solvation free energy, graph shape index, transfer free energy, amino acid composition, CC in regression analysis, residue accessible surface area in tripeptide, partition coefficient, and formation entropy. These features were also used by Shi et al. [18]. The 12 physicochemical properties for each amino acid are shown in Table S3.  Table S4 shows the values of the features of the amino acid factors. Each amino acid residue has different properties that can influence the specificity and diversity of protein structure and function. Atchley et al. [42] performed feature analysis on the AAIndex [43] by using the multivariate statistical method and then transformed the AAIndex to five multidimensional and highly interpretable numeric patterns of attribute covariation that reflects polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. The five numerical pattern scores (denoted as "amino acid factors") were adopted in the current study to represent the respective properties of each amino acid in a given protein.

Feature Selection (DX)
We know that many features are irrelevant and redundant for a prediction problem. Therefore, a number of feature selection methods were proposed to remove the most irrelevant and redundant features to improve the performance of learning models. We used the DX score [44] to solve this problem. The author of this method adopted the DX score to select the most relevant bigram features. The DX score can assess the discrimination power of a feature in general cases. According to [45], the DX score can be defined as follows: where average_pos denotes the mean value of the feature in the interaction pairs of the training dataset; average_neg denotes the mean value of the feature in the non-interaction pairs of the training dataset; var_pos and var_neg denote the variance of the feature in the interaction pairs and non-interaction pairs of the training dataset, respectively.

Ensemble Coding Scheme
The key issue for the sequence-based method in PPI prediction is the manner in which to encode protein sequences by using protein properties. Many studies have proposed effective coding methods, including AC, CT, LD, GAC, MAC, and NMBAC. These methods use the interaction information among amino acids and the amino acid composition according to the protein characteristics. Figure 21 provides the framework of the EC scheme for protein sequence pairs. Take the following protein interaction pair P a -P b for example: P a , MTASVSNTQNKLNELLDAIRQEF; P b , MNPGGEQTI.
First, the sequence P a -P b is transformed to six vectors according to the AC, LD, CT, GAC, MAC, and NMBAC methods. AC features describe the level of correlation between two protein sequences in terms of their specific physicochemical property, which are defined on the basis of the distribution of amino acid properties along the sequence. AC can be computed according to Equation (9).
GAC can be computed according to Equation (10).
MAC can be computed according to Equation (11).
NMBMAC can be computed according to Equation (12).
where j represents one descriptor, i is the position of the residue in sequence X, n is the length of the sequence, and lag is the distance between a residue and its neighboring residue. In this case, lag is set to 30 for the AC, MAC, GAC, and NMBAC method. In the LD method, each protein is divided into 10 local regions, each of which includes 3 local descriptors: composition, transition, and distribution. The CT method considers any 3 continuous amino acids as a unit in the sequence. The details of such units are provided in the Supplementary Information [11,12,18,[30][31][32][46][47][48][49]; Second, the DX method was used to rank each feature (from high to low) for the 6 vectors generated in the first step; Third, given that the dimensions of the AC, LD, CT, GAC, MAC, and NMBAC methods are large, we select the top-ranked 150 features in the 6 coding methods to reduce complexity and combine such features to 1 vector (EC). Fourth, the DX method was adopted to rank each feature in the EC. Fifth, RF was used to evaluate the important of each feature by using 5-fold cross-validation and by adding features sequentially. After the optimal features were confirmed, the model produced by the optimal features was applied to classify the test dataset.

Conclusions
In this study, a new method was developed for PPI prediction. The EC was constructed by using six coding methods, namely, AC, CT, LD, MAC, GAC, and NMBAC. To extract important features, feature selection (DX score) was used to rank each feature. RF was then adopted to find the optimal feature subset. After the optimal features subset was confirmed, a model was produced on the basis of the optimal feature set and then applied to the Yeast/Human All Interaction datasets. Our results show that the DXEC-RF is more suitable for performing PPI prediction than other methods.