Protein Subnuclear Localization Based on Radius-SMOTE and Kernel Linear Discriminant Analysis Combined with Random Forest

: Proteinsubnuclearlocalizationplaysanimportantroleinproteomics, andcanhelpresearchers to understand the biologic functions of nucleus. To date, most protein datasets used by studies are unbalanced, which reduces the prediction accuracy of protein subnuclear localization—especially for the minority classes. In this work, a novel method is therefore proposed to predict the protein subnuclear localization of unbalanced datasets. First, the position-specific score matrix is used to extract the feature vectors of two benchmark datasets and then the useful features are selected by kernel linear discriminant analysis. Second, the Radius-SMOTE is used to expand the samples of minority classes to deal with the problem of imbalance in datasets. Finally, the optimal feature vectors of the expanded datasets are classified by random forest. In order to evaluate the performance of the proposed method, four index evolutions are calculated by Jackknife test. The results indicate that the proposed method can achieve better effect compared with other conventional methods, and it can also improve the accuracy for both majority and minority classes effectively.


Introduction
A biologic cell is a highly ordered whole that can be divided into different organelles according to spatial distribution and function, such as cytoplasm, nucleus, etc. The proteins in cells strongly correlate with life activities because proteins are able to perform biologic functions only when the proteins are transported to the correct nucleus or in a cell [1,2]. The correct protein subnuclear localization can be used to annotate the structure and function of protein, and it also contributes to the development of new drugs about genetic disease, even cancer [3].
With the development of life sciences, traditional experiments such as cell fractionation, electron microscopy, cannot meet the challenge of protein subnuclear localization due to the rapid growth of protein samples in dataset [4]. To better solve this problem, computational intelligence can be used for the protein subnuclear localization [5]. The critical issues of protein subnuclear localization using computational intelligence generally include two aspects: extract the useful features of protein sequences; select appropriate classification algorithm and evaluate the results [6].
During the last two decades, many techniques about the feature extraction of protein sequences have been proposed. In 1994, Nakashima et al. established a prediction method according to a PSMM is a higher dimensional matrix that includes much redundant information and nonlinear characteristics, KLDA is much suitable for processing this kind of biologic data. Third, since the protein datasets are seriously imbalanced, the proposed Radius-SMOTE is used to alleviate the problem of imbalanced data by creating samples of minority classes. Finally, all test samples are classified by RF. RF can balance errors of different classes to a certain extent, which can further reduce the negative effects of dataset imbalance.

Position-Specific Score Matrix
Many homologous proteins may have the same structures and functions, thus PSSM is used to find the evolution information of protein sequence [28,29] in this study. PSSM, denoted by P, is defined by Equation (1).
where E i→j represents the score of amino acid residue in the i − th protein P being mutated to amino acid type j during the evolution process; L is the length of protein sequence; the numbers 1 to 20 are used to represent the native amino acid types. In this work, P can be generated by PSI-BLAST according to protein sequence and non-redundant (NR) database and the parameters of E-value and iteration are set as 0.001 and 3, respectively. According to Equation (1), P is a L × 20 matrix, because the value of L is different for different protein sequences, Chou et al. proposed a representation method to standardize the representation of P for each sample [19], which is shown in Equations (2) and (3).
where E j represents the average score of the amino acid residues in the j − th protein P. According to the Equations (2) and (3), the generated P is a 20 × 20 matrix [30].

Kernel Linear Discriminant Analysis
Kernel linear discriminant analysis (KLDA) [31], a dimensionality reduction algorithm based on kernel method, is used to solve the problem of data linear inseparability in the original space and it is a nonlinear extension of linear discriminate analysis (LDA). LDA is a dimensionality reduction algorithm and its essence is to map the data from the high dimensional space to the low dimensional linear subspace by the linear combination of features. However, the useful features of data in the real world are usually not the linear combinations of the original features. When there are a large number of nonlinear structures in the datasets, the mapping of linear dimensionality method cannot preserve these structures information, so kernel method is proposed to transform the problem of linear inseparability in the original space into a linear separable problem in the high dimensional eigenspace.
Next, the idea of KLDA is represented in detail. In this study, X is used to represent the protein dataset that contains N samples classified in k classes and X = {x 1 , x 2 , . . . , x N } = C 1 ∪ C 2 . . . ∪ C k , where x i (i = 1, 2, . . . , N) and C i (i = 1, 2, . . . , k) represent each sample and class of X, respectively. The process of dimension reduction can be divided into three steps:

1.
Map input samples x 1 , x 2 , . . . , x N to a higher dimensional space F by nonlinear mapping function ∅ and the mapped samples can be expressed as ∅( Calculate the mean m ∅ of all mapped samples and the mean m ∅ i of the mapped samples for class C i by the following formulas: where N i means the number of samples belonging to the class C i .

3.
Calculate intraclass covariance matrix S ∅ intra and the interclass covariance matrix S ∅ inter for the whole mapped samples using the follow formulas: 4. Find the optimal projection direction ν by minimizing the intraclass distance and maximizing the interclass distance and the process can be expressed as Equation (8).
Moreover, ν is the linear combination of ∅(x 1 ), ∅(x 2 ), . . . , ∅(x N ), which can be expressed as follows: In Equation (8), ∅ is unknown and feature space F may not be unique, which means ν cannot be computed directly. Thus, the kernel trick K(x, y) = < ∅(x), ∅(y) > is introduced to solve this problem and Equations (4) and (5) can be transcribed as Equations (10) and (11).
Combined with Equations (6)- (11), the final criterion function of dimension reduction can be rewritten as follows:

5.
Obtain the finally rank-reduction projective matrix Y by Y = (a) T X.

The Proposed Radius-SMOTE
SMOTE was proposed by Chawla et al. in 2002 [32] and it is used to solve the problem of imbalanced data in this study. Supposing that each sample in minority class can be expressed as x i (i = 1, 2, . . . , n), the k(k < n) nearest neighbors of x i can be expressed as x j i ( j = 1, 2, . . . , k), the imbalance rate of minority class is represented by r and the generated sample can be represented as y s i (s = 1, 2, . . . , r). Original SMOTE executes three steps to generate a new instance, as shown in Figure 2a. First, it chooses a random minority sample x i ; among its k nearest minority class neighbors, selecting an instance x j i randomly; finally, a new instance y s i is generated between x i and x j i by  To address this problem mentioned above, Radius-SMOTE is proposed to synthesize better samples of minority classes. The process of synthesizing new samples by Radius-SMOTE can be divided into five steps:

1.
Calculate the imbalance rate r by r = n Majority −n Minority n Minority

5.
According to the imbalance rate, repeat Steps 2 to 4 r times. Finally, n × r samples can be synthesized by Radius-SMOTE, as shown in Figure 3b.

Random Forest
Random forest (RF) is an ensemble-based algorithm [33] that constructs many decision trees using original data and classifies the samples by combining the result of each decision tree. The construction of RF can be divided into five steps: 1.
Randomly select the training subsets from the original dataset; 2.
Set up a decision tree for each training subset, in which each decision tree does not need to be pruned; 3.
Construct RF model by formed forest that is composed of tens or hundreds of decision trees; 4.
Classify a new sample, each decision tree in the forest gives an individual result; 5.
Calculate the votes of each class and get the final class which has the supreme votes.

Experiments
In this section, the experimental results of two datasets based on our proposed method are introduced and analyzed.

Evaluation Indexes
In this work, the Jackknife test was used to examine the performance of the proposed model. This model is considered the most reasonable cross-validation method [35,36]. In a dataset, the Jackknife test supposes n samples which will be selected as test samples one-by-one, and the remaining n − 1 data used as training samples simultaneously.
In order to evaluate the degree of dataset imbalance and the performance of the proposed method, five different indices of the Jackknife test can be defined as follows: where n i means the i − th class, IR means the degree of dataset imbalance and the smaller IR means that the data are more imbalance.
Matthews  , when the value of MCC is equal to 1, the performance of prediction is perfect; when the value of MCC is equal to 0, the performance of prediction is random; when the value of MCC is equal to −1, the performance of prediction is worst.

The Analysis of Unbalance Datasets
The imbalance degree of Dataset 1 was calculated in both cases and shown in Figure 4. In the first case, the value of IR was 0.042 based on the original data; in the second case, the value of IR was 0.769 based on the expanded data by SMOTE or Radius-SMOTE. The IR of Dataset 1 increased to 0.727, which means that the Radius-SMOTE was effective for reducing the dataset imbalance.

The Overall Accuracy Analysis of the Proposed Method
In this study, the features of protein sequences were extracted by PSSM and the prediction of subnuclear localization was obtained by RF. To demonstrate the performance of the proposed method, three experiments were performed on each of two benchmark datasets. The experimental result by the Jackknife cross-validation is shown in Figure 6. From Figure 6, the maximum accuracies of Dataset 1 and Dataset 2 were 96.1% and 95.7%. The classification accuracy of RF + SMOTE were all higher than RF, which means the oversampling method could effectively improve the classification accuracy. The accuracy performed by RF + Radius-SMOTE were all higher than the accuracy calculated by RF with SMOTE, which means the method of Radius-SMOTE could achieve good classification performance.

The Relationship between k in Radius-SMOTE with Overall Accuracy
In the process of expanding datasets, the selection of k nearest neighbor plays an important role in classification. The experimental results influenced by different k in Radius-SMOTE are shown in Figure 7. The blue polyline denotes the overall accuracy of Dataset 1; the red polyline denotes the overall accuracy of Dataset 2. In this study, the interval of k is set to [1:9]. For Dataset 1, the minimum accuracy is 84.8 (k = 1); along with the increase of k, the maximum accuracy is 96.1% (k = 7). For Dataset 2, the minimum accuracy was 84.3% (k = 1) and the maximum accuracy was 95.7% (k = 5). In general, the accuracy of each dataset reaches the lowest when k = 1, which means the selection of nearest neighbor in Radius-SMOTE was significant.

The Relationship between k in RF with Overall Accuracy
RF was used for classification in this study. The parameter k of RF represents the number of trees in the process of classification and it had influence on the overall accuracy of the proposed method. In Figure 8, the interval of k was set in the range of [30:210]. For Dataset 1, the overall accuracy was lowest 95.1% (k = 30); in addition, the accuracy gradually turns to be stable when k was greater than 30 and achieves the highest value 96.71% (k = 210). For Dataset 2, the overall accuracy was highest 95.7% (k = 150); in addition, the difference of accuracy caused by k was 4.5 higher than that of Dataset 1, which means the effect of k on Dataset 2 was more obvious than Dataset 1. Figure 8 shows that the blue polyline was above the red polyline, which means that Dataset 1 had a better classification performance than Dataset 2.

The Analysis for Evaluation Indexes of Different Methods
In this part, three experiments were performed on each of two benchmark datasets. All datasets were mapped into low-dimensional space by KLDA and then classified by RF. Four evolution indices of datasets were calculated based on the Jackknife tests and shown in Tables 3 and 4. In the first experiment, the value of evolution indices is found in the third column of Tables 3 and 4, based on original data; In the second experiment, the value of evolution indices are found in the fourth column of Tables 3 and 4, based on expanded data by SMOTE; In the third experiment, the value of evolution indices is found in the fifth column of Tables 3 and 4, based on expanded data by Radius-SMOTE.
As shown in Table 3, for the first experiment, the class Nl (307 samples) had the highest Se value of 0.958 and the lowest Sp value of 0.766; and the class Nb (13 samples) had the lowest Se value of 0.154 and the highest Sp value of 1, which means that the classification results were easily affected by the number of samples in different categories. In addition, most of the value of MCC was less than 0.5, which means that the classification performance of RF without oversampling was not good. For the second experiment, the values of MCC were more than 0.83 (only the MCC of Nl was 0.76) and the values of ACC were more than 0.94, which demonstrated that the classification performance with SMOTE was good. For the third experiment, the Se of all classed were higher than 0.92 (only the Se of Nl was 0.876) and the values of Sp were higher than 0.988. The values of ACC were higher than 0.98 which were very close to 1 and the values of MCC were higher 0.9, which showed that the classification performance with Radius-SMOTE was better than that of others.
As shown in Table 4, for the first experiment, the Se values of Ne and Nm were 0.118 and 0.11, respectively, which was approximately equal to 0.11; the MCC values of all classes were lower than the 0.52 (only the MCC of Nl was 0.759), which means that the classification performance without oversampling was bad. For the second experiment, the ACC values of all classes were higher than 0.94 and the MCC values of Cn, Co and Nl were between 0.74 to 0.88; and the remaining seven types of dataset were higher than 0.92. It was found that the MCC and ACC of the first dataset were larger than that of the second class, which illustrated that the classification with SMOTE was better.

Comparisons with Other Methods
We compared the performance between our proposed method and state-of-arts methods based on two benchmark datasets. Contrast experiments were divided into two categories. These were then compared with methods of SMOTE-variants and methods of protein subnuclear localization. Based on the same dataset, the Jackknife test was used to compare the performance of the proposed method with other methods previously introduced in detail in this section.

Comparison of Dataset 1
Based on Dataset 1, the proposed method was first compared with other oversampling methods, as shown in Table 5. It was found that Radius-SMOTE had better performance for protein subnuclear localization. Then the proposed method was compared with six state-of-arts methods of protein subnuclear localization, as shown in Table 6. Through the comparison, the highest classification accuracy of Dataset 1 was 96.1% obtained by the proposed method. Table 6. Prediction results of Dataset 1 obtained by different methods of protein subnuclear localization.

Methods (Jackknife Test)
Overall Accuracy (%) Fusion of PsePSSM and PseAAC-KNN [19] 67.4 PseAAPSSM-LDA-KNN [6] 88.1 DipPSSM-LDA-KNN [6] 95.9 AACPSSM with fused kernel-KLDA-KNN [40] kernel 94.7 PseAAC-A hybrid-classifier-based SVM [ The comparison of MCC between proposed method and other prediction methods of protein subnuclear localization can be found in Figure 9. The biggest difference of MCC for different methods were 0.39, 0.69, 0.45 and 0.093, respectively. The smallest difference was 0.093 obtained by the proposed method, which means that the classification results of majority and minority classes are more accurate than that of other methods (Figure 9).

Comparison of Dataset 2
For Dataset 2, the four comparison methods of oversampling were the same as those in Dataset 1 ( Table 7). The performance of Radius-SMOT on Dataset 2 was also the best, which illustrates the effectiveness of Radius-SMOTE.

Oversampling Methods
Overall Accuracy (%) Borderline_SMOTE1 [37] 93.8 Borderline_SMOTE2 [37] 93.6 SVM_balance [38] 95.6 NRAS [39] 88.1 The proposed Radius-SMOTE 95.7 At the same time, the proposed method was also compared with three methods of protein subnuclear localization ( Table 8). The results illustrate that the proposed method had better performance than other methods. Table 8. Prediction results of Dataset 2 obtained by different methods of protein subnuclear localization.

Methods (Jackknife Test)
Overall Accuracy (%) SSLD and AAC-SVM [34] 81.5 PseAAPSSM-LDA-KNN [6] 84 CoPSSM-KLDA-based DGGA-KNN [5] 87.4 The proposed PSSM-Radius-SMOTE-KLDA-RF 95.7 From Figure 10, the MCC of each class obtained by the proposed method was higher than that of other three methods. Thus, the classification performance of the proposed method in Dataset 2 was better than other methods.

Conclusions
This study proposes an effective protein subnuclear localization method, with the aim of overcoming the imbalance of protein datasets and improving the prediction accuracy of protein subnuclear localization. First, the features of protein are represented by PSSM, which can extract the evolution information of proteins. Second, the dimensions of feature vector are reduced by KLDA, which can reduce the redundant information of protein dataset. Third, Radius-SMOTE, which is based on SMOTE, is used to solve the imbalance problem of protein dataset. Finally, the subnuclear localization of proteins is predicted by RF.
According to the Jackknife test, the overall accuracy of the proposed method in two benchmark datasets can reach 96.1% and 95.7%. From the experimental results, the following conclusions can be drawn: 1.
The imbalance of protein datasets has a great impact on the prediction accuracy of protein subnuclear localization;

2.
The proposed method can efficiently improve the prediction accuracy of protein subnuclear localization by solving the imbalanced problem of protein datasets; 3.
The combination of KLDA and RF can improve the classification accuracy of protein at the subnuclear level.