Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA

An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one.


Introduction
It is well known that if proteins are wrongly located or are largely accumulated in improper parts in nuclear, genetic diseases, and even cancer, will be caused [1]. Thus, nuclear protein plays a very important role in the research on disease prevention and clinical medicine where the correct protein sub-nuclear localization is essential. Researchers in the past two decades have made great progress in the study of protein representation methods and sub-cellular localization prediction [2]. Since the nucleus is the largest cell organelle guiding the process of biological cell life, researchers have focused on seeking out the location(s) in the nucleus of the query protein so as to explore its function. The traditional approaches are to conduct a series of biology experiments at the cost of much time and money [3]. However, the task, with a large number of protein sequences having been generated, requires us to find faster localization methods. An attractive route in recent studies is to utilize machine learning for protein sub-nuclear localization [4].
The core problem of protein sub-nuclear localization using machine learning method includes two aspects: constructing good representations for collecting as much protein sequence information as possible, and developing effective models for prediction. Some good representations providing abundant discrimination information for improving prediction accuracy have been reported. Nakashima and Nishikawa propose the well-known representation, amino acid composition (AAC) [5], which describes the occurrence frequency of 20 kinds of essential amino acids in the protein sequence. However, AAC loses the abundant information of protein sequence. Then, dipeptide composition (DipC) is presented by considering the essential amino acid composition information along local order of amino acid [6]. Subsequently, taking into account both sequence order and length information, Chou et al. introduce pseudo-amino acid composition (PseAAC) [7][8][9][10][11]. Besides, position-specific scoring matrix (PSSM) is proposed through considering the evolution information that is helpful for protein sub-nuclear localization [12]. In addition, many representation approaches can be found in [13,14].
After obtaining a good representation, researchers need to develop models for predicting protein sub-nuclear localization. Shen and Chou [15] utilize optimized evidence-theoretic k-nearest classifier based on PseAAC to predict protein sub-nuclear locations. Mundra et al. report a multi-class support vector machine based classifier employing AAC, DipC, PseAAC and PSSM [16]. Kumar et al. describe a method, called SubNucPred, by combining presence or absence of unique Pfam domain and amino acid composition based SVM model [17]. Jiang et al. [18] report an ensemble classification method for sub-nuclear locations on dataset in [19,20] using decision trumps, Fuzzy k-nearest neighbors algorithm and radial basis-SVMs.
However, two drawbacks in current works exist: shortage of a representation with sufficient information and no consideration of the relationship between representation and prediction model. Using single representation, from one point of view, is insufficient for expressing protein sequence, which can lead to bad performance on protein sub-nuclear localization. Representations with more information from multiply aspects are worth studying for improving prediction accuracy. On the other hand, simplicity is also an important principle in machine learning. A compact representation can yield a preferred prediction model [21]. Therefore, this paper first proposes two effective fusion representations by combining two single representations, respectively, and then uses the dimension reduction method of linear discriminant analysis (LDA) to arrive at an optimal expression for k-nearest neighbors classifier (KNN). In the first process, we specifically take account into both DipC and PSSM to form a new representation, dubbed DipPSSM and consider both PseAAC and PSSM to construct another proposed representation, called PseAAPSSM. In this way, the two proposed representations contain more protein sequence information, and can be sufficient for describing protein data. However, it is difficult to reach a suitable trade-off of DipC and PSSM in DipPSSM and a suitable trade-off of PseAAC and PSSM in PseAAPSSM, so we adopt genetic algorithm to figure out a set of balance factors to solve this problem. Table 1. The corresponding relationship between abbreviation and full name.

Code
The Full Name Abbrevition   1  Dipeptide composition  DipC  2 Pseudo-amino acid composition PseAAC 3 Position specific scoring matrix PSSM 4 The proposed representation by fusing DipC and PSSM DipPSSM 5 The proposed representation by fusing PseAAC and PSSM PseAAPSSM 6 Linear discriminate analysis LDA 7 k-nearest neighbors KNN  8  True positive  TP  9  True negative  TN  10  False positive  FP  11  False negative  FN  12 Mathew's correlation coefficient MCC In Section 2, we review three single representations, DIPC, PseAAC and PSSM. In Section 3, we propose two representations and use genetic algorithm to get the balance factors of the proposed representations. In Section 4, we perform LDA on the proposed representations followed by KNN classification algorithm. In Section 5, experiments with two benchmark datasets are performed. Section 6 gives the concluding remarks. For convenience of the readers, we give a list of all abbreviations of this paper in Table 1.

The Related work
In this section, three single representations, DIPC, PseAAC and PSSM, are introduced to prepare for our proposed fusion representations.

Dipeptide Composition (DipC)
DipC, reflecting the amino acids composition information and the ordinal relation of the essential amino acids in the sequence, denotes the occurrence frequencies of dyad consecutive residues in the primary sequence out of the 400 combination of dyad amino acids and hence forms a 400D feature vector [6]. In this work, we add 20 elements, separately representing the frequencies of 20 kinds of amino acids in the protein sequence, into DipC vector to preferably reflect the amino acids composition information. Therefore, the final protein sequence is expressed as a 420 dimensions vector that can be mapped into a point of 420D Euclidean spaces. We denote this feature representation of a protein sample as P DipC , whose former 20D shows the amino acids composition and latter 400D shows dipeptide composition. For a protein P whose sequence length is L (i.e., P has L amino acids), we have P DipC " rp 1 , p 2 ,¨¨¨, p 20 , p 21 ,¨¨¨, p 420 s T , p i " where aa i is the amount of type i amino acids and cr i is the amount of dyad consecutive residues.

Pseudo Amino acid Composition (PseAAC)
PseAAC, put forward by Chou et al., represents a protein sequence with its sequence composition and order information in a vector [7]. In PseAAC, the first 20 elements denote the frequency of 20 kinds of essential amino acids and the rest elements are the ordinal related factor obtained via computing the impact of the hydrophobic and hydrophilic of amino acids [15]. General PseAAC is written as: P PseAAC " rp 1 ,¨¨¨, p 20 ,¨¨¨, p 20`λ ,¨¨¨, p 20`2λ s T In this paper, we transform protein sequence into PseAAC representation with tools on line provided by Pattern Recognition and Bioinformatics Group of Shanghai Jiaotong University. Note that we empirically set the value of parameter λ as 10 and obtain a 40D feature vector P PseAAC for representing the protein sequence P.

Position Specific Scoring Matrix (PSSM)
There are various variations of protein sequences occurring in the biological evolution process, for instance, the insertion, substitution or deletion of one or several amino acid residues in the sequence [21]. With long-term accumulation of these variations, the similarities between the original and the new synthesis proteins are reducing gradually, but these homologous proteins may exhibit remarkably similar structures and functions [22]. As one sub-nuclear location may contain highly homologous proteins with similar biological function, we employ PSSM to collect protein sequences evolution information. Here, we obtain PSSM with the PSI-BLAST search tool provided on line by National Center for Biotechnology Information, via three iterations setting the E-value cutoff at 0.001 for the query sequence of the protein P against multiple sequence alignment. Then protein sequence P is represented as a matrix shown in Equation (3).
where P (i,j) is the score that the i-th amino acid is substituted by the type j amino acid [23], i = 1,2, . . . , L; j = 1,2, . . . , 20. Here, the numerical codes from 1 to 20 denote the 20 native amino acid types corresponding to their single character codes in the alphabetical order. We see that the Lˆ20 PSSM matrices are not uniform for proteins with different sequence lengths L, which cannot be processed by general machine learning methods. To uniform PSSM dimension, we define a new matrix M = P PSSM T¨P PSSM , which is a symmetric matrix containing 20ˆ20 = 400 elements [24,25]. Thus, we only need the information of its 210 elements just as Equation (4).
Then the general protein sample P can be formulated as:

Two Fusion Representations, DipPSSM and PseAAPSSM, and the Optimization Algorithm
In this section, two fusion representations are introduced and then genetic algorithm is used to seek out the optimal weight coefficients in the fusing process.

Two Fusion Representations DipPSSM and PseAAPSSM
Although both of DipC and PseAAC contain the information of the amino acid composition and the sequence order, they reflect different essential features of protein samples. On the other hand, PSSM represents a protein's evolution information, which DipC and PseAAC do not possess. To this end, we combine PSSM with DipC and PseAAC to form two new representations, called DipPSSM and PseAAPSSM, respectively. Both DipPSSM and PseAAPSSM contain much more protein information than their component representations. Specifically, DipPSSM includes amino acids composition information, amino acids sequence order information and evolutionary information. PseAAPSSM contains amino acids composition information, amino acids sequence order information, the chemical and physical properties of amino acids and evolutionary information. Now, we introduce the detailed combination of generating the fusion representations, DipPSSM and PseAAPSSM. Suppose that we have a dataset of N proteins belonging to n sub-nuclear locations. First, we transform the protein sequence of the i-th sub-nuclear location into two representations A i and B i , i = 1,2, . . . , n, where A i means DipC or PseAAC and B i means PSSM. A i and B i contain different context information leading to their different effects on protein sub-nuclear localization. Denote A and B as follows [7,15,24].
Then, we employ the weight coefficients vector R to balance the two representations, which is an important idea for combining representations. The mathematical forms of R can be written as follow: R " t r 1 , r 2 , r 3 ,¨¨¨, r n´1 , r n u (8) where r i P (0,1) (i = 1,2, . . . , n) are used to represent the importance of the two representations in each sub-nuclear location, here also called the balance factors. We present the final form of the fusion representation in Equation (9).
In many current literatures, different components of a fusion representation are considered equally important, which is actually a special case of Equation (9) when r i = 0.5 (i = 1,2, . . . , n). Since the fused representation Equation (9) uses the characteristics of the two single representations reasonably, it contains more protein sequences information and reflects the influence degree of the two single representations. Note that the balance factors for different sub-nuclear locations are not all the same. Besides, since different sub-nuclear locations are an organic whole in the cellular nucleus, the sub-nuclear proteins are interacting with each other, it is proper to think that n balance factors, r i (i = 1,2, . . . , n), are correlated with each other. Therefore, it is a complex work to select an optimum value of R. In the next subsection, we will discuss how to give a proper value of R.

Genetic Algorithm-The Optimization Algorithm
Genetic algorithm is an algorithm that imitates the evolution process of biological organism in the nature as an adaptive method that can be used to solve searching and optimizing problems [26], especially combination optimization problems with high computational complexity, which traditional methods cannot cope with [27]. In this paper, we employ genetic algorithm to seek out the balance factors r i (i = 1,2, . . . , n) of the proposed representations. The seeking procedure is as follows.
The first and generally the most difficult step of the genetic algorithm is to create an initial population, which is a pre-determined amount of individuals encoded to map the problem solution into a genetic string, or chromosome [28]. In genetic algorithm, all the individuals, in term of the coding method and principle, possess the same structure maintaining the genetic information on individuals of population. The second step is to conduct selection, crossover, mutation and replacement depending on the fitness error, under the constraints of the individual population. The final step is to stop iteration when stopping criteria is met.
In this paper, we put forward an initial-population selection strategy to greedily produce initial population. Its detailed process is as follows. (1) Generate a random permutation of the integers traversing from 1 to n (n is the number of sub-nuclear locations), which is the tuning order of the balance vector R.
Set 0.5 as the initial value for all elements in R.
For each r i , we search from 0 to 1 with 0.01 steps to get the value obtaining the highest prediction accuracy. (4) Repeat step (3) for all the elements of R according to the order in step (1).
Repeat step (1-4) 50 times to get 50 sets of balance vectors R. We save these balance vectors as the initial population.
Note that in Step (5), due to the unstable of genetic algorithm, we here run this experiment multiple times to select the optimal solution as the final balance factors. Specifically, we repeat 50 times to generate an initial population. In theory, the greater the number of repetitions, the better the result becomes. Practically, the results trend to be stable when the repetition exceeds 50 times. Therefore, we set a relative reasonable number of 50 due to the cost of computation. After the steps for creating the initial population, we calculate the balance factors via minimizing the fitness error for predicting the sub-nuclear localization. We implement the computation by using MATLAB to work out the balance factors r i (i = 1,2, . . . , n) delivering by the minimum fitness error.

Dimension Reduction Method and Classifier Algorithm
In this section, we first introduce the dimension reduction algorithm and then describe the KNN classifier and cross-validation methods.

Linear Discriminant Analysis (LDA)
It is well known that high dimension of data not only increases the complexity of classifier, but also increases the risk of over fitting of the classifier [12]. The increase in information and dimensionality of our proposed fusing representations will lead to an increase in noise [29]. Specifically, each representation has its intrinsic dimensionality for classification which is often much lower than the dimensionality of the observation vector. Hence, the dimensionality reduction algorithm, linear discriminant analysis (LDA) [22,30], is employed in this work, which is a well-known supervised classifier in pattern recognition such as speech recognition, face recognition, protein classification and so on. A concise description about LDA is given below.
Assume that Dataset X contains N proteins and X is a union of C subsets, i.e., X " To obtain the optimal solution of LDA, we maximize the formulation J(W) in Equation (10) and then find out the projection matrix W*. We can realize the ideal linear projection with the projection matrix W*.
where S W and S B denote within-class scatter matrix and between-class scatter matrix, respectively, which are formulized as follows.
where µ i " 1 N i ř N i j"1 x i j is the class mean vector and µ " 1 N ř C i"1 N i µ i is the total mean vector. For the focus of this paper, we do not give too many descriptions for the derivation and calculation process of matrix W*. According to [31,32], for multi-class pattern classification, such as C classification problem, the orthonormal columns of W˚must satisfy Equation (13), which is a generalized eigenvalue problem.
Hence, the eigenvectors of S W´1 S B consistent with the largest C´1 eigenvalues are the columns of the optimal projection matrix W* on the condition that S W is nonsingular.

k-Nearest Neighbors Algorithm
For protein sub-nuclear localization and classification problem, one classic and simple method is k-nearest neighbors (KNN). The KNN classifier predicts each unlabeled sample by the majority label of its nearest neighbors in the training set [33]. Despite its simplicity, the KNN often yields competitive results, and in this paper, when combined with the reduction dimension algorithm, it has significantly advanced the classification accuracy [23]. Before applying KNN classifier for protein sub-nuclear localization, we transform each protein sequence to a vector with fixed dimension. Then we classify each sequence according to class memberships of its k-nearest neighbors [34,35]. Cosine distance, cospu, vq " u¨v ||u||ˆ||v|| , is chosen to measure the close degree of two proteins u and v, where ||¨|| is the module function. The value of cos(u, v) ranges in [-1,1], the closer to 1 its absolute value is, the closer to each other are u and v.

Cross-Validation Methods
Traditionally, in the context of statistical prediction and classification, cross-validation is utilized to estimate the performance of the final classifier or predictor. Independent dataset test, jackknife test, and K-fold cross-validation are three popular cross validation methods [35]. The K-fold cross-validation is a method to approximately estimate prediction error without bias under much more complicated situations [36]. Thus K-fold cross-validation is employed in this paper to examine the anticipated performance of the KNN classifier, where K is the positive integer satisfying K ď N and N denotes the size of the benchmark dataset. The case K = N is indeed identify to leave-one-out or jackknife test. Jackknife test can deliver high variance on account of the N training sets similar to one another [37]. Moreover, the computational cost is also expensive, requiring N iterations of the learning approach. Usually, 10-fold cross validation is a preferred route for pursuing a good trade off, where the benchmark dataset is randomly partitioned into ten equal-size subsets where those subsets hold the original proportion in different classes. For each experiment, we carry out the test ten times. In each run, one subset is utilized for testing and the remaining are used for training, and thus each subset is in turn used as testing set once. To obtain a reliable result, we run 50 times experiments and calculate the average result of the test accuracies. In addition, since the jackknife test is objective and little arbitrary because it can always yield a unique result for a given dataset, and therefore has been adopted to estimate the performance of predictors [38], it is also considered in Section 5.2.4 to compare the overall success rate of predictors.

Numerical Results
In this section, we introduce the two sub-nuclear location datasets and then give the numerical results and analysis.

Description of Datasets and Experimental Procedure
In order to validate the efficiency of the proposed method, two public datasets are adopted in this paper. One is Nuc-Ploc [7], constructed in 2007 by Shen and Chou, which contains 714 proteins located at nine sub-nuclear locations, listed in Table 2. The other is SubNucPred [17], constructed by Ravindra Kumar et al. in 2014, which contains ten sub-nuclear location proteins and is detailed listed in Table 3. Nuclear envelope 61 4 Nuclear matrix 29 5 Nuclear pore complex 79 6 Nuclear speckle 67 7 Nucleolus 307 8 Nucleoplasm 37 9 Nuclear PML body 13 Overall 714 Table 3. Protein benchmark Dataset 2 of ten sub-nuclear locations. 1  Centromere  86  2  Chromosome  113  3 Nuclear speckle 50 4
(1) Represent the protein sequences using DipC, PseAAC, and PSSM. To provide an intuitive view, these processes are shown in Figure 1.

Feature Fusion Representations
A comparison of fusing and single representations: In this subsection, we compare our proposed representations PseAAPSSM and DipPSSM with their single atoms on the prediction success rates of protein sub-nuclear locations. Tables 4 and 5 show the experimental results for every sub-nuclear location on Datasets 1 and 2, respectively. Note that we take the average value of fifty random success rates according to 10-fold cross validation as the prediction success rate (SR), where the neighborhood size k of KNN is chosen corresponding to the highest overall success rate with k traversing from 1 to 10. The calculation of success rate and overall success rate are in Equations (15) and (16), respectively. success ratepiq " Tpiq{Npiq pi " 1, 2,¨¨¨, nq (15) overall success rate " ÿ n i"1 Tpiq{ ÿ n i"1 Npiq pi " 1, 2,¨¨¨, nq where T(i) is the number of correctly predicted proteins belonging to location i, N(i) is the total number of proteins at location i. Note that the success rate here can also be understood as the sensitivity defined in many literatures which will be discussed in Section 5.2.3. For the two proposed fusion representations DipPSSM and PseAAPSSM, the optimal balance factor vector R is also listed in the tables. According to Tables 4 and 5 it is clear that our proposed fusion representations outperform the single representations consistently. Table 4. Prediction success rate (SR) and the optimal R of Dataset 1 for protein sub-nuclear localization by 10-fold cross validation with various representations. Balance factor vector R: Figure 2 describes the success rate curves on Dataset 1 of DipPSSM and PseAAPSSM, where each subplot corresponds to a sub-nuclear location. For each subplot, the horizontal axis represents certain balance vector r i and the ordinate axis is the prediction success rate. Note that in each subplot, when r i varies from 0 to 1 with step 0.1, the remaining n´1 balance factors are fixed in the values in Table 4. The numerical experiment shown in Figure 3 is the same as that on Figure 2, except for the different Dataset 2. From Figures 2 and 3 it is clear that the parameters r i (i = 1, 2, . . . , n) have significant influence on protein sub-nuclear localization. Especially, Figure 2 also shows that when r i is around 0.9 for each subplot (i = 1, 2, . . . , n), the success rates have a leaping point, probably suggesting that for Dataset 1, dipeptide composition or pseudo amino acid composition are more important than position specific scoring matrix in the fusion representations.

Dimensionality Reduction
3D visualization: In this subsection, we employ LDA to present visualization results. Here, we give the 3D scatter plot of DipPSSM and PseAAPSSM for both datasets, so as to observe the data distribution in the three-dimensional space after data reduction by LDA. Figures 4 and 5 show the results of Dataset 1 and 2, respectively, where the three axes represent the first three components of LDA corresponding to the largest three eigenvalues, respectively.
In Figure 4, we use nine colors, which are coded from 1 to 9 according to Table 2, to represent the nine sub-nuclear locations protein of Dataset 1. In Figure 5, we use ten colors, which are coded from 1 to 10 according to Table 3, to represent the ten sub-nuclear locations protein of Dataset 2. In Figures 4b and 5b, there are some data points that are hardly distinguished at those scales. Therefore, we provide a patch of high resolution in Figure 4c and Figure 5c for those data points. These results suggest that LDA can improve the classification performance by separating the data points from different classes.   Parameter effects: With the 10-fold cross-validation, Figure 6 demonstrates the overall success rates against dimensions reduced by LDA from DipPSSM and PseAAPSSM, respectively, where the neighborhood size k is set to 4, a choice corresponding to a good performance in 1 to 10. From Figure 6, we can see that most information lying in the original high dimensional protein data can be summarized by some low dimensional structure, suggesting the efficiency of LDA for protein sub-nuclear localization. Figure 7 further gives the comparison of the success rates among the reduction data and the original data when the neighborhood size k changes from 1 to 10. It is easily seen from Figure 7 that for each fixed k, both DipPSSM with LDA and PseAAPSSM with LDA improved success rate of sub-nuclear locating prediction significantly compared with DipPSSM and PseAAPSSM. Interestingly, in Figures 4-7 we can see that for both datasets, the reduction effects of DipPSSM seem a little better than PseAAPSSM.

Analysis of numerical Results
From another perspective, it is indicated in the current literature that the following indexes (Equations (17)-(20)) are often used to evaluate the performance of a predictor. We calculate these indexes of 10-fold cross validation to compare different representations together with dimension reduction method.
SEpiq " TPpiq{pTPpiq`FNpiqq pi " 1, 2,¨¨¨, nq SPpiq " TNpiq{pTNpiq`FPpiqq pi " 1, 2,¨¨¨, nq ACCpiq " pTPpiq`TNpiqq pTPpiq`FPpiq`TNpiq`FNpiqq pi " 1, 2,¨¨¨, nq MCCpiq " pTPpiqˆTNpiqq´pFPpiqˆFNpiqq a pTPpiq`FPpiqqˆpTPpiq`FNpiqqˆpTNpiq`FPpiqqˆpTNpiq`FNpiqq pi " 1, 2,¨¨¨, nq In these equations, TP (true positive) and TN (true negative) were the number of proteins that were correctly located while FP (false positive) and FN (false negative) were the number of proteins that were wrongly located. SE (Sensitivity) denotes the rate of positive samples correctly located, whose value is equal to the success rate in Equation (14). SP (Specificity) denotes the rate of negative samples correctly located. ACC (Accuracy) means the rate of correctly located samples. MCC is the Mathew's Correlation Coefficient, which returns a value lying in [-1,1]. The value of a MCC coefficient reflects the prediction consequences. The value of 1 denotes a perfect prediction, 0 represents random prediction and´1 represents a bad prediction. We cannot perfectly describe the confusion matrix of true and false, positives and negatives through a single number, generally regarding the MCC as one of the best [39]. Table 6 gives the values of four indexes in Equations (17)- (20) for nine sub-nuclear locations in Dataset 1 using three single representations of PseAAC, Dipe and PSSM, two fusion representations of DipPSSM and PseAAPSSM and their combination with the dimension reduction method LDA, where both PseAAPSSM and DipPSSM are reduced to eight dimensions. Table 7 uses the similar experimental design to Table 6 except for the use of Dataset 2, where PseAAPSSM and DipPSSM are reduced to nine dimensions. From these results, we come to the following conclusions. The predictions with sensitivity (SE), specificity (SP), accuracy (ACC) and MCC by fusion representations are better than the single representations in most locations. Furthermore, the fusion representations with the LDA treatment outperform those without. Note that due to the randomness of the 10-fold cross validation algorithm, the numerical values of the four indexes SE, SP, ACC and MCC have small variation each time. That is also the reason why we get different values of sensitivity and success rate in each sub-nuclear location in Tables 4 and 6 as well as Tables 5 and 7 although theoretically Equations (15) and (17) should produce the same value.    Table 8 gives the comparison of the overall success rates on Dataset 1 among our protein sub-nuclear localization methods and the Nuc-PLoc predictor [7] with jackknife test. For each sub-nuclear location of Dataset 1, Figure 8 gives the comparison of the Matthew's correlation coefficient (MCC) indexes [7] among our methods and Nuc-PLoc prediction. From Table 8 and Figure 8, it is clear that the success rates of our protein sub-nuclear localization predictors are much higher than that of the Nuc-PLoc.  Next, we present another comparison of our methods with SubNucPred method [17] on Dataset 2. The four indexes of sensitivity (SE), specificity (SP), accuracy (ACC) and MCC in each sub-nuclear location are calculated and shown in Figure 9, where 10-fold cross validation was used. It can be seen from Figure 9 that our methods of DipPSSM with LDA and PseAAPSSM with LDA outperform SubNucPred.

Conclusions
Following the completion of the Human Genome Project, bioscience has stepped into the era of the genome and proteome [40][41][42][43][44]. A large amount of computational methods have been presented to deal with the prediction tasks in bioscience [45][46][47][48]. The nucleus is highly organized and the largest organelle in the eukaryotic cells. Hence, managing protein sub-nuclear localization is important for mastering biological functions of the nucleus. Many current studies discuss protein sub-nuclear localization prediction [49,50]. This paper proposes a different route to identify the protein sub-nuclear localization by firstly developing two fusion representations, DipPSSM and PseAAPSSM. Then, we conduct the experiments based on the 10-fold cross validation on two datasets to certify the superiority of the proposed representations and the applicability for predicting protein sub-nuclear localization. Through the present study, we have drawn the conclusions that our fusion representations can greatly improve the success rate in predicting protein sub-nuclear localization, thereby the fusion representations can reflect more overall sequence pattern of a protein than the single one.
However, there is the difficulty of choosing proper balance factors in constructing the fusion representations. The processing method of this paper is to use genetic algorithm to produce approximate optimal values of the weight coefficients (balance factors), where we run the genetic algorithm multiple times to compute the average weight coefficient giving rise to the ideal performance. However, the time complexity of this method is high, so in the future research we will try multiple searching methods for achieving the weight coefficients.
Due to the fact that our proposed fusion representations have high dimensionality, which might result in some negative effects for KNN prediction, we employ LDA to process the representations before using KNN classifier predicts protein locations. Note that, in current pattern recognition research, many other useful data reduction methods such as kernel discriminant analysis and fuzzy LDA have emerged. How to effectively use these methods or their improved methods or other more suitable dimension reducing methods in the sub-nuclear localization field is still an open problem. In addition, it remains an interesting challenge to obtain better representations for protein sub-nuclear localization and study other machine learning classification algorithms.