Next Article in Journal
Tissue Regeneration in the Chronically Inflamed Tumor Environment: Implications for Cell Fusion Driven Tumor Progression and Therapy Resistant Tumor Hybrid Cells
Previous Article in Journal
Retraction: X. Ma et al. Hybrid Endovascular Repair in Aortic Arch Pathologies: A Retrospective Study. Int. J. Mol. Sci. 2010, 11, 4687–4696

Int. J. Mol. Sci. 2015, 16(12), 30343-30361; https://doi.org/10.3390/ijms161226237

Article
Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA
School of Information Science and Engineering, Yunnan University, Kunming 650504, China
*
Author to whom correspondence should be addressed.
Academic Editor: Mark L. Richter
Received: 9 October 2015 / Accepted: 11 December 2015 / Published: 19 December 2015

Abstract

:
An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one.
Keywords:
protein sub-nuclear localization; DipPSSM; PseAAPSSM; linear discriminant analysis; KNN classifier

1. Introduction

It is well known that if proteins are wrongly located or are largely accumulated in improper parts in nuclear, genetic diseases, and even cancer, will be caused [1]. Thus, nuclear protein plays a very important role in the research on disease prevention and clinical medicine where the correct protein sub-nuclear localization is essential. Researchers in the past two decades have made great progress in the study of protein representation methods and sub-cellular localization prediction [2]. Since the nucleus is the largest cell organelle guiding the process of biological cell life, researchers have focused on seeking out the location(s) in the nucleus of the query protein so as to explore its function. The traditional approaches are to conduct a series of biology experiments at the cost of much time and money [3]. However, the task, with a large number of protein sequences having been generated, requires us to find faster localization methods. An attractive route in recent studies is to utilize machine learning for protein sub-nuclear localization [4].
The core problem of protein sub-nuclear localization using machine learning method includes two aspects: constructing good representations for collecting as much protein sequence information as possible, and developing effective models for prediction. Some good representations providing abundant discrimination information for improving prediction accuracy have been reported. Nakashima and Nishikawa propose the well-known representation, amino acid composition (AAC) [5], which describes the occurrence frequency of 20 kinds of essential amino acids in the protein sequence. However, AAC loses the abundant information of protein sequence. Then, dipeptide composition (DipC) is presented by considering the essential amino acid composition information along local order of amino acid [6]. Subsequently, taking into account both sequence order and length information, Chou et al. introduce pseudo-amino acid composition (PseAAC) [7,8,9,10,11]. Besides, position-specific scoring matrix (PSSM) is proposed through considering the evolution information that is helpful for protein sub-nuclear localization [12]. In addition, many representation approaches can be found in [13,14].
After obtaining a good representation, researchers need to develop models for predicting protein sub-nuclear localization. Shen and Chou [15] utilize optimized evidence-theoretic k-nearest classifier based on PseAAC to predict protein sub-nuclear locations. Mundra et al. report a multi-class support vector machine based classifier employing AAC, DipC, PseAAC and PSSM [16]. Kumar et al. describe a method, called SubNucPred, by combining presence or absence of unique Pfam domain and amino acid composition based SVM model [17]. Jiang et al. [18] report an ensemble classification method for sub-nuclear locations on dataset in [19,20] using decision trumps, Fuzzy k-nearest neighbors algorithm and radial basis-SVMs.
However, two drawbacks in current works exist: shortage of a representation with sufficient information and no consideration of the relationship between representation and prediction model. Using single representation, from one point of view, is insufficient for expressing protein sequence, which can lead to bad performance on protein sub-nuclear localization. Representations with more information from multiply aspects are worth studying for improving prediction accuracy. On the other hand, simplicity is also an important principle in machine learning. A compact representation can yield a preferred prediction model [21]. Therefore, this paper first proposes two effective fusion representations by combining two single representations, respectively, and then uses the dimension reduction method of linear discriminant analysis (LDA) to arrive at an optimal expression for k-nearest neighbors classifier (KNN). In the first process, we specifically take account into both DipC and PSSM to form a new representation, dubbed DipPSSM and consider both PseAAC and PSSM to construct another proposed representation, called PseAAPSSM. In this way, the two proposed representations contain more protein sequence information, and can be sufficient for describing protein data. However, it is difficult to reach a suitable trade-off of DipC and PSSM in DipPSSM and a suitable trade-off of PseAAC and PSSM in PseAAPSSM, so we adopt genetic algorithm to figure out a set of balance factors to solve this problem.
Table 1. The corresponding relationship between abbreviation and full name.
Table 1. The corresponding relationship between abbreviation and full name.
CodeThe Full NameAbbrevition
1Dipeptide compositionDipC
2Pseudo-amino acid compositionPseAAC
3Position specific scoring matrixPSSM
4The proposed representation by fusing DipC and PSSMDipPSSM
5The proposed representation by fusing PseAAC and PSSMPseAAPSSM
6Linear discriminate analysisLDA
7k-nearest neighborsKNN
8True positiveTP
9True negativeTN
10False positiveFP
11False negativeFN
12Mathew’s correlation coefficientMCC
In Section 2, we review three single representations, DIPC, PseAAC and PSSM. In Section 3, we propose two representations and use genetic algorithm to get the balance factors of the proposed representations. In Section 4, we perform LDA on the proposed representations followed by KNN classification algorithm. In Section 5, experiments with two benchmark datasets are performed. Section 6 gives the concluding remarks. For convenience of the readers, we give a list of all abbreviations of this paper in Table 1.

2. The Related work

In this section, three single representations, DIPC, PseAAC and PSSM, are introduced to prepare for our proposed fusion representations.

2.1. Dipeptide Composition (DipC)

DipC, reflecting the amino acids composition information and the ordinal relation of the essential amino acids in the sequence, denotes the occurrence frequencies of dyad consecutive residues in the primary sequence out of the 400 combination of dyad amino acids and hence forms a 400D feature vector [6]. In this work, we add 20 elements, separately representing the frequencies of 20 kinds of amino acids in the protein sequence, into DipC vector to preferably reflect the amino acids composition information. Therefore, the final protein sequence is expressed as a 420 dimensions vector that can be mapped into a point of 420D Euclidean spaces. We denote this feature representation of a protein sample as PDipC, whose former 20D shows the amino acids composition and latter 400D shows dipeptide composition. For a protein P whose sequence length is L (i.e., P has L amino acids), we have
P D i p C = [ p 1 , p 2 , , p 20 , p 21 , , p 420 ] T , p i = { a a i / L , i = 1 , 2 , , 20 c r i / ( L 1 ) , i = 21 , 22 , , 420
where aai is the amount of type i amino acids and cri is the amount of dyad consecutive residues.

2.2. Pseudo Amino acid Composition (PseAAC)

PseAAC, put forward by Chou et al., represents a protein sequence with its sequence composition and order information in a vector [7]. In PseAAC, the first 20 elements denote the frequency of 20 kinds of essential amino acids and the rest elements are the ordinal related factor obtained via computing the impact of the hydrophobic and hydrophilic of amino acids [15]. General PseAAC is written as:
P P s e A A C = [ p 1 , , p 20 , , p 20 + λ , , p 20 + 2 λ ] T
In this paper, we transform protein sequence into PseAAC representation with tools on line provided by Pattern Recognition and Bioinformatics Group of Shanghai Jiaotong University. Note that we empirically set the value of parameter λ as 10 and obtain a 40D feature vector PPseAAC for representing the protein sequence P.

2.3. Position Specific Scoring Matrix (PSSM)

There are various variations of protein sequences occurring in the biological evolution process, for instance, the insertion, substitution or deletion of one or several amino acid residues in the sequence [21]. With long-term accumulation of these variations, the similarities between the original and the new synthesis proteins are reducing gradually, but these homologous proteins may exhibit remarkably similar structures and functions [22]. As one sub-nuclear location may contain highly homologous proteins with similar biological function, we employ PSSM to collect protein sequences evolution information. Here, we obtain PSSM with the PSI-BLAST search tool provided on line by National Center for Biotechnology Information, via three iterations setting the E-value cutoff at 0.001 for the query sequence of the protein P against multiple sequence alignment. Then protein sequence P is represented as a matrix shown in Equation (3).
P P S S M = [ p ( 1 , 1 ) , p ( 1 , 2 ) , , p ( 1 , 20 ) p ( 2 , 1 ) , p ( 2 , 2 ) , , p ( 2 , 20 ) p ( L , 1 ) , p ( L , 2 ) , , p ( L , 20 ) ]
where P(i,j) is the score that the i-th amino acid is substituted by the type j amino acid [23], i = 1,2,…, L; j = 1,2,…, 20. Here, the numerical codes from 1 to 20 denote the 20 native amino acid types corresponding to their single character codes in the alphabetical order. We see that the L × 20 PSSM matrices are not uniform for proteins with different sequence lengths L, which cannot be processed by general machine learning methods. To uniform PSSM dimension, we define a new matrix M = PPSSMT·PPSSM, which is a symmetric matrix containing 20 × 20 = 400 elements [24,25]. Thus, we only need the information of its 210 elements just as Equation (4).
[ p ( 1 , 1 ) p ( 2 , 1 ) p ( 2 , 2 ) p ( L , 1 ) p ( L , 2 ) p ( L , 20 ) ] = Δ [ p 1 p 2 p 3 p 191 p 192 p 210 ]
Then the general protein sample P can be formulated as:
P P S S M = [ p 1 , p 2 , p 3 , , p 210 ] T

3. Two Fusion Representations, DipPSSM and PseAAPSSM, and the Optimization Algorithm

In this section, two fusion representations are introduced and then genetic algorithm is used to seek out the optimal weight coefficients in the fusing process.

3.1. Two Fusion Representations DipPSSM and PseAAPSSM

Although both of DipC and PseAAC contain the information of the amino acid composition and the sequence order, they reflect different essential features of protein samples. On the other hand, PSSM represents a protein’s evolution information, which DipC and PseAAC do not possess. To this end, we combine PSSM with DipC and PseAAC to form two new representations, called DipPSSM and PseAAPSSM, respectively. Both DipPSSM and PseAAPSSM contain much more protein information than their component representations. Specifically, DipPSSM includes amino acids composition information, amino acids sequence order information and evolutionary information. PseAAPSSM contains amino acids composition information, amino acids sequence order information, the chemical and physical properties of amino acids and evolutionary information.
Now, we introduce the detailed combination of generating the fusion representations, DipPSSM and PseAAPSSM. Suppose that we have a dataset of N proteins belonging to n sub-nuclear locations. First, we transform the protein sequence of the i-th sub-nuclear location into two representations Ai and Bi, i = 1,2,…, n, where Ai means DipC or PseAAC and Bi means PSSM. Ai and Bi contain different context information leading to their different effects on protein sub-nuclear localization. Denote A and B as follows [7,15,24].
A = { A 1 , A 2 , A 3 , , A n 1 , A n }
B = { B 1 , B 2 , B 3 , , B n 1 , B n }
Then, we employ the weight coefficients vector R to balance the two representations, which is an important idea for combining representations. The mathematical forms of R can be written as follow:
R = { r 1 , r 2 , r 3 , , r n 1 , r n }
where ri (0,1) (i = 1,2,…, n) are used to represent the importance of the two representations in each sub-nuclear location, here also called the balance factors. We present the final form of the fusion representation in Equation (9).
V i = [ r i A i , ( 1 r i ) B i ]     ( i = 1 , 2 , , n )
In many current literatures, different components of a fusion representation are considered equally important, which is actually a special case of Equation (9) when ri = 0.5 (i = 1,2,…, n). Since the fused representation Equation (9) uses the characteristics of the two single representations reasonably, it contains more protein sequences information and reflects the influence degree of the two single representations. Note that the balance factors for different sub-nuclear locations are not all the same. Besides, since different sub-nuclear locations are an organic whole in the cellular nucleus, the sub-nuclear proteins are interacting with each other, it is proper to think that n balance factors, ri (i = 1,2,…, n), are correlated with each other. Therefore, it is a complex work to select an optimum value of R. In the next subsection, we will discuss how to give a proper value of R.

3.2. Genetic Algorithm—The Optimization Algorithm

Genetic algorithm is an algorithm that imitates the evolution process of biological organism in the nature as an adaptive method that can be used to solve searching and optimizing problems [26], especially combination optimization problems with high computational complexity, which traditional methods cannot cope with [27]. In this paper, we employ genetic algorithm to seek out the balance factors ri (i = 1,2,…, n) of the proposed representations. The seeking procedure is as follows.
The first and generally the most difficult step of the genetic algorithm is to create an initial population, which is a pre-determined amount of individuals encoded to map the problem solution into a genetic string, or chromosome [28]. In genetic algorithm, all the individuals, in term of the coding method and principle, possess the same structure maintaining the genetic information on individuals of population. The second step is to conduct selection, crossover, mutation and replacement depending on the fitness error, under the constraints of the individual population. The final step is to stop iteration when stopping criteria is met.
In this paper, we put forward an initial-population selection strategy to greedily produce initial population. Its detailed process is as follows.
(1)
Generate a random permutation of the integers traversing from 1 to n (n is the number of sub-nuclear locations), which is the tuning order of the balance vector R.
(2)
Set 0.5 as the initial value for all elements in R.
(3)
For each ri, we search from 0 to 1 with 0.01 steps to get the value obtaining the highest prediction accuracy.
(4)
Repeat step (3) for all the elements of R according to the order in step (1).
(5)
Repeat step (1–4) 50 times to get 50 sets of balance vectors R. We save these balance vectors as the initial population.
Note that in Step (5), due to the unstable of genetic algorithm, we here run this experiment multiple times to select the optimal solution as the final balance factors. Specifically, we repeat 50 times to generate an initial population. In theory, the greater the number of repetitions, the better the result becomes. Practically, the results trend to be stable when the repetition exceeds 50 times. Therefore, we set a relative reasonable number of 50 due to the cost of computation. After the steps for creating the initial population, we calculate the balance factors via minimizing the fitness error for predicting the sub-nuclear localization. We implement the computation by using MATLAB to work out the balance factors ri (i = 1,2,…, n) delivering by the minimum fitness error.

4. Dimension Reduction Method and Classifier Algorithm

In this section, we first introduce the dimension reduction algorithm and then describe the KNN classifier and cross-validation methods.

4.1. Linear Discriminant Analysis (LDA)

It is well known that high dimension of data not only increases the complexity of classifier, but also increases the risk of over fitting of the classifier [12]. The increase in information and dimensionality of our proposed fusing representations will lead to an increase in noise [29]. Specifically, each representation has its intrinsic dimensionality for classification which is often much lower than the dimensionality of the observation vector. Hence, the dimensionality reduction algorithm, linear discriminant analysis (LDA) [22,30], is employed in this work, which is a well-known supervised classifier in pattern recognition such as speech recognition, face recognition, protein classification and so on. A concise description about LDA is given below.
Assume that Dataset X contains N proteins and X is a union of C subsets, i.e., X = X 1 X 2 X C = { x 1 , x 2 , x N } , where Xi contains N(i) proteins x 1 i , x 2 i , , x N ( i ) i , i = 1,2,…, C. Thus, N = i = 1 C N ( i ) . Suppose X i X j = φ , i , j = 1 , 2 , , C , i j . To obtain the optimal solution of LDA, we maximize the formulation J(W) in Equation (10) and then find out the projection matrix W*. We can realize the ideal linear projection with the projection matrix W*.
J ( W ) = | W T S B W | | W T S W W | , W * = a r g m a x W J ( W )
where SW and SB denote within-class scatter matrix and between-class scatter matrix, respectively, which are formulized as follows.
S W = i = 1 C j = 1 N i ( x j i μ i ) ( x j i μ i ) T
S B = i = 1 C N i ( μ i μ ) ( μ i μ ) T
where μ i = 1 N i j = 1 N i x j i is the class mean vector and μ = 1 N i = 1 C N i μ i is the total mean vector.
For the focus of this paper, we do not give too many descriptions for the derivation and calculation process of matrix W*. According to [31,32], for multi-class pattern classification, such as C classification problem, the orthonormal columns of W* must satisfy Equation (13), which is a generalized eigenvalue problem.
S B w i = λ i S W w i ,   i = 1 , 2 , , C 1
Hence, the eigenvectors of S W 1 S B consistent with the largest C − 1 eigenvalues are the columns of the optimal projection matrix W* on the condition that SW is nonsingular.
Finally, we obtain the projection Y = ( y 1 , y 2 , , y C 1 ) through Equation (14):
Y = ( W * ) T X

4.2. k-Nearest Neighbors (KNN) Algorithm and Cross-Validation Methods

4.2.1. k-Nearest Neighbors Algorithm

For protein sub-nuclear localization and classification problem, one classic and simple method is k-nearest neighbors (KNN). The KNN classifier predicts each unlabeled sample by the majority label of its nearest neighbors in the training set [33]. Despite its simplicity, the KNN often yields competitive results, and in this paper, when combined with the reduction dimension algorithm, it has significantly advanced the classification accuracy [23]. Before applying KNN classifier for protein sub-nuclear localization, we transform each protein sequence to a vector with fixed dimension. Then we classify each sequence according to class memberships of its k-nearest neighbors [34,35]. Cosine distance, c o s ( u , v ) = u v u × v , is chosen to measure the close degree of two proteins u and v, where is the module function. The value of cos(u, v) ranges in [–1,1], the closer to 1 its absolute value is, the closer to each other are u and v.

4.2.2. Cross-Validation Methods

Traditionally, in the context of statistical prediction and classification, cross-validation is utilized to estimate the performance of the final classifier or predictor. Independent dataset test, jackknife test, and K-fold cross-validation are three popular cross validation methods [35]. The K-fold cross-validation is a method to approximately estimate prediction error without bias under much more complicated situations [36]. Thus K-fold cross-validation is employed in this paper to examine the anticipated performance of the KNN classifier, where K is the positive integer satisfying KN and N denotes the size of the benchmark dataset. The case K = N is indeed identify to leave-one-out or jackknife test. Jackknife test can deliver high variance on account of the N training sets similar to one another [37]. Moreover, the computational cost is also expensive, requiring N iterations of the learning approach. Usually, 10-fold cross validation is a preferred route for pursuing a good trade off, where the benchmark dataset is randomly partitioned into ten equal-size subsets where those subsets hold the original proportion in different classes. For each experiment, we carry out the test ten times. In each run, one subset is utilized for testing and the remaining are used for training, and thus each subset is in turn used as testing set once. To obtain a reliable result, we run 50 times experiments and calculate the average result of the test accuracies. In addition, since the jackknife test is objective and little arbitrary because it can always yield a unique result for a given dataset, and therefore has been adopted to estimate the performance of predictors [38], it is also considered in Section 5.2.4 to compare the overall success rate of predictors.

5. Numerical Results

In this section, we introduce the two sub-nuclear location datasets and then give the numerical results and analysis.

5.1. Description of Datasets and Experimental Procedure

In order to validate the efficiency of the proposed method, two public datasets are adopted in this paper. One is Nuc-Ploc [7], constructed in 2007 by Shen and Chou, which contains 714 proteins located at nine sub-nuclear locations, listed in Table 2. The other is SubNucPred [17], constructed by Ravindra Kumar et al. in 2014, which contains ten sub-nuclear location proteins and is detailed listed in Table 3.
Table 2. Protein benchmark Dataset 1 of nine sub-nuclear locations.
Table 2. Protein benchmark Dataset 1 of nine sub-nuclear locations.
CodeSub-Nuclear LocationNumber
1Chromatin99
2Heterochromatin22
3Nuclear envelope61
4Nuclear matrix29
5Nuclear pore complex79
6Nuclear speckle67
7Nucleolus307
8Nucleoplasm37
9Nuclear PML body13
Overall714
Table 3. Protein benchmark Dataset 2 of ten sub-nuclear locations.
Table 3. Protein benchmark Dataset 2 of ten sub-nuclear locations.
CodeSub-Nuclear LocationNumber
1Centromere86
2Chromosome113
3Nuclear speckle50
4Nucleolus294
5Nuclear envelope17
6Nuclear matrix18
7Nucleoplasm30
8Nuclear pore complex12
9Nuclear PML body12
10Telomere37
Overall669
The procedure of numerical experiment is as follows.
(1)
Represent the protein sequences using DipC, PseAAC, and PSSM.
(2)
Fuse DipC and PSSM to get DipPSSM and fuse PseAAC and PSSM to get PseAAPSSM.
(3)
Employ LDA to reduce the dimensionality of DipPSSM and PseAAPSSM.
(4)
Train KNN classifier for prediction.
To provide an intuitive view, these processes are shown in Figure 1.
Figure 1. A flowchart of the prediction process.
Figure 1. A flowchart of the prediction process.
Ijms 16 26237 g001

5.2. Numerical Results and Analysis

5.2.1. Feature Fusion Representations

A comparison of fusing and single representations: In this subsection, we compare our proposed representations PseAAPSSM and DipPSSM with their single atoms on the prediction success rates of protein sub-nuclear locations. Table 4 and Table 5 show the experimental results for every sub-nuclear location on Datasets 1 and 2, respectively. Note that we take the average value of fifty random success rates according to 10-fold cross validation as the prediction success rate (SR), where the neighborhood size k of KNN is chosen corresponding to the highest overall success rate with k traversing from 1 to 10. The calculation of success rate and overall success rate are in Equations (15) and (16), respectively.
s u c c e s s   r a t e ( i ) = T ( i ) / N ( i )     ( i = 1 , 2 , , n )
o v e r a l l   s u c c e s s   r a t e   = i = 1 n T ( i ) / i = 1 n N ( i )     ( i = 1 , 2 , , n )
where T(i) is the number of correctly predicted proteins belonging to location i, N(i) is the total number of proteins at location i. Note that the success rate here can also be understood as the sensitivity defined in many literatures which will be discussed in Section 5.2.3. For the two proposed fusion representations DipPSSM and PseAAPSSM, the optimal balance factor vector R is also listed in the tables.
According to Table 4 and Table 5, it is clear that our proposed fusion representations outperform the single representations consistently.
Table 4. Prediction success rate (SR) and the optimal R of Dataset 1 for protein sub-nuclear localization by 10-fold cross validation with various representations.
Table 4. Prediction success rate (SR) and the optimal R of Dataset 1 for protein sub-nuclear localization by 10-fold cross validation with various representations.
Sub-Nuclear LocationPseAACDipCPSSMPseAAPSSMDipPSSM
SR (k = 9)SR (k = 8)SR (k = 3)SR (k = 3)RSR (k = 3)R
1. Chromatin0.48670.54370.56900.76220.75000.76830.7470
2. Heterochromatin0.21300.21130.40200.56500.82190.56130.8196
3. Nuclear envelope0.26780.21690.38720.46570.25000.45300.2458
4. Nuclear matrix0.13330.15670.38500.77770.99780.80070.9976
5. Nuclear pore complex0.54800.57600.61080.72510.15000.72310.1489
6. Nuclear speckle0.29260.33550.33030.52160.06000.52350.0583
7. Nucleolus0.79520.77130.77561.00000.99891.00000.9997
8. Nucleoplasm0.05770.07000.29370.70320.99780.75530.9973
9. Nuclear PML body0.08300.09200.38200.41300.04000.38300.0401
Overall0.53650.53890.59290.79710.8002
Table 5. Prediction success rate (SR) and the optimal R of Dataset 2 for protein sub-nuclear localization by 10-fold cross validation with different representations.
Table 5. Prediction success rate (SR) and the optimal R of Dataset 2 for protein sub-nuclear localization by 10-fold cross validation with different representations.
Sub-Nuclear LocationPseAACDipCPSSMPseAAPSSMDipPSSM
SR (k = 9)SR (k = 9)SR (k = 6)SR (k = 4)RSR (k = 4)R
1. Centromere0.24950.09160.60880.79080.99110.78890.9901
2. Chromosome0.33970.38610.48190.92990.99760.92790.9980
3. Nuclear speckle0.31880.31640.35040.34600.69830.34160.7000
4. Nucleolus0.86790.86920.83010.93600.25040.93370.2498
5. Nuclear envelope0.26700.09800.00700.06400.19780.00600.2000
6. Nuclear matrix0.18800.16600.26300.31100.23910.31700.2400
7. Nucleoplasm0.03130.03070.16671.00000.99921.00000.9998
8. Nuclear pore complex0.41100.47500.32100.50800.21870.51900.2206
9. Nuclear PML body0.00100.00200.02600.08500.20790.06600.2100
10. Telomere0.09980.08730.39230.47380.12130.47250.1200
Overall0.51680.50250.59310.78740.7855
Balance factor vector R: Figure 2 describes the success rate curves on Dataset 1 of DipPSSM and PseAAPSSM, where each subplot corresponds to a sub-nuclear location. For each subplot, the horizontal axis represents certain balance vector ri and the ordinate axis is the prediction success rate. Note that in each subplot, when ri varies from 0 to 1 with step 0.1, the remaining n−1 balance factors are fixed in the values in Table 4.
Figure 2. Success rate comparison for different ri with our representations on Dataset 1, where each subplot, from (a) to (i), respectively represents each sub-nuclear location.
Figure 2. Success rate comparison for different ri with our representations on Dataset 1, where each subplot, from (a) to (i), respectively represents each sub-nuclear location.
Ijms 16 26237 g002
The numerical experiment shown in Figure 3 is the same as that on Figure 2, except for the different Dataset 2. From Figure 2 and Figure 3, it is clear that the parameters ri (i = 1, 2,…, n) have significant influence on protein sub-nuclear localization. Especially, Figure 2 also shows that when ri is around 0.9 for each subplot (i = 1, 2,…, n), the success rates have a leaping point, probably suggesting that for Dataset 1, dipeptide composition or pseudo amino acid composition are more important than position specific scoring matrix in the fusion representations.
Figure 3. Success rate comparison for different ri with our representations on Dataset 2, where each subplot, from (a) to (j), respectively represents each sub-nuclear location.
Figure 3. Success rate comparison for different ri with our representations on Dataset 2, where each subplot, from (a) to (j), respectively represents each sub-nuclear location.
Ijms 16 26237 g003

5.2.2. Dimensionality Reduction

3D visualization: In this subsection, we employ LDA to present visualization results. Here, we give the 3D scatter plot of DipPSSM and PseAAPSSM for both datasets, so as to observe the data distribution in the three-dimensional space after data reduction by LDA. Figure 4 and Figure 5 show the results of Dataset 1 and 2, respectively, where the three axes represent the first three components of LDA corresponding to the largest three eigenvalues, respectively.
In Figure 4, we use nine colors, which are coded from 1 to 9 according to Table 2, to represent the nine sub-nuclear locations protein of Dataset 1. In Figure 5, we use ten colors, which are coded from 1 to 10 according to Table 3, to represent the ten sub-nuclear locations protein of Dataset 2. In Figure 4b and Figure 5b, there are some data points that are hardly distinguished at those scales. Therefore, we provide a patch of high resolution in Figure 4c and Figure 5c for those data points. These results suggest that LDA can improve the classification performance by separating the data points from different classes.
Figure 4. 3D scatter on Dataset 1 with X-, Y- and Z-axes representing the first three components of LDA, respectively: (a) DipPSSM; (b) PseAAPSSM and (c) the patch of high resolution for the indicated region in (b).
Figure 4. 3D scatter on Dataset 1 with X-, Y- and Z-axes representing the first three components of LDA, respectively: (a) DipPSSM; (b) PseAAPSSM and (c) the patch of high resolution for the indicated region in (b).
Ijms 16 26237 g004aIjms 16 26237 g004b
Figure 5. 3D scatter on Dataset 2 with X-, Y- and Z-axes representing the first three components of LDA, respectively: (a) DipPSSM; (b) PseAAPSSM and (c) the patch of high resolution for the indicated region in (b).
Figure 5. 3D scatter on Dataset 2 with X-, Y- and Z-axes representing the first three components of LDA, respectively: (a) DipPSSM; (b) PseAAPSSM and (c) the patch of high resolution for the indicated region in (b).
Ijms 16 26237 g005
Parameter effects: With the 10-fold cross-validation, Figure 6 demonstrates the overall success rates against dimensions reduced by LDA from DipPSSM and PseAAPSSM, respectively, where the neighborhood size k is set to 4, a choice corresponding to a good performance in 1 to 10. From Figure 6, we can see that most information lying in the original high dimensional protein data can be summarized by some low dimensional structure, suggesting the efficiency of LDA for protein sub-nuclear localization.
Figure 7 further gives the comparison of the success rates among the reduction data and the original data when the neighborhood size k changes from 1 to 10. It is easily seen from Figure 7 that for each fixed k, both DipPSSM with LDA and PseAAPSSM with LDA improved success rate of sub-nuclear locating prediction significantly compared with DipPSSM and PseAAPSSM. Interestingly, in Figure 4, Figure 5, Figure 6 and Figure 7, we can see that for both datasets, the reduction effects of DipPSSM seem a little better than PseAAPSSM.
Figure 6. The overall success rates at different dimensions, reduced by LDA, from DipPSSM and PseAAPSSM, respectively: (a) Dataset 1 and (b) Dataset 2.
Figure 6. The overall success rates at different dimensions, reduced by LDA, from DipPSSM and PseAAPSSM, respectively: (a) Dataset 1 and (b) Dataset 2.
Ijms 16 26237 g006
Figure 7. Comparison of success rates among different k values by DipPSSM, PseAAPSSM, DipPSSM with LDA and PseAAPSSM with LDA, respectively: (a) Dataset 1 and (b) Dataset 2.
Figure 7. Comparison of success rates among different k values by DipPSSM, PseAAPSSM, DipPSSM with LDA and PseAAPSSM with LDA, respectively: (a) Dataset 1 and (b) Dataset 2.
Ijms 16 26237 g007

5.2.3. Analysis of numerical Results

From another perspective, it is indicated in the current literature that the following indexes (Equations (17)–(20)) are often used to evaluate the performance of a predictor. We calculate these indexes of 10-fold cross validation to compare different representations together with dimension reduction method.
S E ( i ) = T P ( i ) / ( T P ( i ) + F N ( i ) )     ( i = 1 , 2 , , n )
S P ( i ) = T N ( i ) / ( T N ( i ) + F P ( i ) )     ( i = 1 , 2 , , n )
A C C ( i ) = ( T P ( i ) + T N ( i ) ) ( T P ( i ) + F P ( i ) + T N ( i ) + F N ( i ) )     ( i = 1 , 2 , , n )
M C C ( i ) = ( T P ( i ) × T N ( i ) ) ( F P ( i ) × F N ( i ) ) ( T P ( i ) + F P ( i ) ) × ( T P ( i ) + F N ( i ) ) × ( T N ( i ) + F P ( i ) ) × ( T N ( i ) + F N ( i ) )     ( i = 1 , 2 , , n )
In these equations, TP (true positive) and TN (true negative) were the number of proteins that were correctly located while FP (false positive) and FN (false negative) were the number of proteins that were wrongly located. SE (Sensitivity) denotes the rate of positive samples correctly located, whose value is equal to the success rate in Equation (14). SP (Specificity) denotes the rate of negative samples correctly located. ACC (Accuracy) means the rate of correctly located samples. MCC is the Mathew’s Correlation Coefficient, which returns a value lying in [–1,1]. The value of a MCC coefficient reflects the prediction consequences. The value of 1 denotes a perfect prediction, 0 represents random prediction and −1 represents a bad prediction. We cannot perfectly describe the confusion matrix of true and false, positives and negatives through a single number, generally regarding the MCC as one of the best [39].
Table 6 gives the values of four indexes in Equations (17)–(20) for nine sub-nuclear locations in Dataset 1 using three single representations of PseAAC, Dipe and PSSM, two fusion representations of DipPSSM and PseAAPSSM and their combination with the dimension reduction method LDA, where both PseAAPSSM and DipPSSM are reduced to eight dimensions. Table 7 uses the similar experimental design to Table 6 except for the use of Dataset 2, where PseAAPSSM and DipPSSM are reduced to nine dimensions. From these results, we come to the following conclusions. The predictions with sensitivity (SE), specificity (SP), accuracy (ACC) and MCC by fusion representations are better than the single representations in most locations. Furthermore, the fusion representations with the LDA treatment outperform those without. Note that due to the randomness of the 10-fold cross validation algorithm, the numerical values of the four indexes SE, SP, ACC and MCC have small variation each time. That is also the reason why we get different values of sensitivity and success rate in each sub-nuclear location in Table 4 and Table 6, as well as Table 5 and Table 7, although theoretically Equations (15) and (17) should produce the same value.
Table 6. Performance of various representations on Dataset 1.
Table 6. Performance of various representations on Dataset 1.
Sub-Nuclear LocationIndexPseAACDipCPSSMPseAAPSSMDipPSSMPseAAPSSM with LDADipPSSM with LDA
1. ChromatinSE0.45450.53540.55560.74750.80810.92930.8889
SP0.84720.84880.91540.92520.91540.99190.9789
ACC0.79270.80530.86550.97620.90060.98320.9664
MCC0.26330.32910.45600.62170.64410.92910.8605
2. HeterochromatinSE0.27270.13640.40910.59090.59090.54551
SP0.98840.99280.98120.98120.98550.99570.9971
ACC0.96640.96640.96360.96080.97340.98180.9972
MCC0.32550.21200.39030.52780.56420.65200.9560
3. Nuclear envelopeSE0.26230.21310.34430.47540.45900.95080.9344
SP0.98930.98930.97090.94700.977011
ACC0.92720.92300.91740.95380.93280.99580.9944
MCC0.39830.34290.38310.51160.51230.97290.9637
4. Nuclear matrixSE0.13790.20690.41380.65520.60270.37930.5517
SP0.99270.99420.97370.98690.99120.98690.9839
ACC0.95800.96220.95100.92300.97620.96220.9964
MCC0.23110.33770.38130.65290.67020.43810.5543
5. Nuclear pore complexSE0.53160.59490.64560.72150.734211
SP0.93700.93230.94960.96220.960611
ACC0.89220.89500.91600.93560.935611
MCC0.46110.49830.58250.67630.680011
6. Nuclear speckleSE0.29850.35820.32840.49250.507511
SP0.97370.96750.95360.96750.969111
ACC0.91040.91040.89500.97340.925811
MCC0.35810.39090.31640.50740.525611
7. NucleolusSE0.79150.77520.75900.97720.99670.93491
SP0.62160.65360.71250.93610.97300.86490.9926
ACC0.69470.70590.73250.93140.98320.89500.9958
MCC0.41170.42540.46690.90770.96620.79250.9915
8. NucleoplasmSE0.05410.08110.27030.37840.67570.27030.9730
SP0.98520.98670.98380.99260.99410.98081
ACC0.93700.93980.94680.96920.97760.94400.9986
MCC0.06770.11690.33330.51100.75210.31520.9857
9. Nuclear PML bodySE0.07690.15380.30770.38460.307711
SP10.99710.99290.98720.992911
ACC0.98320.98180.98040.90060.980411
MCC0.27500.27050.36020.35850.360211
Table 7. Performance of various representations on Dataset 2.
Table 7. Performance of various representations on Dataset 2.
Sub-Nuclear LocationIndexPseAACDipePSSMPseAAPSSMDipPSSMPseAAPSSM with LDADipPSSM with LDA
1. CentromereSE0.22090.11630.59300.80230.82560.61631
SP0.97050.98280.93140.97430.97600.96740.9949
ACC0.87440.87140.88790.95220.95670.92230.9955
MCC0.28450.19480.51200.78440.80560.63040.9805
2. ChromosomeSE0.33630.38050.50440.90270.88500.87611
SP0.88670.85250.90470.99100.98740.90111
ACC0.79370.77280.83710.97610.97010.89691
MCC0.23330.22400.41350.91350.89170.69171
3. Nuclear speckleSE0.26000.34000.36000.32000.30000.78001
SP0.97740.97090.96450.97420.97580.97421
ACC0.92380.92380.91930.92530.92530.95961
MCC0.31720.36720.35990.36250.35040.72201
4. NucleolusSE0.88100.87070.82310.94220.94220.93200.9830
SP0.44270.49070.64800.80000.80530.98670.9840
ACC0.63530.65770.72500.86250.86550.96260.9836
MCC0.35040.38090.47100.73770.74280.92470.9666
5. Nuclear envelopeSE0.29410.11760.00170.05880.117610.8235
SP0.99390.99540.99390.99540.993910.9939
ACC0.97610.97310.96860.97160.971610.9895
MCC0.39340.2066−0.01250.11070.186110.7950
6. Nuclear matrixSE0.11110.16670.22220.38890.33330.88890.8333
SP0.99850.99850.99540.98920.99080.99850.9969
ACC0.97460.97610.97460.97310.97310.99550.9925
MCC0.26540.34660.34600.42750.39510.91240.8537
7. NucleoplasmSE0.00160.00110.16670.96670.96670.23331
SP0.99690.99840.99690.99370.99060.98901
ACC0.95220.95370.95960.99250.98950.95521
MCC−0.0119−0.00840.33260.91790.88980.32151
8. Nuclear pore complexSE0.41670.50000.25000.50000.500011
SP10.99390.99240.99390.993911
ACC0.98950.98510.97910.98510.985111
MCC0.64210.54020.29600.54020.540211
9. Nuclear PML bodySE0.00200.00120.00110.08330.083311
SP0.99240.99850.99850.99540.995411
ACC0.97460.98060.98060.97910.979111
MCC−0.0117−0.0052−0.00520.13560.135611
10. TelomereSE0.13510.13510.43240.48650.459510.7568
SP0.98730.97470.98260.98260.979410.9921
ACC0.94020.92830.95220.95520.950710.9791
MCC0.20280.14400.48200.52650.484710.7904

5.2.4. Compare with Existing Prediction Results

Table 8 gives the comparison of the overall success rates on Dataset 1 among our protein sub-nuclear localization methods and the Nuc-PLoc predictor [7] with jackknife test. For each sub-nuclear location of Dataset 1, Figure 8 gives the comparison of the Matthew’s correlation coefficient (MCC) indexes [7] among our methods and Nuc-PLoc prediction. From Table 8 and Figure 8, it is clear that the success rates of our protein sub-nuclear localization predictors are much higher than that of the Nuc-PLoc.
Table 8. Comparison of the overall success rates by jackknife test on Dataset 1.
Table 8. Comparison of the overall success rates by jackknife test on Dataset 1.
AlgorithmRepresentationOverall Success Rate
Nuc-PLocFusion of PsePSSM and PseAAC67.4%
Our methodsDipPSSM with LDA95.94%
PseAAPSSM with LDA88.1%
Figure 8. Comparison of MCC performance on Dataset 1 among our proposed methods with Nuc-PLoc.
Figure 8. Comparison of MCC performance on Dataset 1 among our proposed methods with Nuc-PLoc.
Ijms 16 26237 g008
Next, we present another comparison of our methods with SubNucPred method [17] on Dataset 2. The four indexes of sensitivity (SE), specificity (SP), accuracy (ACC) and MCC in each sub-nuclear location are calculated and shown in Figure 9, where 10-fold cross validation was used. It can be seen from Figure 9 that our methods of DipPSSM with LDA and PseAAPSSM with LDA outperform SubNucPred.
Figure 9. Comparison of our proposed methods with SubNucPred on Dataset 2: (a) Sensitivity (SE); (b) Specificity (SP); (c) Accuracy (ACC) and (d) Mathew’s Correlation Coeffcient (MCC).
Figure 9. Comparison of our proposed methods with SubNucPred on Dataset 2: (a) Sensitivity (SE); (b) Specificity (SP); (c) Accuracy (ACC) and (d) Mathew’s Correlation Coeffcient (MCC).
Ijms 16 26237 g009

6. Conclusions

Following the completion of the Human Genome Project, bioscience has stepped into the era of the genome and proteome [40,41,42,43,44]. A large amount of computational methods have been presented to deal with the prediction tasks in bioscience [45,46,47,48]. The nucleus is highly organized and the largest organelle in the eukaryotic cells. Hence, managing protein sub-nuclear localization is important for mastering biological functions of the nucleus. Many current studies discuss protein sub-nuclear localization prediction [49,50]. This paper proposes a different route to identify the protein sub-nuclear localization by firstly developing two fusion representations, DipPSSM and PseAAPSSM. Then, we conduct the experiments based on the 10-fold cross validation on two datasets to certify the superiority of the proposed representations and the applicability for predicting protein sub-nuclear localization. Through the present study, we have drawn the conclusions that our fusion representations can greatly improve the success rate in predicting protein sub-nuclear localization, thereby the fusion representations can reflect more overall sequence pattern of a protein than the single one.
However, there is the difficulty of choosing proper balance factors in constructing the fusion representations. The processing method of this paper is to use genetic algorithm to produce approximate optimal values of the weight coefficients (balance factors), where we run the genetic algorithm multiple times to compute the average weight coefficient giving rise to the ideal performance. However, the time complexity of this method is high, so in the future research we will try multiple searching methods for achieving the weight coefficients.
Due to the fact that our proposed fusion representations have high dimensionality, which might result in some negative effects for KNN prediction, we employ LDA to process the representations before using KNN classifier predicts protein locations. Note that, in current pattern recognition research, many other useful data reduction methods such as kernel discriminant analysis and fuzzy LDA have emerged. How to effectively use these methods or their improved methods or other more suitable dimension reducing methods in the sub-nuclear localization field is still an open problem. In addition, it remains an interesting challenge to obtain better representations for protein sub-nuclear localization and study other machine learning classification algorithms.

Acknowledgments

This research is fully supported by grants from National Natural Science Foundation of China (11261068, 11171293).

Author Contributions

Shunfang Wang designed the research. Shuhui Liu performed the numerical experiments. Shunfang Wang and Shuhui Liu analyzed the data and wrote the paper. The authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Mei, S.Y.; Fei, W. Amino acid classification based spectrum kernel fusion for protein subnuclear localization. BMC Bioinform. 2010, 11, S17. [Google Scholar] [CrossRef] [PubMed]
  2. Nancy, Y.; Wagner, J.; Laird, M.; Melli, G.; Rey, S.; Lo, R.; Brinkman, F. PSORTb 3.0: Improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics 2010, 26, 1608–1615. [Google Scholar]
  3. Chou, K.C. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteom. 2009, 6, 262–274. [Google Scholar] [CrossRef]
  4. Zuo, Y.C.; Peng, Y.; Liu, L.; Chen, W.; Yang, L.; Fan, G.L. Predicting peroxidase subcellular location by hybridizing different descriptors of Chou’pseudo amino acid patterns. Anal. Biochem. 2014, 458, 14–19. [Google Scholar] [CrossRef] [PubMed]
  5. Nakashima, H.; Nishikawa, K. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol. 1994, 238, 54–61. [Google Scholar] [CrossRef] [PubMed]
  6. Ding, Y.; Cai, Y.; Zhang, G.; Xu, W. The influence of dipeptide composition on protein thermostability. FEBS Lett. 2004, 569, 284–288. [Google Scholar] [CrossRef] [PubMed]
  7. Shen, H.B.; Chou, K.C. Nuc-PLoc: A new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Eng. Des. Sel. 2007, 20, 561–567. [Google Scholar] [CrossRef] [PubMed]
  8. Du, P.; Gu, S.; Jiao, Y. PseAAC General: Fast building various modes of general form of chou’s pseudo-amino acid composition for large-scale protein datasets. Int. J. Mol. Sci. 2014, 15, 3495–3506. [Google Scholar] [CrossRef] [PubMed]
  9. Cao, D.S.; Xu, Q.S.; Liang, Y.Z. Propy: A tool to generate various modes of Chou’s pseAAC. Bioinformatics 2013, 29, 960–962. [Google Scholar] [CrossRef] [PubMed]
  10. Du, P.; Wang, X.; Xu, C.; Gao, Y. PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou’s pseudoamino acid compositions. Anal. Biochem. 2012, 425, 117–119. [Google Scholar] [CrossRef] [PubMed]
  11. Li, L.Q.; Yu, S.J.; Xiao, W.D.; Li, Y.S.; Li, M.L.; Huang, L.; Zheng, X.Q.; Zhou, S.W.; Yang, H. Prediction of bacterial protein subcellular localization by incorporating various features into Chou's PseAAC and a backward feature selection approach. Biochimie 2014, 104, 100–107. [Google Scholar] [CrossRef] [PubMed]
  12. Wang, T.; Yang, J. Using the nonlinear dimensionality reduction method for the prediction of subcellular localization of Gram-negative bacterial proteins. Mol. Divers. 2009, 13, 475–481. [Google Scholar] [CrossRef] [PubMed]
  13. Chou, K.C. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 2011, 273, 236–247. [Google Scholar] [CrossRef] [PubMed]
  14. Mandal, M.; Mukhopadhyay, A.; Maulik, U. Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou’s PseAAC. Med. Biol. Eng. Comput. 2015, 53, 331–344. [Google Scholar] [CrossRef] [PubMed]
  15. Shen, H.B.; Chou, K.C. Predicting protein subnuclear location with optimized evidence-theoretic k-nearest classifier and pseudo amino acid composition. Biochem. Biophys. Res. Commun. 2005, 337, 752–756. [Google Scholar] [CrossRef] [PubMed]
  16. Mundra, P.; Kumar, M.; Kumar, K.K.; Jayaraman, V.K.; Kulkarni, B.D. Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM. Pattern Recognit. Lett. 2007, 28, 1610–1615. [Google Scholar] [CrossRef]
  17. Kumar, R.; Jain, S.; Kumari, B.; Kumar, M. Protein Sub-Nuclear Localization Prediction Using SVM and Pfam Domain Information. PLoS ONE 2014, 9, e98345. [Google Scholar] [CrossRef] [PubMed]
  18. Jiang, X.; Wei, R.; Zhao, Y.; Zhang, T. Using Chou’s pseudo amino acid composition based on approximate entropy and an ensemble of AdaBoost classifiers to predict protein subnuclear location. Amino Acids 2008, 34, 669–675. [Google Scholar] [CrossRef] [PubMed]
  19. Li, F.; Li, Q. Using pseudo amino acid composition to predict protein subnuclear location with improved hybrid approach. Amino Acids 2008, 34, 119–125. [Google Scholar] [CrossRef] [PubMed]
  20. Lei, Z.; Dai, Y. An SVM-based system for predicting protein subnuclear localizations. BMC Bioinform. 2005, 6, 291. [Google Scholar] [CrossRef] [PubMed]
  21. Wang, Z.; Zou, Q.; Jiang, Y.; Ju, Y.; Zeng, X. Review of Protein Subcellular Localization Prediction. Curr. Bioinform. 2014, 9, 331–342. [Google Scholar] [CrossRef]
  22. Xiao, X.; Wu, Z.C.; Chou, K.C. iLoc-Virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. J. Theor. Biol. 2011, 284, 42–51. [Google Scholar] [CrossRef] [PubMed]
  23. Wang, T.; Yang, J. Predicting subcellular localization of gramnegative bacterial proteins by linear dimensionality reduction method. Protein Pept. Lett. 2010, 17, 32–37. [Google Scholar] [CrossRef] [PubMed]
  24. Gao, Q.B.; Wang, Z.Z.; Yan, C.; Du, Y.H. Prediction of protein subcellular location using a combined feature of sequence. FEBS Lett. 2005, 579, 3444–3448. [Google Scholar] [CrossRef] [PubMed]
  25. Chou, K.C.; Wu, Z.C.; Xiao, X. iLoc-Euk: A Multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE 2011, 6, e18258. [Google Scholar] [CrossRef] [PubMed]
  26. Sun, J.; Xhafa, F. A genetic algorithm for ground station scheduling. Complex, Intelligent and Software Intensive Systems (CISIS). In Proceedings of the 2011 International Conference on IEEE, Seoul, Korea, 30 June–2 July 2011; pp. 138–145.
  27. Mühlenbein, H. Parallel genetic algorithms, population genetics and combinatorial optimization. In Parallelism, Learning, Evolution, 1st ed.; Becker, J.D., Eisele, I., Mündemann, F.W., Eds.; Springer Berlin Heidelberg: Neubiberg, Germany, 1991; pp. 398–406. [Google Scholar]
  28. Li, L.; Weinberg, C.R.; Darden, T.A.; Pedersen, L.G. Gene selection for sample classification based on gene expression data: Study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 2001, 17, 1131–1142. [Google Scholar] [CrossRef] [PubMed]
  29. Welling, M. Fisher linear discriminant analysis. In Department of Computer Science; University of Toronto: Toronto, ON, Canada, 2005; p. 3. [Google Scholar]
  30. Heo, G.; Gader, P. Robust kernel discriminant analysis using fuzzy memberships. Pattern Recognit. 2011, 44, 716–723. [Google Scholar] [CrossRef]
  31. Martínez, A.M.; Kak, A.C. PCA versus LDA. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 228–233. [Google Scholar] [CrossRef]
  32. Zhang, Y.P.; Xiang, M.; Yang, B. Linear dimensionality reduction based on Hybrid structure preserving projection. Neurocomputing 2015. [Google Scholar] [CrossRef]
  33. Zhang, H.; Berg, A.C.; Maire, M.; Malik, J. SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006; pp. 2126–2136.
  34. Chou, K.C. Some remarks on predicting multi-label attributes in molecular biosystems. Mol. Biosyst. 2013, 9, 1092–1100. [Google Scholar] [CrossRef] [PubMed]
  35. Lin, W.Z.; Fang, J.A.; Xiao, X.; Chou, K.C. iLoc-Animal: A multi-label learning classifier for predicting subcellular localization of animal proteins. Mol. Biosyst. 2013, 9, 634–644. [Google Scholar] [CrossRef] [PubMed]
  36. Efron, B.; Gong, G. A leisurely look at the bootstrap, the jackknife, and cross-validation. Am. Stat. 1983, 37, 36–48. [Google Scholar]
  37. Refaeilzadeh, P.; Tang, L.; Liu, H. Cross validation. In Encyclopedia of Database Systems, 1st ed.; Springer US: New York, NY, USA, 2009; pp. 532–538. [Google Scholar]
  38. Chen, Y.K.; Li, K.B. Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou’s pseudo amino acid composition. J. Theor. Biol. 2013, 318, 1–12. [Google Scholar] [CrossRef] [PubMed]
  39. Powers, D.M. Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
  40. Huang, Q.Y.; You, Z.H.; Zhang, X.F.; Zhou, Y. Prediction of Protein–Protein Interactions with Clustered Amino Acids and Weighted Sparse Representation. Int. J. Mol. Sci. 2015, 16, 10855–10869. [Google Scholar] [CrossRef] [PubMed]
  41. Georgiou, D.N.; Karakasidis, T.E.; Megaritis, A.C. A short survey on genetic sequences, chou’s pseudo amino acid composition and its combination with fuzzy set theory. Open Bioinform. J. 2013, 7, 41–48. [Google Scholar] [CrossRef]
  42. Georgiou, D.N.; Karakasidis, T.E.; Nieto, J.J.; Torres, A. Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou’s pseudo amino acid composition. J. Theor. Biol. 2009, 257, 17–26. [Google Scholar] [CrossRef] [PubMed]
  43. Georgiou, D.N.; Karakasidis, T.E.; Nieto, J.J.; Torres, A. A study of entropy/clarity of genetic sequences using metric spaces and fuzzy sets. J. Theor. Biol. 2010, 267, 95–105. [Google Scholar] [CrossRef] [PubMed]
  44. Nieto, J.J.; Torres, A.; Georgiou, D.N.; Karakasidis, T.E. Fuzzy polynucleotide spaces and metrics. Bull. Math. Biol. 2006, 68, 703–725. [Google Scholar] [CrossRef] [PubMed]
  45. Mohabatkar, H.; Beigi, M.M.; Abdolahi, K.; Mohsenzadeh, S. Prediction of allergenic proteins by means of the concept of Chou’s pseudo amino acid composition and a machine learning approach. Med. Chem. 2013, 9, 133–137. [Google Scholar] [CrossRef] [PubMed]
  46. Liao, B.; Jiang, Y.; Yuan, G.; Zhu, W.; Cai, L.; Cao, Z. Learning a weighted meta-sample based parameter free sparse representation classification for microarray data. PLoS ONE 2014, 9, e104314. [Google Scholar] [CrossRef] [PubMed]
  47. Liu, B.; Fang, L.; Liu, F.; Wang, X.; Chen, J.; Chou, K.C. Identification of real microRNA precursors with a pseudo structure status composition approach. PLoS ONE 2015, 10, e0121501. [Google Scholar] [CrossRef] [PubMed]
  48. Yang, R.; Zhang, C.; Gao, R.; Zhang, L. An effective antifreeze protein predictor with ensemble classifiers and comprehensive sequence descriptors. Int. J. Mol. Sci. 2015, 16, 21191–21214. [Google Scholar] [CrossRef] [PubMed]
  49. Fan, Y.N.; Xiao, X.; Min, J.L.; Chou, K.C. iNR-Drug: Predicting the interaction of drugs with nuclear receptors in cellular networking. Int. J. Mol. Sci. 2014, 15, 4915–4937. [Google Scholar] [CrossRef] [PubMed]
  50. Han, G.S.; Yu, Z.G.; Anh, V.; Krishnajith, A.P.D.; Tian, Y.C. An ensemble method for predicting subnuclear localizations from primary protein structures. PLoS ONE 2013, 8, e57225. [Google Scholar] [CrossRef] [PubMed][Green Version]
Int. J. Mol. Sci. EISSN 1422-0067 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top