Open Access
This article is
 freely available
 reusable
Int. J. Mol. Sci. 2017, 18(12), 2718; https://doi.org/10.3390/ijms18122718
Article
Protein Subcellular Localization with Gaussian Kernel Discriminant Analysis and Its Kernel Parameter Selection
^{1}
Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China
^{2}
School of Statistics and Mathematics, Yunnan University of Finance and Economics, Kunming 650221, China
*
Correspondence: [email protected] (S.W.); [email protected] (K.Y.); [email protected] (Y.F.); Tel.: +8613769196517 (S.W.); +8687165933093 (K.Y.); +8613987659718 (Y.F.)
^{†}
These authors contributed equally to this work.
Received: 16 October 2017 / Accepted: 5 December 2017 / Published: 15 December 2017
Abstract
:Kernel discriminant analysis (KDA) is a dimension reduction and classification algorithm based on nonlinear kernel trick, which can be novelly used to treat highdimensional and complex biological data before undergoing classification processes such as protein subcellular localization. Kernel parameters make a great impact on the performance of the KDA model. Specifically, for KDA with the popular Gaussian kernel, to select the scale parameter is still a challenging problem. Thus, this paper introduces the KDA method and proposes a new method for Gaussian kernel parameter selection depending on the fact that the differences between reconstruction errors of edge normal samples and those of interior normal samples should be maximized for certain suitable kernel parameters. Experiments with various standard data sets of protein subcellular localization show that the overall accuracy of protein classification prediction with KDA is much higher than that without KDA. Meanwhile, the kernel parameter of KDA has a great impact on the efficiency, and the proposed method can produce an optimum parameter, which makes the new algorithm not only perform as effectively as the traditional ones, but also reduce the computational time and thus improve efficiency.
Keywords:
protein subcellular localization; kernel parameter selection; kernel discriminant analysis (KDA); Gaussian kernel function; dimension reduction1. Introduction
Some proteins can only play the role in one specific place in the cell while others can play the role in several places in the cell [1]. Generally, a protein can function correctly only when it is localized to a correct subcellular location [2]. Therefore, protein subcellular localization prediction is an important research area of proteomics. It is helpful to predict protein function as well as to understand the interaction and regulation mechanism of proteins [3]. Now, many methods have been used to predict protein subcellular location, such as green fluorescent protein labeling [4], mass spectrometry [5], and so on. However, these traditional experimental methods usually have many technical limitations, resulting in high cost of time and money. Thus, prediction of protein subcellular location based on machine learning has become a focus research in bioinformatics [6,7,8].
When we use the methods of machine learning to predict protein subcellular location, we must extract features of protein sequences. We can get some vectors after feature extraction, and then we use the classifier to process these vectors. However, these vectors are usually complex due to their high dimensionality and nonlinear property. In order to improve the prediction accuracy of protein subcellular location, an appropriate nonlinear method for reducing data dimension should be used before classification. Kernel discriminant analysis (KDA) [9] is a nonlinear reductive dimension algorithm based on kernel trick that has been used in many fields such as facial recognition and fingerprint identification. The KDA method not only reduces data dimensionality but also makes use of the classification information. This paper newly introduces the KDA method to predict protein subcellular location. The algorithm of KDA first maps sample data to a highdimensional feature space by a kernel function, and then executes linear discriminant analysis (LDA) in the highdimensional feature space [10], which indicates that kernel parameter selection will significantly affect the algorithm performance.
There are some classical algorithms used to select the parameter of kernel function, such as genetic algorithm, grid searching algorithm, and so on. These methods have high calculation precision but large amounts of calculation. In an effort to reduce computational complexity, recently, Xiao et al. proposed a method based on reconstruction errors of samples and used it to select the parameters of Gaussian kernel principal component analysis (KPCA) for novelty detection [11]. Their methods are applied into the toy data sets and UCI (University of CaliforniaIrvine) benchmark data sets to demonstrate the correctness of the algorithm. However, their innovation in the KPCA method aims at dimensional reduction rather than discriminant analysis, which leads to unsatisfied classification prediction accuracy. Thus, it is necessary to improve the efficiency of the method in [11] especially for some complex data such as biological data.
In this paper, an improved algorithm of selecting parameters of Gaussian kernel in KDA is proposed to analyze complex protein data and predict subcellular location. By maximizing the differences of reconstruction errors between edge normal samples and interior normal samples, the proposed method not only shows the same effect as the traditional gridsearching method, but also reduces the computational time and improves efficiency.
2. Results and Discussion
In this section, the proposed method (in Section 3.4) and the gridsearching algorithm (in Section 4.4) are both applied to predict protein subcellular localization. We use two standard data sets as the experimental data. The two used feature expressions are generated from PSSM (position specific scoring matrix) [12], which are the PsePSSM (pseudoposition specific scoring matrix) [12] and the PSSMS (AAO + PSSMAAO + PSSMSAC + PSSMSD = PSSMS) [13]. Here AAO means consensus sequencebased occurrence, PSSMAAO means evolutionarybased occurrence or semioccurrence of PSSM, PSSMSD is segmented distribution of PSSM and PSSMSAC is segmented auto covariance of PSSM. The knearest neighbors (KNN) is used as the classifier in which Euclidean distance is adopted for the distance between samples. The flow of experiments is as follows.
 First, for each standard data set, we use the PsePSSM algorithm and the PSSMS algorithm to extract features, respectively. Then totally we obtain four sample sets, which are GN1000 (Gramnegative with PsePSSM which contains 1000 features), GN220 (Gramnegative with PSSMS which contains 220 features), GP1000 (Grampositive with PsePSSM which contains 1000 features) and GP220 (Grampositive with PsePSSM which contains 220 features).
 Second, we use the proposed method to select the optimum kernel parameter for the Gaussian KDA model and then use KDA to reduce the dimension of sample sets. The same procedure is also carried out for the traditional gridsearching method to form a comparison with the proposed method.
 Finally, we use the KNN algorithm to classify the reduced dimensional sample sets and use some criterions to evaluate the results and give the comparison results.
Some detailed information in experiments is as follows. For every sample set, we choose the class that contains the most samples to form the training set [8]. Let $\mathrm{S}=\left[0.1,0.2,0.3,0.4,1,2,3,4\right]$ be a candidate set of the Gaussian kernel parameter, which is proposed at random. When we use the KDA algorithm to reduce dimension, the number of retained eigenvectors must be less than or equal to $\mathrm{C}1$ ($\mathrm{C}$ is the number of classes). Therefore, for sample sets GN1000 and GN220, the number of retained eigenvectors, which is denoted as $\mathrm{d}$, can be from 1 to 7. For the sample sets GP1000 and GP220, $\mathrm{d}$ can be 1, 2, and 3. As far as the parameter $\mathrm{u}$ is concerned, when it is 5–8% of the average number of samples, good classification can be achieved [14]. Besides, we demonstrate the robustness of the proposed method with the variation of $\mathrm{u}$ in Section 2.2. So here we simply pick a general value for $\mathrm{u}$, say 8. To sum up, in the following experiments, when certain parameters need to be fixed, their default values are as follows. The value of $\mathrm{d}$ is 7 for sample sets GN1000 and GN220, and 3 for GP1000 and GP220; the value of $\mathrm{u}$ is 8 and the $\mathrm{k}$ value in KNN classifier is 20.
2.1. The Comparison Results of the Overall Accuracy
2.1.1. The Accuracy Comparison between the Proposed Method and the GridSearching Method
In this section, first, the proposed method and the gridsearching method are respectively used in the prediction of protein subcellular localization with different $\mathrm{d}$ values. The experimental results are presented in Figure 1.
In Figure 1, all four sample sets suggest that when we use the KDA algorithm to reduce dimension, the larger the number of retained eigenvectors, the higher the accuracy. The overall accuracy of the proposed method is always the same as that of the gridsearching method, no matter which value of $\mathrm{d}$. The proposed method is effective for selecting the optimal Gaussian kernel parameter.
Then, in the analyses and experiments, we find that superiority of the proposed method is the low runtime, which is demonstrated in Table 1 and Figure 2.
In Table 1, ${\mathrm{t}}_{1}$ and ${\mathrm{t}}_{2}$ are the runtimes of the proposed method and the gridsearching method, respectively. The overall accuracy and the ratio of ${\mathrm{t}}_{1}$ and ${\mathrm{t}}_{2}$ are presented in both Table 1 and Figure 2, from which we can see that for each sample set, the accuracy of the proposed method is always the same as that of the gridsearching method; meanwhile, the runtime of the former is about 70–80% of that of the latter, indicating that the proposed method has a higher efficiency than the gridsearching method.
2.1.2. The Comparison between Methods with and without KDA
In this experiment, we compare the overall accuracies between the cases of using KDA algorithm or not, with $\mathrm{k}$ values of the KNN classifier varying from 1 to 30. The experimental results are shown in Figure 3.
For each sample set, Figure 3 shows that the accuracy with KDA algorithm to reduce dimension is higher than that of without it. However, the kernel parameter has a great impact on the efficiency of the KDA algorithm, and the proposed method can be used to select the optimum parameter that makes the KDA perform perfect. Therefore, accuracy can be improved by using the proposed method to predict the protein subcellular localization.
2.2. The Robustness of the Proposed Method
In the proposed method, the value of $\mathrm{u}$ will have an impact on the radius value of neighborhood so that it can affect the number of the selected internal and edge samples. Figure 4 shows the experimental results when the value of $\mathrm{u}$ ranges from 6 to 10, in which the overall accuracies of the proposed method and the gridsearching method are given.
It is easily seen from Figure 4 that the accuracy keeps invariable with different $\mathrm{u}$ values. The number of the selected internal and edge samples has little effect on the performance of the proposed method. Therefore, the method proposed in this paper has a good robustness.
2.3. Evaluating the Proposed Method with Some Regular Evaluation Criterions
3. Methods
3.1. Protein Subcellular Localization Prediction Based on KDA
To improve the localization prediction accuracy, it is necessary to reduce dimension of highdimensional protein data before subcellular classification. The flow of protein subcellular localization prediction is presented in Figure 5.
As shown in Figure 5, first, for a standard data set, some features of protein sequences such as PSSMbased expressions are extracted to form the sample sets. The specific feature expressions used in this paper are discussed in Section 4.2. Second, the kernel parameter is selected in an interval based on the sample sets to reach its optimal value in KDA model. Third, with this optimal value, we used the KDA to realize the dimension reduction of the sample sets. Lastly, the low dimensional data is treated by certain classifier to realize the classification and the final prediction.
In the whole process of Figure 5, dimension reduction with KDA is very important, in which the kernel selection is a key step and constructs the research focus of this paper. Kernel selection includes the choice of the type of kernel function and the choice of the kernel parameters. In this paper, Gaussian kernel function is adopted for KDA because of its good nature, learning performance, and catholicity. So, the emphasis of this study is to decide the scale parameter of the Gaussian kernel, which plays an important role in the process of dimensionality reduction and has a great influence on prediction results. We put forward a method for selecting the optimum Gaussian kernel parameter with the starting point of reconstruction error idea in [15].
3.2. Algorithm Principle
Kernel method constructs a subspace in the feature space by the kernel trick, which makes normal samples locate in or nearby this subspace, while novel samples are far from it. The reconstruction error is the distance of a sample from the feature space to the subspace [11], so the reconstruction errors of normal samples should be different from those of the novel samples. In this paper, we use the Gaussian KDA as the descending algorithms. Since the values of the reconstruction errors are influenced by the Gaussian kernel parameters, the reconstruction errors of normal samples should be differentiated from those of the novel samples by suitable parameters [11].
In the input space, we usually call the samples on the boundary as edge samples, and call those within the boundary as internal samples [16,17]. The edge samples are much closer to novel samples than the internal samples, while the internal samples are much closer to normal states than the edge samples [11]. We usually use the internal samples as the normal samples and use the edge samples as the novel samples, since there are no novel samples in data sets. Therefore, the principle is that the optimal kernel parameter makes the reconstruction errors have a reasonable difference between the internal samples and the edge samples.
3.3. Kernel Discriminant Analysis (KDA) and Its Reconstruction Error
KDA is an algorithm by applying kernel trick into linear discriminant analysis (LDA). LDA is an algorithm of linear dimensionality reduction together with classifying discrimination, which aims to find a direction that maximizes the betweenclass scatter while minimizing the withinclass scatter [18]. In order to extend the LDA theory to the nonlinear data, Mika et al. proposed the KDA algorithm, which makes the nonlinear data linearly separable in a much higher dimensional feature space than before [9]. The principle of the KDA algorithm is shown as follows.
Suppose the $\mathrm{N}$ samples in $\mathrm{X}$ can be divided into $\mathrm{C}$ classes and the $\mathrm{i}$th class contains ${\mathrm{N}}_{\mathrm{i}}$ samples satisfying $\mathrm{N}={\displaystyle \sum _{\mathrm{i}=1}^{\mathrm{C}}{\mathrm{N}}_{\mathrm{i}}}$. The betweenclass scatter matrix ${\mathrm{S}}_{\mathrm{b}}^{\mathsf{\varphi}}$ and the withinclass scatter matrix ${\mathrm{n}}_{\mathrm{n}}^{\mathsf{\varphi}}$ of $\mathrm{X}$ are defined in the following equations, respectively:
where ${\mathrm{m}}_{\mathrm{i}}^{\mathsf{\varphi}}=\frac{1}{{\mathrm{N}}_{\mathrm{i}}}{\displaystyle \sum _{\mathrm{j}=1}^{{\mathrm{N}}_{\mathrm{i}}}\mathsf{\varphi}\left({\mathrm{x}}_{\mathrm{j}}^{\mathrm{i}}\right)}$ is the mean vector of the $\mathrm{i}$th class, and ${\mathrm{m}}^{\mathsf{\varphi}}=\frac{1}{\mathrm{N}}{\displaystyle \sum _{\mathrm{i}=1}^{\mathrm{N}}\mathsf{\varphi}\left({\mathrm{x}}_{\mathrm{i}}\right)}$ is the total mean of $\mathrm{X}$. To find the optimal linear discriminant, we need to maximize $\mathrm{J}\left(\mathrm{W}\right)$ as follows:
where $\mathrm{W}={[{\mathrm{w}}_{1},{\mathrm{w}}_{2},\cdots ,{\mathrm{w}}_{\mathrm{d}}]}^{\mathrm{T}}\left(1\le \mathrm{d}\le \mathrm{C}1\right)$ is a projection matrix, and ${\mathrm{w}}_{\mathrm{k}}\left(\mathrm{k}=1,2,\cdots ,\mathrm{d}\right)$ is a column vector with $\mathrm{N}$ elements. Through certain algebra, it can be deduced that $\mathrm{W}$ is made up of the eigenvectors corresponding to the top $\mathrm{d}$ eigenvalues of ${\mathrm{S}}_{\mathrm{w}}^{\mathsf{\varphi}}{}^{1}{\mathrm{S}}_{\mathrm{b}}^{\mathsf{\varphi}}$. Also, the projection vector ${\mathrm{w}}_{\mathrm{k}}$ can be represented by a linear combination of the samples in the feature space:
where ${\mathrm{a}}_{\mathrm{j}}^{\mathrm{k}}$ is a real coefficient. The projection of the sample $\mathrm{X}$ onto ${\mathrm{w}}_{\mathrm{k}}$ is given by:
$${\mathrm{S}}_{\mathrm{b}}^{\mathsf{\varphi}}={\displaystyle \sum _{\mathrm{i}=1}^{\mathrm{C}}{\mathrm{N}}_{\mathrm{i}}}\left({\mathrm{m}}_{\mathrm{i}}^{\mathsf{\varphi}}{\mathrm{m}}^{\mathsf{\varphi}}\right)\text{\hspace{0.17em}}\left({\mathrm{m}}_{\mathrm{i}}^{\mathsf{\varphi}}{\mathrm{m}}^{\mathsf{\varphi}}\right){\text{\hspace{0.17em}}}^{\mathrm{T}}$$
$${\mathrm{S}}_{\mathrm{w}}^{\mathsf{\varphi}}={\displaystyle \sum _{\mathrm{i}=1}^{\mathrm{C}}{\displaystyle \sum _{\mathrm{j}=1}^{{\mathrm{N}}_{\mathrm{i}}}\left[\mathsf{\varphi}\left({\mathrm{x}}_{\mathrm{j}}^{\mathrm{i}}\right){\mathrm{m}}_{\mathrm{i}}^{\mathsf{\varphi}}\right]}}\text{\hspace{0.17em}}\left[\mathsf{\varphi}\left({\mathrm{x}}_{\mathrm{j}}^{\mathrm{i}}\right){\mathrm{m}}_{\mathrm{i}}^{\mathsf{\varphi}}\right]{}^{\mathrm{T}}$$
$$\mathrm{m}\mathrm{a}\mathrm{x}\mathrm{J}\left(\mathrm{W}\right)=\frac{{\mathrm{W}}^{\mathrm{T}}{\mathrm{S}}_{\mathrm{b}}^{\mathsf{\varphi}}\mathrm{W}}{{\mathrm{W}}^{\mathrm{T}}{\mathrm{S}}_{\mathrm{w}}^{\mathsf{\varphi}}\mathrm{W}}$$
$${\mathrm{w}}_{\mathrm{k}}={\displaystyle \sum _{\mathrm{j}=1}^{\mathrm{N}}{\mathrm{a}}_{\mathrm{j}}^{\mathrm{k}}}\mathrm{j}\left({\mathrm{x}}_{\mathrm{j}}\right)$$
$${\mathrm{w}}_{\mathrm{k}}^{\mathrm{T}}\times \mathsf{\varphi}\left(\mathrm{x}\right)={\displaystyle \sum _{\mathrm{i}=1}^{\mathrm{N}}{\mathrm{a}}_{\mathrm{i}}^{\mathrm{k}}}\mathrm{K}\left(\mathrm{x},{\mathrm{x}}_{\mathrm{j}}\right)$$
Let $\mathrm{a}=\left[{\mathrm{a}}^{1},{\mathrm{a}}^{2},\cdots ,{\mathrm{a}}^{\mathrm{d}}\right]{}^{\mathrm{T}}$ be the coefficient matrix where ${\mathrm{a}}^{\mathrm{k}}=\left[{\mathrm{a}}_{1}^{\mathrm{k}},{\mathrm{a}}_{2}^{\mathrm{k}},\cdots ,{\mathrm{a}}_{\mathrm{N}}^{\mathrm{k}}\right]{}^{\mathrm{T}}$ is the coefficient vector. Combining Equations (1)–(5), we can obtain the linear discriminant by maximizing the function $\mathrm{J}\left(\mathrm{a}\right)$:
where $\tilde{\mathrm{M}}={\displaystyle \sum _{\mathrm{i}=1}^{\mathrm{C}}{\mathrm{N}}_{\mathrm{i}}\left({\mathrm{M}}_{\mathrm{i}}\mathrm{M}\right)}\text{\hspace{0.17em}}\left({\mathrm{M}}_{\mathrm{i}}\mathrm{M}\right){}^{\mathrm{T}}$, $\tilde{\mathrm{L}}={\displaystyle \sum _{\mathrm{i}=1}^{\mathrm{C}}{\mathrm{K}}_{\mathrm{i}}}\left(\mathrm{E}\frac{1}{{\mathrm{N}}_{\mathrm{i}}}\mathrm{I}\right){\mathrm{K}}_{\mathrm{i}}^{\mathrm{T}}$, the kth component of the vector ${\mathrm{M}}_{\mathrm{i}}$ is ${\left({\mathrm{M}}_{\mathrm{i}}\right)}_{\mathrm{k}}=\frac{1}{{\mathrm{N}}_{\mathrm{i}}}{\displaystyle \sum _{\mathrm{j}=1}^{{\mathrm{N}}_{\mathrm{i}}}\mathrm{K}\left({\mathrm{x}}_{\mathrm{k}},{\mathrm{x}}_{\mathrm{j}}^{\mathrm{i}}\right)\text{\hspace{0.17em}}\left(\mathrm{k}=1,2,\cdots ,\mathrm{N}\right)}$, the kth component of the vector $\mathrm{M}$ is ${\left(\mathrm{M}\right)}_{\mathrm{k}}=\frac{1}{\mathrm{N}}{\displaystyle \sum _{\mathrm{j}=1}^{\mathrm{N}}\mathrm{K}\left({\mathrm{x}}_{\mathrm{k}},{\mathrm{x}}_{\mathrm{j}}^{\mathrm{i}}\right)}\text{\hspace{0.17em}}\left(\mathrm{k}=1,2,\cdots ,\mathrm{N}\right)$, ${\mathrm{K}}_{\mathrm{i}}$ is a $\mathrm{N}\times {\mathrm{N}}_{\mathrm{i}}$ matrix with ${\left({\mathrm{K}}_{\mathrm{i}}\right)}_{\mathrm{m}\mathrm{n}}=\mathrm{K}\left({\mathrm{x}}_{\mathrm{m}},{\mathrm{x}}_{\mathrm{n}}^{\mathrm{i}}\right)$, $\mathrm{E}$ is the ${\mathrm{N}}_{\mathrm{i}}\times {\mathrm{N}}_{\mathrm{i}}$ identity matrix, and $\frac{1}{{\mathrm{N}}_{\mathrm{i}}}\mathrm{I}$ is the ${\mathrm{N}}_{\mathrm{i}}\times {\mathrm{N}}_{\mathrm{i}}$ matrix that all elements are $\frac{1}{{\mathrm{N}}_{\mathrm{i}}}$ [9]. Then, the projection matrix $\mathrm{a}$ is made up of the eigenvectors corresponding to the top $\mathrm{d}$ eigenvalues of ${\tilde{\mathrm{L}}}^{1}\tilde{\mathrm{M}}$.
$$\mathrm{m}\mathrm{a}\mathrm{x}\mathrm{J}\left(\mathrm{a}\right)=\frac{{\mathrm{a}}^{\mathrm{T}}\tilde{\mathrm{M}}\mathrm{a}}{{\mathrm{a}}^{\mathrm{T}}\tilde{\mathrm{L}}\mathrm{a}}$$
According to the KDA algorithm principle in (3) or (6), besides the Gaussian kernel parameter $\mathrm{s}$, the number of retained eigenvectors $\mathrm{d}$ also affects the algorithm performance. Generally, in this paper, the proposed method is mainly used to screen an optimum $\mathrm{S}$ under a predetermined $\mathrm{d}$ value.
The Gaussian kernel function is defined as follows:
where $\mathsf{\sigma}$ is the scale parameter which is generally estimated by $\mathrm{s}$. Note that ${\Vert \mathsf{\varphi}\left(\mathrm{x}\right)\Vert}^{2}=\mathrm{K}\left(\mathrm{x},\mathrm{x}\right)=1$.
$$\mathrm{K}\left({\mathrm{x}}_{\mathrm{i}},{\mathrm{x}}_{\mathrm{j}}\right)=\mathrm{e}\mathrm{x}\mathrm{p}(\frac{{\Vert {\mathrm{x}}_{\mathrm{i}}{\mathrm{x}}_{\mathrm{j}}\Vert}^{2}}{{\mathsf{\sigma}}^{2}})$$
The kernelbased reconstruction error is defined in the following equation:
where t(x) is the vector obtained by projecting $\mathsf{\varphi}\left(\mathrm{x}\right)$ onto a projection matrix $\mathrm{a}$.
$$\begin{array}{ll}\mathrm{R}\mathrm{E}\left(\mathrm{x}\right)& ={\Vert \mathsf{\varphi}\left(\mathrm{x}\right)\mathrm{W}\text{\hspace{0.17em}}\mathrm{t}\left(\mathrm{x}\right)\text{\hspace{0.17em}}\Vert}^{2}={\Vert \mathsf{\varphi}\left(\mathrm{x}\right)\text{\hspace{0.17em}}\Vert}^{2}{\Vert \text{\hspace{0.17em}}\mathrm{t}\left(\mathrm{x}\right)\text{\hspace{0.17em}}\Vert}^{2}\\ & =\mathrm{K}\left(\mathrm{x},\mathrm{x}\right){\Vert \text{\hspace{0.17em}}\mathrm{t}\left(\mathrm{x}\right)\text{\hspace{0.17em}}\Vert}^{2}\end{array}$$
3.4. The Proposed Method for Selecting the Optimum Gaussian Kernel Parameter
The method of kernel parameter selection relies on the reconstruction errors of the internal samples and the edge samples. Therefore, first we find a method to select the edge samples and the interior samples, then we propose the method for selection of the Gaussian kernel parameter.
3.4.1. The Method for Selecting Internal and Edge Samples
Li and Maguire present a borderedge pattern selection method (BEPS) to select the edge samples based on the local geometric information [16]. Xiao et al. [11] modified the BEPS algorithm so that it can select both the edge samples and internal samples. However, their algorithm has the risk of making all samples in the training set become the edge samples. For example, when all samples are distributed on a spherical surface in a threedimensional space, every sample in the data set will be selected as the edge samples since its neighbors are all located on one side of its tangent plane. In order to solve this problem, this paper innovatively combines the ideas in [19,20] to select the internal and edge samples, respectively, which is not dependent on the local geometric information. The main principle is that the edge sample is usually surrounded by the samples belonging to other classes while the internal sample is usually surrounded by the samples belonging to its same class. Further, the edge samples are usually far from the centroid of this class, while the internal samples are usually close to the centroid. So, a sample will be selected as the edge sample if it is far from the centroid of this class and there are samples around it that belongs to other classes, otherwise it will be selected as the internal sample.
Specifically, suppose the $\mathrm{i}$th class ${\mathrm{X}}_{\mathrm{i}}=\left\{{\mathrm{x}}_{1},{\mathrm{x}}_{2},\cdots ,{\mathrm{x}}_{{\mathrm{N}}_{\mathrm{i}}}\right\}$ in the sample set $\mathrm{X}$ is picked out as the training set. Denote ${\mathrm{c}}_{\mathrm{i}}$ be the centroid of this class:
$${\mathrm{c}}_{\mathrm{i}}=\frac{1}{{\mathrm{N}}_{\mathrm{i}}}{\displaystyle \sum _{\mathrm{i}=1}^{{\mathrm{N}}_{\mathrm{i}}}{\mathrm{x}}_{\mathrm{i}}}$$
We use the median value $\mathrm{m}$ of the distances from all samples in a class to its centroid to measure the distance from a sample to the centroid of this class. A sample is conserved to be far from the centroid of this class if the distance from this sample to the centroid is greater than the median value. Otherwise, the sample is considered to be close to the centroid.
Denote $\mathrm{d}\mathrm{i}\mathrm{s}\mathrm{t}\left({\mathrm{x}}_{\mathrm{i}},{\mathrm{x}}_{\mathrm{j}}\right)$ as the distance between any two samples ${\mathrm{x}}_{\mathrm{i}}$ and ${\mathrm{x}}_{\mathrm{j}}$, and ${\mathrm{N}}_{\mathsf{\epsilon}}\left(\mathrm{x}\right)$ as the $\mathsf{\epsilon}$neighborhood of $\mathrm{X}$:
$${\mathrm{N}}_{\mathsf{\epsilon}}\left(\mathrm{x}\right)=\left\{\mathrm{y}\mathrm{d}\mathrm{i}\mathrm{s}\mathrm{t}\left(\mathrm{x},\mathrm{y}\right)\le \mathsf{\epsilon},\mathrm{y}\in {\mathrm{X}}_{}\right\}$$
The value of neighborhood $\mathsf{\epsilon}$ is given as follows. Let $\mathrm{u}$ be a given number which satisfies $0<\mathrm{u}<{\mathrm{N}}_{\mathrm{i}}$. $\mathrm{D}\mathrm{e}\mathrm{n}\mathrm{s}\mathrm{i}\mathrm{t}{\mathrm{y}}_{\mathrm{u}}\left({\mathrm{X}}_{\mathrm{i}}\right)$ is the mean radius of neighborhood of ${\mathrm{X}}_{\mathrm{i}}$ for the given number $\mathrm{u}$:
where $\mathrm{d}\mathrm{i}\mathrm{s}{\mathrm{t}}_{\mathrm{u}}\left({\mathrm{x}}_{\mathrm{i}}\right)$ is the distance from ${\mathrm{x}}_{\mathrm{i}}$ to its uth nearest neighbor. So, $\mathrm{D}\mathrm{e}\mathrm{n}\mathrm{s}\mathrm{i}\mathrm{t}{\mathrm{y}}_{\mathrm{u}}\left({\mathrm{X}}_{\mathrm{i}}\right)$ is used as the value of $\mathsf{\epsilon}$ for the training set ${\mathrm{X}}_{\mathrm{i}}$. The flow for the selection of the internal and edge samples is shown in Table 4.
$$\mathrm{D}\mathrm{e}\mathrm{n}\mathrm{s}\mathrm{i}\mathrm{t}{\mathrm{y}}_{\mathrm{u}}\left({\mathrm{X}}_{\mathrm{i}}\right)=\frac{1}{{\mathrm{N}}_{\mathrm{i}}}{\displaystyle \sum _{\mathrm{i}=1}^{{\mathrm{N}}_{\mathrm{i}}}\mathrm{d}\mathrm{i}\mathrm{s}{\mathrm{t}}_{\mathrm{u}}}\left({\mathrm{x}}_{\mathrm{i}}\right)$$
In Table 4, a sample $\mathrm{X}$ is considered to be the edge one when the distance from $\mathrm{X}$ to the centroid is larger than the median $\mathrm{m}$ and there are samples in ${\mathrm{N}}_{\mathsf{\epsilon}}\left(\mathrm{x}\right)$ belonging to other classes in this case. A sample $\mathrm{X}$ is considered to be the internal one when the distance from $\mathrm{X}$ to the centroid is less than $\mathrm{m}$ and in this case all samples of ${\mathrm{N}}_{\mathsf{\epsilon}}\left(\mathrm{x}\right)$ belong to this class.
3.4.2. The Proposed Method
In order to select the optimum kernel parameter, it is necessary to propose a criterion aiming to distinguish reconstruction errors of the edge samples from those of the internal samples. A suitable parameter not only maximizes the difference between reconstruction errors of the internal samples and those of the edge samples, but also minimizes the variance (or standard deviation) of reconstruction errors of the internal samples [11]. According to the rule, an improved objective function is proposed in this paper. The optimal Gaussian kernel parameter $\mathrm{S}$ is selected by maximizing this objective function.
where ${\Vert \text{\hspace{0.17em}}\cdot \text{\hspace{0.17em}}\Vert}_{\infty}$ is the infinite norm which computes the maximum absolute component of a vector and $\mathrm{s}\mathrm{t}\mathrm{d}(\cdot )$ is a function of the standard deviation. Note that in the objective function $\mathrm{f}\left(\mathrm{s}\right)$, our key improvement is to use the infinite norm to compute the size of reconstruction error vector since it can lead to a higher accuracy than many other measurements, which has been verified by a series of our experiments. The reason is probably that the maximum component is more reasonable to evaluate the size of a reconstruction error vector than others such as the $1$norm, pnorm $\left(1<\mathrm{p}<+\infty \right)$ and the minimum component of a reconstruction error vector in [11].
$$\mathrm{s}=\mathrm{a}\mathrm{r}\mathrm{g}\underset{\mathrm{s}}{\mathrm{m}\mathrm{a}\mathrm{x}}\mathrm{f}\left(\mathrm{s}\right)=\mathrm{a}\mathrm{r}\mathrm{g}\text{\hspace{0.17em}}\underset{\mathrm{s}}{\mathrm{m}\mathrm{a}\mathrm{x}}\text{\hspace{0.17em}}\frac{{\Vert \mathrm{R}\mathrm{E}\left({\mathsf{\Omega}}_{\mathrm{e}\mathrm{d}}\right)\text{\hspace{0.17em}}\Vert}_{\infty}{\Vert \mathrm{R}\mathrm{E}\left({\mathsf{\Omega}}_{\mathrm{i}\mathrm{n}}\right)\text{\hspace{0.17em}}\Vert}_{\infty}}{\mathrm{s}\mathrm{t}\mathrm{d}\left\{\mathrm{R}\mathrm{E}\left({\mathsf{\Omega}}_{\mathrm{i}\mathrm{n}}\right)\right\}}$$
According to (8), when the number of retained eigenvectors is determined, we can select the optimum parameter $\mathrm{s}$ from a candidate set using the proposed method. The optimum parameter ensures that the Gaussian KDA algorithm performs well in dimensionality reduction, which improves the accuracy of protein subcellular location prediction. The proposed method for selecting the Gaussian kernel parameter can be presented in Table 5.
As the end of this section, we want to summarize the position of the proposed method in protein subcellular localization once more. First, two kinds of regularization forms of PSSM are used to extract the features in protein amino acid sequences. Then, the KDA method is performed on the extracted features for dimension reduction and discriminant analysis according to the KDA algorithm principle in Section 3.3 with formulas (1)–(6). During the procedure of KDA, the novelty of our work is to give a new method for selecting the Gaussian kernel parameter, which is summarized in Table 5. Finally, we choose the knearest neighbors (KNN) as the classifier to cluster the dimensionreduced data after KDA.
4. Materials
In this section, we introduce the other processes in Figure 5 except KDA model and its parameter selection, which are necessary materials for the whole experiment.
4.1. Standard Data Sets
In this paper, we use two standard datasets that have been widely used in the literature for Grampositive and Gramnegative subcellular localizations [13], whose protein sequences all come from the SwissProt database.
For the Grampositive bacteria, the standard data set we found in the literature [13,14,21] is publicly available on http://www.csbio.sjtu.edu.cn/bioinf/Gposmulti/Data.htm. There are 523 locative protein sequences in the data set that are distributed in four different subcellular locations. The number of proteins in each location is given in Table 6.
For the Gramnegative bacteria, the standard data set of subcellular localizations is presented in the literature [13,22], which can be downloaded freely from http://www.csbio.sjtu.edu.cn/bioinf/Gnegmulti/Data.htm. The data set contains 1456 locative protein sequences located in eight different subcellular locations. The number of proteins in each location is shown in Table 7.
4.2. Feature Expressions and Sample Sets
In the prediction of protein subcellular localizations with machine learning methods, feature expressions are important information extracted from protein sequences, which have certain proper mathematical algorithms. There are many efficient algorithms used to extract features of protein sequences, in which two of them, PsePSSM [12] and PSSMS [13], are used in this paper. The two methods rely on the positionspecific scoring matrix (PSSM) for benchmarks which is obtained by using the PSIBLAST algorithm to search the SwissProt database with the parameter Evalue of 0.01. The PSSM is defined as follows [12]:
where ${\mathrm{M}}_{\mathrm{i}\to \mathrm{j}}$ represents the score created in the case when the ith amino acid residue of the protein sequence is transformed to the amino acid type $\mathrm{j}$ during the evolutionary process [12].
$${\mathrm{P}}_{\mathrm{P}\mathrm{S}\mathrm{S}\mathrm{M}}=\left[\begin{array}{cccc}{\mathrm{M}}_{1\to 1}& {\mathrm{M}}_{1\to 2}& \cdots & {\mathrm{M}}_{1\to 20}\\ {\mathrm{M}}_{2\to 1}& {\mathrm{M}}_{2\to 2}& \cdots & {\mathrm{M}}_{2\to 20}\\ \vdots & \vdots & \vdots & \vdots \\ {\mathrm{M}}_{\mathrm{i}\to 1}& {\mathrm{M}}_{\mathrm{i}\to 2}& \cdots & {\mathrm{M}}_{\mathrm{i}\to 20}\\ \vdots & \vdots & \vdots & \vdots \\ {\mathrm{M}}_{\mathrm{L}\to 1}& {\mathrm{M}}_{\mathrm{L}\to 2}& \cdots & {\mathrm{M}}_{\mathrm{L}\to 20}\end{array}\right]$$
Note that, usually, multiple alignment methods are used to calculate PSSM, whose chief drawback is being timeconsuming. The reason why we select PSSM instead of simple multiple alignment in this paper to form the total normalized information content is as follows. First, since our focus is to demonstrate the effectiveness of dimensional reduction algorithm, we need to construct highdimensional feature expressions such as PsePSSM and PSSMS, whose dimensions are as high as 1000 and 220, respectively. Second, PSSM has many advantages, such as those described in [23]. As far as the information features are concerned, PSSM has produced the strongest discriminator feature between fold members of protein sequences. Multiple alignment methods are used to calculate PSSM, whose chief drawback is being timeconsuming. However, in spite of the timeconsuming nature of constructing a PSSM for the new sequence, the extracted feature vectors from PSSM are so informative that are worth the cost of their preparation [23]. Besides, for a new protein sequence, we only need to construct a PSSM for the first time, which could be used repeatedly in the future for producing new normalization forms such as PsePSSM and PSSMS.
4.2.1. Pseudo PositionSpecific Scoring Matrix (PsePSSM)
Let $\mathrm{P}$ be a protein sample, whose definition of PsePSSM is given as follows [12]:
where $\mathrm{L}$ is the length of $\mathrm{P}$, ${\mathrm{G}}_{\mathrm{j}}^{\mathsf{\xi}}$ is the correlation factor by coupling the $\mathsf{\xi}$most contiguous scores [22]. According to the definition of PsePSSM, a protein sequence can be represented by a 1000dimensional vector.
$${\mathrm{P}}_{\mathrm{P}\mathrm{s}\mathrm{e}\mathrm{P}\mathrm{S}\mathrm{S}\mathrm{M}}^{\mathsf{\xi}}={\left[{\overline{\mathrm{M}}}_{1}{\overline{\mathrm{M}}}_{2}\cdots {\overline{\mathrm{M}}}_{20}{\mathrm{G}}_{1}^{\mathsf{\xi}}{\mathrm{G}}_{2}^{\mathsf{\xi}}\cdots {\mathrm{G}}_{20}^{\mathsf{\xi}}\right]}^{\mathrm{T}}(\mathsf{\xi}=0,1,2,\cdots ,49)$$
$${\overline{\mathrm{M}}}_{\mathrm{j}}=\frac{1}{\mathrm{L}}{\displaystyle \sum _{\mathrm{i}=1}^{\mathrm{L}}{\mathrm{M}}_{\mathrm{i}\to \mathrm{j}}}(\mathrm{j}=1,2,\cdots ,20)$$
$${\mathrm{G}}_{\mathrm{j}}^{\mathsf{\xi}}=\frac{1}{\mathrm{L}\mathsf{\xi}}{{\displaystyle \sum _{\mathrm{i}=1}^{\mathrm{L}\mathsf{\xi}}\left[{\mathrm{M}}_{\mathrm{i}\to \mathrm{j}}{\mathrm{M}}_{\left(\mathrm{i}+\mathsf{\xi}\right)\to \mathrm{j}}\right]}}^{2}(\mathrm{j}=1,2,\cdots ,20;\mathsf{\xi}<\mathrm{L})$$
4.2.2. PSSMS
Dehzangi et al. [13] put forward a new feature extraction method, PSSMS, which combines four components: AAO, PSSMAAO, PSSMSD, and PSSMSAC. According to the definition of the PSSMS, it can be represented a feature vector with 220 (20 + 20 + 80 + 100) elements.
4.2.3. Sample Sets
For the two benchmark data, PsePSSM and PSSMS are used to extract features, respectively. Finally we get four experimental sample sets GN1000, GN220, GP1000 and GP220, shown in Table 8.
4.3. Evaluation Criterion
To evaluate the performance of the proposed method, we use Jackknife crossvalidation, which has been widely used to predict protein subcellular localization [13]. The Jackknife test is the most objective and rigorous crossvalidation procedure in examining the accuracy of a predictor, which has been used increasingly by investigators to test the power of various predictors [24,25]. In the Jackknife test (also known as leaveoneout crossvalidation), every protein is removed onebyone from the training dataset, and the predictor is trained by the remaining proteins. The isolated protein is then tested by the trained predictor [26]. Let $\mathrm{x}$ be a sample set with $\mathrm{N}$ samples. For each sample, it will be used as the test data, and the remaining $\mathrm{N}1$ samples will be used to construct the training set [27]. In addition, we use some criterion to assess the experimental results, defined as follows [12]:
where $\mathrm{T}\mathrm{P}$ is the number of true positive, $\mathrm{T}\mathrm{N}$ is the number of true negative, $\mathrm{F}\mathrm{P}$ is the number of false positive, and $\mathrm{F}\mathrm{N}$ is the number of false negative [12]. The value of $\mathrm{M}\mathrm{C}\mathrm{C}$ (Matthews coefficient correlation) varies between −1 and 1, indicating when the classification effect goes from a bad to a good one. The values of Specificity (Spe), sensitivity (Sen), and the overall accuracy (Q) all vary between 0 and 1, and the classification effect is better when their values are closer to 1, while the classification effect is worse when their values are closer to 0 [13].
$$\mathrm{M}\mathrm{C}\mathrm{C}\left(\mathrm{k}\right)=\frac{\mathrm{T}{\mathrm{P}}_{\mathrm{k}}\times \mathrm{T}{\mathrm{N}}_{\mathrm{k}}\mathrm{F}{\mathrm{N}}_{\mathrm{k}}\times \mathrm{F}{\mathrm{P}}_{\mathrm{k}}}{\sqrt{\left(\mathrm{T}{\mathrm{P}}_{\mathrm{k}}+\mathrm{F}{\mathrm{N}}_{\mathrm{k}}\right)\text{\hspace{0.17em}}\left(\mathrm{T}{\mathrm{P}}_{\mathrm{k}}+\mathrm{F}{\mathrm{P}}_{\mathrm{k}}\right)\text{\hspace{0.17em}}\left(\mathrm{T}{\mathrm{N}}_{\mathrm{k}}+\mathrm{F}{\mathrm{P}}_{\mathrm{k}}\right)\text{\hspace{0.17em}}\left(\mathrm{T}{\mathrm{N}}_{\mathrm{k}}+\mathrm{F}{\mathrm{N}}_{\mathrm{k}}\right)}}\times 100\%$$
$$\mathrm{S}\mathrm{e}\mathrm{n}\left(\mathrm{k}\right)=\frac{\mathrm{T}{\mathrm{P}}_{\mathrm{k}}}{\mathrm{T}{\mathrm{P}}_{\mathrm{k}}+\mathrm{F}{\mathrm{N}}_{\mathrm{k}}}\times 100\%$$
$$\mathrm{S}\mathrm{p}\mathrm{e}\left(\mathrm{k}\right)=\frac{\mathrm{T}{\mathrm{N}}_{\mathrm{k}}}{\mathrm{T}{\mathrm{P}}_{\mathrm{k}}+\mathrm{F}{\mathrm{P}}_{\mathrm{k}}}\times 100\%$$
$$\mathrm{Q}=\frac{{\displaystyle \sum _{\mathrm{k}=1}^{\mathrm{C}}\mathrm{T}{\mathrm{P}}_{\mathrm{k}}}}{\mathrm{N}}\times 100\%$$
4.4. The Grid Searching Method Used as Contrast
In this section, we introduce a normal algorithm for searching $\mathrm{S}$, the gridsearching algorithm, which is used as a contrast with the proposed algorithm in Section 3.4.
The gridsearching method is usually used to select the optimum parameter, whose steps are as follows for the candidate parameter set $\mathrm{S}$ [28].
 Compute the kernel matrix $\mathrm{k}$ for each parameter ${\mathrm{s}}_{\mathrm{i}}\in \mathrm{S},\text{\hspace{0.17em}}\mathrm{i}=1,2,\cdots ,\mathrm{m}$.
 Use the Gaussian KDA to reduce the dimension of $\mathrm{K}$.
 Use the KNN algorithm to classify the reduced dimensional samples.
 Calculate the classification accuracy.
 Repeat the above four steps until all parameters in $\mathrm{S}$ have been traversed. The parameter corresponding to the highest classification accuracy is selected as the optimum parameter.
5. Conclusions
Biological data is usually highdimensional. As a result, it is necessary to reduce dimension to improve the accuracy of the protein subcellular localization prediction. The kernel discriminant analysis (KDA) based on Gaussian kernel function is a suitable algorithm for dimensional reduction in such applications. As is known to all, the selection of a kernel parameter affects the performance of KDA, and thus it is important to choose the proper parameter that makes this algorithm perform well. To handle this problem, we propose a method of the optimum kernel parameter selection, which relies on reconstruction error [15]. Firstly, we use a method to select the edge and internal samples of the training set. Secondly, we compute the reconstruction errors of the selected samples. Finally, we select the optimum kernel parameter that makes the objective function maximum.
The proposed method is applied to the prediction of protein subcellular locations for Gramnegative bacteria and Grampositive bacteria. Compared with the gridsearching method, the proposed method gives higher efficiency and performance.
Since the performance of the proposed method largely depends on the selection of the internal and edge samples, in the future study, researchers may pay more attention to select more representative internal and edge samples from the biological data set to improve the prediction accuracy of protein subcellular localization. Besides this, it is also meaningful to research how to further improve the proposed method to make it suitable for selecting parameters of other kernels.
Acknowledgments
This research is supported by grants from National Natural Science Foundation of China (No. 11661081, No. 11561071 and No. 61472345) and Natural Science Foundation of Yunnan Province (2017FA032).
Author Contributions
Shunfang Wang and Bing Nie designed the research and the experiments. Wenjia Li, Dongshu Xu, and Bing Nie extracted the feature expressions from the standard data sets, and Bing Nie performed all the other numerical experiments. Shunfang Wang, Bing Nie, Kun Yue, and Yu Fei analyzed the experimental results. Shunfang Wang, Bing Nie, Kun Yue, and Yu Fei wrote this paper. All authors read and approved the final manuscript.
Conflicts of Interest
The authors declare that they have no competing interests.
References
 Chou, K.C. Some Remarks on Predicting MultiLabel Attributes in Molecular Biosystems. Mol. Biosyst. 2013, 9, 1092–1100. [Google Scholar] [CrossRef] [PubMed]
 Zhang, S.; Huang, B.; Xia, X.F.; Sun, Z.R. Bioinformatics Research in Subcellular Localization of Protein. Prog. Biochem. Biophys. 2007, 34, 573–579. [Google Scholar]
 Zhang, S.B.; Lai, J.H. Machine Learningbased Prediction of Subcellular Localization for Protein. Comput. Sci. 2009, 36, 29–33. [Google Scholar]
 Huh, W.K.; Falvo, J.V.; Gerke, L.C.; Carroll, A.S.; Howson, R.W.; Weissman, J.S.; O’Shea, E.K. Global analysis of protein localization in budding yeast. Nature 2003, 425, 686–691. [Google Scholar] [CrossRef] [PubMed]
 Dunkley, T.P.J.; Watson, R.; Griffin, J.L.; Dupree, P.; Lilley, K.S. Localization of organelle proteins by isotope tagging (LOPIT). Mol. Cell. Proteom. 2004, 3, 1128–1134. [Google Scholar] [CrossRef] [PubMed]
 Hasan, M.A.; Ahmad, S.; Molla, M.K. Protein subcellular localization prediction using multiple kernel learning based support vector machine. Mol. Biosyst. 2017, 13, 785–795. [Google Scholar] [CrossRef] [PubMed]
 Teso, S.; Passerini, A. Joint probabilisticlogical refinement of multiple protein feature predictors. BMC Bioinform. 2014, 15, 16. [Google Scholar] [CrossRef] [PubMed]
 Wang, S.; Liu, S. Protein SubNuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA. Int. J. Mol. Sci. 2015, 16, 30343–30361. [Google Scholar] [CrossRef] [PubMed]
 Baudat, G.; Anouar, F. Generalized Discriminant Analysis Using a Kernel Approach. Neural Comput. 2000, 12, 2385–2404. [Google Scholar] [CrossRef] [PubMed]
 Zhang, G.N.; Wang, J.B.; Li, Y.; Miao, Z.; Zhang, Y.F.; Li, H. Person reidentification based on feature fusion and kernel local Fisher discriminant analysis. J. Comput. Appl. 2016, 36, 2597–2600. [Google Scholar]
 Xiao, Y.C.; Wang, H.G.; Xu, W.L.; Miao, Z.; Zhang, Y.; Hang, L.I. Model selection of Gaussian kernel PCA for novelty detection. Chemometr. Intell. Lab. 2014, 136, 164–172. [Google Scholar] [CrossRef]
 Chou, K.C.; Shen, H.B. MemType2L: A Web server for predicting membrane proteins and their types by incorporating evolution information through PsePSSM. Biochem. Biophys. Res. Commun. 2007, 360, 339–345. [Google Scholar] [CrossRef] [PubMed]
 Dehzangi, A.; Heffernan, R.; Sharma, A.; Lyons, J.; Paliwal, K.; Sattar, A. Grampositive and Gramnegative protein subcellular localization by incorporating evolutionarybased descriptors into Chou׳s general PseAAC. J. Theor. Biol. 2015, 364, 284–294. [Google Scholar] [CrossRef] [PubMed]
 Shen, H.B.; Chou, K.C. GposPLoc: An ensemble classifier for predicting subcellular localization of Grampositive bacterial proteins. Protein Eng. Des. Sel. 2007, 20, 39–46. [Google Scholar] [CrossRef] [PubMed]
 Hoffmann, H. Kernel PCA for novelty detection. Pattern Recogn. 2007, 40, 863–874. [Google Scholar] [CrossRef]
 Li, Y.; Maguire, L. Selecting Critical Patterns Based on Local Geometrical and Statistical Information. IEEE Trans. Pattern Anal. 2010, 33, 1189–1201. [Google Scholar]
 Wilson, D.R.; Martinez, T.R. Reduction Techniques for InstanceBased Learning Algorithms. Mach. Learn. 2000, 38, 257–286. [Google Scholar] [CrossRef]
 Saeidi, R.; Astudillo, R.; Kolossa, D. Uncertain LDA: Including observation uncertainties in discriminative transforms. IEEE Trans. Pattern Anal. 2016, 38, 1479–1488. [Google Scholar] [CrossRef] [PubMed]
 Jain, A.K. Data clustering: 50 years beyond Kmeans. Pattern Recogn. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
 Li, R.L.; Hu, Y.F. A DensityBased Method for Reducing the Amount of Training Data in kNN Text Classification. J. Comput. Res. Dev. 2004, 41, 539–545. [Google Scholar]
 Chou, K.C.; Shen, H.B. CellPLoc 2.0: An improved package of webservers for predicting subcellular localization of proteins in various organisms. Nat. Sci. 2010, 2, 1090–1103. [Google Scholar] [CrossRef]
 Chou, K.C.; Shen, H.B. LargeScale Predictions of GramNegative Bacterial Protein Subcellular Locations. J. Proteome Res. 2007, 5, 3420–3428. [Google Scholar] [CrossRef] [PubMed]
 Kavousi, K.; Moshiri, B.; Sadeghi, M.; Araabi, B.N.; MoosaviMovahedi, A.A. A protein fold classifier formed by fusing different modes of pseudo amino acid composition via PSSM. Comput. Biol. Chem. 2011, 35, 1–9. [Google Scholar] [CrossRef] [PubMed]
 Shen, H.B.; Chou, K.C. NucPLoc: A new webserver for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Eng. Des. Sel. 2007, 20, 561–567. [Google Scholar] [CrossRef] [PubMed]
 Wang, T.; Yang, J. Using the nonlinear dimensionality reduction method for the prediction of subcellular localization of Gramnegative bacterial proteins. Mol. Divers. 2009, 13, 475. [Google Scholar]
 Wei, L.Y.; Tang, J.J.; Zou, Q. LocalDPP: An improved DNAbinding protein prediction method by exploring local evolutionary information. Inform. Sci. 2017, 384, 135–144. [Google Scholar] [CrossRef]
 Shen, H.B.; Chou, K.C. GnegmPLoc: A topdown strategy to enhance the quality of predicting subcellular localization of Gramnegative bacterial proteins. J. Theor. Biol. 2010, 264, 326–333. [Google Scholar] [CrossRef] [PubMed]
 Bing, L.I.; Yao, Q.Z.; Luo, Z.M.; Tian, Y. Girdpattern method for model selection of support vector machines. Comput. Eng. Appl. 2008, 44, 136–138. [Google Scholar]
Sample Sets  Overall Accuracy  Ratio (${\mathbf{t}}_{\mathbf{1}}/{\mathbf{t}}_{\mathbf{2}}$)  

GP220 (PSSMS)  The proposed method  0.9924  0.7087 
Grid searching method  0.9924  
GP1000 (PsePSSM)  The proposed method  0.9924  0.7362 
Grid searching method  0.9924  
GN220 (PSSMS)  The proposed method  0.9801  0.7416 
Grid searching method  0.9801  
GN1000 (PsePSSM)  The proposed method  0.9574  0.7687 
Grid searching method  0.9574 
Sample Set  Protein Subcellular Locations  

Cell Membrane  Cell Wall  Cytoplasm  Extracell  
Sensitivity  
GP220  1  0.9444  0.9904  0.9919 
GP1000  0.9943  0.9444  1  0.9837 
Specificity  
GP220  0.9943  1  1  09950 
GP1000  0.9971  1  0.9937  0.9925 
Matthews coefficient correlation (MCC)  
GP220  0.9914  0.9709  0.9920  0.9841 
GP1000  0.9914  0.9709  0.9921  0.9840 
Overall accuracy (Q)  
GP220  0.9924  
GP1000  0.9924 
Sample Set  Protein Subcellular Locations  

(1)  (2)  (3)  (4)  (5)  (6)  (7)  (8)  
Sensitivity  
GN220  1  0.9699  1  0  0.9982  0  0.9677  1 
GN1000  1  0.9323  1  0  0.9659  0  0.9516  0.9556 
Specificity  
GN220  0.9924  0.9902  1  1  0.9978  1  1  0.9953 
GN1000  0.9608  0.9872  1  1  0.9967  1  1  0.9992 
Matthews coefficient correlation ($\mathbf{M}\mathbf{C}\mathbf{C}$)  
GN220  0.9866  0.9324  1    0.9956    0.9823  0.9814 
GN1000  0.9346  0.8957  1    0.9681    0.9733  0.9712 
Overall accuracy (Q)  
GN220  0.9801  
GN1000  0.9574 
(1) Cytoplasm, (2) Extracell, (3) Fimbrium, (4) Flagellum, (5) Inner membrane, (6) Nucleoid, (7) Outer membrane, (8) Periplasm.
Input: $\mathbf{X}=\left\{{\mathbf{X}}_{1},{\mathbf{X}}_{2},\cdots ,{\mathbf{X}}_{\mathbf{C}}\right\}$, the training set ${\mathbf{X}}_{\mathbf{i}}=\left\{{\mathbf{x}}_{1},{\mathbf{x}}_{2},\cdots ,{\mathbf{x}}_{{\mathbf{N}}_{\mathbf{i}}}\right\}$ $\left(1\le \mathbf{i}\le \mathbf{C}\right)$. 

1. Calculate the radius of neighborhood $\mathsf{\epsilon}$ using Equation (11). 
2. Calculate the centroid ${\mathrm{c}}_{\mathrm{i}}$ of the $\mathrm{i}$^{t}^{h} class according to Equation (9). 
3. Calculate the distances $\mathrm{d}\mathrm{i}\mathrm{s}{\mathrm{t}}_{\mathrm{j}}(\mathrm{j}=1,2,\cdots ,{\mathrm{N}}_{\mathrm{i}})$ from all samples in training set to ${\mathrm{c}}_{\mathrm{i}}$, respectively, and the median value $\mathrm{m}$ of them. 
4. For each training sample ${\mathrm{x}}_{\mathrm{j}}$ of the set ${\mathrm{X}}_{\mathrm{i}}$

Output: the selected internal sample set ${\mathsf{\Omega}}_{\mathbf{i}\mathbf{n}}$, the selected edge sample set ${\mathsf{\Omega}}_{\mathbf{e}\mathbf{d}}$. 
Input: A reasonable candidate set $\mathrm{S}=\left\{{\mathrm{s}}_{1},{\mathrm{s}}_{2},\cdots ,{\mathrm{s}}_{\mathrm{m}}\right\}$ for Gaussian kernel parameter, $\mathrm{X}=\left\{{\mathrm{X}}_{1},{\mathrm{X}}_{2},\cdots ,{\mathrm{X}}_{\mathrm{C}}\right\}$, the training set ${\mathrm{X}}_{\mathrm{i}}=\left\{{\mathrm{x}}_{1},{\mathrm{x}}_{2},\cdots ,{\mathrm{x}}_{{\mathrm{N}}_{\mathrm{i}}}\right\}$ $\left(1\le \mathrm{i}\le \mathrm{C}\right)$, the number of retained eigenvectors $\mathrm{d}$. 
1. Get the internal sample set ${\mathsf{\Omega}}_{\mathrm{i}\mathrm{n}}$ and the edge sample set ${\mathsf{\Omega}}_{\mathrm{e}\mathrm{d}}$ from the training set ${\mathrm{X}}_{\mathrm{i}}$ using Algorithm 1. 
2. For each parameter ${\mathrm{s}}_{\mathrm{i}}\in \mathrm{S},\text{\hspace{1em}}\mathrm{i}=1,2,\cdots ,\mathrm{m}$

3. Select the optimum parameter $\mathrm{s}=\mathrm{a}\mathrm{r}\mathrm{g}\underset{{\mathrm{s}}_{\mathrm{i}}\in \mathrm{S}}{\mathrm{m}\mathrm{a}\mathrm{x}}\mathrm{f}\left({\mathrm{s}}_{\mathrm{i}}\right)$ 
Output: the optimum Gaussian kernel parameter $\mathrm{S}$. 
No.  Subcellular Localization  Number of Proteins 

1  cell membrane  174 
2  cell wall  18 
3  cytoplasm  208 
4  extracell  123 
No.  Subcellular Localization  Number of Proteins 

1  cytoplasm  410 
2  extracell  133 
3  fimbrium  32 
4  flagellum  12 
5  inner membrane  557 
6  nucleoid  8 
7  outer membrane  124 
8  periplasm  180 
Sample Sets  Benchmarks for Subcellular Locations  Extraction Feature Method  The Number of Classes  The Dimension of Feature Vector  The Number of Samples 

GN1000  Gramnegative  PsePSSM  8  1000  1456 
GN220  Gramnegative  PSSMS  8  220  1456 
GP1000  Grampositive  PsePSSM  4  1000  523 
GP220  Grampositive  PSSMS  4  220  523 
© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).