Hyperspectral Image Classiﬁcation Based on Two-Stage Subspace Projection

: Hyperspectral image (HSI) classiﬁcation is a widely used application to provide important information of land covers. Each pixel of an HSI has hundreds of spectral bands, which are often considered as features. However, some features are highly correlated and nonlinear. To address these problems, we propose a new discrimination analysis framework for HSI classiﬁcation based on the Two-stage Subspace Projection (TwoSP) in this paper. First, the proposed framework projects the original feature data into a higher-dimensional feature subspace by exploiting the kernel principal component analysis (KPCA). Then, a novel discrimination-information based locality preserving projection (DLPP) method is applied to the preceding KPCA feature data. Finally, an optimal low-dimensional feature space is constructed for the subsequent HSI classiﬁcation. The main contributions of the proposed TwoSP method are twofold: (1) the discrimination information is utilized to minimize the within-class distance in a small neighborhood, and (2) the subspace found by TwoSP separates the samples more than they would be if DLPP was directly applied to the original HSI data. Experimental results on two real-world HSI datasets demonstrate the effectiveness of the proposed TwoSP method in terms of classiﬁcation accuracy.


Introduction
Due to rapid development, hyperspectral images (HSIs) play a very significant role in various hyperspectral remote sensing applications, e.g., military [1], astronomy [2], and classification [3][4][5]. Among these mentioned tasks, HSI classification is a fundamental yet important application to provide primary information for the subsequent tasks, which is the main focus of this paper.
The goal of HSI classification is to distinguish the land-cover types of each pixel, which often has hundreds of spectral bands [6]. Although the high-dimensional features may provide the advantages for more accurate classification, the Hughes phenomenon [7] still exists in the classification process. An effective way is to perform dimensionality reduction of these features before HSI classification.
The existing dimensionality reduction methods can be divided into two categories: feature selection [8,9] and feature extraction [10,11]. The design of feature selection is to select some valuable features from the original HSI data. By contrast, the focus of feature extraction is to project the original high-dimensional feature data into an optimal low-dimensional subspace, which is able to construct valuable features in the projective transformation. Consequently, many feature extraction methods have been presented [12][13][14][15][16][17][18]. Several popular feature extraction methods are principal component analysis (PCA) [19], independent component analysis (ICA) [20], linear discriminant analysis (LDA) [21], and locality preserving projection (LPP) [22]. In general, PCA is a popular global dimensionality reduction method, while LPP is an effective local dimensionality reduction method. Both global and local structures are important for projecting the original high-dimensional data into a low-dimensional subspace while preserving the valuable information. LDA constructs a linear transformation by minimizing the within-class scatter and maximizing the between-class distance. Many extended versions of LDA methods have been presented to improve the classification performance. For instance, a regularized version of LDA (RLDA) [23] was proposed by exploiting a regularized with-class scatter matrix. Wang et al. [24] proposed an HSI classification method by constructing a scatter matrix from a small neighborhood. Since the global structure of HSI data may be inconsistent with the local structure, the classification accuracy of the LDA-based methods is low for the HSI data [25]. To deal with the nonlinear problem involved in the original feature space, the kernel technique has been widely used, which is able to project the data from the original feature space into a kernel-induced space. The corresponding kernel-based versions have kernel principal component analysis (KPCA) [26], kernel independent component analysis (KICA) [27], and kernel discriminant analysis (KDA) [28]. To alleviate the nonlinear and inconsistent problems, Li et al. [29] proposed an unsupervised subspace projection method for single image super-resolution. However, it is difficult to obtain better classification performance because the discrimination information is underutilized.
The classifiers of the HSI data fall into two categories: generative and discriminative [30]. The generative classifiers are to learn the joint probability densities with the feature data and the label information and then compute the posterior probabilities via naive Bayesian [31] or Gaussian mixture model [32]. Although the generative model exploits the feature data exhaustively, it still lacks the discriminative information. The discriminative classifiers are able to find the optimal decision boundaries among different classes, including neighbor neighbors (NN) [33], logistic regression [34], support vector machine (SVM) [35], and random forest (RF) [36]. Compared to the generative classification methods, the discriminative model is able to distinguish between classes. Hence, we use the discriminative model for the corresponding HSI classification.
Due to high-dimensionality of the HSI data, the relationship between the features is often nonliear. Directly applying the linear transformation method for the high-dimensional feature data may lead to over-fitting problem during the training process and may provide low accuracy in the classification process. Kernel-based techniques [37] are designed to deal with the nonlinear problem. We chose the most popular kernel-based feature representation method, i.e., KPCA, as the first-stage subspace projection. However, the original feature data will be projected onto a higher-dimensional subspace to acquire the approximate linear relationship. In order to practically address the dimensionality reduction problem, we propose a new discrimination-information based locality preserving projection (DLPP) method by computing the kernel distances in a k-nearest neighborhood for the foregoing KPCA projected data when the training samples are belong to the same class. Overall, the proposed two-stage subspace projection framework first applies KPCA to the original high-dimensional HSI data and then exploits the proposed DLPP method for the preceding KPCA feature data, which is able to preserve both global and local structures.
In this work, we propose a new dimensionality reduction method for supervised HSI classification, termed as Two-stage Subspace Projection (TwoSP). In order to exploit the global and local structures of the original HSI data, the proposed TwoSP framework first projects the data onto the KPCA space to preserve the global structure and to alleviate the nonlinear problem. Furthermore, the discrimination information in a k-nearest neighborhood for the within-class samples is used to learn the DLPP transformation matrix in the training process. The final subspace found by TwoSP substantially separates the testing samples from different classes. In summary, the main contributions of this paper can be summarized as follows: (1) The discrimination information is utilized to compute the samples' kernel distances from a k-nearest neighborhood in the subspace, so the local structure of the original HSI data can be captured adaptively.
(2) The proposed TwoSP framework is an effective way for supervised HSI classification by combining the existing KPCA and the proposed DLPP methods, which not only extracts the nonlinear feature, but also exploits the discrimination information to preserve both global and local structures of the original HSI data. In an optimal low-dimensional subspace, the classification boundary can be found for the HSI data.
The remainder of this paper is organized as follows. In Section 2, we briefly introduce the traditional KPCA method. In Section 3, we describe the proposed DLPP method in detail. Section 4 summaries the proposed TwoSP framework for the HSI data. Section 5 evaluates the TwoSP on two real-world HSI databases compared with several previous works. Finally, we provide a conclusion of this work in Section 6.

Kernel Principal Component Analysis
In this section, we briefly introduce the kernel principal component analysis (KPCA) method [26,29]. Let X X X = [x x x 1 , x x x 2 , · · · , x x x n ] ∈ R d×n be the original feature data, where d is the data dimensionality and n is the number of the input samples. We then consider the nonlinear problem in a feature subspace F induced by a mapping function φ : R d → F. The projected feature data becomes linearly related in F, which is also named as reproducing kernel Hilbert space.
First, the mapping data, φ(x x x i ) should be transformed by zero-mean normalization, i.e., subtracting the mean vector u u u = 1/n ∑ n i=1 φ(x x x i ). Similar to principal component analysis (PCA) [19], the corresponding covariance matrix C C C for KPCA transformation can be defined as We seek the optimal projection vector v v v that maximizes the covariance matrix after projection, i.e., solving the eigenvalue problem, that is where λ and v v v are the eigenvalue and the corresponding eigenvector, respectively. The eigenvector v v v can be expanded as where α i is the i-th weighted coefficient and all the n weights are grouped into α α α = [α 1 , α 2 , · · · , α n ] T . Substituting Equations (1) and (3) into Equation (2), the eigenvalue problem can be reduced to the following equation (for the details, see [29]):K whereK K K is the centered kernel matrix with size of n × n. The relationship betweenK K K and the kernel matrix K K K of the original feature data can be defined asK K K = GKG GKG GKG, where G G G = I I I n − 1/n1 1 1 n , I I I n is an identity matrix with size of n × n, and 1 1 1 n is a matrix all for 1 with size of n × n. Gaussian radial basis is the most popular kernel function, which is used in this paper, defined as K K K i,j = exp(− x x x i − x x x j 2 2 /σ), where σ is the kernel parameter [38].
The optimal KPCA projection matrix W W W for problem (4) is formed by the r eigenvectors of the centered kernel matrixK K K with respect to the r largest eigenvalues. Therefore, the original feature data can be converted onto the KPCA subspace: Therefore, the original features are mapped from the d-dimensional space R d to a r-dimensional subspace R r , where the optimal value of r may be larger than d. Although the original features may be projected onto a higher-dimensional feature space, the linear relationship and global structure can be captured, which is important for the subsequent dimensionality reduction.

Discrimination-Information-Based Locality Preserving Projection
In this section, the proposed discrimination-information based locality preserving projection (DLPP) method is introduced in detail.
After applying the first-stage KPCA subspace projection, we can obtain the mapping data X X X r = [x x x 1 r , x x x 2 r , · · · , x x x n r ] ∈ R r×n , which preserves the global structure of the original high-dimensional feature data. However, in the real-world HSI classification, the local structure may be inconsistent with the global structure. Therefore, the extraction of the local structure should be taken into consideration in the process of dimensionality reduction.
Denote another nonlinear mapping function as ψ : R r → F. Denote the value of kernel matrix be . As mentioned, we also chose the Gaussian kernel function in the computation of the kernel distances, i.e., where ρ is the corresponding kernel parameter. Then, we can compute the kernel distances among the KPCA mapping data X X X r as follows: To further preserve the local structures, an adjacency matrix S S S is designed to measure the similarity relationship between feature vectors x x x i r and x x x j r that are from the same class, i.e., where N (x x x i r ) and N (x x x j r ) indicate the k-nearest neighbors of x x x i r and x x x j r , respectively. (·) is the function to obtain the class information of the input feature vector. That is to say, when the two feature vectors x x x i r and x x x j r have the same class information and exist in each other's neighborhood, the value of the adjacency matrix S S S is computed by the corresponding kernel distances; otherwise, zero.
The proposed DLPP is to adjust the adjacency during the dimensionality reduction. Denote P P P be the projection matrix that embeds the first-stage KPCA mapping data X X X r into an optimal low-dimensional subspace transformed by X X X m = P P P T X X X r , which yields the following formula: Problem (8) can be further reduced to min P P P tr 1 2 tr P P P T X X X r Z Z Z s X X X T r P P P − P P P T X X X r S S SX X X T r P P P = min P P P tr P P P T X X X r L L L s X X X T r P P P where L L L s = Z Z Z s − S S S and Z Z Z s = diag(S S S1 1 1),1 1 1 is a vector all for 1 with size of n × 1, tr(·) and diag are the trace function and the diagonal function. To obtain an optimal solution of problem (9), a normalized scale constraint is imposed as P P P T X X X r Z Z Z s X X X T r P P P = I I I m (10) where I I I m is a m × m identity matrix. Therefore, the optimal DLPP projection matrix P P P * can be computed as P P P * = arg min P P P tr P P P T X X X r L L L s X X X T r P P P s.t. P P P T X X X r Z Z Z s X X X T r P P P = I I I m (11) which can be solved analytically through generalized eigenvalue decomposition between X X X r L L L s X X X T r and X X X r Z Z Z s X X X T r . P P P * is then formed by the m eigenvectors corresponding to the m smallest eigenvalues. When the optimal DLPP projection matrix P P P * is obtained, the second-stage projected feature data can be computed by X X X m = (P P P * ) T X X X r .
After the two-stage subspace projection, the dimensionality of the final projected feature data is m, which is significantly lower than the original dimensionality, i.e., m d.

The Proposed Framework
In this section, we will show the proposed TwoSP method can be applied to the HSI data. According to the class information in an ascending sort order, the input HSI feature data c, n = n 1 + n 2 + · · · + n c , and n 0 = 0 (i.e., excluding the samples with the class information of 0). These samples are then further partitioned as the training samples and the test ones. Like the dataset partition method in [39], a small portion of the training set is enough for a good classification performance. For each subset X X X i (i = 1, 2, · · · , c), we randomly select 5% samples to compose the training subset and all the remaining samples are used as the test subset. Therefore, we can denote the training dataset as X X X s = [X X X s,1 , X X X s,2 , · · · , X X X s s,i ], and n s,i = n i * 5% , and the test subset is defined as X t,i ], and n t,i = n i − n s,i . Moreover, the number of training and test samples are denoted as n s = n s,1 + n s,2 + · · · + n s,c and n t = n t,1 + n t,2 + · · · + n t,c .
For simplicity, the training dataset and the corresponding class information are marked as On the other hand, the test dataset is labeled as t , y 2 t , · · · , y n t t ], where y i t also belongs to [1, 2, · · · , c]. The details of the whole framework is described in Algorithm 1.

Algorithm 1: The Proposed Framework for HSI classification.
Input: Training dataset X X X s , training class information set Y Y Y s , test dataset X X X t , the number of neighbors in DLPP k, first-stage dimensionality r, and second-stage dimensionality m. Output: Estimate the test class information set Y Y Y t . First-Stage Subspace Projection: merge X X X s and X X X t into the whole dataset ; n is equal to n s + n t ; 3. compute the centered kernel matrixK K K = GKG GKG GKG, where G G G = I I I n − 1/n1 1 1 n ; 4.
select the r eigenvectors ofK K K corresponding to the r largest eigenvalues, to construct the first-stage subspace projection matrix W W W; 6.
according to Equation (5), obtain the KPCA feature data X X X r ;

Second-Stage Subspace Projection:
7. extract the projected training samples X X X s,r = [x x x 1 s,r , x x x 2 s,r , · · · , x x x n s s,r ], and compute the corresponding kernel matrix 8. compute the adjacency matrix S S S in the corresponding k-nearest neigbhors with the same class using Equation (7), where y i s = (x x x i s,r ); 9.
solve problem (11), where Z Z Z s = diag(S S S1 1 1) and L L L s = Z Z Z s − S S S; 10.
choose the m eigenvectors corresponding to the m smallest eigenvalues, to construct the second-stage subspace projection matrix P P P * ; 11.
for the training samples, the DLPP feature data X X X s,m is obtained using Equation (12), i.e., X X X s,m = (P P P * ) T X X X s,r .; Classification: for i = 1, 2, · · · , n t do 12. also according to Equation (12), the final projected feature vector for each test sample can be computed as x x x i t,m = (P P P * ) T x x x i t,r ; 13.
using the NN classifier, find the nearest training sample in the optimal low-dimensional feature space, i.e., j * = arg min j x x x i t,m − x x x j s,m 2 2 , where j = 1, · · · , n s ; 14.
obtain the corresponding class information, i.e., y i t = (x x x j * s,m ); and then set i = i + 1; end Result: Obtain the class estimation for all the test samples, i.e., Y Y Y t .

Experimental Results
To validate the effectiveness of the proposed algorithm for hyperspectral image classification, experiments are conducted to compare with several existing dimensionality reduction methods including principal component analysis (PCA) [19], independent component analysis (ICA) [20], linear discriminant analysis (LDA) [21], kernel discriminant analysis (KDA) [28], kernel principal component analysis (KPCA) [26], the proposed discrimination-information based locality preserving projection (DLPP), and the proposed two-stage subspace projection (TwoSP). In addition, we also compare the classification results of the raw spectral features (RAW).

Experimental Setting
In this paper, we conduct the experiments on two real-world HSI datasets, i.e., Indian Pines and Kennedy Space Center (KSC) datasets [40], to demonstrate the effectiveness of the proposed TwoSP method compared with the existing dimensionality reduction algorithms.
The Indian Pines dataset in the corrected version consists of 145 × 145 pixels and 200 spectral bands, which was gathered by an Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over Northwestern Indiana. By removing the background with the class information of 0, 10,249 pixels are annotated from 16 classes. On the other hand, the KSC dataset was acquired by an AVIRIS sensor over the Kennedy Space Center, Florida. The size of this HSI is 512 × 614. Each pixel has 176 spectral bands with the annotated 13 classes. By removing the background pixels, 5211 valuable pixels remain. Table 1 shows various land-cover types and the selected sizes of the training and test subsets for the two aforementioned HSI datasets. For classification, we use the nearest neighbor (NN), support vector machine (SVM), and random forest (RF) classifiers to obtain the estimated classes of the test samples. For SVM, we use the "libsvm" toolbox in a Matlab version with a linear kernel [41]. After that, we select three widely used classification measurements, i.e., average accuracy (AA), overall accuracy (OA), and kappa coefficient (KC) between the estimates and the ground-truths of all the classes, to objectively assess the performance of HSI classification. All the experiments we performed on a personal computer with Intel Xeon CPU E5-2643 v3, 3.40 GHz, 64 GB memory, and 64-bit Windows 7 using Matlab R2017b.

Performance on Hyperspectral Image Datasets
To alleviate the random error caused by the randomly selecting 5% samples as the training set and all the remaining samples as the test set, we repeated the experimental results five times and reported the average accuracy for each class, average AAs for all the classes, average OAs and average KCs for all the test samples with their corresponding standard deviations (STDs) in this paper. The quantitative classification accuracy of these methods are given in Tables 2 and 3 for the Indian Pines dataset and KSC dataset, respectively. Each method uses its best reduced dimensionality shown in brackets.
From the two tables, we can see that TwoSP outperforms all the competitors in terms of AA, OA, and KC. Especially, the classification performance of TwoSP is better than directly applying the proposed DLPP method to the original HSI feature data. PCA neglects the nonlinear relationship from the original high-dimensional feature data, although it achieves the dimensionality reduction. Due to the nonlinear problem, which may lead to a one-to-many relationship among the feature vector, ICA is unable to capture the global and local structures. LDA preserves the local manifold structure by exploiting the discrimination information of the training samples. However, the global structure is lost in the dimensionality reduction. KDA considers both the global and local data relationship in the discriminant analysis. For a small number of training samples, the classification accuracy is often low. For instance, "Class No. 9" in Table 2, the average OA value (%) for individual class No. 9 of KDA, is 6.3, while that of our TwoSP reaches up to 30.5 in spite of having only one training sample. To achieve dimensionality reduction, KDA introduces the discrimination information of all the within-class and between-class samples. However, the within-class samples that are far away from the referred sample may have a negative effect on the final classification accuracy. Therefore, the discrimination information in a small neighborhood should be taken into consideration in the dimensionality reduction process, which is proposed in this paper. KPCA largely alleviates the nonlinear problem from the HSI data, but it cannot preserve the local data relationship with the valuable discrimination information. The proposed DLPP method directly applying to the original feature data is better than the traditional linear feature transformation approaches, i.e., PCA, ICA, and LDA. The proposed TwoSP method investigates the local structure of the HSI data adaptively, and preserves the global structure in the first-stage subspace projection. Therefore, TwoSP can achieve the best classification performance on a large proportion of the occasions shown in the quantitative results. Table 2. Performance (%) of different methods (with the best reduced dimensionality in brackets) on the Indian Pines dataset using the NN classifier.  Furthermore, the classification maps of the aforementioned methods on Indian Pines and KSC datasets are visualized in Figures 1 and 2. RAW, PCA, ICA, and KPCA have low classification performance because they cannot use the discrimination information in the subspace projection process. Although LDA and KDA use the discrimination information, they introduce all the within-class and between-class samples in the discriminant analysis, which is difficult to preserve the local manifold structure. From the subfigures (f) and (g) in Figures 1 and 2, a clear classification map is shown for several classes. TwoSP, which enforces the discrimination information in a small neighborhood, shows better visualization quality than the others, since it not only preserves the global structure in the first-stage KPCA subspace projection, but also learns the local data relationship in the second-stage DLPP subspace projection. It also demonstrates that the utilization of discrimination information within a small neighborhood can largely improve the classification performance in the corresponding nonlinear feature space. Table 3. Performance (%) of different methods (with the best reduced dimensionality in brackets) on the KSC dataset using the NN classifier.    To further validate the effectiveness of the proposed method, we conducted an additional experiment by the effective cross validation method, i.e., the McNemar test [42]. Table 4 shows the results of the McNemar test for the proposed method and the baselines, where the methods on the vertical direction are the test methods, while those on the horizontal direction are the reference methods. When the value is greater than zero, it illustrates that the classification performance of the test method is better than that of the reference method; otherwise, the reference method has more advantages. Generally, the threshold of significance in the McNemar test is set to 0.05. Furthermore, when the absolute value is larger than 1.96, it indicts that the two methods have obvious differences. From Table 4, we can see that the TwoSP has the highest classification performance when compared with the existing approaches. In addition, the simplified version of our TwoSP, i.e., DLPP, also has good performance to some extent.

Discussion on Computational Cost
In this subsection, we only discuss the computational cost of the proposed method compared with state-of-the-art approaches. Considering the main steps in Algorithm 1, the proposed HSI classification method takes account of three parts: KPCA subspace projection, DLPP subspace projection, and classification.
Let n s and n t be the number of training and test samples, respectively. n = n s + n t . The computational complexity of first-stage subspace projection matrix, i.e., W W W, is O(n 3 ). Here, we can obtain the KPCA feature data X X X r with the dimensionality of r. k is denoted as the number of neighbors in the computation of adjacency matrix. The computational complexity of the second-stage subspace projection process is then O(rkn s + r 3 ). After the two-stage subspace projection framework, the original HSI data is projected onto an optimal low-dimensional feature space with the dimensionality of m. For classification using NN classifier, each test sample is compared with all the training samples to find the nearest neighbor. Therefore, the computational complexity of classification is O(mn t n s ). Table 5 shows the computational time (in terms of seconds) of the proposed method and the baselines, where T1 represents the computation of projection for different methods, and T2, T3, and T4 are the classification time obtained by NN, RF, and SVM classifiers, respectively. Since the first-stage projection of TwoSP needs to compute the kernel matrix (i.e., K K K) with all the training and test samples, T1 of TwoSP is larger than that of other methods. The T1s of KPCA and TwoSP are similar because they both involve in the computation of a large kernel matrix. KDA uses the discriminant information of a small portion of training samples, so the computation of its kernel matrix is small. RAW directly puts the original HSI data into the classification process, so its T1 is null. Among the T2, T3, and T4 of different methods, we can see that the RF classifier needs more time to classify all test samples. Considering classification performance and computational time simultaneously, we chose the NN classifier in the experiments.  T1  T2  T3  T4  T1  T2  T3

Performance of Reduced Dimensionality
Each method is performed with different reduced dimensionality. To demonstrate the optimal reduced dimensionality for the corresponding method, the classification accuracy in terms of overall accuracy is obtained by the NN classifier when the dimensionality varies within {2, 3, · · · , 30} on Indian Pines and KSC, respectively. Figure 3 shows the curves of OA versus the reduced dimensionality on two different HSI datasets. The quantitative results are obtained with a random dataset partition. The proposed TwoSP method achieves the highest OA constantly. Especially, the OA value obtained by TwoSP exceeds that of the other methods to a large extent when the reduced dimensionality is more than 7. In Figure 3, the classification performance becomes stable when the dimensionality increases to a certain value. In most cases, the performance with low-dimensional projected data is better than the original high-dimensional data, which also validates that the dimensionality reduction does improve the classification accuracy.

Analysis of Classifier
To evaluate the classification performance of each dimensionality reduction method with three different classifiers, i.e., NN, RF, and SVM, we randomly selected five different dataset partitions and then computed the average OA values. After the projected feature data was obtained by one of these dimensionality reduction methods, the classes of the test set were discriminated by the NN, RF, and SVM classifiers, respectively. Figure 4 shows the classification results vs. different classifiers on two HSI datasets. In Figures 4a,b, the proposed TwoSP method presents the best classification performance for different classifiers when compared with the other dimensionality reduction methods, which also demonstrates that TwoSP has better robustness in different classifiers. For most cases on the Indian Pines dataset, the NN and RF classifiers generates better OAs than the SVM classifier. On the other hand, the RF classifier achieves better results on the KSC dataset, while the results with NN are superior to RF for LDA and DLPP. It also illustrates the classification accuracy of different classifiers on different datasets may have relatively large differences. To unify the utilization of the classification model, we applied the NN classifier to evaluate the classification performance of each class in Tables 2 and 3.

Analysis of Parameters
In our proposed TwoSP method, there are three parameters set first, as shown in the input of Algorithm 1. Therefore, the first-and second-stage subspace projection dimensionality r and m and the number of neighbors k are studied experimentally. We randomly choose one of the training and test set partition and mainly take the Indian Pines dataset for instance. The objective values of the parameters are changed during the analysis process. Figure 5a shows the overall accuracy results obtained by simultaneously varying the dimensionalities r and m. Since the first-stage subspace projection is designed to preserve the global structure of the original HSI data, the dimensionality of the second-stage subspace projection m should be smaller than r. From this subfigure, we can see that TwoSP is robust to r and m in a wide range. When r and m increase to a certain value, the classification performance in terms of OA is the best. Therefore, we select r = 45 and m = 20 for the Indian Pines dataset in the experiments. Figure 5b shows the different OA values vs. different number of neighbors in the second-stage subspace projection. TwoSP achieves the highest OA value when the number of neighbors k = 200, which is chosen in this paper. Note that the neighbors are selected by comparing the kernel distances first. Only the within-class samples are used to compute the value of adjacency matrix, and zero otherwise. Therefore, in practice a small number of training samples are used to learn the local structure of the first-stage projected data.

Conclusions
In this paper, we proposed the TwoSP method on the basis of the preservation of global and local structures to learn the optimal low-dimensional feature space for HSI classification. TwoSP first applies the traditional KPCA method to address the nonlinear problem which often exists in the HSI data. However, the dimensionality of the first-stage subspace projection is not small enough. It needs to apply the second-stage subspace projection to the preceding projected features. TwoSP exploits the a priori knowledge of the training samples to construct an adjacency matrix for the within-class samples, which can enhance the discrimination information of projected features.
Compared with the state of the art, TwoSP is better able to learn the data manifold relationship in the desired feature space, which creates various valuable features for HSI classification. In addition, TwoSP strongly retains the local smoothness within a small neighborhood. Through the experiments on two real-world HSI datasets, i.e., Indian Pines and KSC, TwoSP provides better classification performance than the existing dimensionality reduction methods, which also validates the effectiveness of the proposed method.
Our future work will focus on how to extend the proposed method to train the optimal transformation matrix for each test sample quickly. It is desirable to improve the classification performance and increase the computation efficiency of the dimensionality reduction.