1. Introduction
As one of the most significant remote sensing techniques, Hyperspectral Images (HSIs) captured by modern remote sensors have been successfully applied in realworld tasks, such as target detection, and crop yield estimation, etc. [
1,
2,
3]. Typically, each pixel in HSIs consists of hundreds of spectral bands/features, which means that HSIs can provide more abundant information about remote sensing objects than multispectral and typical color images with few bands/features, thus HSI classification has attracted much attention in the last several decades. Although many classification models have been introduced, such as Support Vector Machine (SVM), Deep Learning (DL), and Representation based Classification [
4,
5,
6,
7], etc., the curse of dimensionality problem stemming from the highdimensional features/bands could largely compromise the performance of various classification algorithms, especially when there are only small number of labeled training data. Therefore, many Dimensionality Reduction (DR) methods have been introduced to preprocess the HSI data so that the noisy and redundant features can be removed and the discriminative features can be extracted in lowdimensional subspace. To some extent, DR has become fundamental for HSI data analysis [
8,
9,
10,
11,
12].
Generally speaking, all DR models for HSIs can be divided into two categories: bands selection and feature extraction [
8,
12]. The former focuses on choosing representative subset from original spectral bands with physical meanings by some criteria, while the latter tries to learn new features by transforming observed highdimensional spectral bands/features to the lowdimensional features with discriminative and structural information. As stated in [
8], discovering optimal bands from enormous numbers of possible band combinations by feature selection methods is typically suboptimal, thus in this paper we only focus on feature extraction based DR methods for HSIs instead of feature selection.
The past decades have witnessed that various feature extraction algorithms have been introduced for HSI data analysis. There are roughly two major classes of DR techniques for feature extraction, the linear and nonlinear DR models. Principal Components Analysis (PCA) may be the most classic linear DR method, which tries to linearly project observed data into the lowdimensional subspace where the data variance is maximized. To further extend PCA into the Bayesian framework, Probabilistic PCA (PPCA) [
13] was also proposed. Other PCA extensions include Kernel PCA (KPCA), Robust PCA, Sparse PCA, and Tensor PCA, etc. [
14,
15,
16,
17,
18]. However, the linear assumption in PCA indicates that PCA and its extensions could fail when there are nonlinear structures embedded into the highdimensional observations. Thus, many nonlinear DR models have been developed, among which DR models based on manifold learning could be representative because they could be able to capture the intrinsic nonlinear structures (e.g., multiple clusters, subspace structure and manifolds) in observed data with high dimensionality [
12,
19,
20].
Most manifold learning based DR techniques try to model the local geometric structures of data based on graph theory [
19,
21]. The performance of these manifold learning based methods mainly depends on the following two aspects: (i) the design of the similarity graphs, and (ii) the embedding of new test data—the outofsample problem. For the first factor, different methods have distinct ideas to construct similarity graph matrices, such as Local Linear Embedding (LLE), Laplace Eigenmap (LE), Neighborhood Preserving Embedding (NPE), and Locality Preserving Projection (LPP) [
19,
22], etc. In [
19], a general Graph Embedding (GE) framework was introduced to unify these existing manifold learning algorithms where various methods to construct different similarity graph matrices were compared. Lunga et al. [
20] also reviewed typical manifold learning based DR methods for HSI classification. Recently, the representation based approaches have also been introduced to manifold learning framework to constitute the similarity graphs [
12]. For example, Sparse Representation (SR), Collaborative Representation (CR) and Low Rank Representation (LRR) [
7] were utilized to construct the sparse graph (
${\ell}_{1}$ graph), collaborative graph (
${\ell}_{1}$ graph) and lowrank graph, giving rise to Sparsity Preserving Projection (SPP) [
23], Collaborative Representation based Projection(CRP) [
24] and Low Rank Preserving Projections (LRPP) [
25], respectively. For the second aspect, in order to address the outofsample problem encountered by many manifold learning models, extra mappings that project new samples to lowdimensional manifold subspace were added into existing manifold learning algorithms, where linear mapping was typically utilized like LPP and NPE [
19].
The aforementioned techniques are unsupervised DR models, which means that extra labels cannot be utilized even if they are available in HSIs data where we are typically given some labels. By making use of these label information, the unsupervised DR methods can be extended to the supervised settings, which could improve the discriminating power of DR models. Linear Discriminant Analysis (LDA) can be representative in the line. Compared to its linear counterpart PCA, LDA performs DR by maximizing the betweenclass scatter and minimizing the withinclass scatter simultaneously, leading to more discriminative dimensionalityreduced features than PCA. However, LDA and its extensions including Generalized Discriminant Analysis (GDA) suffer from the limitation of extracting maximum
$C1$ features with
C being the number of label classes [
8] and the SmallSampleSize (SSS) problem. Nonparametric Weighted Feature Extraction (NWFE) and its kernel extension Kernel NWFE address these issues by using the weighted means to evaluate the nonparametric scatter matrices, resulting in more than
$C1$ features being learned [
8]. Other LDA extensions include Regularized LDA (RLDA) [
26] and Modified FLDA (MFLDA) [
27].
PCA was also extended to supervised versions to handle the extra labels such as Supervised PPCA (SPPCA) [
16]. However, these methods are most likely incapable of discovering complex nonlinear geometric structure in numerous HSIs spectral features, which could deteriorate the performance of DR models. Therefore, many supervised manifold learning models which are capable of capturing the local geometric structure of neighboring pixels have been developed, such as Supervised LPP (SLPP) [
19], Local Fisher Discriminant Analysis (LFDA) [
28] and local graph discriminant embedding (LGDE) [
29]. The most easy way to realize supervised manifold learning models is to construct the similarity graph matrices based on neighboring samples belonging to the same class. Alternatively, representation based models have been extended into the supervised settings, such as Sparse Graphbased Discriminate Analysis (SGDA) [
30], Weighted Sparse Graphbased Discriminate Analysis (WSGDA) [
31], Collaborative Graphbased Discriminate Analysis (CGDA) [
32], Laplacian regularized CGDA (LapCGDA) [
7], Discriminant Analysis with Graph Learning (DAGL) [
33] and Sparse and LowRank Graphbased Discriminant Analysis (SLGDA) [
34], etc.
To further add nonlinearity to the supervised manifold models, kernel tricks are also utilized, resulting into Kernel LFDA (KLFDA) [
28], Kernel CGDA (KCGDA) [
32] and Kernel LapCGDA (KLapCGDA) [
7], etc. A recent survey in [
12] termed these kind of techniques GraphEmbedding Discriminant Analysis, where the basic idea behind these models is to construct various discriminative intrinsic graphs with different criterions, such as
${\ell}_{1}$ and
${\ell}_{2}$ graphs employed in SGDA and CGDA, respectively. Although their models have demonstrated the effectiveness for extracting discriminative HSI features in terms of classification accuracy, their performance could become unsatisfactory because of the lack of considering spatial information.
Recently, some researchers have shown that spatial information in HSIs could be efficiently utilized to boost the DR models, leading to the joint spectralspatial DR techniques where most of these kinds of models can be roughly divided into two classes as stated in [
35]:
 (i)
SpectralSpatial information can be used as a preprocessing. For example, Gaussian weighted local mean operator was employed to extract spectralspatial feature in Composite Kernels Discriminant Analysis (CKDA) [
36]. A robust spectralspatial distance rather than simple spectral distance was adopted in Robust Spatial LLE (RSLLE) [
37]. A local graphbased fusion (LGF) [
38] method was proposed to conduct DR by simultaneously considering the spectral information and the spatial information extracted by morphological profiles (MPs). Recently, Tensor PCA (TPCA) [
18], deep learning based models [
5], superpixels [
29,
39] and propagation filter [
40] techniques were also successfully utilized to extract the spectralspatial features.
 (ii)
Spectralspatial information can be used as a spatial constraint/regularization. For instance, the authors of [
41] proposed a Spatial and Spectral Regularized Local Discriminant Embedding (SSRLDE) where the local similarity information was encoded by a spectraldomain regularized local preserving scatter matrix and a spatialdomain local pixel neighborhood preserving scatter matrix. SpectralSpatial LDA (SSLDA) [
42] made use of a local scatter matrix constructed from a small neighborhood as a regularizer which makes the samples approximate the local mean in the lowdimensional feature space among the small neighborhood. The SpectralSpatial Shared Linear Regression (SSSLR) [
35] made use of a convex set to describe the spatial structure and a shared structure learning model to learn a more discriminant linear projection matrix for classification. Spatialspectral hypergraph was used in SpatialSpectral Hypergraph Discriminant Analysis (SSHGDA) [
43] to construct complex intraclass and interclass scatter matrices to describe the local geometric similarity of HSI data. Superpixellevel graphs and local reconstruction graphs were constructed to be the spatial regularization in Spatial Regularized LGDE (SLGDE) [
29] and Local Geometric Structure Fisher Analysis (LGSFA) [
44], respectively. Recently, He et al. [
45] reviewed many stateoftheart spectralspatial feature extraction and classification methods, which demonstrates that spatial information could be beneficial to HSI feature extraction and classification.
In this paper, we focus on the GraphEmbedding Discriminant Analysis framework because of its outstanding performance and low complexity. Although some progress has been made based on this framework, such as SGDA, CGDA, LapCGDA, and SLGDA, etc., the performance of these models could be further improved by efficiently incorporating the spatial information into the framework. Motivated by recently introduced Joint CR (JCR) [
46] and Spatialaware CR (SaCR) [
10] algorithms, in this paper, we propose the Spatialaware Collaborative Graphbased Discriminate Analysis (SaCGDA) where spatialspectral features are firstly preprocessed by the average filtering in JCR and then the spatial information is encoded as a spatial regularization item in CR to construct the spectralspatial similarity graph in an easy and efficient way. To further improve the performance of the model, inspired by LapCGDA, the spectral Laplacian regularization item is also introduced into SaCGDA, giving rise to the Laplacian regularized SaCGDA (LapSaCGDA) which makes use of the spectralspatial information more effectively.
The rest of the paper is organized as follows. In
Section 2, we briefly review the related works, including the Collaborative Representation(CR) and the GraphEmbedding Discriminant Analysis framework. The proposed SaCGDA and LapSaCGDA will be introduced in
Section 3. Then, three HSI datasets are used to evaluate the effectiveness of the newly proposed algorithms in
Section 4. Finally, the concluding remarks and comments will be given in
Section 5.
2. Related Works
In this section, we will review some related works, including the typical CR models and its variants JCR and SaCR, plus the GraphEmbedding Discriminant Analysis framework.
For the sake of consistency, we make use of the following notations throughout this paper:
$X=[{\mathit{x}}_{1},\dots ,{\mathit{x}}_{N}]$ are observed (inputs) data with each sample
${\mathit{x}}_{n}$ in a highdimensional space
${\mathcal{R}}^{D}$;
$\mathit{y}=[{y}_{1},\dots ,{y}_{N}]$ are observed (outputs or labels) data with each point
${y}_{n}$ being discrete class label in
$\{1,2,\dots ,C\}$, where
C is the number of the classes;
$Z=[{\mathit{z}}_{1},\dots ,{\mathit{z}}_{N}]$ are the dimensionalityreduced variables/features in a lowdimensional space
${\mathcal{R}}^{d}$ (
$d\ll D$) with each
${\mathit{z}}_{n}$ corresponding to
${\mathit{x}}_{n}$ and/or
${\mathit{y}}_{n}$. For the sake of convenience, we further denote
X as an
$D\times N$ matrix,
$\mathit{y}$ an
$N\times 1$ vector and
Z an
$d\times N$ matrix, so the training data can be defined as
$\mathcal{D}={\left\{({\mathit{x}}_{n},{y}_{n})\right\}}_{n=1}^{N}$. Let
${n}_{l}$ be the number of training data from
lth class and
${\sum}_{l=1}^{C}{n}_{l}=N$. Similar to LDA, the goal of discriminant analysis based DR models is to find the lowdimensional projection
Z with typical linear mapping
where the transformation matrix
$P\in {\mathcal{R}}^{D\times d}$ linearly projects each sample
${\mathit{x}}_{n}$ to the lowdimensional subspace.
2.1. Collaborative Representation (CR)
The representation based models such as Sparse Representation (SR), Collaborative Representation (CR) and LowRank Representation(LRR) have been successfully applied in many tasks [
7]. Compared to the computationally expensive SR, CR can provide comparative classification performance with very efficient closedform solution to learn the model parameters. Given a test sample
${\mathit{x}}^{*}$, all the training data are employed to represent it by
where the the regularization parameter
$\alpha $ balances the tradeoff between the reconstruction residual and the regularization of the representation coefficients. Typically,
${\ell}_{2}$ norm is used with
$\mathsf{\Omega}\left(\mathit{w}\right)={\left\right\mathit{w}\left\right}_{2}^{2}$. However, other regularization methods can also be adopted [
47,
48]. For example,
${\ell}_{1}$ instead of
${\ell}_{2}$ norm brings the sparsity of the coefficient, resulting in SR. A locality regularization [
49]
$\mathsf{\Omega}\left(\mathit{w}\right)={\left\right\mathsf{\Gamma}\mathit{w}\left\right}_{2}^{2}$ provides different freedom to training data according to the Euclidean distances from
${\mathit{x}}^{*}$, where each diagonal entry in the diagonal matrix
$\mathsf{\Gamma}$ is defined by
For classic CR model, be solving the the leastsquares problem, the optimal representation coefficients
${\mathit{w}}^{*}$ can be easily obtained with closedform solution
and then the class label of test sample
${\mathit{x}}^{*}$ can be predicted according to the minimum residual
where
${X}_{l}$ and
${\mathit{w}}_{l}^{*}$ are the subsets of
X and
${\mathit{w}}^{*}$ corresponding to specific class label
l, respectively.
As we have mentioned that spatial information is significant for HSI data analysis. Two simple but efficient CR models have been recently developed. The first one termed by Joint CR (JCR) makes use of the spatial information as a preprocessing based on the fact that the neighboring pixels usually share similar spectral characteristics with high probability. Thus, the spatial correlation across neighboring pixels could be indirectly incorporated by a joint model, where each training and testing sample is represented by spatially averaging its neighboring pixels that is similar with the convolution operation in Convolution Neural Network (CNN). Another model called joint Spatialaware CR (JSaCR) further utilizes the spatial information as a spatial constraint in a very easy and effective way
where
${\tilde{\mathit{x}}}^{*}$ and each element in
$\tilde{X}$ are the average spectral features for the test sample
${\mathit{x}}^{*}$ and each training point
${\mathit{x}}_{n}^{*}$ centered in a small window with
m neighbors, respectively,
$\tilde{\mathsf{\Gamma}}$ is similarly defined by
$\tilde{{\mathsf{\Gamma}}_{ii}}=\left\right{\tilde{\mathit{x}}}^{*}{\tilde{\mathit{x}}}_{i}{\left\right}_{2}(i=1,2,\dots ,N)$, each element in
$s=[{s}_{1},{s}_{2},\dots ,{s}_{N}]$ is associated with each training sample, which encourages that representation coefficients
$\mathit{w}$ are spatially coherent w.r.t. the training data,
$\mathrm{diag}\left(\mathit{s}\right)$ is a square diagonal matrix with the elements of vector
$\mathit{s}$ on the main diagonal. To further explain the spatial constraint in the third term in Equation (
6), let us firstly denote the pixel coordinate of the testing sample
${\mathit{x}}^{*}$ and each training pixel
${\mathit{x}}_{n}$ by
$({p}_{*},{q}_{*})$ and
$({p}_{n},{q}_{n})$, respectively. The spatial relationship between them can be simply measured by smooth parameter
t
where
$\mathrm{dist}(\xb7,\xb7)$ represents the Euclidean distance with smooth parameter
t adjusting the distance decay speed for the spatial constraint. Actually, the normalized element
$s\in (0,1]$ is obtained by dividing
$max\left(\mathit{s}\right)$ from
$\mathit{s}$. A large value of
t implicitly means that pixels that are spatially far away from the test sample will be penalized by assigning coefficients close to 0. As can be seen from the JSaCR formulation, the locality constraint (second term) and spatial constraint (third term) are controlled by the regularization parameters
$\alpha $ and
$\beta $, respectively.
2.2. GraphEmbedding Discriminant Analysis
GraphEmbedding Discriminant Analysis [
12] seeks to learn the projection matrix
P based on the linear model assumption
${\mathit{z}}_{n}={P}^{T}{\mathit{x}}_{n}$ by preserving the similarities of samples in the original observation space. Many manifold learning models and discriminant analysis based methods can be unified into this framework, such as LPP, NPE, LDA and recently proposed SGDA, CGDA, etc.
Typically, a general GraphEmbedding Discriminant Analysis model can be formulated by finding a lowdimensional subspace where the local neighborhood relationships in the highdimensional observations can be retained. Based on typical linear mapping
${\mathit{z}}_{n}={P}^{T}{\mathit{x}}_{n}$, the objective function of GraphEmbedding Discriminant Analysis is
where the similarity graph matrix
W is built on pairwise distances between observed samples to represent local geometry with each
${W}_{ij}$ being the similarity/affinity between samples
${\mathit{x}}_{i}$ and
${\mathit{x}}_{j}$,
$L=TW$ is the Laplacian matrix of graph
G,
T is the diagonal matrix with
nth diagonal element being
${T}_{ii}={\sum}_{j=1}^{N}{W}_{ij}$ and
${L}_{p}$ is a sample scale normalization constraint or some Laplacian matrix of penalty
${G}_{p}$. By simply reformulating the objective function as
The optimal projection matrix
P can be obtained by solving a generalized eigenvalue decomposition problem
where
$P\in {\mathcal{R}}^{D\times d}$ is constructed by the
d eigenvectors corresponding to the
d smallest nonzero eigenvalues. As can be seen from the formulations, the performance of GraphEmbedding Discriminant Analysis algorithms largely depends on the construction methods of the similarity/affinity matrix.
Let’s firstly take the unsupervised LPP model as the classic example where the similarity between two samples
${\mathit{x}}_{i}$ and
${\mathit{x}}_{j}$ is typically measured by the heat kernel if they are the neighbors
with a prespecified parameter
r.
Compared to unsupervised LPP, supervised discriminant analysis models like LDA, SGDA and CGDA construct the similarity matrix in the supervised manner, which could provide better discriminant ability. For example,
W in LDA, SGDA or CGDA is typically expressed by a blockdiagonal matrix
where each
${\left\{{W}^{l}\right\}}_{l=1}^{C}$ is the withinclass similarity matrix of size
${n}_{l}\times {n}_{l}$ based on
${n}_{l}$ training samples only from the
lth class. Different strategies have been applied to construct the withinclass similarity matrix. As can be seen from Equation (
13)
${\ell}_{1}$,
${\ell}_{2}$,
$nuclear+{\ell}_{1}$ and
$manifold+{\ell}_{2}$ norms are adopted in SGDA, CGDA, SLGDA and LapCGDA, respectively, using the representation based algorithms to construct the similarity matrix
${W}^{l}=[{\mathit{w}}_{1}^{l};{\mathit{w}}_{2}^{l};\dots ;{\mathit{w}}_{{n}_{l}}^{l}]\in {\mathcal{R}}^{{n}_{l}\times {n}_{l}}$ with each
${n}_{l}$ being the number of training data only from
lth class:
where
${\mathit{x}}_{n}^{l}$ is a training sample from
lth class,
${X}_{n}^{l}$ denotes all training samples from
lth class but excluding
${\mathit{x}}_{n}$,
${\parallel \xb7\parallel}_{*}$ in SLGDA indicates the nuclear norm, and
${H}_{n}$ in LapCGDA is the Laplacian matrix constructed by Equation (
11).
Although these models have demonstrated their effectiveness, their performance could be unsatisfactory because the spatial information in HSIs is not utilized. Compared to the spatial constraints like local reconstruction points, superpixel and hypergraph based spatial regularization models in LGSFA [
44], SLGDE [
29] and SSHGDA [
43], the simple spatial prior in JSaCR has been proven to be fast and efficient. Therefore, in this paper, we introduce the spatial prior into CGDA model because of its outstanding performance and low complexity.
3. Laplacian Regularized SpatialAware CGDA
In this section, motivated by JSaCR and LapCGDA, we propose Laplacian regularized SpatialAware CGDA (LapSaCGDA) by simultaneously introducing spatial prior and spectral manifold regularization into CGDA which will significantly increase the discriminant ability of the learnt similarity/affinity matrix and then boost the performance of the DR model.
Firstly, based on the fact that the average filtering in JCR could smooth the random noises in HSI data, we similarly preprocess all the HSI data, which leads to the new notations such as ${\tilde{\mathit{x}}}_{n}$ and $\tilde{X}$, meaning a training sample and all training data after the preprocessing, respectively.
Then, based on the JSaCR model in Equation (
6), it is straightforward to make use of JSaCR instead of typical CR in CGDA to construct the withinclass similarity matrix
${\left\{{W}^{l}\right\}}_{l=1}^{C}$ for
${n}_{l}$ labeled samples belonging to
lth class by
where
${\tilde{\mathit{x}}}_{n}^{l}$ is a preprocessed training sample from
lth class,
${X}_{n}^{l}$ are all preprocessed training samples from
lth class but excluding
${\tilde{\mathit{x}}}_{n}$,
$\tilde{\mathsf{\Gamma}}$ and
$\mathit{s}$ are similarly defined by
$\tilde{{\mathsf{\Gamma}}_{ii}}={\parallel {\tilde{\mathit{x}}}_{n}^{l}{\tilde{\mathit{x}}}_{i}^{l}\parallel}_{2}$ and
${\mathit{s}}_{n}={\left[\mathrm{dist}(({p}_{n},{q}_{n}),({p}_{i},{q}_{i}))\right]}^{t}$ with the pixel coordinate
$({p}_{i},{q}_{i})$ for samples in
lth class
$(i=1,2,\cdots ,{n}_{l})$, respectively.
It is easy to obtain the closedform solution for the optimization problem in Equation (
17) by
As stated in JSaCR, the locality prior (second term) and spatial prior (third term) in Equation (
17) can be efficient to extract the spectralspatial structure in HSI data. However, the simple locality prior w.r.t. the second term in Equation (
17) could be inefficient to exploit the intrinsic geometric information compared to manifold priors like the Laplacian regularization used in LapCGDA. Hence, we further propose the Laplacian regularized SpatialAware CGDA (LapSaCGDA) by introducing the Laplacian regularization into SaCGDA. In order to compare the two spectral priors, we simply add the spectral Laplacian constraint into SaCGDA as the third regularization instead of replacing the spectral locality constraint in SaCGDA with the spectral Laplacian constraint.
Similar to SaCGDA, the withinclass similarity matrix
${\left\{{W}^{l}\right\}}_{l=1}^{C}$ of LapSaCGDA can be constructed by solving the following optimization problem:
where
${H}_{ni}={e}^{\frac{\left\right{\tilde{\mathit{x}}}_{n}^{l}{\tilde{\mathit{x}}}_{i}^{l}{\left\right}^{2}}{r}}$ with a prespecified parameter
r.
Mathematically, the closedform solution for this optimization can be obtained by
It is worth noting that LapSaCGDA could reduce to SaCGDA when $\gamma =0$, to LapCGDA when $\beta =0$ or to CGDA when $\beta =0$ and $\gamma =0$, thus the proposed LapSaCGDA model is very general and can unify the three models with different regularization parameters’ configurations.
After obtaining all the withinclass similarity matrix
${\left\{{W}^{l}\right\}}_{l=1}^{C}$ by SaCGDA or LapSaCGDA, the blockdiagonal similarity matrix
$W=diag({W}^{1},{W}^{2},\dots ,{W}^{C})$ can be simply constructed like Equation (
12). Finally, based on the GraphEmbedding Discriminant Analysis framework, it is easy to evaluate the optimal projection matrix
P by solving the eigenvalue decomposition in Equation (
10).
The complete LapSaCGDA algorithm is outlined in Algorithm 1.
Algorithm 1 LapSaCGDA for HSIs Dimensionality Reduction and Classification 
 Input:
Highdimensional training samples $X\in {\mathcal{R}}^{D\times N}$, training ground truth $y\in {\mathcal{R}}^{N}$, prefixed latent dimensionality d, and testing pixels ${X}^{*}\in {\mathcal{R}}^{D\times M}$, testing ground truth ${y}^{*}\in {\mathcal{R}}^{M}$, a prespecified regularization parameters $\alpha ,\beta ,\gamma $.  Output:
s = {$Acc$, P}.
 1:
Preprocess all the training and testing data by average filtering;  2:
Evaluate the similarity matrix W by solving Equation ( 20);  3:
Evaluate the optimal projection matrix P by solving the eigenvalue decomposition in Equation ( 10);  4:
Evaluate lowdimensional features for all the training and testing data by ${\mathit{z}}_{n}={P}^{T}{\mathit{x}}_{n}$;  5:
Perform KNN and/or SVM in lowdimensional feature space and return classification accuracy $Acc$;  6:
returns

As for the model complexity, Algorithm 1 tells us that, compared to other contrastive models like CGDA and LapCGDA, the proposed SaCGDA and LapSaCGDA do not increase the complexity except the average filtering preprocessing step, which could be conducted before the algorithm starts.
4. Experiments
In this section, we will validate the effectiveness of the proposed SaCGDA and LapSaCGDA for HSI feature reduction and classification. The novel model is compared to the classic supervised DR models NWFE [
8], SPPCA [
16], plus recently developed DR techniques LapCGDA [
7], SLGDA [
34], and LGSFA [
44] based on the GraphEmbedding Discriminant Analysis framework in three typical HSI datasets in terms of classification accuracy. In addition, the classic SVM is also applied in the original highdimensional spectral feature space for comparison. Since all the HSIs datasets are initially preprocessed by the average filtering, the original spectral features can be regarded as the spectralspatial features as well.
For a fair comparison, firstly all the data will be preprocessed by average filtering with a $7\times 7$ spatial window, and then KNearest Neighbors (KNN) with Euclidean distance and SVM with Radial Basis Function (RBF) kernel are chosen as the classifiers in the dimensionalityreduced space for verifying all the DR models in terms of the classification Overall Accuracy (OA), the classification Average Accuracy (AA) and the Kappa Coefficient (KC) as well as the True Positive Rate (TPR) and False Positive Rates (FPR). The parameter K in KNN is set to be 5, and the optimal parameters of kernel in SVM are selected by the tenfold crossvalidation with a given set $\{0.001,0.01,0.1,1,10,100,1000,10000\}$. The regularization parameters such as $\alpha ,\beta ,\gamma $ in LapCGDA, SLGDA, and the proposed SaCGDA and LapSaCGDA will be chosen by the grid search with a given set $\{{10}^{6},{10}^{5},\dots ,{10}^{4}\}$ firstly. Then, as the optimal projection dimensionality for each DR model has to be prespecified before model optimization, for each piece of HSI data, where the same training and testing data are randomly picked, we compare and choose the best dimensionality of each DR model in the range of 1–30 in terms of the classification accuracy based on KNN in the projected lowdimensional feature space. Finally, with the selected optimal dimensionality for each piece of HSI data and each DR algorithm, we further compare all the DR techniques as the number of training data increases. Classification maps are also compared when limited training data are available. All the experiments are repeated ten times and the averaged results are reported with standard deviation (STD).
4.1. Data Description
Three HSIs datasets including Pavia University scene, Salinas scene and Indian Pines scene from [
50] are employed in our experiments.
The Pavia University scene (PaviaU) is captured by the sensor known as the Reflective Optics System Imaging Spectrometer (ROSIS3) over the Pavia University, Italy in 2002. The number of spectral bands is 103 with the image resolution of
$610\times 340$ pixels after removing some pixels without information and noise bands. The geometric resolution is
$1.3$ m. The data set contains nine land cover types, and the False Color Composition (FCC) and Ground Truth (GT) are shown in
Figure 1.
The Salinas scene (Salinas) was collected by an Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over Salinas Valley, California in 1998 with the geometric resolution of
$3.7$ m. The spatial size of this data is
$512\times 217$ pixels with 224 spectral bands. After removing 20 water absorption and atmospheric effects bands, the number of spectral band becomes 204. Sixteen ground truth classes are labeled in this data set, and the false color composition and ground truth are shown in
Figure 2.
The Indian Pines data (IndianPines) is a scene from Northwest Indiana captured by the AVIRIS sensor in 1992, which consists of
$145\times 145$ pixels and 200 spectral bands after removing bands with noise and water absorption phenomena. Sixteen ground truth classes are considered in this data set, and the false color composition and ground truth are shown in
Figure 3.
4.2. Experiments on the Pavia University Data
In order to demonstrate the effectiveness of the proposed models, initially, we analyze the parameters’ sensitivity and select optimal parameters for each DR model because the regularization parameters such as $\alpha ,\beta ,\gamma $ in LapCGDA, SLGDA, and the proposed SaCGDA and LapSaCGDA significantly affect the performance of the DR models. In addition, for the proposed SaCGDA and LapSaCGDA, the smooth parameter t in the spatial prior is also optimally searched in the set $\{2,4,6,8\}$. In the experiment, 30 samples are randomly picked from each class as training data and the projection dimensionality is fixed to 30 so that the parameters’ sensitivity analysis can be conducted.
As can be seen from
Figure 4a for SaCGDA, the optimal parameter
$\beta $ w.r.t. the spatial prior significantly affects the OAs compared to the parameter
$\alpha $ w.r.t. the spectral locality prior, and the best smooth parameter
t in the spatial prior is 2 in our experiment. Similar results can also be found in the parameters’ sensitivity experiment for LapSaCGDA, which are not completely shown in
Figure 4b because it is difficult to plot the 4D figure w.r.t. three parameters. Here, we simply set the parameter
$\alpha =0.0001$ w.r.t. the spectral locality prior for both SaCGDA and LapSaCGDA because OAs are insensitive to parameter
$\alpha $ based on the experimental results of parameters’ sensitivity. The best regularization parameters for LapCGDA, SLGDA, SaCGDA and LapSaCGDA are listed in
Table 1, which will be used in the following experiments. In addition, what can be found from
Figure 4b and
Table 1 is that the best parameter
$\gamma $ w.r.t. the spectral manifold prior in LapSaCGDA is larger than the optimal parameter
$\alpha $ w.r.t. the spectral locality prior, meaning that the manifold prior has more influence on the DR model than the simple locality prior.
Based on these optimal parameters selected by the grid search, we further choose the best dimensionality of the projection space in terms of the OAs based on KNN and SVM performed in the lowdimensional projection space. The best number of dimensionality of the embedding space will be chosen from the range of 1–30.
What can be seen from
Figure 5 is that the proposed SaCGDA and LapSaCGDA outperform other DR models in almost all lowdimensional projection spaces, and the optimal number of dimensionality for each DR model in this HSI data is 30. It also should be highlighted that, in the lowdimensional projection space, the OAs based on KNN from the proposed SaCGDA and LapSaCGDA outperform SVM based on original highdimensional features significantly, meaning the that the learned lowdimensional features are very discriminatory.
Furthermore, we also show the experimental results when different numbers of training data are randomly chosen. In addition, 20–80 samples are randomly chosen from each class, and the remainder are for testing. It can be viewed from
Table 2 that the proposed SaCGDA and LapSaCGDA outperform other stateoftheart DR techniques and SVM significantly. In addition, based on the fact that LapSaCGDA is superior to SaCGDA in terms of KNN and SVM OAs, we can demonstrate that the spectral Laplacian prior can improve the performance of SaCGDA.
Finally, in order to show the classification results of different methods for each class when limited training data are available, we randomly select 20 samples from each class to compare eight algorithms. The results from different algorithms are objectively and subjectively shown in
Table 3 and
Figure 6, respectively. What can be seen from
Table 3 is that the proposed SaCGDA and LapSaCGDA outperform other models in terms of the AA, OA, KC as well as TPR and FPR in most classes, and LapSaCGDA achieves higher accuracy than SaCGDA, which clearly indicates that the Laplacian regularization in LapSaCGDA is beneficial to the DR model. Accordingly,
Figure 6 subjectively tells us that classification maps from the proposed methods w.r.t. the testing data are more accurate than other contrastive approaches.
4.3. Experiments on the Salinas Data
To further verify the proposed models on HSIs data from different HSIs sensors, we utilize Salinas data captured by AVIRIS sensor to demonstrate the effectiveness of the novel methods. Similarly, we firstly select the optimal regularization parameters such as $\alpha ,\beta ,\gamma $ in LapCGDA, SLGDA, and, in addition to the regularization parameters, the smooth parameter t in the spatial prior for the proposed SaCGDA and LapSaCGDA by the grid search. The embedding dimensionality is set 30 and 30 training samples are randomly picked from each class so that the parameter sensitivity can be compared.
According to
Figure 7a, the regularization parameter
$\alpha $ in SaCGDA becomes sensitive to the OA, but the optimal value is still less than 0.001. Thus, we set
$\alpha =0.0001$ for both SaCGDA and LapSaCGDA. In addition, it can be viewed from
Figure 7b that the optimal parameter
$\gamma $ w.r.t. the spectral manifold prior in LapSaCGDA is 0.1, which means that this manifold prior affects the DR model largely compared to the parameter
$\alpha $ w.r.t. the spectral locality prior. Additionally, for this dataset, the best smooth parameter
t in the spatial prior in SaCGDA and LapSaCGDA is 4 by grid search.
Table 4 displays the best parameters obtained from the parameter sensitivity experiments, which will be used in the following experiments.
Once these optimal parameters are selected by the grid search, we should choose the best dimensionality of the projection space in terms of the OAs based on KNN and SVM performed in the lowdimensional projection space. What can be seen from
Figure 8 is that the proposed SaCGDA and LapSaCGDA outperform other DR models almost in all lowdimensional projection space again, and the optimal number of dimensionality for each DR model in this HSIs data is 30.
Furthermore, based on the optimal projection dimensionality, we also objectively show the experimental results as the amount of training data increases. In addition, 20–80 samples are randomly picked from all the labeled data, and the rest of the samples are for testing. We can conclude from
Table 5 that the proposed SaCGDA and LapSaCGDA achieve the best OAs, especially when only a few pieces of training data are available.
Finally, we select 20 samples from each class to compare all the models in terms of the classification accuracy for each class. As can be viewed from
Table 6, the proposed two models obtain the highest accuracy in almost every class, and LapSaCGDA is slightly superior to SaCGDA due to the Laplacian prior in LapSaCGDA. The classification maps in
Figure 9 give a similar conclusion.
4.4. Experiments on the Indian Pines Data
Another challenging HSIs data is the Indian Pines Data, also captured by the AVIRIS sensor. The optimal regularization parameters such as
$\alpha ,\beta ,\gamma $ in the proposed SaCGDA and LapSaCGDA can be observed from the parameters’ sensitivity analysis experiments in
Figure 10, and the best smooth parameter
t in the spatial prior is 2. Thirty training samples are randomly picked from each class in these experiments, and the projecting dimensionality is also fixed to 30 so that the parameter sensitivity analysis can be demonstrated.
What can be viewed from
Figure 10a is that the OAs are not sensitive to the parameter
$\alpha $ w.r.t. the spectral locality prior in SaCGDA because the optimal parameter
$\beta =5000$ which dominates the model is so large that other constraints are ignored. For the LapSaCGDA, the best parameter
$\gamma $ w.r.t. the manifold prior is also small according to
Figure 10b. The best regularization parameters for LapCGDA, SLGDA, SaCGDA and LapSaCGDA are listed in
Table 7 for the following experiments.
Based on these optimal parameters selected by the grid search, the best dimensionality of the projection space for each DR models can be obtained from
Figure 11 in terms of the OAs based on KNN and SVM performed in the lowdimensional projection space. Once again, the optimal number of dimensionality for almost each DR model in this HSI data is 30, and the proposed SaCGDA and LapSaCGDA are the best among all models although LGSFA shows comparative OAs in highdimensional embedding space based on KNN.
Moreover, we also show the experimental results when different amounts of training data are selected. In addition, 20–80 samples are randomly picked from all the labeled data, and the other samples become the testing data. For those classes where there are less samples available, no more than
$60\%$ samples out of all data belonging to some classes will be randomly chosen. It can be seen from
Table 8 that LGSFA demonstrates comparative OAs based on KNN compared to the proposed SaCGDA and LapSaCGDA while the two novel models outperform LGSFA and other methods in terms of the SVM classification accuracy. As the spatial prior with large regularization parameter
$\beta =5000$ dominates LapSaCGDA, LapSaCGDA shows the same results compared to SaCGDA because the two spectral constraints in LapSaCGDA are ignored and LapSaCGDA reduces to SaCGDA in this case.
Finally, 20 samples are randomly selected to similarly show the classification accuracy for each class based on SVM. It can be seen from
Table 9 that the proposed SaCGDA and LapSaCGDA are superior to other models when only the spatial prior is utilized, which demonstrates the effectiveness of the spatial prior once again. Accordingly,
Figure 12 shows that the classification maps from the proposed SaCGDA and LapSaCGDA are more accurate than other approaches.
4.5. Discussion
Based on above experimental results, we can provide the following discussions,
 (i)
The proposed SaCGDA significantly outperforms SVM, SPPCA, NWFE, SLGDA, LapCGDA and LGSFA in terms of OA, AA, KC, TPR and FPR in most scenarios, and accordingly the classification maps based on the new model are smoother and more accurate than results from other supervised DR algorithms. These results clearly demonstrate that the introduced spatial constraint is very efficient to model the spatial relationship between pixels.
 (ii)
The Laplacian regularized SaCGDA (LapSaCGDA) could be superior to SaCGDA by further introducing Laplacian constraint into SaCGDA, which makes similar pixels share similar representations. The experimental results verify that LapSaCGDA achieves higher accuracy than SaCGDA in most cases because LapSaCGDA could reveal the intrinsic manifold structures of HSI data more efficiently.
 (iii)
It is also worth pointing out that the SVM classifier always provides better classification results than KNN classifiers. However, if we compare the OAs of KNN based on the proposed two models to OAs of SVM based on other DR techniques like NWFE, SLGDA, LapCGDA and LGSFA, what we can find is that the KNN classification results are better than SVM, which further proves that the proposed two algorithms are more effective to extract discriminative lowdimensional features than other contrastive approaches.