Dimensionality Reduction of Hyperspectral Image with Graph-Based Discriminant Analysis Considering Spectral Similarity

: Recently, graph embedding has drawn great attention for dimensionality reduction in hyperspectral imagery. For example, locality preserving projection (LPP) utilizes typical Euclidean distance in a heat kernel to create an afﬁnity matrix and projects the high-dimensional data into a lower-dimensional space. However, the Euclidean distance is not sufﬁciently correlated with intrinsic spectral variation of a material, which may result in inappropriate graph representation. In this work, a graph-based discriminant analysis with spectral similarity (denoted as GDA-SS) measurement is proposed, which fully considers curves changing description among spectral bands. Experimental results based on real hyperspectral images demonstrate that the proposed method is superior to traditional methods, such as supervised LPP, and the state-of-the-art sparse graph-based discriminant analysis (SGDA).


Introduction
Remote sensing big data are always in a large spatial scale.Hyperspectral remote sensing imagery, especially for Earth observation, gives rise to dense spectral sampling, resulting in a large spectral dimension as well.In hyperspectral image analysis, the wealthy spectral information at the cost of high spectral dimensionality can better classify the materials in an observed area.However, high dimensionality leads to the curse of the dimensionality problem, which causes classification performance to deteriorate, especially when the number of available labeled training samples is limited [1][2][3][4][5][6].
Dimensionality reduction is usually applied as a preprocessing step in hyperspectral image analysis to remove redundant features and preserve useful information in a low-dimensional subspace.Projection-based strategy is a common technique of dimensionality reduction, of which the essence is to seek an optimal mapping matrix and then project the original data into a lower dimensional subspace.This strategy contains both unsupervised technologies such as principal component analysis (PCA) [7], the maximum-noise-fraction (MNF) transform and supervised approaches like linear discriminate analysis (LDA), and local Fisher discriminate analysis (LFDA) [8,9].PCA endeavors to find a linear transformation through maximizing the variance in the projected subspace, whereas LDA tries to maximize the trace ratio between-class scatter and the within-class scatter.
In the past few years, graph theory [10] that describes the geometric structures of data has been successfully applied to dimensionality reduction.The main idea of graph-based discriminate analysis (GDA) is a sparse eigenvalue problem, i.e., constructing a block-diagonal affinity matrix with different labels whose nonzero elements represent the relationship between a pair data points belonging to the same labeled samples.Depending on the affinity matrix, a series of algorithms such as local linear embedding (LLE) [11], Laplace Eigenmap (LE) [12], and locality preserving projection (LPP) [13,14] can be derived for different tasks like data visualization and subspace learning.In [10], a general graph-embedding (GE) framework was proposed to summarize a lot of existing manifold learning algorithms.It was noted that the key of GE is to construct a similarity graph that can reflect the critical information in the original data.Besides aforementioned algorithms, some popular graph-based algorithms include unsupervised discriminant projection (UDP) [15], Marginal Fisher analysis (MFA) [10], linear discriminant projection (LDP) [16], sparse preserving projection [17] and various extensions [18][19][20][21].
Unlike PCA and LDA, these graph-embedding algorithms do not assume that the data obey the Gaussian distribution; thus, they are more suitable for discriminate analysis.The essence of those graph-based algorithms aforementioned is constructing different similarity graphs.In existing literature, there are mainly two popular approaches for graph construction.The one is based on pairwise distance (e.g.,Euclidean distance), the other is based on reconstruction coefficients (e.g., sparse representation).The former has been successfully used in ISOMAP, supervised LPP (SLPP) [13,14], etc., and obtains some excellent performance.The latter has attracted a lot of interest because of the wide application of p -norm.Recently, sparse graph-based discriminate analysis (SGDA) [22], collaborative graph-based discriminate analysis (CGDA) [23], and semi-supervised double sparse graphs (sDSG) [24] have demonstrated their effectiveness.
Different from traditional imagery, hyperspectral remote sensing imagery has a vital feature, i.e., each pixel is a high-dimensional vector.Such a vector intuitively reveals spectral reflectance of the objects in different wave bands.In an ideal situation, the same objects have the same spectral signatures.However, in the real world, hyperspectral imagery data may be interfered with to some extent because of the sensor or external factors such as atmosphere and illumination.Euclidean distance is usually used to evaluate the similarity between two vectors, whereas it is easily disturbed when the vector has some extreme point.Motivated by aforementioned algorithms and the special intrinsic feature of hyperspectral data, a novel graph-based discriminate analysis via spectral similarity (denoted as GDA-SS) is proposed in this work.The spectral similarity measurement is based on spectral characteristics to construct a similarity graph.The proposed method utilizes the absolute difference of pairwise pixels and sets a threshold to evaluate the similarity.The main contributions in this work are summarized as follows: (1) GDA-SS takes full advantage of spectral characteristics, which makes, as many as bands in hyperspectral imagery, more sense; and (2) GDA-SS directly evaluates the similarity on the spectral bands and applies the proportionality coefficient to represent a discriminant graph, which makes the similarity clear at a glance.
The remainder of this paper is organized as follows.Section 2 reviews the graph-embedding dimensionality reduction framework and the similarity graph in SLPP and SGDA.Section 3 primarily describes the proposed GDA-SS algorithm in detail as well as the feasibility.Section 4 validates the proposed approach and reports classification results, comparing them to several state-of-the-art alternatives.Section 5 summarizes this work.

Graph-Embedding Dimensionality Reduction Framework
Let a hyperspectral dataset with M samples be denoted as X = {x i } M i=1 in a R d×1 feature space, where d is the number of bands.In the graph theory, an intrinsic graph among the pixels is denoted as G = {X, W} with W being an affinity matrix, and a penalty graph is represented as G p = {X, W p } with W p being a penalty weight matrix.Let C be the number of classes, m l be the number of available labeled samples in the lth class, and ∑ C l=1 m l = M.The graph-embedding dimensionality reduction framework [10,25] endeavors to seek a d × K projection matrix P (with K d), which results in a low-dimensional subspace Y = P T X.The goal is to maintain class separability by preserving the relationship of data points in the original space.The objective function can be mathematically formed as, P = arg min = arg min where L is the Laplacian matrix of graph G, L = D − W, D is a diagonal matrix with the ith diagonal element being D ii = ∑ M j=1 W i,j , and L p may be the Laplacian matrix of the penalty graph G p or a simple scale normalization constraint [10].The optimal projection matrix P can be obtained as, which can be solved as a generalized eigenvalue decomposition problem, where Λ is a diagonal eigenvalue matrix.For a d × K projection matrix P, it is constructed by the K eigenvectors corresponding to the K smallest nonzero eigenvalues.Note that the performance of graph-embedding-based dimensionality-reduction algorithms mainly depends on the choice of G.

Similarity Graph in LPP and SGDA
Recently, various graph-based algorithms are demonstrated to be effective for solving dimensionality reduction problems in high-dimensional data [26][27][28][29].How to construct the similarity graph plays a vital role in these algorithms.The performance of these methods largely hinges on whether the graph can accurately distinguish the similarity and dissimilarity among data points, even when the data contain noise.In this section, two popular approaches to construct affinity graphs are summarized.
The first approach is pairwise distance.In this part, the most popular metric is Euclidean distance with Heat Kernel, typically used in LPP [13], i.e., sim(x i , x j ) = exp where sim(•) represents the similarity function, x i and x j denote data points (vector), and parameter τ denotes the width of the Heat Kernel.This metric has been applied in various domains such as face recognition [30] and anomaly detection [31].However, it is generally known that the pairwise distance is very sensitive to the noise and outliers because its measurement just depends on the corresponding two data points.Thus, the algorithms based on the first strategy may fail to manage noise corrupted data.
The other approach for building graphs is the reconstruction coefficients, typically used in SGDA [22].Sparse representation utilizes a few bases to represent each data point, which is successfully used in data representation.The original formula is expressed as where W is the affinity matrix and • denotes the 1 -norm.Because of the classes of the labeled samples, W can be written as, where W (i) is the sparse representation matrix whose size is M i × M i for the samples in the ith class using the M i samples just belonging to C i .

Proposed GDA-SS
In this section, the proposed method, i.e., GDA-SS, is introduced in detail.GDA-SS is motivated by simple spectral operations as illustrated in Figure 1.The training samples are randomly chosen to construct a similarity graph using the proposed spectral similarity measurement; then, graph-embedding dimensionality reduction framework is applied to project the samples into lower dimensional subspace.Because each spectral vector reveals the spectral information in a certain wavelength range, the proposed approach can translate the characteristic into a similarity graph well.

GDA-SS
Considering x i , x j is in the same class as the hyperspectral data, the difference of these two samples can be written as where |•| denotes the absolute value.It is obvious that the subtraction can reveal the difference between two spectral pixels.After that, the subtraction needs a threshold to constrain the similarity distance.
In fact, pixels may be disturbed by a sensor noise to some degree.Thus, in order to take the edge off the noise, a ratio of average subtraction is applied to measure the similarity.That is, the threshold T d is represented as where avg(•) denotes the average value of elements in x sub , and η is an adjustment parameter.
In experiments, the average value is replaced by that from a set of pair-wise differences in the same class.
When the threshold T d is confirmed, the similarity can be calculated by comparing x sub with T d .The number of elements whose values are less than T d is counted.Then, the similarity between x i and x j is determined as which is the ijth element in matrix W, and d is the number of bands.To make the W even more sparse, the elements in W less than another given threshold T s are set to zero.According to the difference in individual classes, separate threshold T l s can be set, where W l denotes pair-wise similarity of samples in the lth class, and γ is a sparsity-controlling parameter.

Analysis on GDA-SS
For a hyperspectral image, spectrum is the important characteristic, which makes the pixel-level classification a reality.However, each pixel may be interfered with by noise (such as some inevitable random noise).In this way, the parameters are very important for adjusting the data-dependent features.The benefit of this proposed approach is that spectral similarity between two pixels is calculated by chosen bands not all the bands, through thresholding the spectral difference.With the chosen bands, trivial spectral variations and additive noise can be alleviated, resulting in better representation of spectral similarity.
In GDA-SS, there are two important parameters, i.e., η and γ, controlling the spectral similarity and sparseness, respectively.We illustrate three-class synthetic data (here, three classes are chosen from the University of Pavia data that will be introduced in Section 4) to demonstrate the sensitivity of these two parameters.The typical support vector machine (SVM) [32,33] is employed to measure the classification accuracy.The signal-to-noise ratio (SNR) of 20 dB and 30 dB Gaussian noise [34] and infinite (here, Inf means that no additional noise is used) is simulated.Figure 2 illustrates the graph matrix learned by GDA-SS with the pre-setting parameters.When the dimensionality is reduced to 25, the best classification accuracies are 98.25%, 99.25%, and 99.50%, respectively, and we obtain corresponding controlling parameters, i.e., η and γ, as shown in Figure 2. Note that when the SNR is smaller, the resulting parameter η is larger.This is because too much noise can affect the threshold T d .Under the situation, η needs to change for suiting the situation.In general, the higher SNR needs larger η because Equation (9) requires a litter higher tolerance.As for parameter γ, its function (role) is to control the sparseness of graph matrix.Compared to Figure 2a,c, even though the γ value is the same, the η value is significantly different, which results in the sparsity in Figure 2a being worse than that in Figure 2c.It demonstrates that controlling parameters γ and η can adaptively tune the sparsity, and when the SNR is larger, the sparsity may be worse.

Hyperspectral Data
In experiments, real hyperspectral data sets have been used to test the proposed method.The first dataset (http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes) employed in the experiment was acquired using National Aeronautics and Space Administration's (NASA) Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over Salinas Valley, Central Coast of California, in 1998.The image includes 512 × 217 pixels with a high spatial resolution of 3.7 m and 204 bands after 20 water absorption bands are removed.It mainly contains vegetables, bare soils, and vineyard fields.There are also 16 classes, and the number of training and testing samples are listed in Table 1, where 5% of the labeled samples in each class are randomly chosen to be training samples and the rest for testing samples.The second experimental dataset was collected by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor over the city of Pavia, northern Italy.The one is a Pavia University scene, which covers a spatial coverage of 610 × 340 pixels.The dataset has 103 spectral bands prior to water-band removal with a spectral coverage from 0.43-to 0.86-µm and a spatial resolution of 1.3 m.Approximately 42,776 labeled pixels with nine classes are from the ground truth map.In this dataset, 8% of the labeled samples are randomly selected for training and the rest for testing.More detailed information of the number of training and testing samples are summarized in Table 2.

Parameter Tuning
The classical SVM is employed to validate the aforementioned dimensionality-reduction methods, including LDA, SLLP, SGDA, and GDA-SS.A fivefold cross-validation strategy is employed for tuning parameters in classification tasks.
Figure 3 illustrates the sensitivity of the proposed GDA-SS as functions of two parameters (i.e., η and γ) in the objective functions (e.g., Equations ( 8) and ( 10)).In the experiment, η is chosen from 0.1 to 1.3, where the interval is 0.2 and γ is chosen from 0 to 0.9, where the interval is 0.1.Noted that the parameter η can be chosen as greater than 1 due to considering that the data may be pure.However, η cannot be greater; if so, the measurement may contain more errors.It is obvious that when the parameter γ is chosen as 0, the similarity matrix is theoretically no longer "sparse".Optimal η and γ are determined for GDA-SS from the results in Figure 3.For example, according to the validation classification accuracy, the best η of GDA-SS is 0.7 and the one of γ is 0.9 for the Salinas data; and for the University of Pavia dataset, η is set to 0.3 and γ is set to 0.7.It is worth mentioning that a nonzero value of γ verifies that the "sparseness" ratio can have an impact on the dimensionality reduction process.To demonstrate the effect of the dimensionality of the projected subspace on the performance of the proposed methods, Figure 4 illustrates the classification accuracy as a function of the reduced-dimensionality K for LDA, SLPP, SGDA and GDA-SS.SLPP is chosen for comparison because all of the methods are supervised.It is obvious that the performance tends to be stable when the dimensionality is larger than a certain value.For the Salinas dataset, a reduced dimension of 25 appears to be sufficient, whereas approximately 10 is enough for the University of Pavia dataset.Based on the curves in Figure 4, for a low dimensionality, classification accuracy is often not high, while that of GDA-SS is always better than LDA, SLPP, and SGDA.For the Salinas data, when the reduced dimensionality is more than 25, the performance of SGDA tends to decline, whereas GDA-SS tends to be stable.Furthermore, when the reduced dimensionality is smaller than 7, the proposed GDA-SS is superior to SGDA.Thus, this result further confirms that the proposed strategy is able to find a transform that can effectively reduce the dimensionality while enhancing class separability.

Classification Performance
In order to further evaluate the performance of GDA-SS, we compare the proposed method with the traditional LDA, SLPP and the state-of-the-art SGDA in each optimal dimensionality, respectively.Tables 1 and 2 list the class-specific accuracy, overall accuracy and average accuracy for the experimental datasets.From the results of each method, the traditional LDA and SLPP are usually a little worse than state-of-the-art SGDA since 1 -norm can better capture the data structure.However, the proposed GDA-SS with sparse-controlling parameter γ can be better than SGDA.For example, in Table 2, GDA-SS (i.e., 94.02%) yields over 1% higher accuracy than SGDA (i.e., 92.56%).Meanwhile, the γ, which is set to 0.7, verifies that the similarity is "sparse".
Figures 5 and 6 further illustrate the thematic maps.We produce ground-cover maps of the entire image scene for these images (including unlabeled pixels).However, to facilitate easy comparison between methods, only areas for which we have ground truth are shown in these maps.These maps are consistent with the results listed in Tables 1 and 2, respectively.Some areas in the classification maps produced by GDA-SS are obviously less noisy than these of SGDA, e.g., the regions of Bare soil and Bricks in Figure 6. Figure 7 further shows the comparisons between the proposed GDA-SS and these traditional methods with different numbers of training samples.For the Salinas data, the training size is changed from 0.01 to 0.05 (note that 0.05 is the ratio of number of training samples to the total labeled data).It is obvious that the classification performance of the proposed GDA-SS is competitive to the state-of-the-art SGDA.For the University of Pavia data, the improvement always keeps as 1%.In Table 3, standardized McNemar's test [35] is employed to testify the improvement.The Z values of McNemar's test larger than 2.58 mean that two classification results are statistically different at a 99% confidence level.According to our experimental results, the Z values between GDA-SS and SGDA, SLPP, and LDA are always larger than 2.58, which confirms that the proposed GDA-SS is able to highly discriminate between the different classes.For example, even though the classification accuracy of SGDA and GDA-SS is close for the Salinas data, the Z value between these two methods is 4.91, which indicates that the improvement is significant.

More Robustness Test of GDA-SS
Additional discussion on graph construction with distance similarity is presented.For graphbased dimensionality reduction methods, the most important part is to construct an informative graph.Here, several distance-similarity approaches, including cosine, Jaccard, and correlation coefficient, are employed to evaluate the spectral similarity measurement under the framework of GDA-SS in Table 4. Compared with the proposed one, these traditional distance-similarity metrics provide worse performance, although all the accuracy values are higher than 90%.The experiment verifies that the proposed method is more effective in measuring spectral similarity.Furthermore, considering that hyperspectral spectra contain noise, aforementioned dimension reduction methods are compared after noise filtering techniques are applied.Here, two commonly-used filtering methods (i.e., local average filter and wavelet de-noising) are employed as preprocesses for these experimental datasets.In Table 5, it shows that denoising has no obvious impact on these algorithms for the University of Pavia dataset.However for the Salinas dataset, the accuracies of SGDA and LDA are slightly improved, and GDA-SS still maintains a high accuracy, which demonstrates that GDA-SS is less sensitive to noise.

Conclusions
In this paper, a graph-based discriminant analysis via spectral similarity (GDA-SS) framework was proposed.In this method, spectral similarity using chosen band information was incorporated into the affinity matrix, and similarity measurement is less affected by trivial spectral variation and noise.The controlling parameters η and γ were validated to be effective for constructing affinity matrix, from the perspectives of spectral similarity and sparseness.The results of real hyperspectral images demonstrated that the proposed GDA-SS is superior to the traditional LDA, SLPP, and the state-of-the-art SGDA, even under small-sample-size situations.Moreover, the computational cost of GDA-SS is much lower than SGDA because only simple arithmetic operations are involved during graph construction.This makes it potentially more suitable to solve big data problems.

Figure 1 .
Figure 1.The flowchart and the motivation of the proposed GDA-SS.

Figure 3 .
Figure 3. Parameter tuning of η and γ for the proposed GDA-SS using two experimental datasets.(a) Salinas; (b) Pavia University.

Figure 4 .
Figure 4. Classification accuracy versus reduced-dimensionality K for methods using the experimental datasets.(a) Salinas; (b) Pavia University.

Figure 7 .
Figure 7. Classification performance of methods with different numbers of training sample sizes using the experimental datasets.(a) Salinas; (b) Pavia University.

Table 1 .
SVM class-specific accuracy (%), overall accuracy (OA) and average accuracy (AA) of different techniques for the Salinas dataset.

Table 2 .
SVM class-specific accuracy (%), overall accuracy (OA) and average accuracy (AA) of different techniques for the University of Pavia dataset.

Table 3 .
Statistical significance from the Standardized McNemar's Test about the difference between methods.

Table 4 .
Classification evaluation on graph construction with different distance-similarity metrics.

Table 5 .
Classification results after applying noise filtering techniques.