Discriminant Analysis with Graph Learning for Hyperspectral Image Classiﬁcation

: Linear Discriminant Analysis (LDA) is a widely-used technique for dimensionality reduction, and has been applied in many practical applications, such as hyperspectral image classiﬁcation. Traditional LDA assumes that the data obeys the Gaussian distribution. However, in real-world situations, the high-dimensional data may be with various kinds of distributions, which restricts the performance of LDA. To reduce this problem, we propose the Discriminant Analysis with Graph Learning (DAGL) method in this paper. Without any assumption on the data distribution, the proposed method learns the local data relationship adaptively during the optimization. The main contributions of this research are threefold: (1) the local data manifold is captured by learning the data graph adaptively in the subspace; (2) the spatial information within the hyperspectral image is utilized with a regularization term; and (3) an efﬁcient algorithm is designed to optimize the proposed problem with proved convergence. Experimental results on hyperspectral image datasets show that promising performance of the proposed method, and validates its superiority over the state-of-the-art.


Introduction
Hyperspectral Image (HSI) provides hundreds of spectral bands for each pixel and conveys a lot of surface information. Hyperspectral image classification aims to distinguish the land-cover types of each pixel, and the spectral bands are considered as features. However, the great number of bands significantly increases the computational complexity [1]. Moreover, some bands are highly correlated, leading to the feature redundancy problem. Consequently, it is critical to perform dimensionality reduction before classification. The goal of dimensionality reduction is to project the original data into a low-dimensional subspace while preserving the valuable information.
Dimensionality reduction techniques can be roughly classified into two categories: feature selection [2,3] and feature extraction [4][5][6][7][8][9]. Feature selection methods select the most relevant feature subset from the original feature space, while feature extraction methods exploit the low-dimensional subspace that contains valuable information. Compared to feature selection, feature extraction is able to create meaningful features through the transformation of the original ones. Consequently, plenty of techniques have been put forward on feature extraction [9][10][11][12][13]. Principal Component Analysis (PCA) [14] and Linear Discriminant Analysis (LDA) [15] are the most popular feature (1) The affinity graph is built according to the samples' distances in the subspace, so the local data structure is captured adaptively. (2) The proposed formulation perceives the spatial correlation within HSI data, and avoids the ill-posed and over-reducing problem naturally. (3) An alternative optimization algorithm is developed to solve the proposed problem, and its convergence is proved experimentally.

Linear Discriminant Analysis Revisited
In this section, the Linear Discriminant Analysis is briefly reviewed as the preliminary. Given an input data matrix X = [x 1 , x 2 , · · · , x n ] ∈ R d×n (d is the data dimensionality and n is the number of samples), LDA defines the between-class scatter S b and within-class scatter S w as where n k is the sample number in class k, c is the class number, µ k is the mean of samples in class k and µ is the mean of all the samples. With the above definitions, LDA aims to learn a linear transformation W ∈ R d×m (m d), which maximizes the between-class difference while minimizing the within-class separation: where Tr() indicates the trace operator. With the optimal transformation W * , data sample x i can be projected to a m-dimensional feature vector W * T x i . As shown in Equation (1), LDA assumes that the data distribution is Gaussian and the between-class divergence can be reflected by the subtraction of the mean. This assumption is unsuitable for HSI data, and makes LDA insensitive to the local manifold.

Discriminant Analysis with Graph Learning
In this section, the Discriminant Analysis with Graph Learning (DAGL) method is introduced, and an optimization method is proposed to get the optimal solution.

Graph Learning
In real-world tasks, such as HSI classification, the local manifold may be inconsistent with the global structure. Thus, it is necessary to take the local data relationship into consideration.
In the past few decades, numerous algorithms are proposed to explore the data structure. Some of them [27][28][29][30] first construct an affinity graph with various kernels (Gaussian kernel, linear kernel, 0-1 weighting), and then perform clustering or classification according to the spectral of the predefined graph. However, the choice of kernel scales and categories is still an open issue. Therefore, the graph learning methods [25,[31][32][33][34][35] are developed to learn the data graph automatically. One of the most popular graph learning techniques is Sparse Representation [31,32], which aims to learn a sparse graph from the original data. Spare Representation assumes that a data sample can be roughly represented by the linear combination of the others. Defining a coefficient matrix S ∈ R n×n , the optimal S should minimize the reconstruction error as follows: If x i and x j are similar, S ij will be large. Thus, S can be considered as the affinity graph.

Methodology
As shown in Equation (3), Sparse Representation exploits the data relationship in the original data space, and the data noise may affect the graph quality adversely. To reduce this problem, we propose to adjust the data graph during the discriminant analysis, which yields the following formula: where I ∈ R m×m is the identity matrix, and α is a parameter. When the linear transformation W is learned, the first term of problem (4) enforces S ij to be small/large for the within/between-class samples with large transformed distances. In this way, the data graph is optimized in the subspace. Similarly, when S is fixed, the transformed distance W T ||x i − x j || 2 2 will be small/large for the within/between-class samples with large S ij . Consequently, the within-/between-class similar samples are ensured to be close/far away in the transformed subspace. However, it is difficult to optimize problem (4) directly because S is involved in both the numerator and denominator of the first term. Supposing the minimum value of the first term is γ, the optimal W and S should make the value to be close to 0. Thus, problem (4) is equivalent to the following formula: where γ can be set as a small value. Denoting a class indicator matrix Z ∈ R n×n as Z ij = 1, if x i and x j are from the same class, −γ, else (6) problem (5) can be simplified into In HSI data, the pixels within a small region may be highly correlated and belong to the same class. The spatial information is essential for an accurate classification. Given a test sample t ∈ R d×1 , we find its surroundings within a r × r region, and denote them as [t 1 , t 2 , · · · , t r 2 −1 ]. For these samples, we encourage them to be close to each other in the desired subspace, which yields the following problem Problem (8) can be further reduced to Finally, by integrating problems (7) and (9) together, we have the objective function of the proposed DAGL method: where α and β are parameters. Since DAGL does not need to calculate the inverse matrix of within-class scatter, the ill-posed problem is avoided naturally. In addition, the projected dimensionality m can be any value less than d, so the over-reducing problem does not occur. With the proposed objective function, the local data relationship is investigated, and the spatial correlation between the pixels is also captured.

Optimization Algorithm
Problem (11) involves two variables to be optimized, so we consider to fix one and update another one iteratively. The data graph S is firstly initialized with an efficient method [33].
When S is fixed, problem (11) becomes Denoting a scatter matrixS z as problem (12) is converted into min According to the spectral clustering [36], the optimal W for problem (14) is formed by the m eigenvectors of matrix (S z + βS t ) corresponding to the m smallest eigenvalues.
When W is fixed, by removing the irrelevant terms, problem (11) is transformed into Fixing the diagonal elements in S as 0, the above problem is equivalent to where s j ∈ R n×1 is the j-th column of S. Since the s j is independent between different j, we can solve the following problem for each j: where 1 ∈ R n×1 is a column vector with all the elements equal to 1. Because (U + X T X) is a positive definite matrix, problem (18) can be readily solved by the Augmented Largrange Method (ALM) [37].
In the above optimization procedure, the original problem (11) is decomposed into two sub-problems. When solving W, a local optimal value is obtained. When solving S, the ALM algorithm is employed, whose convergence is already proved. Thus, the objective value of problem (11) decreases monotonically in each iteration, and finally converges to a local optimum. The convergence behaviour of the proposed algorithm will be shown in Section 4.3. The details of the whole framework is described in Algorithm 1.

Algorithm 1 Discriminant Analysis with Graph Learning
Input: training set, testing set, parameter K, r, α and β.
1 For each test sample: 2 Construct the training sub-set X by choosing the K nearest neighbors from the training set.
3 Find the surroundings of the test sample within the r × r region, and obtainS t . 4 Initialize data graph S. 5 Repeat: 6 Update W by minimizing problem (14). 7 Update S by solving problem (18). 8 Until converge 9 End Output: optimal transformation matrix W * for each test sample.

Experiments
In this section, experiments are conducted on one toy and two hyperspectral image datasets. The convergence behavior and parameter sensitivity of the proposed method are also discussed.

Performance on Toy Dataset
A toy dataset is constructed to demonstrate that the proposed Discriminant Analysis with Graph Learning (DAGL) can captures the local data structure.
Dataset: as visualized in Figure 1a, the toy dataset consists of two-dimensional samples from two classes. Samples from the first class obey the Gaussian distribution, and those from the second class are distributed in the two-moon shape. The coordinates of the samples are taken as the features.
Performance: We transform the samples into the one-dimensional subspace with regularized Linear Discriminant Analysis (RLDA) [18] and the proposed DAGL. In addition, for DAGL, β is set as 0 since spatial distance is equivalent to the feature distance. Figure 1a shows the learned projection directions. It is manifest that DAGL finds the correct projection direction successfully, while LDA fails. On this dataset, the local data structure is inconsistent with the global structure, and the mean values of the samples cannot reflect their real relationship. Thus, RLDA is unable to project the data correctly, as shown in Figure 1b. On the other hand, the proposed DAGL does not rely on any assumption on the data distribution, and learns the local data manifold adaptively, so it finds discriminative subspace, where the samples are linearly separable, as shown in Figure 1c.

Performance on Hyperspectral Image Datasets
In this part, experiments are conducted on hyperspectral image datasets. The data samples are projected into the subspace, and then classified by the Support Vector Machine (SVM) classifier. The parameters of SVM are selected by grid search within {2 0 , 2 1 , · · · , 2 10 } and {2 0 , 2 1 , · · · , 2 20 }. Three widely-used measurements, overall accuracy (OA), average accuracy (AA) and kappa statics (κ) are adopted as evaluation criteria.
Datasets: two hyperspetral image datasets are employed in the experiments, including Indian Pines and KSC [16] datasets.
The Indian Pines dataset was captured by an Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the northwestern Indiana, and annotates 10,249 pixels from 16 classes. Each pixel is with 220 spectral bands. In the experiments, only 200 bands are used because the other 20 bands are affected by water absorption. The spatial resolution of this dataset is 20 m.
The KSC dataset was captured by an AVIRIS sensor over the Kennedy Space Center (KSC), Florida. After removing the water absorption and low SNR bands, there remains 176 bands. In addition, 5211 pixels from 13 classes, which represent the various land cover types, are used for classification.
For each dataset, we randomly select 5% samples as the training set and all the remaining samples as the test set. To alleviate the random error caused by the dataset partition, we repeated the experiments for five times and report the average results. The sizes of the training and test sets for the two datasets are exhibited in Tables 1 and 2. Through experiments, we have found that a small portion of the training set is enough for a good performance. When classifying a test sample, we just select its 50 nearest neighbors (in feature space) from the training set, and use them to train the proposed DAGL model. Competitors: for a quantitative comparison, six dimensionality reduction algorithms are taken as competitors, including regularized LDA (RLDA) [18], Semi-supervised Discriminant Analysis (SDA) [20], Block Collaborative Graph-based Discriminant Analysis (BCGDA) [25], Spectral-Spatial LDA (SSLDA) [21], and Locality Adaptive Discriminant Analysis (LADA) [22]. To demonstrate the usefulness of dimensionality reduction, the classification result with all features is taken as the baseline, known as RAW.
The curves of OA versus the reduced dimensionality on different datasets are shown in Figure 2. The proposed DAGL achieves the highest OA constantly. Especially, on the Indian Pines dataset, DAGL exceeds the second best one to a large extent when the reduced dimensionality is less than 4. In Figure 2, the performance becomes stable when the dimensionality increases to a certain value. This phenomenon implies that a low-dimensional subspace is sufficient for sustaining the valuable information. Compared with RAW, the performance with projected data is better in most cases, which validates that dimensionality reduction does improve the classification accuracy. The quantitative results of the methods are given in Tables 3 and 4. Each method uses its optimal reduced dimensionality. It can be seen that DAGL outperforms all the competitors in terms of OA, AA and κ. RLDA neglects the local data relationship, so it cannot captures the manifold structure. SDA and SSLDA preserve the local data relationship with a predefined data graph. However, their performance may be adversely affected by the graph quality. BCGDA learns the affinity graph with the original data by sparse representation. Because the data graph is fixed during the discriminant analysis, the data relationship in the desired subspace cannot be exploited. LADA does not have this problem since it integrates graph learning and discriminant analysis jointly. However, it just learns the within-class correlation and fails to discover the similar samples from different classes. The proposed DAGL investigates the local data relationship adaptively, and pushes the between-class similar samples apart. Therefore, it achieves the best performance on all occasions. Furthermore, the classification maps of different methods on Indian Pines are also visualized in Figure 3. SSLDA, LADA and DAGL, which enforce the spatial smoothness within a small region, show better visualization quality than the others. Thus, the utilization of spatial information improves the classification performance. It is worth mentioning that the methods with spatial constraints are time-consuming, as shown in Tables 3 and 4, since they need to find the surroundings and train the model for each sample. Compared to SSLDA and LADA, DAGL is more efficient because the optimization method converges fast.
Similar to the experiments on toy dataset, we also visualize the two-dimensional subspace learned from the Indian Pines dataset. Taking the 5% samples from the Corn-notill, Grass-tree and Soybeans-notill classes, we project the data into two-dimensional subspace with SDA, SSLDA, LADA and the proposed DAGL. In this experiment, the spatial-smoothness terms of SSLDA, LADA and DAGL are removed so that we do not need to train the models for each sample separately. Figure 4 shows the projected data, the subspace found by DAGL separates the samples from different classes far away. This result explains the good performance of DAGL on the Indian Pines dataset when the reduced dimensionality is low.

Convergence and Parameter Sensitivity
The convergence behavior of the proposed optimization algorithm is studied experimentally. We randomly choose two test samples from the Indian Pines and KSC datasets, and plot the changes of the objective values during the optimization. From Figure 5, we can see that the objective values of problem (11) converge within five iterations, which verifies that the optimization algorithm is effective and efficient.
In addition, the parameter sensitivity of DAGL is also discussed. The objective function (11) contains two parameters, i.e., α and β. α affects the learning of the data graph, while β controls the weight of the spatial smoothness term. With varying α and β, the variance of OA is shown in Figure 6. We can see that DAGL is robust to α and β in a wide range. When α and β become very small, the performance drops because the graph quality decreases and the spatial smoothness cannot be guaranteed.

Conclusions
In this paper, we propose a new supervised dimensionality reduction method, known as Discriminant Analysis with Graph Learning (DAGL). DAGL learns the data graph automatically during the discriminant analysis. It pulls the within-class similar samples together while pushing the between-class similar samples far away. Compared with LDA and its graph-based variants, DAGL is able to learn the data relationship within the desired subspace, which contains more valuable features and less noise. In addition, DAGL ensures the smoothness within the neighborhood, so it can discover the spatial correlation within hyperspectral images. Through the experiments on Indian Pines and KSC datasets, DAGL provides better classification results than the state-of-the-art competitors.
In future work, we would like to generalize the proposed method to the kernel version, and learn the nonlinear transformation of HSI data. It is also desirable to improve the optimization algorithm to increase the computation efficiency.