Spatial-Spectral Graph Regularized Kernel Sparse Representation for Hyperspectral Image Classification

This paper presents a spatial-spectral method for hyperspectral image classification in the regularization framework of kernel sparse representation. First, two spatial-spectral constraint terms are appended to the sparse recovery model of kernel sparse representation. The first one is a graph-based spatially-smooth constraint which is utilized to describe the contextual information of hyperspectral images. The second one is a spatial location constraint, which is exploited to incorporate the prior knowledge of the location information of training pixels. Then, an efficient alternating direction method of multipliers is developed to solve the corresponding minimization problem. At last, the recovered sparse coefficient vectors are used to determine the labels of test pixels. Experimental results carried out on three real hyperspectral images point out the effectiveness of the proposed method.


Introduction
Hyperspectral imaging sensors capture digital images in hundreds of narrow and contiguous spectral bands spanning the visible to infrared spectrum.Each pixel of a hyperspectral image can be represented by a vector whose entries correspond to various spectral-band responses.Different materials usually reflect electromagnetic energy differently at specific wavelengths.The wealth of spectral information of hyperspectral images promotes the development of many application domains, such as military [1], agriculture [2], and mineralogy [3].Among the information processing procedures of these applications, classification is an important one, where pixels are assigned to one of the available classes according to a set of given training pixels.Various classifiers have been developed for hyperspectral image classification and have shown good performance.Some popular classifiers include support vector machine (SVM) [4], multinomial logistic regression [5], and sparse representation classification (SRC) [6].
Hyperspectral images usually have large homogeneous regions within which the neighboring pixels consist of the same type of materials and have similar spectral characteristics.Previous works have highlighted that hyperspectral image classification should not only focus on analyzing spectral features, but also incorporate the information of spatially-adjacent pixels.For instance, the authors in [7] incorporate the spectral and spatial information into SVM by taking advantage of the composite kernel technique.In [5], the spatial information is incorporated into multinomial logistic regression by some spatial feature extraction techniques.In [6], a joint sparsity model is introduced to SRC, where the neighborhood pixels are assumed to have the same labels.Very recently, some deep models (e.g., stacked autoencoder [8] and convolutional neural network [9][10][11][12]) have been proposed for the spatial-spectral classification of hyperspectral images and have shown good performance in terms of accuracy and flexibility.
Among the aforementioned classifiers, SRC is an excellent one and can directly assign a class label to a test pixel without a training process.Although SRC can achieve good performance in hyperspectral image classification [6], it is hard to classify the data that is not linearly separable.To overcome this drawback, kernel SRC (KSRC) is proposed in [13] to capture the nonlinear similarity of features.In view of this, this paper considers using KSRC for the classification of hyperspectral images.However, KSRC is also a pixel-wise classifier, which does not consider the spectral similarity of spatially-adjacent pixels.In the literature, some methods have been proposed to incorporate the spatial information into KSRC.In [13], a joint sparsity model is introduced to KSRC, which can be seen as a kernel extension of [6].In [14], the spatial-spectral information is incorporated into KSRC by taking advantage of the neighboring filtering kernel technique.In [15], a centralized quadratic constraint that describes the spectral similarity of spatially-adjacent pixels is appended to the sparse recovery model of KSRC.
This paper considers incorporating the spatial-spectral information of hyperspectral images in the regularization framework of KSRC.Different from the relevant methods, a spatial-spectral graph is designed to describe the contextual information of hyperspectral images, where the spectral similarity of spatially-adjacent pixels (i.e., nodes) is measured by the weight of the edge.In addition, the acquisition of labeled pixels in most hyperspectral scenes is often an expensive procedure.Thus, the number of training pixels that can be used for hyperspectral image classification is often very limited.To deal with this problem, we assume that the training pixels are labeled by experts and then exploit the wealth of these training pixels.In detail, a spatial location constraint is introduced to capture the prior knowledge of the location information of training pixels and is thereby appended to the sparse recovery model of KSRC.As for the final classification procedure, the sparse coefficient vectors obtained by solving the corresponding regularization problem are used to determine the labels of test pixels.
The remainder of this paper is organized as follows.Section 2 briefly reviews the formulations of KSRC.In Section 3, we present the formulations of the proposed SSGL method and the corresponding optimization algorithm.The effectiveness of the proposed method is demonstrated in Section 4 by conducting experiments on three real hyperspectral images.Finally, Section 5 discusses and concludes this paper.

KSRC
This section briefly introduces the notations of kernel sparse representation for hyperspectral image classification.As with the assumption of SRC, KSRC also assumes that the features belonging to the same class approximately lie in the same low-dimensional subspace [16,17].Suppose the given hyperspectral scene includes C classes, and that there exists a feature mapping function φ which maps an L-band test pixel x ∈ R L and a J-sized training dictionary A = [a 1 , ..., a J ] ∈ R L×J to the high-dimensional feature space: where the columns of A are composed of J training pixels from all classes.Then, for an unknown test pixel x ∈ R L , it can be represented as follows: where s ∈ R J denotes an unknown sparse coefficient vector.Suppose the dictionary A is given; the unknown sparse coefficient vector s can be obtained by solving the following optimization problem: where λ > 0 is used to control the sparsity of s.Then, the class label of x can be determined by the following classification rule: where δ c (•) is the characteristic function that selects coefficients related with the c-th class and makes the rest zero.Since all φ mappings used in KSRC occur in the form of inner products, the kernel function K can be defined for any two pixels x i ∈ R L and x j ∈ R L as follows: Some commonly-used kernel functions are: linear kernel (K(x i , x j ) = x i , x j ), polynomial kernel (K(x i , x j ) = ( x i , x j + 1) d , d ∈ Z + ), and Gaussian radial basis function (RBF) kernel (K(x i , 2 ), γ ∈ R + ).In this paper, only the RBF kernel is considered.After introducing (4) into (2), the corresponding optimization problem can be rewritten as follows: where Const = (1/2)K(x, x) is a constant that can be dropped in the optimization problem (5), Q = Φ(A), Φ(A) is a J × J positive semi-definite matrix with entry Q ij = K(a i , a j ), and p = [K(a 1 , x), ..., K(a J , x)] T ∈ R J .Analogously, the classification rule (3) can be rewritten as follows:

Proposed Model
Suppose the observed hyperspectral image X consists of I pixels {x i ∈ R L } I i=1 ; i.e., X = [x 1 , ..., x I ] ∈ R L×J .Then, the corresponding sparse recovery model of (2) for the observed hyperspectral image X can be written as follows: where || • || F is the Frobenius norm, ||S|| 1,1 = ∑ j,i |S ji |, Φ(X) = [φ(x 1 ), ..., φ(x I )] denotes the corresponding data of X under the mapping function φ and S = [s 1 , ..., s I ] ∈ R J×I is the unknown sparse coefficient matrix, with s i ∈ R J being the sparse coefficient vector of pixel x i .Similarly, the corresponding sparse recovery model of ( 5) can be written as follows: where the constant term is dropped and Tr(•) denotes the trace of a matrix, P = Φ(A), Φ(X) = [p 1 , ..., p I ] ∈ R J×I with entry P ij = K(a i , x j ).The corresponding classification rule of (6) can be written as follows: The matrix X is not just a set of pixels; it is essentially a three-dimensional image; since for any i in the complete set I = {1, 2, • • • , I}, s i is the sparse coefficient vector of pixel x i .Then, the matrix S can also be seen as a three-dimensional image.That is to say, the spatial relationship between every two pixels x i and x j is also suitable for that of the sparse coefficient vectors s i and s j .It is natural to incorporate the spatial information of X by enforcing S.However, directly enforcing S is too strict; it does not consider the variation of training pixels within each class.In view of this, the summation of sparse coefficients that belong to each class is considered.Specifically, the spatial information is incorporated by enforcing TS, where T ∈ R C×I is defined as: Let us define a spatial-spectral graph as G = {V, E, W}, where V = [v 1 , ..., v I ] and E are the sets of vertices and edges, respectively, and W ∈ R I×I is a weight matrix on the graph.In this paper, every node v i is connected to its eight spatially-adjacent nodes (see Figure 1), and the weight between v i and v i is defined as follows: where β > 0 and xi and xj are the pixels of the first three principal components of the hyperspectral image X.It is apparent that for any two similar nodes, a large weight will be given.In this paper, the graph-based spatially-smooth constraint term is defined as follows: where L = D − W is the graph Laplacian and D is a diagonal matrix whose entries are column sums of W (i.e., D ii = ∑ j W ij ).After appending (12) to the sparse recovery model ( 8), we can get the following optimization problem: where α > 0. In order to effectively work with limited training pixels, the prior knowledge of the location information of training pixels is included by a spatial location constraint term.The final sparse recovery model can be written as follows: Tr(S T QS) − Tr(S T P) where Λ ⊂ I is a set with S Λ being the corresponding coefficient matrix of the labeled pixels A and ι 0 (•) denotes the indicator function; i.e., ι 0 (•) is zero if the given matrix is equal to zero, and infinity otherwise.By using the location information of training pixels in the model ( 14), these training pixels can be treated as anchor samples whose coefficients are fixed throughout the classification process, and the graph-based spatially-smooth constraint can spread their label information to their neighbors until a global stable state is achieved on the whole data set.

Optimization Algorithm
By introducing two variables M ∈ R J×I and N ∈ R C×I , the proposed optimization model ( 14) can be rewritten as follows: The optimization problem (15) accords with the framework of the alternating direction method of multipliers (ADMM) [18,19].The augmented Lagrangian function of ( 15) can be written as follows: where µ > 0 is the penalty parameter and H ∈ R J×I and B ∈ R C×I are two auxiliary variables.
The corresponding iteration procedure can be written as follows: where t > 0 is the iteration number.The first step of ( 18) is to solve the S-subproblem, which can be derived as: where I is the identity matrix.The second step of ( 18) is to solve the M-subproblem, which is the well-known soft threshold [20]: where soft(•, τ) denotes the component-wise application of the soft-threshold function y = sign(y) max{|y| − τ, 0}.The third step of ( 18) is to solve a linear system, which can be written as: min If removing the last term, the solution of this system can be written as: Let Λ be the complementary set of Λ under the complete set I.Then, the optimization problem (21) can be rewritten as: where and N Λ = T.The solution of ( 23) can be derived as: For both ( 22) and ( 24), one has to solve a large linear system.Fortunately, L is a sparse and diagonally dominant matrix, and one can take the Gauss-Seidel method to get approximate solutions to these two systems.The last two steps of ( 18) are designed to update the auxiliary variables.The procedure of the proposed algorithm is detailed as follows: 1. Input: A training dictionary A ∈ R L×J and a hyperspectral data matrix X ∈ R L×I .2. Choose β, and compute the weight matrix W according to (11). 3. Select the γ parameter for the RBF kernel and compute the matrices Q and P.

Analysis and Comparison
The proposed spatial-spectral graph regularization with location (SSGL) method is designed by appending two spatial-spectral constraint terms to the sparse recovery model of KSRC.One is a graph-based spatially-smooth constraint, and the other is a spatial location constraint.When using the two terms, the following two issues should be mentioned.
1.The graph-based spatially-smooth constraint is proposed by measuring the spatial relationship between every two spatially-adjacent pixels.If using this term, the test pixels should be arranged in the form of images.2. The spatial location constraint is exploited by integrating the location information of anchor samples, which are assumed to be taken from the test area and labeled by experts.
As described in the introduction part, deep models attract a lot of attention and provide competitive classification results very recently.Although the proposed SSGL method and the deep learning methods belong to two distinct techniques, we show the difference between them considering a further improvement of this work.In Table 1, we compare the proposed SSGL method and the deep learning methods in the respects of parameter, training sample, computational cost, flexibility, and accuracy.Restricted to the test area.
Accuracy High Moderate

Datasets
To test the performance of the proposed method, three real hyperspectral images have been considered.
(  (2) Reflective Optics System Imaging Spectrometer (ROSIS) University of Pavia: This dataset is an urban image acquired by the ROSIS sensor, with spectral coverage ranging from 0.43-0.86µm.The ROSIS sensor has a spatial resolution of 1.3 m per pixel, with 115 spectral bands.This image, with a size of 610 × 340 pixels, contains 103 spectral bands after the removal of noisy bands.There are nine ground truth classes of interest.The false color image and the ground truth map are shown in Figure 3.
In the experiments, we randomly chose 40 samples per class for training and used the rest for testing, as shown in Table 3.  (3) AVIRIS Kennedy Space Center: This image was collected by the AVIRIS sensor over the Kennedy Space Center, Florida, in March 1996.There are 224 bands in the image, covering the wavelength range of 0.4-2.5 µm.The spectral and spatial resolutions are 10 nm and 18 m, respectively.This image acquired over an area of 512 × 614 pixels contains 176 spectral bands after removing water absorption and low signal-to-noise bands.Thirteen ground-truth classes ranging from 105-927 pixels in size that occur in this scene are defined for this image.The false color image and the ground truth map are shown in Figure 4.In the experiments, 5% of the labeled pixels were randomly chosen for training, and the rest were used for testing, as shown in Table 4.

Model Development and Experimental Setup
In the experiments, we compared the proposed SSGL method with four widely-used spatial-spectral classification methods based on the KSRC classifier.The first one is the majority voting (MV) method that performs a pixel-wise KSRC followed by majority voting within superpixel regions [21,22], where the superpixel regions are obtained by the entropy rate superpixel (ERS) algorithm [23].The second one is the kernel joint SRC (KJSRC) method using kernel joint sparse representation classification [13,24], where the superpixel-based adaptive neighborhoods are used [24].The third one is the condition random fields (CRF) method that combines pixel-wise KSRC and superpixel segmentation by using condition random fields [25].The last one is the composite kernel with extended multi-attribute profile features (CKEMAP) method that integrates a composite kernel [7] into KSRC for the incorporation of spatial-spectral information, where the spatial features are extracted by EMAP [26,27].In order to show the influence of the proposed two terms, we included a degeneration version (i.e., SSG) of SSGL for comparison, where the location information constraint term was dropped.Moreover, the pixel-wise KSRC [17] that only uses spectral features was also included as a baseline classifier.The corresponding optimization problems of the compared methods were all solved by ADMM.For the pixel-wise KSRC, the parameter γ for the RBF kernel was experimentally set to two for the AVIRIS Indian Pines dataset, 0.5 for the ROSIS University of Pavia dataset, and 0.125 for the AVIRIS Kennedy Space Center dataset, and the parameters λ and µ were empirically set to 10 −4 and 10 −3 , respectively.For further details, one can refer to the existing literature [14].Considering that all compared methods are based on the KSRC classifier, their parameters brought by KSRC were experimentally set to be the same as those of the pixel-wise KSRC.For MV, the additional parameters brought by the ERS segmentation were carefully optimized by reference to [28].For both KJSRC and CRF, the parameters brought by the ERS segmentation were fixed to be the same as those of MV, and the remaining parameters were obtained by cross-validation.For CKEMAP, the EMAP features were extracted from the first three principal components of the hyperspectral image, and the area and standard deviation attributes were considered to build EMAP, as reported in [29].More specifically, for the area attribute the following values were selected for references: 49, 169, 361, 625, 961, 1369, 1849, and 2401.For the standard deviation, the EMAPs were built according to the following reference values with respect to the mean of the individual features: 2.5%, 5%, 7.5%, and 10%.As for the weight parameter and the kernel parameters in the composite kernel framework [7], they were obtained by cross-validation.For the proposed method, µ was set to 10 −4 , α was set to 1, and β was experimentally set to 50 for the AVIRIS Indian Pines dataset and 100 for both the ROSIS University of Pavia dataset and the AVIRIS Kennedy Space Center dataset.As for SSG, the same parameters were used as SSGL.
Before the following experiments, the original data were scaled in the range [0, 1].The classification accuracies are assessed with OA (overall accuracy), AA (average accuracy), and KA (kappa coefficient of agreement).The quantitative measures were obtained by averaging ten runs to avoid any bias induced by random sampling.All experiments were carried out in a 64-b quad-core CPU 2.40-GHz processor with 8 GB of memory.

Numerical and Visual Comparisons
Table 5 summarizes the class-specific and global classification accuracies for the AVIRIS Indian Pines dataset.The processing times in seconds are also included for reference.Figure 5 shows the corresponding classification maps.From Table 5, it can be seen that all of the spatial-spectral methods yielded higher classification accuracies when compared to the pixel-wise KSRC.SSGL gave the highest global and most of the best class-specific accuracies, and CKEMAP performed the second highest.For the proposed methods SSG and SSGL, the improvement of SSGL over SSG was significant.From Figure 5, it can be seen that the maps of the spatial-spectral methods contain more homogeneous regions when compared with the map of the pixel-wise KSRC.The misclassified regions of SSGL and CKEMAP were smaller than those of the others.The class-specific and global classification accuracies for the ROSIS University of Pavia dataset are provided in Table 6, as well as the processing times.The corresponding classification maps are shown in Figure 6.From the table, it is clear that SSGL performed best followed by SSG, and the performances of the spatial-spectral methods were better than those of the pixel-wise method.The numerical comparisons are confirmed by inspecting the classification maps.Table 7 shows the class-specific and global classification accuracies and the processing times for the AVIRIS Kennedy Space Center dataset.It can be seen that Table 7 reveals almost the same results as Table 5.The classification maps shown in Figure 7 confirm the numerical comparisons further.In this set of experiments, it can be seen that SSGL performed best in all scenes.This demonstrates the superiority of the spatial location constraint as well as the graph-based spatially-smooth constraint.However, for SSG, it is more suitable to classify the ROSIS scene.This is because the spatial resolution of the ROSIS scene is higher than those of the two AVIRIS scenes, and thus the spectral similarity of the spatially-adjacent pixels of which is more suitable for SSG to measure.

Analysis of Parameters
Among the parameters of the proposed method, β was set experimentally and differently for the three given datasets.In this set of experiments, we evaluated the influence of β by varying it from 10-200.Figure 8 shows the classification accuracies for both SSGL and SSG when applied to the three given datasets.It can be seen that when β is small, the classification performance of SSG increases monotonically as β increases; and the results typically saturate for large βs.When β is relatively large, the accuracies of SSGL decrease slightly.In all cases, we can choose β in a wide optimal region.Thus, the choice of β is robust.

Influence of Anchor Samples
In SSGL, the spatial location constraint is built on the prior knowledge of the location information of training samples.That is to say, all training samples are treated as anchor samples.In this set of experiments, we built the spatial location constraint term by using different percentages of training sample to show the influence of anchor samples.Figure 9 shows the classification accuracies for SSGL when applied to the three given datasets.From the figure, we can see that the accuracies increase monotonically as the percentage increases, but the accuracies reach the optimal values asymptotically when a moderate number of anchor samples are used.Thus, we can build the spatial location term by using a relatively small number of anchor samples.

Different Numbers of Training Samples
In this set of experiments, we tested the performance of the compared classification methods in an ill-posed scenario, where different numbers of training pixels are used.Specifically, for both the AVIRIS Indian Pines dataset and the AVIRIS Kennedy Space Center dataset, we randomly chose 1-20% of the labeled pixels per class for training and the remaining pixels for testing.For the classes with limited labeled pixels, at least two pixels were chosen for training.For the ROSIS University of Pavia dataset, we built training sets by randomly choosing 10, 20, 40, 60, 80, and 100 training pixels per class and used the rest as test sets.Figure 10 shows the mean value and standard deviation of OA as a function of the number of training pixels for the three given datasets.From the figures, it can be seen that the proposed SSGL outperformed the compared methods in almost all cases.As expected, the performance of the spatial-spectral methods was better than that of the pixel-wise KSRC, and the classification accuracies decreased when the number of training pixels was reduced.When compared with the AVIRIS scene, SSG is more suitable for the ROSIS scene.

Discussion and Conclusions
A regularized kernel sparse representation method for spatial-spectral classification of hyperspectral images has been presented in this paper, where the spatial-spectral information of hyperspectral images is incorporated by appending two spatial-spectral constraint terms to the sparse recovery model of KSRC.One is the graph-based spatially-smooth constraint, which is designed to measure the spectral similarity of spatially-adjacent pixels.The other is the spatial location constraint, which is used to exploit the wealth of training pixels and thereby to reduce the cost of acquiring a large number of labeled pixels.By introducing the two constraints, the spatial and label information of the hyperspectral pixels-especially that of the training pixels-can be spread to their neighbors, which is very useful for the discrimination of each class.Experiments conducted on three real hyperspectral images have demonstrated that the proposed method can yield accurate classification results.Although the results obtained by the proposed method are very encouraging in hyperspectral image classification, further enhancements such as more robust constraint terms and more reasonable regularization strategies should be pursued in future developments.

Figure 1 .
Figure 1.Illustration of the spatially-adjacent nodes.The yellow one denotes the given node and the black ones denote the eight spatially-adjacent nodes.
T QS) − Tr(S T P)

Figure 3 .
Figure 3. Reflective Optics System Imaging Spectrometer (ROSIS) University of Pavia dataset.(a) RGB composite image of three bands.(b) Ground-truth map.

Figure 4 .
Figure 4. AVIRIS Kennedy Space Center dataset.(a) RGB composite image of three bands.(b) Ground-truth map.

Figure 8 .
Figure 8. OA as a function of β.

Figure 9 .
Figure 9. OA as a function of the number of anchor samples.

Figure 10 .
Figure 10.OA as a function of the number of training samples for different classification methods when applied to (a) the AVIRIS Indian Pines dataset, (b) the ROSIS University of Pavia data set, and (c) the AVIRIS Kennedy Space Center dataset.

Table 1 .
Difference between the proposed spatial-spectral graph regularization with location information (SSGL) method and the deep learning methods.

Table 2 .
Sixteen ground-truth classes in the AVIRIS Indian Pines dataset and the number of training and test pixels used in the experiments.

Table 3 .
Nine ground-truth classes in the ROSIS University of Pavia dataset and the number of training and test pixels used in the experiments.

Table 4 .
Thirteen ground-truth classes in the AVIRIS Kennedy Space Center dataset and the number of training and test pixels used in the experiments.

Table 5 .
Classification accuracies for the AVIRIS Indian Pines dataset using different classification methods.The best results are highlighted in bold.AA: average accuracy; CKEMAP: composite kernel with extended multi-attribute profile features; CRF: condition random fields; KA: kappa coefficient of agreement; KSRC: kernel sparse representation classification; KJSRC: kernel joint SRC; MV: majority voting; OA: overall accuracy; SSG: spatial-spectral graph regularization with location information; SSGL: spatial-spectral graph regularization with location information.

Table 6 .
Classification accuracies for the ROSIS University of Pavia dataset using different classification methods.The best results are highlighted in bold.

Table 7 .
Classification accuracies for the AVIRIS Kennedy Space Center dataset using different classification methods.The best results are highlighted in bold.