Hyperspectral Image Classification with Spatial Filtering and ℓ2,1 Norm

Recently, the sparse representation based classification methods have received particular attention in the classification of hyperspectral imagery. However, current sparse representation based classification models have not considered all the test pixels simultaneously. In this paper, we propose a hyperspectral classification method with spatial filtering and ℓ2,1 norm (SFL) that can deal with all the test pixels simultaneously. The ℓ2,1 norm regularization is used to extract relevant training samples among the whole training data set with joint sparsity. In addition, the ℓ2,1 norm loss function is adopted to make it robust for samples that deviate significantly from the rest of the samples. Moreover, to take the spatial information into consideration, a spatial filtering step is implemented where all the training and testing samples are spatially averaged with its nearest neighbors. Furthermore, the non-negative constraint is added to the sparse representation matrix motivated by hyperspectral unmixing. Finally, the alternating direction method of multipliers is used to solve SFL. Experiments on real hyperspectral images demonstrate that the proposed SFL method can obtain better classification performance than some other popular classifiers.


Introduction
Over the past few decades, hyperspectral imagery has been widely used in different remote sensing applications owing to its high-resolution spectral information of the materials in the scene [1][2][3]. Various hyperspectral image classification techniques have been presented for a lot of real applications including material recognition, urban mapping and so on [4][5][6][7][8].
To date, a lot of hyperspectral image classification methods have been presented. Among them, the most representative method is the support vector machine (SVM) [9], which has shown desirable hyperspectral image classification performance. Recently, the sparse representation based classification methods have received a lot of attention in the area of image analysis [10][11][12][13][14], particularly in the classification of hyperspectral image. Chen et al. introduced a dictionary-based sparse representation framework for hyperspectral classification [15]. To be specific, a test pixel is sparsely represented by a few labeled training samples, and the class is determined as the one with the minimal class-specific representation error. In addition, Chen et al. also proposed the simultaneous orthogonal match pursuit (SOMP) to utilize the spatial information of hyperspectral data [15]. To take the additional structured sparsity priors into consideration, Sun et al. reviewed and compared several structured priors for sparse representation based hyperspectral image classification [16], which can exploit both the spatial dependences between the neighboring pixels and the inherent structure of the dictionary. In [17], Chen et al. extended the joint sparse representation to the kernel version for hyperspectral image classification, which can provide a higher classification accuracy than the conventional linear sparse representation algorithms. In addition, Liu et al. proposed a class-specific sparse multiple kernel learning framework for hyperspectral image classification [18], which determined the associated weights of optimal base kernels for any two classes and led to better classification performances. To take other spectral properties and higher order context information into consideration, Wang et al. proposed the spatial-spectral derivative-aided kernel joint sparse representation for hyperspectral image classification [19], and the derivative-aided spectral information can complement traditional spectral features without inducing the curse of dimensionality and ignoring discriminating features. Moreover, Li et al. proposed the joint robust sparse representation classification (JRSRC) method to take the sparse representation residuals into consideration, which can deal with outliers in hyperspectral classification [20]. To integrate the sophisticated prior knowledge about the spatial nature of the image, Roscher et al. proposed constructing a novel dictionary for sparse-representation-based classification [21], which can combine the characteristic spatial patterns and spectral information to improve the classification performance. In order to adaptively explore the spatial information for different types of spatial structures, Fu et al. proposed a new shape-adaptive joint sparse representation method for hyperspectral image classification [22], which can construct a shape-adaptive local smooth region for each test pixel. In order to capture the class-discriminative information, He et al. proposed a group-based sparse and low-rank representation to improve the dictionary for hyperspectral image classification [23]. To take different types of features into consideration, Zhang et al. proposed an alternative joint sparse representation by the multitask joint sparse representation model [24]. To overcome the high coherence of the training samples, Bian et al. proposed a novel multi-layer spatial-spectral sparse representation framework for hyperspectral image classification [25]. In addition, to take the class structure of hyperspectral image data into consideration, Shao et al. proposed a probabilistic class structure regularized sparse representation method to incorporate the class structure information into the sparse representation model [26].
It had been argued in [27] that the collaborative representation classification can obtain very competitive classification performance, while the time consumption was much lower than that of sparse representation. Thus, various collaborative representation methods had been proposed for hyperspectral image classification. Li et al. proposed the nearest regularized subspace (NRS) classifier by using the distance-weighted Tikhonov regularization [28]. Then, the Gabor filtering based nearest regularized subspace classifier had been proposed to exploit the benefits of using spatial features [29]. Collaborative representation with Tikhonov regularization (CRT) had also been proposed for hyperspectral classification [30]. The main difference between NRS and CRT was that the NRS only used within-class training data for collaborative representation while the latter adopted all the training data simultaneously [30]. In [31], the kernel version of a collaborative representation was proposed and denoted as kernel collaborative representation classifier (KCRC). In addition, Li et al. proposed proposed combining the sparse representation and collaborative representation for hyperspectral image classification to make a balance between sparse representation and collaborative representation in the residual domain [32]. Moreover, Sun et al. combined the active learning and semi-supervised learning to improve the classification performance when given a few initial labeled samples, and proposed the extended random walker [33] algorithm for the classification of hyperspectral image.
Very recently, some deep models had been proposed for hyperspectral image classification [34]. To the best of our knowledge, Chen et al. proposed a deep learning method named stacked autoencoder for hyperspectral image classification in 2014 [35]. Recently, convolutional neural networks have been very popular in pattern recognition, computer vision and remote sensing. Convolutional neural networks usually contained a number of convolutional layers and a classification layer, which can learn deep features from the training data and exploit spatial dependence among them. Krizhevsky et al. trained a large convolutional neural networks to classify the 1.2 million high-resolution images in the ImageNet, which had obtained superior image classification accuracy [36]. Since then, convolutional neural networks had been applied for hyperspectral image classification [37,38], which had achieved desirable classification performance. To take the spatial information into consideration, a novel convolutional neural networks framework for hyperspectral image classification using both spectral and spatial features was presented [39]. In addition, Aptoula et al. proposed a combined strategy of both attribute profiles and convolutional neural networks for hyperspectral image classification [40]. To overcome the imbalance between dimensionality and the number of available training samples, Ghamisi et al. proposed a self-improving band selection based convolutional neural networks method for hyperspectral image classification [41]. In addition, some patch based convolutional neural networks hyperspectral image classification methods had also been proposed, such as the method in [42,43]. In order to achieve low computational cost and good generalization performance, Li et al. proposed combining convolutional neural networks with extreme learning machines for hyperspectral image classification [44]. Furthermore, Shi et al. proposed a 3D convolutional neural networks (3D-CNN) method for hyperspectral image classification that can take both the spectral and spatial information into consideration [45].
However, all of the above mentioned methods, whether they are based on sparse representation, collaborative representation or deep models, adopt the pixel-wise classification strategy, i.e., they do not consider all the pixels simultaneously. In [46], theoretical work has demonstrated that multichannel joint sparse recovery is superior to applying standard sparse reconstruction methods to each single channel individually, and the probability of recovery failure decays exponentially with the increase in the number of channels. In addition, the probability bounds still hold true even for a small number of signals. For the classification of hyperspectral images, the multichannel means recovering multi hyperspectral pixels simultaneously. Therefore, inspired by the theoretical work in [46], in this paper, we propose a hyperspectral classification method with spatial filtering and 2,1 norm (SFL) to deal with all the test samples simultaneously, which can not only take much less time but also obtain comparable good or better classification performance. First, the 2,1 norm regularization is adopted to select correlated training samples among the whole training data set. Meanwhile, the 2,1 norm loss function which is robust for outliers is also implemented. Second, we adopt the simple strategy in [47] to exploit the local continuity, and all the training and testing samples are spatially averaged with their nearest neighbors to take the spatial information into consideration, which can be seen as spatial filtering. Third, the non-negative constraint is added in the sparse representation coefficient matrix motivated by hyperspectral unmixing. Finally, to solve SFL, we use the alternating direction method of multipliers [48], a simple but powerful algorithm that is well suited to distributed convex optimization.
The main contribution of this work lies in proposing an SFL for hyperspectral classification that can deal with all the test pixels simultaneously. Experiments on real hyperspectral images demonstrate that the proposed SFL method can obtain better classification performance than some other popular classifiers.

Related Work
In this section, we briefly introduce the classical sparse representation for the classification of hyperspectral images, which can be found in [16]. It is assumed that the pixels in the same class lie in the same low-dimensional subspace, and it has K different classes. Therefore, for an unknown test sample y ∈ R B , where B denotes the the number of bands, y is assumed to lie in the union of the K different subspaces, which can seen as the sparse linear combination of all the training samples Therefore, given the dictionary of training samples A ∈ R B×M , where M is the number of training samples. For an unknown test sample y, the sparse representation coefficient vector x ∈ R M can be obtained by solving the optimization problem as follows: where A consists of the class subdictionaries {A k } k=1,··· ,K , and λ is the regularization parameter. In addition, Equation (2) can be solved by the alternating direction method of multipliers in [49]. Thus, the class label of x is determined as the one with the minimal class-specific reconstruction residual:

Proposed Classifiers
In [46], it has been proved that, with the increase in the number of channels, the failure probability of sparse reconstruction decreases exponentially. Thus, multichannel sparse reconstruction is superior to single channel sparse reconstruction. In addition, the probability bounds are valid even for a small number of signals. Based on this theory, we deal with all the test samples simultaneously, and the proposed SFL classification method will be briefly described.
Let Y = [y 1 , y 2 , · · · , y N ] ∈ R B×N , where {y n } n=1,··· ,N denotes the columns of Y, and N denotes the number of test pixels. To deal with all the test pixels simultaneously, it is natural that the sparse representation coefficient matrix X = [x 1 , x 2 , · · · , x N ] ∈ R M×N for all the test pixels can be obtained by solving the optimization problem as follows: which also can be solved by the alternating direction method of multipliers in [49]. . F represents the matrix Frobenius norm, which is equal to the Euclidean norm of the vector of singular values, i.e., where σ i (i = 1, ..., r) denotes the singular value of X. After the optimizedX is obtained, the classes of all test pixels can be obtained by the minimum class reconstruction error: However, Equation (4) adopts the pixel-wise independent regression, which ignores the correlation among the whole training data set. Recent research shows that the high-dimensional data space is smooth and locally linear, and it has been versified in image reconstruction and classification problems [50,51]. For joint consideration of the classification of neighborhoods, in this paper, we introduce the 2,1 norm regularization and adapt it to extract correlated training samples among the whole training data set with joint sparsity, which is defined as follows: The 2,1 norm was first introduced by Ding et al. [52], which makes the traditional principal component analysis more robust for outliers. The outliers are defined as data points that deviate significantly from the rest of data. Traditional principal component analysis optimizes the sum of squared errors, since the few data points that have large squared errors will dominate the sum. Therefore, the traditional principal component analysis is sensitive to outliers. It has been shown that minimizing the 1 norm is more robust and can resist a larger proportion of outliers compared with quadratic 2 norms [53]. The 2,1 norm is identical to a rotational invariant 1 norm, and the solution of 2,1 norm based robust principal component analysis is the principal eigenvectors of a more robust re-weighted covariance matrix, which can alleviate the effects of outliers. In addition, the 2,1 norm has the advantage of being rotation invariant compared with the 1 norm [52,54,55], i.e., applying the same rotation to all points has no effect on its performance. Due to the above-mentioned advantages, the 2,1 norm has been applied in feature selection [56], multi-task learning [57], multi-kernel learning [58], and non-negative matrix factorization [59]. Nie et al. [56] introduced the 2,1 norm to feature selection, and they used 2,1 norm regularization to select features across all data points with joint sparsity. The 2,1 norm based loss function is used to remove outliers, and the feature selection process is proved to be effective and efficient.
Similarly, we adopt the 2,1 norm regularization to select correlated training samples among the whole training data set with joint sparsity for hyperspectral image classification. Thus, the corresponding optimization problem is as follows: which can be solved by the alternating direction method of multipliers in [60]. This model can be seen as an instance of the methodology in [61], which can impose sparsity across the pixels both at the group and individual levels. In addition, to make it more robust for outliers, the 2,1 norm loss function is adopted. Thus, the corresponding optimization problem is as follows: Due to limited resolution of hyperspectral image sensors and the complexity of ground materials, mixed pixels can easily be found in hyperspectral images. Therefore, a hyperspectral unmixing step is needed [62,63]. Hyperspectral unmixing is a process to identify the pure constituent materials (endmembers) and estimate the proportion of each material (abundance) [64]. The linear mixture model has been prevalently used in hyperspectral unmixing, and the abundance is considered to be non-negative in a linear mixture model [65]. If we deem A as the spectral library consisting of endmembers, then X can be seen as the abundance matrix. Therefore, X is also non-negative. When adding the non-negative constraint into the sparse representation matrix, the corresponding optimization problem is as follows: In addition, since the spectral signatures of neighboring pixels are highly correlated, which make them belong to the same material with high probability, we thus adopt the simple strategy in [47] to exploit the local continuity, and all the training and testing samples are spatially averaged with their nearest neighbors to take the spatial information into consideration, which can be seen as spatial filtering. Moreover, when N=1, it is easy to see that Equation (8) reduces to Equation (2), and Equation (9) reduces to the optimization problem as follows: To sum up, the detailed procedure of our proposed method can be seen from Figure 1. Finally, to solve the optimization problem from Equation (9) to Equation (12), Equation (10) can be solved by the alternating direction method of multipliers in [60], and Equations (9) and (12) are special cases of Equation (11). Thus, it comes down to solving Equation (11). For simplification, Equation (11) can be written as: min where l R + (X) = ∑ P i=1 l R + (X i ) is the indicator function of nonnegative quadrant R + , and X i is the i-th column of X. If X i belongs to the nonnegative quadrant, then l R + (X i ) is zero. If not, it is +∞. In order to solve Equation (11), the alternating direction method of multipliers [48] method is implemented. By introducing auxiliary variables P, Q and W, Equation (11) could be rewritten as: A compact version of it is: min where g(V) = P 2,1 + λ V ≡ (P, W, X), and I is the unit matrix. Thus, the augmented Lagrangian function could be expressed as: where µ > 0, Λ/µ stands for the Lagrange multipliers. In order to update P, we solve P k+1 = arg min and its solution is the famous vector soft threshold operator [10], which updates each row independently P k+1 (r, :) = vect-soft(ζ(r, :), 1 µ ), where ζ = AQ k − Y − Λ k 1 , and the vect-soft-threshold function g(b, τ) = b max{ b 2 −τ,0} max{ b 2 −τ,0}+τ . To update W, we solve and its solution is also the vector soft threshold operator [10]: where γ = Q k − Λ k 2 . To update X, we solve To update Q, we solve The stopping criterion is where ε is the error threshold, and J and K are the number of rows and columns of Z. µ is updated in the same way as [48], which keeps the ratio between the alternating direction method of multiplier primal norms and dual residual norms within a given positive interval. Based on this, we can get Proposition 1, whose proof of convergence is given in [48]. (15) is closed, proper, and convex. If there exist solutions V * and Q * , then iterative sequence {V k } and {Q k } converge to V * and Q * , respectively. If not, at least one of {V k } and {Q k } diverge [48].  Table 1. Thus, the rest is used for testing.

Experimental Data
The second dataset is Pavia University obtained by a Reflective Optics System Imaging Spectrometer (ROSIS) in 2001 at Paiva University, Pavia, Italy. The size of the image is 610 × 340 with a spatial resolution of 1.3 m. The number of bands is 103, and the ground truth image is shown in Figure 2b. There are nine classes and 42,776 labeled samples, 426 of them (about 1%) are chosen as the training data, and the others are used as test data, as shown in Table 2.

Parameter Setting
In experiments, we mainly compare the classification performance when using the pixel-wise strategy and dealing with all the test pixels simultaneously. In addition, we also made a step-by-step comparison by adding or removing spatial filtering and/or constraints to see which step's contribution is more important. For these methods, there are mainly five parameters: i.e., the neighbor size T, the regularization parameter λ, the Lagrange multiplier regularization parameter µ, the error tolerance ε and the maximum number of iteration. The neighbor size T and the regularization parameter λ play an important role in the proposed method, which control the size of spatial filtering and the trade-off between fidelity to the data and sparsity of the solution, respectively. While the Lagrange multiplier regularization parameter µ, the error tolerance ε and the maximum number of iteration, which have lesser impact on the efficiency of the corresponding algorithms, are set to a fixed value, i.e., µ = 10 −2 , ε = 10 −6 , and the maximum number of iteration is 1000. For the neighbor size T, we use the same parameter setting in [16]. For the Indian Pine data set, a spatial window of 9 × 9 (T = 81) is adopted, which is due to this image consisting of mostly large homogeneous regions. For the University of Pavia data set, a spatial window of 5 × 5 (T = 25) is used, which is due many narrow regions being present in this image. The regularization parameter λ is chosen from the given intervals {10 −6 , 10 −5 , 10 −4 , 10 −3 , 10 −2 , 10 −1 }. Figure 3 shows the performance of overall accuracy as a function of the regularization parameter λ using the hyperspectral image of Indian Pines and Pavia University. For convenience, the "Spatial Filtering" and "Non-negative Constraint" are abbreviated as "SF" and "NC", respectively. For example, for the " 2,1 + 2,1 +SF+NC", the first " 2,1 " denotes the loss function norm, the second " 2,1 " denotes the regularization term norm, "SF" denotes using the spatial filtering, and "NC" denotes using the non-negative constraint. Thus, they are the same as the abbreviation of the other compared methods. It can be seen from Figure 3 that the overall accuracy remains stable when ε < 10 −2 . It then decreases when ε > 10 −2 . In addition, " 2,1 + 2,1 +SF+NC" and " 2,1 + 2,1 +SF" have much better overall accuracy than " 2,1 + 2,1 +NC" and " 2,1 + 2,1 ", respectively, which demonstrate that it is significant to improve the overall accuracy when taking the spatial filtering into consideration. Moreover, " 2,1 + 2,1 +SF+NC" and " 2,1 + 2,1 +NC" have better overall accuracy than " 2,1 + 2,1 +SF" and " 2,1 + 2,1 ", respectively, which demonstrate that it helps to improve the overall accuracy when taking the non-negative constraint into consideration. Furthermore, the elevation of overall accuracy when using the spatial filtering is much larger than those when using the non-negative constraint, which suggests that the spatial filtering has a larger effect on the overall accuracy than the non-negative constraint.

Classification Performance
The experiments are performed on a desktop with 3.5 GHz Intel Core CPU, 64 GB memory and Matlab Code. To evaluate the classification performance of different methods, the overall accuracy, average accuracy and kappa statistic [16] are used to evaluate the performances of these methods. Tables 3 and 4 show the classification performances for Indian Pines data set when using the pixel-wise strategy and dealing with all the test pixels simultaneously, respectively. It can be seen from Tables 3 and 4 that methods using the spatial filtering generally obtain better overall accuracy, average accuracy and kappa statistics than those without spatial filtering. For example, " 2 + 1 +SF+NC" and " 2 + 1 +SF" have much better overall accuracy than " 2 + 1 +NC" and " 2 + 1 ", respectively, which demonstrates that it helps a lot to improve overall accuracy by using the spatial filtering. In addition, methods using the non-negative constraint generally obtain better overall accuracy than those without non-negative constraints. For example, " 1 + 1 +SF+NC" and " 1 + 1 +NC" have better overall accuracy than " 1 + 1 +SF" and " 1 + 1 ", respectively, which demonstrates that it helps to improve overall accuracy by using the non-negative constraint. It also can be clearly seen that the spatial filtering has a larger effect on the classification performance than the non-negative constraint. Moreover, methods using 2,1 norm regularization term can generally obtain better classification performance than methods using 1 norm regularization term, for example, "F+ 2,1 +SF+NC" and "F+ 2,1 " generally have better overall accuracy than "F+ 1 +SF+NC" and "F+ 1 ", respectively, which demonstrate that it is beneficial to select correlated training samples among the whole training data set, and can impose sparsity across the pixels both at the group and individual levels. Furthermore, methods using 2,1 norm loss function can generally obtain better classification performance than methods using F norm loss function. For example, " 2,1 + 2,1 +SF+NC" and " 2,1 + 2,1 " generally have better overall accuracy than "F+ 2,1 +SF+NC" and "F+ 2,1 ", respectively, which demonstrate that the 2,1 norm loss function is more robust for outliers than F norm loss function. Tables 5 and 6 show the classification performances for Pavia University data set when using the pixel-wise strategy and dealing with all the test pixels simultaneously, respectively. We can also obtain the above-mentioned conclusion from Tables 5 and 6 when using the Pavia University data. In addition, from Tables 3-6, it can be observed that these methods when dealing with all the test pixels simultaneously can obtain comparable or better overall accuracy than these regression based pixel-wise sparse representation methods, and they are much faster than these pixel-wise sparse representation methods, which demonstrates that it is significant to considerer all the test pixels simultaneously. Figures 4 and 5 show the classification maps for the Indian Pines and Pavia University data sets, respectively, which can give a visual comparison between different methods.     (r) 2,1 + 2,1 +NC; (s) 2,1 + 2,1 +SF; (t) 2,1 + 2,1 +SF+NC (SFL).  We also choose other eight methods for comparison, i.e., SVM [9,66], NRS [28,67], CRT [30,67], KCRC [31,68], OMP [15], SOMP [15], JRSRC [20] and 3D-CNN [45,69]. The SVM is a very popular classifier, the 3D-CNN is a deep neural network based classifier, and the other six compared methods are collaborative representation and sparse representation based classifiers. Tables 7 and 8 show the classification performances of the proposed SFL and eight compared methods using the Indian Pines and Pavia University data set, respectively. In addition, Figures 6 and 7 show the classification maps of the Indian Pines and Pavia University data set when using the proposed SFL and eight compared methods, which can give a visual comparison between different methods. From Tables 7  and 8, it can be clearly seen that the proposed SFL can obtain the best classification performance, which demonstrates that our proposed SFL is efficient for hyperspectral image classification. In addition, the SVM is the fastest, the reason lies in that it is implemeted in C Lagnuage which is much faster than Matlab. NRS, CRT and KCRC are very fast due to the fact that they are collaborative representation methods, and they have closed solutions, which do not need iteration. The OMP and SOMP are also very fast due to the fact that they are greedy sparse representation methods, while the JRSRC method is very time-consuming due to the fact that JRSRC is a regression based sparse representation method. In addition, the 3D-CNN is not fast because the main time-consuming aspect lies in the training. Our proposed method is also a regression based method, which takes more time than the collaborative representation methods and greedy sparse representation methods. There are several possible ways for us to improve the time consumed in the process. One way is to use C Language and graphic processing unit for fast implementation. Another way is to use the first-order primal-dual algorithm in [70] to achieve faster convergence.

Conclusions
In this paper, we propose an SFL method for a hyperspectral image classification method based on the multichannel joint sparse recovery theory in [46], which can deal with all the test pixels simultaneously. The proposed SFL can not only obtain comparably good or better classification performance than using the pixel-wise classification strategy but also takes much less time. In addition, spatial filtering and the non-negative constraints are both adopted to improve the classification performance, and the spatial filtering has a larger effect on the classification than the non-negative constraint. Moreover, methods using 2,1 norm regularization term can generally obtain better classification performance than methods using an 1 norm regularization term, which demonstrate that it is beneficial to select correlated training samples among the whole training data set, and the 2,1 norm regularization term can impose sparsity across the pixels both at the group and individual levels. Furthermore, methods using 2,1 norm loss function can generally obtain better classification performance than methods using F norm loss function, which demonstrate that the 2,1 norm loss function is more robust for outliers than F norm loss function. Finally, experiments on two real hyperspectral image data sets demonstrate that the proposed SFL method outperforms some other popular classifiers. In our future work, we can adopt the CNN framework to extract deep features of hyperspectral images, which can be integrated into our method to improve the classification performance.