Tensor Block-Sparsity Based Representation for Spectral-Spatial Hyperspectral Image Classiﬁcation

: Recently, sparse representation has yielded successful results in hyperspectral image (HSI) classiﬁcation. In the sparse representation-based classiﬁers (SRCs), a more discriminative representation that preserves the spectral-spatial information can be exploited by treating the HSI as a whole entity. Based on this observation, a tensor block-sparsity based representation method is proposed for spectral-spatial classiﬁcation of HSI in this paper. Unlike traditional vector/matrix-based SRCs, the proposed method consists of tensor block-sparsity based dictionary learning and class-dependent block sparse representation. By naturally regarding the HSI cube as a third-order tensor, small local patches centered at the training samples are extracted from the HSI to maintain the structural information. All the patches are then partitioned into a number of groups, on which a dictionary learning model is constructed with a tensor block-sparsity constraint. A test sample is also expressed as a small local patch and the block sparse representation is then performed in a class-wise manner to take advantage of the class label information. Finally, the category of the test sample is determined by using the minimal residual. Experimental results of two real-world HSIs show that our proposed method greatly improves the classiﬁcation performance of SRC.


Introduction
Hyperspectral image (HSI) recorded by sensors simultaneously contains hundreds of continuous narrow spectral bands from the visible to infrared electromagnetic spectrum, providing detailed spectral information about the physical nature of distinct materials.Due to the abundance of information contained in HSI, hyperspectral imaging has opened new avenues in remote sensing [1][2][3][4][5].One of the most important tasks in HSI is pixel-oriented classification [6][7][8][9], where each pixel is labeled by one of the classes based on the training samples given for each class.
Much work has been performed to construct suitable classifiers for HSI.Among available methods, support vector machine (SVM) [10][11][12] and sparse representation-based classifier (SRC) [13][14][15][16][17] are two state-of-the-art ones that have yielded impressive results.The SVM, which is insensitive to the curse of dimensionality (i.e., Hughes phenomenon), has achieved great success in supervised classification over the past few decades.Moreover, sparked by the emergence of compressed sensing, SRC has attracted extensive attention in various applications and has become mainstream in HSI classification.
Incorporating the spatial information into pixel-wise classifiers has also demonstrated potential improvement recently due to its noticeable advantages in exploiting additional relevant information from the spatial domain [18].On the one hand, several refined versions of SVM have been proposed.
In [19], a family of composite kernels is proposed to improve the classification performance by identifying the spatial correlation of neighboring samples or pixels.A weighted combination of basic kernels is involved in multiple kernel learning (MKL) [20][21][22][23], which enables encoding local neighboring details of a scene.Moreover, Markov random field (MRF)-based regularization is presented in [24,25].This method integrates spatial and edge information into the SVM.On the other hand, significant efforts have been dedicated to improving the SRC.For instance, the joint sparsity model (JSM) simultaneously represents neighboring pixels using a linear combination of a few atoms from the dictionary and exploits the spatial correlation between neighboring pixels via the discriminative graphical model [26] or nonlocal weighting scheme [27].Many structured priors, including the Laplacian sparsity prior [28], group sparsity prior [29], low-rank group prior [30] and total variation (TV) prior [31], are introduced into the SRC for detecting the spatial dependences of neighboring pixels.Spatial Bayesian network is employed to enhance the spatial homogeneity of SRC in [32].Moreover, the collaborative representation based classification (CRC) [33], which represents the test sample in a least squares sense, can achieve comparable performance with SRC.
Notably, the dictionary constructed by all of the training samples in SRC-based methods is unable to detect the crucial class-discriminative information, one needs to learn a proper dictionary that can effectively represent the given samples.In general, the dictionaries can roughly emerge from two sources: (1) Building a dictionary via mathematical model-based methods.Several traditional dictionaries [34] proposed by earlier works, including Fourier, wavelet, and discrete cosine transform (DCT) based ones, belong to this category.Although this type of method is characterized by analytic formulation and fast implementation, the dictionaries are fixed and cannot adaptively represent the HSI.(2) Learning a dictionary to has optimal performance on training samples.It is noteworthy that learning compact and discriminative dictionaries has attracted much interest recently.This type of dictionary is flexible to adapt to specific data and has exhibited promising results in recent years.In [32], an online dictionary is specifically designed for patch-based SRC by learning vector quantization (LVQ), while spatial-aware dictionary learning (SADL) [29] is proposed by partitioning the pixels of HSI into a number of square patches called contextual groups.Another limitation of SRC is that the class labels of training samples are only utilized to calculate the residuals for each class but are ignored in the process of determining the sparse codes.To make full use of the class label information, a class-dependent SRC (cdSRC) is proposed in [35] and dictionary learning is performed in a class-oriented manner in [31].
Note that the aforementioned techniques treat HSI as first-order/second-order data.However, in reality, a HSI data set is modeled as three-dimensional (3-D) cube that contains one spectral dimension and two spatial dimensions.In this regard, many researchers have taken the HSI as a third-order tensor and attempted to develop spectral-spatial methods under the umbrella of tensor theory.For instance, 3-D wavelet based feature extraction methods are proposed in [36][37][38] to generate joint spectral-spatial texture features.A tensor discriminative locality alignment (TDLA) method [39] is developed to remove redundant information of HSI.3-D gray-level co-occurrence [40] is presented to extract discriminant co-occurrence features for better classification accuracy.A compressive hyperspectral imaging method based on sparse tensors and nonlinear compressed sensing is proposed in [41].Moreover, a local tensor discriminative analysis technique (LTDA) [42] is presented to integrate spectral-spatial features and tensor discriminant analysis for HSI classification.Some studies also focus on the tensor extension of SRC/dictionary learning.Tensor-based dictionary learning methods are proposed in [43][44][45], while the compressed sensing is extended to multidimensional scenario in [46][47][48].However, to the best of our knowledge, no research has been found regarding the tensor-based SRC and dictionary learning for HSI classification.
In this paper, we propose a tensor block-sparsity based representation method [43,47,48] for spectral-spatial classification of HSI.This method consists of two important steps, tensor block-sparsity based dictionary learning and class-dependent block sparse representation.In the first step, we extract several small local patches consisting of a training sample and its spatial neighborhoods from the HSI by naturally regarding the HSI cube as a third-order tensor.All of the patches are then partitioned into a number of groups according to the class labels of training samples.A dictionary learning model is constructed on those groups with the tensor block-sparsity constraint.In the second step, a test sample is also expressed as a small local patch with several neighboring pixels.We then perform block sparse representation in a class-wise manner to incorporate the class label information of training samples.The class label of the test sample can finally be determined by the minimal residual between the test sample and its approximations.
Compared to the hyperspectral classification literature, the advantages of this paper are as follows: 1. Spectral-spatial information of pixels is preserved by treating the HSI cube as a third-order tensor.Compared to the vector-based methods, the proposed tensor-based method is capable of maintaining the structural information.2. Proper dictionaries are provided by tensor block-sparsity based learning.Instead of using all of the training samples to construct the dictionaries, the proposed method detects class-discriminative information for classification.3. Class label information is fully exploited by class-dependent block sparse representation.
The proposed method learns the sparse coefficients by taking advantage of the class labels, while in general SRC methods, the label information is only used to calculate the residuals.
The layout of this paper is as follows.Section 2 briefly describes the tensor notations and preliminaries.Section 3 illustrates the basic principles of SRC.Section 4 presents the proposed tensor block-sparsity based representation method in detail.Experimental results of two benchmark HSIs are reported in Section 5. Finally, conclusions are drawn in Section 6.

Tensor Notations and Preliminaries
A tensor of order N (i.e., a N-dimensional data array) can be expressed by an underlined boldface capital letter, e.g., A ∈ R I 1 ×I 2 ×...×I N .A matrix (i.e., two-dimensional (2-D) array) is denoted by a boldface uppercase letter and a vector by a boldface lowercase letter, e.g., A ∈ R I 1 ×I 2 and a ∈ R I represent a matrix and a vector, respectively.The element (i 1 , i 2 , . . ., i N ) of a tensor A is expressed as a i 1 ,i 2 ,...,i N , where 1 ≤ i n ≤ I n .The Frobenius norm of a tensor A is defined as A sub-tensor can be formed by restricting the indices to a certain subset of values.This means that given a tensor A ∈ R I 1 ×I 2 ×...×I N , its mode-n fiber is a vector defined by fixing all indices except i n .Mode-n unfolding (i.e., mode-n matricization) of a tensor A yields a matrix A (n) ∈ R I n × Īn ( Īn = ∏ m =n I m ), whose columns are composed of the mode-n fibers of A, i.e., A (1) ∈ R I 1 ×I 2 I 3 ...I N , A (2) ∈ R I 2 ×I 1 I 3 ...I N , etc.The n-rank of A, referred to as r n , is defined by the rank of the mode-n unfolding matrix, i.e., r n = rank(A (n) ).
The product between two matrices can be extended to the product of a tensor and a matrix.Given a tensor A ∈ R I 1 ×I 2 ×...×I N and a matrix B ∈ R J×I n , the mode-n tensor by matrix product yields C = A × n B ∈ R I 1 ×I 2 ...I n−1 ×J×I n+1 ...I N , whose entries are modeled by with i k = 1, 2, . . ., I k , (k = n) and j = 1, 2, . . ., J. It is worth stressing that the mode-n product C corresponds to the product of matrix B by each of the mode-n fibers of A because The Kronecker product of two matrices, denoted by "⊗", is an important mathematical operation utilized in this paper.Given two matrices Ã ∈ R I 1 ×I 2 and B ∈ R I 3 ×I 4 , the Kronecker product Ã ⊗ B is defined as With respect to HSI, it can be regarded as a 3-D data array A ∈ R L w ×L h ×l s , where L w , L h and l s indicate the number of rows, columns and spectral bands, respectively.In merit of the tensor algebra, one can perform spectral-spatial classification by treating the HSI as a whole entity.Interested readers can consult [39,41,[49][50][51] for more details.Moreover, as shown in Figure 1, the original spatial structure is preserved by tensor, while the spatially connected constraint among local neighborhoods is lost in the vector representation.That means, rather than neglecting the 3-D structure of HSI, information from the spatial domain is implicitly exploited by tensors in the proposed method.

Sparse Representation Classifier
SRC is a compressed sensing-inspired technique that has recently been developed as a powerful tool in HSI classification.Relying on the assumption that hyperspectral pixels of the same class generally belong to the same low-dimensional subspace, an unknown test sample x ∈ R l s can be represented as a sparse linear combination of all of the training samples where where the l 0 -norm • 0 refers to the number of non-zero entries in z.Problem (4) is nondeterministic polynomial-time hard (NP-hard), which cannot be solved in polynomial time.Fortunately, it can be approximately solved by greedy algorithm, such as orthogonal matching pursuit (OMP) [52].Moreover, the l 0 -norm problem can also be replaced with an l 1 -norm problem and be approximately solved by basis pursuit (BP) [53].
Having evaluated ẑ, the class label of the test sample x is determined by the minimal error between x and its sub-dictionary estimations class x = arg min k=1,2,...,c x − D k ẑk 2 (5) where D k indicates the sub-dictionary from D belonging to the kth class and ẑk denotes the collection of coefficients in ẑ belonging to the kth class.|| (1) − sub( (1) ) The two above-mentioned steps in the proposed method naturally treat the 3-D HSI data as a third-order tensor to perform the tensor-based dictionary learning and sparse representation.In this regard, the spectral and spatial information can be simultaneously preserved.Moreover, the class-dependent dictionaries generated in the first step help to implement the block sparse representation in a class-wise manner.Note that the traditional SRC ignores the class label information in sparse coding, the classification performance of the proposed method can be improved by taking full advantage of the class labels.

Tensor Block-Sparsity Based Dictionary Learning
In this paper, the HSI cube is modeled as a third-order tensor H ∈ R L w ×L h ×l s , and a small patch is composed of a training sample and its corresponding l w × l h spatial neighborhoods.Suppose X k,j ∈ R l w ×l h ×l s denotes the jth patch in the kth group, where l w and l h refer to the width and height of the patch, respectively, l s represents the number of spectral bands, j denotes the index of patch, and k = 1, 2, . . ., c corresponds to the index of a group.We then express {X k,j } n k j=1 as the kth group connected to the patches belonging to the kth class, where n k represents the number of patches in the kth group.For notational convenience, patches of the kth group {X k,j } n k j=1 are combined together to generate a fourth-order tensor X (k) ∈ R l w ×l h ×l s ×n k , whose fourth order mode demonstrates the patches of the kth class.
With X (k) , k = 1, 2, . . ., c, the optimization problem for dictionary learning can be modeled as min where r h k and r s k are the tensor block-sparsity parameters, the "tensor block-sparsity" of Z (k) is equal to (r w k , r h k , r s k ) if and only if the indices I w , I h , I s which satisfy z (k) , a 1 a 2 implies that each entry of a 1 is not greater than the corresponding entry of a 2 , • F denotes Frobenius norm, i.e., the square root of the sum of the absolute squares of the elements, Z (k) ∈ R m w ×m h ×m s ×n k , and Z (k)  B represents the "tensor block-sparsity" of Z (k) .
Note that the dictionary learning problem in Equation ( 6) can be decoupled into c sub-problems.As such, the dictionary learning can be performed independently for each class to yield class-dependent dictionaries, which helps to achieve class-dependent block sparse representation.Specifically, the kth element in the objective function of Equation ( 6) can be formulated as where sub( ), the tensor block-sparsity constraints of Equation ( 6) are implicitly satisfied by setting Y ∈ R r w k ×r h k ×r s k ×n k .In this regard, the original dictionary learning model in Equation ( 6) can be divided into a number of unconstrained problems imposed on the c groups min Moreover, Equation ( 8) can be equivalently represented as min where G ∈ R r w k ×r h k ×r s k ×r n k denotes the core tensor and k contain the basis vectors in the four modes of X.The tensor block-sparsity parameters r w k , r h k , r s k and r n k can be obtained via the minimum description length (MDL) method [54].Comparing Equations ( 8) and ( 9), we observe that Since the problem of Equation ( 9) can be effectively solved by the Tucker decomposition (http://www.sandia.gov/∼tgkolda/TensorToolbox/index-2.6.html)[55], the solution of Equation ( 8) is also determined accordingly.

Class-Dependent Block Sparse Representation Classifier
Analogous to the training samples, a test sample x ∈ R l s can also be extended to a third-order tensor X ∈ R l w ×l h ×l s by utilizing the l w × l h spatial neighboring pixels of x.Notice that the traditional SRC only incorporates the class label information when identifying the reconstruction errors (see Equation ( 5)), which is generally not included in the learning stage (see Equation ( 4)).To fully exploit the class label information in the learning process, we propose a class-dependent block sparse representation classifier.The idea is to successively represent X by the kth (k = 1, 2, . . ., c) group dictionaries (D w k , D h k , D s k ) and constrain block-structured sparsity on the core tensor Z(k) .This means that the sparse coding is obtained by solving min The constraint of Equation ( 10) restricts the non-zero entries of the core tensor Z(k) to be located in a sub-tensor with size rw k × rh k × rs k .The main advantage of sparse coding is that it can improve the discrimination of the classifier.That is because, suppose the test sample x belong to class i, it is more probable to represent X by a few atoms from (D w i , D h i , D s i ) with a satisfactory accuracy, while more atoms from other classes (e.g., (D w j , D h j , D s j )) are needed to represent X with the same accuracy.Therefore, the representation error of X by (D w i , D h i , D s i ) will be smaller than other classes under a certain sparsity constraint.Moreover, we visually illustrate the unstructured and block sparsity of Z(k) in Figure 3, from which one can observe that the structured information is effectively incorporated into the block-sparsity-based method whereas no such information is contained in the unstructured-based method.It is notable that the additional assumption about the location of nonzero sparse coefficients (i.e., "structured sparsity") facilitates the reduction of complexity in solving Equation (10).
(a) (b) In this paper, we adopt the N-way block OMP (NOMP) to solve Equation (10).The NOMP is a greedy algorithm proposed by Caiafa et al. [47,48].As a tensor extension of the traditional one-dimensional (1-D) OMP, NOMP iteratively selects the dictionary element correlated with the current residuals and is computationally efficient due to the full use of the sparsity structure under the tensor block-sparsity assumption.A detailed classification process is illustrated in Algorithm where 3 ) × 1 D w k (:, Set m = 1, I 3 ) = arg max Update the support set I (k) Identify the vectorized version of Z(k) by 2 ) ⊗ D w k (:, 3 ) × 1 D w k (:, 2 )× 3 D s k (:, I

A Synthetic Example
Having displayed the tensor block-sparsity based representation method, a synthetic example is provided in this subsection to demonstrate the parameterization, learning and utilization of the spatial information.The HSI data set used in this example is cropped from the whole Salinas-A scene (http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes), i.e., columns  and rows  having the size of 36 × 36 × 204 containing 3 classes of interest (c = 3).5% of the samples from each class are chosen for training while the rest ones are taken as test samples.Without loss of generality, we demonstrate how to identify the class label of a test sample located at (23,15) (belong to class 2).As shown in Figure 4, tensor block-sparsity based dictionary learning is first performed on the small local patches (neighborhood size = 7 × 7) of training samples.In this step, spatial information are incorporated in each small 3-D patches.Results of the learned dictionaries As above analysis, both training and test samples are represented as third-order tensors, which consisted of the training (or test) sample located at the center and several spatial neighborhood.Therefore, rather than neglecting the 3-D structure of HSI, the proposed method maintains spectral-spatial information simultaneously by taking a pixel and its neighborhood as a whole entity.

Experimental Section
In this section, we perform experiments on two real-world data cubes to assess the effectiveness of the proposed method.The experimental results are compared visually and quantitatively with those gained from several state-of-the-art methods, including the classical SVM [10], the SVM with composite kernel (SVMCK) [19], CRC [33], the JSM solved by simultaneous versions of OMP (denoted by SOMP) [13], SADL [29], 3-D discrete wavelet transform (3D-DWT) [36], LTDA [42], cdSRC [35], and patch-based learning SRC with spatial smooth (pLSRC-S) [32].Moreover, we abbreviate the proposed tensor block-sparsity based representation classifier as tbSRC for simplicity.

Data Sets
Two real-world HSI data sets, namely Indian Pines data and University of Pavia data, with various spectral and spatial resolutions reflecting different environments of remote sensing are utilized in the experiments.

Experimental Design
To demonstrate the performance of the proposed tbSRC, nine widespread methods are considered for comparison: 1. SVM: the classical SVM [10] with a single radial basis function (RBF) kernel; 2. SVMCK: the SVM [19] with a composite of spectral kernel and spatial kernel, and the spatial feature is extracted by extended morphological profile (EMP) [56]; 3. CRC: the test sample is approximated by the linear combination of the training samples in a least squares sense [33].
4. SOMP: the spectral-spatial SRC incorporating spatial information by JSM and solved by SOMP [13]; 5. SADL: the spatial-aware classification technique [29] whose dictionary is obtained by structured dictionary learning and sparse coefficients are classified by linear SVM (Different from the SVM, SVMCK, 3D-DWT and LTDA, SADL applies the linear SVM for two reasons: (1) the SADL of this paper is consistent with [29], which employs linear SVM; (2) the sparse codes of SADL are discriminative enough to be well classified by the linear SVM.); 6. 3D-DWT: the texture features are obtained by 3D-DWT and the classification results are given by RBF-based SVM [36]; 7. LTDA: the features are extracted by EMP [56] and LTDA [42] and classified by SVM; 8. cdSRC: the SRC for each class is solved by OMP and the class label is jointly determined by the residual and Euclidean distance [35]; 9. pLSRC-S: the spectral-spatial SRC [32] with spatial smooth and the dictionary is learned by a modified patch-based learning SRC.
Note that: (1) the above-mentioned SVM, CRC and cdSRC are classification methods that use spectral information, while the other ones are spectral-spatial methods; (2) the CRC, SOMP, cdSRC, pLSRC-S and tbSRC are based on collaborative/sparse representation, while the other ones apply SVM for classification; (3) the 3D-DWT, LTDA and tbSRC are tensor-based methods, while the remaining methods ignore the 3-D structure of HSI.
Two experiments are designed in this paper: (1) we perform detailed comparisons on the two data sets with fixed ratios of labeled samples; (2) we also conduct additional comparisons to validate the tbSRC using various small numbers of labeled samples.In experiment 1, approximately 5% of the samples from each class in the Indian Pines data (see the 3rd column of Table 1) are randomly chosen as training samples and the remaining ones as test samples.Since the available training samples are separate from the whole samples in the other two data sets, only 5% of the available training samples (see the last column of Table 1) are randomly selected for training in the University of Pavia data.In experiment 2, we choose 0.5%, 1%, 2%, 3%, 4% and 5% of samples from each class in the Indian Pines data and 0.5%, 1%, 2%, 3%, 4% and 5% of the available training samples in the University of Pavia data as training samples, while the rest are treated as test ones.The aforementioned methods are compared numerically (overall accuracy (OA) and average accuracy (AA)) and statistically (kappa coefficient (κ)).All the experiments demonstrated in this section are implemented with MATLAB on a platform with Intel(R) Xeon(R) CPU (3.3 GHz), 8 GB RAM and a Windows 7 operating system and the results are averaged over 10 independent trials to alleviate possible bias.
Moreover, several parameters should be tuned in the experiments.For the SVM-based methods (i.e., SVM, SVMCK, SADL, 3D-DWT and LTDA), the RBF kernel is used in SVM, SVMCK, 3D-DWT and LTDA, while the linear kernel is adopted in SADL.The RBF parameter γ is tuned in the range γ ∈ {2 −8 , 2 −7 , . . ., 2 8 }, the composite weight η in SVMCK is varied in steps of 0.1 in the range 0 to 1, and the penalty term C is set to 60. Figure 7 plots the effect of C (C ∈ {1, 2, . . ., 100} with γ = 1) on OA of SVM for the Indian Pines data.As shown in Figure 7, we can find that the parameter C does not seriously affect the classification accuracy of SVM when C is larger than 10.Therefore, we set C = 60 in the experiments.For the sparse representation-based methods (i.e., SOMP, cdSRC and tbSRC), the sparsity level S ranges from 10 to 100, and the neighborhood of both SOMP and tbSRC ranges from 3 × 3 to 11 × 11. Figure 8 shows the optimal OA of various neighborhood sizes in the two HSI data sets.It can be observed that the 9 × 9 neighborhood yields high accuracy for the Indian Pines data, while smaller sizes are more suitable for the University of Pavia data which lacks large spatial homogeneity.According to [29], the patch size of SADL is taken as 8 × 8 for the Indian Pines data and 16 × 16 for the University of Pavia data.The neighborhood size of pLSRC-S [32] is set as 7 × 7 for the Indian Pines data and 21 × 21 for the remaining two data sets.Although different methods take different patch sizes, it still makes sense to compare those methods.On the one hand, although a larger patch size is somewhat similar to using more samples, the number of training and test samples used in different methods are the same as each other.On the other hand, the patch size of each patch-based method is set to have the best (or at least comparable) performance, and it is reasonable to compare all of the patch-based methods under their best (or at least comparable) situations.Moreover, the HSI is decomposed into 2 levels in the 3D-DWT, and the dimension is reduced to 30 in the local Fisher discriminant analysis (LFDA) step of cdSRC.In LTDA, 5 and 2 principle components are reserved to obtain the EMP features in the two data sets, respectively.

Classification Results and Discussions of Experiment 1
We first illustrate the block-structured sparsity derived from the real-world Indian Pines data set.Without loss of generality, we choose the test sample (belonging to class 1 alfalfa) from the spatial coordinate (70, 97) and let the neighborhood size be 9 × 9 and k = 1.As plotted in Figure 9, the normalized small local patch X ∈ R 9×9×200 , which comprises the test sample and its 9 × 9 neighboring pixels, can be approximated by the 1st group dictionaries (D w 1 , D h 1 , D s 1 ) and the core tensor Z(1) It is observed from Figure 9 that the non-zero elements of Z(1) are concentrated in a few locations and exhibit block structure, which is consistent with the ideal situation shown in Figure 3b.In the first experiment, different approaches are compared in Tables 2 and 3, where the classification accuracy of each class, OA, AA, κ, standard deviation and computational time are displayed.The classification maps of a trail are depicted in Figures 10 and 11.Based on the classification results shown in Tables 2 and 3, and Figures 10 and 11, we make the following discussions: 1. SVM, CRC and cdSRC provide a more salt-and-pepper-like appearance than other methods.
As shown in Figures 10 and 11, there are many scattered salt-and-pepper-like errors in SVM, CRC and cdSRC, while the classification errors of SVMCK, SOMP, SADL, 3D-DWT, LTDA, pLSRC-S and tbSRC are spatially concentrated.This is because SVM, CRC and cdSRC only use the spectral information of the HSI, while the others integrate additional relevant information (i.e., spatial information) and develop it into spectral-spatial methods.As displayed in Tables 2 and 3, the OAs of SVM are almost 10% to 20% lower than those of the other ones.More specifically, SVMCK provides much better results than SVM.This validates the advantage of spatial information for HSI classification; 2. Although cdSRC is based only on spectral characteristics, it still yields impressive classification results.As shown in Table 2, the OA, AA and κ of cdSRC achieve 90.35%, 89.35%, and 88.98%, respectively, which are comparable to those of the SADL and 3D-DWT.Similar properties can also be found in Table 3.Such phenomena imply that incorporating the class label information in the process of calculating the sparse coefficients helps to improve the classification performance; 3. For the Indian Pines data, SOMP achieves comparable classification accuracies to those of SVMCK, which indicates that the training samples inside a window surrounding a central pixel are taken as part of the best atoms.As shown in Table 2, the OA and κ of SOMP are respectively 1.02% and 1.11% higher than those of SVMCK, but the AA of SOMP is 3.08% lower than that of SVMCK.Because the 9th class (i.e., oats) of the Indian Pines data covers a narrow area (see Figure 5b), SOMP offers poor results for this class.For the University of Pavia data, SOMP falls far behind the other methods.As displayed in Table 3, the OA, AA and κ of SOMP are as low as 56.92%, 59.47% and 46.23%, respectively.The reason why SOMP cannot perform well may be because the available training samples (see Figure 6c) are composed of small patches and thus the window around a pixel may contain no training samples.More particularly, the 1st, 7th and 8th classes (i.e., asphalt, bitumen and bricks) of the University of Pavia data cover very narrow regions (see Figure 6b).As a consequence, the classification results of those classes are specifically poor (see Table 3); 4.Among the collaborative/sparse representation-based methods (i.e., CRC, SOMP, cdSRC, pLSRC-S and tbSRC), pLSRC-S and tbSRC yield better classification performance than CRC, SOMP and cdSRC.This is partly because the dictionary learned by pLSRC-S and tbSRC can effectively represent the test samples, while the dictionary in CRC, SOMP and cdSRC is conventionally formed by all of the training samples and thus is not proper for capturing crucial class-discriminative information.As shown in Table 3, the OA of tbSRC is 25.27%, 32.11% and 3.25% higher than that of CRC, SOMP and cdSRC, respectively.Moreover, we can also observe that the SADL attains significant classification accuracies because a structured dictionary is effectively learned.For instance, the OA of SADL achieves 91.66% in Table 2.Those phenomena highlight the importance of dictionary learning; 5.The tensor or 3-D based methods (i.e., 3D-DWT, LTDA and tbSRC) generally lead to better or comparable performance to that of SADL.As shown in Table 3, the OA of 3D-DWT and tbSRC are 0.79% and 2.75% higher than that of SADL, respectively, whereas the OA of LTDA is 0.33% lower than that of SADL.Similar results can also be found in Table 2. Therefore, the classification results demonstrated in Tables 2 and 3 validate the excellent ability of 3D-DWT, LTDA and tbSRC in identifying spectral-spatial structures of HSI cubes.Moreover, tbSRC provides the best performance among all the above-mentioned methods.As depicted in Figures 10 and 11, the classification maps of tbSRC are closer to the ground truth (see Figures 5b and 6b) than those of other methods.Without 3-D tensors for spatial treatment, the proposed tbSRC will roughly reduce to a special case of the cdSRC.Based on the experimental results (see Tables 2 and 3) on the two real-world HSIs, the OA of the spatial treatment will improve about 3% beyond the spectral information; 6.As displayed in Tables 2 and 3, the standard deviation of OA in the tbSRC is slightly lower than those of the other methods, which indicates that the tbSRC is stable.Regarding the computational efforts, the tensor block-sparsity based dictionary learning can be effectively implemented by  1.As expected, the classification accuracy increases when the number of training samples increases.As depicted in Figure 12a, the OA of tbSRC is much lower than 90% when only 0.5% of labeled samples of the Indian Pines data are selected as training samples, while the OA is more than 90% with 5% of labeled samples as training samples; 2. tbSRC is demonstrated to be superior to other methods with a small number of labeled samples, while SADL, 3D-DWT, LTDA, cdSRC and pLSRC-S trail marginally behind tbSRC.As observed from Figure 12a, although only 1% of labeled samples of the Indian Pines data are chosen selected as training samples, the OA of tbSRC almost reaches 80%, while the OAs of SADL, 3D-DWT, LTDA, cdSRC and pLSRC-S are slightly lower than that of tbSRC.Moreover, it is interesting to note that SADL, 3D-DWT, LTDA, cdSRC and pLSRC-S can achieve comparable performance.As shown in Figure 12a, the variations among OAs of those five methods are quite small.Similar results can also be discerned from Figure 12b; 3.For the Indian Pines data, SOMP exhibits comparable classification accuracies to those of SVMCK.
As illustrated in Figure 12a, the gap between SOMP and SVMCK is narrow, regardless what ratios of samples are chosen as training samples.However, SOMP delivers the worst results compared with the other seven methods for the other two data sets.Figure 12b reveals that the OA of SOMP is lower than 60% even when 5% of the available training samples is selected, whereas the OAs of the other methods (except CRC) are higher than 60% when using as low as 2% of the available training samples.In a nutshell, the classification results of experiments 1 and 2 for two benchmark HSI data sets demonstrate the effectiveness of our proposed tbSRC in improving the classification performance.

Conclusions
In this paper, we have proposed a tensor block-sparsity based representation method (i.e., tbSRC) for spectral-spatial classification of HSI.The proposed tbSRC aims at taking the entire HSI cube as a third-order tensor and linearly representing a hyperspectral sample by a few atoms learned from the data.To that end, we perform tensor block-sparsity based dictionary learning and class-dependent block sparse representation on the HSI.Compared against techniques in the literature, our proposed tbSRC provides significant improvements and demonstrates promising results.On the one hand, tbSRC can effectively learn the optimal dictionary with the tensor block-sparsity constraint for a number of groups, which consist of several local patches centered at the training samples and are grouped by the class labels of training samples.The learned dictionary can effectively characterize the subspace structure of different classes and is much better than a dictionary directly constructed using training samples.On the other hand, block sparse representation is performed on the small local patch centered at the test sample in a class-wise manner to make full use of the class label information.Experiments conducted on two benchmark data sets have consistently confirmed the effectiveness of tbSRC, even in scenarios with a small number of training samples.Quantitatively, the OA of tbSRC improves about 2% to 30% compared to other state-of-the-art methods.In the future, a further improvement will be achieved by investigating the kernel variant of tbSRC.

Figure 1 .
Figure 1.Vector and tensor representations of a 5 × 5 local patch from the original HSI data set.

4 .Figure 2
Figure 2 depicts the schematic diagram of the proposed tensor block-sparsity based representation method, which consists of two main components.The first one is the tensor block-sparsity based dictionary learning.In this step, a number of small patches are extracted from the original HSI.Each patch is a third-order tensor consisting of a training sample located at the center of the patch and several spatial neighborhoods of the training sample.The patches are then partitioned into c groups.As shown in Figure 2, we categorize the patches into the same group in case the training samples in those patches belong to the same class.Accordingly, the dictionaries can be learned on the c groups with a tensor block-sparsity constraint.The second step is class-dependent block sparse representation.Similar to the training samples, a test sample is expressed as a small local patch with several neighboring pixels surrounding it.The sparse coding of the test sample is then obtained by performing block sparse representation in a class-wise manner.The class label is finally determined by the minimal residual between the test sample and its approximation.

Figure 2 .
Figure 2. Flowchart of the proposed algorithm.In each class, Z (k) share the same atoms of dictionaries (D w k , D h k , D s k ), thus making the dictionary learning process tensor block-sparse.

1 ,
in which steps 3-10 aim at identifying the core tensor Z(k) and the residual e k .Analogous to SRC, the class label of x is determined by the minimal residual class x = arg min k=1,2,...,c e k

Algorithm 1 : 1 :
) vec(•) denotes the vectorization operator and I Class-Dependent Block Sparse Representation Classifier Require: Dictionaries (D w k , D h k , D s k ), k = 1, 2, . . ., c, sparsity level S, test sample x and its corresponding third-order tensor X composed of the neighboring pixels Ensure: The class label of test sample x for all k = 1, 2, . . ., c do 2:

: m = m + 1 9: end while 10 : 2 , where e m− 1 k 1 k 11 : end for 12 :
Calculate the norm of the kth residual e k = e m−1 k is the vectorization of E m−Determine the class label of x by class x = arg min k=1,2,...,c e k 2, 3 are shown in the top right corner of Figure 4.The size of each dictionary is also displayed.Subsequently, class-dependent block sparse representation is performed on the small local patch of test sample X ∈ R 7×7×204 .The core tensors Z(k) , k = 1, 2, 3 which have obvious structured sparsity is plotted in the lower left corner of Figure 4. Based on the minimal residuals, the class label of the test sample is finally set as 2.

Figure 4 .
Figure 4.A synthetic example conducted on a subset of the Salinas-A scene.

1 .
Indian Pines data: this data set was acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over a mixed forest/agricultural region from Northwestern Indiana in June 1992.The image size is 145 × 145 × 220 (L w = 145, L h = 145, l s = 220) with 145 × 145 pixels and 220 spectral bands.The spatial resolution is 20 m per pixel and the 220 spectral bands cover 0.4-2.5 µm range, of which 20 noisy and water-vapor absorption bands (bands 104-108, 150-163, and 220) are removed so that 200 bands are reserved for experiments.Figure 5 displays the three-band false color composite image along with the corresponding ground truth.This data set contains 16 classes of interest and 10366 labeled pixels ranging unbalanced from 20 to 2468, which poses a big challenge for the classification problem.The number of samples for each class is listed in Table 1, whose background color corresponds to different classes of land covers.2. University of Pavia data: this data set was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) over the urban site of the University of Pavia, northern Italy, in July 2002.The original size is 610 × 340 × 115 (L w = 610, L h = 340, l s = 115) with 610 × 340 pixels and 115 spectral bands.The spatial resolution is 1.3 m per pixel and the 115 spectral bands cover 0.43-0.86µm, of which the 12 noisiest channels are removed and 103 spectral bands remain for experiments.Figure 6 shows the false color composite image, the ground truth data and the available training samples.

Figure 5 .
Figure 5. Indian Pines image: (a) Three-band false color composite; (b) Ground truth data with 16 classes.

Figure 6 .
Figure 6.University of Pavia image: (a) Three-band false color composite; (b) Ground truth data with 9 classes; (c) Available training samples.

3 Figure 9 .
Figure 9. Illustration of the block-structured sparsity derived from the real-world Indian Pines data set.Without loss of generality, the test sample (belonging to class 1 alfalfa) is chosen from the spatial coordinate (70, 97) and let the neighborhood size be 9 × 9 and k = 1, the normalized small local patch of the test sample X ∈ R 9×9×200 can be approximately represented by the 1st group dictionaries (D w 1 , D h 1 , D s 1 ) and the core tensor Z(1) with D w 1 , D h 1 ∈ R 9×9 , D s 1 ∈ R 200×129 , and Z(1) ∈ R 9×9×129 .It is notable that the non-zero elements of Z(1) have block structure, which is consistent with the ideal situation shown in Figure 3b.

Figure 12 .
Figure 12.Overall accuracy (%) of different methods with various numbers of labeled samples for (a) Indian Pines data and (b) University of Pavia data.

Table 1
lists the number of samples for each class together with the available training samples.Analogous to the Indian Pines data, the background color also corresponds to different classes of land covers.As shown in Table1, this image consists of 9 classes of land covers and each class contains more than 900 pixels.However, the available training samples of each class are less than 600.

Table 1 .
Number of Samples (No.S) and available training samples (No.ATS) used in the experiments.
Tucker decomposition, and tbSRC takes most of its computational cost in class-dependent block sparse representation, which requires no more than O(3cnl s RR 3 ) operations with n denoting the number of test samples, R = max respectively.In the experiments, the computation time of tbSRC is comparable with that of other methods.As shown in Table2, although the tbSRC takes a little more time than other methods, it can complete the classification task in no longer than 5 min.

Table 2 .
Classification accuracy (%) and standard deviation (in bracket) of different methods for the Indian Pines data.

Table 3 .
Classification accuracy (%) and standard deviation (in bracket) of different methods for the University of Pavia Data.