Spectral-Spatial Response for Hyperspectral Image Classification

This paper presents a hierarchical deep framework called Spectral-Spatial Response (SSR) to jointly learn spectral and spatial features of Hyperspectral Images (HSIs) by iteratively abstracting neighboring regions. SSR forms a deep architecture and is able to learn discriminative spectral-spatial features of the input HSI at different scales. It includes several existing spectral-spatial-based methods as special scenarios within a single unified framework. Based on SSR, we further propose the Subspace Learning-based Networks (SLN) as an example of SSR for HSI classification. In SLN, the joint spectral and spatial features are learned using templates simply learned by Marginal Fisher Analysis (MFA) and Principal Component Analysis (PCA). An important contribution to the success of SLN is the exploitation of label information of training samples and the local spatial structure of HSI. Extensive experimental results on four challenging HSI datasets taken from the Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) and Reflective Optics System Imaging Spectrometer (ROSIS) airborne sensors show the implementational simplicity of SLN and verify the superiority of SSR for HSI classification.


Introduction
Hyperspectral Image (HSI) classification has recently gained in popularity and attracted interest in many fields, including assessment of environmental damage, growth regulation, land use monitoring, urban planning, reconnaissance, etc. [1][2][3][4][5].Although high spectral resolution opens the door to many applications, the high dimensionality poses new challenges for HSI classification [3].
In the past few years, many methods have been proposed to perform the HSI classification.In order to deal with the problems arising as the data dimensionality increases, many Dimensionality Reduction (DR) methods [4,[6][7][8][9][10] have been adopted for HSI classification.These methods fall into three categories: unsupervised, supervised and semi-supervised.Additionally, they can ameliorate statistically ill-posed problems and improve the classification performance [2].Michele et al. proposed a semi-supervised multiview feature extraction method based on the multiset regularized kernel canonical correlation analysis for the classification of HSI [11,12].Apart from feature extraction, designing an effective classifier is also an important way to promote the classification accuracy.For example, Support Vector Machine (SVM) and Relevance Vector Machine (RVM) have been successfully used for HSI classification [13,14].Recently, Kernel-based Extreme Learning Machine (KELM) [15,16] was also applied to HSI classification [17], where KELM uses an idea to train a single-hidden layer feed forward neural network; that is, the hidden-node parameters are randomly generated based on certain probability distributions.This idea was originally proposed in [18] and further developed in [19,20].A similar idea randomly generating the node parameters based on sparse representation has also been investigated in the matching problem, such as in [21,22].The neural network is an important machine learning method, which has attracted more and more attention recently [23].However, conventional methods only exploit the spectral information of HSIs, and the spatial structure is ignored.Their classification results may contain noise, like salt-and-pepper [24].
Recently, spectral-spatial-based methods have attracted great interests and improved the HSI classification accuracy significantly [25][26][27][28][29][30][31].Camps-Valls et al. [32] proposed a Composite Kernel (CK) that easily combines spatial and spectral information to enhance the classification accuracy of HSIs.Li et al. extended CK to a generalized framework, which exhibits the great flexibility of combining the spectral and spatial information of HSIs [33].Fauvel et al. introduced the Morphological Profile (MP), which is widely used for modeling structural information [34].Li et al. proposed the Maximizer of the Posterior Marginal by Loopy Belief Propagation (MPM-LBP) [35].It exploits the marginal probability distribution from both the spectral and spatial information.Zhong et al. developed a discriminant tensor spectral-spatial feature extraction method for HSI classification [24].Kang et al. [36] proposed a spectral-spatial classification framework based on Edge-Preserving Filtering (EPF), where the filtering operation achieves a local optimization of the probabilities.Two-dimensional Gabor features extracted from selected bands and the Local Binary Pattern (LBP) were introduced for extracting local spatial features of HSIs in [37] and [38], respectively.Li et al. proposed to combine LBP and ELM (LBP-ELM) for HSI classification [38].Feng et al. [25] defined Discriminate Spectral-Spatial Margins (DSSMs) to reveal the local information of hyperspectral pixels and explore the global structures of both labeled and unlabeled data via low-rank representation.Zhou et al. proposed a Spatial and Spectral Regularized Local Discriminant Embedding (SSRLDE) method for DR of HSIs [2].He et al. proposed Spatial Translation-invariant Wavelet (STIW)-based Sparse Representation (STIW-SR) for extracting spectral-spatial features [39].STIW can reduce the spectral observation noise and the spatial nonstationarity while maintaining the class-specific truth spectra.Soltani-Farani et al. presented the Spatially-Aware Dictionary Learning [40] (SADL) method, which is a structured dictionary-based model for hyperspectral data incorporating both spectral and contextual characteristics of spectral samples.Sun et al. presented the Sparse Multinomial Logistic Regression and Spatially-Adaptive Total Variation (SMLR-SpATV) classifier [41] using the SpATV regularization to enforce spatial smoothness.These methods have achieved promising results [1][2][3]27,28,42].Furthermore, Li et al. proposed the Multiple Feature Learning (MFL) framework with state-of-the-art performance [30,43].However, most of these extract spectral-spatial features using a shallow architecture and yield limited complexity and non-linearity.
In [44], a deep learning-based HSI classification method was proposed, where spectral and spatial information is extracted separately and then processed via stacked autoencoders.Similarly, Li et al. proposed to use Deep Belief Networks (DBN) for HSI classification [45].Yue et al. explored both spatial and spectral features in higher levels by using a deep CNN framework for the possible classification of hyperspectral images [46].Unsupervised sparse features were learned via deep CNN in a greedy layer-wise fashion for pixel classification in [47], and CNN was utilized to automatically find spatial-related features at high levels from a subspace after local discriminant embedding [48].Very recently, a regularized deep Feature Extraction (FE) method was presented for Hyperspectral Image (HSI) classification using a Convolutional Neural Network (CNN) [49].These works demonstrate that deep learning opens a new window for future research, showcasing the deep learning-based methods' huge potential.However, how to design a proper deep net is still an open area in the machine learning community [50,51].Generally, HSI classification aims at classifying each pixel to its correct class.However, pixels in smooth homogeneous regions usually have high within-class spectral variations.Consequently, it is crucial to exploit the nonlinear characteristics of HSIs and to reduce intraclass variations.The difference between natural image classification and HSI classification lies in that the former learns a valid representation for each image, while the latter learns an effective representation for each pixel in an HSI.However, with the high dimensionality of HSIs in the spectral domain, theoretical and practical problems arise.Furthermore, each pixel in an HSI will likely share similar spectral characteristics or have the same class membership as its neighboring pixels.Using spatial information can reduce the uncertainty of samples and suppress salt-and-pepper noise in the classification results.In order to make use of the nature of HSIs, we intend to learn the discriminative spectral-spatial features using the hierarchical deep architecture in this paper.More specifically, we learn effective spectral-spatial features by iteratively abstracting neighboring regions.In this way, the intraclass variations can be reduced, and the classification maps become more smooth.Meanwhile, label information of the training samples can also be used to learn discriminative spectral features at different scales [44,[52][53][54][55][56].
Consequently, this paper proposes a hierarchical deep learning framework, called Spectral-Spatial Response (SSR), for HSI classification.SSR can jointly extract spectral-spatial features by iteratively abstracting neighboring regions and recomputing representations for new regions.It can exploit different spatial structures from varied spatial sizes.Using SSR, we develop a novel spectral-spatial-based method, Subspace Learning-based Networks (SLN), for HSI classification.It utilizes Marginal Fisher Analysis (MFA) and Principal Component Analysis (PCA) to learn discriminative spectral-spatial features.The main difference between the proposed framework and the general deep learning framework is that discriminative convolutional filters are learned directly from the images rather than learned by the stochastic gradient descent method that is used in the general deep learning method.Moreover, there are several advantages to be highlighted as follows: • SSR provides a new way to simultaneously exploit discriminative spectral and spatial information in a deep hierarchical fashion.The stacking of joint spectral-spatial feature learning units can produce intrinsic features of HSIs.

•
SSR is a unified framework of designing new joint spectral-spatial feature learning methods for HSI classification.Several existing spectral-spatial-based methods are its special cases.

•
As an implementation example of SSR, SLN is further introduced for HSI classification with a small number of training samples.It is easy to implement and has low sample complexity.
The remainder of this paper is organized as follows.In Section 2, the general framework SSR is presented, and the relationship with other methods is given.A new implementation of SSR called SLN is presented in detail in Section 3. Section 4 provides the experimental evaluations of the proposed framework by using four widely-used HSI datasets respectively collected by the Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) and the Reflective Optics System Imaging Spectrometer (ROSIS).The comparison results with state-of-the-art methods are also reported.Finally, Section 5 concludes with some remarks and possible future research directions.

Spectral-Spatial Response
The proposed framework aims at learning effective features that maximize the difference between classes and minimize the difference within the class.The learned features are expected to be more discriminative for classification.In fact, spatial adjacent pixels usually share similar spectral characteristics and have the same label, and using spatial information can reduce the uncertainty of samples and suppress the salt-and-pepper noise of classification results [57].Furthermore, recent studies have found that a higher layer of the deep hierarchical model produces increasingly abstract representations and is increasingly invariant to some transformations [58,59].Consequently, we present a new framework that jointly learns the spectral-spatial features via a deep hierarchical architecture.Using this architecture, the spectral-spatial features can be learned recursively.

Definition of Spectral-Spatial Response
To define SSR, we need two ingredients as follows: • A finite number of nested cubes to define the hierarchical architecture (see Figure 1).The SSRs on different layers can be learned from different cubes.The sizes of cubes determine the sizes of neighborhoods in the original HSI.

•
A set of templates (or filters) that extract the spectral and spatial features.These templates can be learned from the training samples or the cubes centered at positions of the training samples.The hierarchical architecture is composed of n cubes H 1 , H 2 , • • • , H n , as shown in Figure 1.In this section, the definition of the SSR based on three cubes (H 1 , H 2 and H 3 ) is given as an illustrative example, where H 1 corresponds to a small cube in the HSI, H 2 is a larger cube and H 3 represents the whole HSI.Let I I I ∈ R m ×n ×d be an HSI to be processed, where m , n and d are the height of the HSI, the width of the HSI and the number of the spectral bands, respectively.The construction of SSR is given in a bottom-up fashion. (1) The first layer SSR: The computation procedure of the first layer SSR is shown in Figure 2. Let I I I i,j be the central pixel of H 1 , then the spectral-spatial feature of I I I i,j can be jointly learned as follows.
First, the spectral features can be learned from each pixel.The reproducing kernel, denoted by K 1 (I I I ii,jj , t t t 1 l ), can be used to learn the spectral features, where I I I ii,jj is a pixel in H 1 and t t t 1 l is the l-th spectral template in T 1 .The reproducing kernel can produce the spectral features by encoding the pixel by a set of learned templates.For example, one could choose the simple linear kernel, namely: The spectral templates are spectral feature extractors learned from training pixels.This operation intends to reduce the spectral redundancy.In this way, pixel I I I ii,jj can be transformed into a | T 1 |-dimensional feature vector, where | T 1 | is the cardinality of the T 1 and | T 1 | < d.Similarly, pixels in H 1 ∈ R v 1 ×v 1 ×d (the pixel I I I i,j and its neighborhoods) can be transformed to a new cube in R v 1 ×v 1 ×| T 1 | , where v 1 × v 1 is the size of the neighborhoods.In this cube, there are Here, these matrices are called the first layer spectral feature maps, denoted by g g g 1 . Note that at this stage, we can move H 1 pixel by pixel.Second, we learn the spatial features based on the outputs of the previous stage g g g In this stage, our objective is to incorporate spatial contextual information within the neighbor into the processing pixel.We have: where: and spatial template t t t 1 l can be learned from the first layer spectral feature maps of the training samples.In this way, for each I I I i,j , we can obtain a new feature vector: where R 1 (I I I i,j ) is called the first layer SSR of I I I i,j .This operation can be considered as the "convolution" in the conventional deep learning model.In this way, the spatial information of the local region can be learned.Consequently, the first layer SSR is obtained by jointly exploiting both spectral and spatial information of the HSI.
Finally, we concatenate R 1 (I I I i,j ) and I I I i,j into a new spectral-spatial feature vector (see Figure 3).It can provide more spectral information.In this way, the spectral feature can be enhanced, and the oversmooth problem can be overcome.We processed all of the pixels in I I I ∈ R m ×n ×d , then a new feature cube denoted by I I I can be obtained.Note that each feature vector in I I I is learned from a cube with size of H 1 in the original HSI.( The second layer SSR: Similarly, we can define the second layer SSR on a region corresponding to H 2 in the original image I I I.In this case, two template sets are denoted by T 2 and T 2 , respectively.For each I I I ii,jj , we have: where: K 2 (I I I ii,jj , t t t 2 l ) = I I I ii,jj , t t t 2 l .
K 2 (I I I ii,jj , t t t 2 l ) can be regarded as the pooling operation over all spectral bands.The outputs of this operation on all positions are called the second layer spectral feature maps denoted by g g g 2 i (i = 1, 2, . . ., | T 2 |).Then, these maps can be convoluted by the learned templates (or filters) t t t 2 i (i = 1, 2, . . ., | T 2 |).For the position (i, j), we have: Consequently, the second layer SSR on position (i, j) can be defined by: where t t t 2 can be learned from all feature maps g g g 2 i of training samples.Similarly, the final output is obtained by concatenating R 2 (I I I i,j ) and I I I i,j into a new spectral-spatial feature. (3) Extend to n layers: The output of the previous step is a new feature cube.Similarly, the definition given above can be easily generalized to an n layer architecture defined by sub-cubes Based on the above descriptions, the flowchart of the proposed deep hierarchical framework can be shown in Figure 4, where the SSRs are concatenated with the normalized HSI to prevent from oversmoothing.It is composed of stacked joint feature learning units.Once a stacked architecture has been built, its highest level spectral-spatial features can be used as the input of a supervised learning algorithm, for example SVM or a KELM.The framework can learn substantially more efficient features with increasing depth.Different layers of SSRs are defined in different "receptive fields".From the definition of the SSR, we can find that different learning methods can be flexibly applied to the templates' learning modules.Consequently, we can design different algorithms based the proposed framework.The advantage of the hierarchical framework is that it can effectively learn spectral-spatial features layer by layer.As mentioned above, the framework can be extended to deep layers by defining the SSR on the new feature cube.If we denote the original HSI as the first "feature cube", then the new feature cube obtained by concatenating the first layer SSR and normalized HSI can be denoted as the second one, and so on.Then, the proposed framework can be reformulated from the perspective of the kernel.
x n l } be l feature vectors (corresponding to l pixels in the original HSI) from a sub-cube in the n-th feature cube, then we have: where: is the n-th spectral template set, is the n-th spatial template set, n = 1, 2, 3, 4, • • • , K n is the kernel function on spectral domain, K n is the kernel function on spatial domain and R n is the n-th layer SSR.
Proof.As can be seen, we only need to prove the case of n = 1 and generalize it to R n .First, we have a pixel group {x Then, the first layer spectral feature maps can be obtained by: In this case, each row of F 1 is a vectorized local feature map.Then, we can obtain the first layer SSR based on T 1 .That is: In a similar way, we can prove that: where n = 2, 3 • • • and x x x n i (i = 1, . . ., l) come from the n-th feature cube obtained by concatenating the (n − 1)-th layer SSR and normalized HSI.This proposition indicates that SSRs are obtained by abstracting the features from the previous layer, where transform matrices are learned from the training data on each layer.The kernel computes the inner product in the induced feature space.There are many kernels that can be used, and the linear kernel is the simplest one.The following conclusions can be made for Proposition 1:

•
The proposed framework shares similarities with deep learning models.If kernel functions K 1 and K 1 jointly learn spectral-spatial features on the first layer, then the iterated mapping in Equation (10) demonstrates the multilayer feature learning in the deep model.Consequently, with the increase of the depth, the receptive field becomes larger and larger.In this way, the hierarchical architecture could propagate local information to a broader region.Thus, this framework can learn spectral-spatial features of the HSI with multiple levels of abstractions.

•
The proposed framework is designed for HSI classification.This shows that K n and K n can learn spectral and spatial features jointly.They are not considered in the conventional deep learning models, which are very popular in the computer vision community.These kernel functions can be viewed as inducing a nonlinear mapping from inputs to feature vectors.K n can learn spectral features and overcome the high-dimension problem.Additionally, K n can learn spatial features and decrease the intraclass variations.Consequently, the proposed SSR is suitable for the nature of the HSIs. Remarks: • An HSI usually contains homogeneous regions.Consequently, we assume that each pixel in an HSI will likely share similar spectral characteristics or have the same class membership as its neighboring pixels.This is the reason why the spatial information can be used in the proposed framework SSR.

•
In SSR, the template plays an important role, and it could be a filter or an atom of the dictionary.Consequently, constructing the template sets is an interesting problem to be further investigated.

•
Stacking joint spectral-spatial feature learning leads to a deep architecture.Different feature learning methods (linear and nonlinear) can be embedded into the proposed SSR.The flexibility offers the possibility to systematically and structurally incorporate prior knowledge; for example, MFA can be used to learn discriminative features.

Special Scenarios
SSR is a spectral-spatial-based deep learning framework to deal with the HSI classification.Moreover, several existing spectral-spatial-based HSI classification methods can be derived from SSR.As discussed above, we can use different methods to obtain the template sets.For example, we can learn a set of spectral templates by using techniques, such as PCA and Linear Discriminant Analysis (LDA).In this way, we can remove the spectral redundancy of the HSI and obtain the spectral feature fitting for classification.Similarly, spatial templates can have different forms, such as Principle Components (PCs), which can learn the spatial feature of the HSI.

PCA+Gabor
If the PCs and Gabor filters are selected as T 1 and T 1 , respectively, SSR becomes the method proposed by Chen et al. [42].That is, the image pixels are firstly processed by PCA, then the two-dimensional Gabor filter [37,42,60] is used to extract the spatial information in the PCA-projected subspace.Finally, the spatial feature and spectral feature are concatenated.In this case, we have: where: λ, θ, δ, γ and ψ are the wavelength, orientation, standard derivation of the Gaussian envelope, aspect ratio and phase offset, respectively.

Edge-Preserving Filtering
If we set the templates in T 1 as the SVM classifiers and the templates in T 1 as the edge-preserving filters, SSR reverts to Kang's spectral-spatial method with EPF [36].In this method, each pixel is firstly classified by a classifier, e.g., SVM, to generate multiple probability maps.The edge-preserving filtering is then applied to each map with the help of a guidance image.Finally, the classification of each pixel is determined by the maximum probability of filtered probability maps.In this case, we have: where x x x i , y i , α i and b are the i-th support vector, label, i-th Lagrange multiplier and the number of the support vectors, respectively.Note that the output of each pixel has only one non-zero number.In summary, SSR provides us with a unified framework to treat existing shallow spectral-spatial methods in practice as a change of templates for feature learning at each layer.More importantly, SSR provides guidance for designing new hierarchical deep learning methods.As an example, we will present an implementation of SSR in Section 3.

Subspace Learning-Based Networks
Because using different template learning methods in SSR can lead to different algorithms, there are many ways to implement the SSR (as mentioned in Section 2.2).In this section, we proposed SLN as an implementation example of SSR.SLN uses the subspace learning methods (MFA and PCA) to learn the templates and KELM as the classifier to further promote the classification performance.SLN maximizes the spectral difference between classes using MFA and considers spatial information during the feature learning step using PCA.They are quite effective for HSI classification.Consequently, the proposed method is suitable for HSI classification.
Let I I I tr = {I I I 1 , I I I 2 , . . ., I I I N } be the training set, where I I I i ∈ R d (i = 1, 2, . . ., N), and they belong to C classes.For simplicity, SLN with one joint feature learning unit is presented here.It performs a normalization to the image, uses MFA to learn discriminative spectral templates, employs PCA [61] to learn spatial templates and then uses KELM to assign a label to each pixel.The detailed description of SLN is given as follows.
(1) Preprocessing The image preprocessing is to normalize the data values into [0, 1] by: where (2) Joint spectral-spatial feature learning: First, the discriminative spectral features are desired for classification.Consequently, the label information of the training samples can be used.In SLN, a supervised subspace learning method, MFA [62], is used to construct T 1 .MFA aims at searching for the projection directions on which the marginal sample pairs of different classes are far away from each other while requiring data points of the same class to be close to each other [63].Here, the projection directions of MFA are taken as the templates in T where I I I tr is the normalized training set, L is the Laplacian matrix and B is the constraint matrix (refer to [62]).Once the templates are given, the normalized HSI can be projected to the templates pixel by pixel.As described in Section 2, each template produces a feature map.In this way, we can obtain | T 1 | spectral feature maps.
Second, spatial information within the neighbor is expected to be incorporated into the processing pixel.In SLN, PCA as a linear autoencoder has been used to construct T 1 .In this way, the template learning method is simple and fast.The templates in T 1 can be learned as follows.
We crop v 1 × v 1 image patches centered at each training sample in the i-th spectral feature map.Because there are N training samples, we can collect N patches from each map.These cropped patches are vectorized and form a matrix X. Matrix X is then obtained after removing mean values.The construction of the template set is an optimization problem: where t1 is the spatial template; that is, to find the | T 1 | principal eigenvectors of X X T [64].
The patches cropped from each band of the training samples can be encoded by the | T 1 | templates.
In this way, a feature cube can be obtained, where the number of its feature maps is (3) Classification based on KELM: After L alternations of the joint spectral-spatial feature learning processes, SLN obtains spectral-spatial features that are then classified by the KELM classifier [15,17,65].
Let the features of training samples be {x x x i , y y y i }(i = 1, . . ., N), where x x x i ∈ R | T L |×| T L |+d and y y y i = (y i,1 , . . ., y i,C ) ∈ R C indicate C classes and: y y y i,j = 1, x x x i belongs to the j-th class; 0, otherwise.
The output of the KELM classifier is: where x x x t is the feature of the test sample, Y Y Y = y y y 1 y y y 2 . . .y y y N T and K is the kernel function (this paper uses the Radial Basis Function (RBF) kernel).Finally, the class of the test sample is determined by the index of the output node with the highest output value [65].Ensure: T l , T l (l = 1,. . .,L) and KELM classifier.Using T l to process the i-th row and j-th column of the input image.Crop a neighboring region for the each pixel on the j-th map and process them using T j .Crop a neighboring region for each pixel on the j-th map and process them using learned T l .Concatenating the l-th SSRs with the normalized pixels.27: Feed the output to the KELM.

28:
The pseudocodes of the training and testing procedures of SLN are given in Algorithms 1 and 2, respectively.In Algorithm 2, y t is the predicted class label of the test sample.In SLN, MFA is used to maximize the difference between classes and to minimize the difference within the class.Pixels with the same labels may occur in spatially-separated locations, and MFA can decrease this type of intraclass variation.The effects of spatial feature learning (PCA is used in SLN) can reduce the intraclass variations while making use of the local structure information.Moreover, the learning methods in SLN are adopted according to the nature of the HSI. Remarks:

•
As one implementation of SSR, SLN is a deep learning method.Similarly, other hierarchical methods can be obtained when applying different kinds of templates and kernels in SSR.

•
The templates in SLN are learned by MFA and PCA, which are simple and have better performance in feature learning.Consequently, SLN is a simple and efficient method to jointly learn the spectral-spatial features of HSIs.

Experimental Results and Discussions
In this section, we provide an experimental evaluation for the presented framework and SLN using four real HSIs.In our experiments, the classification results are compared visually and quantitatively, where the quantitative comparisons are based on the class-specific accuracy, overall accuracy (OA), average accuracy (AA) and the κ coefficient [66].Note that the kernel parameters in the KELM are set to 0.1.In our study, all experiments are performed using MATLAB R2014a on an Intel i7 quad-core 2.10-GHz machine with 8 GB RAM.

Datasets and Experimental Setups
Four HSI datasets, including AVIRIS Indian Pines, ROSIS University of Pavia, ROSIS Center of Pavia and Kennedy Space Center (KSC), are employed to evaluate the effectiveness of the proposed method.

•
The Indian Pines dataset was acquired by the AVIRIS sensor over the Indian Pines test site in 1992 [67].The image scene contains 145 × 145 pixels and 220 spectral bands.The ground truth available is designated into 16 classes.In the experiments, the number of bands has been reduced to 200 due to atmospheric effects.This scene is challenging because of the significant presence of mixed pixels and the unbalanced number of available labeled pixels in each class [35].A three-band false color image and the ground-truth image are shown in Figure 5.

Experiments with the AVIRIS Indian Pines Dataset
For the first dataset, we randomly select 10% of labeled samples from each class for training and the rest for testing.Table 1 shows the class-specific accuracies, OAs, AAs and κ coefficients of different methods, where MH-KELM is the Multi-Hypothesis-based KELM [42,68] and SC-MK is the Superpixel-based Classification via Multiple Kernels [69].PCA+Gabor, EPF, MH-KELM, MPM-LBP, LBP-ELM, MFL, SC-MK and SADL are spatial-based methods.SVM only uses the spectral information.The experimental results given in Table 1 are averaged over 10 runs, where the proposed SLN has five layer SSRs.The summary of some important parameters is given in Table 2, where v i is the spatial size in the i-th layer.These parameters are determined experimentally.In the proposed method, ρ = 100,000.For PCA+Gabor, γ = 0.5, λ = 26, δ = 14.6, θ = {0 0 , 22.5 0 , 45 0 , 67.5 0 , 90 0 , 112.5 0 , 135 0 , 157.5 0 } and ψ = 0. Experimental results given in Table 1 show that methods making use of the spectral-spatial information perform better than spectral-based methods.This is consistent with previous studies, demonstrating the advantage of exploiting spatial information in HSI classification.Another interesting observation is that MFL performs poorly on the oats class.The reason for this may be the unbalanced training samples (three samples were used for training).It is easy to find that SLN performs the best on this dataset.This is due to the hierarchical joint spectral-spatial feature learning in the SLN.The κ coefficient is a robust measure that takes into account the possibility of good classification due to random chance.To statistically test the significance of the accuracy differences, we conducted a t-test (at the level of 95%) between the κ coefficients of each pair of compared classification results.Table 3 shows the p-values corresponding to different methods.Based on the results given in Table 3, we can conclude that the increases are statistically significant.
In real applications, the users are usually interested in the full classification map of the scene rather than the ground truth, which is already known.Consequently, Figure 9 illustrates full classification maps obtained by different methods.Each map is obtained from one of the random runs conducted on the Indian Pines dataset.As can be seen from Figure 9, SLN gives satisfactory results on smooth homogeneous regions by making use of the spatial information hierarchically.We can also see that the maps obtained by SVM and KELM have heavy noisy appearance.One of possible reasons for this may be that the spectral-based methods cannot make use of the spatial correlation of the HSI.Although MH-KELM, PCA+Gabor, EPF, MPM-LBP, MFL and SADL show improvements on the classification results, their classification maps still present noise.Further proven by the results in Table 1, the proposed SLN not only reduces noise, but also provides higher OA than other methods.Although LBP-ELM achieves high OA, it leads to oversmoothing.Next, we show how the number of training samples affects the accuracy of different methods.In each test, we randomly select 2% to 10% of the labeled samples from each class to form the training set, and the remainder forms the test set.The quantitative results averaged over 10 runs for various methods are given in Figure 10.As can be seen, OAs increase monotonically as the percentage of training samples increases.With relatively limited training samples (2% of the ground truth), SLN can obtain an OA over 92%, which is around 4% higher than that obtained by the MH-KELM method.In this case, the proposed method obtains effective features using the hierarchical spectral-spatial learning model.Finally, the resulting features of different steps are also given in Figure 11, where false color images are formed by the first three feature maps.Figure 11 shows that the discriminability of the learned features becomes stronger as the layer increases.Note that the learned spectral templates can produce feature maps that preserve the edges, and spatial templates can lead to feature maps that are smoother in the homogeneous regions.With the increase of depth, the local regions with pixels belonging to the same class become smoother while edges are preserved.Therefore, the proposed deep model can abstract intrinsic features from the HSI.This also reveals why the deep architecture can make the proposed method fit for HSI classification.(f) (g) (h) (i) (j)

Experiments with the ROSIS University of Pavia Dataset
First, for each of the nine classes, 1% of the labeled pixels were randomly sampled for training, while the remaining 99% were used for testing.The experiment is repeated 10 times using different randomly-chosen training sets to avoid any bias induced by random sampling.In the proposed method, ρ = 100.Table 4 shows the averaged OAs, AAs, κs and individual class accuracies obtained in our comparisons (see the parameters in Table 2).As we can observe, SVM obtains poor results because it only uses the spectral information.Table 4 also shows that the proposed SLN outperforms other compared methods.It demonstrates that SLN can make use of the spatial information effectively.Figure 12 shows the full classification maps obtained by different methods.Again, we can find that the proposed SLN leads to a better classification map and LBP-ELM leads to the oversmooth problem.We further perform the paired t-test for κ coefficients between the SLN and other compared methods.The obtained p-values in Table 5 demonstrate the effectiveness of the proposed method.Second, the effect of the number of training samples on the classification accuracy for different methods is presented.In this experiment, we randomly select 1% to 3% (with a step of 0.5%) of the labeled samples from each class to form the training set, and the remainder forms the testing set.The experimental results are given in Figure 13, where the given accuracies are obtained by averaging over 10 runs.Similarly, the experimental results show that the proposed SLN outperforms other methods with a small number of training samples.Figure 13 also shows that the spectral-spatial-based methods significantly outperform the spectral-based methods.Finally, we use about 9% of all data for training (3921 samples) and the rest for testing (40,002 samples).The sample distributions can be found in Figure 14, where the training set is provided by Prof. P.Gamba.The fixed training set is challenging because it is made up of small patches, and most of the patches in an HSI contain no training samples.The proposed method is compared with SVM with Composite Kernel (SVM-CK) [70], SADL [40], Simultaneous Orthogonal Matching Pursuit (SOMP) [70], Learning Sparse-Representation-based Classification with Kernel-smoothed regularization (LSRC-K) [26], MPM-LBP and SMLP-SpATV.The experimental results are given in Table 6, where some results come from the related references (parameters are given in the forth column of the Table 2).The proposed method has significant gains in the OA and κ and has high accuracy on the meadows class.We can conclude that SLN can make full use of the limited spatial information.This further demonstrates the advantage of SLN.In this experiment, our method is evaluated by using the ROSIS Center of Pavia dataset and compared with the state-of-the-art methods mentioned above.There are 5536 samples for training and the rest for testing, where the training set is provided by Prof. P. Gamba, as well.In the proposed method, ρ = 1000,000.In this dataset, the distributions of the training samples are relative centralism, and less discriminative position information can be used, as well.Experimental results in Table 7 show that SLN performs the best in terms of the OA and κ (parameters are shown in Table 2).This proves that SSR is an effective strategy to learn spectral-spatial features.
Finally, the full classification maps of the methods listed in Table 7 are illustrated in Figure 15.From the visual inspection of the maps, we find that the proposed SLN outperforms other methods because its resulting classification map is smoother (with reduced salt-and-pepper classification noise).We also note that LBP-ELM not only obtains lower accuracy, but also leads to a serious oversmoothing problem.

Experiments with the Kennedy Space Center Dataset
First, we randomly chose 25 samples for each class as training samples, and the remaining samples composed the test set.In the proposed method, ρ = 10,000.Experimental results in Table 8 show that SLN performs the best.Note that SVM performs poorly on this dataset.This experiment shows that SLN can lead to a high classification accuracy even when the distributions of the training samples are relative centralism, and less discriminative position information can be used.The same conclusion can be drawn as on Center of Pavia dataset.p-values in Table 9 also demonstrate that the improvement is significant.Second, the full classification maps of the methods in Table 8 are given in Figure 16.The advantages of the proposed framework can be visually appreciated in the maps in Figure 16.Although LBP-ELM obtains high OA, AA and κ (see Table 8), its classification map is oversmooth.Figure 16f shows that MH-KELM performs poorly on the water class.Finally, we perform an experiment to examine how the number of training samples affects the results of our method compared to other methods.Here, we randomly chose five to 25 of the labeled samples from each class to form the training set, and the remainder forms the testing set.The experimental results are given in Figure 17, where the accuracies are obtained by averaging over 10 runs.Similarly, the experimental results show that the proposed SLN outperforms other methods with a small number of training samples.

Discussion
In order to further analyze the proposed SLN, we test its performance on more experiments.In this section, the experimental results on the Indian Pines dataset are reported.We can make the same conclusions on the other datasets.
First, the comparisons with multiple-class SVM and the soft-max classifier are given in Table 10.As shown in the table, KELM achieves higher classification accuracy and is fast.Consequently, KELM is used in the final stage of SLN.However, the SVM can obtain comparable results.Second, we show the effects of the depth on the classification accuracy.A series of SLNs with different depths were trained, and their experimental results are shown in Figure 18.It can be observed that more layers usually lead to a higher classification accuracy.However, this does not mean the deeper the better.This helps us to determine how many layers are needed to obtain higher classification accuracy.The number of layers is related to the dataset.It is an open problem to determine the number of the layers.In this paper, it was determined experimentally.Third, we compare the SLN with the recently-proposed deep learning-based methods [49,71].The experimental results are reported in Table 11, where SSDCNN is the Spectral-Spatial Deep Convolutional Neural Network, Deep o is deep features with orthogonal matching pursuit and SSDL is Low Spatial Sampling Distance.From the results, we can see that the classification accuracy values of the proposed method in terms of OA and the κ coefficient are higher than those of other deep learning-based classification methods.We also note that 3D-CNN-LRhas higher sample complexity.
Finally, we show the effects of different spatial filters in the proposed SLN.In this paper, the Gabor filter (12 directions were used) was used for the baseline.The experimental results are given in Figure 19.These results show that the learned filters can achieve better classification accuracy in most cases.However, the SLN using learned filters performs worse than that using predefined filters when the number of the training is 2%.The reason may be that PCA cannot learn effective filters from a small number of training samples.However, we can also find that these two cases can perform better than the other contrastive methods shown in Figure 10.This phenomenon can demonstrate the effectiveness of the proposed framework.

Conclusions and Future Work
In this paper, a novel framework, called SSR, has been proposed for HSI classification.A main contribution of the presented framework is using hierarchical deep architecture to learn joint spectral-spatial features, which are suitable for the nature of HSI.It uses a new strategy to learn the spectral-spatial features of HSIs.Other deep learning-based HSI classification methods usually need to learn more parameters and have high sample complexity.SSR can overcome these problems.In addition, SSR is designed according to the characters of the HSI, other than directly using the conventional deep learning models like some other methods.Consequently, it can jointly learn spectral and spatial features of HSIs.Furthermore, a hierarchical spectral-spatial-based HSI classification method called SLN has been presented as an implementation example of SSR.SLN uses templates learned directly from HSIs to learn discriminative features.In SLN, the discriminative spectral-spectral features on each scale are learned by MFA and PCA.KELM is also embedded into SLN.Extensive experiments on four HSI datasets have validated the effectiveness of SLN.The experimental results also show that the hierarchical spectral-spatial feature learning is useful for classification, and SLN is promising for the classification of HSIs with a small number of training samples.The experimental results of SLN also verify the effectiveness of the proposed SSR.Our future research will follow three directions: 1.
In SSR, the spectral-spatial features are jointly learned through the template matching.Different template sets may lead to different features.Consequently, it is interesting to design new template learning methods.

2.
As shown in Section 3, SSR exploits the spatial information by matching each feature map using the two-dimensional templates.This operation will lead to high dimensional features.We will study a tensor SSR that replaces the two-dimensional templates with tensor-like templates [53,74].

3.
SLN achieves a high classification accuracy by stacking the joint spectral-spatial feature learning units.It is interesting to mathematically analyze and justify its effectiveness according to SSR.

Figure 2 .
Figure 2. Computation procedure of the first layer Spectral-Spatial Response (SSR).
x = max(I I I(:)), M n = min(I I I(:)) and I I I ij (m) is the m-th band of pixel I I I ij .The normalized HSI is denoted as I I I.

1 .
Assume that there are | T 1 | templates and T 1 ∈ R d×| T 1 | is the template set.The spectral templates are the | T 1 | eigenvectors corresponding to a number of the largest eigenvalues of: I I I tr L I I I T tr t1 = λ I I I tr B I I I T tr t1 ,
region for the i-th training pixel on the j-th map and vectorize them,x ij .26: 27: X = [X x ij ].
set T l by PCA.

Algorithm 2
SLN: test procedure.Require:I I I ∈ R m ×n ×d . in the i-th row and j-th column by learned T l .

Figure 5 .Figure 6 .Figure 7 .Figure 8 .
Figure 5. (a) False color composition of the AVIRIS Indian Pines scene; (b) reference map containing 16 mutually-exclusive land cover classes.•Thesecond dataset was gathered by the ROSIS sensor over Pavia, Italy.This image has 610 × 340 pixels (covering the wavelength range from 0.4 to 0.9 µm) and 115 bands.In our experiments, 12 bands are removed due to the noise.Therefore, there are 103 bands retained.There are nine ground-truth classes, in total 43,923 labeled samples.Figure6shows a three-band false color image and the ground-truth map.

Figure 10 .
Figure 10.OAs of different methods under different numbers of training samples.

Figure 11 .
Figure 11.Resulting features of different steps.(a) Features encoded by T 1 ; (b) features encoded by T 1 ; (c) features encoded by T 2 ; (d) features encoded by T 2 ; (e) features encoded by T 3 ; (f) features encoded by T 3 ; (g) features encoded by T 4 ; (h) features encoded by T 4 ; (i) features encoded by T 5 ; (j) features encoded by T 5 .

Figure 13 .
Figure 13.OAs of different methods under different numbers of training samples.

Figure 14 .
Figure 14.The training and testing sets used in our experiments.(a) Training set; (b) testing set.

Figure 17 .
Figure 17.OAs of different methods under different numbers of training samples.

Figure 18 .
Figure 18.Effect of different depths on OAs on the Indian Pines dataset.

Figure 19 .
Figure 19.OAs of SLN using different spatial filters under different numbers of training samples.SSDCNN, Spectral-Spatial Deep Convolutional Neural Network; SSDL, Low Spatial Sampling Distance.
After that, we concatenate the obtained feature cube and normalized HSI into a new feature cube, where the height of the new feature cube is | T 1 | × | T 1 | + d.Similarly, higher layer features can be obtained by extending this architecture to n.Consequently, SLN can extract deep hierarchical features.

Table 1 .
Class-specific classification accuracies (in percentage), OA (in percentage), AA (in percentage) and kappa coefficient for the Aviris Indian Pines dataset.(The best results are highlighted in bold typeface).

Table 2 .
Summary of the parameters on different datasets.

Table 3 .
p-values corresponding to different methods for the Indian Pines dataset.

Table 4 .
Class-specific classification accuracies (in percentage), OA (in percentage), AA (in percentage) and kappa coefficient for the Pavia University dataset.(The best results are highlighted in bold typeface).

Table 5 .
p-values corresponding to different methods for the University of Pavia dataset.

Table 6 .
Class-specific classification accuracies (in percent), OA (in percent), AA (in percent) and kappa coefficient for the Pavia University dataset with the fixed training and testing set.CK, Composite Kernel; SOMP, Simultaneous Orthogonal Matching Pursuit; LSRC-K, Learning Sparse-Representation-based Classification with Kernel-smoothed regularization; SMLR-SpATV, Sparse Multinomial Logistic Regression and Spatially-Adaptive Total Variation.(The best results are highlighted in bold typeface).

Table 7 .
Class-specific classification accuracies (in percentage), OA (in percentage), AA (in percentage) and kappa coefficient for the Center of Pavia dataset with the fixed training and testing Set.(The best results are highlighted in bold typeface).

Table 9 .
p-values corresponding to different methods for the KSC dataset.

Table 10 .
Comparisons of different classifiers on the Indian Pines dataset.

Table 11 .
Comparisons of different deep learning-based on the Indian Pines dataset.(The best results are highlighted in bold typeface).