Spatial-Aware Network for Hyperspectral Image Classiﬁcation

: Deep learning is now receiving widespread attention in hyperspectral image (HSI) classiﬁcation. However, due to the imbalance between a huge number of weights and limited training samples, many problems and difﬁculties have arisen from the use of deep learning methods in HSI classiﬁcation. To handle this issue, an efﬁcient deep learning-based HSI classiﬁcation method, namely, spatial-aware network (SANet) has been proposed in this paper. The main idea of SANet is to exploit discriminative spectral-spatial features by incorporating prior domain knowledge into the deep architecture, where edge-preserving side window ﬁlters are used as the convolution kernels. Thus, SANet has a small number of parameters to optimize. This makes it ﬁt for small sample sizes. Furthermore, SANet is able not only to aware local spatial structures using side window ﬁltering framework, but also to learn discriminative features making use of the hierarchical architecture and limited label information. The experimental results on four widely used HSI data sets demonstrate that our proposed SANet signiﬁcantly outperforms many state-of-the-art approaches when only a small number of training samples are available.


Introduction
Hyperspectral images (HSIs) usually contain abundant spectral information [1][2][3][4]. Such rich spectral information makes it possible to distinguish subtle spectral differences of different materials. Consequently, HSI classification has been widely used in a variety of applications, including environmental management [5], agriculture [6], and surveillance [7]. Over the past few decades, a large number of methods have been proposed to predict the class label of each pixel in HSI. However, it is still a challenging issue in HSI [4,[8][9][10][11][12][13][14][15]. One of the most challenging issues is the limited available training samples. The reason for this is that it is difficult and time consuming to collect a large number of training samples [4,16].
The early research on HSI classification mainly focuses on exploring spectral features directly from spectrum, such as methods based on dimensionality reduction [17,18] and band selection [19]. However, due to the spatial variability and spectral heterogeneity of land covers, spectral-only methods usually fail to provide better performance. It is well known that neighborhood pixels in HSI are highly correlated [20]. As a result, a lot of efforts have been dedicated to integrate the spatial and spectral information for HSI classification [21]. In the past few years, a wide variety of spectral-spatial methods have been developed [20,22,23]. For example, extended morphological profiles (EMPs) is proposed to integrate spectral and spatial information in HSI [24]. In addition, a discriminative low-rank Gabor filtering method is also used to extract spatial-spectral features of HSI [25]. In [26], edge-preserving filtering (EPF) is used for obtaining spectral-spatial features of HSI for the first time. Random fields technology has also been widely used for incorporating spatial information into HSI classification, such as Markov random field (MRF) [27] and conditional random field (CRF) [28]. Moreover, sparsity has been used as a constraint to extract spectral-spatial features in different ways [29][30][31][32][33][34][35][36][37]. Recently, segmentation-based strategy is used to produce more spatially homogeneous classification maps [38][39][40][41]. Zehtabian et al. propose an approach for the development of automatic object-based techniques used for HSI classification [41]. Zheng et al. design a spectral-spatial HSI classificaion method that is based on superpixel segmentation and distance-weighted linear regression classifier to tackle the small labeled training sample size problem [41]. Although these methods can extract spectral-spatial features for HSI classification, low-level features are more sensitive to local changes occurred in HSI, especially in the case of small training samples [42].
The past few years have witnessed a surge of interest in deep learning [2,43]. Motivated by its success in natural image processing, a growing number of deep learning methods are designed for HSI classification [4,44,45]. In the early stages, stacked autoencoders (SAEs) and deep belief networks (DBNs) have been used for HSI classification in [46,47], respectively. This research attempts to intuitively feed the vector-based input into unsupervised deep learning models. However, this strategy suffers from spatial information loss. In order to address this issue, convolutional neural networks (CNNs) are used to learn effective spectral-spatial features residing in HSI [4]. For instance, Cao et al. adopted the CNN to extract deep spatial features [48]. In [49], 3D CNN is also adopted to extract spectral-spatial features. Furthermore, a 3D generative adversarial network (3D-GAN) is also applied in HSI classification [50], where adversarial samples are used as training samples. Although remarkable progress has been made in deep learning-based HSI classification, there still exist some important issues should be dealt with [4,12]. One of the most important issues is the imbalance between lots of weights and a small number of available samples [4,12,51]. That is, the training samples are usually very limited while deep learning models usually require a large number of samples to train [4,11,21]. Although some methods have been designed to deal with this problem, it is still an open problem. For example, RPNet has been proposed in [52], but the discrimination ability of random patches are not guaranteed.
As it is well known, one of the essential theories of deep learning is to learn effective features using a hierarchical architecture [11,21]. Furthermore, existing research shows that incorporating prior domain knowledge into deep learning models can promote their performance and reduce sample complexity [53]. Consequently, designing a deep learning method by incorporating the prior knowledge of HSI is a feasible way to promote the classification performance. In HSI classification, structure-preserving is one kinds of well known prior information, which has been widely used to extract spatial information by using filtering technology [26,[54][55][56]. However, traditional smoothing filters, whose centers are aligned with the pixels being processed, usually lead to edge blurring and loss of spatial information [57]. Recently, side window filtering (SWF), which aligns the window's side or corner with the pixel being processed, is proposed to handle an edge blurring problem in traditional filters [57]. As a result, SWF may outperform traditional smoothing filters in HSI analysis and classification.
Based on the discussion above, this paper will present a deep learning model called a spatial-aware network (SANet) for HSI classification. The proposed method can learn spectral-spatial features with a small amount of training samples. In summary, the major contributions of proposed SANet are twofold.

•
This paper incorporates, for the first time in the literature, a side window filtering framework into deep architecture for HSI classification. We utilize SWF to effectively discover the spatial structure information in HSIs. • This paper proposes an effective deep learning method. There are only a very small number of parameters that need to be determined. Thus, the proposed method can efficiently learn the spectral-spatial features by using a small number of training samples.
This paper is organized in the following manner: Section 2 gives a detailed description of the SANet for HSI classification. Section 3 validates the effectiveness of the proposed SANet on four typical data sets. The comprehensive comparison with several state-of-theart methods is also presented. Finally, conclusions are drawn in Section 4.

Methodology
2.1. Spatial-Aware Network for Feature Learning Figure 1 details the overall flowchart of our proposed SANet, where only six hidden layers are shown for clarity. In contrast to the traditional deep learning methods, which adopt an end-to-end approach to learn, there is no back propagation during the implementation of SANet. Instead, by using the predefined convolutional filters (side window filters), SANet can extract deep features efficiently. By convolving side window filters and integrating them into a deep hierarchical architecture to form SANet, we not only retain the characteristics of side window filtering but also give play to the powerful feature learning ability of deep learning. As shown in Figure 1, SANet has a hierarchical architecture, which includes F layer, P layer, and R layer, where the F layer combines their inputs with different filters to extract spatial features, P layer outputs the structure-preserving responses, and R layer processes their inputs by a supervised dimension reduction method, thereby increasing discriminability and reducing redundancy. These three layers can form one spectral-spatial feature learning unit. The detailed descriptions are as follows.  1 Layer: Let X ∈ R m×n×b be the original HSI, where m, n, b are the row number, column number, and band number, respectively. Let X i represent the ith band of the input HSI, and X i (x, y) is the reflectance value at the position (x, y). X i can be filtered using different side windows with different radius. Figure 2 shows the definition of continuous side window, where (x, y) is the position of the pixel being processed, θ denotes the window orientation, r represents the radius of the window (can be predefined by users), and ρ ∈ [0, r]. By changing θ and ρ, different side windows can be obtained, where the pixel being processed should be on the side or corner of the window.
In practice, there are only a limited quantity of side windows that can be used, where Figure 3 shows eight side windows. These side windows correspond to θ = k × π 2 (k ∈ [0, 3]) and ρ = {0, r}. Letting ρ = r, we can obtain four side windows, which are shown in Figure 3a,b. If we set ρ = 0, we can get another four side windows, which can be found in Figure 3c. In this paper, eight side windows shown in Figure 3 are used for exploring the spatial information in HSI. Note that we can design more side windows with different sizes, shapes, and orientations by changing r, θ, and ρ.    Note that the side window technique can be embedded into a wide variety of filters, such as Gaussian filter, median filter, bilateral filter, and guided filter. In this paper, box filter is used for simplicity. Then, the output is given by whereF stands for filtering operation, r j is the radius of the window, and j = 1, 2, . . .r (r is the total number of the radius). Here, different r j corresponds to different scales. On each scale, eight side windows (L, R, U, D, SW, SE, NE, NW denoted in Figure 3) have been used in this paper. Thus, we can obtain eight feature maps on each scale. Note that only one scale is used in Figure 1 for illustration purposes only. The filtering results on the i-th band are denoted by X Figure 4 shows the multiscale filtering on the ith band. We can carry out this operation on all of the bands. Thus, the total number of the feature maps is N =r × b × 8. Finally, the output of this layer is denoted by X F 1 ∈ R m×n×N . P 1 Layer: In this layer, the MIN pooling is operated over different maps belonging to the same scale. That is, we keep only the minimum output over eight feature maps belonging to the same scale at each position. That is, where P in the xth row and yth column. Hence, one scale band is formed by eight feature maps with the same scale in the output of F1 layer (see Figure 5). Finally, we can obtain the feature map P P 1 ∈ R m×n×br . R 1 Layer: In this layer, feature fusion is carried out to reduce the redundant information residing in high-dimensional P P 1 . Thus, this layer is also called feature fusion layer. This feature fusion operation can not only reduce redundancy but also speeding up the following feature learning. In theory, any feature fusion methods can be used in this layer. For simplicity, linear discriminant analysis (LDA) is adopted in this paper. It can not only reduce the dimension of the data but also able to increase the discriminability of the learned spectral-spatial features. Therefore, the stacked feature maps in P P 1 are fused together as follows: where A is the LDA projection matrix learned from the training samples and consists of S projection directions. Finally, the output of this layer can be denoted by Y R 1 ∈ R m×n×S . Feature Learning in Deep Layers: Let Y R h−1 denote the output of the R h−1 layer. For purpose of learning spectral-spatial features in higher layers, we can take Y R h−1 as the input data and perform the same operations with in F 1 layer, P 1 layer and R 1 layer. In this way, we can obtain spectral-spatial features from different layers as {Y R 1 , Y R 2 , . . . , Y R h }. In this paper, F i , P i , and R i , which are carried out successively, form the ith spectral-spatial feature learning unit. The number of the feature learning units represents the depth of the proposed deep model. Consequently, we can learn deep spectral-spatial features with increasing depth.

Classification Layer
In this layer, the learned spectral-spatial features are fed into classifiers. Previous studies have shown that different layers of deep model can extract different levels of spectral-spatial features [58]. Low-level features always contain more detailed information, and high-level features are more invariant. All of these features are very important for HSI classification. It is reasonable to use both high-level and low-level features for robust HSI classification. Based on this observation, {Y R 1 , Y R 2 , . . . , Y R L } are concatenated for HSI classification, where L is the number of the spectral-spatial feature learning units. Finally, these concatenated spectral-spatial features are fed into SVM for classification.
The pseudocode of the SANet is detailed in Algorithm 1, where T r consists of indexes and labels of the training samples. For the sake of convenience, the original input HSI is denoted by Y R 0 . As can be seen, the proposed SANet is easy to implement.

Algorithm 1 SANet
for r = 1 :r do 5: Filtering on F l layer, obtain X for j = 1 :r do 10: MIN pooling on each X end for 12: end for 13: Feature fusion on R l layer, obtain Y R l ∈ R m×n×S . 14: end for 15: Concatenation of {Y R l , Y R 2 , . . . , Y R L }. 16: Using SVM for classification.
In summary, the proposed method is not only simple but also effective to make use of the spectral-spatial information. In SANet, label information is used in each feature fusion layer. Thus, the proposed method has high discriminability. In addition, the proposed method has a small number of parameters to be determined. Consequently, the proposed method is fit for HSI classification with a small number of training samples.

Experimental Results and Analysis
In this section, experimental results on four real HSI data sets will be given to comprehensively demonstrate the effectiveness of SANet. For comparison, we also present experimental results of some state-of-the-art methods, including hybrid spectral convolutional neural network (HybridSN) [59], convolutional neural network with Markov random fields (CNN-MRF) [48], local covariance matrix representation (LCMR) [51], SPguided training sample enlargement and distance weighted linear regression-based method (STSE_DWLR) [40], and random patches network (RPNet) [52], where CNN-MRF adopted a data augmentation strategy. In this paper, three commonly preferred performance indexes, including overall accuracy (OA), average accuracy (AA) and κ coefficient [60], are used to evaluate the performance of different methods. In this paper, all experiments were repeated 10 times, and the average OAs, AAs, and κ coefficients are reported to evaluate the performances of different methods. Furthermore, full classification maps of different methods are also given for visual analysis.

Data Sets
In our experiments, four well-known publicly available data sets, including Indian Pines, Pavia University, Salinas, and Kennedy Space Center, have been used to verify the effectiveness of SANet. Figure 6 shows a brief summary of these HSIs. The detailed information about them is given as follows.

Indian Pines
This image was gathered by the AVIRIS sensor over the Indian Pines test site in northwestern Indiana. This image comprises 145 × 145 pixels, with a total of 224 spectral bands [61]. In addition, the spatial resolution is 20 meters per pixel (mpp). In our experiments, 200 spectral channels are retained by removing four null bands and 20 corrupted bands [62]. The name and quantity of each class are reported in the first column of Figure 6, where 10,249 samples contain ground truth information, and they belong to 16 different classes. HSI classification on this data set is challenging, due to the presence of mixed pixels and imbalanced class distribution [63].

Pavia University
The second image was recorded by the ROSIS sensor during a flight campaign over Pavia, northern Italy. The size of it is 610 × 340 × 115, and its spatial resolution is 1.3 mpp.
In our experiments, 103 out of the 115 bands are kept after having removed 12 noisy bands. The total number of the labeled pixels is 42,776. In addition, these labeled pixels belong to 9 land-cover classes (see the second column of Figure 6).

Salinas
The third HSI used in experiments was acquired by the 224-band AVIRIS sensor over Salinas Valley, California. This image contains 512 × 217 pixels with a spatial resolution of 3.7 mpp. After discarding the water absorption bands and noisy bands, there are 204 bands have been used in the experiments. A total of 54,129 pixels labeled in 16 classes are used for classification. Finally, the false color image and ground truth map are presented in the third column of Figure 6.

KSC
The last dataset was also acquired by the AVIRIS, but over the Kennedy Space Center (KSC), Florida, in 1996. Due to water absorption and the existence of low signal-noise ratio channels, 176 of them were used in our experiments. There are 13 land cover classes with 5211 labeled pixels. A three-band false color image and the ground-truth map are shown in Figure 6.

SANet Structure Analysis
The proposed SANet is a deep learning-based method. It can learn spectral-spatial features in a hierarchical way. Thus, the setting of the structure parameters of SANet plays an important role in feature learning. Here, we investigate how the depth and the number of scales in each layer influence the classification performance. In this section, only the experimental results on Indian Pines have been given, since we can make the same conclusions on other data sets. Here, only 2% labeled samples per class are used as training samples.

Depth Effect
The depth of representations is of central importance for deep learning methods. In the proposed method, the depth determines the abstraction level of the learned spectralspatial features. Consequently, extensive experiments have been conducted to show the relationship between the number of spectral-spatial learning units and the classification performance. In these experiments, we change only the depth and fix other parameters to investigate classification performance. Figures 7a and 8 show the quantitative results and the classification maps with different depths. It is obvious that, with the increase of the depth, we will obtain better performance. However, it does not mean the deeper the better, which can be found from the curves in Figure 7a. Too deep layers could lead to over smoothing problems. Consequently, the number of the spectral-spatial feature learning units, which depicts the depth, is set to five for all data sets.

Scale Effect
Multiscale is also an important factor related to the classification performance of SANet. The number of scales can also control the architecture or topology of the network. In order to show how to determine the number of scales, we change only the scale number in each layer and fix other parameters. As shown in Figure 7b, as the number of scales grows, the classification accuracies have a trend of rising first and then declining. We can see that SANet obtains the highest κ coefficient when the number of scale is 3. Figure 9 also gives the classification maps corresponding to different scales. It can be observed that SANet obtains a better map whenr is set to 3. According to these observations, we experimentally set the number of scales,r, as 3 for all data sets. In this paper, the radii of side windows are set to {3, 5, 7}. A large scale number usually incurs high computational cost. Thus, it is also a trade-off between accuracy and computational expense to setr as 3.

Comparison with State-of-the-Art Methods
In order to demonstrate the advantages of our SANet, we mainly consider the case of a small training set in this section.

Experimental Results on Indian Pines Data Set
The first experiment is conducted on an Indian Pines data set, where the number of training samples per class is set to be 2% of labeled samples, and the remaining samples are used for testing. Table 1 reports the quantitative results obtained by the proposed SANet and five state-of-the-art methods. Note that the best results are highlighted in bold font. We can observe that SANet achieves the best performance in terms of OA, AA, and κ. In addition, the proposed method obtains the best results on most of the classes. These experimental results also show that CNN-MRF achieves the lowest OA and κ coefficient among six methods, mainly because CNN-MRF is directly trained with a small number of training samples, which may lead to overfitting problems. The OA, AA, and κ of the CNN-MRF approach are 77.39%, 69.10%, and 0.740, respectively, while the results obtained by the proposed method are 93.97%, 91.95%, and 0.931, respectively. This indicates that the average improvement is more than 15%. LCMR achieves better results than HybridSN, STSE_DWLR, CNN-MRF, and RPNet, since it is a handcrafted feature extraction method designed for a small training set. However, the performance of the LCMR is still inferior to the proposed method. The main reason is that LCMR only considers the spectral-spatial features on single scale, while the proposed proposed method integrates multiscale spectral-spatial features using deep architecture. Apart from the quantitative comparisons, we also verify the effectiveness of the proposed method from a visual perspective. Figure 10 presents the full classification maps of different methods, and these maps are produced by one of the random experiments. We can easily observe that the proposed method obtains the best performance. HybridSN, STSE_DWLR, and CNN-MRF lead to oversmoothed classification maps. However, LCMR and RPNet always result in noisy classification maps. Our method can not only preserve the structures in accordance with the false color image (see Figure 6) but also produce smoother results. The main reasons for its good performance are threefold. First, by using a side window filtering principle, the edges and the boundaries of the HSI can be preserved, and the homogeneous regions can be smoothed. Second, the SANet can effectively exploit the spectral-spatial feature from different scales. Third, the LDA is used to remove the redundancy residing in the data, thus the proposed method can extract more discriminative features. Furthermore, we analyze the effects of the number of training samples. We use 2%, 4%, 6%, 8%, and 10% randomly selected samples for each class in Indian Pines data set. We can make the observation from Figure 11 that the performance of the compared methods and the proposed method improve as the size of the training set increases. It is also noteworthy that SANet performs the best, especially when the number of training sample is small. This implies that the proposed approach is more suitable for HSI classification. Since the labeled samples are often difficult and expensive to be collected in practice. Finally, the running times of different methods are presented in Table 2. These results show that the proposed method has the low computational complexity, and the deep learning-based methods are time consuming. We can also make the same conclusion on other three data sets. The second experiment is performed on the Pavia University data set (see Figure 6). In this case, 1% of labeled samples are randomly selected to form the training set. Table 3 presents the quantitative results with respect to different metrics. We again conclude that SANet achieves the best performance compared with other state-of-the-art spectral-spatial methods. As can be seen, except for the proposed method, deep learning-based methods achieve the lower accuracies because of the limited training samples. Figure 12 visualizes the classification results of all the methods. We can observe in Figure 12b that STSE_DWLR leads to oversmoothing. In contrast, the proposed method can preserve the details of HSI. Figure 13 shows the influence of training samples size (ranging from 1% to 3%, with a step of 0.5% per class) on classification performance. Most of the compared methods achieve similar accuracies when the number of training samples is large enough. In contrast, the proposed method has obvious advantages in the case of small samples.

Experimental Results on the Salinas Data Set
The third experiment is performed on the Salinas data set. Similar conclusions can be made from Table 4 and Figure 14, where 1% labeled samples are randomly selected per class for training. These results also demonstrate that the proposed SANet delivers the best performance in terms of OA, AA, and κ. Figure 14 also shows that HybridSN and CNN-MRF yield oversmoothed classification maps, while STSE_DWLR, LCMR and RPNet produce maps with much noise. Finally, Figure 15 shows the influence of the number of training samples (ranging from 1% to 3%, with a step of 0.5% per class) on the performance of the different methods. Similarly, we can conclude that the proposed method can achieve the best results with limited training samples. Furthermore, we can also find that all the methods show comparable performance only when enough training samples are available. This is due to the fact that conventional deep learning methods usually require a large number of training samples to obtain optimal parameter values.

Experimental Results on the KSC Data Set
The fourth experiment is performed on the KSC data set. Similar conclusions can be made from Table 5 and Figure 16, where five labeled samples are randomly selected per class for training. These results also demonstrate that the proposed SANet delivers the best performance in terms of OA,and κ. Figure 16 also shows that HybridSN and CNN-MRF yield oversmoothed classification maps, while LCMR and RPNet produce maps with much noise. Finally, Figure 17 shows the influence of the number of training samples (ranging from 5 to 25, with a step of 5 per class) on the performance of the different methods. Similarly, we can conclude that the proposed method can achieve the best results with limited training samples. Furthermore, we can also find that all the methods show comparable performance only when enough training samples are available. The reason for this is that conventional deep learning methods have high sample complexity.     To sum up, the experimental results on four typical data sets show that the proposed approach can provide better results than other tested methods. The proposed method can alleviate the potential overfitting problems that deep learning-based methods usually suffer when dealing with HSI classification. 5 10 15 20 25 The number of training samples

Conclusions and Future Research
This paper develops an efficient deep spectral-spatial feature learning method for HSI classification. The proposed approach can obtain high accuracy with limited labeled samples by introducing the side window filtering principle to the deep feature learning, and integrating the spatial and the spectral information contained in the original HSIs. Our results also corroborate that incorporating prior domain knowledge into deep architecture can deal with the small sample problem in HSI classification. Future work will focus on improving the performance of the proposed method from the viewpoint of filtering, such as more effective filtering technology being able be applied to the side window filtering framework.