Hyperspectral Images Classiﬁcation Based on Dense Convolutional Networks with Spectral-Wise Attention Mechanism

: Hyperspectral images (HSIs) data that is typically presented in 3-D format offers an opportunity for 3-D networks to extract spectral and spatial features simultaneously. In this paper, we propose a novel end-to-end 3-D dense convolutional network with spectral-wise attention mechanism (MSDN-SA) for HSI classification. The proposed MSDN-SA exploits 3-D dilated convolutions to simultaneously capture the spectral and spatial features at different scales, and densely connects all 3-D feature maps with each other. In addition, a spectral-wise attention mechanism is introduced to enhance the distinguishability of spectral features, which improves the classification performance of the trained models. Experimental results on three HSI datasets demonstrate that our MSDN-SA achieves competitive performance for HSI classification.


Introduction
Hyperspectral images (HSIs) have hundreds of continuous observation bands throughout the electromagnetic spectrum with high spectral resolution [1]. Based on such abundant spectral bands, HSIs have been widely used in various applications, including agriculture development [2], mineral resource exploitation [3], and environmental earth sciences [4]. Supervised land cover classification is one of the most significant topics in hyperspectral remote sensing. However, the redundancy of spectral band information combined with limited training samples [5,6] often poses a challenge to HSI classification.
Conventional HSI supervised classification models are often based on spectral information. Typical classifiers include those based on distance measure [7], k-nearest-neighbors [8], maximum likelihood criterion [9], and logistic regression [10]. To improve classification performance, Random Forests (RF) [11], and AdaBoost [12], which are ensemble learning or multiple classifier methods, have been found to be effective for HSI classification.
However, classification algorithms based on spectral information exploiting only the spectral information fail to capture the important spatial variability perceived for high-resolution data. Furthermore, as HSIs are typically presented in the format of 3-D cubes, it is reasonable to combine the abundant spectral features and spatial features in complementary form to improve the performance of HSI classification. For example, spectral-spatial combined features can be extracted from the

Residual Connections and Dense Connectivity in CNNs
Previous research [33][34][35] has shown that utilizing multi-level features in CNNs through skip-connections is effective for various vision tasks. Residual connections and dense connectivity act as two different connectivity patterns: both have been successfully used in challenging natural image processing tasks. Here, we briefly introduce these two connection concepts.
In principle, a residual connection ( Figure 1a) adds a skip connection that bypasses the nonlinear transformations with an identity mapping. Reference layer inputs explicitly represent the layers as learning residual functions [19]. A residual connection can be formally expressed as: where x l−1 and x l refer to the input and output of the l-th layer, respectively. The h(x l−1 ) = x l−1 is an identity mapping function and the function F(·) represents a non-linear transformation which can be a composite function of operations such as Convolution (Conv), Batch Normalization (BN) [36], Rectified Linear Units (ReLU) [37], or Pooling [38]. By using residual connections, the gradient can flow directly from later to earlier layers through the identity function [19]. In order to maximize the flow of information between network layers, Gao et al. [32] proposed a series of dense connectivity from any layer to all subsequent layers ( Figure 1b). Differing from the residual connections, which combine features through summation, dense connectivity combines features by concatenating them. Specifically, all previous feature maps of layers x 0 , . . . , x l−1 , can be used to compute the output of the l-th layer: where {x 0 , . . . , x l−1 } is the concatenation of all previous feature maps. Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision. The work of [32] also allows features to be reused, while adding only a small set of feature maps to the network [32]. In addition, dense connectivity has a regularization parameter that reduces overfitting on tasks with small training data. Therefore, dense connectivity can be beneficial to perform HSI classification, especially with small training data. Remote Sens. 2018, 10, x FOR PEER REVIEW 4 of 18

Attention Mechanism
Evidence from human perception process [39] demonstrates the importance of an attention mechanism, which usually uses top-level information to guide a bottom-up feedforward process. An attention mechanism can be viewed as a tool to bias the allocation of available processing resources towards the most informative components of an input signal.
Recently, researchers have applied attention mechanism in deep neural networks. Generally, there are two categories of such work: spatial-based attention and channel-wise attention. Spatialbased attention mostly focuses on specific location scenes. Such mechanisms were the focus of early systems for image categorization [40], and were later shown to yield significant improvements for Visual Question Answering (VQA) and captioning [41][42][43]. Channel-wise attention is a complementary form of attention, which involves learning a task-specific modulation that is selectively applied to individual feature maps across the entire scene. Among them, Squeeze-and-Excitation (SE) network (recognized as ImageNet Large Scale Visual Recognition Competition (ILSVRC) winner), produces significant performance improvements for state-of-the-art deep architectures at slightly greater computational cost. SE block [44] is a novel architectural unit designed to improve the representational capacity of a network by enabling it to perform dynamic channel-wise feature recalibration.
On the basis of attention mechanism, many deep neural network structures have been proposed and widely used in various applications [43][44][45][46][47]. For example, Fu et al. [45] introduced a recurrent attention CNN, which can locate the discriminative region recurrently for fine-grained image recognition performance. Li et al. [46] introduced a global attention upsample module to guide the integration of low-and high-level features in semantic segmentation. Wang et al. [47] proposed a residual attention network, which is built by trunk-and-mask [48] attention mechanism to generate attention-aware features for Image Classification. It is worth mentioning that SE network is a lightweight gating mechanism [44], specialized to model channel-wise relationships in a computationally efficient manner and designed to enhance the representational power of modules throughout the network.
In the case of HSI, hundreds of spectral bands are directly used as input data for convolution, which inevitably carries some noise bands. Therefore, we are more concerned with the correlation of the spectral-wise features from HSI which are based on the 3-D feature maps. Inspired by the SE network, we propose a spectral-wise attention mechanism which will be discussed in detail in the next section.

Proposed Methods
In this section, we describe a novel 3-D network for HSI classification. There are two key components in our proposed method: dense convolutional network with dilated convolution and spectral-wise attention mechanism. The first part is dilated convolution [49] based on 3-D patches of an HSI using dense connectivity to simultaneously extract spectral-spatial features. The dense connectivity is used to derive multi-level features in networks. In the second part, inspired by the

Attention Mechanism
Evidence from human perception process [39] demonstrates the importance of an attention mechanism, which usually uses top-level information to guide a bottom-up feedforward process. An attention mechanism can be viewed as a tool to bias the allocation of available processing resources towards the most informative components of an input signal.
Recently, researchers have applied attention mechanism in deep neural networks. Generally, there are two categories of such work: spatial-based attention and channel-wise attention. Spatial-based attention mostly focuses on specific location scenes. Such mechanisms were the focus of early systems for image categorization [40], and were later shown to yield significant improvements for Visual Question Answering (VQA) and captioning [41][42][43]. Channel-wise attention is a complementary form of attention, which involves learning a task-specific modulation that is selectively applied to individual feature maps across the entire scene. Among them, Squeeze-and-Excitation (SE) network (recognized as ImageNet Large Scale Visual Recognition Competition (ILSVRC) winner), produces significant performance improvements for state-of-the-art deep architectures at slightly greater computational cost. SE block [44] is a novel architectural unit designed to improve the representational capacity of a network by enabling it to perform dynamic channel-wise feature recalibration.
On the basis of attention mechanism, many deep neural network structures have been proposed and widely used in various applications [43][44][45][46][47]. For example, Fu et al. [45] introduced a recurrent attention CNN, which can locate the discriminative region recurrently for fine-grained image recognition performance. Li et al. [46] introduced a global attention upsample module to guide the integration of low-and high-level features in semantic segmentation. Wang et al. [47] proposed a residual attention network, which is built by trunk-and-mask [48] attention mechanism to generate attention-aware features for Image Classification. It is worth mentioning that SE network is a lightweight gating mechanism [44], specialized to model channel-wise relationships in a computationally efficient manner and designed to enhance the representational power of modules throughout the network.
In the case of HSI, hundreds of spectral bands are directly used as input data for convolution, which inevitably carries some noise bands. Therefore, we are more concerned with the correlation of the spectral-wise features from HSI which are based on the 3-D feature maps. Inspired by the SE network, we propose a spectral-wise attention mechanism which will be discussed in detail in the next section.

Proposed Methods
In this section, we describe a novel 3-D network for HSI classification. There are two key components in our proposed method: dense convolutional network with dilated convolution and spectral-wise attention mechanism. The first part is dilated convolution [49] based on 3-D patches of an HSI using dense connectivity to simultaneously extract spectral-spatial features. The dense connectivity is used to derive multi-level features in networks. In the second part, inspired by the successful application of attention mechanism in deep neural networks, a spectral-wise attention mechanism is added to the proposed network. The proposed framework is illustrated schematically in Figure 2. Firstly, we extract the S × S × D neighborhoods of the center pixel within each spectral band together with its corresponding category label as samples, where S × S denotes the neighborhood space size, and D is the spectral depth. Once 3-D samples are extracted from an HSI, they are fed into the MSDN-SA model to obtain the classification results. There are seven convolutional layers in the MSDN-SA. In Figure 2, the colored lines in the convolutional layers represent 3 × 3 × 7 dilated convolutions, with each color representing a different channel through layers of different dilation. Note that a "channel" in this paper refers to a filter such that the total number of the "channels" stands for the dimensionality of the output space, i.e., the number of output filters in the convolution. Features are refined at each layer by a spectral-wise attention mechanism. Then, an average pooling layer and a fully connected (FC) layer follow. Finally, a soft-max activation function is used in the final output layer.
Remote Sens. 2018, 10, x FOR PEER REVIEW 5 of 18 successful application of attention mechanism in deep neural networks, a spectral-wise attention mechanism is added to the proposed network. The proposed framework is illustrated schematically in Figure 2. Firstly, we extract the S × S × D neighborhoods of the center pixel within each spectral band together with its corresponding category label as samples, where S × S denotes the neighborhood space size, and D is the spectral depth. Once 3-D samples are extracted from an HSI, they are fed into the MSDN-SA model to obtain the classification results. There are seven convolutional layers in the MSDN-SA. In Figure 2, the colored lines in the convolutional layers represent 3 × 3 × 7 dilated convolutions, with each color representing a different channel through layers of different dilation. Note that a "channel" in this paper refers to a filter such that the total number of the "channels" stands for the dimensionality of the output space, i.e., the number of output filters in the convolution. Features are refined at each layer by a spectral-wise attention mechanism. Then, an average pooling layer and a fully connected (FC) layer follow. Finally, a soft-max activation function is used in the final output layer.

Attention Attention Attention
Soft-max

Dense Convolutional Network with Dilated Convolution
To take advantage of the capability of 3-D spatial filtering, we propose a novel learning architecture with dense connectivity for automated classification from 3-D HSI. In the inspiring work of [32] the dense connectivity was used within layers at a single scale, with transition layers to acquire information at different scales. Differing from the dense convolutional network (DenseNet) as per [32], we combine dense connectivity between the multi-scale feature maps, enabling dense connectivity between the feature maps of the entire network. This facilitates more efficient use of all feature maps and an even larger reduction of the number of required parameters. The presented module uses 3-D dilated convolutions to systematically aggregate multi-scale contextual information without losing spatial resolution.
Instead of using transitional layers to capture features at different scales, the proposed MSDN-SA uses dilated convolutions. A dilated convolution [49] , h d W with dilation d + ∈  uses a dilated kernel h that is nonzero only at distances that are a multiple of d pixels from the center. In the multi-scale approach, each individual channel of a feature map within a single layer operates at a different scale. Specifically, we associate the convolution operations for each channel of the output image of a certain layer with a different dilation. The setting of dilations is shown in Section 4.2. Formally, the output of l x is a dilated convolution of the j -th feature cube with the k -th kernel of the l -th layer, given by: where ( ) g ⋅ is a dilated convolution operation, k l x denote as the feature maps of the l -th layer are convolved with the k -th kernel and l c is the number of kernels in the l -th layer.

Dense Convolutional Network with Dilated Convolution
To take advantage of the capability of 3-D spatial filtering, we propose a novel learning architecture with dense connectivity for automated classification from 3-D HSI. In the inspiring work of [32] the dense connectivity was used within layers at a single scale, with transition layers to acquire information at different scales. Differing from the dense convolutional network (DenseNet) as per [32], we combine dense connectivity between the multi-scale feature maps, enabling dense connectivity between the feature maps of the entire network. This facilitates more efficient use of all feature maps and an even larger reduction of the number of required parameters. The presented module uses 3-D dilated convolutions to systematically aggregate multi-scale contextual information without losing spatial resolution.
Instead of using transitional layers to capture features at different scales, the proposed MSDN-SA uses dilated convolutions. A dilated convolution [49] W h,d with dilation d ∈ Z + uses a dilated kernel h that is nonzero only at distances that are a multiple of d pixels from the center. In the multi-scale approach, each individual channel of a feature map within a single layer operates at a different scale. Specifically, we associate the convolution operations for each channel of the output image of a certain layer with a different dilation. The setting of dilations is shown in Section 4.2. Formally, the output of x l is a dilated convolution of the j-th feature cube with the k-th kernel of the l-th layer, given by: where g(·) is a dilated convolution operation, x k l denote as the feature maps of the l-th layer are convolved with the k-th kernel and c l is the number of kernels in the l-th layer.
When using the dilated convolution, the multi-scale approach has an additional advantage compared with traditional scaling. All feature maps have the same number of rows and columns as the input and output image, for all layers, and hence, when computing a feature map for a specific layer, it is not restricted to use only the output of the previous layer. Instead, we use all previously computed feature maps by densely connecting a network, as described by Equation (2). Thus, we change the dilated convolutional with dense connectivity operation Equation (3) to: where i = 0 . . . l − 1 index the previous layers.

Spectral-Wise Attention Mechanism
For an HSI classification based on 3-D convolution based network, hundreds of spectral bands are directly used as input data for convolution, which inevitably carries some noise bands. To mitigate this problem, we use the SE block [44] to recalibrate spectral-wise feature responses by modelling interdependencies between spectral features. We model the interdependencies between spectral-wise features based on all the bands of each 3-D feature map, which we call spectral-wise attention mechanism. This mechanism aims to selectively emphasize informative spectral features and suppress less useful spectral features. The basic structure of the spectral-wise attention mechanism is illustrated in Figure 3. Here, the neighborhood space size and spectral depth are denoted with S × S and D, respectively, such that the input layer d ∈ Z + . Our starting point is the spectral-wise attention, which is denoted as F spectral in our model and yields the spectral feature attention vector v ∈ R 1×1×D .
Remote Sens. 2018, 10, x FOR PEER REVIEW 6 of 18 When using the dilated convolution, the multi-scale approach has an additional advantage compared with traditional scaling. All feature maps have the same number of rows and columns as the input and output image, for all layers, and hence, when computing a feature map for a specific layer, it is not restricted to use only the output of the previous layer. Instead, we use all previously computed feature maps by densely connecting a network, as described by Equation (2). Thus, we change the dilated convolutional with dense connectivity operation Equation (3) to: where index the previous layers.

Spectral-Wise Attention Mechanism
For an HSI classification based on 3-D convolution based network, hundreds of spectral bands are directly used as input data for convolution, which inevitably carries some noise bands. To mitigate this problem, we use the SE block [44] to recalibrate spectral-wise feature responses by modelling interdependencies between spectral features. We model the interdependencies between spectral-wise features based on all the bands of each 3-D feature map, which we call spectral-wise attention mechanism. This mechanism aims to selectively emphasize informative spectral features and suppress less useful spectral features. The basic structure of the spectral-wise attention mechanism is illustrated in Figure 3. Here, the neighborhood space size and spectral depth are denoted with S S × and D , respectively, such that the input layer d + ∈  . Our starting point is the spectral-wise attention, which is denoted as spectral F in our model and yields the spectral feature It is worth mentioning that the attention mechanism is independently applied to eight channels which are the outputs of the convolutional layers in our algorithm. Following the definition of the channel-wise attention model, the spectral-wise attention model on each 3-D feature map can be informally described as follows. "Summary statistics" are calculated per-band, and then transformations are applied to first shrink and then expand the dimensionality of these statistics. back to the original, higher dimensional space: where δ refers to the ReLU function and σ refers to a sigmoid activation, with "reduction ratio" r of the shrinking operation empirically set to be 4. The final output of the spectral-wise attention U′ is obtained by rescaling the transformation output U with the activations: It is worth mentioning that the attention mechanism is independently applied to eight channels which are the outputs of the convolutional layers in our algorithm. Following the definition of the channel-wise attention model, the spectral-wise attention model on each 3-D feature map can be informally described as follows. "Summary statistics" are calculated per-band, and then transformations are applied to first shrink and then expand the dimensionality of these statistics.
Formally, summary statistics are computed with a global average pooling applied to individual spectral-wise feature channels U = [u k ] k=1...D , yielding the vector P = (p k ) k=1...D . This is followed by first applying a shrinking operation to the vector P with the operator S shrink ∈ R D r ×D compressing it into a lower dimensional space, and then followed by an expansion operation S expand ∈ R D× D r mapping it back to the original, higher dimensional space: v = σ(S expand (δ(S shrink (P)))), (5) where δ refers to the ReLU function and σ refers to a sigmoid activation, with "reduction ratio"r of the shrinking operation empirically set to be 4. The final output of the spectral-wise attention U is obtained by rescaling the transformation output U with the activations: To illustrate the application of the spectral-wise attention mechanism to our dense network, Figure 4 depicts the schema of the proposed approach. The spectral-wise attention is the weight added after the 3-D dilated convolution operation, but before the connection operation. For each feature map, 7 of 18 a global average pooling layer transforms the S × S × D-sized feature map to a 1 × 1 × D -sized feature vector, which corresponds to P = (p k ) k=1...D in Equation (5). Next, a fully connected layer generates an output vector 1 × 1 × D r . This process is a shrinking operation, which corresponds to S shrink in Equation (5). After applying ReLU function which corresponds to δ in Equation (5), an expansion operation is performed. The second fully connected layer generates an output vector 1 × 1 × D, this operation corresponds to S expand in Equation (5). Lastly, a sigmoid activation is employed, which corresponds to σ in Equation (5).
To illustrate the application of the spectral-wise attention mechanism to our dense network, Figure 4 depicts the schema of the proposed approach. The spectral-wise attention is the weight added after the 3-D dilated convolution operation, but before the connection operation. For each feature map, a global average pooling layer transforms the S S D × × -sized feature map to a 1 1 D × × -sized feature vector, which corresponds to P=( ) 1 k k D p =  in Equation (5). Next, a fully connected layer generates an output vector 1 1 D r × × . This process is a shrinking operation, which corresponds to shrink S in Equation (5). After applying ReLU function which corresponds to δ in Equation (5), an expansion operation is performed. The second fully connected layer generates an output vector 1 1 D × × , this operation corresponds to expand S in Equation (5). Lastly, a sigmoid activation is employed, which corresponds to σ in Equation (5). Figure 4. Dense network with spectral-wise attention mechanism.

Network Implementation Details
By combining dense convolutional network and spectral-wise attention mechanism, a new network is formed. Details of the layers of the proposed MSDN-SA are described in Table 1. The implementation of MSDN-SA is given as follows.

Network Implementation Details
By combining dense convolutional network and spectral-wise attention mechanism, a new network is formed. Details of the layers of the proposed MSDN-SA are described in Table 1. The implementation of MSDN-SA is given as follows. Table 1. Network architecture details of proposed novel end-to-end 3-D dense convolutional network with spectral-wise attention mechanism (MSDN-SA) for Indian Pines Dataset.

Layer
Kernel size Network Output Size Taking the Indian Pines dataset as an example, the 3-D samples with size 13 × 13 × 200 are used as the input data. The MSDN-SA has seven layers. Each feature map is the result of applying the dense connection operations given by Equation (4) to all previous feature maps: 3-D dilated convolutions (3-D-DConv) with 3 × 3 × 7 pixel filters and a channel-specific dilation followed with batch normalization. We represent this step operation as DConv-BN. Following each dense connection layer, the spectral-wise attention mechanism is applied to each 3-D feature map and added in accordance to Figure 4. It is worth mentioning that "1 ×1 × 200, 8" denotes as eight attention weights obtained by eight independent channels. Finally, an average pooling layer and a fully connected (FC) layer transforms a 5 × 5 × 48 spectral-spatial feature into a 1 × 1 × L output feature vector, L represents the number of neurons. In the Indian Pines dataset, we select L = 360. Note that all layers in our MSDN-SA, including convolutional and average pooling, are implemented in a 3-D manner. Therefore, when extracting features and making predictions, the MSDN-SA can completely retain and utilize the 3-D spectral-spatial information.
Network implementation details for other datasets are carried out in a similar manner and hence, are omitted.

Experiments Results
In order to evaluate the effectiveness of the proposed method, we tested it on three hyperspectral datasets. Class accuracy, overall accuracy (OA), average accuracy (AA), and kappa coefficient (κ) were adopted to assess the classification results. We implemented 10 trials of hold-out cross validation for each dataset: the mean values and standard deviations are reported for each dataset. For each trial, a limited number of training samples were randomly selected from each class, and the remaining samples were used as a blind test. The training sample sizes are set to a minimal level to make the classification task more challenging than otherwise [50].

Datasets
To evaluate the performance of the proposed method for HSI classification, we use the following three datasets: (1) Indian Pines Dataset The Indian Pines image was gathered by the AVIRIS sensor during a flight over the Indian Pines site in Northwestern Indiana, including 16 vegetation classes. It contains 145 × 145 pixels and 220 spectral bands in the range of 0.4-2.5 µm. Due to water absorption, 20 spectral bands were removed, and the remaining 200 spectral bands were used for classification.
(2) University of Pavia Dataset The University of Pavia image was recorded by the ROSIS sensor over Pavia, northern Italy, including 16 urban land-cover classes and having 610 × 340 pixels. It contains 115 spectral reflectance bands, at the wavelength range 0.43-0.86 µm. Twelve spectral bands were removed due to noise, and the remaining 103 spectral bands were used for the experiments.

Experimental Setting
We followed a previous study [51] and adopted the same weight initialization method. In all experiments, dilations were evenly distributed s ij ∈ [1, 10] by setting the dilation of channel j of layer i equal to s ij = ((iw + j)mod 10) + 1, where w is the number of kernels in the convolutional layers. The soft-max activation function was used in the final output layer and the Nesterov Stochastic Gradient Descent (SGD) optimization method was employed during training to minimize the cross-entropy between labels of samples and network outputs. We set the batch size to 16, with the network trained over 100 epochs on three HSI datasets (60 epochs with learning rate 0.01 and 40 epochs with learning rate 0.001). Then, we analyzed two factors that control the training and classification performance of MSDN-SA: (1) number of kernels in the convolutional layers, and (2) size of input spatial cubes.
First, we experimentally verified the number of kernels in the convolutional layers. We tested different kernel numbers from 4 to 20 with fixed intervals of four in each convolutional layer. Classifications were performed on three datasets with only 20 training samples per class using a different number of kernels. The results are shown in Figure 5a. The network with eight kernels in each convolutional layer obtained the best performance in the Indian Pines dataset and University of Pavia dataset, and the network with 20 kernels achieved the highest classification accuracy in the University of Houston dataset, though only marginally higher than the results with eight kernels. For the sake of consistency, we used eight kernels for all datasets. Note that this verifies that dense connectivity allows features to be reused, and a small number of kernels are sufficient.
Second, to obtain an optimal size of the spatial neighborhood in the MSDN-SA, we assessed 5 × 5, 9 × 9, 13 × 13, 17 × 17, and 21 × 21 neighborhoods. Figure 5b shows the classification performance of three HSI datasets using different spatial neighborhood sizes with only 20 samples per class as training samples. We can see from Figure 5b that initially, as the spatial size increases, the accuracy increases rapidly, however when the spatial size reaches 13 × 13, the accuracy stabilizes. Therefore, to balance between the accuracy of classification and the amount of data involved in computation, we chose empirically 13 × 13 as the spatial neighborhood size. imager in 2012. It contains 15 land cover classes with 349 × 1905 pixels and 144 bands are used for assessment with wavelength ranging from 0.36 to 1.05 μm.

Experimental Setting
We followed a previous study [51] and adopted the same weight initialization method. In all experiments, dilations were evenly distributed [1,10] ij s ∈ by setting the dilation of channel j of layer i equal to 10 (( ) mod ) 1 ij s iw j = + + , where w is the number of kernels in the convolutional layers.
The soft-max activation function was used in the final output layer and the Nesterov Stochastic Gradient Descent (SGD) optimization method was employed during training to minimize the crossentropy between labels of samples and network outputs. We set the batch size to 16, with the network trained over 100 epochs on three HSI datasets (60 epochs with learning rate 0.01 and 40 epochs with learning rate 0.001). Then, we analyzed two factors that control the training and classification performance of MSDN-SA: (1) number of kernels in the convolutional layers, and (2) size of input spatial cubes. First, we experimentally verified the number of kernels in the convolutional layers. We tested different kernel numbers from 4 to 20 with fixed intervals of four in each convolutional layer. Classifications were performed on three datasets with only 20 training samples per class using a different number of kernels. The results are shown in Figure 5a. The network with eight kernels in each convolutional layer obtained the best performance in the Indian Pines dataset and University of Pavia dataset, and the network with 20 kernels achieved the highest classification accuracy in the University of Houston dataset, though only marginally higher than the results with eight kernels. For the sake of consistency, we used eight kernels for all datasets. Note that this verifies that dense connectivity allows features to be reused, and a small number of kernels are sufficient.
Second, to obtain an optimal size of the spatial neighborhood in the MSDN-SA, we assessed 5 × 5, 9 × 9, 13 × 13, 17 × 17, and 21 × 21 neighborhoods. Figure 5b shows the classification performance of three HSI datasets using different spatial neighborhood sizes with only 20 samples per class as training samples. We can see from Figure 5b that initially, as the spatial size increases, the accuracy increases rapidly, however when the spatial size reaches 13 × 13, the accuracy stabilizes. Therefore, to balance between the accuracy of classification and the amount of data involved in computation, we chose empirically 13 × 13 as the spatial neighborhood size.  In our experiments, the performance of MSDN and MSDN-SA are compared with four recently proposed supervised HSI classification methods. The algorithms compared in this paper are summarized as follows: (1) CCF [52]: Canonical Correlation Forests based on spectral feature with 100 trees.
(3) CNN-transfer [23]: A CNN with two-branch architecture based on spectral-spatial feature, where a transfer learning strategy is used. Specifically, the source datasets of Indian Pines for pretraining are Salinas Valley, which were collected by the same sensor AVIRIS, and the source datasets of Pavia University for pretraining are Pavia Center which were collected by the same sensor ROSIS. (4) 3D-CNN [25]: The 3D-CNN network framework has two 3-D convolutional layers and a fully connected layer. The network structure is set as given in [25]. (5) SSRN [31]: The architecture of the SSRN is set out in [31]. The spectral feature learning part includes two convolutional layers and two spectral residual blocks, the spatial feature learning part comprises of one 3-D convolutional layer and two spatial residual blocks. Finally, there is an average pooling layer and a fully connected layer to output the results.
Below we compare the above algorithms with our proposed method. For the three HSI datasets, for a fair comparison, the network structures were set to the same width and depth. Additionally, we set the same input volume size of 13 × 13 × D for our proposed method on all datasets.

Results of Indian Pines Dataset
The 10-time average classification accuracies and the corresponding standard deviations of the Indian Pines dataset are reported in Table 2 and the classification maps of different methods are shown in Figure 6c-g. For this dataset, SSRN and MSDN-SA outperform other methods, with MSDN-SA achieving an advantage of approximately 1% to 12%. Note that the numbers of class samples in this dataset are quite unbalanced. In particular, those of the classes Alfalfa, Grass-pasture-mowed and Oats are very few. Except for the SSRN, all compared methods perform well in these classes. Although in terms of classification accuracy, SSRN is the best competitor and its performance is close to the proposed method, it does not do well with classes with a small number of samples. On the contrary, the stability and performance of our algorithm are obvious.

Results of University of Pavia Dataset
The results of University of Pavia dataset are reported in Table 3 and the classification maps of different methods are shown in Figure 7. With nine classes and 20 samples per class in this dataset, a total of 180 pixels were used for training. This number is smaller than that used for the other two datasets (320 in Indian Pines dataset and 300 in University of Houston dataset). Among all five methods only SSRN and MSDN-SA achieve more than 90% OA, AA and κ . With limited training samples, only well-suited deep features are able to exploit the spectral space of 115 dimensionality. From this table we can see that MSDN-SA reports at least 87% accuracy for all classes, with the AA significantly higher than that achievable by the compared methods. Note that SSRN also performs well in this dataset with more adequate training samples available for each class. Compared with SSRN, the proposed MSDN-SA still performs marginally better in both OA and κ .

Results of University of Pavia Dataset
The results of University of Pavia dataset are reported in Table 3 and the classification maps of different methods are shown in Figure 7. With nine classes and 20 samples per class in this dataset, a total of 180 pixels were used for training. This number is smaller than that used for the other two datasets (320 in Indian Pines dataset and 300 in University of Houston dataset). Among all five methods only SSRN and MSDN-SA achieve more than 90% OA, AA and κ. With limited training samples, only well-suited deep features are able to exploit the spectral space of 115 dimensionality. From this table we can see that MSDN-SA reports at least 87% accuracy for all classes, with the AA significantly higher than that achievable by the compared methods. Note that SSRN also performs well in this dataset with more adequate training samples available for each class. Compared with SSRN, the proposed MSDN-SA still performs marginally better in both OA and κ.

Results of University of Houston Dataset
The results of the experiments of University of Houston dataset are listed in Table 4. As this data set is too large, given the space limitations, we show only the two algorithms with the best classification results and the classification maps are shown in Figure 8c,d. MSDN-SA works well again with few training samples. The advantage of MSDN-SA for this dataset is more significant as compared to the other two algorithms, especially regarding OA and κ . In terms of class accuracy, MSDN-SA performs well in all classes and it achieves the highest accuracy in nine classes.
Overall, for three HSI datasets, the proposed MSDN-SA has achieved better performance than those methods compared. We observe that the skip-connections networks used in both SSRN and MSDN-SA show good results, indicating that this connection mechanism strategy has a positive effect on feature propagation while training with a very small number of samples.

Results of University of Houston Dataset
The results of the experiments of University of Houston dataset are listed in Table 4. As this data set is too large, given the space limitations, we show only the two algorithms with the best classification results and the classification maps are shown in Figure 8c,d. MSDN-SA works well again with few training samples. The advantage of MSDN-SA for this dataset is more significant as compared to the other two algorithms, especially regarding OA and κ. In terms of class accuracy, MSDN-SA performs well in all classes and it achieves the highest accuracy in nine classes.
Overall, for three HSI datasets, the proposed MSDN-SA has achieved better performance than those methods compared. We observe that the skip-connections networks used in both SSRN and MSDN-SA show good results, indicating that this connection mechanism strategy has a positive effect on feature propagation while training with a very small number of samples.

Effect of Training Samples
The above experimental results have shown that the proposed MSDN-SA method performs well in HSI classifications, especially in the case of having smaller training samples. In this part, we would like to further investigate the scenarios of extremely scarce training samples. The curves of AA with respect to a different number of training samples are shown in Figure 9.

Effect of Training Samples
The above experimental results have shown that the proposed MSDN-SA method performs well in HSI classifications, especially in the case of having smaller training samples. In this part, we would like to further investigate the scenarios of extremely scarce training samples. The curves of AA with respect to a different number of training samples are shown in Figure 9.
As expected, as the number of training samples increases, the accuracy increases. We can see from Figure 9 that MSDN-SA outperforms other methods in most cases. Regarding Indian Pines and University of Pavia datasets, using only five training samples per class, MSDN-SA has achieved an average accuracy of more than 80% and 83% respectively. Although classification of University of Houston dataset is more challenging, on 10-50 training samples per class MSDN-SA scores significantly higher than other compared methods.
from Figure 9 that MSDN-SA outperforms other methods in most cases. Regarding Indian Pines and University of Pavia datasets, using only five training samples per class, MSDN-SA has achieved an average accuracy of more than 80% and 83% respectively. Although classification of University of Houston dataset is more challenging, on 10-50 training samples per class MSDN-SA scores significantly higher than other compared methods.

Effect of Spectral-Wise Attention Mechanism
To validate the effectiveness of the spectral-wise attention mechanism, we tested and compared the proposed network with and without the spectral-wise attention mechanism. The effectiveness of spectral-wise attention can be demonstrated in Figure 10. It shows the OA of three datasets with 20 training samples per class. It is obvious that spectral-wise attention improves the classification results for all three datasets, with performance boosted by a larger margin on Indian Pines dataset and University of Houston dataset, than on the University of Pavia dataset. The effect of spectral-wise attention is thought to be related to the redundancy of the input bands. Then, we also investigated the weights generated by the spectral-wise attention mechanism in different layers. Take Indian Pines as an example: the average weights generated by the spectral-wise attention mechanism in the first layer and the penultimate layer are shown in Figure 11, respectively. From Figure 11, we can see that the attention mechanism has less influence on the shallow features, and the weights are concentrated near 0.5. As the number of layers increases, the weights generated in the attention mechanism have more guidance on the deep features, thereby emphasizing informative spectral features and suppressing less useful spectral features, as illustrated in Figure 11.

Effect of Spectral-Wise Attention Mechanism
To validate the effectiveness of the spectral-wise attention mechanism, we tested and compared the proposed network with and without the spectral-wise attention mechanism. The effectiveness of spectral-wise attention can be demonstrated in Figure 10. It shows the OA of three datasets with 20 training samples per class. It is obvious that spectral-wise attention improves the classification results for all three datasets, with performance boosted by a larger margin on Indian Pines dataset and University of Houston dataset, than on the University of Pavia dataset. The effect of spectral-wise attention is thought to be related to the redundancy of the input bands. Then, we also investigated the weights generated by the spectral-wise attention mechanism in different layers. Take Indian Pines as an example: the average weights generated by the spectral-wise attention mechanism in the first layer and the penultimate layer are shown in Figure 11, respectively. From Figure 11, we can see that the attention mechanism has less influence on the shallow features, and the weights are concentrated near 0.5. As the number of layers increases, the weights generated in the attention mechanism have more guidance on the deep features, thereby emphasizing informative spectral features and suppressing less useful spectral features, as illustrated in Figure 11.
University of Pavia datasets, using only five training samples per class, MSDN-SA has achieved an average accuracy of more than 80% and 83% respectively. Although classification of University of Houston dataset is more challenging, on 10-50 training samples per class MSDN-SA scores significantly higher than other compared methods.

Effect of Spectral-Wise Attention Mechanism
To validate the effectiveness of the spectral-wise attention mechanism, we tested and compared the proposed network with and without the spectral-wise attention mechanism. The effectiveness of spectral-wise attention can be demonstrated in Figure 10. It shows the OA of three datasets with 20 training samples per class. It is obvious that spectral-wise attention improves the classification results for all three datasets, with performance boosted by a larger margin on Indian Pines dataset and University of Houston dataset, than on the University of Pavia dataset. The effect of spectral-wise attention is thought to be related to the redundancy of the input bands. Then, we also investigated the weights generated by the spectral-wise attention mechanism in different layers. Take Indian Pines as an example: the average weights generated by the spectral-wise attention mechanism in the first layer and the penultimate layer are shown in Figure 11, respectively. From Figure 11, we can see that the attention mechanism has less influence on the shallow features, and the weights are concentrated near 0.5. As the number of layers increases, the weights generated in the attention mechanism have more guidance on the deep features, thereby emphasizing informative spectral features and suppressing less useful spectral features, as illustrated in Figure 11.

Effect of Dilated Convolution
In this part, we will validate the effectiveness of the dilated convolution. First, we replace each dilated convolution of the proposed method to a traditional three-dimensional convolution, and we represent this model as DN-SA. Then, we compare DN-SA with SVM-3DG, SSRN, and MSDA-SA, which are the top three performing methods in Section 4. The effectiveness of dilated convolution can be demonstrated in Figure 12. It shows the OA of three datasets with 20 training samples per class. It is obvious that dilated convolutions improve the classification results for all three datasets, with performance boosted by a larger margin on University of Pavia and the University of Houston datasets, than on the Indian Pines dataset. Our work shows that the dilated convolution operator is particularly suited to dense prediction due to its ability to expand the receptive field without losing resolution.

Conclusions
In this paper, we have proposed a network architecture specifically designed for 3-D patches of hyperspectral datasets. Specifically, we have proposed a novel dense convolutional network that uses dilated convolutions instead of traditional scaling operations to learn features at different scales. It uses multiple scales in each layer, and computes the feature map of each layer using all the feature Figure 11. Average weights generated by spectral-wise attention mechanism on first layer and penultimate layer on Indian Pines dataset.

Effect of Dilated Convolution
In this part, we will validate the effectiveness of the dilated convolution. First, we replace each dilated convolution of the proposed method to a traditional three-dimensional convolution, and we represent this model as DN-SA. Then, we compare DN-SA with SVM-3DG, SSRN, and MSDA-SA, which are the top three performing methods in Section 4. The effectiveness of dilated convolution can be demonstrated in Figure 12. It shows the OA of three datasets with 20 training samples per class. It is obvious that dilated convolutions improve the classification results for all three datasets, with performance boosted by a larger margin on University of Pavia and the University of Houston datasets, than on the Indian Pines dataset. Our work shows that the dilated convolution operator is particularly suited to dense prediction due to its ability to expand the receptive field without losing resolution.

Effect of Dilated Convolution
In this part, we will validate the effectiveness of the dilated convolution. First, we replace each dilated convolution of the proposed method to a traditional three-dimensional convolution, and we represent this model as DN-SA. Then, we compare DN-SA with SVM-3DG, SSRN, and MSDA-SA, which are the top three performing methods in Section 4. The effectiveness of dilated convolution can be demonstrated in Figure 12. It shows the OA of three datasets with 20 training samples per class. It is obvious that dilated convolutions improve the classification results for all three datasets, with performance boosted by a larger margin on University of Pavia and the University of Houston datasets, than on the Indian Pines dataset. Our work shows that the dilated convolution operator is particularly suited to dense prediction due to its ability to expand the receptive field without losing resolution.

Conclusions
In this paper, we have proposed a network architecture specifically designed for 3-D patches of hyperspectral datasets. Specifically, we have proposed a novel dense convolutional network that uses dilated convolutions instead of traditional scaling operations to learn features at different scales. It uses multiple scales in each layer, and computes the feature map of each layer using all the feature

Conclusions
In this paper, we have proposed a network architecture specifically designed for 3-D patches of hyperspectral datasets. Specifically, we have proposed a novel dense convolutional network that uses dilated convolutions instead of traditional scaling operations to learn features at different scales. It uses multiple scales in each layer, and computes the feature map of each layer using all the feature maps of earlier layers, resulting in a densely connected network. Furthermore, a spectral-wise attention mechanism, adding soft weights on features, was proposed to enhance the distinguishability of spectral features. By combing the dense convolutional network with dilated convolution and spectral-wise attention, the resulting MSDN-SA network architecture enables accurate training with relatively small training sets. Experimental results on three popular HSI benchmark datasets demonstrate that MSDN-SA performs consistently, offering the highest classification accuracy.
In terms of future research, we plan to research how to select effective samples as the training samples, which are potentially more effective for training the network, which remains active research.