One-Shot Dense Network with Polarized Attention for Hyperspectral Image Classiﬁcation

: In recent years, hyperspectral image (HSI) classiﬁcation has become a hot research di-rection in remote sensing image processing. Beneﬁting from the development of deep learning, convolutional neural networks (CNNs) have shown extraordinary achievements in HSI classiﬁcation. Numerous methods combining CNNs and attention mechanisms (AMs) have been proposed for HSI classiﬁcation. However, to fully mine the features of HSI, some of the previous methods apply dense connections to enhance the feature transfer between each convolution layer. Although dense connections allow these methods to fully extract features in a few training samples, it decreases the model efﬁciency and increases the computational cost. Furthermore, to balance model performance against complexity, the AMs in these methods compress a large number of channels or spatial resolutions during the training process, which results in a large amount of useful information being discarded. To tackle these issues, in this article, a novel one-shot dense network with polarized attention, namely, OSDN, was proposed for HSI classiﬁcation. More precisely, since HSI contains rich spectral and spatial information, the OSDN has two independent branches to extract spectral and spatial features, respectively. Similarly, the polarized AMs contain two components: channel-only AMs and spatial-only AMs. Both polarized AMs can use a specially designed ﬁltering method to reduce the complexity of the model while maintaining high internal resolution in both the channel and spatial dimensions. To verify the effectiveness and lightness of OSDN, extensive experiments were carried out on ﬁve benchmark HSI datasets, namely, Pavia University (PU), Kennedy Space Center (KSC), Botswana (BS), Houston 2013 (HS), and Salinas Valley (SV). Experimental results consistently showed that the OSDN can greatly reduce computational cost and parameters while maintaining high accuracy in a few training samples.


Introduction
Benefiting from the increased spectral resolution of remote sensing sensors, the hyperspectral imaging technique shows great potential for obtaining high-quality land-cover information. Hyperspectral image (HSI) contains much spectral and spatial information, and each pixel contains hundreds of continuous and narrow spectral bands ranging from visible to near-infrared. Therefore, it has been widely used in many fields, such as urban planning [1], precision agriculture [2], and mineral exploration [3]. Among these applications, HSI classification is an important technical tool that aims to assign a unique class to each pixel [4]. However, due to the insufficient labeled samples and much redundant information, HSI classification remains a challenging task [5].
In the last decade, various methods have been proposed for HSI classification. These classification methods can be divided into two main categories: traditional machinelearning-based (ML-based) and modern deep-learning-based (DL-based) methods [6].
Generally, in ML-based methods, researchers first perform feature extraction on the raw HSI and then use classifiers to classify the extracted features. According to the types of features, they can be further divided into the spectral-based method and the spatial-spectral-based method. Commonly, the spectral-based method directly classifies the spectral vector of each pixel, such as random forest [7], k-nearest neighbors [8], and support vector machine (SVM) [9]. Moreover, many methods focus on reducing redundant spectral dimensions, which aim to map the high-dimensional spectral vector into a low-dimensional space, such as principal component analysis (PCA) [10], linear discriminant analysis [11], and independent component analysis [12]. However, it is difficult to identify the land-cover types using spectral features alone. The classification results are often filled with much salt-and-pepper noise. Alternatively, many researchers have discovered that spatial features can provide additional useful information for classification tasks. On the basis of this consideration, researchers have proposed a series of spatial-spectral-based methods for HSI classification, such as Gabor wavelet transform [13], local binary patterns [14], and morphological profiles [15]. Although the above methods can improve the classification accuracy, the feature extraction process relies on a priori knowledge and appropriate parameter settings. These limitations may affect the robustness and discrimination of the extracted features, making it difficult to achieve satisfactory results in complex scenarios [16].
In recent years, with the continuous improvement of computing power, the development of deep learning techniques has been greatly promoted. Deep neural network models can automatically extract highly robust and discriminative features from the raw data. They have made significant breakthroughs in many computer vision tasks, including image classification [17], semantic segmentation [18], and remote sensing image processing [19]. Naturally, in the field of HSI classification, research methods are gradually converging to state-of-art deep learning techniques. Currently, many effective classification models based on deep learning methods have been proposed. Chen et al. [20] proposed a stacked autoencoder deep neural network for spatial-spectral classification. It is the first application of DL-based methods to HSI classification. After that, many DL-based classification methods were proposed, and especially convolutional neural networks have attracted much attention.
Convolutional neural network (CNN) with multiple hidden layers has a powerful feature learning capability. It can provide more discriminative features with fine quality for HSI classification. Hu et al. [21] first used a one-dimensional (1-D) CNN to extract deep spectral features from each pixel for HSI classification. In addition, Yu et al. [22] proposed an improved 1-D CNN framework, which embeds pre-extracted hashing features in the network. To fully utilize the spatial context information, some two-dimensional (2-D) CNN has been applied to HSI classification and achieved desirable performance. Chen et al. [23] extracted the first principal component from the HSI data by PCA along the spectral dimension and then fed it into a 2-D CNN model to extract the spatial depth features. Yu et al. [24] applied a multiple 2-D CNN layer with a 1 × 1 convolutional kernel to extract deep spatial features for HSI classification. However, the high spectral dimension in HSI may increase the number of learnable parameters of the 2-D CNN model, and the correlation of local spectra may be neglected. Compared with the 2-D CNN model, the three-dimensional (3-D) CNN model can simultaneously extract joint spatial-spectral features. Mei et al. [25] proposed an unsupervised 3-D convolutional autoencoder to extract the joint spatial-spectral feature. Roy et al. [26] proposed a hybrid 3-D and 2-D CNN model for HSI classification (HYNN). This model first uses 3-D CNN to extract shallow joint spatial-spectral features and then uses 2-D CNN to extract more abstract spatial texture features. Moreover, to reduce the computational cost of 3-D CNN, Zhang et al. [27] proposed a 3-D depth-wise separable CNN for HSI classification. Recently, inspired by the residual network [28], Zhong et al. [29] proposed a spectral-spatial residual network (SSRN), which uses spectral and spatial 3-D residual blocks to learn deep-level features of HSI. Subsequently, inspired by SSRN and DenseNet [30], Wang et al. [31] proposed an end-to-end fast densely connected spectral-spatial classification framework (FDSS), which can more effectively reuse features in a few training samples. Although these CNNbased classification models can extract rich spatial and spectral features of HSI, since the convolution kernel is localized, it needs to expand the field of perception by stacking convolution layers, which may lead to a large number of useless features propagated to the deeper convolutional layers. Those useless features will affect the learning efficiency of the model and eventually lead to a decrease in classification accuracy. Thus, finding and focusing on the discriminative features of HSI is an important problem.
Inspired by the human visual system, many researchers have introduced the attention mechanism to computer vision tasks, such as object detection [32], image caption [33], and image enhancement [34]. Since the attention mechanism can pay attention to valuable features or regions in the feature map, some researchers have successfully introduced it to HSI classification. Fang et al. [35] proposed a densely connected spectral-wise attention mechanism network, in which the squeeze-and-excitation (SE) attention module [36] is applied to recalibrate each spectral contribution. Later, many similar spectral attention modules were introduced for HSI classification to highlight valuable spectral and suppress unless ones. For example, Li et al. [37] proposed a spectral band attention module through the adversarial learning method, in which the attention module can explore the contribution of each band and avoid the spectral distortion. Roy et al. [38] proposed a fused SE attention module, in which two different squeezing operations, global pooling and max pooling, are used to generate the excitation weight. To make the network simultaneously boost and suppress features in both spectral and spatial dimensions, many networks based on spectral-spatial attention modules have been proposed for HSI classification. Inspired by SSRN and convolutional block attention module (CBAM) [39], Ma et al. [40] proposed a double-branch multi-attention network (DBMA), in which the spectral and spatial branches are equipped with spectral-wise attention and spatial-wise attention, respectively. Subsequently, Li et al. [41] constructed a double-branch dual attention (DBDA) network for HSI classification, in which the dual attention network (DANet) [42] is inserted separately into two branches. Compared with CBAM, DANet can adaptively integrate local features and global dependencies. In addition, to obtain the long-distance spatial and spectral features, Shi et al. [43] proposed a 3-D coordination attention mechanism network, and the 3-D attention module could be better adapted to the 3-D structure of the HSI. Li et al. [44] proposed a spectral-spatial global context attention [45] network (SSGC) with less time cost to capture more discriminative features. Moreover, in [46], Shi et al. proposed a pyramidal convolution and iterative attention network (PCIA), in which each branch can extract hierarchical features. Although the above three attention-based methods can achieve good classification results, they compress a large spatial or spectral resolution in obtaining the attention feature map. Meanwhile, the feature extraction process requires a high computational cost due to their simple application dense connection modules.
To solve the above problems, inspired by the latest technology and predecessors, we propose a one-shot dense network with polarized attention for HSI classification. Instead of following the 3-D dense connection method used by predecessors to extract features from HSI, we propose a one-shot dense connection block that maintains good classification accuracy and consumes less computational cost. Meanwhile, we add residual connections in this block, enhancing feature transfer and mitigating the gradient disappearance problem. In addition, the latest proposed polarized attention mechanism (PAM) [47] is introduced in the network to mine finer and higher quality features. Compared with other attention mechanisms [36,39,42,45], it can maintain a relatively high resolution in spectral and spatial dimensions and thus reduce the loss of features. Furthermore, the proposed network is composed of two branches that can perform feature extraction in spectral and spatial realms, respectively. The channel-only and spatial-only attention mechanisms are inserted into each branch to recalibrate feature maps. After extracting the enhanced features from the two branches, we fuse them with a concatenation operation to obtain the spectral-spatial features. Finally, the fused features are fed into the fully connected layer to obtain the classification results. The main contributions of this paper are summarized as follows: (1) We propose a novel spectral-spatial network based on one-shot dense block and polarized attention for HSI classification. The proposed network has two independent feature extraction branches: the spectral branch with channel-only polarized attention applied to obtain spectral features, and the spatial branch with spatial-only polarized attention used to capture spatial features. (2) By one-shot dense block, the number of parameters and computational complexity of the network are greatly reduced. Meanwhile, the residual connection is added to the block, which can alleviate the performance saturation and gradient disappearance problems. (3) We apply both channel-only and spatial-only polarized attention in the proposed network. The channel-only polarized attention emphasizes valuable channel features and suppresses useless ones. The spatial-only attention is more focused on areas with more discriminative features. In addition, the attention mechanism can preserve more resolution in both channel and spatial dimensions and consume less computational costs. (4) Some advanced technologies, including cosine annealing learning rate, Mish activation function [48], Dropout, and early stopping, are employed in the proposed network. For reproducibility, the code of the proposed network is available at https://github.com/HaiZhu-Pan/OSDN (accessed on 5 May 2022).
To show the effectiveness of the proposed network, a large number of experiments were carried out on five real-world HSI datasets, namely, PU, KSC, BS, HS, and SV. The experimental results consistently demonstrate that the proposed network can achieve better accuracy than several widely used ML-and DL-based methods in a few training samples and computational resources.
The remainder of this article is structured as follows: Some close backgrounds are reviewed in Section 2. In Section 3, our proposed network is presented with three parts in detail. In Sections 4 and 5, comparative experiments and ablation analyses are performed to demonstrate the effectiveness of the proposed network. Finally, Section 6 provides some concluding remarks and suggestions for future work.

Background
In this section, we briefly introduce some important background techniques involved in the proposed HSI classification model, including 3D convolutional operation, ResNet and DenseNet, and attention mechanism.

3-D Convolution Operation
Generally, convolutional operations are the core of CNNs. At present, there are three types of convolution operations in the CNN-based HSI classification model, which are 1-D CNN, 2-D CNN, and 3-D CNN. There are some drawbacks of using 1-D CNN or 2-D CNN, such as lack of spatial relationship features or very complex networks [26]. The main reason is that HSI is a 3-D data cube enriched with a large amount of spatial and spectral information. The 1-D CNN alone cannot extract good discriminative features from the spatial dimension. Similarly, a deep 2-D CNN is more computationally complex and may miss some spectral information between adjacent bands. This is our motivation for using the 3-D convolution operation, which can make up for the shortcomings of the first two convolution operations. The process of 3D convolution operation is shown in Figure 1.
As shown in Figure 1, the input data for the 3-D convolution operation is a 4-D tensor h x ∈ h n × h n × s n × k n , where h n × h n × s n is the size of the input data, and k n is the number of channels (feature maps). The 3-D convolution operation contains k n+1 convolutional kernels of size α n+1 × α n+1 × d n+1 , and the stride of subsampling is (s, s, s 1 ). The output size of the 3-D convolution operation is also a 4-D tensor h x+1 ∈ h n+1 × h n+1 × s n+1 × k n+1 . More specifically, the spatial size of the output data is h n+1 = 1 + h n − α n+1 /s , and the depth s n+1 = 1 + s n − d n+1 /s 1 . The 3-D convolution operation is be defined as follows: where M is the Mish activation function. In addition, the height, the width, and the depth of the convolution kernel are denoted by H l , W l , and D l , respectively. Furthermore, k h,w,d

ResNet and DenseNet
Commonly, a trained deep neural network can extract features layer by layer to complete the classification task. However, as the number of convolutional layers increases, two main problems arise: gradient dispersion/explosion and network degradation. Numerous studies have shown that ResNet [28] and DenseNet [30] can alleviate the above problems and achieve feature reuse.
As illustrated in Figure 2, a shortcut connection is added to the base CNN structure in the residual block. The shortcut connections, also known as identity mapping, enable input features to be passed from a lower level to a higher level in a summative way. The output features of the lth residual block are defined as follows: where f l (·) denotes hidden layers, including convolution, batch normalization (BN), and Mish activation layers. To further promote the flow of features in the network, Huang et al. [30] proposed a densely connected network, in which the shortcut connections are used to concatenate the input features and output features at each layer. This structure is shown in Figure 3. The output features of the lth dense block are computed as follows: where D l (·) includes BN, Mish activation function, and convolution operation.
[·] is the connected operation. In particular, DenseNet with layer l has l(l + 1)/2 connections, while CNN with the same layer has only l connections.

Attention Mechanism
The attention mechanism is a common data processing method in deep learning. It helps the model assign different weights to each part of the feature maps to extract more critical and discriminative features, thereby enabling the model to make more accurate judgments without imposing more overhead on the computation and storage of the model. The existing attention mechanisms can be roughly divided into two types, i.e., soft attention and hard attention. The former is more attention to the channel or spatial information of the image, while the latter is more attention to the information of a certain position in the image. Most importantly, the soft attention mechanism is differentiable, in which the weight parameters can be updated by backpropagation during the training process. Therefore, soft attention is widely used in the field of computer vision. For example, the SE attention module [36] can recalibrate each channel's contribution to the network. GCNet [45] not only extracts global contextual information but is also lightweight like SENet. In addition, CBAM [39] and DANet [42] can extract attention maps in both the channel and spatial dimensions. However, these attention models have a low internal attention resolution, which loses a great quantity of channels or spatial information. Moreover, these attention models are computationally intensive when paying attention to the channel and spatial dimensions. To alleviate these problems, the PAM [47] employs a distinctive filtering method to reduce the complexity of the model while maintaining high internal attention resolution in both the channel and spatial dimensions. The detailed implementation of the channel-only PAM and spatial-only PAM is described in Sections 3.1 and 3.2

Channel-Only Polarized Attention Mechanism
As shown in Figure 4, the channel-only PAM is constructed using the channel relations of the feature map. We assume that the input feature maps A c ∈ R h × w × c are independent, where h, w, and c denote height, width, and channel, respectively. First, A c is fed into a 2-D convolution layer with the kernel size of 1 × 1. Next, a new feature map B c ∈ R h × w × c/2 is generated. After that, B c is reshaped to D c ∈ R n × c/2 , where n = h × w. Simultaneously, A c is also fed into a 2-D convolution layer with the kernel size of 1 × 1. A new feature map C c ∈ R h × w × 1 is generated. Then, C c is reshaped to E c ∈ R 1 × 1 × n , and the SoftMax function is applied to enhance attention scope. Subsequently, the matrix multiplication operation is performed on matrices D c and E c , and the generated feature map F c ∈ R 1 × 1 × c/2 . After that, F c is fed into a bottleneck feature transform layer, which consists of two 1 × 1 convolution layers, a layer normalization operation, and a ReLU activation function to obtain the dependency of each channel and raise the channel dimension from c/2 to c. Next, the Sigmoid function is used to keep the channel weights G c ∈ R 1 × 1 × c between 0 and 1. Finally, a channel-wise multiplication operation is performed between H c and A c to generate the final channel-only polarized attention map H c ∈ R h × w × c . The overall channel-only PAM implementation process can be defined as follows: where W 1 , W 2 , and W 3 are 1 × 1 convolution layers; ζ 1 and ζ 2 are two tensor transformation operations; F SG (·) is a Sigmoid activation function; and F SM (·) is a SoftMax activation function. The internal channel resolution between W 1 |W 2 and W 3 , is c/2. The final output of the channel-only PAM is formulated as where c is dot multiplication operation.

Spatial-Only Polarized Attention Mechanism
As is shown in Figure 5, the spatial-only PAM is constructed by the spatial contextual position relationship of the feature map, given an input tensor A s ∈ R h × w × c . First, A s is fed into two 1 × 1 convolution layers to generate feature maps B s ∈ R h × w × c/2 and C s ∈ R h × w × c/2 , respectively. Next, B s is reshaped to D s ∈ R c/2 × n , where n = h × w. Second, the global average pooling operation is used in C s to compress the global spatial features into a feature vector E s ∈ R 1 × 1 × c/2 ; meanwhile, since the spatial features of C s are compressed, we use the SoftMax function to perform feature enhancement on E s . After that, a spatial-wise multiplication operation is conducted on attention maps D s and E s . The generated feature map F c ∈ R 1 × 1 × n . Through reshape and Sigmoid operations, the spatial attention weight G s ∈ R h × w × 1 is generated. The overall spatial-only PAM implementation process can be defined as follows: where W 1 and W 2 are two standard 1 × 1 convolution layers, ζ 1 and ζ 2 are two tensor transformation operations, F GP (·) is a global average pooling operation, F SM (·) is a SoftMax active operation, and F SG (·) is a Sigmoid active operation. The final output of the spatialonly PAM is formulated as where c is dot multiplication operation.

One-Shot Dense Network with Polarized Attention
In this subsection, we describe in detail the proposed network, which consists of spectral feature extraction, spatial feature extraction, and spectral-spatial feature fusion. The structure of the proposed network is shown in Figure 6. In the following, we use the PU dataset as an example to illustrate the three components of the proposed network in detail.

Spectral and Spatial Feature Extraction of One-Shot Dense Block
As shown in Figure 6, this part contains two independent feature extraction processes, including the spectral feature extraction process and the spatial feature extraction process. In the process of feature extraction, inspired by ResNet and DenseNet, we propose a oneshot dense block. Unlike the dense block, the feature maps produced by each convolution (Conv) layer are concatenated only once, and each Conv layer has an equal number of input and output feature maps. Furthermore, we also insert the skip connection in the one-shot dense block, which enables this block to extract deeper features of HSI. Instead of an individual pixel vector, we first randomly select a 3-D patch cube 7 × 7 × 103 from the PU dataset as the network's input. In this way, the network can consider both spatial background information and spectral information around the central pixel of the 3-D patch cube during the classification process. Before passing through the spectral oneshot dense block, we first use a 3-D Conv layer with BN and Mish to reduce the spectral dimension of the input data. The kernel size is (1 × 1 × 7); the filters is 24; the stride is (1, 1, 2); no padding operation. After that, the output size of the generated feature maps is (7 × 7 × 49, 24). Next, they are fed into the spectral one-shot dense block, which consists of a one-shot connected part and a residual connected part. The kernel size, filters, stride, and padding of all 3-D Conv layers in the one-shot connected part is (1 × 1 × 7), 12, (1, 1, 1), and (0, 0, 3), respectively. Then, we connect the generated feature maps through the channel dimension, and thus the feature maps with the size of (7 × 7 × 49, 60) are generated. Meanwhile, to implement the residual connected part, we use a 1 × 1 × 1 3-D Conv layer to reduce the channel dimension from 60 to 24 and then add it to the last feature maps of the one-shot connected part. Finally, after the last 3-D Conv layer with a kernel size of (1 × 1 × 49), a (7 × 7 × 1, 24) feature map is generated.
Similar to the spectral feature extraction process, we only focus on the spatial features of the input data in the spatial feature extraction process. The input data size of the spatial one-shot dense block is (7 × 7 × 1, 24). All hyperparameters are the same as the spectral one-shot dense block except that the kernel size of the spatial one-shot dense block is (3 × 3 × 1). The detailed spectral and spatial feature extraction processes are listed in Tables 1 and 2.

Spectral and Spatial Feature Enhancement of Polarized Attention Mechanism
After the spectral and spatial feature extraction process, the feature maps are enriched with a large amount of spectral and spatial information. However, different channels and positions in these feature maps may make different contributions to the classification results. Therefore, as shown in Figure 6, to enhance valuable features and suppress nonvaluable features, the feature maps are fed into the channel-only polarized attention (COPA) block and spatial-only polarized attention (SOPA) block. The input size of both the COPA block and the SOPA block is (7 × 7 × 24). A detailed description of these two attention mechanisms is given in Sections 3.1 and 3.2. In addition, the detailed implementation of the feature enhancement process is listed in Tables 3 and 4.

Spectral and Spatial Feature Fusion and Classification
After the spectral and spatial feature enhancement process, the resulting feature maps are separately fed into an adaptive average pooling (AdaptiveAvgPool) layer with BN and Mish. Compared to the fully connected layer, the AdaptiveAvgPool layer can reduce the computation cost. The output size of this layer is (1 × 24). Finally, we fuse the two feature maps along the channel dimension and then feed the fused feature maps into the linear layer to obtain the classification results. Since we use the cross-entropy loss in PyTorch as the loss function of the network, which automatically contains the probability distribution of the labels, we no longer use the SoftMax layer to obtain the final classification results. The detailed implementation of the feature fusion and classification process is listed in Table 5.

Hyperspectral Dataset Description
In this paper, we employed five well-known HSI datasets, namely, PU, KSC, BS, HS, and SV, to validate the generality and effectiveness of our proposed method. A detailed description of the above five datasets is presented as follows: PU SV: The SV dataset was also gathered by the AVIRIS sensor, but it was collected in the Salinas Valley region of California. Its spatial dimensions and resolutions are 512 × 217 and 3.7 m, respectively. The raw SV dataset has 224 spectral bands ranging from 400 to 2500 nm. Twenty water absorption bands are abandoned. Therefore, this article uses 204 bands for the experimental dataset. This dataset contains 16 land-cover types with 54,129 labeled samples.

Experimental Evaluation Indicators
In this work, three evaluation indicators, namely, overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa), are used to assess the classification performance of the proposed method [49]. OA refers to the percentage of correctly classified labeled samples to the total labeled samples. AA is the average accuracy for each class, which assigns the same importance to each category. Kappa is the consistency between classification results and ground truth. It is calculated from −1 to 1, but usually, it falls between 0 and 1. All in all, the closer the above three indicators are to 1, the better the classification model will be.
To explain the above three evaluation indicators more intuitively, we first define the confusion matrix. In the confusion matrix, each column represents the predicted label, and each row represents the actual label. The composition of the confusion matrix (A n × n ) is defined as follows: where element a ij indicates the number of samples in class i classified as class j, and

Experimental Setting
The experiments were implemented on a deep learning workstation with a 2× Intel Xeon E5-2680 v4 processor, 35 M of L3 cache, a clock speed of 2.4 GHz, and 14 physical cores/28 way multitask processing. Furthermore, it is equipped with 128 GB of DDR4 RAM and 8× NVIDIA GeForce RTX 2080Ti super graphical processing unit (GPU) with 11 GB of memory. The software environment is CUDA v11.2, PyTorch 1.1.0, and Python 3.7.
To validate the effectiveness of our proposed method, we selected seven representative methods for comparison: one representative ML-based method and seven state-of-the-art DL-based methods. All comparison methods are briefly described as follows: (1) SVM: The SVM with radial basis function (RBF) kernel is employed as a representative of the traditional method for HSI classification. It is implemented by scikit-learn [50]. Each labeled sample in the HSI has a continuous spectral vector. They are directly fed into the SVM without feature extraction and dimensionality reduction. The penalty parameter C and the RBF kernel width σ are selected by Grid SearchCV, both in the range of (10 −2 , 10 2 ). (2) HYSN [26]: The HYSN model has three 3-D convolution layers, one 2-D convolution layer, and two fully connected layers. The sizes of the convolution kernels of the 3-D convolution layers are 3 × 3 × 7, 3 × 3 × 5, and 3 × 3 × 3, respectively. The size of the convolution kernel of the 2-D convolution layer is 3 × 3. (3) SSRN [29]: The SSRN model consists of two residual convolutional blocks with convolution kernel sizes of 1 × 1 × 7 and 3 × 3 × 1, respectively. They are connected sequentially to extract deep-level spectral and spatial features, in which BN and ReLu are added after each convolutional layer. (4) FDSS [31]: The network structure of FDSS is connected by three convolutional parts, including a densely connected spectral feature extraction part, a reducing dimension part, and a densely connected spatial feature extraction part. The shapes of the three partial convolution kernels are 1 × 1 × 7, 1 × 1 × b (b represents the spectral depth of the generated feature map), and 3 × 3 × 1, respectively. Moreover, BN and ReLu are added before each convolutional layer. (5) DBMA [40]: The DBMA model is designed with a two-branch network structure.
Each branch has a dense block and an attention block. Its dense block is the same as in FDSS. Moreover, the attention block is inspired by CBAM [39].
(6) DBDA [41]: The DBDA model uses DANet [42] as the attention mechanism, and the rest of the network structures are the same as DBMA. In particular, it adopts the Mish as the activation function. (7) PCIA [46]: The PCIA model uses an iterative approach to construct an attention mechanism. This network structure also consists of two branches, but each branch uses a pyramid convolution module to perform feature extraction. (8) SSGC [44]: The GCNet [45] attention mechanism is introduced to the SSGC. The rest of the network architecture is the same as DBMA.
To ensure the impartiality of the comparison experiments, we took the same hyperparameters on these methods. For the training set of the proposed method, we applied the Adam optimizer [51] to update the parameters for 100 training epochs, where the initial learning rate is 0.0005 for all datasets. The learning rate is dynamically adjusted every 25 epochs by the cosine annealing [52]. Furthermore, if the loss on the validation set does not change within 10 epochs, the network will move to the test session. To balance efficiency and effectiveness, the spatial size of the HSI patch cube was set to 7 × 7, and the batch size was set to 32. Tables 6-10 provide the detailed distribution of the training, validation, and testing samples of PU, KSC, BS, HS, and SA datasets. To seek reproducibility, the proposed network code is available publicly at https://github.com/HaiZhu-Pan/OSDN (accessed on 5 May 2022).

Experimental Results
Tables 11-15 report the classification accuracy of each category, OA, AA, and Kappa, on five datasets. It is clear that the proposed OSDN produces the best OA, AA, and Kappa and provides a significant improvement over the other methods on all datasets. For example, when 1% of the samples are randomly chosen for training on the PU dataset (Table 11), the improvement in OA compared to SVM, HYSN, SSRN, FDSS, DBMA, DBDA, PCIA, and SSGC methods are 9.96%, 5.87%, 3.60%, 1.72%, 2.16%, 2.06%, 2.99%, and 1.45%, respectively. Specifically, since SVM only uses spectral information to perform classification, its accuracy on all datasets is much lower than other methods. Conversely, the other eight DL-based methods (i.e., HYSN, SSRN, FDSS, DBMA, DBDA, SSGC, PCIA, and OSDN) all achieved good classification results on five datasets because they could automatically extract deep, high-level, and discriminative spatial-spectral information from the 3-D patch cube. Furthermore, compared to SSRN and HYSN, the OA of FDSS was improved approximately by 1-8% on all datasets, which indicates that the densely connected structure can extract features more adequately in a few training samples. In addition, the network structures of DBMA, DBDA, PCIA, and SSGC are very similar. Their classification models are based on two main ideas: dual-branch 3-D dense convolution block and dual-branch attention mechanism. Among these dual-branch attention models, SSGC achieved the best classification results in most datasets due to its ability to focus on global contextual information. In addition, the classification accuracy obtained by OSDN was higher than that of FDSS and SSCG because the PAM module in OSDN not only retained a large amount of spectral and spatial resolution but also dynamically enhanced the feature maps. Finally, compared with the best comparison methods in the five datasets, the OA of OSDN was improved by 1.45%, 1.86%, 1.46%, 1.62%, and 0.82%, respectively. At the same time, AA and Kappa improved to different degrees on the five datasets. Figures 7-11 show the ground truth, false-color image, and classification maps of all methods on the five datasets. Generally, the outline of each category was smoother and clearer in the proposed OSDN classification map on all datasets. Because the SVM method cannot effectively extract the spatial feature, its classification map had a large amount of salt-and-pepper noise on the five datasets (Figures 7b, 8b, 9b, 10b and 11b). In addition, benefiting from the PAM module, our proposed OSDN was found to be significantly better than other methods in predicting those unlabeled categories. Taking the PU dataset as an example, looking carefully at Figure 7k, we can see that there may have been several trees (C4) in the lower side area of the bare soil (C6). However, no method can predict as many trees in this area as possible. On the contrary, it is clear from Figure 7j that our proposed OSDN can predict eight trees in this area. Similarly, in the left area of these eight trees, the proposed OSDN was able to visualize the area more completely than other methods. All observations validate that our proposed OSDN can accurately predict labeled categories and reasonably predict unlabeled categories on all datasets. Moreover, the above results further verify that the proposed one-shot dense connection can also extract sufficient features in a few training samples, while the PAM module can focus on extracting finer features to perform classification.

Comparison of Different Spatial Patch Size
In this subsection, we explore the effect between the spatial patch size and the classification accuracy of the proposed network. In general, if the spatial patch size is too small, it will not be enough to contain rich spatial features, and the classification performance might be decreased. Conversely, if the spatial patch size is too large, it will contain more mixed pixels and increase the computational cost. Therefore, an appropriate spatial patch size should be determined by classification accuracy and computational cost. Figure 12 depicts the OA with different spatial patch sizes ranging from 3 to 13 with a 2-pixel interval. According to Figure 12, with the increase in the spatial patch size, the classification accuracy of the five datasets gradually increased, and the best OA was acquired when the patch size was 7 × 7. This phenomenon indicates that more and more spatial features were included in the data cubes as the spatial size increased. Thus, the classification results were improved to some extent. However, if the space size increased, the OA of most datasets will show a decreasing trend. In conclusion, to balance both the OA and computational costs, we used 7 × 7 as the spatial patch size on the five datasets.

Comparison of Different Training Sample Proportions
It is well known that deep learning is a data-driven approach. In this subsection, we randomly chose 1%, 1.5%, 2%, 3%, 5%, 7%, 9%, and 60% of training samples from each dataset to explore the classification performance of different models with different training sample proportions. As shown in Figure 13, when the training samples were sufficient, these models maintained classification results above 99% on all five datasets. However, obtaining enough training samples is a time-consuming and labor-intensive task. Therefore, one of the motivations of our proposed OSDN was to obtain a good classification result in a few training samples. Compared with other methods, our proposed OSDN consistently maintained the most significant OA in various proportions of training samples. Especially with insufficient training samples, our proposed OSDN achieved the highest OA on the five datasets.

Comparison of Computational Cost and Complexity
One of the purposes of this article is to reduce the computational cost and complexity of the proposed network. Therefore, Table 16 compares the number of parameters, floatingpoint operations (FLOPs), training time, and testing time of different methods for five datasets. The FLOPs is an indicator to evaluate the model's complexity, which is used to measure the computational cost of the model. All methods are counted in the state of the best accuracy and are trained with the same samples. Generally, from Table 16, it can be found that the proposed OSDN achieved good results on all four metrics. Specifically, HYSN had the largest number of parameters and highest FLOPs compared with other DL-based methods, owing to its deeper network structure. Compared with SSRN, although FDSS achieved good classification results (see Figure 14), it had more parameters and FLOPs than SSRN due to its dense connections. In addition, it is worth noting that the DBMA, DBDA, PCIA, and SSGC had a similar feature extraction backbone. Among these four methods, DBMA had the maximum number of parameters and the highest FLOPs since its attention module contained fully connected operations. Furthermore, DBDA, PCIA, and SSGC had roughly the same parameters. However, PCIA took lower FLOPs because of its multiscale pyramidal feature extraction block and iterative attention module. Our proposed OSDN required the least number of parameters and lowest FLOPs on the five datasets due to its lightweight one-shot dense block and effective PAM module. In addition, since SVM contained fewer parameters, it did not consume much time for training and testing. As for the time efficiency of OSDN, it is very competitive with other similar comparative models (i.e., DBMA, DBDA, PCIA, and SSGC). Finally, combining Table 16 and Figure 14 with the above analysis, we can conclude that the proposed OSDN presents satisfying classification accuracy with less computational cost and complexity.

Comparison of Different Dense Connections
To verify the effectiveness and lightness of the proposed one-shot dense block (OSDB), we compared it with two other classical dense blocks, namely, the dense block (DB) and the weak dense block (WDB) [53], as shown in Figure 15. Note that the overall structure of the proposed OSDN remained unchanged; only the feature extraction blocks of OSDN were replaced by DB and WDB, respectively. The number of parameters, FLOPs, and OA of the three dense blocks is listed in Tables 17-21. The experimental results show that OSDB had fewer parameters and FLOPs; meanwhile, the OA was acceptable on the five datasets. From these five tables, although DB achieved the highest OA, it had a large number of parameters and FLOPs, which increased the complexity of the model. In addition, since WDB only retained the skip connection between the two Conv layers in DB, the parameters and FLOPs were reduced to some extent. However, the OA was also reduced. Lastly, our proposed OSDN not only connected the subsequent feature maps at once but also incorporated the residual connections. It enabled the proposed OSDB to maintain accuracy and reduce the amount of computation. In conclusion, although our proposed OSDB did not achieve the best for all indicators, it is acceptable and reasonable from the motivation of reducing computation cost and complexity.

Ablation Analysis toward the Attention Module
This subsection describes an ablation analysis performed on the attention module on five datasets. For a fair comparison, all the networks were trained with the same hyperparameters and samples, as described in Section 4.3. As shown in Figure 16, "Model 0" represents that the PAM was not used in the OSDN; "Model 1" and "Model 2" represent that spatial-only PAM and channel-only PAM were used in the proposed network, respectively; and "Model 3" represents both spatial-only PAM and channel-only PAM being used in the OSDN. According to the results, we can observe that both Model 1 and Model 2 effectively improved the OA over Model 0 on the five datasets. It is worthwhile to note that even though the OA of Model 0 was already very high on the SA dataset, Model 1, Model 2, and Model 3 improved the OA by 0.38%, 0.57%, and 0.76%, respectively. The experimental results consistently show that compared with using Module 1 alone or Module 2 alone, Model 3 achieved the best OA on all datasets. Furthermore, we further analyzed the impact of the attention module (Model 3) on the computational cost of the OSDN. After extensive experiments, we observed that the computation times before and after incorporating Model 3 into the OSDN were ≈0.0051 and ≈0.0064 s, respectively. In addition, the FLOPs and parameters of Model 3 were 0.04 M and 0.001 M, respectively. Therefore, the introduced attention mechanism did not degrade the computational cost and complexity of the OSDN. At the same time, the introduced attention module can select the important channel and spatial features to improve the classification performance of OSDN.

Conclusions
In this article, we aimed to construct an OSDN to solve the current problems of high complexity and inadequate feature extraction of CNN-based HSI classification models in the case of small training samples. By incorporating one-shot dense block, the number of parameters and computational cost of the network were significantly reduced while guaranteeing an excellent feature extraction ability. Moreover, to fully extract refined and discriminative features, the polarized AMs were introduced in the proposed OSDN. Compared to other previous AMs used in HSI classification models, the polarized AMs can maintain high channel and spatial resolution during the training process. In addition, some advanced techniques, including the BN layer, the Mish activation function, the cosine annealing learning rate, the dropout layer, and early stop operation, were used in the OSDN to prevent overfitting and accelerate network convergence.
The experiments demonstrated the effectiveness of two crucial parts in the OSDN, namely, one-shot dense block and polarized AMs. Moreover, several state-of-the-art models, such as HYSN, SSRN, FDSS, DBMA, DBDA, PCIA, and SSGC, were used for comparison on five HSI datasets. In the case of a few training samples, the classification results consistently demonstrated that the OSDN not only accurately predicted the labeled samples but also reasonably predicted the unlabeled samples. At the same time, compared with other comparison models, the proposed OSDN is an efficient lightweight model, which can achieve good classification performance with less computational cost even under limited training samples. In our future work, we will investigate more effective and lightweight models to extract discriminative features for HSI classification. Finally, the code developed for OSDN is available at https://github.com/HaiZhu-Pan/OSDN (accessed on 5 May 2022).