Double-Branch Network with Pyramidal Convolution and Iterative Attention for Hyperspectral Image Classiﬁcation

: Deep-learning methods, especially convolutional neural networks (CNN), have become the ﬁrst choice for hyperspectral image (HSI) classiﬁcation to date. It is a common procedure that small cubes are cropped from hyperspectral images and then fed into CNNs. However, standard CNNs ﬁnd it difﬁcult to extract discriminative spectral–spatial features. How to obtain ﬁner spectral–spatial features to improve the classiﬁcation performance is now a hot topic of research. In this regard, the attention mechanism, which has achieved excellent performance in other computer vision, holds the exciting prospect. In this paper, we propose a double-branch network consisting of a novel convolution named pyramidal convolution (PyConv) and an iterative attention mechanism. Each branch concentrates on exploiting spectral or spatial features with different PyConvs, supplemented by the attention module for reﬁning the feature map. Experimental results demonstrate that our model can yield competitive performance compared to other state-of-the-art models.


Introduction
Hyperspectral remote sensing, containing a rich triad of spatial, radiometric and spectral information, is a frontier area of remote-sensing technology. The hyperspectral remote sensor with remarkable features of high spectral resolution (5~10 nm) and wide spectral range (0.4 µm~2.5 µm) can use dozens or even hundreds of narrow spectral bands to collect information. All the bands can be arranged together to form a continuous and complete spectral curve, which covers the full range of electromagnetic radiation from the visible to the near-infrared wavelength. Hyperspectral image (HSI) implements the effective integration of spatial and spectral information of remote-sensing data and thus addresses important remote-sensing applications, e.g., agriculture [1], environmental monitoring [2], and physics [3].
Traditional spectral-based methods such as k-nearest neighbors [4], multinomial logistic regression (MLR) [5], and support vector machines (SVM) [6], tend to treat the raw pixels directly as input. However, given the large number of spectral bands in HSI, the classifier must deal with these features in a high-dimensional space. Due to the numerous spectral bands in HSI, the classifier is confronted with high-dimensional features and the limited samples makes it difficult to train a classifier with high accuracy. This problem is known as the curse of dimensionality or the Hughes phenomenon. To tackle this problem, dimensionality reduction such as feature selection [7] or feature extraction [8] is a common tactic. Moreover, considering that neighboring pixels probably belong to the same class, another line of research aims at focusing on spatial information. Gu et al. [9] and Fang et al. [10] used SVM as a classifier with a multiple kernel learning strategy to process the HSI data and obtained the desired results. In [11], the original HSI data was fused with multi-scale superpixel segmentation maps and then fed into SVM for processing. Methods respectively. In each branch, pyramidal convolution is introduced to exploit abundant features at different scales. Then, a novel iterative attention mechanism is applied to refine the feature maps. By concatenating or using weighted addition, we fuse the double-branch features. Finally, the fused spectral-spatial features are fed into the fully connected layer to obtain classification results with the SoftMax function. The main contributions of this article are as follows: (1) A new double-branch model based on pyramidal 3D convolution is proposed for HSI classification. Two branches can separately extract spatial features and spectral features efficiently. (2) A new iterative attention mechanism, expectation-maximization attention (EMA), is introduced to HSI classification. It can refine the feature map by highlighting relevant bands or pixels and suppressing the interference of irrelevant bands or pixels. The rest of this paper is organized as follows: In Section 2, we briefly describe the related work. Our proposed architecture is described in detail in Section 3. In Sections 4 and 5, we conduct several experiments and analyze the experimental results. Finally, conclusions and future work are presented in Section 6.

Related Work
In this section, we briefly review several highly correlated techniques before introducing the proposed HSI classification framework, which is pyramidal convolution (PyConv), ResNet and DenseNet, and attention mechanism.

A Multi-Scale 3D Convolution-PyConv
As mentioned in the preceding section, the 3D-CNN-based approach has carved out a niche for itself in HSI classification. Considering that the spectral dimension of HSI is abundant with detailed information of land covers, 3D convolution is an appealing operation in exploiting the spatial and spectral information in HSI for classification.
Based on the standard 3D convolution [36], several offshoots have evolved [37][38][39]. Among them, the multi-scale 3D convolution is of interest in this paper. In [40], a multiscale 3D convolution named pyramidal convolution (PyConv) was proposed, illustrated in Figure 1. Using a pyramid with different types of kernels, PyConv can process the input feature maps FM i at different scales, resulting in a series of output feature maps FM o with complementary information. Generally, PyConv is a hierarchical structure that stacks 3D convolution kernels with different sizes. At each level of PyConv, the spatial size of the kernels varies, increasing from the bottom of the pyramid to the top. As the spatial size increases, the depth of the kernel simultaneously decreases. Consequently, as shown in Figure 1, this leads to two pyramids, facing opposite directions. One pyramid is wide at the bottom and narrow at the top in terms of the depth of the kernel, and the other inverted pyramid is narrow at the bottom and wide at the top in terms of the spatial size of the kernel. This pyramidal structure provides a pool of combinations in which there can be different types and sizes of kernels. Thanks to this, the network can possess the ability to acquire complementary information since kernels with smaller receptive fields focus on small objects and details while kernels with larger receptive fields can concentrate on larger objects and contextual information.

ResNet and DenseNet
Deep networks can lead to better performance, but optimizing deep networks is very difficult. To combat this dilemma, ResNet and DenseNet are powerful tools.
Inspired by residual representations in image recognition, ResNet introduces shortcut connections to the network. As shown in Figure 2(a), denotes hidden layers, including convolution layers, activation function layers, and batch normalization (BN) layers. In the original text of ResNet, shortcut connections simply perform identity mapping, enabling information or gradient to pass directly without travelling through intermediate layers. To mathematically formalize residual learning, identity mapping by shortcuts is integrated into a basic block in ResNet, which can be defined as: Based on ResNet, DenseNet connect all layers directly with each other to ensure maximum information flow through the network all the time. To maintain the feedforward nature, each layer concatenates the outputs of all previous layers as inputs in the channel dimension and transmits its own feature maps to all subsequent layers. Figure  2(b) illustrates this layout. Accordingly, the input of ℎ layer can be formulated as: where refers to a module consists of convolution layers, activation layers, and BN layers, and [ 0 , 1 , ⋯ , −1 ] denotes the concatenation of the feature maps generated by all preceding layers.

ResNet and DenseNet
Deep networks can lead to better performance, but optimizing deep networks is very difficult. To combat this dilemma, ResNet and DenseNet are powerful tools.
Inspired by residual representations in image recognition, ResNet introduces shortcut connections to the network. As shown in Figure 2a H, denotes hidden layers, including convolution layers, activation function layers, and batch normalization (BN) layers. In the original text of ResNet, shortcut connections simply perform identity mapping, enabling information or gradient to pass directly without travelling through intermediate layers. To mathematically formalize residual learning, identity mapping by shortcuts is integrated into a basic block in ResNet, which can be defined as:

Attention Mechanism
Given that the recognition ability of the different bands varies, the same object tends to show different spectral responses to different bands. Plus, different areas of the data cube contain different semantic information. Such prior information can facilitate the competence of the model once when it is fully exploited. The attention mechanism is Based on ResNet, DenseNet connect all layers directly with each other to ensure maximum information flow through the network all the time. To maintain the feedforward nature, each layer concatenates the outputs of all previous layers as inputs in the channel dimension and transmits its own feature maps to all subsequent layers. Figure 2b illustrates this layout. Accordingly, the input x l of l th layer can be formulated as: where H l refers to a module consists of convolution layers, activation layers, and BN layers, and [x 0 , x 1 , · · · , x l−1 ] denotes the concatenation of the feature maps generated by all preceding layers.

Attention Mechanism
Given that the recognition ability of the different bands varies, the same object tends to show different spectral responses to different bands. Plus, different areas of the data cube contain different semantic information. Such prior information can facilitate the competence of the model once when it is fully exploited. The attention mechanism is exactly the powerful technique that meets the demands. The essence of the attention mechanism is to obtain a new representation with linear weighting based on the correlations between objects, which can be interpreted as a method of feature transformation. To date, the attention mechanism has been successfully applied to various tasks, such as video classification [41], machine translation [42] and scene segmentation [43].
Among the diverse attention models, the self-attention [42] is popular, which computes a weighted summation of location contexts. Non-local [40] first introduced the self-attention mechanism to computer vision tasks. DANet [34] treated the Non-local operation as the spatial attention, and further proposed the channel attention, integrating two branches as an overall framework. A 2 net [44] used a dual-attention block to gather crucial features from entire spatio-temporal spaces into a compact set and then adaptively distribute them to each position.
However, these methods tend to drive each pixel to capture global information, resulting in attention maps with high time and space complexity. Motivated by the success of attention in the above works, EMANet [45] rethought the attention mechanism from the perspective of the expectation-maximization (EM) algorithm and computed attention maps in an iterative manner, significantly alleviating the burden of computation. As shown in Figure 3, a set of bases representing the input feature is initialized first, then with the EM algorithm, the update of the attention maps is executed in E step and the update of bases is executed in M step. Two steps are conducted alternately until convergence. Such mechanism can be integrated into a unit called Expectation-Maximization Attention Unit (EMAU), which can be conveniently inserted to CNNs.
Suppose an input feature map is X ∈ R N×C and the bases are initialized as B ∈ R K×C . In E step, we use bases to generate the attention maps Y ∈ R N×K according to the following formulations: where y nk represents the weight of the contribution of the k-th base β k to the n-th pixel x n . Equation (4) is the matrix calculation version of Equation (3), which is the actually application in the experiment. In M step, the attention maps are used to update the bases:  Suppose an input feature map is ∈ × and the bases are initialized as ℬ ∈ × . In E step, we use bases to generate the attention maps ∈ × according to the following formulations: where represents the weight of the contribution of the -th base to the -th pixel . Equation (4) is the matrix calculation version of Equation (3), which is the actually application in the experiment.
In M step, the attention maps are used to update the bases: where the bases ℬ is the weighted sum of to keep both in the same representation space, aiming to guarantee the robustness of iterations.
After two steps are executed alternately for times, ℬ and could converge approximately, which is guaranteed by the property of the EM algorithm. Experimental results also demonstrate that the number of iterations is a small constant, i.e., expectation-maximization attention can converge quickly. Then, the final ℬ and are used to reconstruct . The new , notated as ̃, can be formulated as: here ̃ can be deemed as a low-rank version of .

Methodology
This section is structured as follows. First, we introduce the framework of the proposed method. Second, two branches respectively focusing on spectral information and spatial information are described in detail. Third, fusion operations of spectral and spatial branches are discussed. Finally, several techniques aimed at boosting the network performance are covered.

Framework of the Proposed Model
The flowchart in Figure 4 depicts the proposed model for HSI classification. Generally, it consists of two branches: the spectral branch and the spatial branch. Moreover, Expectation-Maximization attention modules are incorporated into both branches to apply attention-based feature refinement. Concatenation or weighted sum are implemented subsequently to fuse bipartite features. Finally, classification is performed with the SoftMax function. After two steps are executed alternately for T times, B and Y could converge approximately, which is guaranteed by the property of the EM algorithm. Experimental results also demonstrate that the number of iterations T is a small constant, i.e., expectationmaximization attention can converge quickly. Then, the final B and Y are used to reconstruct X. The new X, notated as X, can be formulated as: here X can be deemed as a low-rank version of X.

Methodology
This section is structured as follows. First, we introduce the framework of the proposed method. Second, two branches respectively focusing on spectral information and spatial information are described in detail. Third, fusion operations of spectral and spatial branches are discussed. Finally, several techniques aimed at boosting the network performance are covered.

Framework of the Proposed Model
The flowchart in Figure 4 depicts the proposed model for HSI classification. Generally, it consists of two branches: the spectral branch and the spatial branch. Moreover, Expectation-Maximization attention modules are incorporated into both branches to apply attention-based feature refinement. Concatenation or weighted sum are implemented subsequently to fuse bipartite features. Finally, classification is performed with the SoftMax function.
Concretely, let the HSI data set be H ∈ R h×w×d , where h, w and d denote the height and width of the spatial dimensions and the spectral bands. Assume that H is composed of N labeled pixels X = {x 1 , x 2 , · · · , x N } ∈ R 1×1×d and the corresponding category label set is Y = {y 1 , y 2 , · · · , x N } ∈ R 1×1×C , where C represents the numbers of land cover classes.
To effectively exploit the inherent information in HSI, a common practice is to form a 3D patch cube with several pixels surrounding the given pixel. In this manner, X can be decomposed into a new data set Z = {z 1 , z 2 , · · · , z N } ∈ R w×w×d , where w is the width of cubes. If the target pixel is on the edge of the image, the values of adjacent missing pixels are set to zero. Then, Z is randomly divided into training, validation and testing sets denoted by Z train , Z val and Z test . Accordingly, their corresponding label sets are Y train , Y val and Y test . For each configuration of the model, the training set is used to optimize the parameters while the validation set is used to supervise the training process and select the best-trained model. Finally, the test set is used to verify the performance of the best-trained model. Concretely, let the HSI data set be ℋ ∈ ℎ× × , where ℎ , and denote the height and width of the spatial dimensions and the spectral bands. Assume that ℋ is composed of labeled pixels = { 1 , 2 , ⋯ , } ∈ 1×1× and the corresponding category label set is = { 1 , 2 , ⋯ , } ∈ 1×1× , where represents the numbers of land cover classes. To effectively exploit the inherent information in HSI, a common practice is to form a 3D patch cube with several pixels surrounding the given pixel. In this manner, can be decomposed into a new data set = { 1 , 2 , ⋯ , } ∈ × × , where is the width of cubes. If the target pixel is on the edge of the image, the values of adjacent missing pixels are set to zero. Then, is randomly divided into training, validation and testing sets denoted by , and . Accordingly, their corresponding label sets are , and . For each configuration of the model, the training set is used to optimize the parameters while the validation set is used to supervise the training process and select the best-trained model. Finally, the test set is used to verify the performance of the best-trained model.

Pyramidal Spectral Branch and Pyramidal Spatial Branch
As shown in Figure 4, the spectral branch and the spatial branch consist of PyConv and EMA. First, the pyramidal blocks used in two branches will be described in detail.
Generally, a 3D convolutional layer is first applied to perform a feature transformation on the HSI cube in the spectral dimension, reducing the computational overhead. Then, a pyramidal spectral block is attached. As shown in Figure 5, each layer in the pyramidal convolution consists of three 3D convolution operations with decreasing levels in the spectral dimension, discriminated by blue, yellow and red, respectively. The kernel sizes of the 3D convolution operations in each layer are set to 1 × 1 × 7, 1 × 1 × 5, 1 × 1 × 3, respectively. Furthermore, to make the network powerful and converge rapidly, each convolution is subsequently followed by a batch normalization (BN) layer to regularize and an activation function Mish [46] to learn a non-linear representation. The number of output channels in each layer is consistent and can be set to ′ , then the number of the final output of the block can be formulated as: where is the number of the output channel of the preceding 3D convolution layer and actually is the number of 3D convolution kernels. However, since only the spectral dimension of these convolution kernels varies and is never equal to 1, it can be assumed that mainly the spectral information is explored.

Pyramidal Spectral Branch and Pyramidal Spatial Branch
As shown in Figure 4, the spectral branch and the spatial branch consist of PyConv and EMA. First, the pyramidal blocks used in two branches will be described in detail.
Generally, a 3D convolutional layer is first applied to perform a feature transformation on the HSI cube in the spectral dimension, reducing the computational overhead. Then, a pyramidal spectral block is attached. As shown in Figure 5, each layer in the pyramidal convolution consists of three 3D convolution operations with decreasing levels in the spectral dimension, discriminated by blue, yellow and red, respectively. The kernel sizes of the 3D convolution operations in each layer are set to 1 × 1 × 7, 1 × 1 × 5, 1 × 1 × 3, respectively. Furthermore, to make the network powerful and converge rapidly, each convolution is subsequently followed by a batch normalization (BN) layer to regularize and an activation function Mish [46] to learn a non-linear representation. The number of output channels in each layer is consistent and can be set to k , then the number of the final output of the block can be formulated as: where n is the number of the output channel of the preceding 3D convolution layer and k actually is the number of 3D convolution kernels. However, since only the spectral dimension of these convolution kernels varies and is never equal to 1, it can be assumed that mainly the spectral information is explored. Similar to the pyramidal spectral block, the pyramidal spatial block is built by leveraging the interspatial relationships of feature maps. As illustrated in Figure 6, in contrast to the pyramidal spectral block, the kernel size of the pyramidal spatial block changes in the spatial dimension while keeping fixed in the spectral dimension. Moreover, a 3D convolution layer is also applied before to compact the spectral dimension of the HSI cube, which is exhibited in Figure 4. Again, each layer in the block not only includes a 3d convolutional layer, but also is combined with a batch normalization layer and a Mish activation function layer. The relationship between the input and output of the pyramidal spatial block is aligned with Equation (7). Similar to the pyramidal spectral block, the pyramidal spatial block is built by leveraging the interspatial relationships of feature maps. As illustrated in Figure 6, in contrast to the pyramidal spectral block, the kernel size of the pyramidal spatial block changes in the spatial dimension while keeping fixed in the spectral dimension. Moreover, a 3D convolution layer is also applied before to compact the spectral dimension of the HSI cube, which is exhibited in Figure 4. Again, each layer in the block not only includes a 3d convolutional layer, but also is combined with a batch normalization layer and a Mish activation function layer. The relationship between the input and output of the pyramidal spatial block is aligned with Equation (7).

Expectation-Maximization Attention Block
After attaching the pyramidal spectral or spatial block, a 3D convolutional layer is needed to 'resize' intermediate feature maps for subsequent input to the EMA block. Then, the EMA block follows to refine feature maps. In view of the fact that for the same object, the spectral response may vary dramatically on different bands. In addition, different positions of the extracted feature maps can provide different semantic information for HSI classification. The performance for HSI classification can be improved if such prior information can be properly taken into account. Therefore, the EMA block is introduced. Two EMA blocks located in the spectral and spatial branches are designed with a similar structure. The EMA block located in the spectral branch iterates the attention map along the spectral dimension (denoted as spectral attention), while the EMA block located in the  Similar to the pyramidal spectral block, the pyramidal spatial block is built by leveraging the interspatial relationships of feature maps. As illustrated in Figure 6, in contrast to the pyramidal spectral block, the kernel size of the pyramidal spatial block changes in the spatial dimension while keeping fixed in the spectral dimension. Moreover, a 3D convolution layer is also applied before to compact the spectral dimension of the HSI cube, which is exhibited in Figure 4. Again, each layer in the block not only includes a 3d convolutional layer, but also is combined with a batch normalization layer and a Mish activation function layer. The relationship between the input and output of the pyramidal spatial block is aligned with Equation (7).

Expectation-Maximization Attention Block
After attaching the pyramidal spectral or spatial block, a 3D convolutional layer is needed to 'resize' intermediate feature maps for subsequent input to the EMA block. Then, the EMA block follows to refine feature maps. In view of the fact that for the same object, the spectral response may vary dramatically on different bands. In addition, different positions of the extracted feature maps can provide different semantic information for HSI classification. The performance for HSI classification can be improved if such prior information can be properly taken into account. Therefore, the EMA block is introduced. Two EMA blocks located in the spectral and spatial branches are designed with a similar structure. The EMA block located in the spectral branch iterates the attention map along the spectral dimension (denoted as spectral attention), while the EMA block located in the

Expectation-Maximization Attention Block
After attaching the pyramidal spectral or spatial block, a 3D convolutional layer is needed to 'resize' intermediate feature maps for subsequent input to the EMA block. Then, the EMA block follows to refine feature maps. In view of the fact that for the same object, the spectral response may vary dramatically on different bands. In addition, different positions of the extracted feature maps can provide different semantic information for HSI classification. The performance for HSI classification can be improved if such prior information can be properly taken into account. Therefore, the EMA block is introduced. Two EMA blocks located in the spectral and spatial branches are designed with a similar structure. The EMA block located in the spectral branch iterates the attention map along the spectral dimension (denoted as spectral attention), while the EMA block located in the spatial branch iterates the attention map along the spatial dimension (denoted as spatial attention).
As shown in Figure 7, given an intermediate feature map X as input, a compact base set is initialized with Kaiming's initialization [47]. Then, attention maps can be generated in E step and the base set can be updated in M step, as described in Section 2.3. After a few iterations, with the converged bases and attention maps, a new refined feature mapX can be obtained. Instead of outputtingX directly, a small factor α is adopted to equilibrate X withX. MultiplyingX by α and then adding it to X, the final output X is generated. This operation facilitates the stability of the training and empirical performance validates the potency. computed upon an image should not be the paradigm for all images. In this paper, we choose to run EMA on each image and consistently update the initial values of the bases 0 during the training process with the following strategy: where 0 represents the initial values of bases, is generated after iterating over an image and ∈ [0,1].

Figure 7.
Overall structure of the expectation-maximization attention (EMA) block.

Fusion of Spectral and Spatial Branches
With the aid of the spectral branch and spatial branch, multiple feature maps are generated. Then, how to fuse them to obtain a desirable classification result is a problem. Generally, there are two options, add or concatenation. Here, spatial features and spectral features are added with a certain weight, which is constantly adjusted by backpropagation during the training process. Both fusion operations are experimented and the results are detailed in Section 5.5. Once the fusion is finished, the feature maps subsequently flow through the fully connected layer and the SoftMax activation function and finally the classification result is obtained.

A New Activation Function
The activation function is an important element in a deep neural network and the rectified linear unit (ReLU) is often favored. Recently, Mish [46], a self-regularized non- Back to the initialization of bases, this is actually a key point. The procedure described above only portrays the steps to implement EMA on a single image. However, thousands of images must be processed in the HSI classification task. The spectral feature distribution and spatial feature distribution are distinct for each image, so the bases β computed upon an image should not be the paradigm for all images. In this paper, we choose to run EMA on each image and consistently update the initial values of the bases β 0 during the training process with the following strategy: where β 0 represents the initial values of bases, β T is generated after iterating over an image and γ ∈ [0, 1].

Fusion of Spectral and Spatial Branches
With the aid of the spectral branch and spatial branch, multiple feature maps are generated. Then, how to fuse them to obtain a desirable classification result is a problem. Generally, there are two options, add or concatenation. Here, spatial features and spectral features are added with a certain weight, which is constantly adjusted by back-propagation during the training process. Both fusion operations are experimented and the results are detailed in Section 5.5. Once the fusion is finished, the feature maps subsequently flow through the fully connected layer and the SoftMax activation function and finally the classification result is obtained.

A New Activation Function
The activation function is an important element in a deep neural network and the rectified linear unit (ReLU) is often favored. Recently, Mish [46], a self-regularized nonmonotone activation function, has received increasing attention. The formula for Mish is as follows: where x is the input of the activation function.
The graph of Mish and ReLU can be seen in Figure 8. Unlike ReLU, Mish allows small negative inputs inflow to improve the model performance and keep the network sparsity instead of pruning all the negative inputs. Moreover, Mish is a smooth function and continuously differentiable, which is beneficial to optimization and generalization. monotone activation function, has received increasing attention. The formula for Mish is as follows: where is the input of the activation function. The graph of Mish and ReLU can be seen in Figure 8. Unlike ReLU, Mish allows small negative inputs inflow to improve the model performance and keep the network sparsity instead of pruning all the negative inputs. Moreover, Mish is a smooth function and continuously differentiable, which is beneficial to optimization and generalization.

Other Training Tricks
To mitigate the overfitting problem, dropout [48] is a typical strategy. Given a percentage , which is selected as 0.5 in the proposed model, the network would drop out hidden or visible units temporarily. In the case of stochastic gradient descent, a new network is trained in each mini-batch due to the property of random dropping. Moreover, dropout can make only a few units in the network possess high activation ability, which is conducive to the sparsity of the network. In our framework, a dropout layer is applied after the EMA block.
In addition, the early stopping strategy, and the dynamic learning rate adjustment method are also adopted to accelerate the network training. Specifically, early stopping means stopping the training if the loss function no longer decreases in a couple of training epochs (which is 20 in our method). Dynamic learning rate means that we adjust the learning rate during the training process to avoid the model trapped in a local optimum. Herein, we use the cosine annealing [49] strategy, which is formulated as follows: where t is the learning rate for the -th run while and are ranges for the learning rate.
denotes how many epochs have been executed since the last restart and represents the number of epochs in one restart cycle.

Other Training Tricks
To mitigate the overfitting problem, dropout [48] is a typical strategy. Given a percentage p, which is selected as 0.5 in the proposed model, the network would drop out hidden or visible units temporarily. In the case of stochastic gradient descent, a new network is trained in each mini-batch due to the property of random dropping. Moreover, dropout can make only a few units in the network possess high activation ability, which is conducive to the sparsity of the network. In our framework, a dropout layer is applied after the EMA block.
In addition, the early stopping strategy, and the dynamic learning rate adjustment method are also adopted to accelerate the network training. Specifically, early stopping means stopping the training if the loss function no longer decreases in a couple of training epochs (which is 20 in our method). Dynamic learning rate means that we adjust the learning rate during the training process to avoid the model trapped in a local optimum. Herein, we use the cosine annealing [49] strategy, which is formulated as follows: where η t is the learning rate for the i-th run while η i min and η i max are ranges for the learning rate. T cur denotes how many epochs have been executed since the last restart and T i represents the number of epochs in one restart cycle.

Datasets Description
In the experiments, four publicly available datasets, the Pavia University (UP)dataset The performance of deep-learning-based models strongly depends on the data. Generally, the more labeled data used for training, the better the model performs. Currently, many HSI classification methods can achieve almost 100% accuracy with sufficient training samples. Model performance given the lack of training samples is noteworthy. Therefore, the size of the training samples and validation samples in the experiments are set relatively small to challenge the proposed model. In addition, to conveniently compare with the previous methods, we follow the settings in [35], i.e., the proportion of samples for training and validation is both set to 3% for IP, 0.5% for UP and SV and 1.2% for BS.

Experimental Configuration
All experiments were executed on the same platform configured with Intel Core i7-8700K processor at 3.70 GHz, 32 GB of memory and an NVIDIA GeForce GTX 1080Ti GPU. The software environment is the system of window 10 (64 bit) home and deep-learning frameworks of PyTorch.
Optimization is performed by Adam optimizer with the batch size of 16 and learning rate of 0.0005. To assess the results quantitatively, three metrics are adopted: overall accuracy (OA), average accuracy (AA), and Kappa coefficient.
To assess the effectiveness of our approach, several methods are adopted for comparison. The SVM with a radial basis function (RBF) kernel [6] is selected as a representative of the traditional methods. CDCNN [22], SSRN [28] and FDSSC [29] are chosen on behalf of the deep-learning-based approaches. DBMA [33] and DBDA [35], similar to our model with a two-branch structure, are selected as the state-of-the-art double-branch models. The parameters of each model are set according to the original paper. Given that the codes are available, the results of the classification with these methods on the four datasets are in accordance with our own replication. For a fair comparison, all algorithms are executed ten times and the best results are retained.

Classification Results for the IP Dataset
The accuracy for the IP dataset obtained by different methods is shown in Table 1, where the best accuracy is in bold for each category and for the three metrics. The corresponding classification maps are also illustrated in Figure 9.
The proposed model yields the best results, i.e., 95.90% in OA, 96.19% in AA and 0.9532 in Kappa, as shown in Table 1. CDCNN obtains the lowest accuracy since the training samples are too limited for the 2DCNN-based model. Compared with CDCNN, SVM performs a little better; however, the pepper noise is quite severe, which is shown in Figure 9b. Owing to the integration of spatial and spectral information by 3DCNN, both SSRN and FDSSC are far superior to SVM and CDCNN, exceeding them by almost 20% in OA. Furthermore, FDSSC draws on the dense connection, resulting in better performance. DBMA and DBDA follow basically the same idea i.e., two branches are used to extract spectral and spatial features and the attention mechanism are introduced. However, they are prone to overfitting when the training samples are limited. Moreover, the attention mechanisms they use are simple and cannot distinguish different classes well. In contrast, our proposed model not only uses two branches to extract features, but also introduces an attention mechanism based on the EM algorithm, which can iteratively update the attention map and reduce the intra-class feature variance, thus making it easier to distinguish different class targets. As can be seen in Table 1, our model performs well balanced and excellent on each category, without extremely low scores. This demonstrates the superior discriminative capability of our model for each category. The accuracy for the IP dataset obtained by different methods is shown in Table 1, where the best accuracy is in bold for each category and for the three metrics. The corresponding classification maps are also illustrated in Figure 9.   Table 1. CDCNN obtains the lowest accuracy since the training samples are too limited for the 2DCNN-based model. Compared with CDCNN, SVM performs a little better; however, the pepper noise is quite severe, which is shown in Figure 9b. Owing to the integration of spatial and spectral information by 3DCNN, both SSRN and FDSSC are far superior to SVM and CDCNN, exceeding them by almost 20% in OA. Furthermore, FDSSC draws on the dense connection, resulting in better

Classification Results for the UP Dataset
The accuracy for the UP dataset obtained by different methods is shown in Table 2, where the best accuracy is in bold for each category and for the three metrics. The corresponding classification maps are also illustrated in Figure 10.  The accuracy for the SV dataset obtained by different methods is shown in Table 3, where the best accuracy is in bold for each category and for the three metrics. The corresponding classification maps are also illustrated in Figure 11. As shown in Table 2, our method achieves the best results on the three metrics. In particular, the average improvement over the second-best model, DBDA, is +1.29%, +1.37%, 1.74% for OA, AA, and Kappa metrics, respectively. Specifically, for each class, the best results are obtained by our method in 5 out of 9 classes. In addition, it is worth noting that in class 8, which is the most difficult to classify, only our model exceeds 90% in classification accuracy. Class 8 is represented by the dark gray line in Figure 10a, which is too slender for models to capture. Please note that only DBMA, DBDA and our method achieve the accuracy over 80 % on category 8. This illustrates the advantage of the attention mechanism in capturing fine features. Moreover, the accuracy of our method exceeds 90%, indicating that the attention mechanism adopted by our model stands out.

Classification Results for the SV Dataset
The accuracy for the SV dataset obtained by different methods is shown in Table 3, where the best accuracy is in bold for each category and for the three metrics. The corresponding classification maps are also illustrated in Figure 11. Again, the proposed model obtains the best results with 98.33% OA, 98.91% AA, and 0.9814 Kappa. On the class 15, none of the methods achieves over 90% accuracy except ours. This can be observed in Figure 11. If we concentrate on the yellow area and the gray area in the upper left corner of classification maps, it can be found that these two areas interfere with each other terribly in all the models except ours.

Classification Results for the BS Dataset
The accuracy for the BS dataset obtained by different methods is shown in Table 4, where the best accuracy is in bold for each category and for the three metrics. The corresponding classification maps are also illustrated in Figure 12.
Since the BS dataset is small and only with 3248 labeled samples, training samples may be scarce for the model. Nevertheless, the proposed method yields the best results, which demonstrates the competency of our method in exploiting spectral information and spatial information.

Classification Results for the BS Dataset
The accuracy for the BS dataset obtained by different methods is shown in Table 4, where the best accuracy is in bold for each category and for the three metrics. The corresponding classification maps are also illustrated in Figure 12.
Since the BS dataset is small and only with 3248 labeled samples, training samples may be scarce for the model. Nevertheless, the proposed method yields the best results, which demonstrates the competency of our method in exploiting spectral information and spatial information.

Discussion
In this part, more experiments are carried out to comprehensively discuss the impacts and capabilities of the relevant components in the proposed model.

Investigation of the Proportion of Training Samples
It is well known that the amount of training samples significantly affects the performance of deep-learning models. In this section, we randomly select 0.5%, 1%, 3%, 5% and 10% of samples as training sets to investigate the performance of different models with different proportion of training data. The experimental results are illustrated as Figure 13.
It is expected that as the percentage of training data increases, the OA of all methods improves. Moreover, all three approaches using 3DCNN consistently outperform CDCNN with only 2DCNN and the traditional model SVM. In addition, all three methods using 3DCNN consistently outperform the CDCNN with only 2DCNN and the traditional model SVM. Also, the discrepancy between these methods is narrowing as the amount of training samples increases. Please note that our proposed method obtains the best results regardless of the proportion, especially when the samples are not sufficient.

Discussion
In this part, more experiments are carried out to comprehensively discuss the impacts and capabilities of the relevant components in the proposed model.

Investigation of the Proportion of Training Samples
It is well known that the amount of training samples significantly affects the performance of deep-learning models. In this section, we randomly select 0.5%, 1%, 3%, 5% and 10% of samples as training sets to investigate the performance of different models with different proportion of training data. The experimental results are illustrated as Figure 13.
It is expected that as the percentage of training data increases, the OA of all methods improves. Moreover, all three approaches using 3DCNN consistently outperform CDCNN with only 2DCNN and the traditional model SVM. In addition, all three methods using 3DCNN consistently outperform the CDCNN with only 2DCNN and the traditional model SVM. Also, the discrepancy between these methods is narrowing as the amount of training samples increases. Please note that our proposed method obtains the best results regardless of the proportion, especially when the samples are not sufficient.
5% and 10% of samples as training sets to investigate the performance of different models with different proportion of training data. The experimental results are illustrated as Figure 13.
It is expected that as the percentage of training data increases, the OA of all methods improves. Moreover, all three approaches using 3DCNN consistently outperform CDCNN with only 2DCNN and the traditional model SVM. In addition, all three methods using 3DCNN consistently outperform the CDCNN with only 2DCNN and the traditional model SVM. Also, the discrepancy between these methods is narrowing as the amount of training samples increases. Please note that our proposed method obtains the best results regardless of the proportion, especially when the samples are not sufficient.

Investigation of the Attention Mechanism
Our model integrates spectral attention and spatial attention. In this section, we will test the effectiveness of these attention modules. Specifically, we consider a PyConv-only network without any attention module as a baseline (denoted as Plain). It is a simple double-branch model that extracts spatial and spectral features separately. Moreover, we denote the three derivatives: the subnetwork integrated with spectral attention, the subnetwork integrated with spatial attention and the subnetwork integrated with both as Plain+SpeAtt, Plain+SpaAtt and Plain+SSAtt, respectively. Figure 14 shows the comparison of the classification results of different networks in terms of OA, AA and Kappa. Different colors indicate different subnetworks. From the figure, we can see that either spectral attention or spatial attention, once integrated into the network, can contribute to the performance of the original network. This confirms the effectiveness of the proposed attention module. In addition, we can observe that the Plain+SSAt outperforms all the other subnetworks. This implies that spectral attention and spatial attention can complement each other to contribute more to the final classification decision.

Investigation of the Attention Mechanism
Our model integrates spectral attention and spatial attention. In this section, we will test the effectiveness of these attention modules. Specifically, we consider a PyConv-only network without any attention module as a baseline (denoted as Plain). It is a simple double-branch model that extracts spatial and spectral features separately. Moreover, we denote the three derivatives: the subnetwork integrated with spectral attention, the subnetwork integrated with spatial attention and the subnetwork integrated with both as Plain + SpeAtt, Plain + SpaAtt and Plain + SSAtt, respectively. Figure 14 shows the comparison of the classification results of different networks in terms of OA, AA and Kappa. Different colors indicate different subnetworks. From the figure, we can see that either spectral attention or spatial attention, once integrated into the network, can contribute to the performance of the original network. This confirms the effectiveness of the proposed attention module. In addition, we can observe that the Plain + SSAt outperforms all the other subnetworks. This implies that spectral attention and spatial attention can complement each other to contribute more to the final classification decision.
terms of OA, AA and Kappa. Different colors indicate different subnetworks. From the figure, we can see that either spectral attention or spatial attention, once integrated into the network, can contribute to the performance of the original network. This confirms the effectiveness of the proposed attention module. In addition, we can observe that the Plain+SSAt outperforms all the other subnetworks. This implies that spectral attention and spatial attention can complement each other to contribute more to the final classification decision.

Ablation Study for Iteration Number of the Attention Map
The number of iterations in the EMA block actually affects the performance of the model. We plot the trend of the three metrics OA, AA and Kappa against iteration number as Figure 15. We fix the base at 16 and vary the iteration number to investigate its effect on the model performance. From Figure 15, it can be seen that the model performance basically converges at first, and then the metrics start to oscillate. Specifically, three metrics on the IP dataset perform very unstably due to the unique nature of IP, i.e., the spatial size of IP is relatively small but with 16 categories, which makes the classification more arduous. Accordingly, the iteration number on the IP is set to 2, and that on the UP, SV and BS is set to 3.

Ablation Study for Iteration Number of the Attention Map
The number of iterations in the EMA block actually affects the performance of the model. We plot the trend of the three metrics OA, AA and Kappa against iteration number as Figure 15. We fix the base at 16 and vary the iteration number to investigate its effect on the model performance. From Figure 15, it can be seen that the model performance basically converges at first, and then the metrics start to oscillate. Specifically, three metrics on the IP dataset perform very unstably due to the unique nature of IP, i.e., the spatial size of IP is relatively small but with 16 categories, which makes the classification more arduous. Accordingly, the iteration number on the IP is set to 2, and that on the UP, SV and BS is set to 3.

Comparison of the Activation Function
In this article, a new activation function Mish is introduced to enhance the performance of the model. Here, the comparison of performance between Mish and ReLU is illustrated as Figure 16. It is obvious that if the proposed model adopts Mish as the activation function instead of ReLU, the OA could be improved to a certain extent.
as Figure 15. We fix the base at 16 and vary the iteration number to investigate its effect on the model performance. From Figure 15, it can be seen that the model performance basically converges at first, and then the metrics start to oscillate. Specifically, three metrics on the IP dataset perform very unstably due to the unique nature of IP, i.e., the spatial size of IP is relatively small but with 16 categories, which makes the classification more arduous. Accordingly, the iteration number on the IP is set to 2, and that on the UP, SV and BS is set to 3.

Comparison of the Activation Function
In this article, a new activation function Mish is introduced to enhance the performance of the model. Here, the comparison of performance between Mish and ReLU is illustrated as Figure 16. It is obvious that if the proposed model adopts Mish as the activation function instead of ReLU, the OA could be improved to a certain extent.

Comparison of Different Feature Fusion
As illustrated in Figure 17, concatenation outperforms overall. This is as expected, as the spectral and spatial features are in the independent feature space. If addition operation

Comparison of Different Feature Fusion
As illustrated in Figure 17, concatenation outperforms overall. This is as expected, as the spectral and spatial features are in the independent feature space. If addition operation is adopted instead of concatenation, information from different domains tends to be intermixed or even interfere with each other. However, we can see that the addition operation outperforms the concatenation operation on the IP dataset. At first it is attributed to a coincidence, nonetheless, the results remain the same after repeated experiments. Such experimental results may be caused by the specificity of the IP dataset, which has been mentioned in Section 5.3. In contrast to the concatenation operation, weighted addition can use weights to adjust the influence of spectral and spatial information on the classification results. This property is probably more useful for the IP dataset, since its spatial size is smallest while the numbers of spectral bands categories are large. is adopted instead of concatenation, information from different domains tends to be intermixed or even interfere with each other. However, we can see that the addition operation outperforms the concatenation operation on the IP dataset. At first it is attributed to a coincidence, nonetheless, the results remain the same after repeated experiments. Such experimental results may be caused by the specificity of the IP dataset, which has been mentioned in Section 5.3. In contrast to the concatenation operation, weighted addition can use weights to adjust the influence of spectral and spatial information on the classification results. This property is probably more useful for the IP dataset, since its spatial size is smallest while the numbers of spectral bands categories are large.

Investigation of Running Time
A decent method should achieve a favorable accuracy-efficiency trade-off. In this section, we will compare the time costs of the seven algorithms on the IP, UP, SV, and BS datasets. Tables 5-8 show the training time and test time of seven methods on the four datasets.

Investigation of Running Time
A decent method should achieve a favorable accuracy-efficiency trade-off. In this section, we will compare the time costs of the seven algorithms on the IP, UP, SV, and BS datasets. Tables 5-8 show the training time and test time of seven methods on the four datasets. From Tables 5-8, it can be observed that SVM spends less training time and testing time than DL-based methods in most cases. Furthermore, since 2D convolution is more parameter-and computation-conserving than 3D convolution, CDCNN, as a representative of 2D-CNN-based methods, is more time-conserving than 3D-CNN-based methods. Among these five 3D-CNN-based methods, the training time and testing time of our method are moderate. Considering the accuracy of our method is promising, it can be concluded that the proposed method strikes a better balance between accuracy and efficiency.

Conclusions
In this paper, we propose a novel HSI classification method that consists of a double branch with the pyramidal convolution and an iterative attention. First, the input of the whole framework is not subjected to dimensionality reduction such as PCA. The original 3D data is cropped into 3D cubes as input. Then, two branches are constructed with two novel techniques, namely pyramidal convolution and an iterative attention mechanism, EM attention, to extract spectral features and spatial features, respectively. Meanwhile, a new activation function, Mish, is introduced to accelerate the network convergence and improve the network performance. Finally, with several experiments, we analyze our model from multiple perspectives and demonstrate that the proposed model yields the best or competitive results on four datasets on comparison to other algorithms.
A future direction of our work is to explore better attention mechanisms to obtain finer feature representations. Furthermore, it seems interesting to further reduce the data requirements with new techniques such as a few-shot learning or zero-shot learning.

Conflicts of Interest:
The authors declare no conflict of interest.