A Hybrid-Order Spectral-Spatial Feature Network for Hyperspectral Image Classiﬁcation

: Convolutional neural networks are widely applied in hyperspectral image (HSI) classiﬁcation and show excellent performance. However, there are two challenges: the ﬁrst is that ﬁne features are generally lost in the process of depth transfer; the second is that most existing studies usually restore to ﬁrst-order features, whereas they rarely consider second-order representations. To tackle the above two problems, this article proposes a hybrid-order spectral-spatial feature network (HS 2 FNet) for hyperspectral image classiﬁcation. This framework consists of a precedent feature extraction module (PFEM) and a feature rethinking module (FRM). The former is constructed to capture multiscale spectral-spatial features and focus on adaptively recalibrate channel-wise and spatial-wise feature responses to achieve ﬁrst-order spectral-spatial feature distillation. The latter is devised to heighten the representative ability of HSI by capturing the importance of feature cross-dimension, while learning more discriminative representations by exploiting the second-order statistics of HSI, thereby improving the classiﬁcation performance. Massive experiments demonstrate that the proposed network achieves plausible results compared with the state-of-the-art classiﬁcation methods.


Introduction
A hyperspectral image (HSI), which is generally captured by imaging spectrometers or hyperspectral remote sensing sensors, comprises hundreds of narrow spectral bands and abundant spatial distribution information [1,2]. Because of rich spatial and spectral information, HSI has been widely applied in various fields, such as mineral exploitation [3], environmental science [4], military defense [5], urban development [6]. HSI classification, which aims to assign an accurate ground-truth label to each hyperspectral pixel, has become a hot topic in the field of remote sensing in recent years [7][8][9][10].
In the initial phase, a substantial number of classification methods using machine learning (ML) were presented. These methods usually perform feature extraction and then feed the obtained features into classifiers [11], such as support vector machine (SVM) [12], multinomial logistic regression [13] or random forest [14], for training. Yuan et al. utilized the cluster method to split spectral bands into several sets and selected important bands to accomplish classification tasks [15]. An SVM with nonlinear kernel projection method was proposed [16]. The Markov random field (MRF) was combined with band selection for classification [17]. However, these traditional spectral features-based classification methods do not fully exploit the spatial information of HSI. Therefore, certain classification methods based on spectral-spatial features had been proposed to improve HSI classification performance. A set of complex Gabor wavelets were used to capture spectral-spatial features [18]. Li et al. integrated local spatial features, global spatial features and original spectral features for HSI classification [19]. To obtain spectral and spatial information, extended morphological profiles (EMPs) [20], multi-kernel learning (MKL) [21] and sparse representation-based classifier (SRC) [22] introduced spatial information into the training process. He et al. systematically reviewed the conventional spectral-spatial-based classification methods [23]. Paul et al. proposed a particle swarm optimization-based unsupervised Dimensionality reduction method for HIS classification, where spectral and spatial information is utilized to select informative bands [24]. Nevertheless, these HSI classification methods using ML, whether based on spectral features or based on spectral-spatial features, rely on handcrafted features with limited representation ability, which leads to poor representation and generalization ability.
With the breakthrough of deep learning in the field of artificial intelligence, deep learning (DL)-based HSI classification methods have attracted substantial attention and have become a hotspot. Compared with ML-based classification methods, DL-based methods not only extract abstract features from the low level to the high level but also transform images into more recognizable features, which can provide a fine classification result. Typical DL-based methods involve deep belief networks (DBNs) [25], stacked autoencoders (SAEs) [26], recurrent neural networks (RNNs) [27] and convolutional neural networks (CNNs) [28], which greatly enhance the HSI classification performance. Due to the characteristics of local connection and weight sharing, CNNs have been gradually introduced into HSI classification and have shown promising performance. Hu et al. utilized a 1-D CNN to extract spectral information to accurately classify HSI [29]. A unified framework that combined SAM with a CNN was presented to adaptively learn weight features for each hyperspectral pixel by 1-D convolutional layers [30]. However, the input data of 1-D CNN-based methods must be flattened into 1-D vectors, which disregards the rich spatial features of HSI.
To further boost the classification performance, many 2-D CNN-based and 3-D CNNbased methods have been proposed to capture spectral and spatial features. Yang et al. designed a two-CNN to learn the spectral and spatial joint features [31]. Zhang et al. constructed a novel spectral-spatial-semantic network with multiway attention mechanisms for HSI classification [32]. Li et al. built a deep multilayer fusion dense network to obtain the correlation between spectral information and spatial information [33]. A supervised 2-D CNN consisted of three 2-D convolutional layers and a 3-D CNN composed of three 3-D convolutional layers were introduced for classification [34]. Although these classification methods have achieved considerable progress, how to further enhance the HSI classification accuracy with limited training samples is still challenging. Inspired by residual learning [35] and dense connections [36], Meng et al. presented two mixed link networks to obtain more representative features from HSI, aggregating the characteristics of feature reuse and exploration [37]. Paoletti et al. utilized a dense and deep 3-D CNN to take full advantage of HSI information [38]. Li et al. introduced the maximum correntropy criterion to generate a robust 3D-CapsNet [39]. To deal with the spectral similarity between HSI cubes of spatially adjacent categories, Mei et al. built a cascade residual capsule network for HSI classification [40]. To obtained global information and reduced computational cost, Yu et al. designed a dual-channel convolutional network for HSI classification [41]. However, the spectral-spatial features processed by convolutional layers may include much disturbing or unimportant information. Therefore, focusing on necessary informative regions while suppressing useless regions is still an essential problem.
Recently, inspired by the attention mechanism of human visual perception, many effective and classical attention modules have been introduced into CNNs to ameliorate HSI classification performance. Xi et al. designed 3-D squeeze-and-excitation residual blocks in each stream of framework to learn the spectral and spatial features from low level to high level [42]. To obtain a complex spectral-spatial distribution, Feng et al. devised a symmetric GAN by utilizing an attention mechanism and collaborative learning [43]. Sun  cubes [44]. Ma et al. proposed a double-branch multi-attention mechanism network [45]. Xiang et al. constructed an end-to-end multilevel hybrid attention framework consisting of a 3D-CNN, grouped residual 2D-CNN and coordinate attention [46]. Zhang et al. devised a spectral partitioning residual network with a spatial attention mechanism [47]. Ge et al. introduced an adaptive hash attention mechanism to properly extract the spectral and spatial information [48]. Wang et al. presented a spatio-spectral attention to boost the expression of the important characteristic among all pixels [49]. To enhance the robustness of HSI rotation, Yang et al. designed a cross-attention spectral-spatial network [50]. Although these methods can achieve good classification results, they usually only consider first-order statistical features and rarely learn second-order representations to enhance HSI classification performance.
Modeling of second-order or high-order statistics for obtaining discriminative representations has attracted considerable interest in deep CNNs. In particular, the global second-order pooling can fully utilize the correlation information between different channels, achieving significant performance improvement [51][52][53]. Moreover, the global secondorder pooling has attracted much attention in the field of HSI classification. He et al. employed multiscale covariance maps to fully exploit the spectral and spatial information [54]. Zheng et al. proposed a mixed CNN with covariance pooling to integrate spectral and spatial features [55]. Xue et al. designed a novel second-order pooling network based on an attention mechanism [56].
In this paper, motivated by the abovementioned advanced CNN models, we propose a hybrid-order spectral-spatial feature network (HS 2 FNet) constructed by a precedent feature extraction module (PFEM) and a feature rethinking module (FRM) for hyperspectral image classification. The former is composed of several symmetrical feature extraction blocks (SFEBs) and a distillation block (DB), which captures first-order spectral-spatial features. The latter models more discriminative second-order spectral-spatial representations, which can further refine first-order spectral-spatial features obtained from PFEM, thereby enhancing the classification performance. Specifically, we first design a SFEB to extract spectral-spatial features from different scales and layers. Considering that the connected multiscale features are beneficial for HSI classification, then a DB is built to focus on adaptively recalibrate hierarchical features by strengthening meaningful channels and paying attention to the informative region of spatial dimension, which can eliminate redundant information and achieve first-order feature distillation. Furthermore, to improve classification accuracy, we design a FRM to model more representative second-order spectral-spatial features, which introduces the importance of feature cross-dimension and the second-order statistics of HSI. Finally, we utilize two fully connected layers, two dropout layers and a soft-max layer to perform classification.
The main contributions of this paper are summarized as follows: 1.
We design a symmetrical feature extraction block to capture spectral-spatial features from different scales and layers, while maximizing the use of HSI feature flows between different scales. 2.
To dispel redundant information and noise, a distillation block is devised, which can focus on adaptively recalibratin channel-wise and spatial-wise feature responses to achieve first-order spectral-spatial feature distillation. 3.
We build a feature rethinking module to model more discriminative second-order spectral-spatial features, which further refines the first-order features by capturing the importance of feature cross-dimension and improving the classification performance by exploiting the second-order statistics of HSI, thereby improving the classification performance.
The rest of this paper is organized as follows. In Section 2, we introduce the proposed hybrid-order spectral-spatial feature network in detail. In Section 3, the experimental results and discussion are presented. Finally, the conclusion and future research direction are given in Section 4.

Method
The overall structure of the proposed network using the Indian Pines (IP) dataset as an example is illustrated in Figure 1. This work utilizes the precedent feature extraction module (PFEM) to obtain first-order spectral-spatial information. PFEM consists of several symmetrical feature extraction blocks (SFEBs) and a distillation block (DB). The feature rethinking module (FRM) is employed to model more discriminative second-order spectralspatial representations. FRM can further refine and sublimate first-order spectral-spatial features captured by the PFEM, thereby boosting accurate and efficient classification.
Remote Sens. 2022, 14, x FOR PEER REVIEW 4 of 29 results and discussion are presented. Finally, the conclusion and future research direction are given in Section 4.

Method
The overall structure of the proposed network using the Indian Pines (IP) dataset as an example is illustrated in Figure 1. This work utilizes the precedent feature extraction module (PFEM) to obtain first-order spectral-spatial information. PFEM consists of several symmetrical feature extraction blocks (SFEBs) and a distillation block (DB). The feature rethinking module (FRM) is employed to model more discriminative second-order spectral-spatial representations. FRM can further refine and sublimate first-order spectralspatial features captured by the PFEM, thereby boosting accurate and efficient classification. In Figure 1, a PCA transformation is first applied on the raw HSI to retain important spectral bands and effectively downsize the memory capacity required for calculation. To make full use of the property of HSI, we extract a 3-D image cube of size 21 21 40 × × as the input of our proposed method. Then, the 21 21 40 × × image cube is sent to the initial module including a 3 3 × convolutional layer, a batch normalization (BN) layer and a rectified linear unit (ReLU) to obtain the general feature maps. Furthermore, the acquired feature maps are fed into nine cascaded SFEBs to detect multiscale spectral-spatial features. The multiscale spectral-spatial contextual information is beneficial for HSI classification. However, if the multiscale features obtained from nine SFEBs are only simply concatenated together, this may introduce some noises and unnecessary information. Therefore, we design a DB to focus on adaptively recalibrating hierarchical features by strengthening meaningful channels and paying attention to the informative region of the spatial dimension. Next, the first-order hierarchical distillation features are transmitted to FRM to acquire more discriminative second-order representations by capturing the importance of features' cross-dimension and exploiting the second-order statistics of HSI. Finally, classification is performed through two fully connected layers, two dropout layers and a soft-max layer. More details on PFEM and FRM are provided next.

Precedent Feature Extraction Module
Spectral-spatial feature extraction is the most crucial step for HSI classification, the features obtained from different layers have different characteristics. Low-level features have higher resolution and contain more details but are full of noise, while high-level features have stronger semantic information but lack the perception capability for details. Low-level and high-level features are complementary, and the combination of the two is helpful for HSI classification. Therefore, we build a SFEB to capture multiscale spectralspatial features and maximize the use of HSI feature flows at different scales. Multi-level features are significantly different to each other, and taking full advantage of hierarchical In Figure 1, a PCA transformation is first applied on the raw HSI to retain important spectral bands and effectively downsize the memory capacity required for calculation. To make full use of the property of HSI, we extract a 3-D image cube of size 21 × 21 × 40 as the input of our proposed method. Then, the 21 × 21 × 40 image cube is sent to the initial module including a 3 × 3 convolutional layer, a batch normalization (BN) layer and a rectified linear unit (ReLU) to obtain the general feature maps. Furthermore, the acquired feature maps are fed into nine cascaded SFEBs to detect multiscale spectralspatial features. The multiscale spectral-spatial contextual information is beneficial for HSI classification. However, if the multiscale features obtained from nine SFEBs are only simply concatenated together, this may introduce some noises and unnecessary information. Therefore, we design a DB to focus on adaptively recalibrating hierarchical features by strengthening meaningful channels and paying attention to the informative region of the spatial dimension. Next, the first-order hierarchical distillation features are transmitted to FRM to acquire more discriminative second-order representations by capturing the importance of features' cross-dimension and exploiting the second-order statistics of HSI. Finally, classification is performed through two fully connected layers, two dropout layers and a soft-max layer. More details on PFEM and FRM are provided next.

Precedent Feature Extraction Module
Spectral-spatial feature extraction is the most crucial step for HSI classification, the features obtained from different layers have different characteristics. Low-level features have higher resolution and contain more details but are full of noise, while high-level features have stronger semantic information but lack the perception capability for details. Low-level and high-level features are complementary, and the combination of the two is helpful for HSI classification. Therefore, we build a SFEB to capture multiscale spectralspatial features and maximize the use of HSI feature flows at different scales. Multi-level features are significantly different to each other, and taking full advantage of hierarchical spectral-spatial features is also vital for HSI classification. Unfortunately, hierarchical features may bring some noise and redundant information, which will make the network more difficult to train and degree the classification performance. To address this problem, we design a DB which can effectively utilize hierarchical features while dispelling noise and redundant information, thereby achieving first-order spectral-spatial feature distillation. With the development of DL networks, a progressive structure named DenseNet is proposed to alleviate the problems of overfitting and gradient vanishing [36]. Specifically, the feature maps learned by all layers are connected one by one and input into all subsequent layers, which can enhance information flows. ResNet can be built by stacking micro-blocks sequentially, which not only solves the degradation problem but also increases training speed [35]. Many studies have proved that making the most of multiscale spectral-spatial features can effectively improve HSI classification performance [57][58][59]. Inspired by the above advantages of DL networks, we raise a SFEB comprising three parts: a symmetrical multiscale dense link unit to integrate spectral-spatial features from different scales and layers, cross transmission to facilitate the propagation of information between different scales, and local skip transmission to avoid unnecessary loss of previous features. The proposed SFEB is provided in Figure 2.
spectral-spatial features is also vital for HSI classification. Unfortunately, hierarchical features may bring some noise and redundant information, which will make the network more difficult to train and degree the classification performance. To address this problem, we design a DB which can effectively utilize hierarchical features while dispelling noise and redundant information, thereby achieving first-order spectral-spatial feature distillation.

Symmetrical Feature Extraction Block
With the development of DL networks, a progressive structure named DenseNet is proposed to alleviate the problems of overfitting and gradient vanishing [36]. Specifically, the feature maps learned by all layers are connected one by one and input into all subsequent layers, which can enhance information flows. ResNet can be built by stacking microblocks sequentially, which not only solves the degradation problem but also increases training speed [35]. Many studies have proved that making the most of multiscale spectral-spatial features can effectively improve HSI classification performance [57][58][59]. Inspired by the above advantages of DL networks, we raise a SFEB comprising three parts: a symmetrical multiscale dense link unit to integrate spectral-spatial features from different scales and layers, cross transmission to facilitate the propagation of information between different scales, and local skip transmission to avoid unnecessary loss of previous features. The proposed SFEB is provided in Figure 2.

Symmetrical Multiscale Dense Link Unit
As shown in Figure 2, SFEB utilizes two different convolutional kernels to acquire different scale spectral-spatial features, including 3 3 × and 5 5 × . To better explain the working mechanism, we divide the SFEB into two parts: Top-Link unit and Bottom-Link unit. Top-Link unit and Bottom-Link unit can not only reduce the depth of network but also apply 1 1 × convolutional layers for feature fusion. The operations of Top-Link unit can be expressed as follows:

Symmetrical Multiscale Dense Link Unit
As shown in Figure 2, SFEB utilizes two different convolutional kernels to acquire different scale spectral-spatial features, including 3 × 3 and 5 × 5. To better explain the working mechanism, we divide the SFEB into two parts: Top-Link unit and Bottom-Link unit. Top-Link unit and Bottom-Link unit can not only reduce the depth of network but also apply 1 × 1 convolutional layers for feature fusion. The operations of Top-Link unit can be expressed as follows: where M n−1 and Z sout denote the input and output of Top-Link unit, respectively. ω stands for weights of convolution layer. The superscripts refer to the number of layers at which they are located. The subscripts refer to the size of convolutional kernel used in this layer.
To obtain different scale spectral-spatial features, we use 5 × 5 convolutional layers instead of all 3 × 3 convolutional layers in the Bottom-Link unit. The operations of Bottom-Link unit can be expressed as follows: where M n−1 and Z pout denote the input and output of Bottom-Link unit, respectively.

Cross Transmission
The transfer and fusion of features at different scales has a big impact on feature extraction. Therefore, we introduce the cross transmission into Top-Link unit and Bottom-Link unit, which can transmit different scale spectral-spatial features by themselves to other subsets. Moreover, to reduce dimension and achieve multiscale spectral-spatial features fusion, we employ a 1 × 1 convolutional layer. The complete operations of cross transmission and feature fusion can be described as follows: where M n−1 and Z 1 denote the input and output of SFEB, respectively. The cross transmission and feature fusion can facilitate the exchange of spectral-spatial features at different scales and achieve multiscale feature fusion.

Local Skip Transmission
We also introduce local skip transmission into our constructed SFEB to achieve reasonable feature reuse and strengthen information propagation. The output of SFEB can be written as follows:

Distillation Block
During the training process, spectral-spatial features will gradually disappear with the increase in network depth. Therefore, fully exploiting concatenated features is conducive to improve classification accuracy. However, only using 1 × 1 convolutional layers to concatenate and compress these hierarchical features may produce massive redundant information and be adverse to HSI classification. In this section, we present a DB to effectively utilize hierarchical spectral-spatial features and achieve first-order feature distillation. DB consists of two principal branches: channel-wise recalibrate branch and spatial-wise recalibration branch. The former adaptively recalibrate channel-wise features by strengthening meaningful channels. The latter adaptively recalibrate spatial-wise features by paying attention to the informative regions of spatial dimension. The proposed DB is provided in Figure 3.
As shown in Figure 3, first, to reduce the number of feature maps without losing fine features, we introduce two 1 × 1 convolutional layers at the head and tail of DB, respectively. Then, DB splits the spectral-spatial features obtained from the first 1 × 1 convolutional layer into two branches, i.e., channel-wise recalibrate branch and spatial-wise recalibrate branch. Finally, we employ a simple concatenation operation to integrate the two branches into one new group. As shown in Figure 3, first, to reduce the number of feature maps without losing fine features, we introduce two 11  convolutional layers at the head and tail of DB, respectively. Then, DB splits the spectral-spatial features obtained from the first 11  convolutional layer into two branches, i.e., channel-wise recalibrate branch and spatial-wise recalibrate branch. Finally, we employ a simple concatenation operation to integrate the two branches into one new group.

Feature Splitting
The input spectral-spatial features of DB are denoted as x , respectively.

Channel-Wise Recalibrate Branch
To refine the channel weights of spectral-spatial feature maps, we design a channelwise recalibrate branch. The network structure of the proposed channel-wise recalibrate branch is shown in Figure 3. First, the global average pooling is utilized to average the spatial dimension of feature maps to obtain feature maps with a size of 1 1 ( / 2) g  . Second, the feature tensors are sent to a 11  , ( / 2) / g r 2-D convolutional layer to reduce the channel dimension and focus on the meaningful channels, where r is the channel compressed ratio. Then, we use a ReLU activation layer to strengthen the nonlinear relationship of channels. Next, a 11  , /2 g 2-D convolutional layer is utilized to increase the channel dimension and obtain the /2 g spectral-spatial feature maps. Finally, we apply a sigmoid function to limit the range of features, and the original output features are multiplied by the weight coefficients to obtain the refined channel features.

Spatial-Wise Recalibrate Branch
To pay attention to the informative regions of spatial dimension, we propose a spatial-wise recalibrate branch. The network structure of the proposed spatial-wise recalibrate branch is shown in Figure 3. First, we utilize two 2-D convolutional layers with C filters of 33  size and a 2 C  filters of 33  size to reduce the spatial dimension of features to obtain the feature maps involving important location information. Then, two transposed convolutional layers with C filters of 33  size are introduced to restore the

Feature Splitting
The input spectral-spatial features of DB are denoted as X ∈ R H×W×g , where H and W refer to the height and width of the spatial dimension, respectively, and g represents the number of channels. The DB divides X into two branches along the spectral dimension, where X = [x 1 , x 2 ], x 1 and x 2 ∈ R H×W×(g/2) , and channel-wise recalibrate branch and spatial-wise recalibrate branch are denoted x 1 and x 2 , respectively.

Channel-Wise Recalibrate Branch
To refine the channel weights of spectral-spatial feature maps, we design a channelwise recalibrate branch. The network structure of the proposed channel-wise recalibrate branch is shown in Figure 3. First, the global average pooling is utilized to average the spatial dimension of feature maps to obtain feature maps with a size of 1 × 1 × (g/2). Second, the feature tensors are sent to a 1 × 1, (g/2)/r 2-D convolutional layer to reduce the channel dimension and focus on the meaningful channels, where r is the channel compressed ratio. Then, we use a ReLU activation layer to strengthen the nonlinear relationship of channels. Next, a 1 × 1, g/2 2-D convolutional layer is utilized to increase the channel dimension and obtain the g/2 spectral-spatial feature maps. Finally, we apply a sigmoid function to limit the range of features, and the original output features are multiplied by the weight coefficients to obtain the refined channel features.

Spatial-Wise Recalibrate Branch
To pay attention to the informative regions of spatial dimension, we propose a spatialwise recalibrate branch. The network structure of the proposed spatial-wise recalibrate branch is shown in Figure 3. First, we utilize two 2-D convolutional layers with C filters of 3 × 3 size and a 2 × C filters of 3 × 3 size to reduce the spatial dimension of features to obtain the feature maps involving important location information. Then, two transposed convolutional layers with C filters of 3 × 3 size are introduced to restore the original size. The transposed convolutional layer can not only maintain the mapping relationship of spatial locations but also be vital for the subsequent weight optimization process. Here, each convolutional layer is followed by a ReLU activation function, which is used to enhance the nonlinear correlation. Finally, a sigmoid activation function is applied to limit the spectral-spatial feature maps to [0, 1] range, and the output is multiplied with the feature maps to guarantee that the input of next layer is optimal.

Feature Rethinking Module
Recently, the global second-order pooling operation has been widely applied in a variety of vision tasks and achieved significant performance improvement [60][61][62]. The global second-order pooling can not only learn second-order representations but also enhance nonlinear modelling capability of a network. Inspired by the above advantages of global second-order pooling, we design an FRM to model more discriminative secondorder spectral-spatial feature representations by capturing the importance of feature crossdimension and exploiting the second-order statistics of HSI. To better explain the working mechanism, we divide the FRM into two parts: feature cross-dimension interaction (FCI) and second-order pooling (SP). The proposed FRM is provided in Figure 4. original size. The transposed convolutional layer can not only maintain the mapping relationship of spatial locations but also be vital for the subsequent weight optimization process. Here, each convolutional layer is followed by a ReLU activation function, which is used to enhance the nonlinear correlation. Finally, a sigmoid activation function is applied to limit the spectral-spatial feature maps to [0,1] range, and the output is multiplied with the feature maps to guarantee that the input of next layer is optimal.

Feature Rethinking Module
Recently, the global second-order pooling operation has been widely applied in a variety of vision tasks and achieved significant performance improvement [60][61][62]. The global second-order pooling can not only learn second-order representations but also enhance nonlinear modelling capability of a network. Inspired by the above advantages of global second-order pooling, we design an FRM to model more discriminative secondorder spectral-spatial feature representations by capturing the importance of feature cross-dimension and exploiting the second-order statistics of HSI. To better explain the working mechanism, we divide the FRM into two parts: feature cross-dimension interaction (FCI) and second-order pooling (SP). The proposed FRM is provided in Figure 4.

Feature Cross-Dimension Interaction
FCI is applied to capture close interdependencies between the ( , ) H W , ( , ) H C and ( , ) W C dimensions of HSI, respectively, further enriching spectral-spatial feature representations. The diagram of FCI is provided in Figure 4. FCI is composed of three parallel streams: cross-dimension dependency between the spatial dimension H and the channel dimension C , dependency between the spatial dimension H and the spatial dimension W , and cross-dimension dependency between the spatial dimension W and the channel dimension C .

Cascaded Pooling
The cascaded pooling is made up of max-pooling features and average-pooling features to reduce the second dimension of the image cube. Mathematically, it can be expressed by the following equation:

Feature Cross-Dimension Interaction
FCI is applied to capture close interdependencies between the (H, W), (H, C) and (W, C) dimensions of HSI, respectively, further enriching spectral-spatial feature representations. The diagram of FCI is provided in Figure 4. FCI is composed of three parallel streams: cross-dimension dependency between the spatial dimension H and the channel dimension C, dependency between the spatial dimension H and the spatial dimension W, and cross-dimension dependency between the spatial dimension W and the channel dimension C.

Cascaded Pooling
The cascaded pooling is made up of max-pooling features and average-pooling features to reduce the second dimension of the image cube. Mathematically, it can be expressed by the following equation: where the second dimension across takes place during the max-pooling and averagepooling operations. For example, the size of the image cube is (H, W, C), and the output size of the cascaded pooling is (H, W, 2). The cascaded pooling can not only ease computational overhead but also preserve an affluent spectral-spatial feature representation.

Triplet Cross-Dimension Stream
The input tensor of FCI are denoted as X ∈ R H×W×C , where H and W represent the height and width of the spatial domain, respectively, and C refers to the number of channel. We transmit X to each stream of the proposed FCI. In the first stream, we capture interdependency between the spatial dimension H and the channel dimension C. First, we rotate X 90 • anti-clockwise along the H axis to obtain the rotated tensor X 1 whose shape is H × C × W. Second, X 1 is sent to the cascaded pooling to reduce the second dimension of X 1 . The gotten reduced dimension result is denoted as X 1 , which is of shape H × C × 2. Subsequently, X 1 is entered into a 2-D convolutional layer with 1 filter of 7 × 7 size followed by a BN layer, which provides the intermediate output of shape H × C × 1. Then, a sigmoid function is utilized to generate attention weights. Finally, the obtained attention weights are multiplied by X 1 and then rotated 90 • clockwise along the H axis to remain the same shape as input X.
For the second stream, we acquire interdependency between the spatial dimension H and the spatial dimension W. First, the channels of X are reduced to two by the cascaded pooling. Then, the obtained reduced dimension tensor X 2 of shape H × W × 2 is sent to a 2-D convolutional layer with 1 filter of 7 × 7 size followed by a BN layer. Finally, the output is passed through a sigmoid function to generate attention weights of shape H × W × 1 which are applied to the input X.
Similarly, in the third stream, we achieve interdependency between the spatial dimension W and the channel dimension C. First, we rotate X 90 • anti-clockwise along the W axis to obtain the rotated tensor X 3 , whose shape is C × W × H. Second, X 3 is sent to the cascaded pooling to reduce the second dimension of X 3 . The achieved dimension result is denoted as X 3 , which is of shape C × W × 2. Subsequently, X 3 is entered into a 2-D convolutional layer with 1 filter of 7 × 7 size followed by a BN layer, which provides the intermediate output of shape C × W × 1. Then, a sigmoid function is utilized to generate attention weights. Finally, the obtained attention weights are multiplied by X 3 and then rotated 90 • clockwise along W axis to remain the original input shape of X.
The tensors with FCI of shape H × W × C generated by each stream are aggregated by element-wise summation, and then we utilize a simple average operation to obtain the final refined output. The process of FCI can be summarized as follows: where ω 1 , ω 2 and ω 3 represent the weights of 2-D convolutional layers in the three streams, respectively. σ denotes the sigmoid activation function. y 1 , y 2 and y 3 are the output of the three streams, and y is the output of FCI. y 1 and y 3 refer to the 90 • clockwise rotation to remain the original input shape of H × W × C. Figure 4 provides the structure of SP. Similar to the squeeze-and-excitation networks [63], SP includes the squeeze process and the excitation process. The squeeze process aims to extract the global second statistics along the channel dimension of the input tensor. Given an input tensor X ∈ R H×W×C , to lessen the computational cost, we first pass it to a 2-D convolutional layer of 1 × 1 size to reduce the number of channels from C to d. Second, for the reduced dimension tensor, pairwise channel correlations are computed to gain one d × d covariance matrix. In the excitation process, due to the order of data being changed by the quadratic operations, we employ a row-wise convolution to remain the inherit structural information for the covariance matrix. Then, a 2-D convolution operation is performed, and we utilize the sigmoid activation function to obtain a 1 × 1 × d weight vector. Finally, the corresponding element in the weight vector is multiplied by each channel of X. SP pays attention to the correlation between spectral and spatial locations, while taking full advantage of HIS information to obtain more representative second-order spectral-spatial features.

Experiments and Discussion
The experimental setup, experimental parameter settings, framework parameter settings, comparisons with the state-of-the-art method, generalization performance and ablation studies are described and discussed here in detail.

Datasets
Four public HSI datasets are used in our experiments to evaluate qualitatively and quantitatively the classification performance of the proposed HS 2 FNet.
The University of Pavia (UP) dataset [44] including nine ground-truth categories was collected by a Reflective Optics System Imaging Spectrometer (ROSIS-03) over the Pavia region of northern Italy with a spatial resolution of 1.3 m per pixel (mpp). It is composed of 103 bands ranging from 0.43 to 0.86 µm and 610 × 340 pixels. The experiments use the number of training and test samples of the UP dataset is summarized in Table 1.          The Kennedy Space Centre (KSC) dataset [64] containing 13 ground-truth categories was acquired by an airborne Visible/Infrared imaging spectrometer (AVIRIS) at the Kennedy Space Centre in Florida. The spatial resolution is 18 mpp. It comprises 512 × 614 pixels and 176 bands ranging from 0.4 to 2.5 µm. The experiments use the number of training and test samples of the KSC dataset is summarized in Table 2.
The Indian Pines (IP) dataset [44] including 16 ground-truth categories was captured by an AVIRIS over the India Pine Forest pilot area of Northwestern Indiana with a spatial resolution 20 mpp. It consists of 200 bands ranging from 0.4 to 2.5 µm and 145 × 145 pixels. The experiments use the number of training and test samples of the IP dataset is summarized in Table 3.
The Salinas (SA) dataset [44] containing 16 ground-truth categories was gathered by an AVIRIS sensor over the Salinas Valley of California. The spatial resolution is 3.7 mpp. It includes 512 × 217 pixels and 204 bands ranging from 0.4 to 2.5 µm. The experiments use the number of training and test samples of the SA dataset as summarized in Table 4.

Implementation Details
We randomly choose 20% of the samples for the training and the remaining 80% of the samples for the test for the KSC and IP datasets. For the UP and SA datasets, 10% of samples are selected at random for training and the remaining 90% of samples are utilized for testing. Different experimental datasets contain different sample numbers, so they need to set various batch sizes. The batch size of four experimental datasets is 16, 16, 16 and 128, respectively. In addition, our proposed method is trained in 100 epochs, 100 epochs, 100 epochs and 50 epochs for UP dataset, KSC dataset, IP dataset and SA dataset, respectively. In the training process, the optimizer plays an important role and affects the model convergence. To make the model converge rapidly, we utilize Adam as the optimizer. Specifically, the value of learning rate is set as 0.00001 for four experimental datasets.
The hardware environment of the experiments is a server with an NVIDIA GeForce RTX 2060 SUPER GPU and Intel i-7 9700F CPU. In addition, the software platform is based on TensorFlow 2.3.0, Keras 2.4.3, CUDA 10.1 and Python 3.6.
To analyze the classification effect of the proposed method, four commonly used evaluation indicators are adopted: the accuracy of each category, the overall accuracy (OA), the average accuracy (AA), and the kappa coefficient (Kappa). In theory, the closer these evaluation indicators utilized in this paper are to 1, the better the classification performance will be.

Framework Parameter Settings
The classification performance of our proposed HS 2 FNet is affected by five important parameters, i.e., different spatial sizes, diverse training percentages, different numbers of principal components, diverse compressed ratios in the DB, and various numbers of SFEBs. In this part, in term of HSI classification results, we discuss the influences of these five parameters under different value settings.

Influence of Different Spatial Sizes
The spatial size of image cube has a relatively large influence on the classification performance of the proposed HS 2 FNet. The small spatial size will result in that the loss of spectral-spatial information and damage the classification results. The large spatial size will lead to the central pixel of image cube containing the spectral-spatial information proportion being lower, which is not conducive to HSI classification. So, it is vital for HSI classification to choose an appropriate spatial size. In this paper, we adopt different spatial sizes to find the best one for our proposed method, i.e., 15 Figure 5. From Figure 5a,c,d, we can visually see that for the UP dataset, as the spatial size is 15 × 15, and for the IP dataset and SA dataset, as the spatial size is 21 × 21, three evaluation indexes are obviously better than the others, which all exceed 99%. Therefore, the spatial size of 15 × 15, 21 × 21, 21 × 21 is chosen as the pertinent image cube of the proposed HS 2 FNet on UP dataset, IP dataset and SA dataset, respectively. From Figure 5b, as the spatial size is 25 × 25 or 29 × 29, three evaluation indexes all are 100%. Considering the training time and computable cost, we regard the spatial size of 25 × 25 as the most suitable input of our proposed method for the KSC dataset.

Influence of Diverse Training Percentage
1%, 3%, 5%, 7%, 10%, 15%, 20%, 25% and 30% of samples are chosen at random as the training set and the corresponding remaining samples for test, which studies the influences of different numbers of training samples on the performance of our proposed method. The corresponding results of different training sample numbers for UP dataset, KSC dataset, IP dataset and SA dataset are provided in Figure 6. From Figure 6, we can clearly see that three evaluation indexes of four experimental datasets are gradually improved as the number of training samples increases. Concretely, for KSC dataset, as the number of training samples is 1%~7%, for IP dataset, as the number of training samples is 1%~10%, and for UP and SA datasets, as the number of training samples is 1% or 3%, it can be seen that three evaluation indicators of our proposed method are not so useful. As the number of training samples is more than 3% for UP dataset and SA dataset, all evaluation indicators are over 99%. For KSC dataset, as the number of training samples is more than 7%, and for IP dataset, as the number of training samples is more than 10%, all evaluation indicators exceed 99%. On the one hand, when the proportion of training samples is too small, because of the random choice of samples, some training samples are not se-

Influence of Diverse Training Percentage
1%, 3%, 5%, 7%, 10%, 15%, 20%, 25% and 30% of samples are chosen at random as the training set and the corresponding remaining samples for test, which studies the influences of different numbers of training samples on the performance of our proposed method. The corresponding results of different training sample numbers for UP dataset, KSC dataset, IP dataset and SA dataset are provided in Figure 6. From Figure 6, we can clearly see that three evaluation indexes of four experimental datasets are gradually improved as the number of training samples increases. Concretely, for KSC dataset, as the number of training samples is 1%~7%, for IP dataset, as the number of training samples is 1%~10%, and for UP and SA datasets, as the number of training samples is 1% or 3%, it can be seen that three evaluation indicators of our proposed method are not so useful. As the number of training samples is more than 3% for UP dataset and SA dataset, all evaluation indicators are over 99%. For KSC dataset, as the number of training samples is more than 7%, and for IP dataset, as the number of training samples is more than 10%, all evaluation indicators exceed 99%. On the one hand, when the proportion of training samples is too small, because of the random choice of samples, some training samples are not selected, resulting in poor classification performance. As the proportion of training samples increases, three evaluation indexes also become better. As the proportion of training samples is 15%, three evaluation indicators on four datasets are over 99%. On the other hand, the KSC dataset and IP dataset have relatively few labeled samples, hence the classification results of the two datasets are relatively affected by the number of training samples. By contrast, the UP dataset and SA dataset include a mass of training samples, so it can still achieve good classification results without a large number of training samples. Therefore, for UP dataset and SA dataset, we randomly choose 10% of samples as the training set and the corresponding remaining 90% of samples are adopted as the test set. For KSC dataset and IP dataset, 20% of samples are selected at random for training and the corresponding remaining 80% of samples for testing.

Influence of Different Numbers of Principal Components
To reduce the computational cost and learning parameters, we apply PCA transformation to our proposed method. Different numbers of principal components (PCs) are set, i.e., {3, 5, 10, 15, 20, 25, 30, 35, 40}, to analyze the effect of it in our proposed method for different datasets. From Figure 7a, for UP dataset, compared with other conditions, as the number of PCs is 15, though OA is not good, AA and Kappa are obviously superior to others and three evaluation indicators are over 99.5%. Therefore, we set the number of PCs to 15 for UP dataset. From Figure 7b, for KSC dataset, as the number of PCs is 30 or 35, three evaluation indications reach 100%. Considering learning parameters and training time, we set the number of PCs to 30 for KSC dataset. From Figure 7c, as the number of PCs is 40, the IP dataset obtains the best classification accuracy. From Figure 7d, as the number of PCs is 30, the SA dataset achieves the optimal evaluation indicators, which are closer to 100%. Therefore, the number of PCs for IP dataset and SA dataset is set at 40 and 30, respectively.

Influence of Different Numbers of Principal Components
To reduce the computational cost and learning parameters, we apply PCA transformation to our proposed method. Different numbers of principal components (PCs) are set, i.e., {3, 5, 10, 15, 20, 25, 30, 35, 40}, to analyze the effect of it in our proposed method for different datasets. From Figure 7a, for UP dataset, compared with other conditions, as the number of PCs is 15, though OA is not good, AA and Kappa are obviously superior to others and three evaluation indicators are over 99.5%. Therefore, we set the number of PCs to 15 for UP dataset. From Figure 7b, for KSC dataset, as the number of PCs is 30 or 35, three evaluation indications reach 100%. Considering learning parameters and training time, we set the number of PCs to 30 for KSC dataset. From Figure 7c, as the number of PCs is 40, the IP dataset obtains the best classification accuracy. From Figure 7d, as the number of PCs is 30, the SA dataset achieves the optimal evaluation indicators, which are closer to 100%. Therefore, the number of PCs for IP dataset and SA dataset is set at 40 and 30, respectively.
PCs to 15 for UP dataset. From Figure 7b, for KSC dataset, as the number of PCs is 30 or 35, three evaluation indications reach 100%. Considering learning parameters and training time, we set the number of PCs to 30 for KSC dataset. From Figure 7c, as the number of PCs is 40, the IP dataset obtains the best classification accuracy. From Figure 7d, as the number of PCs is 30, the SA dataset achieves the optimal evaluation indicators, which are closer to 100%. Therefore, the number of PCs for IP dataset and SA dataset is set at 40 and 30, respectively.

Influence of Diverse Compressed Ratios in the DB
To explore the compressed ratios of the channel-wise recalibrate branch in the DB, the compressed ratios r = 1, 2, 3, 4, 5 and 6 are utilized in the UP dataset, KSC dataset, IP dataset and SA dataset. Figure 8 shows the influences of diverse compressed ratios on four experimental datasets. From Figure 8a,b, as the compressed ratio of DB is 1, the UP dataset and KSC dataset obtain the optimal classification accuracy. Meanwhile, we find that large r slightly degrades the evaluation indications, which means that it underfits the feature channel-wise correlations. As show in Figure 8c, as the compressed ratio of DB is 4, the IP dataset has the best evaluation indications. From Figure 8d, for SA dataset, as the compressed ratio of DB is 6, the OA, AA and Kappa are closer to 100%. By contrast, for IP dataset and SA dataset, the evaluation indications do not increase monotonically as r decreases. A possible reason is that the channel attention branch in the DB overfits the feature channel-wise correlations. Therefore, the most appropriate compressed ratio of DB is 1, 1, 4 and 6 for UP dataset, KSC dataset, IP dataset and SA dataset, respectively.

Influence of Various Numbers of SFEBs
The SFEB can capture abundant high-frequency spectral-spatial features at different scales as well as take full advantage of features from the previous layers. Too small a number of SFEBs may insufficiently extract spectral-spatial features, while too large a number will increase computational cost. Therefore, it is indispensable to set the appropriate number of SFEBs to capture the multiscale spectral-spatial features. Figure 9 provides three evaluation indexes when the number of SFEBs is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 on four experimental datasets. From Figure 9b, for KSC dataset, we can clearly see that, as the number of SFEBs is 2, 4, 5 and 6, three evaluation indexes reach 100% and achieve the first-rank classification performance. Compared with other conditions, when the number of SFEBs is 2, fewer learning parameters and shorter training time are needed. Therefore, the most appropriate number of SFEBs is 2 for KSC dataset. According to Figure 9a,c,d, we can obviously find that, when the number of SFEBs is 8, the UP dataset has excellent classification results; when the number of SFEBs is 9, three evaluation indexes of IP dataset achievemthe optimal accuracy; when the number of SFEBs is 2, the SA obtains-optimal classification accuracy. Therefore, we set the proper number of SFEBs as 8, 9 and 2 for UP dataset, IP dataset and SA dataset, respectively. the feature channel-wise correlations. As show in Figure 8c, as the compressed ratio of DB is 4, the IP dataset has the best evaluation indications. From Figure 8d, for SA dataset, as the compressed ratio of DB is 6, the OA, AA and Kappa are closer to 100%. By contrast, for IP dataset and SA dataset, the evaluation indications do not increase monotonically as r decreases. A possible reason is that the channel attention branch in the DB overfits the feature channel-wise correlations. Therefore, the most appropriate compressed ratio of DB is 1, 1, 4 and 6 for UP dataset, KSC dataset, IP dataset and SA dataset, respectively.

Influence of Various Numbers of SFEBs
The SFEB can capture abundant high-frequency spectral-spatial features at different scales as well as take full advantage of features from the previous layers. Too small a number of SFEBs may insufficiently extract spectral-spatial features, while too large a number will increase computational cost. Therefore, it is indispensable to set the appropriate number of SFEBs to capture the multiscale spectral-spatial features. Figure 9 provides three evaluation indexes when the number of SFEBs is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 on four experimental datasets. From Figure 9b, for KSC dataset, we can clearly see that, as the number of SFEBs is 2, 4, 5 and 6, three evaluation indexes reach 100% and achieve the first-rank classification performance. Compared with other conditions, when the number of SFEBs is 2, fewer learning parameters and shorter training time are needed. Therefore, the most appropriate number of SFEBs is 2 for KSC dataset. According to Figure 9a,c,d, we can obviously find that, when the number of SFEBs is 8, the UP dataset has excellent classification results; when the number of SFEBs is 9, three evaluation indexes of IP dataset achievemthe optimal accuracy; when the number of SFEBs is 2, the SA obtains- optimal classification accuracy. Therefore, we set the proper number of SFEBs as 8, 9 and 2 for UP dataset, IP dataset and SA dataset, respectively.
To obtain fair and impartial results, all classification methods including 12 compared methods and our proposed HS 2 FNet utilize the same number of training samples: 10%, 20%, 20% and 10% for UP dataset, KSC dataset, IP dataset and SA dataset, respectively. Tables 5-8 report class-wise accuracy, OA, AA, and Kappa for four experimental datasets.
To obtain fair and impartial results, all classification methods including 12 compared methods and our proposed HS 2 FNet utilize the same number of training samples: 10%, 20%, 20% and 10% for UP dataset, KSC dataset, IP dataset and SA dataset, respectively. Tables 5-8 report class-wise accuracy, OA, AA, and Kappa for four experimental datasets. Moreover, Figures 10-13 display the visual maps of diverse classification methods. By comparing our presented HS 2 FNet with twelve classification methods, we can obtain the following conclusions.
(1) From Tables 5-8, we can observe that, in comparison with three classification methods using ML, ten classification methods based on DL almost achieve superior classification results on four experimental datasets. Among them, our proposed HS2FNet occupies the first place. This is because ML-based methods all depend on handcrafted features and prior knowledge, resulting in poor generalization performance, and cannot be well adapted to the classification task. By comparison, the DL-based classification methods can automatically extract hierarchical representations from HSI data. In addition, among ten DL-based classification methods, we also find that the classification accuracy of 1D_CNN is not satisfactory. This is because 1D_CNN only captures features in the spectral domain and ignores the rich spatial information of HSI. (2) The MFDN, SSAN, DCRN and MAFN adopt two CNN architectures to capture spectral features and spatial features, respectively. The simple concatenated operation or element-wise summation is utilized to fuse spectral and spatial features for classification. These methods obtain good classification results, but the close interdependency of spectral and spatial information is not excavated. Compared with them, our proposed method obtains the better classification accuracy on four datasets. For example, three evaluation indexes of our proposed HS 2 FNet are 99.98%, 99.97% and 99.98% on SA dataset, respectively, which are 3.14%, 3.07% and 3.48% higher than those of DCRN, and 0.47%, 0.49% and 0.53% higher than those of MFDN, respectively. Our proposed method uses the PFEM to capture multiscale spectral-spatial features while eliminating redundancy information, while the FRM is designed to mine second-order spectral-spatial statistic features to improve the classification performance. The best accuracy of three evaluation indexes is marked in red. The highest category accuracy of the designed model is highlighted in blue.  The best accuracy of three evaluation indexes is marked in red. The highest category accuracy of the designed model is highlighted in blue.
(3) The attention mechanism can capture key areas from images for classification. The SSAN introduces the self-attention mechanism, using the relationship between the pixels within an HSI cube to obtain attention areas. The JSSAN designs a spectral-spatial attention block to capture the long-range correlation of the spectral-spatial information. To eliminate redundant bands and interfering pixels, the MAFN constructs a band attention module and a spatial attention module. Although these attention mechanism-based classification methods can obtain evaluation indicators, they ignore the cross-dimension interaction. Comparec with these methods, our proposed method achieves the superior classification results on four datasets. For example, the designed model achieves 100% OA, 100% AA and 100% Kappa on KSC datasets, which are 0.58%, 0.88% and 0.64% higher than those of SSAN, 3.05%, 4.70% and 3.40% than those of JSSAN, and 3.31%, 5.01% and 3.69% than those of MAFN, respectively. Our designed DB not only pays more attention to adaptively recalibrating feature response to eliminate redundant features, but also learns close correlation of spatial and spectral data. Meanwhile, we utilize FCI, which heightens the representative ability of HSI by introducing cross-dimensional interaction without dimensionality reduction.  Tables 5-8. Compared with the ground-truth image, we can draw the conclusion that our proposed HS 2 FNet has smoother classification maps and higher classification accuracy. Meanwhile, it can make a better balance between the boundary information and the object continuity.  The best accuracy of three evaluation indexes is marked in red. The highest category accuracy of the designed model is highlighted in blue.

Generalization Performance
To further demonstrate the superiority and robustness of our devised network, we perform experiments among ten classification methods utilizing DL along with an increase in training samples, i.e., {1%, 3%, 5%, 7%,10%}. From Figure 14, we can see that our

Generalization Performance
To further demonstrate the superiority and robustness of our devised network, we perform experiments among ten classification methods utilizing DL along with an increase in training samples, i.e., {1%, 3%, 5%, 7%,10%}. From Figure 14, we can see that our

Generalization Performance
To further demonstrate the superiority and robustness of our devised network, we perform experiments among ten classification methods utilizing DL along with an increase in training samples, i.e., {1%, 3%, 5%, 7%,10%}. From Figure 14, we can see that our proposed method occupies the first place in all these cases on four experimental datasets, while other compared methods do not show good robustness and generalization performance. For example, according to Figure 14a, our presented HS 2 FNet achieves the top classification accuracy and the MAFN obtains the lowest accuracy. As the number of training samples is 1%, the OA of our proposed model is 97.48%, which is 52.44% higher than MAFN. In Figure 14b, HybridSN achieves comparable results among these classification methods. As the number of training samples is 7%, the OA of our proposed method is 93.80%, which is 3.14% higher than HybridSN. As exhibited in Figure 14c, the OA of 3D_CNN is worst in almost all cases. As the number of training samples is 10%, the OA of our proposed method is 98.04%, which is 18.87% higher than 3D_CNN.In Figure 14d, JSSAN ranks second among competition methods. As the number of training samples is 3%, the OA of our proposed method is 99.72%, which is 6.87%higher than JSSAN. Compared with these representative classification methods, the aforesaid experimental results adequately prove that our constructed HS 2 FNet possesses more robust generalization performance. methods. As the number of training samples is 7%, the OA of our proposed method is 93.80%, which is 3.14% higher than HybridSN. As exhibited in Figure 14c, the OA of 3D_CNN is worst in almost all cases. As the number of training samples is 10%, the OA of our proposed method is 98.04%, which is 18.87% higher than 3D_CNN.In Figure 14d, JSSAN ranks second among competition methods. As the number of training samples is 3%, the OA of our proposed method is 99.72%, which is 6.87%higher than JSSAN. Compared with these representative classification methods, the aforesaid experimental results adequately prove that our constructed HS 2 FNet possesses more robust generalization performance.

Effectiveness Analysis of the SFEB
In this paper, we design the SFEB to make full use of spectral-spatial features at different scales and from the previous layers, which is an important component of our proposed HS 2 FNet.
The SFEB is composed of a symmetrical multiscale dense link unit, cross transmission and local skip transmission. To verify the advantages of the novel design in the proposed method, ablation experiments are conducted under six different conditions on four common datasets, labeled case 1, case 2, case 3, case 4, case 5 and case 6, respectively. The ablation results of module validity analysis are shown Table 9.

Effectiveness Analysis of the SFEB
In this paper, we design the SFEB to make full use of spectral-spatial features at different scales and from the previous layers, which is an important component of our proposed HS 2 FNet.
The SFEB is composed of a symmetrical multiscale dense link unit, cross transmission and local skip transmission. To verify the advantages of the novel design in the proposed method, ablation experiments are conducted under six different conditions on four common datasets, labeled case 1, case 2, case 3, case 4, case 5 and case 6, respectively. The ablation results of module validity analysis are shown Table 9.
Case 1: SFEB only uses Top-Link unit. Case 2: SFEB only uses Bottom-Link unit. Case 3: SFEB uses the combination of Top-Link unit and Bottom-Link unit (without dense connection).
Case 4: SFEB uses the combination of Top-Link unit and Bottom-Link unit.
Case 5: SFEB uses the combination of Top-Link unit, Bottom-Link unit and cross transmission. Case 6 (our proposed method): SFEB uses the combination of Top-Link unit, Bottom-Link unit, cross transmission and local skip transmission. The best mechanic is highlighted in red.
In case 1 and case 2, we only introduce a one-path dense link unit: Top-Link unit or Bottom-Link unit. In case 3, we build a symmetrical multiscale dense link unit without dense connection. In case 4, the combination of Top-Link unit and Bottom-Link unit is utilized to conduct the dual-branch multiscale dense link unit. From Table 9, we can find that case 4 obtains better classification results on four experimental datasets. Particularly, compared with case 1 or case 2, case 4 utilizes a dual-branch multiscale dense link unit to capture multiscale spectral-spatial features. For example, case 4 achieves 98.30% OA, 96.27% AA and 97.75% Kappa on the UP dataset, which are 2.09%, 4.27% and 2.79% higher than case 1 and 1.65%, 2.68% and 2.19%, respectively. Compared with case 3, case 4 introduces the dense connection to take full use of spectral-spatial features from the previous layers. For example, case 4 achieves 95.08% OA, 92.93% AA and 94.51% Kappa on the KSC dataset, which are 0.32%, 1.06% and 0.35% higher than case 3. These experimental results demonstrate that our designed symmetrical multiscale dense link unit is effective.
In case 5, the cross transmission is introduced to achieve spectral-spatial feature exchange and fusion. Compared with case 4, case 5 obtains preferable classification accuracy. For example, case 5 achieves 97.97% OA, 95.26% AA and 97.69% Kappa on IP dataset, which are 0.3%, 4.3% and 0.35% higher than case 4. This is because that the cross transmission makes spectral-spatial features at different scales from the Top-Link unit and Bottom-Link unit to be transmitted to each other. These experimental results demonstrate that our introduced cross transmission is effective.
Local skip transmission can not only solve the degradation problem but also increase training speed, so we introduce it into case 6 (our proposed method). As provided in Table 9, compared with case 5, case 6 obtains superior classification results on four common experimental datasets. These experimental results demonstrate that local skip transmission is beneficial for our proposed method.

Effectiveness Analysis of the DB
To fully verify the effectiveness of our designed DB, three utilizing hierarchical feature compared methods and three ablation experiments are performed on four experimental datasets, which are shown in Figure 15, labeled h1, h2, h3, h4, h5, h6 and h7, respectively. The experimental results of module valid analysis are shown Table 10.
Remote Sens. 2022, 14, x FOR PEER REVIEW 24 of 29 9, compared with case 5, case 6 obtains superior classification results on four common experimental datasets. These experimental results demonstrate that local skip transmission is beneficial for our proposed method.

Effectiveness Analysis of the DB
To fully verify the effectiveness of our designed DB, three utilizing hierarchical feature compared methods and three ablation experiments are performed on four experimental datasets, which are shown in Figure 15, labeled h1, h2, h3, h4, h5, h6 and h7, respectively. The experimental results of module valid analysis are shown Table 10. h1: not using hierarchical features (as shown in Figure 15a). h2: one residual connection only using input features (as shown in Figure 15b). h3: residual dense connection using all hierarchical features (as shown in Figure 15c). h4: using concatenated operation and 11  convolutional layer for hierarchical feature fusion (as shown in Figure 15d). h5: HFFB only using channel-wise recalibrate branch. (as shown in Figure 15e). h6: HFFB only using spatial-wise recalibrate branch (as shown in Figure 15f). h7: our proposed DB. From Table 10, we can clearly see that h7 achieves the best classification results on four experimental datasets. On the one hand, compared with h1, h2 and h3, our proposed DB has obvious advantages. For example, h7 achieves 100% OA, 100% AA and 100% Kappa on the KSC dataset, which are 6.87%, 10.41% and 7.68% higher than h1. This is because DB can utilize two 11  convolutional layers to reduce the number of feature maps without losing fine information, while focusing on adaptively recalibrating channelwise and spatial-wise feature responses to achieve first-order spectral-spatial feature distillation.
On the other hand, compared with h4, h5 and h6, our proposed DB obtains better classification accuracy. For example, h7 obtains 99.98% OA, 99.97% AA and 99.98% Kappa on SA dataset, which are 3.41%, 4.73% and 3.8% higher than h6. Our designed DB can pay more attention to adaptively recalibrating feature responses to eliminate redundant features. Meanwhile, it also has ability to capture tight relation of spatial and spectral infor- h1: not using hierarchical features (as shown in Figure 15a). h2: one residual connection only using input features (as shown in Figure 15b). h3: residual dense connection using all hierarchical features (as shown in Figure 15c). h4: using concatenated operation and 1 × 1 convolutional layer for hierarchical feature fusion (as shown in Figure 15d). h5: HFFB only using channel-wise recalibrate branch. (as shown in Figure 15e). h6: HFFB only using spatial-wise recalibrate branch (as shown in Figure 15f). h7: our proposed DB. From Table 10, we can clearly see that h7 achieves the best classification results on four experimental datasets. On the one hand, compared with h1, h2 and h3, our proposed DB has obvious advantages. For example, h7 achieves 100% OA, 100% AA and 100% Kappa on the KSC dataset, which are 6.87%, 10.41% and 7.68% higher than h1. This is because DB can utilize two 1 × 1 convolutional layers to reduce the number of feature maps without losing fine information, while focusing on adaptively recalibrating channel-wise and spatial-wise feature responses to achieve first-order spectral-spatial feature distillation.
On the other hand, compared with h4, h5 and h6, our proposed DB obtains better classification accuracy. For example, h7 obtains 99.98% OA, 99.97% AA and 99.98% Kappa on SA dataset, which are 3.41%, 4.73% and 3.8% higher than h6. Our designed DB can pay more attention to adaptively recalibrating feature responses to eliminate redundant features. Meanwhile, it also has ability to capture tight relation of spatial and spectral information. These experimental results demonstrate that our designed DB is effective. The best mechanic is highlighted in red.

Effectiveness Analysis of the Proposed HS 2 FNet
In this paper, we propose a HS 2 FNet for HIS classification, which is composed of three functional modules: SFEB, DB and FRM. To validate the effectiveness of each module in the proposed method, we execute the self-comparison experiments on four datasets under three conditions, labelling CN1, CN2 and CN3, respectively. The experimental results of module valid analysis are shown Table 11. The best mechanic is highlighted in red. CN1: using the SFEB. CN2: using the SFEB and DB. CN3 (our proposed method): using the SFEB, DB and FRM.
From Table 11, we can clearly see that, compared with CN1, CN2 obtains better classification accuracy. CN2 introduces the DB to pay more attention to adaptively recalibrating feature response to eliminate redundant information. In particular, CN2 achieves 94.93% OA, 96.84% AA and 94.35% Kappa on SA dataset, which are 3.46%, 4.43% and 3.87% higher than CN1. CN2 achieves 96.27% OA, 94.21% AA and 95.04% Kappa on UP dataset, which are 0.29%, 0.41% and 0.39% higher than CN1. These indicate that the DB is effective.
Moreover, three evaluation indications of CN3 are the best among three conditions on four common experimental datasets, which are all over 99%. These experimental results show that the FRM introduces cross-dimensional features and second-order statistical features into our proposed method to produce more luxuriant and expressive spectralspatial features, which improves the classification performance.

Conclusions
In this paper, we propose a hybrid-order spectral-spatial feature network (HS2FNet) for HSI classification. The HS2FNet consists of two main parts: a precedent feature extraction module (PFEM) including several symmetrical feature extraction blocks (SFEBs) and a distillation block (DB) to capture first-order spectral-spatial features, and feature rethinking module (FRM) to model second-order spectral-spatial features. FRM can further refine first-order features obtained from the PFEM and improve the classification performance. First, a SFEB is designed to extract multiscale spectral-spatial features and make full use of HSI feature flows between different scales. Connecting all multiscale spectral-spatial features is helpful for HSI classification. Unfortunately, these hierarchical features may bring some noise and redundant information, which is harmful for classification accuracy. Therefore, a DB is constructed to focus on adaptively recalibrating channel-wise and spatial-wise feature responses to achieve first-order spectral-spatial feature distillation. Then, to enrich feature representations and improve the classification performance, we devise a FRM to model more discriminative second-order spectral-spatial features, which can not only heighten the representative ability of HSI by capturing the importance of features cross-dimensionally, but also learn more discriminative representations by exploiting the second-order statistics of HSI. Progressive two functional modules can obtain refined spectral-spatial features and achieve an accurate and efficient classification. Finally, we utilize two fully connected layers, two dropout layers and a soft-max layer to finish the classification task. Experimental results demonstrate that the proposed method can render competitive results in contrast with the state-of-the-art classification methods. In addition, the ablation experiments also demonstrate that the proposed architecture is reasonable to improve the classification performance. However, this method also could be improved. For example, as the depth of network increases, the computational complexity is also increased; it often needs numerous training parameters, and a longer training time. These could be future research directions to improve the proposed method.