Hyper-LGNet: Coupling Local and Global Features for Hyperspectral Image Classification

: Hyperspectral sensors provide an opportunity to capture the intensity of high spatial/spectral information and enable applications for high-level earth observation missions, such as accurate land cover mapping and target/object detection. Currently, convolutional neural networks (CNNs) are good at coping with hyperspectral image processing tasks because of the strong spatial and spectral feature extraction ability brought by hierarchical structures, but the convolution operation in CNNs is limited to local feature extraction in both dimensions. In the meanwhile, the introduction of the Transformer structure has provided an opportunity to capture long-distance dependencies between tokens from a global perspective; however, Transformer-based methods have a restricted ability to extract local information because they have no inductive bias, as CNNs do. To make full use of these two methods’ advantages in hyperspectral image processing, a dual-flow architecture named Hyper-LGNet to couple local and global features is firstly proposed by integrating CNN and Transformer branches to deal with HSI spatial-spectral information. In particular, a spatial-spectral feature fusion module (SSFFM) is designed to maximally integrate spectral and spatial information. Three mainstream hyperspectral datasets (Indian Pines, Pavia University and Houston 2013) are utilized to evaluate the proposed method’s performance. Comparative results show that the proposed Hyper-LGNet achieves state-of-the-art performance in comparison with the other nine approaches concerning overall accuracy (OA), average accuracy (AA) and kappa index. Consequently, it is anticipated that, by coupling CNN and Transformer structures, this study can provide novel insights into hyperspectral image analysis.


Introduction
With the development of sensing technology, hyperspectral sensors provide an opportunity to realize the acquisition of hundreds of bands for each pixel, capturing the intensity of the reflectance of high spatial/spectral information and enabling the detection of various objects [1][2][3]. In comparison with red-green-blue (RGB)-based sensing images and multispectral images (MSI), hyperspectral images (HSI) contain hundreds of pieces of spectrum band information because of the increasing band and decreasing bandwidth of each spectral band [4]. Such abundant band information has a more powerful discriminating ability, especially for similar spectral categories, and thus has been widely applied in high-level earth observation (EO) missions, such as accurate land cover mapping, precision agriculture, target/object detection, urban planning, mineral exploration, and so on [5][6][7].
The land cover mapping problem in high-level Earth observation missions can be transformed as an image classification problem, aiming to identify various objects so that vital information can be obtained by key stakeholders for decision making [8][9][10]. In the early stage, the solution to cope with the HSI classification problem by traditional approaches with other strategies to achieve better performance, such as with multiscale dynamic graph and hashing semantic features [32,33].
To overcome the drawbacks brought by CNNs, the Transformer structure is designed for processing and analysing sequential data, particularly in image analysis problems. Because of its unique internal multi-head self-attention mechanism, Transformer is capable of capturing long-distance dependencies between tokens from a global perspective. Inspired by related reviews, Transformer has achieved quite good results on multiple downstream tasks in the natural language processing (NLP) and computer vision (CV) domains with the help of large-scale pre-training [34][35][36]. Furthermore, Transformer has also achieved superior results in the field of hyperspectral image classification. For example, to solve the limited receptive field, inflexibility and difficult generalization problems, HSI-BERT was proposed to capture the global dependence among spatial pixels with bidirectional encoder representations from Transformers [37]. Spatial-spectral Transformer utilized a CNN to extract the spatial features and a modified Transformer to capture global relationships in the spectral dimension, fully exploring the spatial-spectral features [38]. Moreover, the spatial Transformer network was proposed to obtain the optimal input of HSI classifiers for the first time [39]. Rethinking hyperspectral image classification with Transformers, SpectralFormer can learn spectral local information from neighboring bands of HS images, achieving a significant improvement in comparison with state-of-the-art backbone networks [7].
It should be emphasized that although a single Transformer network can pave the way for the HSI classification problem compared to CNN methods by the means of both spatial and spectral information, it still has some problems. First, the Transformer method has a restricted ability to extract local information since it does not possess the strong inductive bias that CNNs do. Second, Transformer needs large-scale pre-training to achieve the same performance as a CNN. Third, the computation load is strongly positive and correlated to the sequence length, so that the Transformer-based method will be unduly computationally intensive when the sequence length is excessively long, and the Transformer's representational ability will also be limited if the sequence length is too short. Therefore, an adequate approach to combine the benefits of each paradigm (CNN-based and Transformer-based methods) applying spatial and spectral information in the field of the HSI classification task is a challenging problem.
Currently, many scholars attribute their work in the HSI classification problem, including traditional machine learning methods, CNN-based approaches and Transformer-based methods. Although some works integrate CNN and Transformer via a hybrid strategy in a single branch, the spatial and spectral information of HSI are not fully fused and utilized. Local features and global features are not complementary at the receptive field level, and the features of only one branch cannot help the model to discriminate various classes through a feature fusion method, resulting in a less convincing and accurate classification performance [39]. To address these previous drawbacks introduced by a single CNN and Transformer network, the dual flow framework named Hyper-LGNet aiming to couple local and global features for hyperspectral image classification is proposed, using CNN and Transformer branches to deal with HSI spatial-spectral information. The proposed Spatial-spectral Feature Fusion Module (SSFFM) is applied to integrate spectral and spatial information maximally. The proposed method is validated by using three popular datasets: the Indian Pines, Pavia University and Houston 2013 datasets. The results are compared with traditional machine learning methods and other deep learning architectures, showing that our result achieves the best performance among others even compared with previous SOTA SpectralFormer [7]. To be more clear, the main contributions are summarized as follows: (1) A dual flow architecture named Hyper-LGNet is proposed, which utilizes CNN and Transformer models from two branches to obtain HSI spatial and spectral information for HSI classification problems on the first attempt.
(2) The sensing image feature fusion block, namely the Spatial-spectral Feature Fusion Module, is proposed to maximally fuse spectral information and spatial information from two branches in a dual-flow architecture. (3) Extensive experiments are conducted on three mainstream datasets, including the In-dian Pines, Pavia University and Houston 2013 datasets. In comparison with various methods, a state-of-the-art classification performance is achieved under Spectral-Former data settings.
The remaining sections of this paper are organized as follows: Section 2 presents the proposed Hyper-LGNet network design; Section 3 demonstrates the comparative results of different algorithms by various HSI public datasets in a qualitative and quantitative way; and finally, conclusions and directions for future work are drawn in Section 4.

Methodology
In this section, we first give a brief review of conventional CNN and Transformer models. Second, a detailed illustration of the proposed dual-branch architecture named Hyper-LGNet is presented. Then, the feature fusion module is introduced to simultaneously achieve effective fusion of dual-branch spectral features embedded in each single branch. The experimental configuration and evaluation matrix are finally displayed.

Overview of Conventional CNN and Transformer Network
For the hyperspectral image classification task, it is of paramount importance to make full use of the spatial and spectral information in the sensing images. Regarding the exploration of spatial information, both local features and global representations are vital for the pixel-wise classification task. Benefiting from the powerful local information extraction ability of convolution operations, CNNs are capable of coping with multiple tasks in the field of computer vision. As can be seen in Figure 1, a conventional framework of a basic convolutional block contains a convolutional layer, batch normalization (BN), an activation function and specific layers for downstream tasks, which provide it with a strong local information extraction ability. Specifically, local features are able to identify low-level information, such as boundary information and texture information among various classes, while global representations can capture higher-level semantic information. Although the receptive field can be increased by hierarchically stacking convolutional layers in a CNN, it is hard to clearly model long-range dependencies, meaning that it cannot effectively capture global representations.

Convolution
Batch Normalization Activation Function As one of the self-attention mechanism-based networks, Transformer [40] can effectively model global dependencies, making up for the CNNs' limitations, especially for HSI classification task. The principle of Transformer can be seen in Figure 2. It is based on a self-attention mechanism by stacking Transformer blocks to learn the word embeddings used in the Transformer decoder and other downstream tasks. Therefore, to cope with the image task, Vision Transformer (ViT) [41] has been proposed, seen as Vision Transformer in Figure 2, to adapt the Transformer encoder and treat a patch as a token to sequentialize the image. With the help of large-scale pre-training, Transformer can model clear long-distance dependencies and achieve superior performance. However, due to the fact that Transformer is without the strong inductive bias possessed by a CNN, Transformer cannot effectively model local information without large-scale pre-training. As a consequence, it is essential to integrate CNN and Transformer to deal with HSI classification problems.  In order to fully utilize the local features and global representations in spatial and spectral dimensions, we combine CNN and Transformer together in a model named Hyper-LGNet in a dual flow approach. The proposed deep learning architecture is capable of extracting important local features and global context information equally in the spatial dimension by using parallel CNN and Transformer. Then, the spatial feature is extracted from the double branch by the proposed feature fusion module. By means of the channel attention mechanism, the spectral information can be also learned and fused. Finally, the feature map is reshaped into a vector form and fed through the fully connected layer to obtain the final output (classification vector). The details of designing the Hyper-LGNet model will be introduced in the following section.

Hyper-LGNet Network Architecture
In this section, the proposed dual-flow architecture obtaining HSI spatial and spectral information is introduced. It employs a CNN branch and a Transformer branch to capture spatial representations and utilizes the Spatial-spectral Feature Fusion Module (SSFFM) to deeply fuse spectral information of both branches (see Figure 3).

CNN Branch Design
In order to fully extract local features in the spatial dimension and solve the aforementioned problem that Transformer cannot effectively model local information without pre-training, we design a simple, powerful and effective CNN branch to build a lightweight architecture. As displayed in Figure 3, this CNN branch is divided into three main stages for downsampling operations, including 1/2, 1/4 and 1/8, where each corresponding stage refers to a particular spatial resolution scale.
Each stage of the CNN branch is composed of an improved residual block. Each residual block has three main parts, including a convolutional layer with a stride of 2 that realizes the downsampling operation in the spatial dimension, a BN layer accelerating model convergence through batch normalization, and a ReLU layer that enhances the nonlinear fitting ability of the CNN branch. Meanwhile, residual connections are also used to optimize the training process of the model. The location of each residual connection utilizes a convolutional alignment spatial resolution of stride 2 to realize that feature maps can achieve feature aggregation by direct addition at the end of the residual block. It is worth noting that in order to avoid the loss of HSI spectral information, the channel dimension of each residual block in CNN branch is set to be the same as the number of band information for the aim of the follow-up extraction of spectral dimension features. By constructing the aforementioned hierarchical CNN branches, the crucial local features for the accurate classification of hyperspectral images can be efficiently extracted.

Transformer Branch Design
Transformer branches, as a type of parallel branch in dual-flow architecture, are well designed to capture global dependencies. Inspired by ViT [41], our Transformer branch consists of a convolutional stem block and four layers of repeatedly stacked Transformer blocks (as is shown in Figure 3). By considering that the computational complexity of the Transformer is quadratic to the sequence length, the complexity of the Transformer branch will be too high if each pixel in the input image block is directly reshaped into a vector. As a result, we first use a stem block composed of a convolution to achieve double downsampling of the image resolution, so that the computational complexity of the Transformer branch can be reduced. The role of the stem block can also be interpreted as feature embedding; hence, our Transformer branch actually takes a 2 × 2 patch as a token.
Each Transformer block includes a multi-head self-attention (MHSA) layer and a feedforward network (FFN). Based on its internal self-attention mechanism, the multi-head self-attention layer can model clear long-range dependencies from a global perspective, while the feed-forward layer further enhances the network's representation ability through its internal fully connected layers and nonlinear activation functions. It is worth mentioning that layer normalization (LN) is used to normalize the data before each layer input, and residual connections are used both in the multi-head self-attention layer and the feed-forward layer to enhance the training ability of the Transformer (preventing gradient disappearing). Given a feature sequence as an input, the expression of the output of the n-th (n ∈ [1, 2, . . . , N]) Transformer block can be calculated as: (1) where LN( * ) is the layer normalization, and x n is the output of the n-th Transformer block.
In particular, the class token is abandoned for the aim of saving the amount of model parameters in the Transformer branch design to pursue a lightweight model. Finally, we utilize positional encoding via depth-wise separable convolution to further enhance the local features learned by the CNN branch and compensate for the loss of positional information of the tokens in the Transformer branch, further improving the network classification performance.

Spatial-Spectral Feature Fusion Module Design
Both the CNN and Transformer branches aim to extract HSI spatial information, and thus, an adequate method to effectively fuse the local and global features of these two branches is crucial for the entire model to achieve accurate classification performance. As a consequence, the spatial-spectral feature fusion module (SSFFM), inspired by SENet, is designed to achieve an effective fusion of local features and global features (to ensure the consistency of dual-branch output features), making full use of the spectral information of the channel dimension [42]. The whole design of SSFFM is presented in Figure 4. To obtain spatially consistent double-branch features, we first reshape the sequence output by the Transformer branch into the form of feature maps. Then, the CNN branch output feature map is upsampled by bilinear interpolation to the same spatial resolution as the Transformer branch. To effectively fuse the features of both branches (e.g., the CNN and Transformer branches) and apply the spectral information to enhance the representation ability of the model, we further concatenate the dual-branch features together along the channel dimensions and utilize a convolution block to compress the channel dimension to reduce the computational complexity of the model. We apply the channel attention module composed of two linear layers to extract the compressed spectral features to fully utilize the spectral information, enhancing the hidden layer feature representations in the channel dimension.
Specifically, we first collapse the feature map of each channel (spectral) into one dimension in the spatial dimension by a global average pooling operation so that these vectors can be sent to two fully connected layers (linear layers) for modelling long-range dependencies between channels. The output of the fully-connected layer is a weighting factor corresponding to each spectral channel. These weighting factors are used to strengthen or weaken the representations of different channels to obtain the final output by direct matrix multiplication (e.g., spatial and spectral information). Of emphasis, in our whole architecture, we take full advantage of the respective advantages of CNN, Transformer and MLP to achieve a lightweight and powerful overall architecture. In order to enhance the training ability of the model, residual connections are also used for structural design. At the same time, to reduce the amount of parameters, two fully connected layers in the feature fusion module can compress the vector length and then restore it to its original size. After spatial-spectral feature fusion, the output will be directly reshaped as a vector and sent to the final two fully connected (FC) layers to obtain the final output for classification.

Experimental Settings
Implementation Details: Our proposed method was implemented on the PyTorch platform and trained with an NVIDIA GeForce GTX 1080Ti GPU (11GB memory). We adopted the Adam optimizer to train our method with a patch size of 8 on three different HSI datasets. Based on experimental results, the best hyperparameter configuration for each HSI dataset was totally various, and the details of their experimental settings are listed in Table 1. Specifically, for the learning rate schedule on the Indian Pines dataset, the learning rate was initialized differently but decayed by multiplying a factor of 0.9 after each one-tenth of the total epochs, while the learning rate on the Pavia University and Houston 2013 datasets followed a cosine learning rate decay schedule with a warm-up strategy for 10 epochs. On the Indian Pines dataset, the training epochs and learning rate were set to 500 and 5 × 10 −4 , respectively, with a mini-batch size of 64. On the Pavia University dataset, the training epochs and learning rate were set to 1000 and 1 × 10 −4 , respectively, also with a mini-batch size of 64. On the Houston 2013 dataset, we trained the proposed method for 1000 epochs with a mini-batch size of 96 and learning rate of 1 × 10 −4 . Of note, for experiments on the Pavia University and Houston 2013 datasets, the L2 norm was also applied for model regularization with a weight decay rate of 5 × 10 −4 . Performance Metrics: The performance of each method was quantitatively evaluated by three commonly used indices, including overall accuracy (OA), average accuracy (AA), and kappa coefficient (k). Moreover, the direct visualization results of various approaches are also displayed to make a qualitative comparison.
where P, N, T and F are the abbreviations of positive, negative, true, and false pixels in the prediction map, respectively. In particular, TP indicates the correctly predicted positive values; FP is a value where the actual class is negative, and the predicted class is positive; FN denotes that the actual class is positive, but the predicted class is negative; and TN expresses the truly predicted negative values. p 0 is the sum of the correctly predicted values for each class divided by the total number of values, namely OA in this situation, and p e is the sum of the true values times the predicted values of each class, which is then divided by the square of the total values of all classes. OA is the main reference metric in our experiments.

Experiments and Results
In this section, three main public datasets of hyperspectral images are first introduced. The data division results (training and testing pixels) are also displayed. Finally, the comparative results are presented with an ablation study from both quantitative and qualitative approaches.

Dataset Introduction and Division
The selected datasets for the HSI classification task are introduced, including the Indian Pines, Pavia University and Houston 2013 datasets. The basic information of these three can be seen in Table 2, where related sensors, band information, spatial resolutions, image sizes, classes as well as data acquisition years are presented. In addition, each hyperspectral dataset is divided into training data and testing data. Of note, there are two dataset division approaches for HSI; this study's data division method is different from the original hyperspectral data website but the same as the literature [7] for a fair comparison. The Indian Pines dataset was collected by an airborne visible/infrared imaging spectrometer (AVIRIS) sensor covering northwestern Indiana, USA. Each image is formed as 145 × 145 pixels with a ground sampling distance of 20 m. There are, in total, 220 spectral bands of information provided by this sensor (10 m spectral resolution) covering the wavelength from 400 nm to 2500 nm. In this dataset, there are 20 noisy and water absorption bands that have been removed to facilitate the image classification process. There are, in total, 16 related objects from big samples to small samples included in this dataset, where the objects and corresponding data for training and testing are shown in Table 3. It can be seen in this dataset that the training pixels are much fewer than the testing samples, indicating that the model is reliable once the result is promising.

Pavia University Data
The Pavia University dataset was collected by the sensor named the Reflective Optics Spectrographic Imaging System (ROSIS). This sensor captured images covering an urban area of Pavia University. In this dataset, the image size is 610 × 340 with a 1.3 m spatial resolution. In terms of spectral information, the band wavelength ranges from 0.43 µm to 0.86 µm. As in the Indian Pines dataset, there are 12 bands that have removed because of the signal-to-noise ratio (SNR) and the water absorption, thus leaving 103 bands in the dataset. There are, in total, 9 classes in this image, including asphalt, meadows, gravel, trees, metal sheets, bare soil, bitumen, bricks, and shadows, which need to be discriminated by the proposed method. The details of training and testing data and pixels are displayed in Table 4.

Houston 2013 Data
The last dataset we applied to evaluate the effectiveness of the proposed Hyper-LGNet is the Houston 2013 dataset. It was obtained by an ITRES CASI-1500 sensor surveying the campus of the University of Houston. Each image in this dataset is formed as 349 × 1905 pixels. The spectral wavelength (total 144 bands) ranges from 346 nm to 1046 nm. The spatial resolution of this dataset is 2.5 m, and there are, in total, 15 classes that need to be classified by the proposed method. Detailed information regarding these data can be found in Table 5.

Indian Pines Dataset Classification Results
Our comparative study is first conducted on the Indian Pine dataset using various algorithms, including traditional machine learning methods (e.g., SVM, RF, KNN), deep learning methods (CNN, RNN, VGG, ViT, FuNet-C [43], SpectralFormer (SF)) and our proposed Hyper-LGNet. The results are evaluated in terms of overall accuracy (OA), average accuracy (AA) and kappa. First, comparing machine learning-based methods and deep learning-based methods, it can be seen from Table 6 that the OA, AA and kappa of conventional machine learning and deep learning method are comparable as deep learning has powerful learning and image feature extraction abilities. In particular, regarding all deep learning-based methods, the last four deep learning methods (SpectralFormer, FuNet-C, ViT and Hyper-LGNet) perform much better than CNN, FCN and RNN. This is mainly because these four can learn more local and global details in their encoderdecoder architecture.
In addition, it can be found that the OA, AA and kappa of the proposed Hyper-LGNet greatly outperform the previous SOTA SpectralFormer method (under the same data division settings), where OA increases from 81.76% to 89.01%, AA increases from 87.81% to 94.14% and kappa increases from 0.7919 to 0.8743. In detail, 15 of 16 classes of each evaluation matrix by our method is higher than SpectralFormer. Class No.14 (Alfalfa) is hard to distinguish in the SpectralFormer method because of its similarity with other classes and the few training data; however, the proposed method using a dual-flow architecture can make the classification OA increase to 100%, indicating that the proposed method obtains more local and global image feature details in this architecture and surpasses the previous SOTA model by 20.51%. Therefore, the proposed method of integrating spectral and spatial information in a dual-flow way is effective, especially in coping with small samples and fewer training samples. The results, to a great extent, demonstrate the value and practicality of deep learning-based approaches in HS image classification. Compared with traditional convolution operations, Transformer-based models can extract finer spatial feature representations from the sequence perspectives, yielding a comparable performance to other deep learning methods. Such a conclusion can also be seen in the box plot by different algorithms in Figure 5, where using total 16 classification accuracy results on Indian Pines. The distribution of each method can be seen in this figure. The overall accuracy of the proposed method is the highest among the various algorithms. Additionally, considering the classification results of all classes, the proposed method achieves the smallest variance seen from the size of the box. Although the previous SOTA, SpectralFormer, ranks second regarding overall accuracy among these methods, it has a large variance of 16 classes of classification accuracy, meaning that this model is less accurate and robust than our proposed one. Therefore, the proposed Hyper-LGNet has a strong discriminating ability in the HSI classification problem. More direct results can be seen from Figure 6 by visualisation, where difficult classes can be well-classified.

Pavia University Dataset Classification Results
In this section, different algorithms including SVM, RF, KNN, CNN, RNN, VGG, ViT, FuNet-C, SpectralFormer and our Hyper-LGNet are also compared on Pavia University dataset. It can be seen from Table 7 that machine learning methods perform predictably worse than most of the deep learning methods due to their limited data fitting and representation ability. This is mainly due to the dataset characteristic problem: the Pavia University dataset is a dataset with large samples and few classes. As can be seen in Table 7, the results of OA obtained by the machine learning methods (SVM, RF, KNN) are all lower than 80%, while the OA of most deep learning-based models can reach more than 80%. In this dataset, the results of the proposed Hyper-LGNet surpasses the previous SOTA SpectralFormer again, providing increases of 1.61% on OA, 0.12% on AA and 0.0208 on Kappa.   Regarding the explicit comparison between deep learning methods, CNN-based models (CNN, RNN, VGG) could not reach outstanding results for the reason that they fail to make the best of spectral sequence information. Specifically, CNNs are good at extracting local contextual information but hardly capable of capturing sequence attributes well. Additionally, RNNs can learn spectral features band-by-band in an orderly fashion, making it hard to learn long-term dependencies among huge numbers of bands (104 bands in the Pavia University dataset). However, only considering sequence data but having no powerful local contextual extraction also leads to poor classification performance.
The Transformer-based model, namely ViT, obtains 81.56% OA, but the Transformer structure designed for sequence information is poor at spatial information learning, hindering its performance to be further improved. As a result, with the full utilization of HSI spatial and spectral information, our Hyper-LGNet reaches the best results regarding OA, AA and Kappa. This demonstrates that the proposed dual-flow architecture has a significant superiority over the other methods. The box plot of various algorithms applied to the Pavia University dataset can be seen in Figure 7, and the direct qualitative visualization can be found in Figure 8.

Houston 2013 Dataset Classification Results
Finally, the advantages of the proposed method are verified by comparing the classification performance of various algorithms on the Houston 2013 dataset. In general, the values obtained by different machine learning methods (SVM, RF, KNN) are comparable but much lower than deep learning models. This is because machine learning methods generally adopt hand-crafted feature extraction approaches to realize image analysis. Therefore, these methods are not applicable without image pre-processing of HSI datasets, so the results of OA are all lower than 80%. Moreover, since HSI has abundant data spatially and spectrally, the complete utilization of these data is important for models to perform well on HSI classification.
As can be seen in Table 8, ViT obtains limited OA scores. Despite using an attention mechanism to obtain the relationship between each two spectral bands, it fails to capture semantic features. On the contrary, other deep learning methods (CNN, RNN, VGG, Spectalformer, FuNet-M, Hyper-LGNet) employ CNN structures, whose local connections and shared weights make them effective at capturing local correlations. Intuitively, with a similar capacity of spatial information acquisition, the qualities of the aforementioned models depend on the acquisition of spectral connections. CNNs are poor at expanding their receptive fields, resulting in the loss of spectral information. However, the proposed Hyper-LGNet takes full advantage of CNN and Transformer structures, realizing the complete utilization of spatial and spectral data and reaching the best OA of 88.80%. In particular, for challenging classes (e.g., Class 10: Highway) in the dataset, all the algorithms perform poorly except for our method, which achieves an overall accuracy of more than 80%. The box plot analysis of the proposed Hyper-LGNet on the Houston 2013 dataset can be seen in Figure 9. Additionally, as in the Indian Pine and Pavia University dataset classification maps, the visualization of classification results on the Houston 2013 dataset can be seen in Figure 10, directly showing the superiority of the proposed Hyper-LGNet in the HS image classification problem.

Ablation Study for the Effectiveness of Dual-Branch Architecture
This ablation study aims to explore whether the dual-flow architecture is effective for the HS image classification task; thus, we use different branches to verify this in this section. This experiment is conducted on the Indian Pines dataset, and three design strategies (a Transformer branch, a CNN branch, and the dual-flow architecture) are employed. As shown in Table 9, the best performance is achieved by the proposed dual-flow method, obtaining an OA of 89.01%, an AA of 94.14% and a kappa of 0.8743. Moreover, it can be seen that a single Transformer branch can obtain OA, AA and kappa values of 86.90%, 93.45% and 0.8509, which is more effective than a single CNN branch, showing that Transformer branches are more powerful in the HSI task compared with conventional CNN architectures. As a result, it can be concluded that the proposed Hyper-LGNet can combine both spatial and spectral information from two different branches, enabling the HSI classification network to obtain more image local and global features and achieve the best classification performance compared to single networks.

Ablation Study for Different Fusion Methods
When exploring the effectiveness of various fusion methods in the proposed dual-flow architecture, there are three fusion approaches that are discussed, including the designed SSFFM, direct addition and direct concatenation. It can be seen that in Table 10, the OA, AA and kappa results of selecting direct addition and concatenation methods in this dual-flow architecture are similar, while the results of OA, AA and kappa by the proposed SSFFM can achieve the best performance, increasing OA by 1.3%, AA by 1.08% and kappa by 0.0148 compared with direct addition and increasing OA by 1.35%, AA by 1.07% and kappa by 0.0152 compared with direct concatenation. In particular, some difficult and small samples can also be well discriminated, such as Class No.14 (Alfalfa). Therefore, the application of SSFFM is capable of fully utilizing the advantages of spatial/spectral information from both branches.

Ablation Study for the Suitable Choices of the Number of Transformer Block
The number of Transformer blocks (TB) is investigated in this section to explore the suitable choices for the best HSI classification performance. A variety of Transformer block numbers (e.g., 2,4,8) are employed on the Indian Pines dataset by the proposed deep learning approach. It can be seen in Table 11 that the proposed Hyper-LGNet achieves the best classification performance when the Transformer layer is set at 4, which is much improved compared with method 1 (improving OA by 4.21%, AA by 2.06%, and kappa by 0.0474) and marginally better than method 3 (improving OA by 0.94%, AA 1.4%, and kappa by 0.0107). It can be seen from this table that the classification performance does not increasingly improve when repeatedly stacking Transformer blocks, mainly due to the overfitting problem and the difficulty of optimizing learnable parameters during the training phase. We believe that this ablation study is one inspiring work while applying the Transformer model in the HSI classification problem.

Conclusions
In this study, we aimed at overcoming the respective limitations of CNN-based models and Transformer-based models on HSI classification. Specifically, we proposed a dualbranch architecture to combine the CNN and Transformer models, realizing a full utilization of HSI spatial and spectral information. With the help of a lightweight and hierarchical CNN branch, the crucial local features could be extracted accurately. In addition, the Transformer branch could capture clear long-range dependencies from a global perspective and enhance the local features learned by the CNN branch. The Spatial-spectral Feature Fusion Module (SSFFM) was designed to eliminate the difference between features obtained by two branches for an effective fusion. The proposed Hyper-LGNet, composing of the above methods, achieved the best performance in terms of classification, overall accuracy, average accuracy and kappa on three popular HSI datasets, demonstrating that it has a powerful generalization ability. In particular, compared with the previous SOTA SpectralFormer method and seven other algorithms, our proposed method obtained SOTA performance on these three datasets. Some ablation studies were conducted to discuss the effectiveness of various branches, feature fusion methods and Transformer block numbers.
Although this work is an inspiring work utilizing dual-flow architecture in HSI classification, still, several points regarding this work are left for further exploration. Firstly, improvements of the Transformer branch are expected to be made by utilizing more advanced techniques (e.g., self-supervised learning), making it more suitable for HS image classification tasks. Moreover, a more lightweight network could be established to reduce the computation complexity while maintaining the performance. Finally, the fusion module could be further improved for a better effect of fusion.