Multi-Scale Spatial–Spectral Residual Attention Network for Hyperspectral Image Classification

.


Introduction
Remote sensing images with high resolution and rich spectral information play a pivotal role in diverse fields, such as plant pest and disease detection [1][2][3], mineral exploration [4,5], crop yield estimation [6,7], and environmental monitoring [8,9].Fang et al. [10] estimated the soil moisture using satellite remote sensing data by integrating Bayesian depth image prior (BDIP) downsampling with a deep fully convolutional neural network.Pang et al. [11] developed an advanced platform combining Lidar, CCD cameras, and hyperspectral sensors for detailed forest monitoring and analysis.Guo et al. [12] utilized vegetation indices (VIs) and texture features (TFs) from unmanned aerial vehicle (UAV) hyperspectral images to create a disease monitoring model based on the partial least squares regression (PLSR) for detecting wheat yellow rust.Hyperspectral images (HSIs) [13], composed of hundreds of spectral bands, offer finer spectral divisions than other images, which provides powerful material discrimination capabilities.As a critical technology in these applications, HSI classification employs traditional methods like the support vector machine (SVM) [14], decision tree (DT) [15], and maximum likelihood classifier (MLC) [16].However, these conventional methods primarily extract shallow features and often overlook deeper characteristics with the manual setting of feature selectors significantly impacting the classification accuracy.
In contrast to traditional HSI classification techniques, deep learning methods have the advantages of automatic feature learning and superior classification capabilities.They have achieved significant progress in various domains.Consequently, an increasing number of scholars are exploring HSI classification through deep learning approaches.Boulch et al. [17] proposed an autoencoder (AE)-based network to perform HSI classification by combining multi-layer AE and pooling layers.Chen et al. [18] proposed a Deep Belief Network (DBN)-based method to extract both shallow and deep features from HSI images for classification.The advent of Convolutional Neural Networks (CNNs) significantly enhanced the image classification accuracy.Given the abundance of sequential information in the spectral channels of HSIs, Mou et al. [19] applied recurrent neural networks to HSI classification for the first time and achieved commendable results.Hu et al. [20] effectively extracted spectral features from HSIs using a stack of five 1D-CNNs.Sharma et al. [21] utilized 2D-CNNs to extract contextual spatial feature information from dimensionalityreduced HSIs, improving the classification performance with limited HSI training samples.However, utilizing the information from spectral or spatial features alone cannot fully take advantage of the rich features of HSIs.The fusion of spectral and spatial features is a beneficial complement to HSI classification.Yang et al. [22] combined the features of 1D-CNN and 2D-CNN frameworks to extract spatial and spectral features, respectively, and then fused these features through a fully connected layer for classification.Joint spatialspectral features can significantly improve the HSI classification accuracy.Nevertheless, dual-branch networks that extract features from spectral and spatial dimensions separately can lead to the loss of some original information, resulting in inadequate spatial-spectral joint features.The introduction of three-dimensional Convolutional Neural Networks has effectively mitigated this issue.Li et al. [23] proposed an end-to-end five-layer 3D-CNN network.Chen et al. [24] combined the 3D-CNN with regularization techniques to effectively extract original three-dimensional features and reduce overfitting in neural networks.Zhong et al. [25] introduced an end-to-end Spectral-Spatial Residual Network for HSI classification, which takes the original three-dimensional data cube as the input and extracts both spectral and spatial features using the 3D-CNN.Qi et al. [26] combined multi-scale and residual concepts to extract deeper features from different receptive fields and classify HSIs.Li et al. [27] integrated attention mechanisms into the Dual-Branch Multi-Attention Network to refine hyperspectral features and achieve a better classification performance.Meng et al. [28] proposed a new deep residual involution network (DRIN) for HSI classification.By using an enlarged involution kernel, the long-distance spatial interactions can be well-modeled.
Although CNNs have shown good performance in the field of HSI classification, there are still challenges to be addressed.For example, labeled HSIs require substantial human and material resources, so maximizing the utilization of existing data when HSI samples are limited is crucial.Furthermore, distinguishing the varying contributions of deep features to each node of the neural network is a challenging task.Lastly, the extraction of joint spatial-spectral features can lead to interference, and determining the correct combination of extracted features in multi-branch networks poses a challenge.
To mitigate these issues, a novel model combining 3D-CNN, 2D-CNN, multi-scale mechanisms [29,30], and residual attention mechanisms [31,32], namely MSRAN, is proposed.This method independently extracts spatial and spectral features through dual branches to minimize the interference between features of different dimensions.It then uses multi-scale mechanisms to enrich the feature diversity and employs spatial residual attention and spectral residual attention to identify effective features, thereby improving feature utilization.Finally, convolutional adaptive fusion integrates spatial texture features and spectral sequence features for classification.The contributions of this paper can be summarized as follows.

1.
We present a Dual-Branch Multi-Scale Residual Spatial-Spectral Attention Network to classify the hyperspectral remote sensing images.This network independently extracts spatial and spectral features, minimizes the interference between these two types of information, and enables the model to focus on multi-dimensional features.

2.
We extract the sequential and neighboring spectral features using different-sized convolution kernels.Meanwhile, multi-scale 2D convolution is employed to capture spatial features by superimposing various receptive fields.This approach improves the ability to classify the multi-boundary samples by highlighting the central pixel weight and capturing the ground contours and local details.

3.
The proposed MSRAN method employs dual residual spectral and spatial attention mechanisms to identify the important features for hyperspectral image classification, which eliminates the disruptive features and enhances the utilization of spatial and spectral features.
The results demonstrate a great improvement in the classification accuracy over advanced methods.
The subsequent parts of this paper are organized as follows: Section 2 discusses the proposed MSRAN method.In Section 3, ablation studies and comparative experiments are conducted to verify the effectiveness and competitiveness of the proposed model.Section 4 summarizes our work.

Structure of the Multi-Scale Spatial-Spectral Residual Attention Network
In HSI classification methods, spatial and spectral features are directly extracted and fused to improve the accuracy, potentially leading to interactions between the two branches [35].Although some improved approaches extract spectral and spatial features in parallel, further optimization of these features is necessary.To enhance the HSI classification performance, this paper extracts joint spatial-spectral, spatial texture, and spectral sequence features from HSIs at different receptive fields using the multi-scale 3D-CNN and 2D-CNN, followed by residual attention optimization.The structure of the proposed MSRAN method is illustrated in Figure 1.Initially, the original HSI is cropped into overlapping blocks to reduce the input data volume.These cropped HSI blocks are then fed into two branches for spectral and spatial feature extraction, respectively, followed by feature fusion classification.Specifically, in the spectral branch, various-scale 3D convolutions are utilized to extract joint spatial-spectral and spectral sequence features.To enhance the spectral distinguishability of different materials, an improved residual spectral attention block reallocates the weights of features, thus augmenting the salient spectral characteristics while filtering out noise and less relevant features.In the spatial feature extraction branch, a combination of multi-scale 2D-CNNs extracts spatial texture features from different receptive fields.For instance, small convolution kernels focus on spatial texture features, while larger ones are more adept at capturing object contour characteristics.However, due to parameter issues associated with large convolution kernels, we expand the receptive field by stacking sequential 2D-CNNs, thus attaining more spatial context features while reducing the number of parameters.Subsequently, to improve the spatial distinguishability of different materials, residual spatial attention reallocates the contributions of different depth features to the neural network nodes, enhancing significant spatial characteristics and diminishing the weights of irrelevant spatial features to the target task.Finally, the convolution fusion module combines the spectral and spatial optimized features through the attention mechanism and uses the Softmax function to classify the fused spatial spectral features.
the spectral and spatial optimized features through the attention mechanism and uses the Softmax function to classify the fused spatial spectral features.

Spectral Feature Extraction Branch
In the spectral feature extraction branch, the multi-scale 3D-CNN is used to design the spectral feature extraction module, as shown in Figure 2. The joint spatial-spectral cubic features are extracted by using a three-dimensional convolution with a convolution kernel of 333  , and the spectral sequence features are extracted by using a three-dimensional convolution with a convolution kernel of 5 1 1  .Subsequently, the spatial-spectral features and spectral features are summed and then fused through convolution.Specifically, the input HSI is denoted as ( ) ( ) In the spectral feature extraction module, ( )

Spectral Feature Extraction Branch
In the spectral feature extraction branch, the multi-scale 3D-CNN is used to design the spectral feature extraction module, as shown in Figure 2. The joint spatial-spectral cubic features are extracted by using a three-dimensional convolution with a convolution kernel of 3 × 3 × 3, and the spectral sequence features are extracted by using a three-dimensional convolution with a convolution kernel of 5 × 1 × 1. Subsequently, the spatial-spectral features and spectral features are summed and then fused through convolution.Specifically, the input HSI is denoted as x ∈ R c×w×h , where c represents the spectral dimension, and w and h respectively indicate the width and height of the input HSI.The spectral features extracted by convolution at different scales are activated by Relu and summed to obtain the output feature y.The calculation formulas are shown in Equations ( 1)-(3):  To refine the spectral features and redistribute the feature contributions to nodes, we introduce the spectral residual attention module by using the residual structure, as shown in Figure 3.This module employs maximum pooling and average pooling on the feature matrix to respectively capture decisive features within the target neighborhood and smooth shared neighborhood features.Subsequently, these features are processed through a Multi-Layer Perceptron (MLP) for shared weighting, thereby increasing the weights that are favorable for target feature classification.The module then utilizes a dual residual structure to mitigate network degradation issues and expedite network updates.Specifically, the extracted spectral features y undergo convolution with a 11 B  kernel, mapping the information from the spectral dimension to the channel dimension, resulting in the output spectral feature ŷ .This feature is then fed into the ˆc w h  In the spectral feature extraction module, x i ∈ R c×w×h (i = 1, 2) represents the Reluactivated convolutional features at different spectral scales, where i denotes the serial number of convolutions.The convolution operations in Conv 3×3×3 (•) and Conv 5×1×1 (•) are similar.Taking the latter as an example, it is a 3D convolution with a kernel size of 5 × 1 × 1, where the spectral dimension is 5, and the spatial dimensions are 1 × 1.
To refine the spectral features and redistribute the feature contributions to nodes, we introduce the spectral residual attention module by using the residual structure, as shown in Figure 3.This module employs maximum pooling and average pooling on the feature matrix to respectively capture decisive features within the target neighborhood and smooth shared neighborhood features.Subsequently, these features are processed through a Multi-Layer Perceptron (MLP) for shared weighting, thereby increasing the weights that are favorable for target feature classification.The module then utilizes a dual residual structure to mitigate network degradation issues and expedite network updates.Specifically, the extracted spectral features y undergo convolution with a B × 1 × 1 kernel, mapping the information from the spectral dimension to the channel dimension, resulting in the output spectral feature ŷ.This feature is then fed into the spectral channel attention module for further processing.The spectral feature ŷ ∈ R c×w×h undergoes maximum pooling and average pooling, followed by a parameter-shared MLP to obtain preassigned weights.The MLP consists of two layers.The first layer produces a feature map of size c r × 1 × 1 (where r is the feature compression ratio), while the second layer yields a feature map with the dimensions c × 1 × 1.The feature map outputs from the second layer of the MLP are added together, and then an activation function is used to generate the redistributed weights M c ∈ R c×1×1 .These weights are subsequently combined with the spectral feature ŷ to enhance the feature representation.ŷ is added to the output of the spectral feature extraction module z.The formulas are shown in Equations ( 4)-( 6): where Conv B×1×1 (•) represents a 3D convolution with a kernel size of B × 1 × 1, B is the spectral dimension size of the input y, and AvgPool(•) and MaxPool(•) represent the average pooling and maximum pooling operations, respectively.σ is the Sigmoid activation function, and M c (•) denotes the weights generated by the MLP.After simplification in Equation ( 5), W 0 and W 1 are the weights of the first and second layers of MLP, while ŷC avg and ŷC max represent the feature matrices after average pooling and maximum pooling.

Spatial Feature Extraction Branch
The spatial dimension of the HSI is significantly smaller than the spectral dimension, which can easily lead to the Hughes phenomenon [36] in the HSI classification.To minimize the disparity between the spectral and spatial dimensions, spectral dimension compression is performed while extracting features using the 2D-CNN.This not only reduces the loss of spectral information but also decreases the number of model parameters.As shown in Figure 4, in order to fully extract and utilize the spatial neighborhood features, this paper attempts to combine different receptive fields and the multi-level 2D-CNN to extract the features of four branches and then performs spatial attention to screen the effective features to obtain the output of each branch.Specifically, the first branch uses a single 2DConv with a convolution kernel size of 33  to extract the spatial details and texture features under a small-scale receptive field.The second branch em-

Spatial Feature Extraction Branch
The spatial dimension of the HSI is significantly smaller than the spectral dimension, which can easily lead to the Hughes phenomenon [36] in the HSI classification.To minimize the disparity between the spectral and spatial dimensions, spectral dimension compression is performed while extracting features using the 2D-CNN.This not only reduces the loss of spectral information but also decreases the number of model parameters.As shown in Figure 4, in order to fully extract and utilize the spatial neighborhood features, this paper attempts to combine different receptive fields and the multi-level 2D-CNN to extract the features of four branches and then performs spatial attention to screen the effective features to obtain the output of each branch.Specifically, the first branch uses a single 2DConv with a convolution kernel size of 3 × 3 to extract the spatial details and texture features under a small-scale receptive field.The second branch employs a 2DConv with a convolution kernel size of 5 × 5 to extract spatial object contour features under a larger receptive field.The third branch initially utilizes a 2DConv with a convolution kernel size of 5 × 5 for a large-scale receptive field, followed by a 2DConv with a convolution kernel size of 3 × 3 to further expand the receptive field and extract environmental features surrounding spatial objects.In branch four, average pooling is used to obtain smooth neighborhood features to assist with the fusion of various levels of spatial features.Each branch is numbered M i (i = 1, 2, 3, 4), where i represents the number of different branches.The calculation formula is shown in Equation ( 7): where Conv2d 3×3 (•) represents a 2D convolution with a size of 3 × 3, Relu(•) is the Relu activation function, M S (•) denotes the spatial residual attention, AvgPooling(•) signifies average pooling, and σ is the Sigmoid activation function.Based on the extracted spatial features in two dimensions, within the spatial residual attention, as illustrated in Figure 5, decisive features within the target neighborhood and smoothed neighborhood features are obtained through maximum pooling and average pooling.After concatenating these features, convolution is used for feature discrimination, focusing on characteristics that are beneficial for target classification.Finally, a dual-path residual is employed to accelerate network training and prevent network degradation.Specifically, the spatial feature Based on the extracted spatial features in two dimensions, within the spatial residual attention, as illustrated in Figure 5, decisive features within the target neighborhood and smoothed neighborhood features are obtained through maximum pooling and average pooling.After concatenating these features, convolution is used for feature discrimination, focusing on characteristics that are beneficial for target classification.Finally, a dual-path residual is employed to accelerate network training and prevent network degradation.Specifically, the spatial feature x ∈ R 8×w×h is compressed through maximum and average pooling along the spectral dimension.Then, the redistributed weights M s ∈ R 8×w×h are generated through convolution mapping and the Sigmoid activation function and are multiplied with the spatial feature x.Finally, the result of this multiplication is directly added to x, yielding the output K of the spatial feature extraction module.
average pooling along the spectral dimension.Then, the redistributed we

Feature Fusion and Classification Module
The spectral and spatial features optimized by the attention mechanism are through a cascading operation, as shown in Figure 6.The features are then ma through convolution to output N dimensions, where N is the number of class ground objects in the training samples.Finally, the classification result is obt through a Softmax operation, as calculated in Equation ( 8

Feature Fusion and Classification Module
The spectral and spatial features optimized by the attention mechanism are fused through a cascading operation, as shown in Figure 6.The features are then mapped through convolution to output N dimensions, where N is the number of classes of ground objects in the training samples.Finally, the classification result is obtained through a Softmax operation, as calculated in Equation ( 8): where δ Softmax represents the Softmax function, and i is the number of different branches in the spatial feature extraction module.In the spatial-spectral feature fusion module, the spectral and spatial features are fused through a cascading operation.This is followed by a convolution operation to obtain spatial-spectral joint features with consistent dimensions, which are then further integrated through feature fusion.
Electronics 2024, 13, where  Softmax represents the Softmax function, and i is the number of diff branches in the spatial feature extraction module.In the spatial-spectral feature f module, the spectral and spatial features are fused through a cascading operation.T followed by a convolution operation to obtain spatial-spectral joint features with sistent dimensions, which are then further integrated through feature fusion.

Experimental Results and Analysis
To validate the effectiveness of the proposed MSRAN classification method, a of ablation and comparative experiments were conducted using two classic datase PU dataset and the SV dataset, along with a novel dataset, the WHLK dataset.Th tributions of different modules to the network and the classification performance o ious algorithms were assessed using multiple indicators (Overall Accuracy-OA, age Accuracy-AA, the Kappa 100  (K) coefficient [37], and the category-wise clas tion accuracy for each dataset).

Experimental Results and Analysis
To validate the effectiveness of the proposed MSRAN classification method, a series of ablation and comparative experiments were conducted using two classic datasets, the PU dataset and the SV dataset, along with a novel dataset, the WHLK dataset.The contributions of different modules to the network and the classification performance of various algorithms were assessed using multiple indicators (Overall Accuracy-OA, Average Accuracy-AA, the Kappa ×100 (K) coefficient [37], and the category-wise classification accuracy for each dataset).

Pavia University Dataset
This dataset was captured by the Reflection Optical System Imaging Spectroradiometer (ROSIS) in the university town of Pavia in northern Italy in 2001.It consists of 115 spectral bands.Twelve spectral bands were removed due to noise, leaving 103 spectral bands for the experimental study.The image size is 610 × 340 pixels, the spectral resolution is 4 nm, and the spatial resolution is 1.3 m/pixel.This dataset includes real-life scenarios consisting of nine different types of land cover, with 42,776 labeled samples.In this study, 5% of the samples were used as the training set and validation set, and the rest were used as the test set.Figure 7 shows the pseudo-color image and ground truth map.Table 1 lists the numbers of training, validation, and testing samples for each category.This dataset was captured by the Airborne Visible/Infrared Imaging Spectroradiometer (AVIRIS) sensor in the Salinas Valley agricultural region of California in 1998.It originally contained 224 spectral bands, of which 204 were retained for experimental studies after removing 20 water absorption bands.The image size is 512 × 217 pixels, the wavelength range is 400-2500 nm, and the spatial resolution is 3.7 m/pixel.This dataset contains 16 different types of land cover, with 54,129 labeled samples.In this study, 5% of the samples were used as the training set and validation set, and the rest were used as the test set.Figure 8 shows the pseudo-color image and ground truth map.Table 2 lists the numbers of training, validation, and testing samples for each category.This dataset was captured by the Headwall Nano-Hyperspec imaging sensor from 13:49 to 14:37 on 17 July 2018 over Longkou Town, Hubei Province, China.It consists of 270 spectral bands that are used for experimental studies.The image is 550 × 400 pixels, the wavelength range is 400-1000 nm, and the spatial resolution is 0.463 m/pixel.This dataset contains nine land cover types with 204,542 pixels labeled.In this study, 1% of the samples were used as the training set and validation set, and the rest were used as the test set.Figure 9 shows the pseudo-color image and ground truth map.Table 3    This dataset was captured by the Headwall Nano-Hyperspec imaging sensor from 13:49 to 14:37 on 17 July 2018 over Longkou Town, Hubei Province, China.It consists of 270 spectral bands that are used for experimental studies.The image is 550 × 400 pixels, the wavelength range is 400-1000 nm, and the spatial resolution is 0.463 m/pixel.This dataset contains nine land cover types with 204,542 pixels labeled.In this study, 1% of the samples were used as the training set and validation set, and the rest were used as the test set.Figure 9 shows the pseudo-color image and ground truth map.Table 3

Experimental Settings
In the experiments, the Adam optimizer was utilized, with the default setting of 6 samples per batch for training.The learning rate was set to 1 × 10 −3 by default, and th patch size was set to nine by default.For training periods, the PU and SV datasets we subjected to 300 epochs, while the WHLK dataset was set to 200 epochs.The paramet settings for other comparative methods were consistent with those reported in the co responding literature.All training samples underwent the same data preprocessin methods, and the results were derived from the average of multiple experiments.In o der to ensure fairness in the experimental process, all HSI classification methods we implemented on the same computing workstation.The workstation was equipped wi 40 GB of memory, an Intel 8255C CPU, and an RTX 3080 GPU.The experiments we conducted on the Ubuntu 18.04 platform using the PyTorch 1.8.1 framework.
In order to quantitatively compare the effectiveness of the method presented in th paper with multiple other approaches, four quantitative evaluation metrics were adop ed: OA, AA, the K coefficient, and the classification accuracy for individual categories each dataset.OA represents the percentage of samples correctly classified by the mod in the entire dataset, which is used as a measure of the overall performance, where th

Experimental Settings
In the experiments, the Adam optimizer was utilized, with the default setting of 64 samples per batch for training.The learning rate was set to 1 × 10 −3 by default, and the patch size was set to nine by default.For training periods, the PU and SV datasets were subjected to 300 epochs, while the WHLK dataset was set to 200 epochs.The parameter settings for other comparative methods were consistent with those reported in the corresponding literature.All training samples underwent the same data preprocessing methods, and the results were derived from the average of multiple experiments.In order to ensure fairness in the experimental process, all HSI classification methods were implemented on the same computing workstation.The workstation was equipped with 40 GB of memory, an Intel 8255C CPU, and an RTX 3080 GPU.The experiments were conducted on the Ubuntu 18.04 platform using the PyTorch 1.8.1 framework.
In order to quantitatively compare the effectiveness of the method presented in this paper with multiple other approaches, four quantitative evaluation metrics were adopted: OA, AA, the K coefficient, and the classification accuracy for individual categories in each dataset.OA represents the percentage of samples correctly classified by the model in the entire dataset, which is used as a measure of the overall performance, where the higher the value, the better the performance.AA is the mean of the precision values for each category, providing a comprehensive assessment of the performance across categories.
The Kappa coefficient is a measure of the consistency between the model and random classification, taking into account the randomness in the classification results, thereby offering more reliability than the mere OA.The Kappa coefficient ranges between −1 and 1, where 0 indicates agreement with random classification and 1 denotes complete agreement.The category classification accuracy of each dataset refers to the proportion of correctly classified samples in each category.

Ablation Studies
To demonstrate the contribution of each module, ablation experiments were performed using a controlled module variable approach.The MSRAN is mainly composed of the multiscale spectral residual network (spectral branch), the multi-scale spatial residual network (spatial branch), and the feature fusion module.Different combinations of these modules were tested to assess their contributions.Since the feature fusion module integrates feature fusion and classification tasks, it was included by default in our experiments.As shown in Table 4, a comparison between NET-1 and NET-4 revealed that the inclusion of multiscale spatial residual attention in NET-4 led to increases in the OA of 1.02% and 3.40% for the PU and SV datasets, respectively.This improvement is attributed to the fact that HSIs encompass both spatial and spectral information, making the joint spatial-spectral features more suitable for HSI classification tasks.The incorporation of spatial residual attention mechanisms enables the network to focus more on features that are effective for classification.Notably, for the WHLK dataset, the introduction of multi-scale spatialspectral residual attention mechanisms resulted in a significant 7.91% increase in the OA.This can be attributed to the higher spatial resolution of the WHLK, which provides richer details for land cover spatial features.Multi-scale feature extraction can extract more levels of land cover contours, edges, and local details through different receptive fields, thereby enhancing the precision of the classification features.By comparing it with the NET-2, the introduction of multi-scale spectral residual attention in the NET-4 resulted in notable OA improvements of 10.75%, 7.33%, and 19.68% on three HSI datasets, respectively.This improvement is partly due to the advantages brought by the joint spatial-spectral features and partly because of the use of 3 × 3 × 3 three-dimensional convolution kernels and 5 × 1 × 1 one-dimensional convolution kernels.These kernels not only extract longer one-dimensional spectral features but also accommodate three-dimensional neighboring spectral features.Subsequently, spectral residual attention was applied to enhance the utilization of spectral features.In NET-3, the spatial and spectral multi-scale mechanisms are modified to a single scale.Specifically, in the spectral branch, the multi-scale 3DConv is replaced with a single-scale 3DConv with a 3 × 3 × 3 convolution kernel, while in the spatial branch, the multi-level multi-scale convolution is substituted with a single-scale 2DConv with a 5 × 5 kernel.All other experimental settings are retained, as they are in our final NET-4 model.By comparison, our final NET-4 model incorporates a multiscale mechanism and improves by 6.31%, 8.54%, and 9.63% in terms of the OA on three distinct datasets, respectively.This demonstrates that the multi-scale mechanism is adept at capturing object features of various sizes.The reason for this mainly lies in the fact that the various volumes of different-sized objects lead to diverse scales of edge contours and local detail information.This diversity necessitates the use of different receptive fields for effective feature extraction.The combination of spectral and spatial multi-scale mechanisms enhances our proposed MSRAN method to achieve a superior classification performance.

Comparative Experiments
To validate the effectiveness of the proposed MSRAN method, comparative experiments were conducted against a range of both classical and contemporary networks, including the SVM [14], 1D-RNN [18], 1D-CNN [20], 2D-CNN [21], 3D-CNN [23], DBDA [27], DRIN [28], and LS 2 CM [38].Tables 5-7 present the OA, AA, K coefficient, and classification accuracy for each land cover type across the PU, SV, and WHLK datasets.Bolded results in these tables signify superior classification performances.Remarkably, the MSRAN outperformed the others, achieving the highest OA, AA, and Kappa coefficient values on two traditional and one novel dataset.In particular, the proposed MSRAN method achieved OA improvements of 3.68%, 4.26%, and 3.34% compared with the classical 3D-CNN method across these datasets, respectively.Even compared with the advanced DRIN and LS 2 CM methods, MSRAN maintained its superiority due to its efficient integration of spatial and spectral features.The classification results across the datasets showed notable imbalances.For instance, in Table 5 (Bitumen), Table 6 (Lettuce R4), and Table 7 (Mixed weed), classification methods like the SVM, 1D-CNN, and 1D-RNN utilizing one-dimensional features exhibited subpar performances due to the complex sample distribution and the predominance of boundary samples.Conversely, MSRAN along with DBDA and 3D-CNN demonstrated a consistently balanced classification accuracy across various datasets and sample types, where the proposed MSRAN showed the most uniform performance.This consistency demonstrates the ability of our MSRAN method to handle diverse and complex sample data, which can be mainly attributed to the multi-scale mechanism, feature extraction at various granularities, feature enhancement of the central pixel, as well as the residual attention mechanism.The integration of these modules enhances the utilization of spatial and spectral features, which validates the robustness of the proposed MSRAN method.
In visual comparisons, classification maps are generated by these methods on the PU, SV, and WHLK datasets, as depicted in Figures 10-12.As observed, the maps from the DBDA, DRIN, LS 2 CM, and MSRAN align more closely with the actual ground conditions, particularly for the classification map of the proposed MSRAN method, which exhibits superior visual accuracy.Conversely, the classification maps from other methods are less precise.While some methods perform well with larger sample sizes, they struggle with smaller, more dispersed samples, such as the scattered Asphalt and Trees in the PU dataset, as shown in Figure 10.Samples that are close to, or even coincident with, edge pixels, such as LettuceR4 in the SV dataset of Figure 11 and Roads and Houses in the WHLK dataset of Figure 12, do not perform well.In contrast, the proposed MSRAN method excels in classes with a high percentage of boundary samples, as shown in the high-resolution WHLK dataset, where it effectively managed complex samples like Roads and Houses and Mixed Weed along Riverbanks with intricate boundary challenges.Overall, the proposed MSRAN method enhances the spectral features of the central pixel and extracts multi-level fine-grained spatial details, which cooperates with the dual-path residual attention mechanism to screen effective features.By removing the redundant feature interference and minimizing the interference between surrounding pixels, the proposed MSRAN method enables the accurate classification of land cover types in areas with complex spatial distribution, particularly for the areas with scattered and boundary samples.

Impacts of Different Training Ratios
Figure 13 displays the OA values of various models with different proportions of training samples.Considering the varying total number of samples and the stability of the models under exposure to different proportions, 3%, 5%, 7%, and 10% of the training samples were randomly selected for the PU and SV datasets, while 0.5%, 1%, 3%, and 7% were chosen for the WHLK dataset.Overall, it is observed that, even with a limited number of training samples, it still achieves satisfactory classification results.As the number of training samples increases, all methods show a growth trend for performance across these three datasets.Notably, the proposed MSRAN method demonstrates a stable growth trend, further affirming its robustness.
These findings suggest that MSRAN is particularly effective in scenarios with the common challenge of limited training data for HSI classification.Its ability to maintain stable performance improvements with increasing amounts of training data highlights its potential for use in various applications, particularly in areas where collecting extensive training samples is challenging or impractical.The robustness of MSRAN in these contexts underscores its suitability for real-world applications and its effectiveness for harnessing limited data for accurate classification.
stable performance improvements with increasing amounts of training data highlights its potential for use in various applications, particularly in areas where collecting extensive training samples is challenging or impractical.The robustness of MSRAN in these contexts underscores its suitability for real-world applications and its effectiveness for harnessing limited data for accurate classification.

Conclusions
Traditional convolutional neural network methods often face challenges with feature extraction, particularly in terms of spatial and spectral features influencing each other and the poor extraction of features when there are few samples with multiple edges.This paper introduced a novel method for HSI classification, the Dual-Branch Multi-Scale Spectral-Spatial Residual Attention Network.This approach employs two branches to extract spatial features with different granularities and enhance central-pixel spectral features.It utilizes spatial residual attention and spectral residual attention to refine the extracted spectral-spatial features.Finally, these features are fused and classified using a Softmax classifier for HSIs.Extensive experimental evaluations demonstrated that this method is highly competitive.The dual-branch structure allows for more effective and independent extraction of spatial and spectral features, addressing the limitations of traditional CNN methods.The use of residual attention mechanisms further enhances feature extraction, particularly for complex samples with multiple edges or a limited quantity.The final fusion of these features ensured a comprehensive representation of the data, leading to an improved classification performance.Overall, the MSRAN method provides higher accuracy and robustness for hyperspectral image classification under different scenarios and conditions.

Figure 1 .
Figure 1.Overall architecture of the proposed MSRAN for HSI classification.
c represents the spectral dimension, and w and h respectively indicate the width and height of the in- put HSI.The spectral features extracted by convolution at different scales are activated by Relu and summed to obtain the output feature y .The calculation formulas are shown in Equations (1)-(3):

Figure 2 .
Figure 2. Spectral feature extraction and optimization branch.

Figure 2 .
Figure 2. Spectral feature extraction and optimization branch.

Figure 4 .
Figure 4. Structure of the spatial feature extraction and optimization module.

ConnectionFigure 4 .
Figure 4. Structure of the spatial feature extraction and optimization module.
convolution mapping and the Sigmoid activ function and are multiplied with the spatial feature x .Finally, the result of this m plication is directly added to x , yielding the output K of the spatial feature extra module.

Figure 5 .
Figure 5. Structure of the spatial residual attention.

Figure 5 .
Figure 5. Structure of the spatial residual attention.

Figure 6 .
Figure 6.Feature fusion and classification module.

3. 1 . 1 .
Pavia University Dataset This dataset was captured by the Reflection Optical System Imaging Spectr ometer (ROSIS) in the university town of Pavia in northern Italy in 2001.It consists spectral bands.Twelve spectral bands were removed due to noise, leaving 103 sp bands for the experimental study.The image size is 610 × 340 pixels, the spectral r tion is 4 nm, and the spatial resolution is 1.3 m/pixel.This dataset includes real-lif

Figure 6 .
Figure 6.Feature fusion and classification module.

Electronics 2024 ,Figure 7 .
Figure 7. Pseudo-color image and ground truth image of the PU dataset.

Figure 7 .
Figure 7. Pseudo-color image and ground truth image of the PU dataset.

Figure 8 .
Figure 8. Pseudo-color image and ground truth image of the SV dataset.

Figure 8 .
Figure 8. Pseudo-color image and ground truth image of the SV dataset.

Figure 9 .
Figure 9. Pseudo-color image and ground truth image of the WHLK dataset.

Figure 9 .
Figure 9. Pseudo-color image and ground truth image of the WHLK dataset.

Figure 11 .
Figure 11.Classification maps of the SV dataset with different methods.

Figure 12 .Figure 12 .
Figure 12.Classification maps of the WHLK dataset with different methods.

Figure 13 .
Figure 13.Classification results versus different percentages of training samples for the three datasets.(a) PU.(b) SA.(c) WHLK.Figure 13.Classification results versus different percentages of training samples for the three datasets.(a) PU.(b) SA.(c) WHLK.

Figure 13 .
Figure 13.Classification results versus different percentages of training samples for the three datasets.(a) PU.(b) SA.(c) WHLK.Figure 13.Classification results versus different percentages of training samples for the three datasets.(a) PU.(b) SA.(c) WHLK.
Overall architecture of the proposed MSRAN for HSI classification.

Table 1 .
Categories and corresponding training, validation, and test sample numbers for th dataset.AVIRIS) sensor in the Salinas Valley agricultural region of California in originally contained 224 spectral bands, of which 204 were retained for exper studies after removing 20 water absorption bands.The image size is 512 × 217 pix wavelength range is 400-2500 nm, and the spatial resolution is 3.7 m/pixel.This contains 16 different types of land cover, with 54,129 labeled samples.In this study the samples were used as the training set and validation set, and the rest were use test set.Figure8shows the pseudo-color image and ground truth map.Table 2 numbers of training, validation, and testing samples for each category.

Table 1 .
Categories and corresponding training, validation, and test sample numbers for the PU dataset.

Table 2 .
Categories and corresponding training, validation, and test sample numbers for the SV dataset.
lists the numbers of training, validation, and testing samples for each category.

Table 2 .
Categories and corresponding training, validation, and test sample numbers for the SV dataset.
lists the numbers of training, validation, and testing samples for each category.

Table 3 .
Categories and corresponding training, validation, and test sample numbers for th WHLK dataset.

Table 3 .
Categories and corresponding training, validation, and test sample numbers for the WHLK dataset.

Table 4 .
Contributions of different components to the proposed MSRAN method.

Table 5 .
Classification results of PU datasets with different methods.

Table 6 .
Classification results of SV datasets with different methods.
The bold entities indicate the highest value.

Table 7 .
Classification results of WHLK datasets with different methods.The bold entities indicate the highest value.