Adaptive Learnable Spectral–Spatial Fusion Transformer for Hyperspectral Image Classification

: In hyperspectral image classification (HSIC), every pixel of the HSI is assigned to a land cover category. While convolutional neural network (CNN)-based methods for HSIC have significantly enhanced performance, they encounter challenges in learning the relevance of deep semantic features and grappling with escalating computational costs as network depth increases. In contrast, the transformer framework is adept at capturing the relevance of high-level semantic features, presenting an effective solution to address the limitations encountered by CNN-based approaches. This article introduces a novel adaptive learnable spectral–spatial fusion transformer (ALSST) to enhance HSI classification. The model incorporates a dual-branch adaptive spectral– spatial fusion gating mechanism (ASSF), which captures spectral–spatial fusion features effectively from images. The ASSF comprises two key components: the point depthwise attention module (PDWA) for spectral feature extraction and the asymmetric depthwise attention module (ADWA) for spatial feature extraction. The model efficiently obtains spectral–spatial fusion features by multiplying the outputs of these two branches. Furthermore, we integrate the layer scale and DropKey into the traditional transformer encoder and multi-head self-attention (MHSA) to form a new transformer with a layer scale and DropKey (LD-Former). This innovation enhances data dynamics and mitigates performance degradation in deeper encoder layers. The experiments detailed in this article are executed on four renowned datasets: Trento (TR), MUUFL (MU), Augsburg (AU), and the University of Pavia (UP). The findings demonstrate that the ALSST model secures optimal performance, surpassing some existing models. The overall accuracy (OA) is 99.70%, 89.72%, 97.84%, and 99.78% on four famous datasets: Trento (TR), MUUFL (MU), Augsburg (AU), and University of Pavia (UP), respectively.


Introduction
The advent of advanced imaging technologies has garnered increased attention for various remote sensing modalities, with hyperspectral remote sensing technology emerging as a vital domain.Distinguished from grayscale and RGB imagery, hyperspectral images (HSIs) are the three-dimensional (3D) data encapsulated across hundreds of contiguous spectral bands, offering an abundance of spectral information alongside intricate spatial texture information [1].HSIs have found extensive applications across diverse sectors, including plant disease diagnosis [2], military reconnaissance [3], ecosystem assessment [4], urban planning [5], and target detection [6], among others [7][8][9].Therefore, substantial research efforts have been channeled into HSI-related works, spanning classification [10], band selection [11], and anomaly detection tasks [12].Notably, hyperspectral image classification (HSIC) has a critical status within this spectrum of applications, which aims to distinguish land covers precisely at the pixel level [13].
Primitively, the assessment of the similarity in spectral information was evaluated by statistical algorithms, which then served as a basis for distinguishing hyperspectral pixels [14,15].Nonetheless, this approach encounters limitations due to the potential variability of spectral characteristics within identical land objects and the occurrence of similar spectral features across different land types.In the past, HSIC predominantly leveraged traditional machine learning models.Given that the spectral bands in HSIs typically exceed 100, whereas the actual categories of land objects are usually fewer than 30, this causes a notable redundancy in spectral information.To address this, machine learning algorithms often employ techniques, such as principal component analysis (PCA) [16], independent component analysis (ICA) [17], and linear discriminant analysis (LDA) [18], to mitigate spectral redundancy.Subsequently, classifiers like decision trees [19], support vector machines (SVMs) [20], and K-nearest neighbors (KNNs) [21] were applied to the refined features for classification.Although machine learning methods have marked a substantial advancement over earlier statistical approaches to performance, they largely depend on manually crafted feature extraction, which may not effectively extract the more intricate information embedded within hyperspectral data.This limitation underscores the need for more advanced methodologies capable of autonomously adapting the complex patterns inherent in HSIs [22].
The rapid evolution of deep learning [23][24][25][26] has rendered it a more potent tool than traditional machine learning for extracting abstract information through multi-layer neural networks, significantly improving classification performance [27].Deep learning techniques are the forefront methods of HSIC.Chen et al. [28] first introduced deep learning to HSIC, utilizing stacked autoencoders to extract spatial-spectral features and achieve notable classification outcomes.Since then, a plethora of deep learning architectures have been applied to HSIC, including deep belief networks (DBNs) [29], convolutional neural networks (CNNs) [30], graph convolutional networks (GCNs) [31], vision transformers [32], and other models [33,34].Notably, CNNs and vision transformers have emerged as a leading approach in HSIC, attributed to the exceptional ability of CNNs to capture intricate spatial and spectral patterns and of vision transformers to capture long-range information, thereby setting a benchmark in the methodological advancement.
Roy et al. [35] combined the strengths of three-dimensional and two-dimensional convolutional neural networks (3DCNNs and 2DCNNs, respectively) to create a hybrid spectral CNN that excels in spatial-spectral feature representation, enhancing spatial feature delineation.Sun et al. [36] introduced a novel classification model featuring heterogeneous spectral-spatial attention convolutional blocks, which supports plug-and-play functionality and adeptly extracts 3D features of HSIs.CNNs demonstrate formidable performance due to their intrinsic network architecture, but their effectiveness is somewhat constrained when processing the lengthy spectral feature sequences inherent to HSIs.Acknowledging the transformative potential of vision transformers, Hong et al. [37] developed a spectral transformer tailored for extracting discriminative spectral features from block frequency bands.Sun et al. [38] introduced a spectral-spatial feature tokenization transformer that adeptly captures sequential relationships and high-level semantic features.Although vision transformers have shown proficiency in handling spectral sequence features, their exploitation of local spatial information remains suboptimal.To address this gap, Wang et al. [39] proposed a novel spectral-spatial kernel that integrates with an enhanced visual transformation method, facilitating comprehensive extraction of spatialspectral features.Huang et al. [40] introduced the 3D Swin Transformer (3DSwinT) model, specifically designed to embrace the 3D nature of HSI and to exploit the rich spatial-spectral information they contain.Fang et al. [41] developed the MAR-LWFormer, which joins the attention mechanism with a lightweight transformer to facilitate multi-channel feature representation.The MAR-LWFormer capitalizes on the multispectral and multiscale spatial-spectral information intrinsic to HSI data, showing remarkable effectiveness even at exceedingly low sampling rates.Despite the above application of deep learning methods widespread in HSI classification, several challenges persist in the field.In these methods, the CNN layers employ a fixed approach to feature extraction without the dynamic fusion of spectral-spatial information.Moreover, during the deep information extraction phase by the transformer, dynamic feature updates are not conducted, limiting the adaptability and potential effectiveness of the model in capturing intricate data patterns.
To fully extract the spectral-spatial fusion features in HSIs and increase classification performance, a novel adaptive learnable spectral-spatial fusion transformer (ALSST) is designed.The dual-branch adaptive spectral-spatial fusion gating mechanism (ASSF) is engineered to concurrently extract spectral-spatial fusion features from HSIs.Additionally, the learnable transition matrix layer scale is incorporated into the original vision transformer encoder to enhance training dynamics.Furthermore, the DropKey technique is implemented in the multi-head self-attention (MHSA) to mitigate the risk of overfitting.This approach enhances the model's generalization capabilities, ensuring more robust and reliable performance across diverse hyperspectral datasets.The key contributions of the ALSST are as follows: 1.
In this study, a dual-branch fusion model named ALSST is designed to extract the spectral-spatial fusion features of HSIs.This model synergistically combines the prowess of CNNs in extracting local features with the capacity of a vision transformer to discern long-range dependencies.Through this integrated approach, the ALSST aims to provide a comprehensive learning mechanism for the spectral-spatial fusion features of HSIs, optimizing the ability of the model to interpret and classify complex hyperspectral data effectively.

2.
A dual-branch fusion feature extraction module known as ASSF is developed in the study.The module contains the point depthwise attention module (PDWA) and the asymmetric depthwise attention module (ADWA).The PDWA primarily focuses on extracting spectral features from HSIs, whereas the ADWA is tailored to capture spatial information.The innovative design of ASSF enables the exclusion of linear layers, thereby accentuating local continuity while maintaining the richness of feature complexity.

3.
The new transformer with a layer scale and DropKey (LD-Former) is proposed to increase the data dynamics and prevent performance degradation as the transformer deepens.The layer scale was added to the output of each residual block, and different output channels were multiplied by different values to make the features more refined.At the same time, the DropKey is adopted into self-attention (SA) to obtain DropKey self-attention (DSA).The combination of these two techniques overcomes the risk of overfitting and can train deeper transformers.
The remainder of this article is structured as follows: Section 2 delves into the underlying theory behind the proposed ALSST method.Section 3 introduces the four prominent HSI datasets, the experimental settings, and experiments conducted on datasets.Section 4 discusses the ablation analysis, the ratio of DropKey, and the impact of varying training samples.Section 5 provides a conclusion, summarizing the main findings and contributions of the article.

Overall Architecture
The ALSST proposed in this article is shown in Figure 1.The ASSF with the PDWA and the ADWA is designed to enhance the spectral-spatial fusion feature extraction.The novel LD-Former is proposed to increase the data dynamics and prevent performance degradation as the transformer deepens.H W L is the size of HSI after PCA, and P P L × × is the patches size).
From Figure 1, the original input data of the ALSST could be represented as , where height is H , width is W , and spectra are B. HSIs feature extensive spectral bands that offer valuable information but concurrently elevate computational costs.The PCA is applied to reduce the spectral dimensions of the HSI, optimizing the balance between information retention and computational efficiency.The data after PCA would be reshaped to , of which L is the amounts of bands after PCA.From Figure 1, the original input data of the ALSST could be represented as X H ori ∈ R H×W×B , where height is H, width is W, and spectra are B. HSIs feature extensive spectral bands that offer valuable information but concurrently elevate computational costs.The PCA is applied to reduce the spectral dimensions of the HSI, optimizing the balance between information retention and computational efficiency.The data after PCA would be reshaped to X H PCA ∈ R H×W×L , of which L is the amounts of bands after PCA.

X H
PCA is sent to dual-branch ASSF to extract spectral-spatial fusion features X out spectral and X out spatial .X H PCA is firstly sent to the convolution kernel 1 × 1 × 3 in the spectral branch to extract the spectral features X 3D spectral , and then X 3D spectral is reshaped to the X 2D spectral to make the feature dimension match the subsequent PDWA.We put X 2D spectral into the PDWA to focus on extracting spectral features.The outputs of the PDWA are sent to the convolution kernel 1 × 1 for reducing the feature channels.Then, the new outputs are reshaped to the one-dimensional (1D) vectors.X H PCA is synchronously sent to the convolution kernel 3 × 1 × 1 and 1 × 3 × 1 in the spatial branch to extract the spatial features X 3D spatial , and then X 3D spatial is reshaped to the X 2D spatial to make the feature dimension match the subsequent ADWA.We put X 2D spatial into the ADWA to focus on extracting spatial features.The outputs of the PDWA are also sent to the convolution kernel 1 × 1 for reducing the feature channels.Then, the new outputs are also reshaped to the1D vectors.The X 1D spectral and X 1D spatial are multiplied to obtain the spectral-spatial fusion features for the subsequent LD-Former to extract the in-depth dynamic features information.The LD-Former may loop N times.Then, the outputs Y cls spectral and Y cls spatial are put into the multi-layer perceptron (MLP) [42] separately for the final classification.The cross-entropy loss (CE) function is used to measure the degree of inconsistency between the predicted labels Y P and the true labels Y L .

Feature Extraction via ASSF
While transformer networks adeptly model global interactions between token embeddings via SA, they exhibit limitations in extracting fine-grained local feature patterns [42].Given the proven effectiveness of CNNs in modeling spatial context features, particularly in HSI classification tasks, we integrate a CNN for feature extraction from input data.We draw inspiration from the GSAU [43] to devise the ASSF module for further refined feature representation.Central to ASSF are the PDWA and the ADWA, which bypass the linear layer and capture local continuity while managing complexity effectively.
The PDWA, deployed for extracting spectral features from HSIs, is depicted on the left of Figure 2. It comprises several key components: pointwise convolution (PWC), point depthwise convolution (PDWC), the multiplication operation, and the residual connection.These elements collaboratively function to enhance the feature extraction process.The input X in spectral of the module is divided into X P1 spectral and X P2 spectral .X P1 spectral is sent to the PWC layer to obtain X PP1 spectral .We feed X PP1 spectral into PDWC with 1 × 1 convolution kernel to yield the output X PD spectral .Groups in the PDWC layer are equal to the channels of X PP1 spectral .Given that the convolution kernel size is 1 × 1 and the number of groups equals the input channels, this configuration effectively concentrates on the channel information, enabling a focused analysis and processing of the spectral or feature channels in the data.This approach ensures that the convolution operation emphasizes inter-channel relationships, enhancing the ability to capture and exploit channel-specific features of the model.X P2 spectral also obtains X PP2 spectral by a PWC.To keep a portion of the original information intact, no transformations or operations are applied to X PP2 spectral .This ensures the preservation of essential raw data characteristics within the analytical framework.The data obtained by multiplying X PD spectral and X PP2 spectral are sent to a new PWC and obtain the output X ′ out spectral .We set a random parameter matrix G of size (1, 1, channels) and set it to be backpropagated and updated with training.The result of multiplying G and X ′ out spectral is connected with X in spectral by residual connection to obtain the output X out spectral .By the matrix G, the adaptive update of the output of this module is achieved.The PWC employs a 1 × 1 convolution kernel designed to modify the data dimensions, which is crucial for aligning the dimensions across different layers.The main process of the PDWA is as follows: here, X P1 spectral and X P2 spectral represent the characteristic data of the two branches in the PDWA, respectively.Φ PDW (•) and ⊗ represent PDWC and multiplication.The ADWA is primarily utilized to extract spatial features from HSIs, with its structure illustrated on the right of Figure 2. The ADWA framework encompasses several key components: the PWC, two layers of asymmetric depthwise convolution (ADWC), the multiplication operation, and the residual connection.These elements collectively enhance the capacity to capture and integrate spatial features of the model effectively, thereby enriching the overall feature representation.In this module, the PDWC within the PDWA is replaced by two ADWC layers with  31 and  13 convolution kernels, re- spectively, while all other operations remain consistent.This modification allows for a more nuanced extraction of spatial features by capturing variations along different dimensions, thereby enhancing the ability to discern spatial features within the data.The main processes of the ADWA are as follows: where

LD-Former Encoder
Figure 3 delineates the structure of the proposed LD-Former.The LD-Former encoders are adept at modeling deep semantic interrelations among feature tokens, transforming the input of LD-Former into a sequence of vectors.A class token is integrated at the beginning of this vector sequence to encapsulate the overall sequence information.Subsequently, n positional encodings are embedded into the sequence to generate multiple tokens, with encoding similarity reflecting proximity in information.These tokens are then processed through the transformer encoder.The output from the multi-head The ADWA is primarily utilized to extract spatial features from HSIs, with its structure illustrated on the right of Figure 2. The ADWA framework encompasses several key components: the PWC, two layers of asymmetric depthwise convolution (ADWC), the multiplication operation, and the residual connection.These elements collectively enhance the capacity to capture and integrate spatial features of the model effectively, thereby enriching the overall feature representation.In this module, the PDWC within the PDWA is replaced by two ADWC layers with 3 × 1 and 1 × 3 convolution kernels, respectively, while all other operations remain consistent.This modification allows for a more nuanced extraction of spatial features by capturing variations along different dimensions, thereby enhancing the ability to discern spatial features within the data.The main processes of the ADWA are as follows: where X A1 spatial and X A2 spatial represent the features of the two branches in the ADWA.Φ ADW1 (•) and Φ ADW2 (•) represent two ADWCs.

LD-Former Encoder
Figure 3 delineates the structure of the proposed LD-Former.The LD-Former encoders are adept at modeling deep semantic interrelations among feature tokens, transforming the input of LD-Former into a sequence of vectors.A class token is integrated at the beginning of this vector sequence to encapsulate the overall sequence information.Subsequently, n positional encodings are embedded into the sequence to generate multiple tokens, with encoding similarity reflecting proximity in information.These tokens are then processed through the transformer encoder.The output from the multi-head DropKey self-attention (MHDSA) undergoes classification via an MLP, comprising one layer norm (LN) and two fully connected layers (FC), with a Gaussian error linear unit (GELU) [44] activation function employed for classification.The operations are iteratively applied N times.In deeper models, attention maps of subsequent blocks tend to be more similar, suggesting limited benefits from excessively increasing depth.
To address potential limitations of deep transformers, a learnable matrix (layer scale), inspired by CaiT models [45], is integrated into the transformer encoder.This layer scale introduces a learnable diagonal matrix to each residual block's output, initialized near zero, enabling differentiated scaling across SA or MLP output channels, thereby refining feature representation and supporting the training of deeper models.The formulas are as follows: here, LN is the layer norm and MLP is the feed-forward network in the LD-Former.
) are learnable weights for the outputs of MHDSA and MLP.The diagonal values are all initialized to the fixed small value σ.When the depth is within 18, σ = 0.1, when the depth is within 24, σ = 5 × 10 −3 , and σ = 5 × 10 −6 is adopted to the deeper networks.
layer norm (LN) and two fully connected layers (FC), with a Gaussian error linear unit (GELU) [44] activation function employed for classification.The operations are iteratively applied N times.In deeper models, attention maps of subsequent blocks tend to be more similar, suggesting limited benefits from excessively increasing depth.
To address potential limitations of deep transformers, a learnable matrix (layer scale), inspired by CaiT models [45], is integrated into the transformer encoder.This layer scale introduces a learnable diagonal matrix to each residual block's output, initialized near zero, enabling differentiated scaling across SA or MLP output channels, thereby refining feature representation and supporting the training of deeper models.The formulas are as follows: ( ) here, LN is the layer norm and MLP is the feed-forward network in the LD-Former.
( )    Figure 4 delineates the structure of the proposed MHDSA.To elucidate the interrelations among feature tokens, the learnable weights denoted W Q , W K , and W V are established for the SA mechanism.These weights are utilized to multiply with the feature tokens, subsequently linearly aggregating them into three distinct matrices representing queries (Q), keys (K), and values (V).The Softmax(•) is then applied to the resultant scores, transforming them into weight probabilities, thereby facilitating the SA computation, as delineated below.
where d K represents the dimension of K.
scores, transforming them into weight probabilities, thereby facilitating the SA computation, as delineated below.
( ) where K d represents the dimension of K .At the same time, DropKey [46] is an innovative regularization technique designed for MHSA.It is an effective tool to combat overfitting, particularly in scenarios characterized by a scarcity of samples.DropKey prevents the model from relying too heavily on specific feature interactions by selectively dropping keys in the attention mechanism.It promotes a more generalized learning process that enhances the performance of the model on unseen data.The DSA computation is delineated below.
here, n is the number of feature tokens.i r is the ratio of DropKey.A structure analogous to MHSA named MHDSA is implemented in the approach, utilizing multiple sets of weights.MHDSA comprises several DSA units, where the scores from each DSA are aggregated.This method allows for a diversified and comprehensive analysis of the input features, capturing various aspects of the data through different attention perspectives.The formulation of this process is detailed in the following expression.
( ) ( ) here, h is the attention heads number and W is the parameter matrix.At the same time, DropKey [46] is an innovative regularization technique designed for MHSA.It is an effective tool to combat overfitting, particularly in scenarios characterized by a scarcity of samples.DropKey prevents the model from relying too heavily on specific feature interactions by selectively dropping keys in the attention mechanism.It promotes a more generalized learning process that enhances the performance of the model on unseen data.The DSA computation is delineated below.
here, n is the number of feature tokens.r i is the ratio of DropKey.A structure analogous to MHSA named MHDSA is implemented in the approach, utilizing multiple sets of weights.MHDSA comprises several DSA units, where the scores from each DSA are aggregated.This method allows for a diversified and comprehensive analysis of the input features, capturing various aspects of the data through different attention perspectives.The formulation of this process is detailed in the following expression.
here, h is the attention heads number and W is the parameter matrix.

Final Classifier
The overarching algorithm of the ALSST for HSIC is detailed in Algorithm 1.The features Y cls spectral and Y cls spatial extracted from the LD-Former encoders are fed into MLP layers for the terminal classification stage.Each MLP has two linear layers and a GELU operation.The size of the output layer in MLP is customized to match the total class count, enabling the Softmax(•) normalization of the activations of output units to sum to one.
This normalization makes the output embody a probability distribution of the class labels.Aggregating the final two output probability vectors yields the ultimate probability vector, with the maximum probability value designating the label of a pixel.Subsequently, the CE function is employed to calculate the loss value to enhance the precision of the classification results.Generate the spectral-spatial fusion features using the ASSF.9 Perform the LD-Former encoder: 10 The learnable class tokens are added to the first locations of the 1D spectral-spatial fusion feature vectors derived from ASSF, while the positional embedding is carried out to the total feature vectors, to form the semantic tokens.The semantic tokens learned by Equations ( 3)- (7).11 Input the spectral-spatial class tokens from LD-Former into the MLP. 12 Training ALSST (end) and test ALSST 15

Data Description
The performance of the proposed ALSST method in this article is evaluated on four public multi-modal datasets: Trento (TR), MUUFL (MU) [47,48], Augsburg (AU), and University of Pavia (UP).Details of all datasets are described as follows.

TR dataset
The TR dataset covers a rural area surrounding the city of Trento, Italy.It includes 600 × 166 pixels and six categories.The HSI has 63 bands in the wavelength range from 420.89 to 989.09 nm.The spectral resolution is 9.2 nm, and the spatial resolution is 1 m.The pseudo-color image of HSIs and the ground-truth image are in Figure 5.The color, class name, training samples, and test samples for the TR dataset are in Table 1.
Remote Sens. 2024, 16, 1912 10 The pseudo-color image of HSIs and the ground-truth image are in Figure 5.The co class name, training samples, and test samples for the TR dataset are in Table 1.

MU dataset
The MU dataset covers the University of Southern Mississippi Gulf Park Campus, Long Beach, Mississippi, USA.The dataset was acquired in November 2010 with a spatial resolution of 1 m per pixel.The original dataset is 325 × 337 pixels with 72 bands, and the imaging spectral range is between 380 nm and 1050 nm.Due to the influence of imaging noise, the first 4 and the last 4 bands were removed, and finally, 64 bands were used.The invalid area on the right of the original image is removed, and the 325 × 220 pixels is retained.Objects in the imaging scene were placed into eleven categories.The pseudo-color image of the HSI and the ground-truth image are in Figure 6.The details of MU dataset are in Table 2. (b)

AU dataset
The AU dataset was captured over the city of Augsburg, Germany.The spatial resolutions were down sampled to a resolution of 30 m.The HSI has 180 bands from 0.4 to 2.5 µm.The pixels size of AU is 332 × 485, and it depicts seven different land cover classes.The pseudo-color image of the HSI and the ground-truth image are in Figure 7. Details on the AU dataset are provided in Table 3.

AU dataset
The AU dataset was captured over the city of Augsburg, Germany.The spatial resolutions were down sampled to a resolution of 30 m.The HSI has 180 bands from 0.4 to 2.5 µm.The pixels size of AU is 332 × 485, and it depicts seven different land cover classes.The pseudo-color image of the HSI and the ground-truth image are in Figure 7. Details on the AU dataset are provided in Table 3.

AU dataset
The AU dataset was captured over the city of Augsburg, Germany.The spatial resolutions were down sampled to a resolution of 30 m.The HSI has 180 bands from 0.4 to 2.5 µm.The pixels size of AU is 332 × 485, and it depicts seven different land cover classes.The pseudo-color image of the HSI and the ground-truth image are in Figure 7. Details on the AU dataset are provided in Table 3.

AU dataset
The AU dataset was captured over the city of Augsburg, Germany.The spatial resolutions were down sampled to a resolution of 30 m.The HSI has 180 bands from 0.4 to 2.5 µm.The pixels size of AU is 332 × 485, and it depicts seven different land cover classes.The pseudo-color image of the HSI and the ground-truth image are in Figure 7. Details on the AU dataset are provided in Table 3.

Experimental Setting
The experiments presented in this article are executed on Windows 11 and an RTX 3090Ti.The programming is conducted in Python 3.8, utilizing PyTorch 1.12.0.The input image size is 11 × 11, with a batch size of 64 and 100 epochs.The PCA method is employed to reduce the dimensionality of HSIs to 30.To enhance the robustness of the experimental outcomes, training and test samples for the TR, MU, AU, and UP are selected randomly.Tables 2-5 detail the number of training and testing samples for these four datasets, and the selection of training samples for each class depends on its distribution and total number.The experiments are repeated five times to ensure consistency, with the final classification results representing the average of these iterations.The evaluation metrics of overall accuracy (OA), average accuracy (AA), and the statistical Kappa coefficient (K) are the primary indicators of performance in these classification experiments.
It is imperative to compare experimental outcomes across different parameter settings to achieve optimal accuracy.In this article, variables such as the initial learning rate of the Adam optimizer, the number of attention heads, the depth of encoders, and the depth of the LD-Former are rigorously tested across all datasets.A controlled variable method is employed in these experiments, ensuring consistency in input size, number of epochs, experiment iterations, and the quantity of training and testing samples.

Experimental Setting
The experiments presented in this article are executed on Windows 11 and an RTX 3090Ti.The programming is conducted in Python 3.8, utilizing PyTorch 1.12.0.The input image size is 11 × 11, with a batch size of 64 and 100 epochs.The PCA method is employed to reduce the dimensionality of HSIs to 30.To enhance the robustness of the experimental outcomes, training and test samples for the TR, MU, AU, and UP are selected randomly.Tables 2-5 detail the number of training and testing samples for these four datasets, and the selection of training samples for each class depends on its distribution and total number.The experiments are repeated five times to ensure consistency, with the final classification results representing the average of these iterations.The evaluation metrics of overall accuracy (OA), average accuracy (AA), and the statistical Kappa coefficient (K) are the primary indicators of performance in these classification experiments.
It is imperative to compare experimental outcomes across different parameter settings to achieve optimal accuracy.In this article, variables such as the initial learning rate of the Adam optimizer, the number of attention heads, the depth of encoders, and the depth of the LD-Former are rigorously tested across all datasets.A controlled variable method is employed in these experiments, ensuring consistency in input size, number of epochs, experiment iterations, and the quantity of training and testing samples.

Initial Learning Rate
Table 5 illustrates the impact of various initial learning rates for the Adam optimizer on the experimental outcomes.The initial learning rates of 0.001, 0.0005, and 0.0001 are evaluated in the experiments.The findings indicate that the optimal accuracy is achieved when the initial learning rates for the TR, MU, and UP datasets are 0.001 and for the AU dataset is 0.0005.

Depth and Heads
Figure 9 presents the combined influence of the number of attention heads, encoder depths, and the depth of the LD-Former on the performance of the ALSST.In this work, we evaluate the LD-Former with four distinct configurations (i.e., 4 + 2, 4 + 1, 8 + 2, and 8 + 1) for the number of attention heads and the depth of the encoder.The findings indicate that the optimal configuration for the TR, MU, and UP datasets involves setting the attention heads and LD-Former depth to 8 and 1, respectively, while a different optimal setting is 4 and 1 for the AU dataset.

Depth and Heads
Figure 9 presents the combined influence of the number of attention heads, encoder depths, and the depth of the LD-Former on the performance of the ALSST.In this work, we evaluate the LD-Former with four distinct configurations (i.e., 4 + 2, 4 + 1, 8 + 2, and 8 + 1) for the number of attention heads and the depth of the encoder.The findings indicate that the optimal configuration for the TR, MU, and UP datasets involves setting the attention heads and LD-Former depth to 8 and 1, respectively, while a different optimal setting is 4 and 1 for the AU dataset.).The horizontal axis represents the attention heads, while the vertical axis denotes the OA (%).The green represents one encoder depth and fusion block depth, and the orange represents two encoder depths and fusion block depths.

Performance Comparison
In this section, we evaluate the performance of the proposed ALSST against various methods, including LiEtAl [49], SSRN [50], HyBridSN [35], DMCN [10], SpectralFormer [37], SSFTT [38], morpFormer [51], and 3D-ConvSST [52], to validate its classification effectiveness.The initial learning rates for all methods are aligned with those used for the ALSST to ensure a fair comparison, facilitating optimal performance evaluation.The classification outcomes and maps for each method across all datasets are detailed in Section 3. 3

Performance Comparison
In this section, we evaluate the performance of the proposed ALSST against various methods, including LiEtAl [49], SSRN [50], HyBridSN [35], DMCN [10], Spectral-Former [37], SSFTT [38], morpFormer [51], and 3D-ConvSST [52], to validate its classification effectiveness.The initial learning rates for all methods are aligned with those used for the ALSST to ensure a fair comparison, facilitating optimal performance evaluation.The classification outcomes and maps for each method across all datasets are detailed in Section 3.3.1.Subsequently, Section 3.3.2provides a comparative analysis of resource consumption and computational complexity for all the methods.

TR dataset
Table 6 indicates that SpectralFormer yields the most favorable classification outcomes among the compared methods.This inferior performance is attributed to its approach of directly flattening image blocks into vectors, a process that disrupts the intrinsic structural information of the image.Following SpectralFormer, LiEtAl ranks as the second-leasteffective, primarily due to its simplistic structure, which limits its feature extraction capabilities.The proposed ALSST improves OA by 1.60%, 0.79%, 1.13%, 0.35%, 1.71%, 0.52%, 0.68%, and 0.12% compared to LiEtAl, SSRN, HyBridSN, DMCN, SpectralFormer, SSFTT, morpFormer, and 3D-ConvSST.At the same time, the proposed ALSST improves K × 100 by 2.80%, 1.17%, 1.57%, 0.66%, 2.99%, 0.83%, 1.04%, and 0.14% on AA, and improves 2.14%, 1.06%, 1.51%, 0.47%, 2.29%, 0.70%, 0.91%, and 0.16%, respectively.In addition, it could be found that the accuracy of category 3, 4 and 5 of the proposed ALSST reached 100%.The simplicity of the sample distribution facilitates the effective learning of feature information.As depicted in Figure 10, the ALSST model exhibits the least amount of salt-andpepper noise in its classification maps compared to other methods, demonstrating its superior ability to produce cleaner and more accurate classification results.The category in the red box is represented by green, and only the proposed method can achieve all green, while other comparison methods are mixed with blue.It can also be seen from Figure 11 that the clustering effect of the model with higher precision is better, and the clustering effect of the ALSST is the best.Taking categories 2 and 8 as examples, the ALSST can separate them to a greater extent.As depicted in Figure 10, the ALSST model exhibits the least amount of salt-andpepper noise in its classification maps compared to other methods, demonstrating its superior ability to produce cleaner and more accurate classification results.The category in the red box is represented by green, and only the proposed method can achieve all green, while other comparison methods are mixed with blue.It can also be seen from Figure 11 that the clustering effect of the model with higher precision is better, and the clustering effect of the ALSST is the best.Taking categories 2 and 8 as examples, the ALSST can separate them to a greater extent.

MU dataset
Table 7 shows that LiEtAl has the worst classification results, and MFT_PT has the second worst because it only carries out ordinary convolutional feature extraction on HSIs.The OA of the proposed ALSST rises by 6.84%, 2.78%, 4.50%, 2.33%,2.64%,2.66%, 4.76%, and 3.51% compared to LiEtAl, SSRN, HyBridSN, DMCN, SpectralFormer, SSFTT, morpFormer, and 3D-ConvSST.Meanwhile, the AA rises by 7.26%, 2.05%, 3.10%, 2.30%, 3.09%, 3.21%, 4.82%, and 4.22%, and the K × 100 rises by 8.73%, 3.54%, 5.62%, 2.99%, 3.45%, 3.40%, 5.98%, and 4.47%.The uneven and intricate sample distribution of the MU dataset causes considerable challenges in improving classification accuracy.The ALSST model leverages dynamic spectral-spatial fusion feature information to obtain a superior classification performance relative to other algorithms.As can be seen from the area inside the blue box in Figure 12, the classification image produced by the ALSST aligns most closely with the ground-truth image.As can be seen from the area inside the red circle of the T-SNE visualization in Figure 13, the ALSST has the best clustering effect.As can be seen from the area inside the blue box in Figure 12, the classification image produced by the ALSST aligns most closely with the ground-truth image.As can be seen from the area inside the red circle of the T-SNE visualization in Figure 13, the ALSST has the best clustering effect.(j) (i) (h) (g) (f)
The red boxed area in Figure 14 indicates that the ALSST model generates classification maps with minimal salt-and-pepper noise compared to alternative methods, showcasing its capability to yield more precise classification outcomes.Furthermore, Figure 15 reveals that the higher the accuracy, the better the clustering effects, and the ALSST can distinguish categories 3 and 6 to the greatest extent.

Consumption and Computational Complexity
In this section, we conduct a thorough comparative analysis of the proposed ALSST against benchmark methods to evaluate classification performance.Main metrics, such as total parameters (TPs), training time (Tr), testing time (Te), and floating-point operations (Flops), are evaluated for each method.The detailed results are presented in Table 10.It is important to note that the padding in the convolution layers of the ALSST increases the number of parameters and the complexity of the model.However, the padding also introduces additional learnable features, which can contribute to enhancing classification accuracy.

Consumption and Computational Complexity
In this section, we conduct a thorough comparative analysis of the proposed ALSST against benchmark methods to evaluate classification performance.Main metrics, such as total parameters (TPs), training time (Tr), testing time (Te), and floating-point operations (Flops), are evaluated for each method.The detailed results are presented in Table 10.It is important to note that the padding in the convolution layers of the ALSST increases the number of parameters and the complexity of the model.However, the padding also introduces additional learnable features, which can contribute to enhancing classification accuracy.

Consumption and Computational Complexity
In this section, we conduct a thorough comparative analysis of the proposed ALSST against benchmark methods to evaluate classification performance.Main metrics, such as total parameters (TPs), training time (Tr), testing time (Te), and floating-point operations (Flops), are evaluated for each method.The detailed results are presented in Table 10.It is important to note that the padding in the convolution layers of the ALSST increases the num-ber of parameters and the complexity of the model.However, the padding also introduces additional learnable features, which can contribute to enhancing classification accuracy.The experimental configurations are consistent with those previously described.The 3D-ConvSST has the most Flops.The ALSST model demonstrates a reduction in total parameters compared to HyBridSN and DMCN and a decrease in Flops compared to DMCN.When analyzing the MU, AU, and UP datasets, the ALSST exhibits fewer Flops than SSRN, although it shows an increase in Flops for the TR dataset.The ALSST has a shorter testing duration, more total parameters, more Flops, and longer training time than the morpFormer on the TR dataset.Conversely, for the MU, AU, and UP datasets, the ALSST has shorter training times than morpFormer.The ALSST tends to have more total parameters and Flops compared to other methods across these datasets.Despite these aspects, the ALSST stands out by delivering superior classification performance, highlighting its effectiveness in handling diverse datasets.

Ablation Analysis
This section takes the UP dataset as an example to assess the impact of various components on performance.The first and second columns in Table 11 are the convolutional fusion feature extraction module, ASSF, shown in Figure 1.The third column of Table 11 is the multiplication for the fusion of spectral-spatial features in the ASSF.The fourth column of Table 11 is the ordinary vision transformer encoder.According to the results, the ALSST model proposed in this study achieves the highest classification accuracy, indicating that each component positively enhances the classification performance.Table 12 illustrates the impact of employing the asymmetric convolution kernel within the ALSST.A 3D convolution kernel can be decomposed into multiple 2D convolution kernels.When a 2D kernel possesses a rank of 1, it effectively functions as a sequence of 1D convolutions and reinforces the core structure of CNNs.Therefore, the classification accuracy of the proposed model is improved.

Ratio of DropKey
DropKey is an innovative regularizer applied within MHSA to effectively address the issue of overfitting, particularly in scenarios where samples are limited.In this section, we investigate the influence of varying DropKey ratios, ranging from 0.1 to 0.9, on the classification accuracy across four distinct datasets.The experimental results, including OA, AA, and K × 100, are showcased in Figure 18, indicating dataset-specific optimal DropKey ratios.Specifically, optimal performance for the TR and UP datasets is achieved with a DropKey ratio of 0.3, while the MU dataset peaks at 0.7 and the AU dataset at 0.8.Accordingly, the DropKey ratios are set to 0.3, 0.7, 0.8, and 0.3 for the TR, MU, AU, and UP datasets to optimize classification outcomes.

Training Percentage
In this section, we conducted experiments to evaluate the performance of the proposed ALSST model across various training set sizes.The experimental configurations were consistent with the previously described settings.
The outcomes of these experiments are presented in Figure 19, illustrating the performance of the models under different training percentages.For TR, AU, and UP da-

Training Percentage
In this section, we conducted experiments to evaluate the performance of the proposed ALSST model across various training set sizes.The experimental configurations were consistent with the previously described settings.
The outcomes of these experiments are presented in Figure 19, illustrating the performance of the models under different training percentages.For TR, AU, and UP datasets, we designated 2%, 4%, 6%, and 8% of the total samples as the training samples.However, due to the notably uneven sample distribution of the MU dataset, 5%, 10%, 15%, and 20% of the total samples are selected for training.Our experiments demonstrated a significant improvement in the accuracies of all methods as the size of the training set increased.The ALSST model displayed superior performance across all scenarios, especially showing a marked increase in accuracy on the MU dataset.This enhancement is attributed to the capacity to leverage rich learnable features of the ALSST, which allows for more effective adaptation to uneven distributions and thus improves accuracy.Furthermore, the effectiveness of the ALSST across all datasets underscores its broad applicability in tasks that involve spectral-spatial feature fusion and classification.

Conclusions
In this article, an adaptive learning spectral-spatial fusion model named ALSST is proposed.Firstly, a dual-branch ASSF is to extract spectral-spatial fusion features, which mainly includes the PDWA and the ADWA.The PDWA could extract the spectral information of HSIs.The ADWA could extract the spatial information of HSIs.Moreover, a transformer model that amalgamates MHSA and MLP is utilized to thoroughly leverage the correlations and heterogeneities among spectral-spatial features.Then, by adding a layer scale and DropKey to the primary transformer encoder and SA, the data dynamics are improved, and the influence of transformer depth on model classification performance is alleviated.Numerous experiments were executed across four HSI datasets to evaluate the performance of the ALSST in comparison with existing classification methods, aiming to validate its effectiveness and componential contributions.The outcomes of these experiments affirm the method's effectiveness and its superior performance, underscoring the advantages of the ALSST in HSIC tasks.The inclusion of data padding in ASSF results an

Conclusions
In this article, an adaptive learning spectral-spatial fusion model named ALSST is proposed.Firstly, a dual-branch ASSF is to extract spectral-spatial fusion features, which mainly includes the PDWA and the ADWA.The PDWA could extract the spectral information of HSIs.The ADWA could extract the spatial information of HSIs.Moreover, a transformer model that amalgamates MHSA and MLP is utilized to thoroughly leverage the correlations and heterogeneities among spectral-spatial features.Then, by adding a layer scale and DropKey to the primary transformer encoder and SA, the data dynamics are improved, and the influence of transformer depth on model classification performance is alleviated.Numerous experiments were executed across four HSI datasets to evaluate the performance of the ALSST in comparison with existing classification methods, aiming to validate its effectiveness and componential contributions.The outcomes of these experiments affirm the method's effectiveness and its superior performance, underscoring the advantages of the ALSST in HSIC tasks.The inclusion of data padding in ASSF results an increase in model complexity and parameters.Consequently, a future direction for research is to focus on the development of a model that is both precise and lightweight, balancing the need for detailed feature extraction with the imperative for computational efficiency.

RFigure 1 .
Figure 1.Structure for proposed ALSST model.(The ASSF is proposed to exclude the fully connected layers and capture local continuity while considering complexity.The multiplication is used to fusion the spectral-spatial features.The LD-Former is designed to increase the data dynamics and prevent performance degradation as the transformer deepens.In this figure, × × H W B is the size of the original HSI, × × dual-branch ASSF to extract spectral-spatial fusion features out spectral X and out spatial X .H PCA X is firstly sent to the convolution kernel 1 1 3 × × in the spectral branch to extract the spectral features 3D spectral X , and then 3D spectral X is reshaped to the 2D spectral X to make the feature dimension match the subsequent PDWA.We put 2D spectral X into the PDWA to focus on extracting spectral features.The outputs of the PDWA are sent to the

Figure 1 .
Figure 1.Structure for proposed ALSST model.(The ASSF is proposed to exclude the fully connected layers and capture local continuity while considering complexity.The multiplication is used to fusion the spectral-spatial features.The LD-Former is designed to increase the data dynamics and prevent performance degradation as the transformer deepens.In this figure, H × W × B is the size of the original HSI, H × W × L is the size of HSI after PCA, and P × P × L is the patches size).

Figure 2 .
Figure 2. Structure for proposed ASSF.PDWA is used to extract the spectral features of HSI, and ADWA is used to extract the spatial features of HSI.


are learnable weights for the outputs of MHDSA and MLP.The diagonal values are all initialized to the fixed small value σ .When the depth is within 18,

λFigure 3 .
Figure 3. Structure for proposed LD-Former.In this paper, MHDSA is the DropKey multi-head selfattention, ADD is the residual connection, MLP is the multi-layer perceptron, and NORM is the layer norm.N is the times of the encoder loops.

Figure 4
Figure 4 delineates the structure of the proposed MHDSA.To elucidate the interrelations among feature tokens, the learnable weights denoted W Q , W K , and W V are es- tablished for the SA mechanism.These weights are utilized to multiply with the feature

Figure 3 .
Figure 3. Structure for proposed LD-Former.In this paper, MHDSA is the DropKey multi-head self-attention, ADD is the residual connection, MLP is the multi-layer perceptron, and NORM is the layer norm.N is the times of the encoder loops.

Figure 4 .
Figure 4. Structure for proposed MHDSA.Q, K, and V are query, key, and value matrices.

Figure 4 .
Figure 4. Structure for proposed MHDSA.Q, K, and V are query, key, and value matrices.
2. MU datasetThe MU dataset covers the University of Southern Mississippi Gulf Park Campus, Long Beach, Mississippi, USA.The dataset was acquired in November 2010 with a spatial resolution of 1 m per pixel.The original dataset is 325 × 337 pixels with 72 bands, and the imaging spectral range is between 380 nm and 1050 nm.Due to the influence of imaging noise, the first 4 and the last 4 bands were removed, and finally, 64 bands were used.The invalid area on the right of the original image is removed, and the 325 × 220 pixels is retained.Objects in the imaging scene were placed into eleven categories.The pseudo-color image of the HSI and the ground-truth image are in Figure6.The details of MU dataset are in Table2 .

4 .
UP dataset The UP hyperspectral dataset, captured in 2003, focuses on the urban area around the University of Pavia in Northern Italy.The dataset comprises 610 × 340 pixels, encompassing nine distinct categories of ground objects.It includes 115 consecutive spectral bands ranging from 0.43 to 0.86 µm, offering a spatial resolution of 1.3 m.Due to noise, 12 bands were discarded, leaving 103 bands for analysis.The pseudo-color image of the HSI,

Figure 9 .
Figure 9. Combined effect of the heads for attention, and the depth for encoders on four datasets.(a) TR dataset (8 + 1); (b) MU dataset (8 + 1); (c) AU dataset (4 + 1); (d) UP dataset (8 + 1).The horizontal axis represents the attention heads, while the vertical axis denotes the OA (%).The green represents one encoder depth and fusion block depth, and the orange represents two encoder depths and fusion block depths.
and 16 display the classification maps generated by all the considered methods, and Figures 11, 13, 15 and 17 display the T-SNE visualization by all the considered methods, further illustrating the superior performance of the ALSST model.

Figure 18 .
Figure 18.OA (%), AA (%), and K×100 on different ratio of DropKey.(a) TR dataset; (b) MU dataset; (c) AU dataset; (d) UP dataset.In the blue dashed box is the ratio corresponding to the best accuracy.

Figure 18 .
Figure 18.OA (%), AA (%), and K×100 on different ratio of DropKey.(a) TR dataset; (b) MU dataset; (c) AU dataset; (d) UP dataset.In the blue dashed box is the ratio corresponding to the best accuracy.

Figure 19 .
Figure 19.OA (%) of different training percentage on all methods.(a) TR dataset; (b) MU dataset; (c) AU dataset; (d) UP dataset.The accuracies of all evaluated methods demonstrated significant improvement as the number of training samples increased.Notably, the ALSST model consistently outperformed the other methods in every scenario, showcasing its superior effectiveness.

Figure 19 .
Figure 19.OA (%) of different training percentage on all methods.(a) TR dataset; (b) MU dataset; (c) AU dataset; (d) UP dataset.The accuracies of all evaluated methods demonstrated significant improvement as the number of training samples increased.Notably, the ALSST model consistently outperformed the other methods in every scenario, showcasing its superior effectiveness.
Structure for proposed ASSF.PDWA is used to extract the spectral features of HSI, and ADWA is used to extract the spatial features of HSI.

Algorithm 1
Adaptive Learnable Spectral-spatial Fusion Transformer for Hyperspectral Image Classification.HSI: X H ori ∈ R H×W×B , Labels: Y L ∈ R H×W , Patches = 11 × 11, PCA = 30.Output: Prediction: Y P 1 Initialize: batch size = 64, epochs = 100, the initial learning rate of the optimizer Adam depends on datasets.Accomplish the slicing process for HSI to acquire the small patches X H in ∈ R P×P×L .4 Split X H in and into training sets D tr and test sets D te (D tr has the class labels, and D te has not the class labels).

Table 1 .
Details on TR dataset.

Table 1 .
Details on TR dataset.

Table 1 .
Details on TR dataset.

Table 1 .
Details on TR dataset.

Table 2 .
Details on MU dataset.

Table 2 .
Details on MU dataset.

Table 3 .
Details on AU dataset.

Table 3 .
Details on AU dataset.

Table 3 .
Details on AU dataset.The UP hyperspectral dataset, captured in 2003, focuses on the urban area around the University of Pavia in Northern Italy.The dataset comprises 610 × 340 pixels, encompassing nine distinct categories of ground objects.It includes 115 consecutive spectral bands ranging from 0.43 to 0.86 µm, offering a spatial resolution of 1.3 m.Due to noise, 12 bands were discarded, leaving 103 bands for analysis.The pseudo-color image of the HSI,

Table 4 .
Details on UP dataset.

Table 4 .
Details on UP dataset.

Table 4 .
Details on UP dataset.

Table 5 .
OA (%) of different initial learning rate on all datasets (Bold represents the best accuracy).
.1.Subsequently, Section 3.3.2provides a comparative analysis of resource consumption and computational complexity for all the methods.

Table 6 .
OA, AA, K and per-class accuracy for TR dataset (Bold represents the best accuracy).91%, and 0.16%, respectively.In addition, it could be found that the accuracy of category 3, 4 and 5 of the proposed ALSST reached 100%.The simplicity of the sample distribution facilitates the effective learning of feature information.

Table 7 .
OA, AA, K and per-class accuracy for MU dataset (Bold represents the best accuracy).

Table 9 .
OA, AA, K and per-class accuracy for UP dataset (Bold represents the best accuracy).

Table 9 .
OA, AA, K and per-class accuracy for UP dataset (Bold represents the best accuracy).

Table 10 .
Total parameters, training time, testing time, and Flops of all methods on different datasets (Bold represents the best accuracy).

Table 11 .
Effect of different combinations on ALSST (Rows represent different combinations, √indicates that the component exists, ⊗ represents the multiplication of spectral-spatial features, and the bold represents the best accuracy).

Table 12 .
The effect of asymmetric convolution for ALSST (Bold represents the best accuracy).