Cross-Domain Hyperspectral Image Classification Combined Sharpness-Aware Minimization with Local-to-Global Feature Enhancement

Liu, Chengyang; Wang, Aili; Wang, Minhui; Wu, Haibin; Yan, Siqi; Zhao, Lin

doi:10.3390/rs18050740

Open AccessArticle

Cross-Domain Hyperspectral Image Classification Combined Sharpness-Aware Minimization with Local-to-Global Feature Enhancement

by

Chengyang Liu

,

Aili Wang

^*

,

Minhui Wang

,

Haibin Wu

,

Siqi Yan

and

Lin Zhao

Heilongjiang Province Key Laboratory of Laser Spectroscopy Technology and Application, Harbin University of Science and Technology, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(5), 740; https://doi.org/10.3390/rs18050740

Submission received: 26 November 2025 / Revised: 7 January 2026 / Accepted: 22 January 2026 / Published: 28 February 2026

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

This study proposes a novel paradigm for classifying hyperspectral satellite imagery using UAV hyperspectral data, enabling effective utilization of large amounts of unlabeled satellite data. By integrating cross-domain learning with the high spatial resolution and abundant labeled information of UAV hyperspectral data, the proposed method significantly enhances the fine-grained classification performance of satellite hyperspectral images in broad-area scenes. This approach offers a new research direction for the intelligent interpretation of hyperspectral remote sensing data acquired from heterogeneous sensor platforms.
The proposed method achieves state-of-the-art classification performance, significantly outperforming advanced cross-domain classification approaches and the SOTA method DSFormer on four standard benchmark datasets.

What are the implications of the main findings?

A local–global feature extraction model is developed. Initially, the model captures local edge information from cross-domain data, followed by global feature alignment through an improved self-attention mechanism. This strategy enhances boundary detail representation through local feature extraction and optimizes cross-domain feature consistency via global feature alignment, thereby improving the model’s adaptability and robustness in cross-domain hyperspectral classification tasks.
An improved sharpness perception minimization (ISAM) strategy is proposed to overcome local optima and reduced generalization resulting from spectral shift in hyperspectral cross-domain classification tasks. To reduce computational complexity and improve training efficiency, this work refines the gradient perturbation strategy by using a single forward propagation to compute approximate perturbations. Furthermore, by combining square root gradient approximation perturbation with a nonlinear gradient scaling mechanism, the gradient update amplitude exhibits gradual growth relative to the gradient size. This adaptive adjustment of feature update intensity suppresses the dominance of large gradients, enhances the influence of small gradients, and ensures more balanced cross-domain feature alignment.

Abstract

With the increasing availability of satellite imagery and the shortening revisit intervals, efficiently processing satellite hyperspectral images has become a critical task. However, in practice, a large portion of satellite hyperspectral data remains unlabeled, making it difficult to achieve satisfactory classification performance using satellite data alone. Meanwhile, UAV-based platforms offer acquisition flexibility, which facilitates the collection of rich and detailed information. To address these challenges, this paper proposes a method called Sharpness-Aware Minimization with Local-to-Global Feature Enhancement (SAMLFE), which uses UAV hyperspectral images for training to enhance the fine-grained classification performance of satellite hyperspectral images in large scenes. Specifically, a spectral dimension mapping model is first employed to unify UAV and satellite images into a common spectral dimension, thereby mitigating the impact of inconsistent feature representations. Next, a local-to-global feature extraction network is constructed to capture both local details and global semantics. Few-shot learning is applied to extract discriminative features from both the source and target domains within the shared feature space, thereby enhancing the model’s ability to utilize limited labeled data efficiently. Furthermore, a conditional adversarial domain adaptation strategy is adopted to align the feature distributions of the source and target domains, thereby alleviating spectral shift. Meanwhile, the integration of an improved Sharpness-Aware Minimization (ISAM) enhances the model’s robustness across domains. Finally, the K-Nearest Neighbor algorithm is employed to perform accurate classification. Experimental results on multiple datasets demonstrate that the proposed method achieves superior generalization and classification performance in cross-domain hyperspectral image classification. It also outperforms existing methods in terms of feature distribution alignment, robustness of feature extraction, and adaptability to small-sample scenarios.

Keywords:

hyperspectral images; local-to-global feature extraction model; improved sharpness-aware minimization; few-shot learning; conditional adversarial domain adaptation

1. Introduction

Hyperspectral remote sensing integrates imaging and spectral detection technologies, covering electromagnetic wave bands including visible light, near-infrared, mid-infrared, and thermal infrared. During the imaging of ground objects’ spatial characteristics, spectral measurements are conducted on each spatial unit, simultaneously capturing spatial and spectral information [1]. Due to the characteristic of “combination of image and spectrum” in hyperspectral images, they contain a much higher degree of ground information. By fully using this feature, landforms can be accurately classified [2]. Therefore, hyperspectral remote sensing has been widely applied in urban planning [3], environmental monitoring [4], precision agriculture [5], and modern medical diagnostics [6].

In recent years, deep learning methods have advanced rapidly, particularly in hyperspectral image classification, significantly enhancing classification accuracy through powerful feature extraction capabilities. Deep learning models can automatically learn spectral and spatial features. In particular, three-dimensional convolutional neural networks (3D CNNs) effectively capture spectral–spatial information, achieving outstanding performance in classification tasks. However, as CNN models deepen, the overfitting phenomenon inevitably occurs [7]. To address this, ResNet was introduced to mitigate gradient vanishing and overfitting issues in deep network training by employing residual blocks. The identity mapping within these blocks allows deep networks to be trained more effectively without increasing complexity [8]. Building upon this framework, the supervised spectral–spatial residual network (SSRN) was specifically optimized for hyperspectral image classification tasks [9]. By incorporating identity mapping, SSRN effectively reduces potential accuracy loss during feature extraction, significantly enhancing classification accuracy.

Despite the considerable success of deep learning in HSI classification, training such models typically requires a large amount of labeled data. However, in practical applications, newly acquired hyperspectral images often lack sufficient labels. Data labeling is time-consuming, labor-intensive, and expensive, greatly limiting the learning potential of current deep learning models. Therefore, few-shot learning (FSL) has attracted considerable attention due to its ability to quickly adapt to new tasks using prior knowledge and a limited number of labeled samples [10,11]. Various strategies have been developed to implement FSL effectively. For instance, the Deep Fuzzy Metric Learning (DFML) method leverages fuzzy logic theory to construct a spatial–spectral fuzzy metric space, enhancing the characterization of uncertainty in mixed and boundary pixel categories [12]. By combining a hybrid CNN–Transformer network with a Gaussian membership function-based fuzzy set representation, DFML improves classification performance under few-shot conditions. To further enhance feature discriminability, the Spatial–Spectral Enhancement and Fusion Network (SSEFN) utilizes a spatial–spectral enhancement strategy to facilitate model learning. Moreover, SSEFN incorporates an Adaptive Decision Fusion (ADF) module to integrate classification decisions from multiple enhanced features, effectively mitigating model overfitting [13]. In the context of few-shot open-set classification, the Self-Supervised Multitask Learning (SSMTL) framework was proposed to enhance extraction by introducing a self-supervised reconstruction task. Integrating modules such as the Data Diversification Module (DDM), Three-Dimensional Multi-Scale Attention Module (3D-MAM), and Adaptive Threshold Module (ATM), SSMTL dynamically adjusts thresholds based on uncertainty, thereby improving open-set classification performance [14].

Although the aforementioned methods have achieved promising classification performance, the generalization ability of the models remains limited due to insufficient labeled samples in the target domain. To address this challenge, domain adaptation methods have been widely adopted to tackle the problem of limited labeled samples. These methods leverage source domain data rich in annotations to extract potential correlation information (i.e., domain-invariant features) between different domains, and subsequently classify unlabeled target domains by mapping different but similar scenes into a shared feature space. This classification approach is known as cross-scene classification [15]. In the context of remote sensing, one effective strategy involves aligning data distributions. For example, semi-supervised Transfer Component Analysis (TCA) reduces inter-domain discrepancies in the feature space, significantly improving performance under scarce label conditions [16]. Beyond statistical methods, deep learning architectures have also been utilized for feature alignment. Specifically, models employing recurrent neural networks (RNNs) have been proposed to extract features followed by transformation learning to obtain domain-invariant representations [17]. Additionally, approaches based on deep metric learning have been developed to align embedded features via unsupervised domain adaptation, facilitating classification using the nearest neighbor (NN) algorithm even when the target scene contains limited labeled samples [18].

In practical cross-scene transfer learning, source and target datasets are often collected using different sensors or affected by varying external conditions. These differences not only lead to distributional shifts, but also result in inconsistencies in spectral dimensions, class distributions, and spatial resolution, thereby presenting significant challenges for cross-scene feature transfer under heterogeneous domain adaptation. In fact, compared to isomorphic transfer learning, heterogeneous relationships between source and target domains are more common in real-world applications. As a result, heterogeneous transfer learning has garnered increasing attention in recent years [19].

To address this challenge, Deep Cross-domain Few-shot Learning (DCFSL) was introduced as a heterogeneous transfer learning approach. Using meta-learning, DCFSL effectively handles the few-shot HSI classification. Moreover, DCFSL incorporates a conditional domain adversarial strategy to mitigate domain shift between the source and target domains, thereby enhancing cross-domain classification performance [20]. From a metric learning perspective, the Class Covariance Metric-based Few-shot Learning (CMFSL) method was proposed to improve adaptability [21]. It employs interactive training and replaces the traditional Euclidean metric with class covariance distance to learn invariant features across domains, thereby enhancing model adaptability and improving HSI classification accuracy. To capture non-local spatial dependencies effectively, Graph Information Aggregation Cross-domain Few-shot Learning (Gia-CFSL) integrates information propagation and graph alignment within graph neural networks [22]. By aligning non-local spatial information at both feature and distribution levels, Gia-CFSL mitigates spectral shifts and improves cross-domain classification accuracy [22]. For heterogeneous domain adaptation, a transfer model based on the Extreme Learning Machine (ELM) network was developed to enforce feature dimension consistency between source and target domains to achieve heterogeneous domain adaptation [23]. Focusing on feature relationships, another innovative method integrates a spectral–spatial enhanced channel attention mechanism to dynamically extract multi-scale global-to-local features. This approach also incorporates correlation alignment losses to reduce distribution discrepancies [24]. From an optimization perspective, Decoupled Knowledge Distillation-based FSL (DKD-FSL) [25] was introduced. DKD-FSL formulates meta-knowledge extraction and debiasing as a collaborative optimization task and introduces a knowledge distillation strategy to efficiently acquire and utilize unbiased meta-knowledge. Additionally, DKD-FSL employs a decoupled log interaction module to optimize the interaction between task-relevant and data-internal knowledge, and integrates a discriminant information refining module to enhance the separability of similar spectral bands [25].

However, the above method primarily focuses on cross-domain learning using data collected from different sensors on the same platform. In practical scenarios, where a large amount of hyperspectral data remains unlabeled, relying solely on single-platform data restricts the diversity of information sources. Therefore, expanding the data scope beyond a single platform presents a promising avenue to improve classification performance [26]. As an effective supplement, UAV remote sensing technology provides high operational flexibility. It effectively captures data even under cloud cover and delivers high-resolution imagery. Furthermore, the detailed spatial and spectral information inherent in UAV data greatly enhances the precision of surface feature identification and classification [27,28,29]. While UAV remote sensing data excels in high resolution and timeliness, satellite remote sensing is characterized by its wide-area coverage and stable observation capabilities. Leveraging these complementary strengths, pre-training the model on UAV data and transferring the learned features to satellite images effectively integrates rich detailed information with large-scale global contexts. In addition, this strategy uses large amounts of unlabeled satellite images and applies cross-domain learning to extract latent information, thereby enhancing the model’s generalization and classification accuracy on satellite data. This approach significantly reduces dependence on manual annotation and improves the model’s practicality and scalability in large-scale remote sensing applications. However, integrating both approaches presents challenges, including differences in spatial–spectral resolution, changes in imaging conditions, and inconsistencies in data sources. Particularly, when migrating small-scene UAV data to large-scene satellite data, addressing these differences while maintaining high classification accuracy has become a key challenge in heterogeneous transfer learning. To accurately capture intrinsic features under spectral resolution variations, the Subpixel Spectral Variability Network (S²VNet) models spectral variability and nonlinear mixture characteristics to deeply integrate complete subpixel information with class features [30]. This mechanism significantly enhances the model’s discriminative capability, achieving precise feature capture in complex scenarios. Meanwhile, to address the challenge of distribution inconsistency in multisource data, Multisource Collaborative Domain Generalization (MS-CDG) employs a distribution consistency alignment strategy [31]. This enables the model to effectively extract domain-invariant features, thereby improving generalization capability under multisource conditions. Although the aforementioned methods have achieved significant progress in subpixel-level feature mining and multisource domain generalization, simultaneously overcoming the significant spatial–spectral resolution discrepancies between UAV and satellite data and enhancing model generalization remains a critical challenge.

Based on this, this paper proposes a cross-domain hyperspectral image classification model that incorporates sharpness perception minimization and local–global feature enhancement. It utilizes UAV hyperspectral images for learning and training, aiming to achieve fine classification of large-scene satellite hyperspectral images. The model primarily addresses the challenges of large spatial–spectral resolution differences and the cross-domain learning of small-scene UAV data and large-scene satellite data under varying imaging conditions. In summary, the main contributions of the work are as follows:

This study proposes a novel paradigm for classifying hyperspectral satellite imagery using UAV hyperspectral data, enabling effective utilization of large amounts of unlabeled satellite data. By integrating cross-domain learning with the high spatial resolution and abundant labeled information of UAV hyperspectral data, the proposed method significantly enhances the fine-grained classification performance of satellite hyperspectral images in broad-area scenes. This approach offers a new research direction for the intelligent interpretation of hyperspectral remote sensing data acquired from heterogeneous sensor platforms;
A local–global feature extraction model is developed. Initially, the model captures local edge information from cross-domain data, followed by global feature alignment through an improved self-attention mechanism. This strategy enhances boundary detail representation through local feature extraction and optimizes cross-domain feature consistency via global feature alignment, thereby improving the model’s adaptability and robustness in cross-domain hyperspectral classification tasks;
An improved sharpness perception minimization (ISAM) strategy is proposed to overcome local optima and reduced generalization resulting from spectral shift in hyperspectral cross-domain classification tasks. To reduce computational complexity and improve training efficiency, this work refines the gradient perturbation strategy by using a single forward propagation to compute approximate perturbations. Furthermore, by combining square root gradient approximation perturbation with a nonlinear gradient scaling mechanism, the gradient update amplitude exhibits gradual growth relative to the gradient size. This adaptive adjustment of feature update intensity suppresses the dominance of large gradients, enhances the influence of small gradients, and ensures more balanced cross-domain feature alignment.

2. Related Works

2.1. Hyperspectral Image Classification via Deep Neural Network

The deep learning models for hyperspectral image classification can be broadly categorized into spectral models and spatial–spectral models. For spectral deep learning, Hu et al. proposed a one-dimensional convolutional neural network that relies solely on spectral band information for classification. By modeling each spectral band individually, the representation capability of spectral information was significantly enhanced [32]. Mou et al. employed a recurrent neural network (RNN) to model spectral bands as sequences, achieving good classification performance [33]. However, although these methods are effective in modeling spectral information, there is potential to further improve classification performance by leveraging the rich spatial features in hyperspectral images.

To address this issue, the researchers began to explore deep learning models that integrate spatial information to fully exploit the feature representation capabilities of hyperspectral images. Li et al. proposed a three-dimensional convolutional neural network (3D CNN) capable of simultaneously extracting spatial and spectral features, thereby significantly enhancing classification performance [34]. In addition, Liu et al. combined two-dimensional and three-dimensional CNNs for feature extraction, which not only enhanced feature representation capability but also effectively reduced computational cost [35]. To further improve the stability of deep networks, Transformer models have been increasingly applied to hyperspectral classification tasks beyond traditional CNN architectures, due to their powerful global modeling capabilities. Ahmad et al. proposed the Spectral–Spatial Wavelet Transformer, which captures both local and global features simultaneously, thereby improving classification accuracy [36]. However, while these methods perform well in single-domain settings, there remains potential to adapt them for cross-domain hyperspectral classification, particularly to better handle differences in data distribution between domains.

Given the limitations of single-domain classification methods, researchers have shifted their focus to cross-domain hyperspectral image classification to enhance model generalization. In cross-domain hyperspectral classification tasks, ResNet is regarded as a key method for enhancing generalization, owing to its strong feature extraction abilities and stable optimization characteristics. Li et al. demonstrated that the residual structure can stably learn features despite changes in data distribution, thus improving model accuracy in cross-domain hyperspectral image classification tasks [20]. Consequently, ResNet and its variants have been widely employed in cross-domain learning tasks. Zhang et al. proposed a comparative learning method based on ResNet, which leverages contrastive learning to enhance cross-domain shared features, thereby improving target domain classification accuracy [37]. Additionally, graph neural networks (GNNs) have been introduced into cross-domain hyperspectral classification tasks. Ye et al. proposed a cross-domain few-shot learning method based on GNNs. By constructing an inter-domain relationship graph, they structured the samples from both source and target domains to enhance cross-domain classification performance [38]. Inspired by previous studies, this paper proposes a local–global feature extraction model that combines the residual learning framework of ResNet with the global modeling capability of the Transformer. This integration enhances the model’s generalization and robustness in cross-domain hyperspectral classification tasks.

2.2. Strategies for Enhancing Model Generalization

Improving the generalization ability of models has become a key challenge in cross-domain hyperspectral image classification tasks. To address this challenge, Domain Adaptation (DA) and Domain Generalization (DG) techniques have been introduced into cross-domain hyperspectral image classification tasks. Specifically, DA aims to reduce the feature distribution discrepancy between the source and target domains, whereas DG focuses on mining domain-invariant features to ensure model robustness on unseen domains. Existing methods mainly include data augmentation, regularization techniques, and domain adaptation, each improving model generalization from different perspectives. These methods have shown promising results, suggesting avenues for continued improvement.

Data augmentation methods expand datasets using adversarial networks (GANs) or traditional transformations, such as rotation, flipping, and noise perturbation, thereby improving the model’s adaptability to changes in inter-domain distributions. The GAN-based data enhancement strategy proposed by Miftahushudur et al. can improve cross-domain classification performance; however, the high dimensionality and complexity of hyperspectral data could pose challenges for the quality of generated samples [39]. Moreover, although traditional data augmentation methods are simple and efficient, when faced with significant inter-domain differences, augmented samples may still struggle to fully compensate for the feature shift between the source and target domains. Regularization techniques enhance generalization capabilities by optimizing the model parameter update process. Dropout reduces the network’s reliance on specific neurons by randomly discarding units and their connections during training, thus effectively mitigating overfitting [40]. However, to further extend their benefits to cross-domain scenarios, additional mechanisms may be incorporated to better handle target domain features.

Early domain adaptation methods primarily focused on mitigating distribution discrepancies through statistical alignment. Wang et al. proposed a maximum mean difference (MMD)-based method that measures the mean difference between the characteristic distributions of the source and target domains in Hilbert Space. This approach narrows the cross-domain distribution deviation and enhances the model’s generalization ability [41]. Additionally, some studies have incorporated manifold learning strategies. These methods abstract visual and semantic spaces into graph structures and utilize matrix factorization to enforce the optimized semantic manifold to approximate the visual manifold in terms of geometric topology, thereby constructing a visually aligned semantic graph to align cross-modal domain distributions [42]. However, these traditional approaches typically assume that the source and target domains share a consistent class space. In practice, scenarios are often more complex, with the target domain frequently exhibiting spectral shifts or containing unknown classes. Addressing these realistic conditions introduces significant generalization challenges. With advancements in deep learning, domain adaptation has witnessed significant breakthroughs, and current methods can be broadly categorized into fine-tuning-based approaches and feature-level domain adaptation.

Fine-tuning-based approaches typically involve transferring the weights of a network pre-trained on the source domain to a target domain model, enabling the model to adapt to the target data distribution through subsequent fine-tuning operations. For instance, Mei et al. trained a five-layer CNN classifier on source domain data and subsequently extracted the fully connected feature layers to construct a feature learning framework tailored for the target domain [43]. Similarly, Yang et al. proposed a dual-branch convolutional neural network that utilizes a fine-tuning strategy to transfer the weights of the initial layers from the pre-trained network to the target domain model [44]. While this strategy proves effective when the source and target domains share similar features, relying solely on fine-tuning may be insufficient to extract deep, robust features when facing significant data discrepancies, such as differences in resolution or sensor types.

Another approach is feature-level domain adaptation, which aims to extract domain-invariant features by minimizing the discrepancy between the feature distributions of the source and target domains. Othman et al. proposed a deep domain adaptation network utilizing a mini-batch gradient optimization algorithm to minimize the feature distribution error between the two domains, thereby effectively achieving cross-scene classification [45]. Similarly, Wang et al. proposed a neural network-based domain adaptation framework that achieves high-precision classification by jointly optimizing three objectives in the embedding space: source discriminative classification, cross-domain distribution alignment, and manifold structure preservation [46]. While the aforementioned methods excel at aligning global feature distributions, there remains room for improvement in capturing the fine-grained class structures intrinsic to the data. To address this, Conditional Adversarial Domain Adaptation strategies have been introduced for cross-domain classification tasks. The primary advantage of this strategy is that it not only focuses on feature alignment but also incorporates the classifier’s predictions (i.e., class information) as a critical condition into the adversarial network. This mechanism encourages a more compact distribution of same-class objects in the feature space (i.e., enhancing intra-class compactness), thereby enabling superior classification performance even when target domain samples are scarce [47].

Although Conditional Adversarial Domain Adaptation strategies effectively facilitate fine-grained alignment between source and target domains by incorporating class prediction information, relying solely on feature-level constraints is often insufficient to fully guarantee model generalization in complex cross-domain scenarios. In fact, the improvement of cross-domain classification performance depends not only on feature alignment but is also intrinsically linked to the optimization process of model parameters. Particularly under small-sample conditions, models tend to be highly sensitive to distribution discrepancies, where even subtle spectral variations can significantly impact final performance. Therefore, while adversarial adaptation strategies establish the correct direction for feature alignment, further incorporating an advanced optimization mechanism is essential. By guiding the model to converge to a flatter and more robust minimum, such a mechanism can significantly enhance the model’s resilience to environmental changes.

In recent years, several studies have aimed to improve model generalization through optimization strategies. Among these, Sharpness-Aware Minimization (SAM) has emerged as a powerful algorithm due to its superior noise robustness and generalization capabilities [48]. SAM introduces a new optimization paradigm, where the core idea is to optimize the worst loss in the neighborhood during the gradient update process. This guides the model to learn on a smoother loss surface, alleviating the issue of local optimization and improving generalization. Although SAM has shown promising results in two-dimensional natural image classification tasks, it faces challenges such as high optimization complexity and unstable local perturbation directions when applied to three-dimensional data, especially in tasks with high-dimensional structures and cross-domain differences, such as hyperspectral images. Further improvements and adaptations are needed. Inspired by previous studies, this paper proposes an improved SAM algorithm that enhances model optimization efficiency and significantly improves generalization in cross-domain scenarios.

3. Proposed SAMLFE for HSI Classification

Figure 1 illustrates the architecture of the proposed SAMLFE model, using the WHU-Hi-HanChuan and Pavia University datasets as examples. First, the target domain images (Pavia University), with limited labels, and the source domain images (WHU-Hi-HanChuan), with abundant labels, are input the spectral dimension mapping module to align their spectral dimensions. The mapped data is then passed into the local-to-global feature extraction model, which comprehensively extracts both local details and global semantic features. This ensures that the global modeling stage retains key detail information while preventing the model from overfocusing on irrelevant local features. Additionally, few-shot learning is applied in the local-to-global feature space to extract shared deep features between the source and target domains, enabling rapid adaptation to spectral offsets in new tasks. A conditional adversarial domain adaptation strategy is then employed to align the feature distributions of the source and target domains, effectively mitigating the spectral offset problem. Simultaneously, the parameter optimization process is refined using the improved sharpness-aware minimization strategy to reduce the model’s sensitivity to feature distribution changes, thus minimizing performance fluctuations caused by spectral offsets. Finally, the extracted features are classified using KNN to determine the final geographic category.

3.1. Spectral Dimension Mapping Model Between Source and Target Domains

Differences in the number of spectral bands between the source and target domains (e.g., 274 bands in the source domain versus 103 bands in the target domain) can cause significant deviations in their feature distributions, thereby affecting the cross-domain performance of the classification model. In addition, the mapping model, first proposed in [49], is adopted to ensure the same dimension of input samples before the embedded feature extractor model. To mitigate feature inconsistencies caused by spectral dimension differences between the source domain (e.g., UAV images) and the target domain (e.g., satellite images), this study employs a spectral dimension mapping model to uniformly map the input data which enables feature alignment across domains and enhances the model’s cross-domain adaptability. For a detailed description of the training process, refer to Algorithm 1.

The spectral dimension mapping model, illustrated in Figure 2, consists of two components: a 2D convolutional layer and a batch normalization layer. The figure depicts the source domain as an example, while the target domain follows an identical mechanism. First, the input data from the source and target domains, denoted as

x_{s}

and

x_{t}

, are individually transformed using the spectral dimension mapping models

M_{s}

and

M_{t}

. This operation projects the heterogeneous data into a unified spectral feature space with a target dimension of

d = 100

. Subsequently, the outputs generated by the 2D convolution layers are normalized through batch normalization (BN) to accelerate training and enhance model stability. Finally, the resulting aligned feature maps are obtained, denoted as

{x^{'}}_{s}

and

{x^{'}}_{t}

, as formulated in Equation (1):

{x^{'}}_{s} = M_{s} (x_{s}), {x^{'}}_{s} = M_{t} (x_{t})

(1)

where

x_{s}

and

x_{t}

represent the input features of the source and target domains, with dimensions of

B_{s} \times H_{s} \times W_{s}

and

B_{t} \times H_{t} \times W_{t}

, respectively.

M_{s}

and

M_{t}

denote the dimension mapping layers, both with a dimension of

d \times 1 \times 1

. After mapping, the bands

B_{s}

and

B_{t}

are projected onto the target dimension

d

. The aligned features of the source and target domains are denoted as

{x^{'}}_{s}

and

{x^{'}}_{t}

, with dimensions

d \times H_{s} \times W_{s}

and

d \times H_{t} \times W_{t}

, respectively, where

d

is set to 100.

Algorithm 1 Pseudocode for the Training Process of the
Proposed SAMLFE

Input: S_T, Q_T of D_t, S_S, Q_S of D_S, the number of training episodes.

Output: The classification accuracy of each class of the target dataset.

1: Spectral Dimension Mapping

2: Calculate

X_{S}^{'}

,

X_{S}^{'}

by Equation (1);

3: Local-to-Global Feature Extraction

4: for episode = 1: episodes do

5: Randomly selected S_T, Q_T from D_t, S_S, Q_S from D_S;

6: Extract deep representations from the mapped data;

7: Calculate local features y_LFEM by Equation (8);

8: Calculate global features O by Equation (14);

9: Few-shot learning on extracted features;

10: Calculate

{L^{S}}_{f s l}

,

{L^{t}}_{f s l}

by Equations (18) and (19);

11: Loss_fsl =

{L^{S}}_{f s l}

+

{L^{t}}_{f s l}

;

12: Loss_fsl.backward()

13: end for

14: Conditional Domain Discriminator

15: for episode = 1: episodes do

16: Calculate L_d by Equation (22);

17: Loss = L_fsl + L_d;

18: Loss.backward()

19: end for

20: ISAM Parameter Optimization

21: for episode = 1: episodes do

22: Calculate e_w,

\hat{w}

and

{\tilde{e}}_{w}

by Equations (24)–(26);

23: Update model parameters to reduce loss fluctuations;

24: Calculate W_new by Equation (27)

25: end for

3.2. Local-to-Global Feature Extraction Model

In constructing the local–global feature extraction model, this study adopts a “local-to-global” feature extraction strategy, differing from conventional network practices. The core innovation of this strategy lies in first employing a local feature extraction module to precisely capture details and edge information, followed by a global feature extraction module to model cross-domain relationships. This approach ensures that critical details are preserved during global modeling, while preventing excessive attention to irrelevant local features. In the local feature extraction module, the traditional residual block is replaced with a 3D asymmetric residual block. This module decomposes the three-dimensional convolution into kernels along three directions and executes them sequentially in a predefined order, enabling the model to flexibly capture spectral and spatial dependencies, thereby achieving precise local feature extraction under cross-domain conditions. In the global feature extraction module, an improved self-attention mechanism is introduced, with the weights of attention-enhanced features and asymmetric residual features balanced via an adaptive scaling parameter, γ. During the early training stages, γ is small, and the model primarily relies on domain-invariant residual features to prevent overfitting in the source domain. As training progresses, γ gradually increases, adaptively introducing global attention, which progressively aligns target domain features and enhances the robustness and generalization of cross-domain representations. The architecture of the local–global feature extraction model is illustrated in Figure 3. Moreover, although the network performs identical operations at different depths and across various channels, the resulting feature maps exhibit distinct semantic meanings because of variations in the input. Consequently, the feature representations differ across stages. To intuitively illustrate these semantic differences, different colors, sizes, and shapes are used in the schematic diagram for visual annotation. As illustrated in the bottom left of Figure 3, the Asymmetric Residual Block decomposes a standard 3D convolution into three sequential kernels:

1 \times 1 \times 3

(vertical),

1 \times 3 \times 1

(horizontal), and

3 \times 1 \times 1

(spectral). This decomposition aligns with Equations (3)–(5). Furthermore, the right side of Figure 3 visualizes the feature space distribution. Different colors and shapes (e.g., blue circles for ‘Trees’, gray dots for ‘Meadows’) represent the semantic separation of classes achieved after the Global Feature Extraction, demonstrating the enhanced feature extraction capability of the proposed model.

The Local Feature Extraction Model (LFEM) is constructed, which draws on the concept of residual learning and designs residual blocks with asymmetric structures. By flexibly configuring convolutional kernels in different directions, the model effectively extracts local feature information and accurately captures the edges and contours of land objects. This structure preserves critical boundary information during subsequent global feature modeling, preventing the loss of important details and enhancing the model’s ability to perceive complex land structures.

First, the input data is processed through the initial 3D convolution (3DConv) layer, simultaneously capturing spatial and spectral features while fully exploiting the spatial–spectral correlation in hyperspectral images, thereby enhancing the accuracy and representational capability of feature extraction. The detailed operations are presented in Equation (2).

y_{1} (t, i, j) = R (\sum_{p = 0}^{P - 1} \sum_{m = 0}^{k - 1} \sum_{n = 0}^{k - 1} x (t + p, i + m, j + n) \cdot w (p, m, n) + b)

(2)

where

R (\cdot)

denotes the ReLU activation function,

y_{1} (t, i, j)

represents the output feature map after the first convolutional layer at position

(t, i, j)

, and

x (t + p, i + m, j + n)

corresponds to the value at position in

(t + p, i + m, j + n)

.

w (p, m, n)

represents the weight at position

(p, m, n)

in the three-dimensional convolution, where

P \times k \times k

indicates the size of the convolution kernel and

b

is the bias term. Here,

(t, i, j)

denotes the specific coordinate position within the 3D feature map, where

t

represents the index along the spectral dimension, while

i

and

j

represent the indices along the spatial vertical (row) and horizontal (column) dimensions, respectively.

Subsequently, the detailed features are further extracted using asymmetric convolution to enhance the representation of HSI details, including edges, textures and contours. The asymmetric convolution processes feature maps using multi-directional convolution kernels to capture fine-grained information such as edges and contours within the image. In this study, asymmetric convolution is applied sequentially along the y, x, and z axes. The y axis in the spatial dimension corresponds to the vertical direction in the HSI. Convolution operations along the y axis focus on extracting features related to the vertical structure of land objects, which helps the model classify objects with vertically distributed characteristics, such as buildings and trees. The process is shown in Equation (3).

y_{y} (t, i, j) = \sum_{n = 0}^{k - 1} y_{z} (t, i, j + n) \cdot W_{y} (n) + b_{y}

(3)

where

y_{y} (t, i, j)

represents the feature map obtained after convolution along the y axis,

y_{z} (t, i, j + n)

represents the value of the input data at position

(t, i, j + n)

,

W_{y} (n)

denotes the weight of the 3D convolution at position n, where the convolution kernel has a size of

1 \times 1 \times 3

, and

b_{y}

is the offset term along the y axis.

The x axis in the spatial dimension corresponds to the horizontal direction in the HSI. Convolution operations along the x axis help extract the horizontal spatial characteristics of the HSI, especially for land objects with horizontal arrangement or expansion (such as roads, building profiles, etc.), as shown in Equation (4).

y_{x} (t, i, j) = \sum_{m = 0}^{k - 1} y_{x} (t, i + m, j) \cdot W_{x} (m) + b_{x}

(4)

where

y_{x} (t, i, j)

represents the feature map obtained after convolution along the x axis,

y_{x} (t, i + m, j)

denotes the value of the input data at position

(t, i + m, j)

,

W_{x} (m)

represents the weight of the three-dimensional convolution at position m, where

1 \times 3 \times 1

is the size of the convolution kernel, and

b_{x}

is the offset term along the x axis.

In hyperspectral images, the z axis corresponds to the spectral dimension. By performing convolution operations along the z axis, the model can more accurately capture pixel variations across different spectral bands. The process is shown in Equation (5).

y_{z} (t, i, j) = \sum_{p = 0}^{P - 1} y_{2} (t + p, i, j) \cdot W_{z} (p) + b_{z}

(5)

where

y_{z} (t, i, j)

represents the feature map obtained after convolution along the z axis,

y_{2} (t + p, i, j)

represents the value of the input data at position

(t + p, i, j)

,

W_{z} (p)

represents the weight of the 3D convolution at position p, where the convolution kernel has a size of

3 \times 1 \times 1

, and

b_{z}

is the offset term along the z axis.

Additionally, after obtaining multiple feature maps, they are fused to enhance the quality of local information as shown in Equation (6). Using asymmetric convolution, the embedded features from both the source and target domains effectively capture details, particularly the edge and contour information in the image. These embedded features not only enhance the accuracy of image representation but also provide stronger support for the subsequent classification tasks.

y_{2} (t, i, j) = R (y_{z} (t, i, j) + y_{y} (t, i, j) + y_{x} (t, i, j))

(6)

The feature map is then further processed using 3D convolution operations to uncover the complex relationships between spatial and spectral dimensions, generating a new feature map, as shown in Equation (7).

y_{3} (t, i, j) = \sum_{p = 0}^{P - 1} \sum_{m = 0}^{k - 1} \sum_{n = 0}^{k - 1} y_{2} (t + p, i + m, j + n) \cdot w (p, m, n) + b

(7)

Finally, we apply the concept of residual learning to fuse multiple feature maps and generate the new feature maps. This strategy helps preserve important local features while enhancing the model’s ability to learn complex features through residual connections, improving the expression capability and classification performance of the final output, as shown in Equation (8).

y_{L F E M} = R (y_{1} + y_{3})

(8)

The global feature extraction module (GFEM) leverages the global modeling capability of the improved self-attention mechanism in Transformer. By capturing the long-distance dependencies, it focuses on identifying shared features between the source and target domains. Additionally, the model automatically highlights stable and clear areas in the feature map while suppressing interference from noisy and uncertain regions, enhancing the perception of information in heterogeneous structures, which effectively mitigates information misalignment due to differences in spatial structure and resolution, improving the model’s generalization ability across domains. After local feature extraction, 3D maximum pooling operation is applied to reduce feature dimensionality, preserving the most representative values, emphasizing key local information, and providing a more compact and discriminative input for subsequent global modeling, as shown in Equation (9).

y_{4} (t^{'}, i^{'}, j^{'}) = \max \{y_{L F E M} (t^{'} + δ_{z}, i^{'} + δ_{h}, j^{'} + δ_{w})\}

(9)

where

y_{4} (t^{'}, i^{'}, j^{'})

represents the maximum pooling of 3D, with the size of the pooling kernel being

k_{z} \times k_{h} \times k_{w}

, and

0 \leq δ_{z} < k_{z}

,

0 \leq δ_{h} < k_{h}

,

0 \leq δ_{w} < k_{w}

.

Next, a global feature extraction space is constructed. An improved self-attention mechanism is employed to adaptively focus on key features while suppressing irrelevant or redundant information. This approach effectively establishes long-range dependencies between pixels in the feature map, thereby enhancing the model’s capacity to capture global structural information. The structure of the improved self-attention mechanism is illustrated in Figure 4. Initially, three convolutional layers with kernel size

1 \times 1 \times 1

are applied to the input

y_{4} (t^{'}, i^{'}, j^{'})

, projecting it into the Query, Key, and Value representations. This process is illustrated in Equation (10).

\begin{array}{l} Q_{1} = C o n v 3 d_{q u e r y} (x), & Q_{1} \in ℝ^{l \times \frac{C}{8} \times D \times W \times H} \\ K_{1} = C o n v 3 d_{k e y} (x), & K_{1} \in ℝ^{l \times \frac{C}{8} \times D \times W \times H} \\ V_{1} = C o n v 3 d_{v a l u e} (x), & V_{1} \in ℝ^{l \times C \times D \times W \times H} \end{array}

(10)

where l represents the batch size, C is the number of input channels, and D, W, and H denote the depth, width, and height, respectively, while

Q_{1}

,

K_{1}

, and

V_{1}

represent the Query, Key, and Value representations obtained after the first calculation, respectively.

In addition, to simplify subsequent matrix operations, the spatial dimension

D \times W \times H

is flattened into a unified dimension N, where

N = D \times W \times H

, as shown in Equation (11).

\begin{array}{l} Q_{2} = permute (reshape (Q_{1})), & Q_{2} \in ℝ^{l \times N \times G} \\ K_{2} = reshape (K_{1}), & K_{2} \in ℝ^{l \times G \times N} \\ V_{2} = reshape (V_{1}), & V_{2} \in ℝ^{l \times C \times N} \end{array}

(11)

where

Q_{2}

,

K_{2}

, and

V_{2}

represent the features output after transformation, and

G = C / 8

.

Subsequently, the attention score

A

is calculated by computing the similarity between

Q_{2}

and

K_{2}

, thereby focusing on features with higher scores, as shown in Equation (12).

\begin{array}{l} S = Q_{2} \cdot K_{2}, S \in ℝ^{l \times N \times N} \\ A = softmax (S), A \in ℝ^{l \times N \times N} \end{array}

(12)

After calculating the score matrix, the data is reshaped to restore the original spatial dimensions. First, the attention matrix is transposed by exchanging its last two dimensions, and then batch matrix multiplication is performed with

V_{2}

. The specific process is shown in Equation (13).

\tilde{Y} = reshape (V_{2} \cdot A^{T}), \tilde{Y} \in ℝ^{l \times C \times D \times W \times H}

(13)

In the output of the previous step, this paper draws inspiration from the scaling factor in the Linformer self-attention mechanism and introduces a learnable scalar γ to adaptively control the weighting between the self-attention module output and the original input. γ is a trainable parameter optimized through backpropagation during training. The model automatically adjusts the value of γ, thereby adaptively balancing the contributions of the improved self-attention output and the input, as shown in Equation (14).

O = γ \cdot \tilde{Y} + x

(14)

3.3. Source and Target Few-Shot Learning

Due to sensor discrepancies and environmental variations, the spectral characteristics of the same object may shift across different scenarios, leading to performance degradation in cross-domain applications. To mitigate the spectral shift between UAV and satellite data, this paper performs few-shot learning tasks based on source and target domains within the local–global feature space, as illustrated in Figure 5. By constructing K-shot C-way tasks, the few-shot learning utilizes a limited number of samples across multiple training rounds, enabling the model to progressively learn feature representations with enhanced generalization across different categories and scenarios. Combined with a multi-task training strategy, the model extracts the deep features shared between the source and target domains, enabling the rapid adaptation to new tasks under spectral shifts while maintaining high classification accuracy. Figure 5 illustrates the construction of meta-learning tasks for the source and target domains. Specifically, the source task is constructed using the Support Set

S_{s}

and Query Set

Q_{s}

, while the target task is similarly composed of

S_{t}

and

Q_{t}

. As depicted in the right section, the diagram focuses on the calculation of the Euclidean metric within each constructed task. Through meta-learning training, the model encourages samples of the same class to cluster closer together in the feature space. As illustrated in the right section of the figure, where distinct colors represent different categories, the samples are shown aggregating towards each other, thereby achieving greater intra-class compactness.

Assume that the source domain dataset in the local–global feature space is denoted as

D_{s} = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{n_{s}}

, where

x_{i}^{s}

represents the HSI data of the

i

-th sample and

y_{i}^{s}

denotes its corresponding category label. The target domain dataset is denoted as

D_{t} = {(x_{i}^{t}, y_{i}^{t})}_{i = 1}^{n_{t}}

, where

x_{i}^{t}

represents the HSI data of the i-th sample and

y_{i}^{t}

denotes the corresponding category label in the target domain.

D_{t}

contains a small amount of labeled data

D_{f}

and a large amount of unlabeled data

D_{u}

, where

D_{T} = D_{f} \cup D_{u}

. The few-shot learning tasks are, respectively, performed on

D_{s}

and

D_{t}

. First, a few-shot learning task is performed on the source domain

D_{s}

. Specifically,

D_{S}

classes are randomly selected from the source domain dataset

C_{s}

. For each class,

K_{s}

labeled samples are selected to form the support set

S_{S}

.

N_{s}

unlabeled samples are then selected from the same

C_{s}

classes to form the query set

Q_{S}

. The corresponding formula is expressed as follows:

S_{S} = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{C_{s} \times K_{s}}, Q_{S} = {(x_{j}^{s}, y_{j}^{s})}_{j = 1}^{C_{s} \times N_{s}}

(15)

Perform the few-shot sample learning task for the target domain

D_{t}

, replicate the task for domain

D_{T}

, and generate the support set

S_{T}

and query set

Q_{T}

for the target domain as follows:

S_{T} = {(x_{i}^{t}, y_{i}^{t})}_{i = 1}^{C_{t} \times K_{t}}, Q_{T} = {(x_{j}^{t}, y_{j}^{t})}_{j = 1}^{C_{t} \times N_{t}}

(16)

During the training phase, the Softmax method is applied to calculate the similarity between the query sample and the category center of the support set. The network parameters are then updated by optimizing the negative log-probability loss function. For the query sample

x_{j}

, its class distribution probability is calculated as follows:

P (y_{j} = k | x_{j} \in Q_{s}) = \frac{\exp (- d (f_{ϕ} (x_{j}), c_{k}))}{\sum_{k = 1}^{C} \exp (- d (f_{ϕ} (x_{j}), c_{k}))}

(17)

where

d (,)

represents the Euclidean distance,

C_{k}

is the embedding feature center supporting the k-th set, C is the number of categories in the task,

f_{ϕ}

represents the mapping layer with parameters

ϕ

, and the embedding feature extraction network,

x_{j}

is a sample in the query set, and

y_{j}

is the true category label of

x_{j}

.

The classification loss for the source domain episode task is calculated by summing the negative logarithmic probabilities of all query samples:

L_{f s l}^{s} = E_{S_{s}, Q_{s}} [- \sum_{(x, y) \in Q_{s}} \log p_{ϕ} (y = k | x)]

(18)

where

E_{S_{s}, Q_{s}}

denotes the expectation over the support set and the query set, and

S_{s}

and

Q_{s}

represent the support set and query set of the source domain, respectively.

Similarly, FSL is applied to the target domain data, and the classification loss for the target domain episode task is calculated:

L_{f s l}^{t} = E_{S_{t}, Q_{t}} [- \sum_{(x, y) \in Q_{t}} \log p_{ϕ} (y = k | x)]

(19)

where

S_{t}

and

Q_{t}

represent the small-sample support set and query set of the target domain, respectively, and

p_{ϕ}

is the probability distribution output by the network parameterized by

ϕ

.

3.4. Conditional Domain Discriminator Model

To reduce the distributional difference between UAV and satellite data, this paper employs a conditional domain discriminator and optimizes using domain adversarial losses. The model distinguishes data sources (e.g., UAVs or satellites) using domain discriminators, while the local–global feature extractor learns domain-independent shared features via adversarial training, thereby reducing domain differences such as those related to sensors and environments. As training progresses, the local–global feature extractor and domain discriminator interact, gradually aligning feature distributions and improving classification performance in the target domain (e.g., satellite data), effectively alleviating the impact of spectral offset. The network architecture is illustrated in Figure 6. To address the dimension explosion issue inherent in standard multilinear maps where the output dimension

d_{f} \times d_{g}

becomes computationally prohibitive, the Randomized Multilinear Map described in Formula 20 is employed to project the joint features into a compact 1024-dimensional space. Subsequently, the Conditional Domain Discriminator is implemented as a five-layer Multilayer Perceptron. The blocks labeled ‘FRD’ in the figure represent the hidden layers, where each block consists of a Fully Connected (FC) layer, a ReLU activation function, and a Dropout layer to prevent overfitting. Specifically, the Dropout and ReLU are applied after each FC layer except for the final one. Finally, the last layer utilizes a Softmax function to predict the domain probability.

Suppose

h = (f, g)

represents the joint variables

f

and

g

. The multilinear mapping

f \otimes g

is selected to condition

g

. Compared to simple concatenation strategies, multilinear mapping

f \otimes g

can more effectively capture multimodal structures in complex data distributions. However, its main disadvantage is its tendency to cause dimensional explosions. Assuming

d_{f}

and

d_{g}

represent the dimensions of

f

and

g

, respectively, the output dimension of the multilinear map is

d_{f} \times d_{g}

, which is often difficult to embed into deep models. To address the issue of dimensional explosion, traditional multilinear mapping is replaced with random multilinear mapping. The multilinear map

T_{\otimes} (f, g)

can be approximated to

T_{⊙} (f, g)

using a dot product, as shown in the following formula:

T_{⊙} (f, g) = \frac{1}{\sqrt{d}} (R_{f} f) ⊙ (R_{g} g)

(20)

where

⊙

represents the element-level dot product operation, where

R_{f} \in R^{d \times d_{f}}

and

R_{g} \in R^{d \times d_{g}}

are two random matrices. These random matrices are sampled once and fixed during the training phase. They meet

d ≪ d_{f} \times d_{g}

, where

f

is the feature extracted by the local–global feature extractor, and g is the category information predicted by the discriminator D. Additionally, both matrices

R_{f}

and

R_{g}

follow symmetric distributions with a mean of zero. Finally, the following conditional strategy is employed:

T (h) = \begin{array}{l} T_{\otimes} (f, g), & d_{f} \times d_{g} \leq 1024 \\ T_{⊙} (f, g), & o t h e r w i s e \end{array}

(21)

where 1024 represents the largest number of units in the model. If the dimension of the multilinear map T exceeds 1024, a randomized multilinear map will be used. The domain adversarial loss function is then defined as follows:

\begin{array}{l} L_{d} = \min_{D} \max_{T} L = & - E_{x_{i}^{s} ~ P_{s} (x)} \log [D (T (h_{i}^{s}))] \\ - E_{x_{j}^{t} ~ P_{t} (x)} \log [1 - D (T (h_{j}^{t}))] \end{array}

(22)

where D represents the discriminator,

h_{i}^{s}

and

h_{j}^{t}

are the embedded features of the source domain and target domain samples, where

h = (f, g)

combined with the variable

f

is the feature extracted by the local-to-global feature extractor. g represents the category information predicted by the discriminator D, and T is the linear dimensional transformation.

P_{s} (x)

and

P_{t} (x)

denote the source domain feature distribution and the target domain feature distribution, respectively. Specifically,

x_{i}^{s}

consists of support samples from the source domain, while

x_{j}^{t}

comprises query samples from the target domain. Finally, the loss function in this paper consists of

L_{f s l}^{s}

,

L_{f s l}^{t}

and

L_{d}

, as follows:

\begin{array}{l} L_{t o t a l} & = L_{t o t a l}^{s} + L_{t o t a l}^{t} \\ = L_{f s l}^{s} + L_{f s l}^{t} + 2 L_{d} \end{array}

(23)

3.5. Improved Sharpness-Aware Minimization Strategy

In cross-domain hyperspectral image classification tasks, the significant spectral offset between the source and target domains often causes substantial changes in feature distribution, leading to large fluctuations in the model’s loss surface. To address this issue, this paper proposes an improved SAM algorithm, which uses a single forward pass to compute the approximate perturbation. This approach avoids the strong perturbations introduced by the traditional SAM algorithm’s two forward passes, thereby enhancing the stability of gradient perturbation. Additionally, this paper improves the original gradient perturbation strategy by introducing a nonlinear gradient scaling mechanism, causing the gradient update magnitude to grow more gradually relative to the gradient size. This adaptive adjustment of feature update strengths results in more stable perturbation directions, better aligned with the characteristics of cross-domain classification tasks. It effectively smooths the loss function and optimizes the model’s performance in the worst-case scenario. The schematic diagram shown in Figure 7 illustrates the expected effect after incorporating ISAM. Figure 7 illustrates the schematic operation and the expected impact of the Improved Sharpness-Aware Minimization (ISAM) strategy on the model’s optimization landscape. As depicted, the initial loss surfaces (representing components like

L_{s}

and

L_{d}

) typically exhibit pronounced peaks and steep valleys caused by cross-domain spectral shifts. The transition, mediated by the ISAM algorithm, results in a final total loss surface (

L_{t o t a l}

) that is significantly smoother and flatter. This visualization demonstrates how ISAM mitigates the instability of gradient perturbations and guides the model towards flatter minima, thereby enhancing generalization performance in worst-case scenarios.

First, calculate the gradient perturbation amplitude of each parameter w. This is done by taking the square root of the gradient and adding a small positive number

ϵ

, where

ϵ

is set to 1 × 10⁻⁸ for numerical stability, a common term used in the Adam optimizer, as shown in Equation (24).

e_{w} = \sqrt{{(\nabla_{w} L_{total})}^{2}} + ϵ = |\nabla_{w} L_{total}| + ϵ

(24)

Among them,

e_{w}

is used to measure the amplitude of each parameter gradient and stored in the state,

w

represents the model parameter, and

\nabla_{w} L

represents the gradient of the loss function

L_{total}

with respect to the parameter

w

.

Next, apply the Adam optimizer’s update rules to update the parameters, obtaining the intermediate result

\hat{w}

. This step is equivalent to a standard gradient descent update, as outlined in the following process.

\hat{w} = BaseOptimizerStep (w, \nabla_{w} L)

(25)

Subsequently, each parameter is corrected and updated in a gradient-free environment. Using the stored disturbance amplitude

e_{w}

,

ϵ

is added to compute the update factor

{\tilde{e}}_{w}

and the parameters are adjusted accordingly.

{\tilde{e}}_{w} = e_{w} + ϵ = |\nabla_{w} L| + 2 ϵ

(26)

Finally, based on the corrected perturbation factor, an update operation is applied to the model parameters, as shown in Equation (24). This update not only accounts for the relative relationship between the original parameters and the perturbation direction but also refines the parameters through proportional weighting. This approach ensures model stability while guiding optimization in a more robust direction, thereby enhancing the model’s generalization ability in cross-domain hyperspectral image classification tasks.

w_{new} = (1 - ρ) \hat{w} + ρ \cdot \frac{e_{w}}{{\tilde{e}}_{w}}

(27)

where

ρ

is a hyperparameter controlling the weight adjustment amplitude during the disturbance correction process, set to 1.1 × 10⁻⁸.

(1 - ρ) \hat{w}

represents part of the basic update result, and

ρ \cdot \frac{e_{w}}{{\tilde{e}}_{w}}

denotes the disturbance correction applied to the parameters. This ensures the adaptive balancing of update forces across different features during gradient updates, preventing the loss of key information due to gradient normalization.

4. Experimental Validation and Analysis

4.1. Dataset Descriptions

To evaluate the performance and efficiency of the proposed model, this study conducted experiments on three datasets with three indicators: overall classification accuracy (OA), average classification accuracy (AA), and the Kappa coefficient. Specifically, OA represents the proportion of correctly classified samples to the total number of test samples, AA reflects the average accuracy for each category and measures the classification differences between categories, while the Kappa coefficient (K × 100) evaluates the consistency between the classification results and the true ground truth. The F1 score effectively evaluates the model’s generalization ability and robustness under conditions of class imbalance. A higher value for these indicators indicates better classification performance. The model size is used to evaluate the model’s spatial complexity, where a larger size indicates higher complexity. The time metric denotes the model’s training duration, with a smaller value corresponding to faster training. Specifically, WHU-Hi-HanChuan serves as the source domain dataset (UAV dataset), while Pavia University, Indian Pines, and Salina are used as the target domain datasets (airborne dataset). Furthermore, the HZ dataset is utilized as the target domain dataset representing satellite imagery.

(1) Source domain dataset: The WHU-Hi-HanChuan (WHHC) dataset was collected by the Leica Aibot X6 UAV platform in Hanchuan City, Hubei Province, China, in 2016. This dataset consists of 274 spectral bands, with a wavelength range of 400 nm to 1000 nm, a spatial resolution of 0.109 m, and an image size of 1217 × 303 pixels. The details of the WHU-Hi-HanChuan dataset are shown in Table 1, and the false-color image and ground truth are shown in Figure 8.

(2) Target domain dataset: The Pavia University (PU) dataset was collected using ROSIS sensors, consisting of 610 × 340 pixels, with a spatial resolution of 1.3 m. This dataset contains 103 spectral bands, with a wavelength range of 430 nm to 860 nm. The dataset includes 9 categories. The details of the PU dataset are shown in Table 2, and the false-color image and ground-truth are shown in Figure 9.

The Salinas (SA) dataset has an image size of 512 × 217 pixels, a spatial resolution of 3.7 m, and covers 204 spectral bands with a wavelength range from 400 nm to 2500 nm. The dataset was collected using an airborne visible/infrared imaging spectrometer (AVIRIS) sensor over the SA Valley in California, USA. The details of the SA dataset are shown in Table 3, and the false-color image and ground truth are shown in Figure 10.

The Hangzhou (HZ) dataset was acquired by the NASA EO-1 Hyperion sensor. There are 198 spectral bands after uncalibrated and noisy bands are removed. The spatial size of the HZ dataset is 590 × 230 with 30 m spatial resolution. The details of the HZ dataset are shown in Table 4, and the false-color image and ground truth are shown in Figure 11.

The Indian Pines (IP) dataset was collected by AVIRIS in 1992 in Northwest Indiana, USA. This dataset contains 200 spectral bands, with a wavelength range of 400 nm to 2500 nm, a spatial resolution of 20 m, and an image size of 145 × 145 pixels. The dataset includes 16 vegetation categories. The details of the IP dataset are shown in Table 5, and the false-color image and ground truth are shown in Figure 12.

4.2. Experimental Setting

The method proposed in this paper uses the Warmup Cosine Learning Rate Scheduler for optimization. In the early stages of training, the learning rate gradually increases from 1 × 10⁻⁵ to 1 × 10⁻³, preventing instability caused by an excessively high learning rate. After the early stage, the learning rate decreases gradually, following the shape of the cosine function, which allows the model to converge more stably in the later stages of training. The model is optimized using the Adam optimizer, with the number of iterations set to 10,000. To reduce the randomness caused by the training samples, the final result for each trial was obtained by averaging 10 repetitions. To ensure a rigorous and fair comparison, all experiments were conducted on a unified hardware platform equipped with a single NVIDIA [e.g., RTX 2080Ti] GPU manufactured by Shanghai finehoo Tech-nology Co., Ltd., located in Shanghai, China and programmed using PyTorch 1.7.1. Specifically, we standardized the input data volume by evaluating all methods under a consistent 5-shot setting (i.e., 5 labeled samples per class) to guarantee identical supervision information across distinct algorithms. Crucially, regarding the training details for the comparative state-of-the-art (SOTA) methods, we strictly adhered to the hyperparameter configurations (e.g., learning rate, weight decay, and training epochs) recommended in their original papers’ implementations. This protocol ensures that each baseline model is evaluated at its optimal performance level, eliminating potential bias arising from improper parameter tuning. The proposed SAMLFE method is compared with several representative algorithms, including traditional machine learning methods, XGBoost [50] and SVM [51], deep learning methods, 3D-CNN [34] and SSRN [9], popular cross-domain classification methods, DCFSL [20] and Gia-CFSL [22], and state-of-the-art (SOTA) methods, GSCViT [52] and DSFormer [53]. A systematic comparison with these methods provides a comprehensive evaluation of SAMLFE’s performance advantages and generalization capabilities across different modeling paradigms and cross-domain scenarios.

4.3. Classification Maps and Categorized Results

Table 6, Table 7, Table 8 and Table 9 show the classification results of different methods on the Pavia University (PU), Indian Pines (IP), Salinas datasets (SA), and Hangzhou (HZ), with the best accuracy highlighted in bold. The SAMLFE classification method proposed in this paper demonstrates significant advantages compared to other methods. The SAMLFE model achieves the highest overall classification accuracy (OA) on the PU, IP, SA, and HZ datasets, with accuracies of 84.27%, 65.45%, 90.61%, and 80.73%, respectively. Particularly in the PU dataset, the OA of the SAMLFE model has increased by 2.07%, 2.92%, 1.63%, 2.78%, 5.19%, 19.17%, 21.54%, and 33.76% compared to DSFormer, GSCViT, Gia-CFSL, DCFSL, SSRN, 3D-CNN, SVM, and XGBoost. These results clearly demonstrate that the SAMLFE model can effectively handle cross-domain HSI classification tasks and maintain high classification accuracy even when significant domain differences exist. In contrast, the other methods perform poorly on cross-domain data, particularly when there are large structural and resolution differences between the source and target domains.

Moreover, the time consumption results reveal clear hierarchical differences in computational complexity among the methods. Traditional machine learning methods, such as SVM and XGBoost, exhibit very low computational overhead due to their simple architectures and the absence of deep feature extraction. As the model depth increases, deep learning methods such as 3D-CNN and SSRN require more convolutional operations for extracting spatial–spectral features, thereby increasing computational time. Meanwhile, although GSCViT and DSFormer possess strong feature extraction capabilities, their architectures are optimized for computational efficiency, reducing convolutional operations and resulting in lower training costs compared to 3D-CNN and SSRN. In contrast, DCFSL and Gia-CFSL incur higher computational overhead due to complex processes, including small-sample learning strategies and feature alignment. SAMLFE falls into the category of models with high computational load, primarily due to the small-sample learning strategy and the ISAM module. While these strategies enhance cross-domain feature extraction and discriminative capabilities, they inevitably increase computational costs. Nevertheless, based on cross-domain HSI classification results, SAMLFE significantly outperforms existing methods across multiple metrics and maintains superior classification performance even when the source and target domains differ substantially.

The model size indicates that SAMLFE is only slightly larger than the 3D-CNN model, suggesting that its space complexity remains low. This implies that the method maintains a small number of parameters and low storage requirements while sustaining high model performance. Although training time is relatively long, the model’s compact architecture and limited parameter count provide strong deployment advantages, enabling adaptation to multiple scenarios and platforms, thereby demonstrating the method’s practicality and scalability.

As shown in Table 6, SAMLFE achieves the optimal classification results for three types of land objects: Asphalt, Trees, and Bare soil, which fully demonstrates the effectiveness of multi-directional feature extraction enabled by the introduction of asymmetric residual blocks. Specifically, Asphalt has flat and regular surface structures, which enhance horizontal feature extraction in hyperspectral images, which helps to more accurately capture their texture and geometric features, improving classification accuracy. Trees have significant vertical structures. By strengthening vertical feature modeling, the system more accurately captures their spatial characteristics, further improving classification performance. Figure 13 presents the classification results for the PU dataset, which indicate that SAMLFE closely matches the ground truth in spatial distribution and geographic boundary classification, reflecting high accuracy and strong generalization.

Table 7 shows the classification results for the IP dataset. Among these categories, category 15, “Buildings-Grass-Trees-Drives,” shows significant improvement in classification accuracy. This category includes rich texture features and spatial structures in multiple directions. The local–global feature enhancement model effectively combines local details with global features, capturing long-range spatial dependencies and further improving classification accuracy. Furthermore, the average accuracy (AA) of the SAMLFE method is lower than that of DSFormer. This is because DSFormer achieves higher classification accuracy on categories with fewer samples in the 4th, 9th, 13th, and 16th categories, thus making the overall AA index higher. In order to fully consider the class imbalance problem, the F1 score is introduced to comprehensively assess the precision and recall of each category. It is not affected by variations in sample sizes and thus provides a fairer evaluation of model performance across all categories. Based on the F1 score, considering the sample sizes of all categories, the proposed method still achieves the best overall performance.

Figure 14 shows the classification results for the IP dataset. As seen in the figure, compared to XGBoost, SVM, 3DCNN, SSRN, DCFSL, Gia-CFSL, GSCViT, DSFormer, and other methods, SAMLFE has the smallest misclassification area and most closely matches the ground truth, which suggests that SAMLFE captures subtle differences between geographic categories more accurately, particularly in complex or mixed areas and further confirms the effectiveness of the proposed method in improving classification accuracy and reducing errors.

Table 8 shows the classification results for SA dataset, with the visual classification maps of various methods shown in Figure 15. The OA, AA, Kappa, and F1 by SAMLFE reached 90.61%, 94.53%, 89.56%, and 93.97%, respectively, all of which outperformed the comparison methods. Specifically, compared to the SOTA methods DSFormer and GSCViT, SAMLFE increased by 1.07% and 1.71%, respectively. Furthermore, as shown in Figure 15, SAMLFE demonstrates greater accuracy in distinguishing boundary regions between classes. This effectively reduces the classification confusion commonly observed in traditional methods when handling fuzzy boundaries. This advantage primarily stems from two strategies: a local-to-global feature extraction approach and an improved sharpness perception minimization technique. The former employs asymmetric residual blocks to extract edge details and integrates an improved self-attention mechanism to achieve global feature alignment. The latter suppresses the impact of spectral shifts on generalization by introducing a nonlinear gradient perturbation mechanism. This result demonstrates that the method maintains excellent cross-domain generalization capabilities, even under significant differences between the source and target domains.

Figure 15 shows the pseudo-color classification results of various methods on the SA dataset. As shown in the figure, category 8, “Grapes_untrained,” exhibits a high rate of misclassification across all methods, indicating the difficulty of classifying this type of land cover. Nevertheless, the proposed method achieves the best classification performance for this category. This type of land cover is typically interspersed with surrounding soil, weeds, and shrubs, leading to significant spectral overlap with other categories, along with serious boundary blurring and the presence of mixed pixels. The local–global feature enhancement model proposed in this paper captures long-range dependencies between cells while finely extracting local detailed features, thereby effectively mitigating the uncertainty caused by mixed categories and enhancing the discrimination of complex land covers.

As shown in Table 9, on HZ dataset, the proposed method achieves a significant improvement in classification accuracy for the Land/Building class, reaching 84.40%. Furthermore, compared with other methods, the OA shows improvements of 4.14%, 5.17%, 2.64%, 2.87%, 4.81%, 7.75%, 13.03%, and 16.66%, respectively.

Figure 16 shows that the classification results of the proposed method most closely match the distribution of real ground objects, with the fewest misclassified areas. This indicates that the Local-to-Global Feature Extraction Model extracts more discriminative features, enabling accurate differentiation of various types of ground objects. Moreover, the introduction of the ISAM further enhances the model’s generalization ability, ensuring stable performance in complex scenarios.

4.4. Ablation Study of SAMLFE Model

In this subsection, the contribution of each module to classification performance is evaluated through ablation experiments. The main contributions of the proposed SAMLFE lie in the ISAM, self-attention, and asymmetric residual block methods. To better analyze the role of these three methods, this paper evaluates their contributions through ablation experiments. As shown in Table 10, this paper focuses on investigating six situations: (1) only performing DCFSL on the source and target domains; (2) adding improved self-attention on the basis of (1); (3) adding ISAM and asymmetric residual block on the basis of (1); (4) adding improved self-attention and asymmetric residual block on the basis of (1); (5) adding ISAM and improved self-attention on the basis of (1); (6) adding ISAM, improved self-attention, and asymmetric residual block on the basis of (1).

The model without any additional modules (e.g., ID1) serves as the baseline reference. Introducing only ISAM (e.g., ID2) significantly improves model performance, indicating that ISAM effectively reduces spectral offsets between the source and target domains, thereby enhancing classification accuracy. Building on this, adding the Asymmetric Residual Block (e.g., ID3) further improves overall accuracy, suggesting that it complements ISAM in feature extraction: ISAM primarily addresses inter-domain offsets, while the Asymmetric Residual Block focuses on capturing local detailed features. The combination of the two modules enables both local detail extraction and enhanced domain generalization. Subsequently, combining improved Self-Attention and the Asymmetric Residual Block (e.g., ID4) further improves performance. Improved Self-Attention captures global contextual information and facilitates extraction of more discriminative features, while the Asymmetric Residual Block enhances local features. By integrating local and global features, the model extracts more refined representations. These two components exhibit complementary roles in feature representation. When ISAM and improved Self-Attention are applied together (e.g., ID5), performance shows little improvement, indicating that without the Asymmetric Residual Block, relying solely on global features is insufficient to further enhance performance. Finally, introducing all three modules simultaneously (e.g., ID6) achieves the best overall results, demonstrating that ISAM, Self-Attention, and the Asymmetric Residual Block produce complementary and synergistic effects in cross-domain generalization, global context capture, and local feature enhancement. Although ID2 shows the best results on the SA dataset, this is acceptable considering the improvements observed on other datasets.

When only the ISAM module is retained, the model still achieves significant performance improvements on the target domain, indicating that ISAM plays a critical role in mitigating spectral offsets between the source and target domains. With the gradual introduction of the improved Self-Attention mechanism and asymmetric residual blocks, model performance steadily improves. Compared with the baseline model DCFSL, the complete SAMLFE structure achieves better classification results in the target domain, fully demonstrating the rationality and effectiveness of the proposed model design.

As shown in Figure 17, the ISAM module yields the most significant performance improvement, indicating that it plays a critical role in enhancing the adaptability of cross-domain HSI classification tasks. Although the improvements from other modules are less pronounced than those of ISAM, they still provide notable performance enhancements compared to the baseline model. This demonstrates that these modules are also effective in cross-domain classification tasks, particularly in improving target domain classification accuracy, reducing distribution differences between the source and target domains, and enhancing the model’s generalization ability and robustness.

5. Discussion

5.1. Effectiveness of the Number of Hyperparameters on the Model

To evaluate the parameter sensitivity of SAMLFE across three target domain datasets, a parameter sensitivity analysis was conducted. The hyperparameter controlling the perturbation correction process in ISAM was denoted as

λ_{1}

, while the minimum learning rate in the preheated cosine annealing strategy was denoted as

λ_{2}

. The candidate values were selected from

λ_{1} \in {1 \times 10^{- 9}, 1 \times 10^{- 8}, 1.1 \times 10^{- 8}, 1.2 \times 10^{- 8}}

and

λ_{2} \in {1 \times 10^{- 7}, 1 \times 10^{- 6}, 1 \times 10^{- 5}, 1.1 \times 10^{- 5}}

for combination experiments, respectively. Figure 18 illustrates the trends of performance metrics, including OA, AA, and Kappa, as the combinations of

λ_{1}

and

λ_{2}

change on the PU dataset. In Figure 18, the horizontal and vertical axes represent two parameters,

λ_{1}

and

λ_{2}

, respectively. The vertical axis corresponds to the overall accuracy (OA), average accuracy (AA), and kappa coefficient of the model. The surfaces of different colors illustrate how the model’s OA, AA, and kappa coefficients vary with different parameter combinations. High points on the surface indicate that the model achieves higher OA, AA, and kappa values under those parameter combinations, while troughs indicate lower performance in these metrics. To reduce the computational burden caused by hyperparameter optimization on the same dataset, the parameter combination achieving better performance was selected as the experimental setting. For the PU, IP, SA, and HZ datasets, the final selected parameters were

λ_{1} = 1.1 \times 10^{- 8}

and

λ_{2} = 1 \times 10^{- 5}

, respectively.

5.2. Analysis of the Impact of Sample Size on SAMLFE Classification Accuracy

To assess the impact of varying sample sizes on SAMLFE performance, this study sets the number of samples per class to 1, 2, 3, 4, and 5, respectively. The 1–5 sample setting covers various small-sample learning scenarios and enables systematic evaluation of the model’s cross-domain learning ability under different data availability conditions. As the sample size increases (e.g., 3–5), target domain samples offer richer intra- and inter-class feature information, enabling the model to better adjust feature distributions and achieve effective cross-domain learning. Therefore, the 1–5 sample setting not only preserves the characteristics of small-sample learning but also validates the model’s generalization ability across domains. Most relevant studies (e.g., Li [20] and Zhang [22]) adopt similar settings to ensure fair comparisons among algorithms and reproducibility of experimental results. Furthermore, as shown in Table 11 and Figure 19, when the sample size is 5, OA, AA, Kappa, and F1 reach their highest values. Therefore, this study ultimately selects 5 samples per class for the experiments.

5.3. Analyzing the Impact of Batch Size on the SAMLFE Framework

In cross-domain hyperspectral image classification, batch size determines the number of samples used by the model for gradient estimation in a single iteration, thereby influencing its generalization performance and feature learning capability. Smaller batch sizes (e.g., 32 or 64) involve fewer samples in gradient computation, causing model convergence to fluctuate and making stable convergence difficult. Moreover, small batches cannot fully capture the diverse spectral–spatial features of the source and target domains, potentially weakening the model’s cross-domain feature learning capability. Conversely, a larger batch size (e.g., 150) introduces more samples per iteration, facilitating smoother gradient estimates, stabilizing model convergence, and promoting the learning of more robust cross-domain features. However, an excessively large batch size exposes the model to many samples from different domains simultaneously, averaging gradient directions, reducing the diversity of gradient updates, and masking subtle inter-domain feature distribution differences. This over-smoothed gradient update diminishes the model’s sensitivity to cross-domain feature variations, ultimately reducing its cross-domain learning ability in the target domain. Based on the above analysis, sensitivity experiments were conducted under four batch size settings: 32, 64, 128, and 150. As shown in Table 12 and Figure 20, increasing batch size gradually improves multiple metrics, including OA, AA, Kappa coefficient, and F1 score. When the batch size is 128, the model achieves optimal performance; further increasing it to 150 slightly reduces performance. Therefore, 128 is selected as the optimal batch size for subsequent experiments.

5.4. Analyzing the Impact of Parameter Gamma on the Improved Self-Attention

To verify the effectiveness of parameter gamma in the Improved Self-attention mechanism, an ablation study was conducted, with results shown in Table 13. It can be observed that the introduction of parameter gamma enables the model to adaptively adjust the weighting ratio between the input and the output of the self-attention mechanism during training. This capability enhances the model’s adaptability to cross-domain classification tasks, leading to significant improvements in OA, AA, and Kappa coefficients.

5.5. Feature Visualization of the Target Domain

This section employs t-SNE to visualize the two-dimensional feature projections of the original HSI data, Gia-CFSL, and SAMLFE across the PU, IP, and SA datasets. The visualization results are presented in Figure 21, Figure 22 and Figure 23. In t-SNE visualizations, each color represents a feature category, illustrating the model’s ability to distinguish different categories in a low-dimensional feature space. The clustering patterns shown by t-SNE directly reflect the model’s hyperspectral feature extraction ability, where tight intra-class clustering indicates that the model learns stable and discriminative features. Clear inter-class separation demonstrates that the model effectively captures subtle spectral–spatial differences between similar categories, thereby distinguishing these differences in the low-dimensional feature space. Compared to the original HSI data and the Gia-CFSL method, the proposed SAMLFE demonstrates reduced inter-class feature confusion and clearer classification boundaries across all three datasets, highlighting its ability to learn more discriminative feature representations. Specifically, Figure 21a, Figure 22a and Figure 23a illustrate that in the original HSI data, various land object categories exhibit wide coverage and significant overlap, complicating effective classification. For instance, in the IP dataset, the cross-domain method Gia-CFSL also exhibits notable category confusion in Figure 22b, particularly between “Corn-notill” and “Corn-mintill,” due to their similar surface coverage types. In contrast, Figure 22c shows that SAMLFE significantly reduces inter-class overlap, presents clearer feature distributions, and achieves better separability. Similarly, in the SA dataset, Figure 23b,c reveal that SAMLFE achieves more compact clustering for easily confused categories such as “Grapes_untrained” and “Vineyard_untrained” in Class 8, with minimal confusion, further validating its capability to distinguish complex classes. Overall, the proposed SAMLFE method significantly enhances category separability in the feature space, effectively improving the model’s discriminative ability and classification performance.

6. Conclusions

This paper proposes a cross-domain hyperspectral image classification method that integrates sharpness-aware minimization with local-to-global feature enhancement, establishing a novel paradigm for large-scene satellite image classification supported by UAV hyperspectral data. The local-to-global feature extraction model simultaneously captures fine-grained local details and long-range dependencies, enabling effective extraction of shared semantic features across domains. When combined with the improved sharpness-aware minimization strategy, the model achieves enhanced cross-domain generalization and more precise feature alignment. Experiments on the PU, IP, SA, and HZ datasets demonstrate that the proposed method outperforms mainstream approaches in both classification accuracy and cross-domain adaptability. Notably, the method maintains strong robustness and generalization capability even when the source and target domains exhibit significant discrepancies. Compared with the SOTA DSFormer method, the overall accuracy (OA) on the four datasets improves by 2.07%, 1%, 1.07, and 4.14%, respectively. These results confirm the method’s effectiveness in extracting semantic features across domains, enhancing feature alignment, and improving classification performance. Future work will focus on exploring more efficient feature alignment strategies and extending the method’s applicability to broader and more diverse remote sensing scenarios.

Author Contributions

Conceptualization, C.L., A.W. and H.W.; methodology, software, validation, C.L.; writing—review and editing, A.W., H.W. and C.L.; visualization, C.L., A.W., M.W. and S.Y.; supervision, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Key Research and Development Plan Project of Heilongjiang (JD2023SJ19), the National Key Support Project for Foreign Experts of Northeast Special Project (D20250098) and the Program for Young Talents of Basic Research in Universities of Heilongjiang Province (YQJH2024077) and the Postdoctoral Fellowship Program of China Postdoctoral Science Foundation (GZC20252304).

Data Availability Statement

https://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes (accessed on 1 January 2025); http://rsidea.whu.edu.cn/resource_WHUHi_sharing.htm (accessed on 6 February 2018). https://github.com/szubing/ED-DMM-UDA (accessed on 30 February 2020).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Q.; Huang, J.; Wang, S.; Zhang, Z.; Shen, T.; Gu, Y. Community Structure Guided Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4404115. [Google Scholar] [CrossRef]
Tu, B.; Ren, Q.; Li, Q.; He, W.; He, W. Hyperspectral Image Classification Using a Superpixel–Pixel–Subpixel Multilevel Network. IEEE Trans. Geosci. Remote Sens. 2023, 72, 5013616. [Google Scholar] [CrossRef]
Weber, C.; Aguejdad, R.; Briottet, X.; Avala, J.; Fabre, S.; Demuynck, J.; Zenou, E.; Deville, Y.; Karoui, M.S.; Benhalouche, F.Z.; et al. Hyperspectral Imagery for Environmental Urban Planning. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 1628–1631. [Google Scholar]
Yang, X.; Yu, Y. Estimating Soil Salinity Under Various Moisture Conditions: An Experimental Study. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2525–2533. [Google Scholar] [CrossRef]
Liang, L.; Di, L.; Zhang, L.; Deng, M.; Qin, Z.; Zhao, S.; Lin, H. Estimation of crop LAI using hyperspectral vegetation indices and a hybrid inversion method. Remote Sens. Environ. 2015, 165, 123–134. [Google Scholar] [CrossRef]
Hao, Q.; Pei, Y.; Zhou, R.; Sun, B.; Sun, J.; Li, S.; Kang, X. Fusing Multiple Deep Models for in Vivo Human Brain Hyperspectral Image Classification to Identify Glioblastoma Tumor. IEEE Trans. Instrum. Meas. 2021, 70, 4007314. [Google Scholar] [CrossRef]
Kanthi, M.; Sarma, T.H.; Bindu, C.S. A 3D-Deep CNN Based Feature Extraction and Hyperspectral Image Classification. In Proceedings of the 2020 IEEE India Geoscience and Remote Sensing Symposium, Ahmedabad, India, 1–4 December 2020; pp. 229–232. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–Spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [Google Scholar] [CrossRef]
Wei, W.; Tong, L.; Guo, B.; Zhou, J.; Xiao, C. Few-Shot Hyperspectral Image Classification Using Relational Generative Adversarial Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5539016. [Google Scholar] [CrossRef]
Yu, C.; Gong, B.; Song, M.; Zhao, E.; Chang, C.-I. Multiview Calibrated Prototype Learning for Few-Shot Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5544713. [Google Scholar] [CrossRef]
Tang, H.; Zhang, C.; Tang, D.; Lin, X.; Yang, X.; Xie, W. Few-Shot Hyperspectral Image Classification with Deep Fuzzy Metric Learning. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5502205. [Google Scholar] [CrossRef]
Liu, S.; Fu, C.; Duan, Y.; Wang, X.; Luo, F. Spatial–Spectral Enhancement and Fusion Network for Hyperspectral Image Classification with Few Labeled Samples. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5502414. [Google Scholar] [CrossRef]
Mu, C.; Liu, Y.; Yan, X.; Ali, A.; Liu, Y. Few-Shot Open-Set Hyperspectral Image Classification with Adaptive Threshold Using Self-Supervised Multitask Learning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5526618. [Google Scholar] [CrossRef]
Zhao, C.; Qin, B.; Feng, S.; Zhu, W.; Zhang, L.; Ren, J. An Unsupervised Domain Adaptation Method Towards Multi-Level Features and Decision Boundaries for Cross-Scene Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5546216. [Google Scholar] [CrossRef]
Matasci, G.; Volpi, M.; Kanevski, M.; Tuia, D. Semisupervised Transfer Component Analysis for Domain Adaptation in Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3550–3564. [Google Scholar] [CrossRef]
Zhou, X.; Prasad, S. Deep Feature Alignment Neural Networks for Domain Adaptation of Hyperspectral Data. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5863–5872. [Google Scholar] [CrossRef]
Deng, B.; Jia, S.; Shi, D. Deep Metric Learning-Based Feature Embedding for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 1422–1435. [Google Scholar] [CrossRef]
Wang, Y.; Liu, G.; Yang, L.; Liu, J.; Wei, L. An Attention-Based Feature Processing Method for Cross-Domain Hyperspectral Image Classification. IEEE Signal Process. Lett. 2025, 32, 196–200. [Google Scholar] [CrossRef]
Li, Z.; Liu, M.; Chen, Y.; Xu, Y.; Li, W.; Du, Q. Deep Cross-Domain Few-Shot Learning for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5501618. [Google Scholar] [CrossRef]
Xi, B.; Li, J.; Li, Y.; Song, R.; Hong, D.; Chanussot, J. Few-Shot Learning with Class-Covariance Metric for Hyperspectral Image Classification. IEEE Trans. Image Process. 2022, 31, 5079–5092. [Google Scholar] [CrossRef]
Zhang, Y.; Li, W.; Zhang, M.; Wang, S.; Tao, R.; Du, Q. Graph Information Aggregation Cross-Domain Few-Shot Learning for Hyperspectral Image Classification. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 1912–1925. [Google Scholar] [CrossRef] [PubMed]
Zhou, L.; Ma, L. Extreme Learning Machine-Based Heterogeneous Domain Adaptation for Classification of Hyperspectral Images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1781–1785. [Google Scholar] [CrossRef]
Dang, Y.; Li, H.; Liu, B.; Zhang, X. Cross-Domain Few-Shot Learning for Hyperspectral Image Classification Based on Global-to-Local Enhanced Channel Attention. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5501905. [Google Scholar] [CrossRef]
Feng, S.; Zhang, H.; Xi, B.; Zhao, C.; Li, Y.; Chanussot, J. Cross-Domain Few-Shot Learning Based on Decoupled Knowledge Distillation for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5534414. [Google Scholar] [CrossRef]
Jiang, Z.; Li, Z.; Wang, Y.; Li, W.; Wang, K.; Tian, J.; Wang, C.; Du, Q. Lifelong Learning with Adaptive Knowledge Fusion and Class Margin Dynamic Adjustment for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5505619. [Google Scholar] [CrossRef]
Zhong, Y.; Hu, X.; Luo, C.; Wang, X.; Zhao, J.; Zhang, L. WHU-Hi: UAV-Borne Hyperspectral with High Spatial Resolution (H²) Benchmark Datasets and Classifier for Precise Crop Identification Based on Deep Convolutional Neural Network with CRF. Remote Sens. Environ. 2020, 250, 112012. [Google Scholar] [CrossRef]
Wei, L.; Yu, M.; Zhong, Y.; Zhao, J.; Liang, Y.; Hu, X. Spatial-Spectral Fusion Based on Conditional Random Fields for the Fine Classification of Crops in UAV-Borne Hyperspectral Remote Sensing Imagery. Remote Sens. 2019, 11, 780. [Google Scholar] [CrossRef]
Zhong, Y.; Xu, Y.; Wang, X.; Jia, T.; Xia, G.; Ma, A.; Zhang, L. Pipeline Leakage Detection for District Heating Systems Using Multisource Data in Mid and High-Latitude Regions. ISPRS J. Photogramm. Remote Sens. 2019, 151, 207–222. [Google Scholar] [CrossRef]
Han, Z.; Yang, J.; Gao, L.; Zeng, Z.; Zhang, B.; Chanussot, J. Subpixel Spectral Variability Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504014. [Google Scholar] [CrossRef]
Han, Z.; Zhang, C.; Gao, L.; Zeng, Z.; Ng, M.K.; Zhang, B.; Chanussot, J. Multisource Collaborative Domain Generalization for Cross-Scene Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5535815. [Google Scholar] [CrossRef]
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep Convolutional Neural Networks for Hyperspectral Image Classification. J. Sensors. 2015, 2015, 258619. [Google Scholar] [CrossRef]
Mou, L.; Ghamisi, P.; Zhu, X.X. Deep Recurrent Neural Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3639–3655. [Google Scholar] [CrossRef]
Li, Y.; Zhang, H.; Shen, Q. Spectral–Spatial Classification of Hyperspectral Imagery with 3D Convolutional Neural Network. Remote Sens. 2017, 9, 67. [Google Scholar] [CrossRef]
Liu, X.; Liu, S.; Chen, W.; Qu, S. HDECGCN: A Heterogeneous Dual Enhanced Network Based on Hybrid CNNs Joint Multiscale Dynamic GCNs for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5515717. [Google Scholar] [CrossRef]
Ahmad, M.; Ghous, U.; Usama, M.; Mazzara, M. WaveFormer: Spectral–Spatial Wavelet Transformer for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5502405. [Google Scholar] [CrossRef]
Zhang, S.; Chen, Z.; Wang, D.; Wang, Z.J. Cross-Domain Few-Shot Contrastive Learning for Hyperspectral Images Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5514505. [Google Scholar] [CrossRef]
Ye, Z.; Wang, J.; Sun, T.; Zhang, J.; Li, W. Cross-Domain Few-Shot Learning Based on Graph Convolution Contrast for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5504614. [Google Scholar] [CrossRef]
Miftahushudur, T.; Grieve, B.; Yin, H. Permuted KPCA and SMOTE to Guide GAN-Based Oversampling for Imbalanced HSI Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 489–505. [Google Scholar] [CrossRef]
Paoletti, M.E.; Haut, J.M.; Plaza, J.; Plaza, A. Neighboring Region Dropout for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1032–1036. [Google Scholar] [CrossRef]
Wang, W.; Wang, X.; Liu, Y.; Yang, J. Rethinking Maximum Mean Discrepancy for Visual Domain Adaptation. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 264–277. [Google Scholar] [CrossRef]
Li, Y.; Hu, H.; Wang, D. Learning Visually Aligned Semantic Graph for Cross-Modal Manifold Matching. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3412–3416. [Google Scholar]
Mei, S.; Ji, J.; Hou, J.; Li, X.; Du, Q. Learning Sensor-Specific Spatial-Spectral Features of Hyperspectral Images via Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4520–4533. [Google Scholar] [CrossRef]
Yang, J.; Zhao, Y.; Chan, J.C. Learning and Transferring Deep Joint Spectral–Spatial Features for Hyperspectral Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4729–4742. [Google Scholar] [CrossRef]
Othman, E.; Bazi, Y.; Melgani, F.; Alhichri, H.; Alajlan, N.; Zuair, M. Domain Adaptation Network for Cross-Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4441–4456. [Google Scholar] [CrossRef]
Wang, Z.; Du, B.; Shi, Q.; Tu, W. Domain Adaptation with Discriminative Distribution and Manifold Embedding for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1155–1159. [Google Scholar] [CrossRef]
Wang, M.; Chen, J.; Wang, Y.; Wang, S.; Li, L.; Su, H.; Gong, Z. Joint Adversarial Domain Adaptation With Structural Graph Alignment. IEEE Trans. Netw. Sci. Eng. 2024, 11, 604–612. [Google Scholar] [CrossRef]
Liu, L.; Zhang, Y.; Tang, J.; Chen, Q. Generalizable Prompt Learning via Gradient Constrained Sharpness-Aware Minimization. IEEE Trans. Multimed. 2025, 27, 1100–1113. [Google Scholar] [CrossRef]
He, X.; Chen, Y.; Ghamisi, P. Heterogeneous Transfer Learning for Hyperspectral Image Classification Based on Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 3246–3263. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Zhong, S.; Chang, C.-I.; Zhang, Y. Iterative Support Vector Machine for Hyperspectral Image Classification. In Proceedings of the 2018 25th IEEE International Conference on Image Processing, Athens, Greece, 7–10 October 2018; pp. 3309–3312. [Google Scholar]
Zhao, Z.; Xu, X.; Li, S.; Plaza, A. Hyperspectral Image Classification Using Groupwise Separable Convolutional Vision Transformer Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5511817. [Google Scholar] [CrossRef]
Xu, Y.; Wang, D.; Zhang, L.; Zhang, L. Dual Selective Fusion Transformer Network for Hyperspectral Image Classification. Neural Netw. 2025, 187, 107311. [Google Scholar] [CrossRef]

Figure 1. Framework of the proposed SAMLFE for HSI classification.

Figure 2. The structure of spectral dimension mapping model.

Figure 3. The structure of local-to-global feature extraction model.

Figure 4. The feature space of improved self-attention mechanism.

Figure 5. The few-shot learning between source and target domain.

Figure 6. The structure of conditional domain discriminator model.

Figure 7. The schematic diagram of ISAM.

Figure 8. WHHC dataset. (a) False-color image; (b) Ground-truth map.

Figure 9. PU dataset. (a) False-color image; (b) Ground-truth map.

Figure 10. SA dataset. (a) False-color image. (b) Ground-truth map.

Figure 11. HZ dataset. (a) False-color image. (b) Ground-truth map.

Figure 12. IP dataset. (a) False-color image; (b) Ground-truth map.

Figure 13. Classification maps of PU dataset by different methods. (a) Ground truth; (b) XGBoost; (c) SVM; (d) 3DCNN; (e) SSRN; (f) DCFSL; (g) Gia-CFSL; (h) GSCViT; (i) DSFormer; (j) SAMLFE.

Figure 14. Classification maps of IP dataset by different methods. (a) Ground truth; (b) XGBoost; (c) SVM; (d) 3DCNN; (e) SSRN; (f) DCFSL; (g) Gia-CFSL; (h) GSCViT; (i) DSFormer; (j) SAMLFE.

Figure 15. Classification maps of SA dataset by different methods. (a) Ground truth; (b) XGBoost; (c) SVM; (d) 3DCNN; (e) SSRN; (f) DCFSL; (g) Gia-CFSL; (h) GSCViT; (i) DSFormer; (j) SAMLFE.

Figure 16. Classification maps of HZ dataset by different methods. (a) Ground truth; (b) XGBoost; (c) SVM; (d) 3DCNN; (e) SSRN; (f) DCFSL; (g) Gia-CFSL; (h) GSCViT; (i) DSFormer; (j) SAMLFE.

Figure 17. The performance comparison of the SAMLFE model across different ablation settings. (a) PU; (b) IP; (c) SA.

Figure 18. The effect of hyperparameters on the SAMLFE model. (a) OA. (b) AA. (c) Kappa.

Figure 19. The performance comparison of the SAMLFE model across different sample sizes. (a) PU; (b) IP; (c) SA.

Figure 20. The performance comparison of the SAMLFE model across different batch size. (a) PU; (b) IP; (c) SA.

Figure 21. Two-dimensional feature visualization on the PU dataset. (a) Original samples; (b) Features by Gia-CFSL; (c) Features by SAMLFE.

Figure 22. Two-dimensional feature visualization on the IP dataset. (a) Original samples; (b) Features by Gia-CFSL; (c) Features by SAMLFE.

Figure 23. Two-dimensional feature visualization on the SA dataset. (a) Original samples; (b) Features by Gia-CFSL; (c) Features by SAMLFE.

Table 1. The Number of Samples of the WHHC dataset.

Class	Name	Pixels	Class	Name	Pixels
1	Strawberry	44,735	9	Grass	9469
2	Cowpea	22,753	10	Red roof	10,516
3	Soybean	10,287	11	Gray roof	16,911
4	Sorghum	5353	12	Plastic	3679
5	Water spinach	1200	13	Bare soil	9116
6	Watermelon	4533	14	Road	18,560
7	Greens	5903	15	Bright object	1136
8	Trees	17,978	16	Water	75,401

Table 2. The Number of Samples of the PU dataset.

Class	Name	Pixels
1	Asphalt	6631
2	Meadows	18,649
3	Gravel	2099
4	Trees	3064
5	Sheets	1345
6	Bare soil	5029
7	Bitumen	1330
8	Bricks	3682
9	Shadow	947

Table 3. The Number of Samples of the SA dataset.

Class	Name	Pixels	Class	Name	Pixels
1	Brocoli_green_weeds_1	2009	9	Soil_vinyard_develop	6203
2	Brocoli_green_weeds_2	3726	10	Corn_senesced_green_weeds	3278
3	Fallow	1976	11	Lettuce_romaine_4wk	1068
4	Fallow_rough_plow	1394	12	Lettuce_romaine_5wk	1927
5	Fallow_smooth	2678	13	Lettuce_romaine_6wk	916
6	Stubble	3959	14	Lettuce_romaine_7wk	1070
7	Celery	3579	15	Vinyard_untrained	7268
8	Grapes_untrained	11,271	16	Vinyard_vertical_trellis	1807

Table 4. The Number of Samples of the HZ dataset.

Class	Name	Pixels
1	Water	18,043
2	Land/building	77,450
3	Plants	40,207

Table 5. The Number of Samples of the IP dataset.

Class	Name	Pixels	Class	Name	Pixels
1	Alfalfa	46	9	Oats	20
2	Corn-notill	1428	10	Sovbean-notill	972
3	Corn-mintill	830	11	Soybean-mintill	2455
4	Corn	237	12	Soybean-cleam	593
5	Grass-pasture	483	13	Wheat	205
6	Grass-tree	730	14	Woods	1265
7	Grass-pasture-mowed	28	15	Buildings-Grass-Trees-Drives	386
8	Hay-windrowed	478	16	Stone-Steel-owers	93

Table 6. Classification results for the PU dataset.

Class	XGBoost	SVM	3DCNN	SSRN	DCFSL	Gia-CFSL	GSCViT	DSFormer	SAMLFE
1	47.89 ± 20.40	66.69 ± 4.50	56.87 ± 3.23	47.65 ± 8.38	78.44 ± 6.92	79.98 ± 2.79	77.63 ± 13.19	68.04 ± 2.40	80.75 ± 4.06
2	47.52 ± 0.43	56.21 ± 13.62	70.50 ± 12.39	85.85 ± 8.06	84.24 ± 10.93	89.41 ± 6.01	79.76 ± 7.25	87.37 ± 13.34	88.29 ± 4.98
3	34.13 ± 18.91	56.96 ± 8.67	62.37 ± 5.88	66.33 ± 18.29	67.24 ± 12.22	67.83 ± 7.93	57.45 ± 23.96	89.26 ± 5.32	70.83 ± 11.23
4	63.11 ± 18.78	73.44 ± 22.72	71.76 ± 7.97	83.93 ± 4.69	93.26 ± 3.94	92.64 ± 2.78	90.11 ± 7.31	87.09 ± 6.08	93.83 ± 3.03
5	96.77 ± 1.41	96.84 ± 1.58	96.57 ± 4.17	99.81 ± 0.19	99.09 ± 1.06	96.24 ± 6.56	99.97 ± 0.05	100.00 ± 0.00	99.05 ± 0.85
6	39.98 ± 1.49	52.38 ± 25.54	49.77 ± 19.42	77.63 ± 6.91	76.24 ± 7.90	65.38 ± 13.48	77.77 ± 5.55	66.81 ± 9.87	79.28 ± 6.74
7	66.84 ± 18.39	77.46 ± 12.14	81.03 ± 6.41	99.70 ± 0.30	78.88 ± 10.58	81.18 ± 3.33	94.78 ± 5.68	99.12 ± 0.51	82.72 ± 10.73
8	50.92 ± 8.20	69.93 ± 2.88	45.77 ± 16.01	86.32 ± 2.83	69.83 ± 14.18	69.23 ± 13.37	91.09 ± 4.02	78.90 ± 6.23	69.60 ± 11.79
9	89.53 ± 8.09	99.86 ± 0.16	90.66 ± 7.47	99.68 ± 0.32	94.08 ± 6.13	94.73 ± 8.84	99.52 ± 0.52	93.84 ± 10.48	93.38 ± 7.39
OA(%)	50.51 ± 2.20	62.73 ± 5.15	65.10 ± 4.28	79.08 ± 2.87	81.49 ± 4.77	82.64 ± 2.28	81.35 ± 4.09	82.20 ± 4.70	84.27 ± 1.90
AA(%)	59.63 ± 0.11	72.20 ± 2.95	69.48 ± 1.31	82.98 ± 2.01	82.37 ± 2.71	81.85 ± 1.60	85.34 ± 0.33	85.60 ± 1.00	84.19 ± 1.77
K × 100	40.01 ± 2.66	53.78 ± 5.10	55.83 ± 4.06	73.01 ± 3.39	76.17 ± 5.49	77.25 ± 2.63	76.01 ± 4.82	76.92 ± 5.04	79.51 ± 2.31
F1	49.59 ± 3.11	66.70 ± 0.66	61.12 ± 5.06	79.78 ± 1.84	79.28 ± 2.69	79.48 ± 3.54	75.67 ± 0.48	76.28 ± 4.72	81.26 ± 1.87
Model size (MB)	—	—	0.12	0.76	0.29	0.86	0.59	2.58	0.27
Time(s)	0.34	0.01	61.05	75.06	1989.37	5503.92	16.48	58.45	5030.01

Table 7. Classification results for the IP dataset.

Class	XGBoost	SVM	3DCNN	SSRN	DCFSL	Gia-CFSL	GSCViT	DSFormer	SAMLFE
1	68.29 ± 12.91	67.48 ± 17.13	63.90 ± 7.46	97.56 ± 2.44	94.15 ± 8.81	84.88 ± 9.05	97.22 ± 3.93	99.39 ± 1.22	95.37 ± 5.71
2	24.43 ± 18.83	28.48 ± 8.69	26.52 ± 9.32	39.72 ± 7.15	40.93 ± 6.69	43.18 ± 8.06	44.79 ± 23.94	45.06 ± 2.25	43.13 ± 12.66
3	25.13 ± 15.00	34.34 ± 11.28	27.01 ± 8.83	32.61 ± 23.17	45.70 ± 6.39	47.64 ± 11.66	62.38 ± 1.12	51.18 ± 1.72	53.53 ± 7.37
4	29.60 ± 2.87	60.20 ± 3.51	32.50 ± 12.38	54.20 ± 25.76	72.16 ± 16.83	81.12 ± 5.41	85.90 ± 11.84	91.49 ± 6.43	71.38 ± 15.32
5	48.26 ± 19.90	50.28 ± 4.51	61.88 ± 13.14	84.73 ± 2.63	71.23 ± 7.87	73.26 ± 6.03	63.01 ± 1.49	71.97 ± 13.11	74.54 ± 7.53
6	63.91 ± 6.22	77.61 ± 2.68	75.03 ± 13.12	74.24 ± 22.87	83.35 ± 7.09	76.86 ± 6.79	95.28 ± 0.59	92.76 ± 2.91	85.83 ± 5.24
7	68.12 ± 20.55	85.51 ± 17.57	88.70 ± 14.45	96.74 ± 5.65	98.70 ± 3.91	96.52 ± 3.25	100.00 ± 0.00	94.57 ± 10.87	98.70 ± 1.99
8	36.36 ± 3.96	70.54 ± 12.76	81.61 ± 7.81	80.92 ± 12.81	81.16 ± 13.84	92.35 ± 4.12	76.50 ± 33.24	86.36 ± 9.27	84.55 ± 10.50
9	62.22 ± 23.41	91.11 ± 15.40	98.67 ± 2.67	76.67 ± 20.41	99.33 ± 2.00	97.33 ± 5.33	95.00 ± 7.07	100.00 ± 0.00	98.67 ± 2.67
10	29.85 ± 15.35	45.81 ± 3.38	40.02 ± 7.05	53.77 ± 18.93	56.29 ± 10.77	58.47 ± 6.59	59.25 ± 16.02	52.43 ± 1.97	56.83 ± 12.41
11	19.69 ± 3.84	35.01 ± 16.04	50.52 ± 11.45	45.90 ± 9.25	57.49 ± 12.78	56.78 ± 7.13	48.16 ± 13.73	57.43 ± 8.72	63.40 ± 7.00
12	24.94 ± 8.33	38.72 ± 7.23	24.97 ± 6.09	54.00 ± 11.48	43.84 ± 15.17	43.47 ± 12.82	48.97 ± 19.77	39.67 ± 9.63	39.44 ± 10.89
13	82.33 ± 9.65	92.50 ± 9.10	97.60 ± 4.55	98.38 ± 1.98	96.35 ± 4.99	94.10 ± 3.44	99.75 ± 0.36	99.63 ± 0.25	96.80 ± 2.23
14	57.17 ± 6.98	63.97 ± 14.02	54.40 ± 10.98	79.86 ± 8.72	86.21 ± 6.27	81.30 ± 6.26	88.05 ± 16.00	86.19 ± 3.65	86.33 ± 7.92
15	27.56 ± 7.83	37.97 ± 16.19	32.02 ± 6.46	61.09 ± 16.02	68.92 ± 8.71	52.81 ± 9.21	59.84 ± 1.13	67.06 ± 8.99	72.89 ± 11.42
16	89.39 ± 8.83	80.30 ± 10.92	84.55 ± 7.35	98.58 ± 1.86	98.64 ± 1.42	95.91 ± 6.53	91.57 ± 11.93	100.00 ± 0.00	97.50 ± 3.40
OA(%)	34.70 ± 0.76	46.82 ± 6.07	47.32 ± 3.92	57.46 ± 3.83	62.65 ± 2.60	62.17 ± 2.41	63.23 ± 8.61	64.45 ± 2.55	65.45 ± 2.93
AA(%)	47.33 ± 1.64	59.99 ± 3.93	58.74 ± 2.66	70.56 ± 7.17	74.65 ± 1.93	73.50 ± 1.58	75.98 ± 3.58	77.20 ± 1.70	76.18 ± 1.88
K × 100	28.47 ± 1.26	40.99 ± 6.07	40.83 ± 3.99	52.31 ± 4.61	58.08 ± 2.72	57.39 ± 2.80	58.87 ± 9.58	60.07 ± 2.70	61.06 ± 3.15
F1	36.46 ± 1.25	48.38 ± 2.04	48.04 ± 3.20	58.69 ± 2.48	61.37 ± 1.77	58.96 ± 1.31	59.55 ± 4.90	58.32 ± 0.81	65.52 ± 2.54
Model size (MB)	—	—	0.43	1.32	0.33	0.89	2.44	2.59	0.33
Time(s)	0.69	0.01	23.61	35.46	3211.04	7602.74	23.99	61.53	6266.61

Table 8. Classification results for the SA dataset.

Class	XGBoost	SVM	3DCNN	SSRN	DCFSL	Gia-CFSL	GSCViT	DSFormer	SAMLFE
1	92.65 ± 1.57	95.43 ± 2.80	98.25 ± 0.20	97.85 ± 4.29	99.61 ± 0.85	99.54 ± 0.24	99.75 ± 0.47	100.00 ± 0.00	99.64 ± 0.55
2	72.92 ± 6.67	95.05 ± 2.42	98.70 ± 1.25	95.87 ± 8.26	99.01 ± 1.23	99.11 ± 0.69	99.81 ± 0.18	96.41 ± 5.07	99.56 ± 0.37
3	61.15 ± 18.37	87.23 ± 12.36	95.61 ± 0.43	93.08 ± 13.71	90.27 ± 10.25	90.54 ± 7.60	83.95 ± 12.85	97.64 ± 3.05	96.00 ± 3.95
4	97.43 ± 0.96	98.61 ± 0.83	97.34 ± 0.94	98.08 ± 1.77	99.40 ± 0.48	97.48 ± 2.29	99.69 ± 0.32	99.89 ± 0.04	99.12 ± 0.76
5	94.71 ± 2.87	88.04 ± 7.32	93.08 ± 4.08	95.23 ± 2.64	91.50 ± 2.81	93.00 ± 2.39	92.40 ± 5.91	84.36 ± 5.57	90.18 ± 6.38
6	87.73 ± 8.53	99.36 ± 0.22	99.73 ± 0.27	99.94 ± 0.11	99.39 ± 0.97	98.41 ± 1.57	99.37 ± 0.61	100.00 ± 0.00	99.05 ± 1.16
7	90.68 ± 7.75	97.96 ± 2.35	96.75 ± 1.82	99.96 ± 0.06	98.27 ± 1.33	97.37 ± 1.57	99.94 ± 0.09	99.94 ± 0.01	98.52 ± 1.10
8	45.81 ± 16.89	66.11 ± 5.72	71.39 ± 8.25	60.08 ± 25.00	75.88 ± 10.53	71.11 ± 11.98	68.59 ± 12.54	70.00 ± 17.16	79.06 ± 5.53
9	72.95 ± 19.01	90.01 ± 8.68	93.63 ± 0.34	99.74 ± 0.35	99.32 ± 0.76	99.26 ± 0.62	98.62 ± 2.20	99.98 ± 0.02	99.36 ± 0.73
10	65.70 ± 8.87	82.43 ± 0.76	83.13 ± 8.65	94.01 ± 1.36	88.00 ± 4.35	84.32 ± 4.79	89.01 ± 7.33	95.95 ± 0.28	88.57 ± 4.07
11	56.22 ± 27.12	87.80 ± 10.74	77.28 ± 15.95	99.34 ± 0.48	98.78 ± 1.16	95.71 ± 3.99	96.06 ± 4.55	89.84 ± 3.19	97.22 ± 4.28
12	81.70 ± 8.77	96.86 ± 0.62	98.47 ± 1.53	99.38 ± 0.86	99.14 ± 1.48	97.61 ± 2.51	98.95 ± 1.12	99.24 ± 1.06	99.80 ± 0.18
13	93.08 ± 4.32	97.91 ± 0.11	97.64 ± 1.92	99.30 ± 0.89	99.33 ± 0.78	98.70 ± 0.61	99.39 ± 1.36	97.86 ± 3.02	98.63 ± 1.82
14	82.60 ± 7.21	90.67 ± 0.38	84.46 ± 6.43	97.78 ± 2.41	97.91 ± 1.57	98.70 ± 0.82	98.28 ± 2.76	96.47 ± 4.44	98.04 ± 1.72
15	53.69 ± 16.02	52.93 ± 5.40	55.73 ± 0.40	61.66 ± 27.80	76.02 ± 8.09	78.05 ± 8.51	82.82 ± 12.11	81.50 ± 17.90	77.56 ± 5.67
16	83.94 ± 4.97	71.29 ± 8.13	88.43 ± 7.41	91.01 ± 6.16	89.54 ± 6.77	95.21 ± 3.16	93.70 ± 7.07	99.14 ± 0.97	92.26 ± 7.13
OA(%)	69.40 ± 0.84	81.09 ± 0.61	84.14 ± 3.14	84.82 ± 1.79	89.45 ± 1.86	88.49 ± 1.74	88.90 ± 2.78	89.54 ± 1.14	90.61 ± 0.93
AA(%)	77.06 ± 2.82	87.36 ± 0.51	89.35 ± 3.38	92.64 ± 1.35	93.83 ± 0.93	93.38 ± 0.90	93.77 ± 1.50	94.26 ± 0.13	94.53 ± 0.82
K × 100	66.34 ± 1.03	78.99 ± 0.70	82.37 ± 3.46	83.16 ± 1.93	88.28 ± 2.03	87.24 ± 1.88	87.69 ± 3.08	88.41 ± 1.21	89.56 ± 1.03
F1	72.86 ± 1.76	84.35 ± 2.03	84.77 ± 6.11	86.69 ± 5.61	93.68 ± 0.35	92.41 ± 0.39	91.95 ± 1.95	93.19 ± 2.88	93.97 ± 1.00
Model size (MB)	—	—	0.44	1.35	0.33	0.89	0.69	2.59	0.33
Time(s)	0.55	0.01	117.98	150.75	3217.59	7705.78	26.43	62.64	7129.28

Table 9. Classification results for the HZ dataset.

Class	XGBoost	SVM	3DCNN	SSRN	DCFSL	Gia-CFSL	GSCViT	DSFormer	SAMLFE
1	97.25 ± 3.08	89.58 ± 9.33	86.07 ± 2.13	89.93 ± 3.10	84.74 ± 2.01	83.63 ± 3.65	83.17 ± 13.02	87.73 ± 4.05	83.61 ± 1.94
2	64.42 ± 11.16	62.65 ± 14.45	70.14 ± 9.25	78.19 ± 10.29	78.86 ± 2.55	77.79 ± 4.52	76.80 ± 20.60	73.47 ± 16.83	84.40 ± 5.17
3	48.51 ± 3.22	67.61 ± 17.24	72.57 ± 9.38	65.26 ± 5.38	72.84 ± 6.01	76.19 ± 6.41	69.74 ± 18.49	77.61 ± 9.34	72.35 ± 7.29
OA(%)	64.07 ± 6.92	67.70 ± 1.90	72.98 ± 3.12	75.92 ± 3.87	77.86 ± 1.74	78.09 ± 2.95	75.56 ± 6.00	76.59 ± 8.61	80.73 ± 1.21
AA(%)	70.06 ± 3.77	73.28 ± 4.04	76.26 ± 1.31	77.80 ± 0.61	78.81 ± 1.84	79.20 ± 2.39	76.57 ± 3.20	79.60 ± 4.31	80.12 ± 0.44
K × 100	41.76 ± 8.75	47.69 ± 1.05	54.22 ± 3.85	58.87 ± 4.20	61.38 ± 3.16	62.50 ± 4.53	58.12 ± 6.03	60.77 ± 12.11	65.78 ± 1.52
F1	62.46 ± 5.84	67.44 ± 2.69	75.12 ± 2.81	75.05 ± 5.15	79.14 ± 1.72	78.57 ± 3.97	49.20 ± 2.56	49.83 ± 5.96	81.37 ± 0.77
Model size (MB)	—	—	0.08	1.31	0.32	0.89	0.68	2.58	0.30
Time(s)	0.17	0.01	137.64	160.03	1574.63	5960.51	17.73	55.75	4282.37

Table 10. Sequential ablation study on three datasets.

Dataset	ID	ISAM	Improved Self-Attention	Asymmetric Residual Block	OA	AA	K × 100	F1
PU	1	×	×	×	81.49 ± 4.77	82.37 ± 2.71	76.17 ± 5.49	79.28 ± 2.69
	2	√	×	×	82.75 ± 3.57	82.25 ± 1.42	77.61 ± 4.26	80.35 ± 4.08
	3	√	×	√	81.53 ± 1.05	82.48 ± 0.15	76.92 ± 1.19	79.29 ± 1.10
	4	×	√	√	81.62 ± 3.91	83.79 ± 0.98	77.41 ± 4.41	79.93 ± 1.15
	5	√	√	×	81.59 ± 2.08	82.98 ± 2.49	76.31 ± 2.38	79.31 ± 0.71
	6	√	√	√	84.27 ± 1.90	84.19 ± 1.77	79.51 ± 2.31	81.26 ± 1.87
IP	1	×	×	×	62.65 ± 2.60	74.65 ± 1.93	58.08 ± 2.72	61.37 ± 1.77
	2	√	×	×	63.85 ± 2.05	76.39 ± 2.17	59.24 ± 2.26	63.11 ± 5.79
	3	√	×	√	64.31 ± 3.87	77.01 ± 1.62	59.87 ± 4.02	64.85 ± 1.88
	4	×	√	√	64.27 ± 2.88	77.00 ± 1.54	59.67 ± 3.16	64.74 ± 1.87
	5	√	√	×	64.15 ± 3.75	76.08 ± 2.95	59.80 ± 4.02	64.12 ± 1.83
	6	√	√	√	65.45 ± 2.93	76.18 ± 1.88	61.06 ± 3.15	65.52 ± 2.54
SA	1	×	×	×	89.45 ± 1.86	93.83 ± 0.93	88.28 ± 2.03	93.68 ± 0.35
	2	√	×	×	91.50 ± 0.86	95.28 ± 0.64	90.54 ± 0.96	94.52 ± 1.31
	3	√	×	√	90.71 ± 0.81	94.24 ± 0.29	89.65 ± 0.89	93.60 ± 4.81
	4	×	√	√	90.63 ± 0.64	93.61 ± 1.51	89.57 ± 0.72	93.19 ± 0.63
	5	√	√	×	89.62 ± 2.14	94.58 ± 1.35	88.46 ± 2.35	93.56 ± 0.45
	6	√	√	√	90.61 ± 0.93	94.53 ± 0.82	89.56 ± 1.03	93.97 ± 1.00

Table 11. Impact of sample size on the classification metrics.

Sample Size	PU					IP					SA
Sample Size	1	2	3	4	5	1	2	3	4	5	1	2	3	4	5
OA(%) Std	58.74 ± 2.59	62.41 ± 11.28	65.60 ± 10.20	75.66 ± 9.65	84.27 ± 1.90	41.97 ± 0.84	53.97 ± 2.55	56.11 ± 2.64	60.21 ± 0.74	65.45 ± 2.93	76.31 ± 0.09	83.46 ± 2.19	89.17 ± 0.46	88.34 ± 0.78	90.61 ± 0.93
AA(%) Std	63.63 ± 2.93	70.76 ± 2.69	74.22 ± 2.25	82.46 ± 3.42	84.19 ± 1.77	51.40 ± 0.67	65.59 ± 0.31	70.22 ± 0.55	74.12 ± 0.41	76.18 ± 1.88	79.59 ± 3.00	88.91 ± 0.73	93.47 ± 1.10	93.17 ± 0.24	94.53 ± 0.82
K × 100 Std	48.29 ± 2.21	54.45 ± 11.71	58.23 ± 10.16	69.79 ± 10.83	79.51 ± 2.31	35.29 ± 1.44	48.48 ± 2.71	50.88 ± 2.65	55.34 ± 0.77	61.06 ± 3.15	73.68 ± 0.23	81.52 ± 2.46	87.99 ± 0.49	87.03 ± 0.84	89.56 ± 1.03
F1 Std	56.61 ± 2.63	64.35 ± 5.62	67.75 ± 4.43	78.17 ± 4.60	81.26 ± 1.87	38.92 ± 1.24	53.42 ± 0.55	54.97 ± 2.44	60.96 ± 1.19	65.52 ± 2.54	77.93 ± 0.25	87.19 ± 0.01	91.38 ± 1.30	91.36 ± 0.32	93.97 ± 1.00

Table 12. Classification Results across Different Datasets and Batch Sizes.

Batch Size	PU				IP				SA
Batch Size	32	64	128	150	32	64	128	150	32	64	128	150
OA(%) Std	77.26 ± 7.46	79.23 ± 4.86	84.27 ± 1.90	81.31 ± 4.50	65.22 ± 0.91	64.59 ± 0.79	65.45 ± 2.93	63.68 ± 1.65	89.53 ± 0.08	90.11 ± 0.51	90.61 ± 0.93	89.96 ± 1.78
AA(%) Std	80.08 ± 4.79	81.87 ± 2.97	84.19 ± 1.77	83.01 ± 2.67	77.61 ± 0.94	77.20 ± 0.54	76.18 ± 1.88	77.09 ± 1.34	92.95 ± 0.60	94.04 ± 0.70	94.53 ± 0.82	94.60 ± 0.71
K × 100 Std	70.86 ± 9.29	73.48 ± 5.90	79.51 ± 2.31	76.07 ± 5.42	60.67 ± 1.18	60.32 ± 0.78	61.06 ± 3.15	59.21 ± 1.77	88.32 ± 0.11	88.96 ± 0.58	89.56 ± 1.03	88.85 ± 1.97
F1 Std	74.63 ± 6.26	75.09 ± 4.16	81.26 ± 1.87	78.63 ± 3.12	64.22 ± 1.79	64.11 ± 0.85	65.52 ± 2.54	63.04 ± 1.51	92.94 ± 0.39	93.79 ± 1.06	93.97 ± 1.00	93.59 ± 4.66

Table 13. The sensitivity analysis of parameter gamma.

	Index	Baseline	Without Gamma	With Gamma
Dataset	Index	Baseline	Without Gamma	With Gamma
PU	OA	81.49 ± 4.77	75.93 ± 5.80	82.37 ± 5.43
	AA	82.37 ± 2.71	80.05 ± 2.81	83.18 ± 2.12
	Kappa	76.17 ± 5.49	69.73 ± 6.38	77.48 ± 6.18
	F1	79.28 ± 2.69	76.18 ± 2.30	79.39 ± 2.92
IP	OA	62.65 ± 2.60	62.45 ± 1.67	63.96 ± 1.47
	AA	74.65 ± 1.93	72.63 ± 1.96	75.52 ± 1.83
	Kappa	58.08 ± 2.72	57.69 ± 2.04	59.44 ± 1.61
	F1	61.37 ± 1.77	61.68 ± 3.12	63.33 ± 1.82
SA	OA	89.45 ± 1.86	88.13 ± 0.89	90.39 ± 1.84
	AA	93.83 ± 0.93	93.16 ± 1.28	94.94 ± 1.09
	Kappa	88.28 ± 2.03	86.82 ± 0.96	89.33 ± 2.05
	F1	93.68 ± 0.35	91.97 ± 1.32	94.09 ± 0.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, C.; Wang, A.; Wang, M.; Wu, H.; Yan, S.; Zhao, L. Cross-Domain Hyperspectral Image Classification Combined Sharpness-Aware Minimization with Local-to-Global Feature Enhancement. Remote Sens. 2026, 18, 740. https://doi.org/10.3390/rs18050740

AMA Style

Liu C, Wang A, Wang M, Wu H, Yan S, Zhao L. Cross-Domain Hyperspectral Image Classification Combined Sharpness-Aware Minimization with Local-to-Global Feature Enhancement. Remote Sensing. 2026; 18(5):740. https://doi.org/10.3390/rs18050740

Chicago/Turabian Style

Liu, Chengyang, Aili Wang, Minhui Wang, Haibin Wu, Siqi Yan, and Lin Zhao. 2026. "Cross-Domain Hyperspectral Image Classification Combined Sharpness-Aware Minimization with Local-to-Global Feature Enhancement" Remote Sensing 18, no. 5: 740. https://doi.org/10.3390/rs18050740

APA Style

Liu, C., Wang, A., Wang, M., Wu, H., Yan, S., & Zhao, L. (2026). Cross-Domain Hyperspectral Image Classification Combined Sharpness-Aware Minimization with Local-to-Global Feature Enhancement. Remote Sensing, 18(5), 740. https://doi.org/10.3390/rs18050740

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Domain Hyperspectral Image Classification Combined Sharpness-Aware Minimization with Local-to-Global Feature Enhancement

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Hyperspectral Image Classification via Deep Neural Network

2.2. Strategies for Enhancing Model Generalization

3. Proposed SAMLFE for HSI Classification

3.1. Spectral Dimension Mapping Model Between Source and Target Domains

3.2. Local-to-Global Feature Extraction Model

3.3. Source and Target Few-Shot Learning

3.4. Conditional Domain Discriminator Model

3.5. Improved Sharpness-Aware Minimization Strategy

4. Experimental Validation and Analysis

4.1. Dataset Descriptions

4.2. Experimental Setting

4.3. Classification Maps and Categorized Results

4.4. Ablation Study of SAMLFE Model

5. Discussion

5.1. Effectiveness of the Number of Hyperparameters on the Model

5.2. Analysis of the Impact of Sample Size on SAMLFE Classification Accuracy

5.3. Analyzing the Impact of Batch Size on the SAMLFE Framework

5.4. Analyzing the Impact of Parameter Gamma on the Improved Self-Attention

5.5. Feature Visualization of the Target Domain

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI