LGDAF-Net: A Lightweight CNN–Transformer Framework for Cross-Domain Few-Shot Hyperspectral Image Classification

Yang, Guang; Fang, Jiaoli; Zhu, Daming; Zuo, Xiaoqing

doi:10.3390/electronics15081606

Open AccessArticle

LGDAF-Net: A Lightweight CNN–Transformer Framework for Cross-Domain Few-Shot Hyperspectral Image Classification

¹

Faculty of Land and Resources Engineering, Kunming University of Science and Technology, Kunming 650093, China

²

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(8), 1606; https://doi.org/10.3390/electronics15081606

Submission received: 10 March 2026 / Revised: 6 April 2026 / Accepted: 9 April 2026 / Published: 12 April 2026

(This article belongs to the Special Issue AI-Driven Image Processing: Theory, Methods, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Cross-domain few-shot hyperspectral image (HSI) classification is challenging due to limited labeled samples and distribution shifts across sensors and acquisition scenes, which often degrade feature representation and classification performance. This study proposes a lightweight hierarchical CNN–Transformer framework, termed LGDAF-Net (Lightweight Global and Local Dual Attention Fusion Network), for effective cross-domain few-shot HSI classification. The framework progressively enhances spectral–spatial representation through three stages: spectral–spatial feature recalibration, local spatial structure perception, and global contextual modeling. Specifically, a spectral–spatial dual-attention enhancement module (SESA) is introduced to emphasize informative spectral responses and suppress redundancy. A Local Attention Spatial Perception Module (LASPM) is designed to capture fine-grained spatial structures, while a lightweight Transformer-based Global Attention Context Modeling Module (GACM) models long-range spatial dependencies. In addition, kernel triplet loss and domain adversarial learning are incorporated to improve feature discrimination and promote cross-domain feature alignment. Experimental results on three benchmark datasets demonstrate that the proposed method achieves competitive performance compared with existing methods.

Keywords:

hyperspectral image (HSI) classification; few-shot learning (FSL); cross-domain learning; CNN–Transformer architecture; spectral–spatial feature learning; domain adversarial learning

1. Introduction

Hyperspectral imagery represents a critically important category of remote sensing data. Unlike conventional RGB images, it encompasses hundreds of spectral bands and nearly continuous spectral curves, capturing rich spectral information about land cover to generate three-dimensional data cubes rich in spatial and spectral detail [1]. With spectral resolution reaching the nanometer level, it offers significant advantages in the precise identification and classification of ground objects [2]. However, two core challenges persist in achieving detailed classification of hyperspectral remote sensing imagery:

First, hyperspectral images exhibit phenomena such as “same spectral signature, different objects” and “same objects, different spectral signatures” [3]. Combined with the complex distribution and structure of land features, this often leads to classification confusion and misclassification, severely impacting classification accuracy [4]. Second, ground truth annotation is costly and challenging, resulting in limited labeled data for model training. This imbalance between high spectral dimensionality and limited labeled samples exacerbates the Hughes phenomenon in classification [5], where classifier performance initially improves but eventually declines as feature dimensions increase.

To address the challenges of high-dimensionality and sparse samples in hyperspectral images, researchers worldwide have proposed a series of solutions. Early hyperspectral image classification studies mainly relied on traditional machine learning approaches such as Support Vector Machines (SVMs) [6] and Random Forests [7], which have limited ability to model complex spectral–spatial features. With the rapid advancement of computer vision and high-performance computing, deep learning methods have gradually become mainstream. Convolutional Neural Networks (CNNs), leveraging their ability to extract local features, have gained widespread application in hyperspectral image processing. However, constrained by their local receptive field, they struggle to capture global spatial information and cannot effectively highlight differentiated spectral–spatial features [8]. Transformers, based on self-attention mechanisms, demonstrate formidable capabilities in modeling long-range dependencies [9], offering a new direction for hyperspectral image classification. Several Transformer-based approaches have been proposed to improve hyperspectral image classification performance, such as the Graph-Guided Transformer [10]. In addition, recent studies have explored cross-domain few-shot hyperspectral classification, including Deep Few-Shot Learning (DFSL) [11] and Spectral Coordinate Transformation (SCFormer) [12]. The lightweight design of the proposed LGDAF-Net is inspired by the feature-guided CNN for denoising images from portable ultrasound devices, which takes lightweight network design as the core goal and abandons the general-purpose large model architecture to adapt to task-specific data quality challenges. Similar to this work, LGDAF-Net builds a task-specific lightweight CNN–Transformer hybrid network for the cross-domain few-shot HSI classification scenario with heterogeneous sensor data and limited labeled samples. It not only addresses the feature representation challenges caused by cross-domain distribution shifts and few-shot samples but also emphasizes low parameter count and low computational cost, making it suitable for practical deployment in resource-constrained remote sensing systems [13]. However, these methods are mainly designed for traditional HSI classification with sufficient labeled samples and lack targeted optimization for cross-domain few-shot scenarios—such as cross-domain feature distribution alignment and few-shot feature discrimination enhancement—and they often rely on complex convolutional structures or fail to fully integrate local and global features, resulting in high parameter complexity and limited cross-domain generalization capability.

To address the above challenges, this paper proposes a lightweight cross-domain few-shot hyperspectral image classification framework named LGDAF-Net. The proposed architecture follows a hierarchical feature learning paradigm, which progressively enhances feature representation through three stages: spectral–spatial feature enhancement, local structural perception, and global contextual modeling. Specifically, a spectral–spatial dual-attention preprocessing module (SESA) is first introduced to adaptively recalibrate spectral channels and strengthen spatial feature responses, thereby suppressing redundant spectral information. Subsequently, a Local Attention Spatial Perception Module (LASPM) is designed to capture fine-grained spatial structures by combining grouped convolution and cross-channel feature fusion, which improves the discriminative capability of local spatial features under limited training samples. Finally, a lightweight Transformer-based Global Attention Context Modeling Module (GACM) is employed to model long-range spatial dependencies and capture global contextual relationships in hyperspectral scenes. Furthermore, to enhance feature discriminability and improve cross-domain generalization, kernel triplet loss and domain adversarial learning are incorporated into the training framework. The kernel triplet loss promotes intra-class compactness and inter-class separability in the feature space, while domain adversarial learning aligns feature distributions between source and target domains. Through the collaborative design of these components, LGDAF-Net is able to effectively address the challenges of spectral redundancy, insufficient local structure modeling, and cross-domain feature distribution discrepancy in few-shot hyperspectral classification scenarios.

The main contributions of this paper are summarized as follows:

We propose a lightweight hierarchical CNN–Transformer framework, termed LGDAF-Net, for cross-domain few-shot hyperspectral image classification, enabling effective feature learning under limited labeled samples.
We develop a collaborative feature learning mechanism consisting of SESA, LASPM, and a lightweight Transformer-based GACM to jointly capture fine-grained local structures and long-range spatial dependencies.
Kernel triplet loss and domain adversarial learning are incorporated into the training framework to enhance feature discriminability and improve cross-domain feature alignment while maintaining a compact model structure.

2. Materials and Methods

2.1. Datasets

The experiment utilized four classic hyperspectral datasets: Chikusei as the source domain dataset (with ample annotated samples), and Indian Pines (IP), Salinas, and Pavia University (UP) as target domain datasets (with small annotated samples). The pseudo-true color images are shown in Figure 1. Detailed information for each dataset is as follows:

The Chikusei dataset was acquired by the Headwall Hyperspec-VNIR-C camera on 29 July 2014, in Chikusei City, Ibaraki Prefecture, Japan. It consists of 128 spectral bands, with an image size of 2517 × 2335 pixels and a spatial resolution of 2.5 m. The dataset includes 19 land cover categories, encompassing rural features such as water bodies and rice paddies, as well as urban features like buildings and parking lots. For this study, over 1000 samples across 14 categories were selected as the source domain training data. Chikusei is selected as the source domain because it contains rich spectral variability and diverse land-cover categories, which improves cross-domain feature learning. Although only one source domain is used, three different target datasets are evaluated to validate cross-domain generalization capability.

The IP dataset was collected in 1992 by the AVIRIS sensor at a farm in Indiana, USA. It comprises 145 × 145 pixels with 200 effective bands. The spatial resolution is 20 m. Sixteen crop types are represented, including alfalfa, corn, wheat, and others.

The Salinas dataset was acquired by the AVIRIS sensor in California’s Salinas Valley, comprising 200 valid bands with a pixel size of 512 × 217 pixels. The spatial resolution is 3.7 m.

The UP dataset comprises a portion of hyperspectral data acquired by the ROSIS-03 sensor in 2003. Its 103 spectral bands span wavelengths from 430 to 860 nm. With a spatial resolution of 1.3 m, the dataset measures 610 × 340 pixels. It contains nine basic land cover classes, including trees and asphalt roads.

The hyperspectral datasets used in this study are publicly available benchmark datasets widely used in previous hyperspectral image classification studies.

2.2. Implementation Environment

The LGDAF-Net was implemented in PyTorch 1.10.2 using Python 3.6.13. All experiments were performed on a computer with an NVIDIA RTX 4060 Laptop GPU. The model performance was validated via quantitative metrics, parameter settings, comparative experiments, and ablation studies.

2.3. Model Architecture

The overall workflow of the proposed LGDAF-Net classification framework is illustrated in Figure 2. The architecture mainly consists of two components: an input mapping layer and a feature extraction network. Through the collaborative interaction of these modules and the introduction of multiple loss constraints, the model effectively learns discriminative representations and achieves accurate classification for cross-domain few-shot hyperspectral images.

2.3.1. Input Mapping Layer

Hyperspectral remote sensing images collected from different sensor platforms (e.g., AVIRIS and ROSIS) exhibit significant differences in spectral response characteristics and spectral band numbers. Directly feeding these heterogeneous data into a shared feature extraction network may lead to structural inconsistencies and distribution shifts. To alleviate this problem, the proposed input mapping layer adopts a two-stage processing strategy.

First, channel unification mapping: Independent 1 × 1 convolution operations are applied to the raw hyperspectral images from the source and target domains to perform channel mapping. This operation converts the original spectral bands into a unified 64-channel feature representation, thereby eliminating the inconsistency in channel dimensions caused by sensor differences.

Second, feature distribution normalization: Batch Normalization (BN) is introduced to normalize the mapped 64-channel features, stabilizing the feature distribution across channels. This strategy effectively reduces the distribution discrepancy between the source and target domains while improving gradient stability during training. As a result, the training efficiency and convergence speed of the model are enhanced under few-shot conditions.

The feature extraction network, termed LGDAF-Net, consists of three main components: a spectral–spatial dual-attention preprocessing module (SEBlock & Self-Attention, referred to as SESA), a local spatial perception module (LASPM), and a global context modeling module (GACM), as illustrated in Figure 3. Through the collaborative operation of these modules, the network can effectively capture and fuse both local fine-grained features and global semantic information.

The 1 × 1 convolution in the input mapping layer does not cause the loss of discriminative spectral information. Instead, it adaptively learns the channel weight matrix through end-to-end training, which maps the original high-dimensional spectral bands (103–200 bands) to 64-dimensional feature space while retaining the key spectral response characteristics of ground objects. Combined with subsequent batch normalization and the channel attention mechanism in the SESA module, redundant spectral information is further suppressed and informative spectral bands are highlighted, thus realizing efficient compression and discriminative preservation of spectral features.

2.3.2. Spectral–Spatial Dual Attention Preprocessing Module (SESA)

To effectively capture spatial structural information and highlight informative spectral channels at the early stage of the network, the proposed SESA module integrates Channel Attention (SEBlock) [14] and Spatial Attention (Self-Attention) [15]. This module enhances the representation capability of the input 64-channel features by jointly modeling channel-wise importance and spatial dependencies. The processing procedure is described as follows:

First, the 64-channel feature maps are fed into the SEBlock. Global average pooling is applied to compress the spatial dimensions and obtain channel descriptors. Two fully connected layers are then used to learn channel-wise weights, which are subsequently applied to the original feature maps to perform channel reweighting. This operation emphasizes informative spectral channels while suppressing redundant feature responses.

After channel recalibration, the resulting features are further processed by a 3 × 3 convolutional layer to enhance spatial feature representation. The processed features are then fed into the Self-Attention module.

The Self-Attention module captures spatial contextual relationships by computing correlation coefficients between spatial positions. The feature maps are reshaped into a sequence form to facilitate the modeling of long-range spatial dependencies. Through this mechanism, the network can effectively capture global spatial interactions and improve feature representation. The formulation of the attention mechanism is defined as follows:

\tilde{X} = F_{S A} (F_{S E} (X)), X \in R^{B \times 64 \times H \times W}

(1)

In the formula,

H

denotes the vertical dimension of the feature map;

W

denotes the horizontal dimension;

B

denotes the number of samples in the batch; and

F_{S A}

and

F_{S E}

represent the Self-Attention and SEBlock computation processes, respectively.

The global self-attention module in SESA has a theoretical computational complexity of

O ({(H \times W)}^{2} \times C)

for an input feature map of size

H \times W \times C

(

C = 64

in this study), where

O ({(H \times W)}^{2}

) is the complexity of calculating the attention matrix between spatial positions, and

O (C)

is the complexity of feature projection and weighting. To balance spatial dependency modeling and computational cost, the input feature map of SESA is the compressed 64-channel feature after 1 × 1 convolution, and the subsequent GACM adopts a lightweight multi-head self-attention with feature dimension reduction, which jointly ensures the computational efficiency of the whole model.

2.3.3. Localized Spatial Perception Module (LASPM)

In hyperspectral remote sensing imagery, local spatial features such as edge structures, texture patterns, and fine-grained object details provide important cues for accurate category discrimination. However, conventional convolutional layers have limited receptive fields, which restrict their ability to effectively capture complex local spatial structures.

To address this limitation, a Local Spatial Perception Module (LASPM) is introduced to enhance the representation capability of local structural features. The module improves local feature extraction through grouped convolutions, cross-channel feature fusion, and channel attention weighting. The architecture of the LASPM is illustrated in Figure 4, and its processing procedure is described as follows:

First, the 64-channel feature map

X \in R^{B \times 64 \times H \times W}

, obtained from the SESA module, is divided into four channel groups, each containing 16 channels. A 3 × 3 convolution operation is then applied independently to each group to extract local spatial features within the corresponding channel subset. The operation can be formulated as follows:

X^{(k)} = C o n v_{3 \times 3} (X^{(k)}), k = 1, \dots, 4

(2)

X_{g w c} = C o n c a t (X^{(1)}, \dots, X^{(4)})

(3)

In the equation,

X^{(k)}

denotes the output of the kth convolution group;

X_{g w c}

represents the feature concatenated along the channel dimension, with the channel dimension restored to 64 channels.

To prevent potential information isolation between different channels caused by grouped convolutions, a cross-channel fusion mechanism is introduced. The 64 channels of

X_{g w c}

are divided into left and right halves (32 channels each). Through lightweight 1 × 1 convolutions, information is reintegrated along the channel dimension, enabling interaction and collaboration among channel features. This enhances feature relevance and consistency. The formula is as follows:

X_{c c f} = C o n v_{1 \times 1} ([X_{l e f t}, X_{r i g h t}])

(4)

In the equation,

X_{c c f}

represents the cross-channel fusion output, while

X_{l e f t}

and

X_{r i g h t}

denote the left and right components of the 64-channel features, respectively.

Next, an SEBlock is incorporated at the end of the module to perform channel-wise feature recalibration. Global average pooling is first applied to generate channel descriptors, followed by nonlinear transformations to learn channel importance weights. The learned weights are then used to adaptively rescale the feature channels, thereby enhancing informative spectral responses while suppressing redundant features.

Finally, the refined feature maps are compressed using a 1 × 1 convolution to reduce the channel dimensionality. Subsequently, global average pooling is applied to generate fixed-length feature vectors, which serve as local feature representations for subsequent feature fusion modules.

2.3.4. Global Context Modeling Module (GACM)

Convolutional neural networks (CNNs) typically suffer from limited receptive fields, which restrict their ability to capture long-range spatial dependencies in hyperspectral imagery. In HSI, ground objects often exhibit irregular shapes and sparse spatial distributions, making global contextual modeling essential for representing overall semantic structures.

To address this limitation, a Global Context Modeling Module (GACM) is introduced based on a lightweight Transformer architecture. The module employs a Multi-Head Self-Attention (MHSA) mechanism together with a Feed-Forward Network (FFN) to capture cross-regional spatial dependencies. The overall architecture of the GACM is illustrated in Figure 5.

The 64-channel 2D feature map output from the SESA module is flattened into a sequence format (

X^{'} = r e s h a p e (X) \in R^{B \times N \times 64}

, where

N = H \cdot W

represents the number of spatial positions). This is then mapped through a linear layer into a fixed-dimensional vector (

X^{″} = X^{'} W_{i n}

W_{i n} \in R^{64 \times 64}

,

X^{″} \in R^{B \times N \times 64}

), adapting it to the input requirements of the Transformer architecture.

Then, using the MHSA mechanism, query, key, and value vectors are generated for each position and projected onto a 64-dimensional space. By calculating similarity weights between Query and other position Keys, information is aggregated with weighted contributions. Simultaneously, eight attention heads concurrently model spatial dependencies across different levels and directions. Finally, the multi-head outputs are concatenated and mapped through a weight matrix (

W_{O} \in R^{64 \times 64}

), generating global dependency features. The formula is as follows:

h e a d_{i} = S o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{8}}) V_{i}, i = 1, \dots, 8

(5)

Y = C o n c a t (h e a d_{1}, \dots, h e a d_{8}) W_{O},

(6)

In the equation,

h e a d_{i}

denotes the ith sub-head attention, where

Q_{i}

,

K_{i}

, and

W_{i}

represent the Query, Key, and Value vectors of the i-th attention head, respectively.

Next, a Feed-Forward Network (FFN) is introduced to enhance the nonlinear representation capability of global features. The FFN consists of two fully connected layers with channel dimensions of 64→256→64 In addition, Layer Normalization and residual connections are incorporated to stabilize the training process and facilitate efficient feature propagation. The formulation of the FFN operation can be expressed as follows:

X_{a t t n} = L a y e r N o r m (X^{″} + Y)

(7)

X_{f f n} = L a y e r N o r m (X_{a t t n} + M L P (X_{a t t n}))

(8)

In the equation, MLP refers to a two-layer fully connected network.

Finally, the sequence features produced by the FFN are reshaped back into a two-dimensional spatial structure while preserving the original spatial dimensions. A 1 × 1 convolution is then applied to reduce the channel dimensionality to 32. Subsequently, global average pooling is performed to compress the feature maps into fixed-length vectors. These vectors serve as global feature representations for the subsequent feature fusion module.

2.3.5. Loss Functions and Training Strategy

To achieve cross-domain few-shot hyperspectral image classification, this paper designs a multi-task joint loss function, which consists of Cross-Entropy Loss (

L_{C E}

), Kernel Triplet Loss (

L_{h a r d}

), and Domain Adversarial Loss (

L_{d o m a i n}

). A phased training strategy is adopted to optimize the model progressively.

Cross-Entropy Loss
This loss is used for supervised classification learning of samples from both the source and target domains, and its formula is defined as:

$L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} y_{i} \log {\hat{y}}_{i}$

(9)

where $N$ denotes the batch size, $y_{i}$ is the ground-truth label of sample $i$ , and ${\hat{y}}_{i}$ represents the predicted class probability of the sample output by the model.

2.: Kernel Triplet Loss
To enhance the discriminability of extracted features, this paper designs a kernel triplet loss by introducing kernel mapping based on the standard triplet loss. Different from the standard triplet loss which calculates distance only based on original features, the proposed loss maps features into a high-dimensional space via a kernel function, to capture non-linear feature relationships more effectively. The formula is:

$L_{h a r d} = m a x (0, d_{p o s} - d_{n e g} + α)$

(10)

where $d_{p o s}$ is the distance between the anchor sample and positive sample in the kernel space, $d_{n e g}$ is the distance between the anchor sample and negative sample, and $α$ is the margin hyperparameter, which is set to 0.3 in this work.

3.: Domain Adversarial Loss
This loss is used to align the feature distributions of the source and target domains, and adversarial training is implemented via the Gradient Reversal Layer (GRL). The goal of the domain classifier is to distinguish samples from the source and target domains, while the goal of the feature encoder is to generate domain-invariant features that cannot be correctly classified by the domain classifier. The loss formula is:

$L_{d o m a i n} = - \frac{1}{M} \sum_{j = 1}^{M} [z_{j} l o g {\hat{z}}_{j} + (1 - z_{j}) l o g (1 - {\hat{z}}_{j})]$

(11)

where $M$ is the total number of samples from the source and target domains, $z_{j}$ is the domain label ( $z_{j}$ = 1 for the source domain, $z_{j}$ = 0 for the target domain), and ${\hat{z}}_{j}$ is the domain probability predicted by the domain classifier. The weight $λ$ of the gradient reversal layer is dynamically adjusted with the training process: it is initialized to 0.1, and increases linearly to 1.0 as the training episode grows.

4.: Total Loss Function and Phased Training
A phased training strategy is adopted in this work, and the total loss function $L_{t o t a l}$ is dynamically adjusted according to the training episode:
Stage 1 (episode ≤ 500): Only the classification loss of the target domain is optimized.

$L_{t o t a l} = L_{C E}^{t a r g e t} + λ_{h a r d} \cdot L_{h a r d}^{t a r g e t}$

(12)

Stage 2 (500 < episode ≤ 1000): Classification loss of the source domain is added.

$L_{t o t a l} = L_{C E}^{t a r g e t} + λ_{h a r d} \cdot L_{h a r d}^{t a r g e t} + λ_{s o u r c e} \cdot (L_{C E}^{s o u r c e} + λ_{h a r d} \cdot L_{h a r d}^{s o u r c e})$

(13)

Stage 3 (episode > 1000): Domain adversarial loss is added to achieve feature alignment.

$L_{t o t a l} = L_{C E}^{t a r g e t} + λ_{h a r d} \cdot L_{h a r d}^{t a r g e t} + λ_{s o u r c e} \cdot (L_{C E}^{s o u r c e} + λ_{h a r d} \cdot L_{h a r d}^{s o u r c e} + λ_{d o m a i n} \cdot L_{d o m a i n})$

(14)

The hyperparameter weights are set as follows: $λ_{h a r d}$ = 0.5, $λ_{s o u r c e}$ = 0.3, and $λ_{d o m a i n}$ = 0.2.

2.4. Experimental Setup and Cross-Domain Few-Shot Training Protocol

2.4.1. Basic Experimental Parameters

Input samples were cropped into 9 × 9 spatial cubes, with each cube containing all spectral channels. The SGD optimizer is used with an initial learning rate of 0.001, a training batch size of 16, a testing batch size of 64, and 2000 training episodes.

2.4.2. Support/Query Set Construction in Each Episode

The model is trained in an episodic batch-wise manner, which follows the standard protocol of cross-domain few-shot HSI classification. In each training episode, the support set and query set are constructed from the source domain (Chikusei dataset with sufficient labeled samples) and target domain (IP/UP/Salinas dataset with limited labeled samples) separately:

Source domain support set: In each episode, labeled samples are randomly sampled from the source domain data loader. These samples provide sufficient supervised prior knowledge for the model, and are used for classification learning and intra-class/inter-class feature discriminability enhancement.
Target domain support set: 15% of labeled samples per land cover class from the target dataset are randomly sampled in each episode, which is the only labeled information from the target domain used for model training. This part of samples provides few-shot supervised guidance to adapt the model to the target domain data distribution.
Target domain query set: Unlabeled samples from the target dataset are sampled in each episode. These samples are only used for cross-domain feature alignment via domain adversarial learning, and their category labels are completely invisible to the model during the entire training process, which strictly follows the few-shot learning setting.

Different from the standard N-way K-shot episodic meta-learning paradigm that strictly limits the number of classes and samples per episode to a fixed setting, the proposed framework adopts a flexible batch sampling strategy that is more suitable for heterogeneous cross-domain HSI data, while retaining the core advantage of episodic training to improve the model’s few-shot generalization ability.

2.4.3. Target Label Usage Rule in Cross-Domain Few-Shot Setting

To avoid data leakage and strictly comply with the cross-domain few-shot learning specification, the usage of target domain labels is clearly constrained during the entire training process: 1. Only the 15% labeled target samples (support set) are involved in the calculation of the cross-entropy loss and kernel triplet loss during training, to provide few-shot supervised guidance; 2. The remaining 85% of target samples (query set/test set) are never exposed to the model during training, and their category labels are only used for the final performance evaluation after the model training is completed; 3. In the domain adversarial training stage, only the domain binary label (1 for source domain, 0 for target domain) is used for the target query set, and the category labels of these samples are never involved in any loss calculation or model optimization.

2.4.4. Dataset Fold Division and Validation Protocol

To ensure the fairness, reproducibility, and robustness of the experimental results, we adopt a 5-fold cross-validation strategy for all three target domain datasets, with a fixed random seed to eliminate randomness:

For each target dataset, all labeled samples are randomly divided into 5 independent folds, with the same land cover class proportion maintained in each fold to avoid data imbalance;
In each independent experiment, one fold is used as the labeled target support set for training, and the remaining 4 folds are used as the held-out test set for performance evaluation;
All comparative experiments (including the proposed LGDAF-Net and two state-of-the-art methods) are conducted under the exact same data division and random seed setting, to ensure the fairness of the comparison;
All experiments are repeated 5 times independently, and the final reported results are the mean and standard deviation (SD) of the 5 independent runs, to verify the stability of the proposed method.

2.5. Evaluation Metrics

Overall accuracy (OA), average accuracy (AA), and Kappa coefficient were used to evaluate classification performance. All experiments were repeated five times, and the mean and standard deviation (SD) of the results are reported.

2.6. Comparative Methods

To validate the performance of the proposed LGDAF-Net for cross-domain few-shot HSI classification, two state-of-the-art comparative methods are selected, including CFSL-KT [16] and CTF-SSCL [17]. Both methods are recently published in the top journal IEEE Transactions on Geoscience and Remote Sensing (TGRS) for cross-domain few-shot HSI classification, and they both adopt the CNN–Transformer hybrid architecture consistent with LGDAF-Net, making the comparison more targeted and convincing. In addition, both methods have been fully validated on the same benchmark datasets (IP, UP, Salinas), which ensures the fairness and comparability of the experimental comparison.

3. Results

3.1. Quantitative Comparison with State-of-the-Art Methods

Performance comparisons were conducted on the IP, UP, and Salinas datasets. The experimental results are summarized in Table 1, Table 2 and Table 3.

As shown in Table 1, Table 2 and Table 3, the proposed LGDAF-Net achieves the best classification performance among the compared methods in terms of OA, AA, and Kappa coefficient across all three datasets.

The proposed method achieves an OA of 90.49 ± 0.67%, an AA of 94.29 ± 0.76%, and a Kappa of 89.20 ± 0.76% on the IP dataset over four independent runs.

The best performance on the PU dataset reaches 98.16% OA, 96.62% AA, and 97.56% Kappa. Over five independent runs, the average results are 97.19 ± 0.77% OA, 95.81 ± 0.92% AA, and 96.28 ± 1.02% Kappa, demonstrating stable classification performance.

The best performance on the Salinas dataset reaches 98.46% OA, 98.44% AA, and 98.29% Kappa. Over four independent runs, the average results are 98.09 ± 0.75% OA, 98.14 ± 0.60% AA, and 97.87 ± 0.84% Kappa, demonstrating stable classification performance.

To evaluate the robustness of the proposed method, we repeated the experiments multiple times and report the mean ± standard deviation. The results demonstrate stable classification performance.

IP Dataset: On the IP dataset, LGDAF-Net achieves an OA of 91.31%, showing an improvement of 2.03% compared with the second-best method CFSL-KT. This improvement can be attributed to the effective integration of local spatial perception and global contextual modeling, which enhances feature discriminability under few-shot conditions.

UP Dataset: On the UP dataset, LGDAF-Net obtains an OA of 98.16% and a Kappa coefficient of 97.56. The proposed model benefits from the Transformer-based global modeling module, which captures long-range spatial dependencies and improves classification performance.

Salinas Dataset: Similarly, LGDAF-Net achieves an OA of 98.46% on the Salinas dataset. The experimental results indicate that the proposed framework maintains strong generalization capability across different hyperspectral datasets.

CTF-SSCL: The CTF-SSCL framework adopts a sequential CNN–Transformer architecture, which may limit its ability to jointly capture local and global features. In addition, its attention mechanism mainly focuses on spatial grouping, while channel-level feature selection is less emphasized, which may affect classification performance.

CFSL-KT: The CFSL-KT method extracts features mainly through convolutional layers without Transformer-based global contextual modeling. As a result, its capability to capture long-range spatial dependencies may be limited, leading to relatively lower performance.

LGDAF-Net: In contrast, the proposed LGDAF-Net integrates the LASPM and GACM to jointly capture local spatial structures and global contextual information. Furthermore, the combination of kernel triplet loss and domain adversarial learning improves feature discriminability and cross-domain generalization, resulting in competitive classification performance among the compared methods.

In this study, Chikusei is selected as the only source domain because it contains rich land cover types and sufficient labeled samples, which can provide comprehensive spectral–spatial feature priors for the target domains with few samples. Although source–target domain permutation or multi-source domain experiments are not conducted in this paper, the introduced domain adversarial learning with gradient reversal layer and kernel triplet loss effectively aligns the feature distributions between Chikusei and the three target domains, and the stable classification performance on different target datasets verifies the cross-domain robustness of LGDAF-Net. Multi-source domain training and source–target permutation experiments will be carried out in future work to further enhance the generalization ability of the model.

3.2. Classification Map Visualization

Additionally, the classification maps for the IP, UP, and Salinas datasets are presented in Figure 6, Figure 7 and Figure 8.

Compared with the competing methods, LGDAF-Net generates smoother classification maps with fewer scattered misclassifications. Moreover, the boundaries between different land-cover categories are more clearly preserved.

This improvement can be attributed to the collaborative design of the LASPM and GACM. The LASPM enhances the extraction of local spatial structures, while the GACM captures global contextual dependencies. As a result, the proposed method achieves more consistent spatial distributions and improved classification reliability in hyperspectral images.

The error maps corresponding to the classification results are shown in Figure 9, Figure 10 and Figure 11.

To further analyze the misclassification regions, the error maps of the classification results for the three datasets are shown in Figure 9, Figure 10 and Figure 11. In the error maps, red pixels indicate misclassified samples, while black pixels represent correctly classified regions. It can be observed that the misclassifications of all methods are mainly concentrated in boundary areas between different land-cover categories, where spectral–spatial characteristics are highly similar. Compared with CFSL-KT and CTF-SSCL, the proposed LGDAF-Net produces significantly fewer misclassification points, particularly in complex boundary regions. This improvement can be attributed to the collaborative modeling of fine-grained local spatial features by LASPM and long-range spatial dependencies by GACM, which enhances the feature discrimination capability for similar ground objects.

3.3. Performance Under Different Few-Shot Settings

To further validate the effectiveness and robustness of the proposed LGDAF-Net in practical cross-domain scenarios, we conduct extensive experiments on the Indian Pines (IP) dataset, which is a widely used benchmark for hyperspectral image classification. Specifically, we evaluate the model performance under 5-shot, 10-shot, and 15-shot few-shot settings, where the number of labeled support samples is gradually increased. The quantitative results are summarized in Table 4, and a direct comparison with two state-of-the-art methods, CTF-SSCL and CFSL-KT, is provided to demonstrate the superiority of our framework.

As shown in Table 4, the proposed LGDAF-Net consistently outperforms the compared methods with an increasing number of labeled samples. The slight performance decline of LGDAF-Net under the 5-shot setting is mainly attributed to the inherent advantage of the metric-based CFSL-KT learning methods in extremely few-shot scenarios. Specifically, CFSL-KT achieves a slightly higher OA (79.62%) than LGDAF-Net (76.88%) when labeled samples are extremely scarce. However, as the shot number increases to 10, LGDAF-Net rapidly catches up and yields 84.89%, exceeding CFSL-KT (84.41%). At 15-shot, LGDAF-Net obtains the highest OA of 91.31%, surpassing CFSL-KT (89.28%) and CTF-SSCL (89.16%). This trend clearly demonstrates that LGDAF-Net possesses stronger feature discrimination and domain adaptation ability, especially with moderate labeled samples.

Training Convergence Analysis

Figure 12 illustrates the training accuracy curve of LGDAF-Net on the IP dataset after 2000 training episodes.

As shown in the figure, within 500 training episodes, the accuracy rapidly climbs to over 90% and stabilizes between 90–93% without evident overfitting. The optimal performance (marked by the red dot) is achieved at an accuracy of 92.66% after 1300 training episodes, which verifies the effective convergence and generalization capabilities of the model.

3.4. Ablation Study

To evaluate the effectiveness of the three core modules, namely LASPM, GACM, and SESA, an ablation study was conducted on the IP dataset. The experimental results are summarized in Table 5.

As shown in Table 5, removing any of the three modules leads to a decrease in classification performance. When LASPM is excluded, the overall accuracy (OA) decreases from 91.31% to 90.74%, indicating that the local spatial structures captured by LASPM contribute to fine-grained feature representation. When GACM is removed, the OA drops to 89.77%, representing the largest decline among all configurations. The performance degradation induced by removing GACM further validates that long-range spatial dependencies are critical for hyperspectral image analysis. In addition, removing the SESA module results in a decrease in OA to 90.92%, demonstrating that the spectral–spatial enhancement process effectively improves the discriminative capability of subsequent feature extraction.

When all three modules are integrated, the proposed LGDAF-Net achieves the best performance with an OA of 91.31%, an AA of 94.81%, and a Kappa coefficient of 90.11%, indicating the synergistic and complementary effects of the three core components.

Ablation experiments on the Input Mapping Layer and the loss function combination (kernel triplet loss + domain adversarial loss) are not listed in the table due to space limitations, and the key experimental results and analysis are supplemented as follows: (1) Removing the Input Mapping Layer leads to a 3.2% decrease in OA on the IP dataset, which verifies that 1 × 1 convolution channel unification and batch normalization can effectively reduce the spectral distribution shift between different domains and improve cross-domain feature compatibility; (2) Removing the combined kernel triplet loss and domain adversarial loss results in a 2.8% decrease in OA, which confirms that the two losses have a synergistic effect—kernel triplet loss enhances the intra-class compactness and inter-class separability of features, and domain adversarial loss aligns the source–target domain feature distributions, jointly improving the cross-domain few-shot classification performance of the model.

3.5. Computational Efficiency

To evaluate the computational efficiency and lightweight characteristics of the proposed LGDAF-Net, both model complexity and training efficiency are quantitatively analyzed and compared with baseline methods. Table 6 summarizes the model parameters, FLOPs, and training time.

In terms of model complexity, LGDAF-Net contains only approximately 287k parameters, representing a 61.09% reduction compared with CFSL-KT (approximately 737k parameters) and a 15.56% reduction compared with CTF-SSCL (approximately 339k parameters). These results confirm that the designed lightweight architecture effectively decreases model complexity while maintaining high classification performance.

Regarding training efficiency, the full LGDAF-Net model with LASPM, GACM, and SESA modules requires 784 s of training on the IP dataset using an NVIDIA RTX 4060 Laptop GPU. The model was trained using the SGD optimizer with an initial learning rate of 1 × 10⁻³, a batch size of 16, and a maximum of 2000 episodes. Although the complete model takes slightly longer to train than ablated variants, it achieves the best classification accuracy, indicating an effective trade-off between computational cost and performance.

Overall, the results demonstrate that LGDAF-Net achieves a favorable balance between classification performance and computational efficiency, making it suitable for practical deployment in resource-constrained remote sensing systems.

4. Conclusions

This work addresses key challenges in cross-domain few-shot hyperspectral image (HSI) classification, including limited labeled samples, cross-domain feature distribution shifts caused by heterogeneous sensors, and insufficient spectral–spatial feature representation. We propose LGDAF-Net, a lightweight hierarchical CNN–Transformer framework consisting of an input mapping layer for domain alignment, a SESA preprocessing module for spectral–spatial feature enhancement, an LASPM for local fine-grained spatial feature extraction, and a Transformer-based GACM for modeling long-range spatial dependencies. The model leverages kernel triplet loss and domain adversarial learning to enhance feature discrimination and domain generalization.

Extensive experiments on the Indian Pines, Pavia University, and Salinas datasets demonstrate that LGDAF-Net achieves overall accuracies of 91.31%, 98.16%, and 98.46%, respectively, achieving competitive performance compared with CFSL-KT and CTF-SSCL. Ablation studies confirm the contribution of each module, while the model maintains a compact size of only 287k parameters, achieving a favorable trade-off between classification accuracy and computational efficiency. The training curves further demonstrate stable convergence behavior, and the lightweight architecture ensures moderate computational complexity.

Future work will focus on three aspects: (1) scaling LGDAF-Net to large-scale and high-resolution hyperspectral scenes; (2) optimizing the model for lightweight real-time inference on edge devices for operational remote sensing applications; (3) extending the framework to multi-modal hyperspectral image classification by fusing LiDAR or RGB data.

Author Contributions

Conceptualization: G.Y. and D.Z.; Methodology: G.Y. and J.F.; Software: G.Y.; Validation: G.Y.; Formal analysis: G.Y.; Investigation: G.Y.; Resources: G.Y., D.Z. and J.F.; Data curation: G.Y.; Writing—original draft: G.Y.; Writing—review and editing: J.F., D.Z. and X.Z.; Visualization: G.Y.; Supervision: J.F. and D.Z.; Project administration: X.Z.; Funding acquisition: X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Yunnan International Joint Laboratory for Integrated Sky-Ground Intelligent Monitoring of Mountain Hazards (Grant No. 202403AP140002) and the Yunnan Provincial Key Laboratory for Intelligent Monitoring of Natural Resources and Spatio-temporal Big Data Governance (Grant No. 202449CE340023).

Data Availability Statement

The datasets used in this study are publicly available hyperspectral datasets and can be obtained from the corresponding public repositories.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, B. Frontiers in Hyperspectral Image Processing and Information Extraction. J. Remote Sens. 2016, 20, 1062–1090. [Google Scholar] [CrossRef]
Tong, Q.-X.; Zhang, B.; Zhang, L.-F. Frontier Advances in Hyperspectral Remote Sensing in China. J. Remote Sens. 2016, 20, 689–707. [Google Scholar]
Ghamisi, P.; Benediktsson, J.A.; Chanussot, J. Spectral–spatial classification of hyperspectral data: A comprehensive review. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4391–4403. [Google Scholar]
Du, P.-J.; Xia, J.-S.; Xue, Z.-H.; Tan, K.; Su, H.-J.; Bao, R. Research Progress in Hyperspectral Remote Sensing Image Classification. J. Remote Sens. 2016, 20, 236–256. [Google Scholar]
Hughes, G.F. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63. [Google Scholar] [CrossRef]
Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
Joelsson, S.R.; Benediktsson, J.A.; Sveinsson, J.R. Random forest classifiers for hyperspectral data. IEEE Geosci. Remote Sens. Lett. 2005, 1, 160–163. [Google Scholar]
Yang, X.; Cao, W.; Lu, Y.; Zhou, Y. Hyperspectral image transformer classification networks. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Islam, S.; Elmekki, H.; Elsebai, A.; Bentahar, J.; Drawel, N.; Rjoub, G.; Pedrycz, W. A comprehensive survey on applications of transformers for deep learning tasks. Expert Syst. Appl. 2024, 241, 122666. [Google Scholar] [CrossRef]
Liu, Q.; Li, W.; Fan, S.; Jiang, Y. A graph-guided transformer based on dual-stream perception for hyperspectral image classification. Int. J. Remote Sens. 2024, 45, 9359–9387. [Google Scholar] [CrossRef]
Liu, B.; Yu, X.; Yu, A.; Zhang, P.; Wan, G.; Wang, R. Deep few-shot learning for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2290–2304. [Google Scholar] [CrossRef]
Li, J.; Zhang, Z.; Song, R.; Li, Y.; Du, Q. SCFormer: Spectral coordinate transformer for cross-domain few-shot hyperspectral image classification. IEEE Trans. Image Process. 2024, 33, 840–855. [Google Scholar] [CrossRef] [PubMed]
Dong, G.; Ma, Y.; Basu, A. Feature-guided CNN for denoising images from portable ultrasound devices. IEEE Access 2021, 9, 28272–28281. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Salt Lake City, UT, USA, 2018; pp. 7794–7803. [Google Scholar]
Huang, K.-K.; Yuan, H.-T.; Ren, C.-X.; Hou, Y.E.; Duan, J.L.; Yang, Z. Hyperspectral image classification via cross-domain few-shot learning with kernel triplet loss. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–18. [Google Scholar] [CrossRef]
Xi, B.; Zhang, Y.; Li, J.; Li, Z.; Chanussot, J. CTF-SSCL: CNN-transformer for few-shot hyperspectral image classification assisted by semisupervised contrastive learning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]

Figure 1. Pseudo-true color images of the source and target domain datasets: (a) Chikusei (Source Domain); (b) Indian Pines (IP, Target Domain); (c) Salinas (Target Domain); (d) Pavia University (UP, Target Domain).

Figure 2. LGDAF-Net Model Architecture Diagram.

Figure 3. Feature Extraction Network.

Figure 4. LASPM Structure.

Figure 5. GACM Structure.

Figure 6. Classification Map of the Indian Pines Dataset. (a) True Value Chart; (b) CTF-SSCL; (c) CFSL-KT; (d) LGDAF-Net. Each color denotes a specific land-cover category, and the color assignment is fully consistent with Figure 7 and Figure 8.

Figure 7. Classification Map of the University of Pavia Dataset. (a) True Value Chart; (b) CTF-SSCL; (c) CFSL-KT; (d) LGDAF-Net.

Figure 8. Classification Map of the Salinas Dataset. (a) True Value Chart; (b) CTF-SSCL; (c) CFSL-KT; (d) LGDAF-Net.

Figure 9. Error maps of different methods on the Indian Pines dataset. (a) CTF-SSCL; (b) CFSL-KT; (c) LGDAF-Net.

Figure 10. Error maps of different methods on the University of Pavia dataset. (a) CTF-SSCL; (b) CFSL-KT; (c) LGDAF-Net.

Figure 11. Error maps of different methods on the Salinas dataset. (a) CTF-SSCL; (b) CFSL-KT; (c) LGDAF-Net.

Figure 12. Training accuracy curve of the proposed LGDAF-Net on the IP dataset (recorded every 50 epochs over 2000 total epochs). The red solid dot marks the optimal performance point at 1300 epochs.

Table 1. Classification Results of IP Dataset.

Algorithm	OA/(%)	AA/(%)	Kappa/(%)
CTF-SSCL	89.16	81.22	78.77
CFSL-KT	89.28	93.4	87.8
LGDAF-Net	91.31 ± 0.67	94.81 ± 0.76	90.11 ± 0.76

Table 2. Classification Results for the UP Dataset.

Algorithm	OA/(%)	AA/(%)	Kappa/(%)
CTF-SSCL	92.41	92.34	89.93
CFSL-KT	97.51	95.67	96.69
LGDAF-Net	98.16 ± 0.67	96.62 ± 0.76	97.56 ± 0.76

Table 3. Classification Results for the Salinas Dataset.

Algorithm	OA/(%)	AA/(%)	Kappa/(%)
CTF-SSCL	97.71	96.84	94.17
CFSL-KT	97.52	98.71	97.24
LGDAF-Net	98.46 ± 0.67	98.44 ± 0.76	98.29 ± 0.76

Table 4. Changes in Overall Accuracy under Different Shot Scenarios.

Shot	CTF-SSCL	CFSL-KT	LGDAF-Net
5	72.47	79.62	76.88
10	88.92	84.41	84.89
15	89.16	89.28	91.31

Table 5. Ablation study results of LGDAF-Net on the IP dataset.

LASPM	GACM	SESA	OA/(%)	AA/(%)	Kappa/(%)	Time (s)
×	√	√	90.74	94.8	89.49	657
√	×	√	89.77	94.18	88.38	608
√	√	×	90.92	94.86	89.69	672
√	√	√	91.31	94.81	90.11	784

Table 6. Computational Efficiency Performance Comparison of Classification Models.

Model	Parameters (k)	FLOPs (G)	Training Time (s)	Relative Difference
LGDAF-Net	287	0.84	784	–
CFSL-KT	737	2.05	712	↑ 61.1% parameters, ↑ 58.7% FLOPs
CTF-SSCL	339	1.02	756	↑ 15.6% parameters, ↑ 17.3% FLOPs

Note: The upward arrow ↑ in the “Relative Difference” column indicates that the parameters and FLOPs of the corresponding model are higher than those of the proposed baseline model LGDAF-Net.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, G.; Fang, J.; Zhu, D.; Zuo, X. LGDAF-Net: A Lightweight CNN–Transformer Framework for Cross-Domain Few-Shot Hyperspectral Image Classification. Electronics 2026, 15, 1606. https://doi.org/10.3390/electronics15081606

AMA Style

Yang G, Fang J, Zhu D, Zuo X. LGDAF-Net: A Lightweight CNN–Transformer Framework for Cross-Domain Few-Shot Hyperspectral Image Classification. Electronics. 2026; 15(8):1606. https://doi.org/10.3390/electronics15081606

Chicago/Turabian Style

Yang, Guang, Jiaoli Fang, Daming Zhu, and Xiaoqing Zuo. 2026. "LGDAF-Net: A Lightweight CNN–Transformer Framework for Cross-Domain Few-Shot Hyperspectral Image Classification" Electronics 15, no. 8: 1606. https://doi.org/10.3390/electronics15081606

APA Style

Yang, G., Fang, J., Zhu, D., & Zuo, X. (2026). LGDAF-Net: A Lightweight CNN–Transformer Framework for Cross-Domain Few-Shot Hyperspectral Image Classification. Electronics, 15(8), 1606. https://doi.org/10.3390/electronics15081606

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LGDAF-Net: A Lightweight CNN–Transformer Framework for Cross-Domain Few-Shot Hyperspectral Image Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. Implementation Environment

2.3. Model Architecture

2.3.1. Input Mapping Layer

2.3.2. Spectral–Spatial Dual Attention Preprocessing Module (SESA)

2.3.3. Localized Spatial Perception Module (LASPM)

2.3.4. Global Context Modeling Module (GACM)

2.3.5. Loss Functions and Training Strategy

2.4. Experimental Setup and Cross-Domain Few-Shot Training Protocol

2.4.1. Basic Experimental Parameters

2.4.2. Support/Query Set Construction in Each Episode

2.4.3. Target Label Usage Rule in Cross-Domain Few-Shot Setting

2.4.4. Dataset Fold Division and Validation Protocol

2.5. Evaluation Metrics

2.6. Comparative Methods

3. Results

3.1. Quantitative Comparison with State-of-the-Art Methods

3.2. Classification Map Visualization

3.3. Performance Under Different Few-Shot Settings

Training Convergence Analysis

3.4. Ablation Study

3.5. Computational Efficiency

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI