Next Article in Journal
3D Local Feature Learning and Analysis on Point Cloud Parts via Momentum Contrast
Next Article in Special Issue
AWG-Based Spectral Multiplexing for Unambiguous Range-Extended FMCW LiDAR
Previous Article in Journal
High-Resolution OFDR with All Grating Fiber Combining Phase Demodulation and Cross-Correlation Methods
Previous Article in Special Issue
Evaluating UAV and Handheld LiDAR Point Clouds for Radiative Transfer Modeling Using a Voxel-Based Point Density Proxy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hybrid Mamba–Graph Fusion with Multi-Stage Pseudo-Label Refinement for Semi-Supervised Hyperspectral–LiDAR Classification

School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(3), 1005; https://doi.org/10.3390/s26031005
Submission received: 11 December 2025 / Revised: 25 January 2026 / Accepted: 30 January 2026 / Published: 3 February 2026
(This article belongs to the Special Issue Progress in LiDAR Technologies and Applications)

Highlights

What are the main findings?
  • Novel Hybrid Architecture for HSI–LiDAR Fusion: We propose HMGF-Net, a unified multimodal network combining a spectral–spatial CNN (HSI) and a multi-scale CNN (LiDAR) with a Mamba state-space sequence module and a graph-based fusion layer to capture both local and long-range contextual features across modalities.
  • Multi-Stage Pseudo-Label Refinement: We develop a three-stage semi-supervised training pipeline that progressively refines pseudolabels using confidence-based filtering, spatial–spectral smoothing via a KNN graph, and graph-consistency checks, effectively denoising labels and stabilizing training with very limited ground truth.
What is the implication of the main findings?
  • State-of-the-Art Performance with Limited Labels:The proposed approach achieves superior classification accuracy on three benchmark hyperspectral–LiDAR datasets (Houston2013, Augsburg, Trento), outperforming eight recent state-of-the-art methods. Notably, HMGF-Net attains higher overall accuracy (up to 92–99%) and average accuracy than competitors, with pronounced gains in low-label regimes (only ten labeled samples per class).
  • Synergy of Sequence Modelling and Graph Reasoning: Our results demonstrate that integrating long-range sequence modeling (Mamba) with graph-based context propagation yields smoother classification maps and better class discrimination. The hybrid approach reduces salt-and-pepper noise and improves boundary delineation in the predicted maps, highlighting its effectiveness for detailed sensor data analysis.

Abstract

Semi-supervised joint classification of Hyperspectral Images (HSIs) and LiDAR-derived Digital Surface Models (DSMs) remains challenging due to scarcity of labeled pixels, strong intra-class variability, and the heterogeneous nature of spectral and elevation features. In this work, we propose a Hybrid Mamba–Graph Fusion Network (HMGF-Net) with Multi-Stage Pseudo-Label Refinement (MS-PLR) for semi-supervised hyperspectral–LiDAR classification. The framework employs a spectral–spatial HSI backbone combining 3D–2D convolutions, a compact LiDAR CNN encoder, Mamba-style state-space sequence blocks for long-range spectral and cross-modal dependency modeling, and a graph fusion module that propagates information over a heterogeneous pixel graph. Semi-supervised learning is realized via a three-stage pseudolabeling pipeline that progressively filters, smooths, and re-weights pseudolabels based on prediction confidence, spatial–spectral consistency, and graph neighborhood agreement. We validate HMGF-Net on three benchmark hyperspectral–LiDAR datasets. Compared with a set of eight state-of-the-art (SOTA) baselines, including 3D-CNNs, SSRN, HybridSN, transformer-based models such as SpectralFormer, multimodal CNN–GCN fusion networks, and recent semi-supervised methods, the proposed approach delivers consistent gains in overall accuracy, average accuracy, and Cohen’s kappa, especially in low-label regimes (10% labeled pixels). The results highlight that the synergy between sequence modeling and graph reasoning in combination with carefully designed pseudolabel refinement is essential to maximizing the benefit of abundant unlabeled samples in multimodal remote sensing scenarios.

1. Introduction

Hyperspectral Images (HSI) provide dense spectral information across hundreds of narrow bands, and have become indispensable for material identification and land cover analysis. However, issues such as spectral redundancy, nonlinear mixing, and high dimensionality continue to challenge robust spectral–spatial modeling, motivating the development of advanced learning strategies for high-dimensional remote sensing data [1,2]. Recent deep learning approaches have improved HSI classification through joint spatial–spectral feature extraction, hybrid CNN–attention architectures, and lightweight networks designed for small-sample settings [3,4,5].
To address the limitations of spectral information alone, complementary modalities such as Light Detection and Ranging (LiDAR) have been extensively integrated with HSI. LiDAR provides elevation and structural cues that enhance edge delineation and mitigate spectral ambiguity. Numerous studies have demonstrated that combining HSI and LiDAR significantly improves classification accuracy, particularly in heterogeneous or urban environments [6,7,8]. Recent advances include dual-branch transformers, hypergraph networks, and cross-attention fusion modules that model heterogeneous spectral–elevation interactions more effectively [9,10,11]. In addition to dual-modality setups, multisensor fusion frameworks now exploit combinations of hyperspectral, multispectral, radar, and LiDAR data to achieve more stable and generalizable scene understanding [12,13].
Despite this progress, multimodal HSI–LiDAR fusion remains difficult under limited supervision. HSI suffers from large intra-class spectral variance and sensitivity to noise, while LiDAR exhibits irregular sampling patterns and modality-specific distortions. Effective fusion requires bridging disparate feature spaces while preserving modality-specific advantages. Deep HSI classifiers based on random-patch learning, dense residual transfer, and spectral–spatial CNNs offer improved robustness [3,4,14], yet most multimodal fusion networks assume abundant labeled samples and degrade sharply when labels are scarce [15,16].
To address this, Semi-Supervised Learning (SSL) has gained importance in HSI classification. Modern SSL methods integrate unlabeled samples through consistency constraints, pseudolabel refinement, adversarial learning, and hybrid generative–discriminative models [17,18,19]. In addition, multiscale refinement strategies and cost-aware learning mechanisms have been shown to substantially improve label efficiency in low-annotation scenarios [15,20]. Nonetheless, applying SSL to multimodal fusion remains challenging, as pseudolabel noise can propagate between modalities, causing semantic drift unless cross-modal agreement and structural consistency are jointly enforced [21].
Meanwhile, long-range modeling advances have reshaped deep learning for remote sensing. Transformer-based architectures and graph neural networks have demonstrated strong spectral–spatial reasoning capabilities, but remain limited by computational complexity when applied to high-resolution hyperspectral cubes [2,16]. Recently, visual State-Space Models (SSMs), particularly Mamba-inspired architectures, have emerged as efficient alternatives that capture long-range dependencies with linear complexity. RS3Mamba and ConvMambaSR show excellent performance in segmentation and super-resolution tasks, highlighting the potential of SSMs for hyperspectral sequence modeling [22,23]. Additional work in SSM-driven remote sensing indicates improved efficiency, stability, and scalability compared to transformer-based models [24,25,26].
Despite these advances, existing state-space and graph-based approaches for remote sensing classification exhibit several limitations. Current Mamba-inspired architectures such as RS3Mamba [22] and ConvMambaSR [23] employ fixed scanning strategies over the spatial grid that do not adapt to irregular object boundaries in heterogeneous scenes, potentially overlooking critical cross-modal relationships at class transitions where spectral and elevation discontinuities do not align. While graph neural networks have shown promise for HSI classification through hypergraph convolutions [8] and cross-attention mechanisms [11], they face well-documented over-smoothing effects when stacked beyond two or three layers, and incur significant computational overhead when constructing dense pixel graphs over high-resolution hyperspectral cubes [2,16]. Most critically, neither paradigm alone addresses the challenge of pseudolabel noise propagation in semi-supervised multimodal learning. Erroneous predictions generated from one modality can corrupt feature representations through the fusion mechanism, leading to semantic drift during iterative self-training [21]. These limitations motivate the design of HMGF-Net, which combines efficient state-space sequence modeling with graph-based consistency verification specifically targeted at pseudolabel quality control. Motivated by these developments, we propose HMGF-Net. a Hybrid Mamba–Graph Fusion network for semi-supervised HSI–LiDAR classification. The proposed network integrates (i) spectral–spatial HSI encoding, (ii) multiscale LiDAR structural modeling, (iii) selective state-space sequence modeling for efficient long-range dependency capture, and (iv) graph-guided multimodal fusion. To mitigate pseudolabel noise in SSL, we introduce Multi-Stage Pseudo-Label Refinement (MS-PLR), a mechanism that applies confidence filtering, spatial–spectral smoothing, and graph-consistency propagation. Together, these components enable HMGF-Net to achieve robust and stable performance even under extremely limited labeled data.
These observations collectively motivate the design of a unified architecture capable of addressing the interconnected challenges of multimodal learning under limited labels. Existing methods rarely combine spectral–spatial HSI encoding, LiDAR structural modeling, efficient long-range sequence learning, graph reasoning, and systematic pseudolabel refinement into a single coherent framework. Moreover, multimodal SSL remains vulnerable to inconsistent pseudolabel predictions, which can propagate uncertainty and destabilize training. Additionally, high-dimensional hyperspectral sequences require an efficient long-range modeling approach that avoids the computational burden of transformers. These challenges form the foundation for our proposed methodology.
To address these intertwined challenges, we introduce HMGF-Net, a Hybrid Mamba–Graph Fusion Network equipped with an end-to-end Multi-Stage Pseudo-Label Refinement (MS-PLR) mechanism. In contrast to existing multimodal approaches, HMGF-Net integrates spectral–spatial representation learning for hyperspectral data, multiscale geometric modeling for LiDAR elevation cues, and efficient long-range dependency modeling through the Mamba selective state-space paradigm. These encoded features are subsequently processed within a graph-based fusion network that captures cross-modal relational structure and enhances contextual reasoning. Finally, the MS-PLR pipeline progressively refines pseudolabels through confidence filtering, spatial–spectral smoothing, and graph-consistency propagation, enabling the network to suppress noise, reinforce cross-modal stability, and achieve high classification accuracy even in demanding low-label conditions.
The main contributions of this work are summarized below:
  • We present HMGF-Net, a unified multimodal architecture that integrates a 3D–2D spectral–spatial CNN encoder for hyperspectral data, a multiscale CNN for LiDAR elevation modeling, and Mamba-based selective state-space modeling for efficient long-range dependency learning. This design combines local spectral–spatial feature extraction with global sequence modeling, enabling a more expressive and computationally efficient multimodal representation than conventional CNN- or transformer-based approaches.
  • We introduce a graph-guided multimodal fusion mechanism that aligns hyperspectral and LiDAR features using relational modeling based on spectral similarity, spatial proximity, and elevation-informed neighborhood structure. This graph-based fusion strategy promotes more coherent cross-modal interactions by preserving geometric continuity and spectral–spatial relationships, thereby enabling the network to integrate complementary modality information more effectively than concatenation, attention-only fusion, or shallow multimodal alignment methods.
  • We develop a Multi-Stage Pseudo-Label Refinement (MS-PLR) framework designed to stabilize semi-supervised learning through progressive noise suppression. The refinement process incorporates confidence filtering, spatial–spectral neighborhood smoothing, and graph consistency propagation to reduce the influence of unreliable predictions. This enables more reliable pseudolabel supervision in low-label scenarios, preventing semantic drift and improving training stability by ensuring that refined labels remain structurally consistent with both spectral–spatial patterns and elevation cues.

2. Materials and Methods

Datasets

To assess the effectiveness of the proposed approach, three publicly accessible multisensor remote sensing image classification datasets are utilized as experimental datasets: the Houston2013, Trento, and Augsburg datasets. Comprehensive parameters are shown in Table 1.
Houston2013 dataset. The Houston2013 dataset was captured using the ITRES CASI-1500 (ITRES Research Limited, Calgary, AB, Canada) sensor over the University of Houston campus and its surrounding urban area in Houston, Texas, USA in 2012. This dataset includes both HSI and LiDAR DSM data. The spatial dimensions of the dataset are 349 × 1905, with a spatial resolution of approximately 2.5 m. The HSI data consist of 144 spectral bands, covering the wavelength range from 380 to 1050 nm. The LiDAR data provide elevation information for ground features. The land cover is categorized into fifteen types: Healthy Grass, Stressed Grass, Synthetic Grass, Trees, Soil, Water, Residential, Commercial, Road, Highway, Railway, Parking Lot 1, Parking Lot 2, Tennis Court, and Running Track.
Augsburg dataset. The Augsburg dataset consists of paired HSI and LiDAR DSM data; the HSI data were collected using the HySpex (Norsk Elektro Optikk AS, Skedsmokorset, Norway) sensor, while the LiDAR DSM data were obtained with the DLR-3K sensor (German Aerospace Center, Oberpfaffenhofen, Germany). This dataset was acquired over Augsburg, Germany, which is an urban environment. The spatial dimensions of the Augsburg dataset are 332 × 485, with a spatial resolution of approximately 30 m. The HSI data includes 180 spectral bands, spanning the wavelength range of 0.4 to 2.5 µm. The LiDAR DSM data provides 3D elevation information for surface features. The dataset comprises seven land cover categories with varying sample distributions.
Trento Dataset. The Trento dataset is an HSI–LiDAR pair dataset; the HSI data were collected by an AISA Eagle (Specim, Spectral Imaging Ltd., Oulu, Finland) sensor, while the LiDAR digital surface model (DSM) data were acquired by an Optech ALTM 3100EA (Teledyne Optech, Vaughan, ON, Canada) sensor. The dataset was captured over a rural area south of the city of Trento, Italy. The Trento dataset has a spatial dimension of 166 × 600 with a spatial resolution of approximately 1 m. The HSI data in the Trento dataset consist of 63 spectral bands, with wavelengths ranging from 420 to 990 nm. The LiDAR DSM data provide elevation information of ground features. The land cover is classified into six categories: Apple Trees, Buildings, Ground, Woods, Vineyard, and Roads.

3. Methods

To address the challenge of limited labeled samples in multi-source remote sensing image classification, we propose a Heterogeneous Multimodal Graph Fusion Network (HMGF-Net) trained with a Multi-Stage Progressive Learning Refinement (MS-PLR) strategy. The proposed framework consists of two complementary components: (1) the HMGF-Net architecture, which effectively integrates hyperspectral and LiDAR features through modality-specific encoders and a Mamba-based fusion module, and (2) the MS-PLR training strategy that leverages unlabeled data through graph-regularized pseudolabeling. The overall framework is illustrated in Figure 1, and the key notations used throughout this paper are summarized in Table 2.

3.1. Framework Overview

As shown in Figure 1, the proposed framework integrates the HMGF-Net architecture with the MS-PLR training strategy. The HMGF-Net architecture comprises three main components: (1) a dual-branch feature extraction module with modality-specific encoders, (2) a Mamba-based feature fusion module for cross-modal interaction, and (3) a classification head for land cover prediction. The MS-PLR training strategy employs a three-stage paradigm to maximize the utilization of both labeled and unlabeled samples:
Stage 1 (Supervised Pretraining): HMGF-Net is initially trained on the small labeled set D l using cross-entropy loss. The forward path consists of patch extraction, modality-specific encoders, Mamba-based feature fusion, and classification head. This stage establishes the initial feature representations.
Stage 2 (Pseudolabel Generation): Using the pretrained HMGF-Net, we generate predictions for all unlabeled samples in D u . High-confidence predictions are filtered through dynamic thresholding and validated via KNN graph consistency checking. Only samples that pass both criteria are accepted as pseudolabels.
Stage 3 (Semi-Supervised Refinement): HMGF-Net is fine-tuned on the augmented dataset D a u g = D l P , where P denotes the validated pseudolabel set. A reduced learning rate prevents catastrophic forgetting of the knowledge learned in Stage 1.

3.2. HMGF-Net Architecture

The HMGF-Net architecture is designed for effective multimodal feature extraction and fusion. We describe each architectural component in detail below.

3.2.1. Dual-Branch Feature Extraction Module

For multi-source remote sensing data, we develop a dual-branch architecture with customized connectivity mechanisms specifically designed for the distinct characteristics of HSI and LiDAR data. Figure 2 illustrates the feature extraction module.
The HSI encoder employs 3D convolutions to jointly capture spectral correlations and spatial context. For an input patch P H R 1 × C × P × P , we apply three cascaded 3D convolutional layers with spectral kernel sizes of 7, 5, and 3, progressively reducing spectral dimensionality while extracting hierarchical features. Each layer incorporates batch normalization and ReLU activation to stabilize training and introduce non-linearity.
Considering the spectral–spatial complexity and potential noise in spatial dimensions, we enhance the standard 3D CNN with residual learning. The residual block incorporates skip connections that allow direct propagation of spectral features across layers, which helps preserve critical spectral information while alleviating gradient vanishing issues. It can be expressed as
F H ( l + 2 ) = H H ( F H ( l ) ) + F H ( l ) ,
where F H ( l ) represents the input of the l-th layer and H H ( · ) represents the mapping function. The 3D features are reduced via spectral average pooling and a 2D convolution to produce Z H R D × P × P , where D = 64 .
Because LiDAR data adopt the Digital Surface Model format, which has local correlations and sparsity, we use a dense connection method in which features of each layer are connected to all previous layers. This effectively leverages local correlations and enhances the re-usability and representational power of features:
F L ( l + 2 ) = H L ( l + 2 ) ( [ F L ( 0 ) , F L ( 1 ) , , F L ( l + 1 ) ] )
where [ · ] denotes concatenation of features from all preceding layers. The LiDAR encoder processes elevation information through three 2D convolutional layers with channel dimensions 1→32→64→64, producing Z L R D × P × P .

3.2.2. Mamba-Based Feature Fusion Module

To capture long-range spatial dependencies and enable effective cross-modal interaction, we propose a Mamba-based state-space fusion module. Unlike attention mechanisms, which have O ( L 2 ) complexity, the state-space model formulation enables linear-time O ( L ) sequence modeling while capturing long-range dependencies. The detailed structure is illustrated in Figure 3.
The modality-specific features are first concatenated along the channel dimension to form a unified representation
Z c a t = [ Z H ; Z L ] R 2 D × P × P ,
then the spatial feature map is reshaped into a sequence representation to model spatial positions as sequential tokens:
S = Reshape ( Z c a t ) R L × N
where L = P 2 is the sequence length and N = 2 D is the feature dimension. This transformation enables the application of sequential modeling to capture spatial relationships.
Our Mamba block is based on structured state-space sequence models (S4/Mamba), which map input sequences to output sequences through a latent state. The discretized state-space recurrence for each spatial position t { 1 , , L } is
h t = A ¯ h t 1 + B ¯ t x t , y t = C t h t ,
where h t R N s is the latent state with dimension N s , A ¯ is the discretized state transition matrix, and  B ¯ t , C t are input-dependent projection matrices that enable content-aware sequence modeling. This selective mechanism allows the model to adaptively filter and retain relevant cross-modal information based on input content, which is crucial for fusing heterogeneous HSI spectral and LiDAR elevation features.
The complete Mamba block processes the input sequence through a gating mechanism:
Z f u s e d = Reshape ( SSM ( S ) σ g ( S ) ) R 2 D × P × P
where SSM ( · ) denotes the selective state-space operation, σ g is the SiLU activation function, and ⊙ represents element-wise multiplication to provide a gating mechanism. The gating allows the network to control information flow, suppressing irrelevant features while enhancing discriminative ones.

3.2.3. Classification Head

The fused features are passed through a 1 × 1 convolution to produce class logits. The center-pixel prediction is extracted and converted to class probabilities via softmax:
p k = exp ( y ^ k ) j = 1 K exp ( y ^ j ) , k = 1 , , K
where K is the number of land cover classes. The center-pixel extraction strategy focuses on the most reliable prediction within each patch, reducing boundary effects.

3.3. Multi-Stage Progressive Learning Refinement (MS-PLR) Strategy

While the HMGF-Net architecture provides effective multimodal feature fusion, the limited availability of labeled samples in remote sensing applications motivates the development of a semi-supervised training strategy. The MS-PLR strategy leverages unlabeled data through graph-regularized pseudolabeling to progressively refine the model. This training strategy operates on top of the HMGF-Net architecture and consists of three stages, as outlined in Algorithm 1.
Algorithm 1 HMGF-Net Training with MS-PLR Strategy
Require: HSI X H , LiDAR X L , labeled set D l , unlabeled set D u
Ensure: Trained HMGF-Net model θ *
  1:
Stage 1: Supervised Pretraining
  2:
Initialize HMGF-Net parameters θ
  3:
Train HMGF-Net on D l for E s u p epochs using L s u p
  4:
Stage 2: Pseudo-Label Generation
  5:
Compute predictions and confidences { ( y ^ j , c j ) } for D u using HMGF-Net
  6:
Compute dynamic threshold τ d y n
  7:
Select candidates: C { j : c j τ d y n }
  8:
Build KNN graph on fused features and compute { p ¯ i }
  9:
Filter by agreement: P { i C : arg max p ¯ i = y ^ i }
10:
Stage 3: Semi-Supervised Refinement
11:
Fine-tune HMGF-Net on D l P for E s s l epochs with η s s l
12:
return  θ *

3.3.1. Stage 1: Supervised Pretraining

HMGF-Net is first trained on labeled data using cross-entropy loss with label smoothing to prevent overconfident predictions:
L s u p = 1 | D l | i k = 1 K y ˜ i k log ( p k ( i ) )
where the smoothed label y ˜ i k applies smoothing parameter ϵ = 0.1 . This stage establishes robust initial feature representations that form the foundation for subsequent pseudolabel generation.

3.3.2. Stage 2: Graph-Regularized Pseudolabel Acquisition

To leverage unlabeled data while avoiding error propagation from noisy pseudo-labels, we propose a graph-regularized acquisition strategy that combines confidence filtering with feature-space consistency verification. The pipeline is illustrated in Figure 4 and consists of two key components.
For each unlabeled sample x j D u , we compute the prediction confidence as c j = max k p k ( j ) . A fixed threshold ignores that different categories have varying learning difficulties, which makes difficult categories less likely to be selected when filtering pseudolabeled samples. Therefore, we employ a dynamic threshold that adapts to the confidence distribution:
τ d y n = τ b a s e + α · Median ( { c j } ) τ b a s e +
where τ b a s e is the base threshold, α is the blending coefficient, and ( · ) + denotes the positive part function. This adaptive threshold adjusts to the model’s overall confidence level, ensuring balanced selection across categories with different learning difficulties. Samples exceeding this threshold form the candidate set C .
To verify pseudolabel quality through spatial–spectral consistency, we construct a k-nearest neighbor graph in the learned feature space. For each candidate sample, we extract the globally averaged fused feature f j = GAP ( Z f u s e d ( j ) ) R 2 D and compute the cosine similarity to identify neighbors.
The neighborhood-aggregated prediction provides a smoothed estimate:
p ¯ i = j N k ( i ) A ˜ i j p j
where N k ( i ) denotes the k nearest neighbors and A ˜ i j is the row-normalized adjacency weight. A pseudolabel is accepted only if the aggregated neighborhood prediction agrees with the sample’s own prediction:
P = { ( x i , y ^ i ) : c i τ d y n and arg max k p ¯ i k = y ^ i } .
This graph consistency check effectively filters out samples near class boundaries or in ambiguous regions, ensuring that only high-quality pseudolabels contribute to model training.

3.3.3. Stage 3: Semi-Supervised Refinement

After pseudolabel generation, HMGF-Net is fine-tuned on the augmented dataset D a u g = D l P . To account for potential noise in pseudolabels, we apply confidence-based sample reweighting, where labeled samples receive a weight of 1.0 and pseudolabeled samples receive a weight equal to their confidence score c i . Training proceeds with a reduced learning rate η s s l = γ · η b a s e (where γ [ 0.03 , 0.05 ] ) to prevent catastrophic forgetting of the knowledge learned in Stage 1.

4. Results

This section presents a comprehensive evaluation of the proposed HMGF-Net with MS-PLR across three benchmark datasets: Houston2013, Augsburg, and Trento. We first describe the evaluation protocol, then present quantitative results with detailed comparisons against state-of-the-art methods, followed by parameter sensitivity studies and visual assessment.

4.1. Evaluation Protocol

The baseline methods span diverse state-of-the-art paradigms published between 2021–2024, including CNN-based approaches (Res-CP [30], CCR-Net [31], SepDGConv [32]), few-shot and semi-supervised methods (DCFSL [33], S3Net [34]), dual-modality fusion architectures (DSCA-Net [35], Fusion_HCT [36]), and transformer-based multimodal fusion (MFT [37]). Notably, MFT [37] (2023) represents the current state-of-the-art in multimodal remote sensing transformers, while DSCA-Net [35] (2024) is among the most recent dual-stream adaptive networks. This selection ensures a comprehensive comparison, with seven of eight baselines published in 2022 or later.
All models operated under identical training conditions with the same hyperspectral and LiDAR input modalities. The proposed HMGF-Net employed its dual-branch encoder, Mamba-based fusion module, and MS-PLR training strategy as described in Section 3.
Table 3 reports the hyperparameter configurations for MS-PLR. The base threshold τ base and KNN neighborhood size k were tuned per dataset via grid search, while the blending coefficient α = 0.3 generalized well across all datasets without per-dataset adjustment. Houston2013 requires a higher τ base (0.60) due to its fifteen-class complexity and finer inter-class boundaries, whereas Trento and Augsburg benefit from larger k (20) owing to their spatially homogeneous agricultural and urban parcels. The KNN graph is reconstructed at the start of each SSL round using updated fused features, ensuring that neighborhood relationships reflect the current model state.

4.2. Quantitative Results

4.2.1. Results on Houston2013

Table 4 presents the complete class-wise and overall results on Houston2013. The proposed HMGF-Net achieves the highest OA (92.30%), AA (93.43%), and Kappa (91.68), outperforming all comparison models. Notably, HMGF-Net demonstrates superior performance on challenging urban classes such as Commercial (94.72%), Residential (98.04%), and Road (91.51%), which exhibit high intra-class variability. The integration of Mamba-based long-range modeling with graph fusion provides stronger context propagation, while MS-PLR reduces pseudolabel noise near class boundaries.

4.2.2. Results on Augsburg

Table 5 reports results on the Augsburg dataset, which contains large-scale and highly heterogeneous urban–vegetation mixtures. HMGF-Net achieves an OA of 88.61%, AA of 78.46%, and Kappa of 83.74, outperforming competing approaches. The Augsburg dataset poses significant challenges due to the dominance of the Residential-Area and Low-Plants classes, which together comprise over 70% of the scene. HMGF-Net significantly improves on Low-Plants (95.36%) and maintains competitive performance across minority classes. The graph-based fusion mechanism effectively incorporates elevation discontinuities and spectral relationships, ensuring reliable pseudolabel propagation.

4.2.3. Results on Trento

Table 6 shows results on the Trento dataset. The proposed method achieves the highest OA (99.39%), AA (98.68%), and Kappa (99.18). Trento consists primarily of agricultural and semi-structured terrain where hyperspectral-LiDAR fusion plays a critical role in distinguishing vegetation types. HMGF-Net produces near-perfect accuracy for classes such as Apple Trees (99.70%), Woods (100.00%), Vineyard (99.98%), and Roads (96.97%). The Mamba state-space module effectively captures long-range spectral patterns, while the KNN graph consistency verification incorporates elevation and spatial continuity across agricultural parcels.

4.3. Comparative Analysis

Across all three datasets, HMGF-Net consistently surpasses existing models in OA, AA, and Kappa. Table 7 summarizes the performance comparison. The improvements arise from four key architectural and methodological strengths:
  • Hybrid Encoder Design: The 3D–2D CNN with residual connections for HSI and dense connections for LiDAR effectively captures modality-specific characteristics while maintaining computational efficiency.
  • Efficient Sequence Modeling: The Mamba block models long-range dependencies with O ( L ) complexity, offering advantages over standard CNNs (limited receptive field) and transformers ( O ( L 2 ) complexity).
  • Graph-Regularized Fusion: The KNN graph consistency verification aligns predictions semantically in the learned feature space, improving robustness against noisy pseudolabels.
  • Progressive Refinement: The MS-PLR strategy progressively expands the training set with validated pseudolabels, enabling effective utilization of unlabeled data under extreme label scarcity.
Table 7. Summary of HMGF-Net classification performance (%) across all datasets.
Table 7. Summary of HMGF-Net classification performance (%) across all datasets.
DatasetOAAAKappa
Houston201392.3093.4391.68
Augsburg88.6178.4683.74
Trento99.3998.6899.18

4.4. Parameter Sensitivity Analysis

To investigate the robustness of HMGF-Net to hyperparameter selection, we conducted systematic sensitivity analysis on three critical parameters: KNN neighborhood size k, LiDAR fusion weight ω l , and learning rate η . For all experiments, the batch size and patch size were fixed at 32 and 11 × 11 , respectively, with label smoothing ϵ = 0.1 .

4.4.1. Impact of KNN Neighborhood Size

The KNN neighborhood size k determines the scope of spatial–spectral consistency verification in the graph-regularized pseudolabel acquisition module. As illustrated in Figure 5, we evaluated classification performance with k values ranging from 5 to 25.
For Houston2013, optimal performance is achieved at k = 5 (OA: 92.30%). This is attributed to its high spatial resolution (2.5 m), where neighboring pixels are more likely to belong to the same category within a compact neighborhood. In contrast, Augsburg and Trento achieve optimal results at k = 20 , reflecting their larger homogeneous regions that benefit from broader neighborhood context. These results demonstrate that optimal k is dataset-dependent and should be tuned based on spatial characteristics.

4.4.2. Impact of LiDAR Fusion Weight

The fusion weight ω l controls the relative contribution of LiDAR features, with HSI weight satisfying ω h = 1 ω l . Figure 6 shows classification performance as ω l varies from 0 to 0.9.
For Houston2013, optimal performance occurs at ω l = 0.3 , indicating that spectral information dominates for distinguishing diverse urban categories. For Augsburg and Trento, balanced fusion ( ω l = 0.5 ) yields the best results, as elevation information provides crucial discriminative features for separating vegetation types and distinguishing buildings. Performance degradation at extreme weights ( ω l = 0 or 0.9 ) confirms the importance of multimodal fusion.

4.4.3. Impact of Learning Rate

Figure 7 evaluates six learning rates: { 3 × 10 5 , 10 4 , 3 × 10 4 , 10 3 , 3 × 10 3 , 10 2 } . Houston2013 achieves optimal performance at 3 × 10 4 , while Augsburg and Trento perform best at 10 4 .
The larger optimal learning rate for Houston2013 may be attributed to its complex fifteen-class feature space requiring more aggressive parameter updates. Excessively large learning rates ( 10 2 ) cause significant performance degradation across all datasets, particularly for Augsburg (OA drops to 75.82%) due to its severe class imbalance. We recommend learning rates in the range [ 10 4 , 3 × 10 4 ] for similar HSI-LiDAR classification tasks.

4.5. Visual Assessment

Figure 8, Figure 9 and Figure 10 present visual classification maps for all three datasets. Compared to baseline methods, HMGF-Net produces smoother regions, cleaner class boundaries, and fewer isolated misclassifications.
On Houston2013, HMGF-Net exhibits improved discrimination along road networks and parking lot boundaries, where spectral confusion is prevalent. The Augsburg results show enhanced separation between residential and commercial areas, benefiting from the elevation-aware fusion mechanism. On Trento, the agricultural parcel boundaries are sharply delineated, demonstrating effective utilization of both spectral signatures and terrain structure.

4.6. Summary

The experimental results across the Houston2013, Augsburg, and Trento datasets confirm that HMGF-Net with MS-PLR offers substantial advantages in multimodal semi-supervised learning. Through its combination of spectral–spatial encoding, Mamba-based sequence modeling, graph-regularized fusion, and progressive pseudolabel refinement, the proposed framework delivers robust performance under limited annotation, consistently surpassing state-of-the-art methods across all datasets and evaluation metrics.

5. Discussion

5.1. Ablation Study

To comprehensively evaluate the contribution of each component in our proposed HMGF-Net framework, we conducted extensive ablation experiments on all three benchmark datasets. The ablation study is organized into four parts: (1) module-wise quantitative analysis, presented in Table 8; (2) semi-supervised learning effectiveness, illustrated in Figure 11; (3) feature representation visualization using t-SNE, shown in Figure 12; and (4) computational complexity comparison, reported in Table 9.

5.1.1. Module-Wise Ablation Analysis

To quantify the contribution of each component in our proposed framework, we conducted a comprehensive module-wise ablation study. The results are presented in Table 8. Starting from the HSI-only baseline, we progressively integrated each module and evaluate the resulting classification performance across all three datasets. The HSI-only baseline, which employs only the spectral–spatial encoder without any auxiliary information, achieves OA of 96.64%, 84.04%, and 88.18% on the Trento, Houston2013, and Augsburg datasets, respectively. In contrast, using LiDAR data alone yields significantly lower performance with OA of 90.24%, 61.45%, and 58.61%, confirming that spectral information from HSI provides more discriminative features than elevation information alone for land cover classification tasks.
Introduction of the Mamba-based fusion module, which integrates both HSI and LiDAR inputs through state-space sequence modeling, substantially improves the classification performance, achieving OA of 99.14%, 92.54%, and 83.96% on the three datasets. This improvement of 2.50%, 8.50%, and −4.22% over HSI-only baseline demonstrates the effectiveness of our cross-modal fusion strategy in capturing long-range spatial dependencies and complementary information between modalities. The temporary performance drop on the Augsburg dataset can be attributed to the introduction of noisy elevation features in heterogeneous urban regions, which is subsequently addressed by the graph-based refinement. Adding the graph-based consistency verification with a fixed threshold further enhances the results to 99.38%, 95.31%, and 88.46% OA, highlighting the importance of pseudolabel quality control through spatial–spectral neighborhood consistency checking.
The performance pattern on Augsburg (88.18% → 83.96% → 88.61%) warrants detailed analysis. The temporary drop after introducing the Mamba fusion module is caused by LiDAR noise at 30 m resolution inducing modality-fusion mismatch, rather than by oversmoothing from sequence modeling. Three observations support this conclusion: (1) LiDAR-only classification achieves only 58.61% OA on Augsburg, the lowest across all datasets, confirming limited discriminative power at coarse resolution; (2) Augsburg’s urban classes exhibit spectral homogeneity but elevation heterogeneity, causing noisy LiDAR features to dilute discriminative spectral information when fused; and (3) if oversmoothing were responsible, degradation would occur across all datasets, yet Houston2013 shows +8.50% improvement and Trento achieves 99.38% OA. The graph-consistency verification addresses this by rejecting pseudolabels where neighborhood predictions disagree, recovering OA from 83.96% to 88.46% (+4.50%).
Finally, the complete HMGF-Net with the proposed MS-PLR strategy achieves the best performance across all datasets: 99.41% OA on Trento (+2.77% over baseline), 96.65% OA on Houston2013 (+12.61% over baseline), and 88.61% OA on Augsburg (+0.43% over baseline). The most significant improvement is observed on the Houston2013 dataset, where the complex urban environment with fifteen diverse land cover categories benefits substantially from our multimodal fusion and semi-supervised learning approach. These ablation results validate that each proposed component contributes positively to the final classification performance and that their synergistic combination yields optimal results across diverse remote sensing scenarios.

5.1.2. Effectiveness of Semi-Supervised Learning

To further validate the effectiveness of our proposed semi-supervised learning strategy, we compared three representative configurations, as illustrated in Figure 11: single-source classification using only HSI data, multi-source fusion combining HSI and LiDAR through the Mamba-based module under fully supervised settings, and multi-source with SSL, representing the complete HMGF-Net framework with graph-regularized pseudolabel acquisition. The visualization clearly demonstrates the progressive performance improvement achieved by each enhancement. On the Houston2013 dataset, the multi-source configuration achieves OA of 92.54%, AA of 93.66%, and Kappa of 91.94%, representing substantial improvements of 8.50%, 9.42%, and 9.22% over the single-source baseline (OA = 84.04%, AA = 84.24%, Kappa = 82.72%). Incorporation of our semi-supervised learning strategy with MS-PLR further boosts the performance to OA of 96.65%, AA of 97.08%, and Kappa of 96.37%, demonstrating the significant benefit of leveraging abundant unlabeled samples through propagation of high-confidence pseudolabels.
Similar performance gains are observed for the Trento dataset, where the full HMGF-Net achieves 99.41% OA compared to 96.64% for single-source and 99.14% for multi-source without SSL. The consistent improvement across all three metrics confirms that the semi-supervised refinement effectively enhances the model’s generalization capability by exploiting the rich spatial structure in unlabeled data. For the Augsburg dataset, which presents more challenging scenarios with severe class imbalance and complex urban landscapes, multi-source fusion initially shows slightly lower OA (83.96%) than single-source (88.18%) due to the introduction of noisy elevation features in heterogeneous regions. However, the semi-supervised refinement stage effectively addresses this limitation by selectively incorporating reliable pseudolabels verified through graph-based spatial consistency, ultimately achieving 88.61% OA with notably improved AA (79.75% vs. 76.10%) and Kappa coefficient (84.13% vs. 83.27%). These experimental results confirm that our proposed MS-PLR strategy successfully exploits the complementary information from both modalities while leveraging unlabeled samples to enhance classification accuracy across diverse remote sensing scenarios.

5.1.3. Feature Representation Visualization

To provide intuitive insights into the discriminative capability of learned feature representations, we employ t-distributed Stochastic Neighbor Embedding (t-SNE) to visualize the high-dimensional features extracted by different ablation configurations, as shown in Figure 12. Each row corresponds to a specific dataset (Houston2013, Augsburg, and Trento, from top to bottom), while each column represents a progressive ablation variant from left to right: HSI only, + LiDAR CNN, + Mamba (supervised), + Graph (fixed τ ), and Full HMGF-Net.
For the Houston2013 dataset with fifteen land cover categories, the HSI-only baseline exhibits considerable overlap between semantically similar classes, particularly among different vegetation types and urban structures. The introduction of LiDAR features initially increases inter-class confusion due to the limited discriminative power of elevation information alone. However, the Mamba-based fusion module effectively integrates complementary cross-modal information, resulting in more compact intra-class clusters and clearer inter-class boundaries. The subsequent graph-based refinement and full HMGF-Net configuration progressively improve the cluster separation, with the final visualization showing well-defined, tightly grouped clusters for all fifteen categories with minimal overlap.
Similar progressive improvements are observed for the Augsburg dataset with seven classes. The initial HSI-only representation shows significant mixing between classes 1–2 (different forest types) and classes 3–4 (residential and industrial areas). The Mamba fusion and graph-based refinement stages gradually disentangle these confusing categories, with the full HMGF-Net producing the most separable feature space where each class forms a distinct and compact cluster. For the Trento dataset with six classes, the relatively simpler classification task results in well-separated clusters even with the baseline configuration. Nevertheless, the progressive integration of modules further enhances the compactness of intra-class distributions and increases the margins between different categories. The full HMGF-Net achieves the most discriminative feature representation with clearly defined boundaries and minimal intra-class variance, which directly translates to the superior classification accuracy of 99.41% OA reported in Table 8. These visualizations provide compelling qualitative evidence that each proposed component contributes to learning more discriminative and well-structured feature representations for HSI-LiDAR classification.

5.1.4. Computational Complexity Analysis

Table 9 compares the computational cost of all methods on Houston2013 in terms of trainable parameters, FLOPs, training time, and inference time. HMGF-Net contains only 227.63 K parameters, a reduction of 93.9% relative to DSCA-Net (3737.71 K) and 75.8% relative to MFT (940.79 K). The parameter efficiency arises from replacing quadratic self-attention ( O ( L 2 ) ) with linear state-space recurrence ( O ( L ) ) in the Mamba block. Training requires 688.64 s owing to the iterative pseudolabel refinement stages; however, inference completes in 2.07 s for the full test set, comparable to lightweight baselines such as DCFSL (2.04 s) and Fusion_HCT (2.35 s). Despite this modest computational footprint, HMGF-Net attains the highest OA (92.30%), outperforming DSCA-Net by +1.50% while using only 6.1% of its parameters, demonstrating a favorable accuracy–efficiency tradeoff that is suitable for operational deployment.
Compared to the transformer-based MFT, HMGF-Net reduces parameters by 75.8% (227.63 K vs. 940.79 K) while achieving +6.36% higher OA (92.30% vs. 85.94%), confirming the efficiency advantage of linear-complexity state-space modeling over quadratic self-attention.

5.2. Cross-Dataset Generalization Analysis

To assess the portability of HMGF-Net across heterogeneous remote sensing scenarios, we examined performance variations for semantically equivalent classes appearing in multiple datasets.
The Water class exhibits the largest cross-dataset variation, with 100.00% on Houston2013 versus 66.12% on Augsburg. Three factors account for this discrepancy:
  • Spatial Resolution: Houston2013 (2.5 m) resolves water bodies as spectrally pure pixels, whereas Augsburg (30 m) produces mixed pixels containing water, vegetation, and built-up materials along riverbanks and canals. Mixed pixels exhibit ambiguous spectral signatures that reduce classifier confidence.
  • Spectral Range: Augsburg spans 0.4–2.5 µm, including Short-Wave Infra-Red (SWIR) bands where water absorption is strong but variable depending on turbidity, depth, and dissolved constituents. Houston2013 covers only 0.38–1.05 µm, where water reflectance is more stable and distinctive.
  • Water Body Heterogeneity: Houston2013 contains relatively homogeneous urban water features (retention ponds, swimming pools), while Augsburg includes rivers, industrial waterways, and agricultural irrigation channels with diverse spectral characteristics.
Despite these challenges, HMGF-Net achieves the highest accuracy for the Water class on Augsburg among all compared methods (66.12% versus 65.99% for DSCA-Net and 41.78% for MFT), demonstrating robust generalization even under unfavorable imaging conditions. The consistent improvements across all three datasets, each with distinct sensors, resolutions, and land-cover distributions, confirm that the proposed architecture generalizes effectively without dataset-specific tuning. Future work may incorporate resolution-adaptive modules or domain adaptation techniques to further enhance cross-sensor transferability.

6. Conclusions

This work introduces HMGF-Net, a unified multimodal framework designed to address the challenges of semi-supervised hyperspectral–LiDAR classification under extremely limited labeled data. By combining a 3D–2D spectral–spatial encoder for hyperspectral imagery, a multi-scale CNN for LiDAR elevation modeling, and an efficient Mamba selective state-space module for long-range feature refinement, the network captures both local spectral–spatial structure and global contextual dependencies. A graph-based fusion mechanism further enhances cross-modal alignment by modeling relational consistency across spectral, spatial, and elevation domains. To ensure training stability in low-label scenarios, the proposed Multi-Stage Pseudo-Label Refinement (MS-PLR) framework progressively mitigates label noise through confidence filtering, spatial–spectral smoothing, and graph-consistency propagation.
Extensive experiments on the Houston2013, Augsburg, and Trento datasets demonstrate that HMGF-Net consistently outperforms state-of-the-art hyperspectral, multimodal, and semi-supervised learning approaches. The model achieves superior overall accuracy, average accuracy, and Kappa values across all datasets, with notable improvements in structurally complex or spectrally ambiguous classes. The results confirm that integrating selective state-space modeling with graph-guided fusion and progressive pseudolabel refinement offers a robust and efficient solution for multimodal classification under restricted supervision.
Future research may extend the framework toward large-scale scene understanding, real-time inference, and multimodal transformer–state-space hybrids. Moreover, the integration of physics-informed priors, domain generalization mechanisms, or additional modalities such as SAR and multispectral data may further broaden the applicability of the proposed approach in operational remote sensing environments.

Author Contributions

Conceptualization, K.M.H. and Y.L.; Methodology, K.M.H. and S.P.; Software, K.M.H.; Validation, K.Z. and Y.L.; Formal analysis, K.Z.; Investigation, K.Z.; Resources, S.P. and Y.L.; Data curation, K.M.H. and S.P.; Writing—original draft, K.M.H.; Writing—review & editing, K.M.H. and K.Z.; Visualization, S.P.; Supervision, Y.L.; Project administration, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available from their corresponding references. The code and trained models will be made available upon reasonable request to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Datta, D.; Mallick, P.K.; Bhoi, A.K.; Ijaz, M.F.; Shafi, J.; Choi, J. Hyperspectral image classification: Potentials, challenges, and future directions. Comput. Intell. Neurosci. 2022, 2022, 3854635. [Google Scholar] [CrossRef] [PubMed]
  2. Ranjan, P.; Girdhar, A. A comprehensive systematic review of deep learning methods for hyperspectral images classification. Int. J. Remote Sens. 2022, 43, 6221–6306. [Google Scholar] [CrossRef]
  3. Uchaev, D.; Uchaev, D. Small sample hyperspectral image classification based on the random patches network and recursive filtering. Sensors 2023, 23, 2499. [Google Scholar] [CrossRef] [PubMed]
  4. Liu, H.; Bi, W.; Mughees, N. Enhanced hyperspectral image classification technique using PCA-2D-CNN algorithm and null spectrum hyperpixel features. Sensors 2025, 25, 5790. [Google Scholar] [CrossRef]
  5. Torrecillas, C.; Zarzuelo, C.; de la Fuente, J.; Jigena-Antelo, B.; Prates, G. Evaluation and Modelling of the Coastal Geomorphological Changes of Deception Island since the 1970 Eruption and Its Involvement in Research Activity. Remote Sens. 2024, 16, 512. [Google Scholar] [CrossRef]
  6. Wang, Q.; Zhou, B.; Zhang, J.; Xie, J.; Wang, Y. Joint classification of hyperspectral images and lidar data based on dual-branch transformer. Sensors 2024, 24, 867. [Google Scholar] [CrossRef]
  7. Guo, H.; Tian, B.; Liu, W. CCFormer: Cross-Modal Cross-Attention Transformer for Classification of Hyperspectral and LiDAR Data. Sensors 2025, 25, 5698. [Google Scholar] [CrossRef]
  8. Wang, L.; Deng, S. Hypergraph Convolution Network Classification for Hyperspectral and LiDAR Data. Sensors 2025, 25, 3092. [Google Scholar] [CrossRef]
  9. Yang, J.X.; Zhou, J.; Wang, J.; Tian, H.; Liew, A.W.C. LiDAR-guided cross-attention fusion for hyperspectral band selection and image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5515815. [Google Scholar] [CrossRef]
  10. Li, Y.; Xiao, X. Deep learning-based fusion of optical, radar, and LiDAR data for advancing land monitoring. Sensors 2025, 25, 4991. [Google Scholar] [CrossRef]
  11. Jing, H.; Wu, S.; Zhang, L.; Meng, F.; Yan, Y.; Wang, Y.; Du, Z. Heterogeneous contrastive graph fusion network for classification of hyperspectral and LIDAR data. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5521317. [Google Scholar] [CrossRef]
  12. Brezini, S.E.; Deville, Y. Hyperspectral and multispectral image fusion with automated extraction of image-based endmember bundles and sparsity-based unmixing to deal with spectral variability. Sensors 2023, 23, 2341. [Google Scholar] [CrossRef] [PubMed]
  13. Kuras, A.; Brell, M.; Liland, K.H.; Burud, I. Multitemporal feature-level fusion on hyperspectral and LiDAR data in the urban environment. Remote Sens. 2023, 15, 632. [Google Scholar] [CrossRef]
  14. Zhou, H.; Wang, X.; Xia, K.; Ma, Y.; Yuan, G. Transfer learning-based hyperspectral image classification using residual dense connection networks. Sensors 2024, 24, 2664. [Google Scholar] [CrossRef]
  15. Bai, Y.; Liu, D.; Zhang, L.; Wu, H. A Low-Measurement-Cost-Based Multi-Strategy Hyperspectral Image Classification Scheme. Sensors 2024, 24, 6647. [Google Scholar] [CrossRef]
  16. He, Z.; Zhu, Q.; Xia, K.; Ghamisi, P.; Zu, B. Semi-supervised hierarchical Transformer for hyperspectral Image classification. Int. J. Remote Sens. 2024, 45, 21–50. [Google Scholar] [CrossRef]
  17. Zhan, Y.; Wang, Y.; Yu, X. Semisupervised hyperspectral image classification based on generative adversarial networks and spectral angle distance. Sci. Rep. 2023, 13, 22019. [Google Scholar] [CrossRef]
  18. Cao, X.; Li, C.; Feng, J.; Jiao, L. Semi-supervised feature learning for disjoint hyperspectral imagery classification. Neurocomputing 2023, 526, 9–18. [Google Scholar] [CrossRef]
  19. Sun, H.; Chen, R.; Yin, Z.; Yao, H.; Chen, Y.; Xie, W.; Feng, G.; Lu, X. Class-aware consistency learning for open-set semi-supervised hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5535415. [Google Scholar] [CrossRef]
  20. Yang, C.; Liu, Z.; Guan, R.; Zhao, H. A Semi-Supervised Multi-Scale Convolutional Neural Network for Hyperspectral Image Classification with Limited Labeled Samples. Remote Sens. 2025, 17, 3273. [Google Scholar] [CrossRef]
  21. Han, W.; Jiang, W.; Geng, J.; Miao, W. Difference-Complementary Learning and Label Reassignment for Multimodal Semi-Supervised Semantic Segmentation of Remote Sensing Images. IEEE Trans. Image Process. 2025, 34, 566–580. [Google Scholar] [CrossRef] [PubMed]
  22. Zhang, Z.; Fan, X.; Wang, X.; Qin, Y.; Xia, J. A novel remote sensing image change detection approach based on multi-level state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4417014. [Google Scholar] [CrossRef]
  23. Zhu, Q.; Zhang, G.; Zou, X.; Wang, X.; Huang, J.; Li, X. Convmambasr: Leveraging state-space models and cnns in a dual-branch architecture for remote sensing imagery super-resolution. Remote Sens. 2024, 16, 3254. [Google Scholar] [CrossRef]
  24. Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote sensing change detection with spatiotemporal state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
  25. Li, L.; Yi, J.; Fan, H.; Lin, H. A Lightweight Semantic Segmentation Network Based on Self-attention Mechanism and State Space Model for Efficient Urban Scene Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4703215. [Google Scholar] [CrossRef]
  26. Huang, J.; Yuan, X.; Lam, C.T.; Wang, Y.; Xia, M. LCCDMamba: Visual state space model for land cover change detection of VHR remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 5765–5781. [Google Scholar] [CrossRef]
  27. Debes, C.; Merentitis, A.; Heremans, R.; Hahn, J.; Frangiadakis, N.; van Kasteren, T.; Liao, W.; Bellens, R.; Pižurica, A.; Gautama, S.; et al. Hyperspectral and LiDAR data fusion: Outcome of the 2013 GRSS data fusion contest. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2405–2418. [Google Scholar] [CrossRef]
  28. Hong, D.; Gao, L.; Hang, R.; Zhang, B.; Chanussot, J. Deep encoder–decoder networks for classification of hyperspectral and LiDAR data. IEEE Geosci. Remote Sens. Lett. 2020, 19, 5500205. [Google Scholar] [CrossRef]
  29. Hong, D.; Hu, J.; Yao, J.; Chanussot, J.; Zhu, X.X. Multimodal remote sensing benchmark datasets for land cover classification with a shared and specific feature learning model. ISPRS J. Photogramm. Remote Sens. 2021, 178, 68–80. [Google Scholar] [CrossRef]
  30. Seydgar, M.; Rahnamayan, S.; Ghamisi, P.; Bidgoli, A.A. Semisupervised hyperspectral image classification using a probabilistic pseudo-label generation framework. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5535218. [Google Scholar] [CrossRef]
  31. Wu, X.; Hong, D.; Chanussot, J. Convolutional neural networks for multimodal remote sensing data classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5517010. [Google Scholar] [CrossRef]
  32. Yang, Y.; Zhu, D.; Qu, T.; Wang, Q.; Ren, F.; Cheng, C. Single-stream CNN with learnable architecture for multisource remote sensing data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5409218. [Google Scholar] [CrossRef]
  33. Li, Z.; Liu, M.; Chen, Y.; Xu, Y.; Li, W.; Du, Q. Deep Cross-Domain Few-Shot Learning for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5501618. [Google Scholar] [CrossRef]
  34. Xue, Z.; Zhou, Y.; Du, P. S3Net: Spectral–spatial Siamese network for few-shot hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5531219. [Google Scholar] [CrossRef]
  35. Lu, T.; Fang, Y.; Fu, W.; Ding, K.; Kang, X. Dual-stream class-adaptive network for semi-supervised hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5507511. [Google Scholar] [CrossRef]
  36. Zhao, G.; Ye, Q.; Sun, L.; Wu, Z.; Pan, C.; Jeon, B. Joint classification of hyperspectral and LiDAR data using a hierarchical CNN and transformer. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5500716. [Google Scholar] [CrossRef]
  37. Roy, S.K.; Deria, A.; Hong, D.; Rasti, B.; Plaza, A.; Chanussot, J. Multimodal fusion transformer for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5515620. [Google Scholar] [CrossRef]
Figure 1. Overall framework of the proposed HMGF-Net with MS-PLR training strategy.
Figure 1. Overall framework of the proposed HMGF-Net with MS-PLR training strategy.
Sensors 26 01005 g001
Figure 2. Dual-branch feature extraction module.
Figure 2. Dual-branch feature extraction module.
Sensors 26 01005 g002
Figure 3. Feature fusion and Mamba sequence modeling module. HSI and LiDAR features are concatenated, reshaped to sequence, processed through GRU gates, and reshaped back to spatial domain. Ellipsis dots (…) indicate omitted intermediate layers for visual clarity.
Figure 3. Feature fusion and Mamba sequence modeling module. HSI and LiDAR features are concatenated, reshaped to sequence, processed through GRU gates, and reshaped back to spatial domain. Ellipsis dots (…) indicate omitted intermediate layers for visual clarity.
Sensors 26 01005 g003
Figure 4. Graph-regularized pseudolabel acquisition pipeline. Stage 1 (Confidence Filtering): For each unlabeled sample x j , the prediction confidence c j = max k p k ( j ) is compared against a dynamic threshold τ dyn computed via Equation (9). Samples with c j < τ dyn are rejected. Stage 2 (KNN Graph): Fused features f j = GAP ( Z fused ( j ) ) are used to construct a k-nearest neighbor graph based on cosine similarity, yielding normalized adjacency A ˜ . Stage 3 (Consistency Check): Neighborhood-aggregated predictions p ¯ i = j N k ( i ) A ˜ i j p j determine acceptance; pseudolabels enter P only when arg max k p ¯ i k = y ^ i , ensuring that the local consensus matches the point prediction.
Figure 4. Graph-regularized pseudolabel acquisition pipeline. Stage 1 (Confidence Filtering): For each unlabeled sample x j , the prediction confidence c j = max k p k ( j ) is compared against a dynamic threshold τ dyn computed via Equation (9). Samples with c j < τ dyn are rejected. Stage 2 (KNN Graph): Fused features f j = GAP ( Z fused ( j ) ) are used to construct a k-nearest neighbor graph based on cosine similarity, yielding normalized adjacency A ˜ . Stage 3 (Consistency Check): Neighborhood-aggregated predictions p ¯ i = j N k ( i ) A ˜ i j p j determine acceptance; pseudolabels enter P only when arg max k p ¯ i k = y ^ i , ensuring that the local consensus matches the point prediction.
Sensors 26 01005 g004
Figure 5. Impact of KNN neighborhood size on classification performance: (a) Houston2013; (b) Augsburg; (c) Trento.
Figure 5. Impact of KNN neighborhood size on classification performance: (a) Houston2013; (b) Augsburg; (c) Trento.
Sensors 26 01005 g005
Figure 6. Impact of LiDAR fusion weight on classification performance: (a) Houston2013; (b) Augsburg; (c) Trento.
Figure 6. Impact of LiDAR fusion weight on classification performance: (a) Houston2013; (b) Augsburg; (c) Trento.
Sensors 26 01005 g006
Figure 7. Impact of learning rate on classification performance: (a) Houston2013; (b) Augsburg; (c) Trento.
Figure 7. Impact of learning rate on classification performance: (a) Houston2013; (b) Augsburg; (c) Trento.
Sensors 26 01005 g007
Figure 8. Classification maps for Houston2013: (a) ground truth; (b) Res-CP; (c) DCFSL; (d) S3Net; (e) DSCA-Net; (f) CCR-Net; (g) SepDGConv; (h) Fusion_HCT; (i) MFT; (j) proposed HMGF-Net (92.30%).
Figure 8. Classification maps for Houston2013: (a) ground truth; (b) Res-CP; (c) DCFSL; (d) S3Net; (e) DSCA-Net; (f) CCR-Net; (g) SepDGConv; (h) Fusion_HCT; (i) MFT; (j) proposed HMGF-Net (92.30%).
Sensors 26 01005 g008
Figure 9. Classification maps for Augsburg: (a) ground truth; (b) Res-CP; (c) DCFSL; (d) S3Net; (e) DSCA-Net; (f) CCR-Net; (g) SepDGConv; (h) Fusion_HCT; (i) MFT; (j) proposed HMGF-Net (88.61%).
Figure 9. Classification maps for Augsburg: (a) ground truth; (b) Res-CP; (c) DCFSL; (d) S3Net; (e) DSCA-Net; (f) CCR-Net; (g) SepDGConv; (h) Fusion_HCT; (i) MFT; (j) proposed HMGF-Net (88.61%).
Sensors 26 01005 g009
Figure 10. Classification maps for Trento: (a) ground truth; (b) Res-CP; (c) DCFSL; (d) S3Net; (e) DSCA-Net; (f) CCR-Net; (g) SepDGConv; (h) Fusion_HCT; (i) MFT; (j) proposed HMGF-Net (99.39%).
Figure 10. Classification maps for Trento: (a) ground truth; (b) Res-CP; (c) DCFSL; (d) S3Net; (e) DSCA-Net; (f) CCR-Net; (g) SepDGConv; (h) Fusion_HCT; (i) MFT; (j) proposed HMGF-Net (99.39%).
Sensors 26 01005 g010
Figure 11. Ablation study on the effectiveness of semi-supervised learning: (a) Houston2013; (b) Augsburg; (c) Trento.
Figure 11. Ablation study on the effectiveness of semi-supervised learning: (a) Houston2013; (b) Augsburg; (c) Trento.
Sensors 26 01005 g011
Figure 12. t-SNE visualization of learned feature representations across different ablation configurations. Each row corresponds to a dataset (Houston2013, Augsburg, Trento) and each column represents an ablation variant (HSI only, + LiDAR CNN, + Mamba, + Graph, Full HMGF-Net). Different colors indicate different land cover categories.
Figure 12. t-SNE visualization of learned feature representations across different ablation configurations. Each row corresponds to a dataset (Houston2013, Augsburg, Trento) and each column represents an ablation variant (HSI only, + LiDAR CNN, + Mamba, + Graph, Full HMGF-Net). Different colors indicate different land cover categories.
Sensors 26 01005 g012
Table 1. Dataset description.
Table 1. Dataset description.
DatasetHouston2013 [27]Trento [28]Augsburg [29]
LocationHouston, Texas, USATrento, ItalyAugsburg, Germany
Sensor TypeHSILiDARHSILiDARHSILiDAR
Image Size349 × 1905349 × 1905600 × 166600 × 166332 × 485332 × 485
Spatial Resolution2.5 m2.5 m1 m1 m30 m30 m
Number of Bands14416311801
Wavelength Range0.38–1.05 m/0.42–0.99 m/0.4–2.5 m/
Sensor NameCASI-1500/AISA EagleOptech ALTM 3100EAHySpexDLR-3 K
Table 2. Summary of the notation used in this paper.
Table 2. Summary of the notation used in this paper.
SymbolDescriptionSymbolDescription
Input and Output
X H HSI cube ( H × W × C ) X L LiDAR DSM ( H × W × 1 )
KNumber of land-cover classesPPatch size
D l Labeled training set D u Unlabeled sample pool
Feature Extraction
F H ( l ) HSI feature at layer l F L ( l ) LiDAR feature at layer l
Z H Encoded HSI features Z L Encoded LiDAR features
DFeature dimension (64) H ( · ) Layer mapping function
Mamba Fusion Module
Z c a t Concatenated features S Sequence representation
LSequence length ( P 2 )NFeature channels ( 2 D )
h t Latent state at position t N s State dimension
A ¯ Discretized state matrix B ¯ t , C t Input-dependent projections
Z f u s e d Fused output features σ g SiLU activation
Pseudo-Label Acquisition
c j Prediction confidence τ b a s e Base threshold
τ d y n Dynamic threshold α Blending coefficient
N k ( i ) k-nearest neighbors of i A ˜ i j Normalized adjacency
p ¯ i Aggregated prediction P Validated pseudolabel set
Training Parameters
L s u p Supervised loss ϵ Label smoothing (0.1)
η Base learning rate γ LR reduction factor
E s u p Supervised epochs E s s l SSL refinement epochs
Table 3. Hyperparameter settings for MS-PLR across datasets. Values were selected via grid search on a held-out validation set (10% of training samples) *.
Table 3. Hyperparameter settings for MS-PLR across datasets. Values were selected via grid search on a held-out validation set (10% of training samples) *.
HyperparameterHouston2013TrentoAugsburg
Base threshold τ base 0.600.500.55
Blending coefficient α 0.30.30.3
KNN neighborhood k52020
Graph update frequencyEvery SSL roundEvery SSL roundEvery SSL round
Supervised epochs E sup 10080100
SSL refinement epochs E ssl 152015
* Search ranges: τ base [ 0.40 , 0.70 ] ; α [ 0.1 , 0.5 ] ; k [ 5 , 25 ] .
Table 4. Comparison of classification accuracy (%) of different methods on the Houston2013 dataset. The competing methods include Res-CP [30], DCFSL [33], S3Net [34], DSCA-Net [35], CCR-Net [31], SepDGConv [32], Fusion_HCT [36], and MFT [37]. The best results are highlighted in bold.
Table 4. Comparison of classification accuracy (%) of different methods on the Houston2013 dataset. The competing methods include Res-CP [30], DCFSL [33], S3Net [34], DSCA-Net [35], CCR-Net [31], SepDGConv [32], Fusion_HCT [36], and MFT [37]. The best results are highlighted in bold.
No.Class (Train/Test)Res-CPDCFSLS3NetDSCA-NetCCR-NetSepDGConvFusion_HCTMFTProposed
1Healthy grass (10/1241)98.5196.2194.0493.3183.4087.3782.3593.3984.03
2Stressed grass (10/1244)87.9696.6291.4892.7783.6886.6099.5281.99100.00
3Synthetic grass (10/687)100.0098.98100.0098.9898.2598.8597.0999.56100.00
4Trees (10/1234)98.0796.5295.7988.7487.6075.7296.2792.2293.28
5Soil (10/1232)99.4799.1995.13100.0095.2997.18100.0099.3598.89
6Water (10/315)100.0082.8698.4189.8486.9892.62100.0098.09100.00
7Residential (10/1258)93.3987.5282.1993.6470.1174.6876.3985.9398.04
8Commercial (10/1234)67.7665.8869.0475.1258.8359.2475.9373.9994.72
9Road (10/1242)87.4180.2774.4891.3850.5656.3187.2069.4091.51
10Highway (10/1217)73.7981.6876.9978.8057.7656.8981.9265.1674.59
11Railway (10/1225)91.1680.7394.3790.1254.1278.2990.7885.7993.86
12Parking lot 1 (10/1223)71.7493.8775.3189.2973.0281.5959.5292.8079.39
13Parking lot 2 (10/459)87.8498.2694.7799.7869.7289.9886.2779.7493.21
14Tennis court (10/418)98.1298.33100.00100.0090.1992.29100.00100.00100.00
15Running track (10/650)92.0799.23100.0098.0076.1598.03100.0097.54100.00
OA (%)87.3189.3887.2690.8073.7178.7987.0085.9492.30
AA (%)89.8290.4189.4791.9975.7181.7188.8887.6693.43
Kappa (×100)86.2888.5286.2490.0671.6677.1185.9584.8091.68
Table 5. Comparison of classification accuracy (%) of different methods on the Augsburg dataset. The competing methods include Res-CP [30], DCFSL [33], S3Net [34], DSCA-Net [35], CCR-Net [31], SepDGConv [32], Fusion_HCT [36], and MFT [37]. The best results are highlighted in bold.
Table 5. Comparison of classification accuracy (%) of different methods on the Augsburg dataset. The competing methods include Res-CP [30], DCFSL [33], S3Net [34], DSCA-Net [35], CCR-Net [31], SepDGConv [32], Fusion_HCT [36], and MFT [37]. The best results are highlighted in bold.
No.Class (Train/Test)Res-CPDCFSLS3NetDSCA-NetCCR-NetSepDGConvFusion_HCTMFTProposed
1Forest (10/13,497)90.5091.7498.5393.9189.6183.0893.3794.6697.82
2Residential-Area (10/30,319)98.6670.2786.5987.6245.0871.1987.8492.8286.45
3Industrial-Area (10/3841)77.1752.9050.1963.0846.9945.4424.2468.6861.28
4Low-Plants (10/26,847)99.5867.5056.7981.8147.1047.8367.7390.1795.36
5Allotment (10/565)20.8987.9673.9869.3886.1972.5279.8233.6379.65
6Commercial-Area (10/1635)17.2142.8755.3547.6537.1932.2171.4418.8462.54
7Water (10/1520)39.5462.5065.6565.9952.7042.4259.7441.7866.12
OA (%)83.3671.5775.4884.1253.8362.5977.8288.0788.61
AA (%)63.3767.9669.5972.7857.8456.3969.1762.9378.46
Kappa (×100)77.3862.9967.0577.9442.5650.3369.3782.9383.74
Table 6. Comparison of classification accuracy (%) of different methods on the Trento dataset. The competing methods include Res-CP [30], DCFSL [33], S3Net [34], DSCA-Net [35], CCR-Net [31], SepDGConv [32], Fusion_HCT [36], and MFT [37]. The best results are highlighted in bold.
Table 6. Comparison of classification accuracy (%) of different methods on the Trento dataset. The competing methods include Res-CP [30], DCFSL [33], S3Net [34], DSCA-Net [35], CCR-Net [31], SepDGConv [32], Fusion_HCT [36], and MFT [37]. The best results are highlighted in bold.
No.Class (Train/Test)Res-CPDCFSLS3NetDSCA-NetCCR-NetSepDGConvFusion_HCTMFTProposed
1Apple trees (10/4024)98.4699.5889.4699.8598.3188.5798.7396.9499.70
2Building (10/2893)90.2389.7073.5678.6495.0698.9790.9499.8997.77
3Ground (10/469)98.7199.7997.8798.0881.8897.2897.6583.3797.63
4Woods (10/9113)99.9898.9581.75100.0099.6299.25100.0099.84100.00
5Vineyard (10/10,491)99.2898.7372.90100.0088.7991.4792.7698.8399.98
6Roads (10/3164)64.8982.3381.4882.0288.1592.3184.5483.5796.97
OA (%)93.6396.3479.1496.0194.3894.3494.7897.1499.39
AA (%)91.9294.8582.8493.1092.1394.6494.1193.7498.68
Kappa (×100)91.5095.1272.2794.6892.2192.4893.1096.1999.18
Table 8. Module-wise ablation of HMGF-Net on three benchmark datasets. Ticks (✓) indicate enabled modules; crosses (✗) indicate disabled modules. The best results are shown in bold.
Table 8. Module-wise ablation of HMGF-Net on three benchmark datasets. Ticks (✓) indicate enabled modules; crosses (✗) indicate disabled modules. The best results are shown in bold.
VariantHSILiDARMambaGraphMS–PLRTrentoHouston2013Augsburg
OAAA κ OAAA κ OAAA κ
HSI only (baseline)96.6494.3795.5084.0484.2482.7288.1876.1083.77
LiDAR CNN only90.2490.6487.2661.4560.8958.4358.6158.9748.66
+ Mamba block99.3898.9599.1792.5493.6691.9483.9677.1379.40
+ Graph fusion (fixed τ )99.3899.3899.1995.3196.0494.9388.4679.2483.91
Full HMGF-Net (MS–PLR)99.4199.1199.2295.9796.3495.6088.6179.7584.13
Table 9. Computational complexity comparison on the Houston2013 dataset. Params: trainable parameters (K); FLOPs: floating-point operations (M); Mem: peak GPU memory (GB); Train: total training time (s); Test: inference time for entire test set (s). Hardware: NVIDIA RTX 4090 (24 GB), 64 GB RAM (Santa Clara, CA, USA), batch size 32, patch size 11 × 11 . Best results in bold.
Table 9. Computational complexity comparison on the Houston2013 dataset. Params: trainable parameters (K); FLOPs: floating-point operations (M); Mem: peak GPU memory (GB); Train: total training time (s); Test: inference time for entire test set (s). Hardware: NVIDIA RTX 4090 (24 GB), 64 GB RAM (Santa Clara, CA, USA), batch size 32, patch size 11 × 11 . Best results in bold.
MethodYearParams (K)FLOPs (M)Mem (GB)Train (s)Test (s)OA (%)
CCR-Net [31]202170.080.141.261.120.0673.71
Res-CP [30]2022180.022.231.8115.771.5687.31
S3Net [34]2022229.1046.872.4323.864.2787.26
DCFSL [33]2022284.1428.752.6563.142.0489.38
SepDGConv [32]2022312.4518.642.2198.421.8278.79
Fusion_HCT [36]2022425.3832.173.1245.632.3587.00
MFT [37]2023940.796.324.6163.701.7885.94
DSCA-Net [35]20243737.71322.528.4472.355.8190.80
HMGF-Net (Ours)2025227.6352.482.8688.642.0792.30
Metrics reported on Houston2013 as representative benchmark; architecture and computational cost remain consistent across datasets, with <5% FLOPs variation due to input band differences (63–180 bands).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hussain, K.M.; Zhao, K.; Perviaz, S.; Li, Y. Hybrid Mamba–Graph Fusion with Multi-Stage Pseudo-Label Refinement for Semi-Supervised Hyperspectral–LiDAR Classification. Sensors 2026, 26, 1005. https://doi.org/10.3390/s26031005

AMA Style

Hussain KM, Zhao K, Perviaz S, Li Y. Hybrid Mamba–Graph Fusion with Multi-Stage Pseudo-Label Refinement for Semi-Supervised Hyperspectral–LiDAR Classification. Sensors. 2026; 26(3):1005. https://doi.org/10.3390/s26031005

Chicago/Turabian Style

Hussain, Khanzada Muzammil, Keyun Zhao, Sachal Perviaz, and Ying Li. 2026. "Hybrid Mamba–Graph Fusion with Multi-Stage Pseudo-Label Refinement for Semi-Supervised Hyperspectral–LiDAR Classification" Sensors 26, no. 3: 1005. https://doi.org/10.3390/s26031005

APA Style

Hussain, K. M., Zhao, K., Perviaz, S., & Li, Y. (2026). Hybrid Mamba–Graph Fusion with Multi-Stage Pseudo-Label Refinement for Semi-Supervised Hyperspectral–LiDAR Classification. Sensors, 26(3), 1005. https://doi.org/10.3390/s26031005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop