1. Introduction
Hyperspectral Images (HSI) provide dense spectral information across hundreds of narrow bands, and have become indispensable for material identification and land cover analysis. However, issues such as spectral redundancy, nonlinear mixing, and high dimensionality continue to challenge robust spectral–spatial modeling, motivating the development of advanced learning strategies for high-dimensional remote sensing data [
1,
2]. Recent deep learning approaches have improved HSI classification through joint spatial–spectral feature extraction, hybrid CNN–attention architectures, and lightweight networks designed for small-sample settings [
3,
4,
5].
To address the limitations of spectral information alone, complementary modalities such as Light Detection and Ranging (LiDAR) have been extensively integrated with HSI. LiDAR provides elevation and structural cues that enhance edge delineation and mitigate spectral ambiguity. Numerous studies have demonstrated that combining HSI and LiDAR significantly improves classification accuracy, particularly in heterogeneous or urban environments [
6,
7,
8]. Recent advances include dual-branch transformers, hypergraph networks, and cross-attention fusion modules that model heterogeneous spectral–elevation interactions more effectively [
9,
10,
11]. In addition to dual-modality setups, multisensor fusion frameworks now exploit combinations of hyperspectral, multispectral, radar, and LiDAR data to achieve more stable and generalizable scene understanding [
12,
13].
Despite this progress, multimodal HSI–LiDAR fusion remains difficult under limited supervision. HSI suffers from large intra-class spectral variance and sensitivity to noise, while LiDAR exhibits irregular sampling patterns and modality-specific distortions. Effective fusion requires bridging disparate feature spaces while preserving modality-specific advantages. Deep HSI classifiers based on random-patch learning, dense residual transfer, and spectral–spatial CNNs offer improved robustness [
3,
4,
14], yet most multimodal fusion networks assume abundant labeled samples and degrade sharply when labels are scarce [
15,
16].
To address this, Semi-Supervised Learning (SSL) has gained importance in HSI classification. Modern SSL methods integrate unlabeled samples through consistency constraints, pseudolabel refinement, adversarial learning, and hybrid generative–discriminative models [
17,
18,
19]. In addition, multiscale refinement strategies and cost-aware learning mechanisms have been shown to substantially improve label efficiency in low-annotation scenarios [
15,
20]. Nonetheless, applying SSL to multimodal fusion remains challenging, as pseudolabel noise can propagate between modalities, causing semantic drift unless cross-modal agreement and structural consistency are jointly enforced [
21].
Meanwhile, long-range modeling advances have reshaped deep learning for remote sensing. Transformer-based architectures and graph neural networks have demonstrated strong spectral–spatial reasoning capabilities, but remain limited by computational complexity when applied to high-resolution hyperspectral cubes [
2,
16]. Recently, visual State-Space Models (SSMs), particularly Mamba-inspired architectures, have emerged as efficient alternatives that capture long-range dependencies with linear complexity. RS3Mamba and ConvMambaSR show excellent performance in segmentation and super-resolution tasks, highlighting the potential of SSMs for hyperspectral sequence modeling [
22,
23]. Additional work in SSM-driven remote sensing indicates improved efficiency, stability, and scalability compared to transformer-based models [
24,
25,
26].
Despite these advances, existing state-space and graph-based approaches for remote sensing classification exhibit several limitations. Current Mamba-inspired architectures such as RS3Mamba [
22] and ConvMambaSR [
23] employ fixed scanning strategies over the spatial grid that do not adapt to irregular object boundaries in heterogeneous scenes, potentially overlooking critical cross-modal relationships at class transitions where spectral and elevation discontinuities do not align. While graph neural networks have shown promise for HSI classification through hypergraph convolutions [
8] and cross-attention mechanisms [
11], they face well-documented over-smoothing effects when stacked beyond two or three layers, and incur significant computational overhead when constructing dense pixel graphs over high-resolution hyperspectral cubes [
2,
16]. Most critically, neither paradigm alone addresses the challenge of pseudolabel noise propagation in semi-supervised multimodal learning. Erroneous predictions generated from one modality can corrupt feature representations through the fusion mechanism, leading to semantic drift during iterative self-training [
21]. These limitations motivate the design of HMGF-Net, which combines efficient state-space sequence modeling with graph-based consistency verification specifically targeted at pseudolabel quality control. Motivated by these developments, we propose HMGF-Net. a Hybrid Mamba–Graph Fusion network for semi-supervised HSI–LiDAR classification. The proposed network integrates (i) spectral–spatial HSI encoding, (ii) multiscale LiDAR structural modeling, (iii) selective state-space sequence modeling for efficient long-range dependency capture, and (iv) graph-guided multimodal fusion. To mitigate pseudolabel noise in SSL, we introduce Multi-Stage Pseudo-Label Refinement (MS-PLR), a mechanism that applies confidence filtering, spatial–spectral smoothing, and graph-consistency propagation. Together, these components enable HMGF-Net to achieve robust and stable performance even under extremely limited labeled data.
These observations collectively motivate the design of a unified architecture capable of addressing the interconnected challenges of multimodal learning under limited labels. Existing methods rarely combine spectral–spatial HSI encoding, LiDAR structural modeling, efficient long-range sequence learning, graph reasoning, and systematic pseudolabel refinement into a single coherent framework. Moreover, multimodal SSL remains vulnerable to inconsistent pseudolabel predictions, which can propagate uncertainty and destabilize training. Additionally, high-dimensional hyperspectral sequences require an efficient long-range modeling approach that avoids the computational burden of transformers. These challenges form the foundation for our proposed methodology.
To address these intertwined challenges, we introduce HMGF-Net, a Hybrid Mamba–Graph Fusion Network equipped with an end-to-end Multi-Stage Pseudo-Label Refinement (MS-PLR) mechanism. In contrast to existing multimodal approaches, HMGF-Net integrates spectral–spatial representation learning for hyperspectral data, multiscale geometric modeling for LiDAR elevation cues, and efficient long-range dependency modeling through the Mamba selective state-space paradigm. These encoded features are subsequently processed within a graph-based fusion network that captures cross-modal relational structure and enhances contextual reasoning. Finally, the MS-PLR pipeline progressively refines pseudolabels through confidence filtering, spatial–spectral smoothing, and graph-consistency propagation, enabling the network to suppress noise, reinforce cross-modal stability, and achieve high classification accuracy even in demanding low-label conditions.
The main contributions of this work are summarized below:
We present HMGF-Net, a unified multimodal architecture that integrates a 3D–2D spectral–spatial CNN encoder for hyperspectral data, a multiscale CNN for LiDAR elevation modeling, and Mamba-based selective state-space modeling for efficient long-range dependency learning. This design combines local spectral–spatial feature extraction with global sequence modeling, enabling a more expressive and computationally efficient multimodal representation than conventional CNN- or transformer-based approaches.
We introduce a graph-guided multimodal fusion mechanism that aligns hyperspectral and LiDAR features using relational modeling based on spectral similarity, spatial proximity, and elevation-informed neighborhood structure. This graph-based fusion strategy promotes more coherent cross-modal interactions by preserving geometric continuity and spectral–spatial relationships, thereby enabling the network to integrate complementary modality information more effectively than concatenation, attention-only fusion, or shallow multimodal alignment methods.
We develop a Multi-Stage Pseudo-Label Refinement (MS-PLR) framework designed to stabilize semi-supervised learning through progressive noise suppression. The refinement process incorporates confidence filtering, spatial–spectral neighborhood smoothing, and graph consistency propagation to reduce the influence of unreliable predictions. This enables more reliable pseudolabel supervision in low-label scenarios, preventing semantic drift and improving training stability by ensuring that refined labels remain structurally consistent with both spectral–spatial patterns and elevation cues.
2. Materials and Methods
Datasets
To assess the effectiveness of the proposed approach, three publicly accessible multisensor remote sensing image classification datasets are utilized as experimental datasets: the Houston2013, Trento, and Augsburg datasets. Comprehensive parameters are shown in
Table 1.
Houston2013 dataset. The Houston2013 dataset was captured using the ITRES CASI-1500 (ITRES Research Limited, Calgary, AB, Canada) sensor over the University of Houston campus and its surrounding urban area in Houston, Texas, USA in 2012. This dataset includes both HSI and LiDAR DSM data. The spatial dimensions of the dataset are 349 × 1905, with a spatial resolution of approximately 2.5 m. The HSI data consist of 144 spectral bands, covering the wavelength range from 380 to 1050 nm. The LiDAR data provide elevation information for ground features. The land cover is categorized into fifteen types: Healthy Grass, Stressed Grass, Synthetic Grass, Trees, Soil, Water, Residential, Commercial, Road, Highway, Railway, Parking Lot 1, Parking Lot 2, Tennis Court, and Running Track.
Augsburg dataset. The Augsburg dataset consists of paired HSI and LiDAR DSM data; the HSI data were collected using the HySpex (Norsk Elektro Optikk AS, Skedsmokorset, Norway) sensor, while the LiDAR DSM data were obtained with the DLR-3K sensor (German Aerospace Center, Oberpfaffenhofen, Germany). This dataset was acquired over Augsburg, Germany, which is an urban environment. The spatial dimensions of the Augsburg dataset are 332 × 485, with a spatial resolution of approximately 30 m. The HSI data includes 180 spectral bands, spanning the wavelength range of 0.4 to 2.5 µm. The LiDAR DSM data provides 3D elevation information for surface features. The dataset comprises seven land cover categories with varying sample distributions.
Trento Dataset. The Trento dataset is an HSI–LiDAR pair dataset; the HSI data were collected by an AISA Eagle (Specim, Spectral Imaging Ltd., Oulu, Finland) sensor, while the LiDAR digital surface model (DSM) data were acquired by an Optech ALTM 3100EA (Teledyne Optech, Vaughan, ON, Canada) sensor. The dataset was captured over a rural area south of the city of Trento, Italy. The Trento dataset has a spatial dimension of 166 × 600 with a spatial resolution of approximately 1 m. The HSI data in the Trento dataset consist of 63 spectral bands, with wavelengths ranging from 420 to 990 nm. The LiDAR DSM data provide elevation information of ground features. The land cover is classified into six categories: Apple Trees, Buildings, Ground, Woods, Vineyard, and Roads.
4. Results
This section presents a comprehensive evaluation of the proposed HMGF-Net with MS-PLR across three benchmark datasets: Houston2013, Augsburg, and Trento. We first describe the evaluation protocol, then present quantitative results with detailed comparisons against state-of-the-art methods, followed by parameter sensitivity studies and visual assessment.
4.1. Evaluation Protocol
The baseline methods span diverse state-of-the-art paradigms published between 2021–2024, including CNN-based approaches (Res-CP [
30], CCR-Net [
31], SepDGConv [
32]), few-shot and semi-supervised methods (DCFSL [
33], S3Net [
34]), dual-modality fusion architectures (DSCA-Net [
35], Fusion_HCT [
36]), and transformer-based multimodal fusion (MFT [
37]). Notably, MFT [
37] (2023) represents the current state-of-the-art in multimodal remote sensing transformers, while DSCA-Net [
35] (2024) is among the most recent dual-stream adaptive networks. This selection ensures a comprehensive comparison, with seven of eight baselines published in 2022 or later.
All models operated under identical training conditions with the same hyperspectral and LiDAR input modalities. The proposed HMGF-Net employed its dual-branch encoder, Mamba-based fusion module, and MS-PLR training strategy as described in
Section 3.
Table 3 reports the hyperparameter configurations for MS-PLR. The base threshold
and KNN neighborhood size
k were tuned per dataset via grid search, while the blending coefficient
generalized well across all datasets without per-dataset adjustment. Houston2013 requires a higher
(0.60) due to its fifteen-class complexity and finer inter-class boundaries, whereas Trento and Augsburg benefit from larger
k (20) owing to their spatially homogeneous agricultural and urban parcels. The KNN graph is reconstructed at the start of each SSL round using updated fused features, ensuring that neighborhood relationships reflect the current model state.
4.2. Quantitative Results
4.2.1. Results on Houston2013
Table 4 presents the complete class-wise and overall results on Houston2013. The proposed HMGF-Net achieves the highest OA (92.30%), AA (93.43%), and Kappa (91.68), outperforming all comparison models. Notably, HMGF-Net demonstrates superior performance on challenging urban classes such as Commercial (94.72%), Residential (98.04%), and Road (91.51%), which exhibit high intra-class variability. The integration of Mamba-based long-range modeling with graph fusion provides stronger context propagation, while MS-PLR reduces pseudolabel noise near class boundaries.
4.2.2. Results on Augsburg
Table 5 reports results on the Augsburg dataset, which contains large-scale and highly heterogeneous urban–vegetation mixtures. HMGF-Net achieves an OA of 88.61%, AA of 78.46%, and Kappa of 83.74, outperforming competing approaches. The Augsburg dataset poses significant challenges due to the dominance of the Residential-Area and Low-Plants classes, which together comprise over 70% of the scene. HMGF-Net significantly improves on Low-Plants (95.36%) and maintains competitive performance across minority classes. The graph-based fusion mechanism effectively incorporates elevation discontinuities and spectral relationships, ensuring reliable pseudolabel propagation.
4.2.3. Results on Trento
Table 6 shows results on the Trento dataset. The proposed method achieves the highest OA (99.39%), AA (98.68%), and Kappa (99.18). Trento consists primarily of agricultural and semi-structured terrain where hyperspectral-LiDAR fusion plays a critical role in distinguishing vegetation types. HMGF-Net produces near-perfect accuracy for classes such as Apple Trees (99.70%), Woods (100.00%), Vineyard (99.98%), and Roads (96.97%). The Mamba state-space module effectively captures long-range spectral patterns, while the KNN graph consistency verification incorporates elevation and spatial continuity across agricultural parcels.
4.3. Comparative Analysis
Across all three datasets, HMGF-Net consistently surpasses existing models in OA, AA, and Kappa.
Table 7 summarizes the performance comparison. The improvements arise from four key architectural and methodological strengths:
Hybrid Encoder Design: The 3D–2D CNN with residual connections for HSI and dense connections for LiDAR effectively captures modality-specific characteristics while maintaining computational efficiency.
Efficient Sequence Modeling: The Mamba block models long-range dependencies with complexity, offering advantages over standard CNNs (limited receptive field) and transformers ( complexity).
Graph-Regularized Fusion: The KNN graph consistency verification aligns predictions semantically in the learned feature space, improving robustness against noisy pseudolabels.
Progressive Refinement: The MS-PLR strategy progressively expands the training set with validated pseudolabels, enabling effective utilization of unlabeled data under extreme label scarcity.
Table 7.
Summary of HMGF-Net classification performance (%) across all datasets.
Table 7.
Summary of HMGF-Net classification performance (%) across all datasets.
| Dataset | OA | AA | Kappa |
|---|
| Houston2013 | 92.30 | 93.43 | 91.68 |
| Augsburg | 88.61 | 78.46 | 83.74 |
| Trento | 99.39 | 98.68 | 99.18 |
4.4. Parameter Sensitivity Analysis
To investigate the robustness of HMGF-Net to hyperparameter selection, we conducted systematic sensitivity analysis on three critical parameters: KNN neighborhood size k, LiDAR fusion weight , and learning rate . For all experiments, the batch size and patch size were fixed at 32 and , respectively, with label smoothing .
4.4.1. Impact of KNN Neighborhood Size
The KNN neighborhood size
k determines the scope of spatial–spectral consistency verification in the graph-regularized pseudolabel acquisition module. As illustrated in
Figure 5, we evaluated classification performance with
k values ranging from 5 to 25.
For Houston2013, optimal performance is achieved at (OA: 92.30%). This is attributed to its high spatial resolution (2.5 m), where neighboring pixels are more likely to belong to the same category within a compact neighborhood. In contrast, Augsburg and Trento achieve optimal results at , reflecting their larger homogeneous regions that benefit from broader neighborhood context. These results demonstrate that optimal k is dataset-dependent and should be tuned based on spatial characteristics.
4.4.2. Impact of LiDAR Fusion Weight
The fusion weight
controls the relative contribution of LiDAR features, with HSI weight satisfying
.
Figure 6 shows classification performance as
varies from 0 to 0.9.
For Houston2013, optimal performance occurs at , indicating that spectral information dominates for distinguishing diverse urban categories. For Augsburg and Trento, balanced fusion () yields the best results, as elevation information provides crucial discriminative features for separating vegetation types and distinguishing buildings. Performance degradation at extreme weights ( or ) confirms the importance of multimodal fusion.
4.4.3. Impact of Learning Rate
Figure 7 evaluates six learning rates:
. Houston2013 achieves optimal performance at
, while Augsburg and Trento perform best at
.
The larger optimal learning rate for Houston2013 may be attributed to its complex fifteen-class feature space requiring more aggressive parameter updates. Excessively large learning rates () cause significant performance degradation across all datasets, particularly for Augsburg (OA drops to 75.82%) due to its severe class imbalance. We recommend learning rates in the range for similar HSI-LiDAR classification tasks.
4.5. Visual Assessment
Figure 8,
Figure 9 and
Figure 10 present visual classification maps for all three datasets. Compared to baseline methods, HMGF-Net produces smoother regions, cleaner class boundaries, and fewer isolated misclassifications.
On Houston2013, HMGF-Net exhibits improved discrimination along road networks and parking lot boundaries, where spectral confusion is prevalent. The Augsburg results show enhanced separation between residential and commercial areas, benefiting from the elevation-aware fusion mechanism. On Trento, the agricultural parcel boundaries are sharply delineated, demonstrating effective utilization of both spectral signatures and terrain structure.
4.6. Summary
The experimental results across the Houston2013, Augsburg, and Trento datasets confirm that HMGF-Net with MS-PLR offers substantial advantages in multimodal semi-supervised learning. Through its combination of spectral–spatial encoding, Mamba-based sequence modeling, graph-regularized fusion, and progressive pseudolabel refinement, the proposed framework delivers robust performance under limited annotation, consistently surpassing state-of-the-art methods across all datasets and evaluation metrics.
6. Conclusions
This work introduces HMGF-Net, a unified multimodal framework designed to address the challenges of semi-supervised hyperspectral–LiDAR classification under extremely limited labeled data. By combining a 3D–2D spectral–spatial encoder for hyperspectral imagery, a multi-scale CNN for LiDAR elevation modeling, and an efficient Mamba selective state-space module for long-range feature refinement, the network captures both local spectral–spatial structure and global contextual dependencies. A graph-based fusion mechanism further enhances cross-modal alignment by modeling relational consistency across spectral, spatial, and elevation domains. To ensure training stability in low-label scenarios, the proposed Multi-Stage Pseudo-Label Refinement (MS-PLR) framework progressively mitigates label noise through confidence filtering, spatial–spectral smoothing, and graph-consistency propagation.
Extensive experiments on the Houston2013, Augsburg, and Trento datasets demonstrate that HMGF-Net consistently outperforms state-of-the-art hyperspectral, multimodal, and semi-supervised learning approaches. The model achieves superior overall accuracy, average accuracy, and Kappa values across all datasets, with notable improvements in structurally complex or spectrally ambiguous classes. The results confirm that integrating selective state-space modeling with graph-guided fusion and progressive pseudolabel refinement offers a robust and efficient solution for multimodal classification under restricted supervision.
Future research may extend the framework toward large-scale scene understanding, real-time inference, and multimodal transformer–state-space hybrids. Moreover, the integration of physics-informed priors, domain generalization mechanisms, or additional modalities such as SAR and multispectral data may further broaden the applicability of the proposed approach in operational remote sensing environments.