Joint Hyperspectral Images and LiDAR Data Classification Combined with Quantum-Inspired Entangled Mamba

Myagmarsuren, Davaajargal; Wang, Aili; Lv, Haoran; Wu, Haibin; Molnar, Gabor; Yu, Liang

doi:10.3390/rs17244065

Open AccessArticle

Joint Hyperspectral Images and LiDAR Data Classification Combined with Quantum-Inspired Entangled Mamba

by

Davaajargal Myagmarsuren

¹

,

Aili Wang

^1,*

,

Haoran Lv

¹

,

Haibin Wu

¹

,

Gabor Molnar

²

and

Liang Yu

³

¹

Heilongjiang Province Key Laboratory of Laser Spectroscopy Technology and Application, Harbin University of Science and Technology, Harbin 150080, China

²

Institute for Application Techniques in Plant Protection, Julius Kühn Institute (JKI)-Federal Research Centre for Cultivated Plants, 38104 Brunswick, Germany

³

Ultra-Precision Optoelectronic Instrument Engineering Center, School of Instrument Science and Engineering, Harbin Institute of Technology, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(24), 4065; https://doi.org/10.3390/rs17244065

Submission received: 9 November 2025 / Revised: 9 December 2025 / Accepted: 13 December 2025 / Published: 18 December 2025

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose a novel CNN-GCN framework coordinated with wavelet transform for HSI and LiDAR classification. Its core innovation is a set of dedicated modules that work in concert to effectively balance local detail extraction with global contextual modeling.
The proposed method achieves state-of-the-art classification performance, significantly outperforming existing advanced methods across three standard benchmark datasets.

What are the implications of the main findings?

This study provides an effective solution to key challenges in multimodal remote sensing, such as balancing local details with global contexts and enabling computationally efficient deep feature interaction.
The framework’s superior generalization capability across diverse scenes demonstrates its strong potential as a reliable tool for enhancing accuracy in practical applications like environmental monitoring and urban planning.

Abstract

The multimodal fusion of hyperspectral images (HSI) and LiDAR data for land cover classification encounters difficulties in modeling heterogeneous data characteristics and cross-modal dependencies, leading to the loss of complementary information due to concatenation, the inadequacy of fixed fusion weights to adapt to spatially varying reliability, and the assumptions of linear separability for nonlinearly coupled patterns. We propose QIE-Mamba, integrating selective state-space models with quantum-inspired processing to enhance multimodal representation learning. The framework employs ConvNeXt encoders for hierarchical feature extraction, quantum superposition layers for complex-valued multimodal encoding with learned amplitude–phase relationships, unitary entanglement networks via skew-symmetric matrix parameterization (validated through Cayley transform and matrix exponential methods), quantum-enhanced Mamba blocks with adaptive decoherence, and confidence-weighted measurement for classification. Systematic three-phase sequential validation on Houston2013, Muufl, and Augsburg datasets achieves overall accuracies of 99.62%, 96.31%, and 96.30%. Theoretical validation confirms 35.87% mutual information improvement over classical fusion (6.9966 vs. 5.1493 bits), with ablation studies demonstrating quantum superposition contributes 82% of total performance gains. Phase information accounts for 99.6% of quantum state entropy, while gradient convergence analysis confirms training stability (zero mean/std gradient norms). The optimization framework reduces hyperparameter search complexity by 99.6% while maintaining state-of-the-art performance. These results establish quantum-inspired state-space models as effective architectures for multimodal remote sensing fusion, providing reproducible methodology for hyperspectral–LiDAR classification with linear computational complexity.

Keywords:

hyperspectral images; LiDAR data; multimodal fusion; state-space models; Mamba; quantum-inspired processing; quantum superposition; multimodal classification; remote sensing

Graphical Abstract

1. Introduction

Multimodal fusion of remote sensing data has emerged as an important strategy to improve the accuracy and reliability of land cover classification [1]. Hyperspectral imaging (HSI) allows for the detailed identification of materials by taking hundreds of adjacent narrowband spectral signatures and looking at their unique spectral characteristics [2]. This technology enables detailed studies of the biophysical and biochemical properties of plants, soils, and aquatic systems, facilitating precision agriculture, environmental monitoring, and scientific research [3]. While LiDAR provides structural and elevation information for capturing geometric features and surface topology regardless of lighting conditions, HSI offers fine spectral characteristics for material composition analysis and discrimination. Therefore, the fusion of hyperspectral images and LiDAR data has become particularly important for land cover classification [4,5,6] tasks. Recent advances in remote sensing have demonstrated that the fusion of these modalities through deep learning approaches enables superior feature extraction capabilities [7].

Despite these additional features, traditional multimodal fusion methods have difficulty effectively modeling the complex relationships between different data sources. Support vector machines (SVMs) and random forests (RFs) are examples of classical machine learning methods that struggle to combine different types of data with complex cross-modal interactions. Although deep learning approaches have made significant progress in recent years, it remains challenging to model cross-modal interactions while preserving modal specificity [8,9]. To capture multimodal features, people often use convolutional neural networks (CNNs) with architectures such as VGG [10] and ResNet [11]. Vision Transformers (ViT) [12] pioneered end-to-end solutions by directly implementing transform encoders on images, while Contrastive Language–Image Pre-Training (CLIP) [13] improved multi-objective training based on multiple objectives.

Recent advances in state-space models, particularly the Mamba architecture [14], have demonstrated exceptional capabilities for modeling complex sequences of linear computations. The Mamba design introduces input-dependent selection mechanisms that allow content-aware state transitions while maintaining efficiency through hardware-aware algorithms [15,16,17]. However, direct use of multimodal fusion remains challenging due to the difficulty of combining information from modes with different spectral, spatial, and temporal properties.

Quantum-inspired computational principles, in particular superposition and entanglement [18,19], offer new approaches to represent and process interdependent information systems. Quantum neural networks use these principles to enhance representational capabilities through interference patterns and superposition states. Although quantum-inspired techniques show promise in classical networks, their integration with multimodal fusion state-space models for remote sensing remains unexplored. This paper presents QIE-Mamba, a framework that integrates selective state-space models with quantum-influenced processing for the hyperspectral–LiDAR class in represented Figure 1.

Our main contributions are summarized as follows:

Quantum-enhanced state-space architecture. By combining complex-valued quantum superposition with integrated entanglement networks (validated by Cayley transforms and matrix exponentials) within selective state-space models, we can achieve improved cross-modal correlation modeling beyond the classical coupling approach.
Confidence-weighted quantum measurement. A confidence mechanism based on quantum measurement uncertainty provides adaptive per-sample weighting, contributing +0.57% percentage point improvement over superposition alone (95.45%→96.02% on Houston2013). Combined with real measurement extraction of amplitude relationships and dataset-adaptive decoherence rates (γ = 0.005–0.1), the system achieves stable quantum-to-classical state collapse while maintaining discriminative cross-modal information.
Systematic validation framework. A three-stage sequential optimization methodology (ConvNeXt architecture → Mamba parameters → quantum components) reduces the complexity of hyperparameter search by 93.5% compared to full search, achieving 99.62%, 96.31%, and 96.30% accuracies in Houston2013, Muufl, and Augsburg, providing reproducible guidelines for quantum-inspired multimodal fusion in remote sensing applications.

2. Related Works

2.1. Multimodal Remote Sensing Fusion: Context and Obstacles

The concept of multimodal remote sensing integration posits that various sensors can offer a complementary perspective of the same physical environment through diverse measurement, sampling, and acquisition techniques [20]. The fundamental premise of fusion techniques is the existence of correlations between real-world phenomena and observational datasets, as well as among datasets themselves. Hu et al. [21] have shown that the effective use of all available data sources can significantly increase the utility of remote sensing infrastructure.

The fusion of HSI and LiDAR data has attracted considerable research attention due to their complementary features. HSI obtains material composition through spectral fingerprinting, while LiDAR contributes important geometric and elevation data [22], and Li et al. [23] conducted an advanced fusion study by combining structural features of these data to improve classification. This fusion paradigm has been effectively implemented in several fields, such as urban land use classification [24], agricultural monitoring [25], and construction material identification [26].

A recent comprehensive review by Rehman et al. [7] identified over 1000 articles on HSI-LiDAR fusion. This indicates that this research domain is well-established and continues to expand. Nevertheless, many opportunities for improvement remain. Modal heterogeneity resulting from differing spatial and spectral resolutions, the necessity for computational efficiency in handling high-dimensional data, and the requirement for effective modeling of cross-modal interactions remain challenges [27,28]. Addressing these challenges requires methods that can learn complex feature representations and that can promote the adoption of deep learning architecture while maintaining computational power.

2.2. Deep Learning Architectures for Remote Sensing Fusion

Based on the abovementioned integration challenges, researchers have introduced more advanced computational techniques to address the heterogeneity and computational complexity of multiple remote sensing modes.

Traditional machine learning techniques such as SVM and RF have greatly improved remote sensing integration by providing efficient multi-level classification systems [29,30].

Deep learning models have built on this foundation and extended traditional approaches by automatically learning hierarchical features from a wide range of data sources. CNN, Transformer, and hybrid architectures have become the most popular types [31]. CNNs extract hierarchical features from optical, Synthetic Aperture Radar (SAR), and LiDAR data, while neural RNNs such as LSTM [32] or GRU [30] models handle temporal relationships in multi-temporal data. Visual Transformers utilizing multi-head attention mechanisms offer multiple levels of fusion, although they incur a significant quadratic computational cost.

ConvNeXt represents a significant advancement in the design of convolutional neural networks (CNNs). ConvNeXt is a big step forward in the design of CNN architecture. Convnets can be altered to resemble ViT using the ResNet concept [33], which gave rise to the idea of building ConvNeXt models [34]. It is based on the ideas of Vision Transformer yet is still efficient in terms of computing [35]. It modifies the convolutional network while maintaining network efficiency by adding Transformer-inspired features such as layer normalization, large kernel size (7 × 7), and an inverse blocking block. HybridSN [36] represents the inaugural integration of 2D and 3D CNNs for the analysis of spectral–spatial features. Recent innovations, such as SS-ConvNeXt [37], have exhibited enhanced performance by employing spatial–spectral ConvNeXt structures with depth-wise separable convolutions, resulting in improved parameter efficiency. Wu et al. [38] introduced a dual-encoder method for HSI-LiDAR fusion utilizing differential learning; however, these methods handle the modalities independently, thereby restricting inter-depth interaction potential.

Recent developments in cross-modal fusion have led to the creation of advanced ways to combine data from several sources. The multi-scale fusion technique of MKSFF-CNN [39] solves the problem of feature heterogeneity by learning how to combine multi-scale representations in a way that makes them more useful for classification. Another good technique to fuse multiple kinds of data is through frequency–spatial fusion. The Frequency–Spatial Contextual Awareness Network (AIS-FCANet) [40] collects global spectrum data that works with local spatial patterns to give a better picture of a scenario. The incorporation of contextual awareness mechanisms in AIS-FCANet enhances cross-modal fusion even better by explicitly modeling long-range interactions across different areas and modalities.

Zhang et al. [41] proposed an adaptive multi-stage fusion framework using HSI and LiDAR with hierarchical fusion, and their Local Visual State Space (LocalVSS) block uses window-based directional scanning and cross-modal attention integration, which goes beyond the basic performance of CNN and Transformer. However, their method involves complex architectures and manual design of fusion strategies, while our quantum-inspired framework offers integrated cross-modeling through learned entanglement mechanisms. Hussain et al. [42] recently proposed a cross-attention mechanism for HSI-LiDAR fusion, which improves performance but at the cost of O(n²) complexity. Bloemheuvel et al. [43] introduced graph neural networks for spatial relationship modeling, but they require special graph construction.

In our study, we address this issue using linear complex state-space models. These methods aim to address the trade-off between computational efficiency and representational capability.

2.3. State-Space and Quantum-Inspired Models

Attention processes have been the most important part of sequence modeling since Vaswani et al. [44] came up with Transformer designs. But the fact that attention has a quadratic computational complexity has led to research into better options. State-space models (SSMs) have become a viable method since they have a linear computational cost and can nevertheless handle long-term interdependence.

The theoretical underpinning of modern SSMs is anchored in classical control theory and Kalman’s work [38] on state-space representations. In the context of deep learning, Gu et al. [16] created the HIPPO framework for the best polynomial projections, which was the first step towards structured SSMs. Traditional SSMs, particularly the structured state-space model (S4) [40] to S5 [41] and subsequently to S6 [42] designs have systematically enhanced computing efficiency and performance. Diagonal state space (DSS) [43] and the diagonal version of S4 (S4D) [44] reduced the complexity of the original S4 computations by focusing on diagonal parameters. The S6 design philosophy, inspired by Hua et al. [45] Gated Attention Unit (GAU), achieves parameter economy while retaining the expressive strength required for a wide range of sequence modeling tasks.

The Mamba design [14] solves some of the core concerns with standard SSMs by introducing a way for state transitions to rely on the input. This new notion enables us just to look at the inputs that matter based on what they say. For remote sensing uses, Mamba’s linear complexity is very helpful for processing high-dimensional spatial–spectral signals.

Quantum-inspired computing is a new way of doing things that could improve the ability to combine different types of data. Quantum neural networks use the principles of quantum superposition and entanglement to make learning more efficient [18]. Recent developments indicate that quantum-inspired classical networks can attain enhanced representational capacity via quantum interference patterns and superposition states [19].

Quantum-Inspired Entangled Mamba (QIE-Mamba) is the first quantum-inspired model of state space for combining data from several types of remote sensing. QIE-Mamba models cross-modal correlations by quantum-inspired entanglement mechanisms integrated in selective state spaces, which is different from typical methods that handle modalities separately. Some important new ideas are quantum superposition states for multimodal representation, entanglement-driven cross-modal fusion that captures non-local correlations, and quantum-enhanced selective scanning that keeps evolutionary coherence.

When you mix quantum ideas with classical architecture, you can come up with new ways to improve deep learning. Quantum superposition allows for exponentially more functional combinations, while entanglement shows how things relate across vast distances in space, time, and structure. These paradigms overcome fundamental limitations of classical methods while ensuring the feasibility of computation through quantum-inspired classical implementations. While the theoretical foundations are well-established, the practical applications of quantum entanglement in state-space models and remote sensing have yet to be explored, offering substantial prospects for innovation.

3. Method

We present QIE-Mamba, the first quantum-inspired state-space model that uses quantum entanglement principles to fuse multimodal remote sensing data. Unlike traditional fusion methods that treat modalities independently, QIE-Mamba models cross-modal correlations through a quantum-inspired entanglement mechanism embedded in the Mamba selective state space. The detailed computational flow of the QIE-Mamba framework, which demonstrates the quantum-induced fusion mechanism, is displayed in Figure 2. The HSI and LiDAR inputs are processed by a four-stage ConvNeXt encoder with progressive down-sampling and channel expansion. The feature processors transform the spatial features into sequence representations. The quantum processing pipeline implements the following: (1) quantum superposition—amplitude and phase encoders generate a normal quantum state from the two modes; (2) quantum entanglement—a learnable unified gate uses cross-modal quantum correlations; (3) decoherent SSM—the quantum Mamba block processes the entangled states by modeling the disentanglement; and (4) quantum measurement—selected measurement type (magnitude, real, phase, or adaptive) leads to classical properties of quantum states. Modal confidence weights the contributions before the final classification.

3.1. Problem Formulation

The function is parameterized by the proposed QIE-Mamba network, comprising four principal components: modality-specific feature extraction utilizing ConvNeXt encoders, quantum-inspired Mamba fusion for multimodal integration, confidence-based fusion and classification head for ultimate prediction, and the mathematical foundation.

Given multimodal remote sensing data consisting of hyperspectral imagery

X_{H S I} \in R^{W \times H \times B_{H}}

and LiDAR data

X_{L i D A R} \in R^{W \times H \times B_{L}}

, where

W

and

H

represent spatial dimensions and

B_{H}, B_{L}

donate the number of special bands, our objective is to learn a mapping function

f : (X_{H S I}, X_{L i D A R}) \to y

where

y \in \{1,2, \dots, C\}

represents the predicted class label among

C

possible land cover categories.

Traditional fusion methods face three main limitations when combining HSI and LiDAR data: modal heterogeneity due to different imaging mechanisms and resolutions, difficulty in capturing spatial long-term correlations of spatial–spectral height features, and suboptimal weighting of additional information, as HSI contains less spectral information, while LiDAR offers detailed height information that is not affected by atmospheric interference [46].

Quantum-inspired design rationale: We propose to model multimodal fusion as a quantum measurement problem. In quantum mechanics, measurements exist in superpositions of multiple states until they converge to a classical state. Similarly, we hypothesize that multimodal features should be represented as entangled quantum states that preserve cross-modal relationships until they are classified (measured) by branch-transform-based hyperspectral images and LiDAR data. This quantum-inspired structure provides three mathematical advantages:

Detailed complexity representation, which encodes both amplitude (feature size) and phase (relationship between modals) [47];
Uniform transformation, which preserves information with inverse operations [48];
Systematic validation framework. A measurement technique elucidates the fundamental attributes of the transition from quantum to classical features [49].

3.2. Hierarchical Feature Extraction with ConvNeXt Encoders

The remote sensing imagery is characterized by heterogeneous, multi-source data, which provides additional information for land cover observation and therefore requires computationally efficient hierarchical feature extraction and is particularly useful for classifying hyperspectral image and LiDAR data. ConvNeXt achieves Vision Transformer-level performance with CNN-high efficiency [35].

3.2.1. Stem Layer with Patchify Operation

The stem layer performs aggressive down-sampling:

F_{m}^{(0)} = L a y e r N o r m (C o n v 2 D (X^{m}; W_{s t e m}^{(m)}, k = 2, s = 2))

(1)

where

X^{m}

is input data for modality

m \in \{H S I, L i D A R\}

,

W_{s t e m}^{(m)}

are learnable weight parameters for the stem convolution of modality

m

, and

k, s

are kernel size and stride, respectively.

F_{m}^{(0)}

represents the output features from the stem layer (stage 0). The rationale for 4x down-sampling is that hyperspectral remote sensing data contains additional spatial information at high-resolution intervals. Early down-sampling reduces computational costs and allows trees to be detected and classified separately from photogrammetric point clouds and hyperspectral imagery while maintaining spectral accuracy for classification tasks [50].

3.2.2. ConvNeXt Stage Architecture

For each stage

l \in \{1, 2, 3, 4\}

:

F_{m}^{(l)} = C o n v N e X t S t a g e_{l} (F_{m}^{(l - 1)})

(2)

where

F_{m}^{(l)}

is the feature map for modality m at stage

l

,

l

is the stage index,

F_{m}^{(l - 1)}

is the input feature map from previous stage (

l - 1

), and

C o n v N e X t S t a g e_{l}

is the ConvNeXt stage operation at stage

l

.

Each stage begins with conditional down-sampling:

F_{(D o w n)} = \{\begin{matrix} C o n v 2 D (L a y e r N o r m (F_{m}^{(l - 1)}, k = 2, s = 2)), l \in \{2, 3\}, \\ F_{m}^{(l - 1)}, o t h e r w i s e \end{matrix}\}

(3)

where

F_{(D o w n)}

is the down-sampled feature map. This is followed by

d_{l}

ConvNeXt blocks:

F_{m}^{(l)} = C o n v N e X t B l o c k s_{d_{l}} (\dots C o n v N e X t B l o c k_{2} (C o n v N e X t B l o c k_{1} (F_{m}^{(l - 1)})))

(4)

where

d_{l}

is the depth of stage

l

.

3.2.3. ConvNeXt Block

The rationale for the model is that recent Transformer-based fusion methods have made significant progress in HSI-LiDAR joint classification by effectively modeling long-term correlations, and the layer scale parameter allows for training very deep networks, while a depth-separable convolutional model is a more robust approach.

Each ConvNeXt block implements inverted bottleneck with depth-wise separable convolutions:

F_{o u t} = F_{i n} + D r o p P a t h (γ \cdot P W C o n v_{2} (G E L U (P W C o n v_{1} (L N (P W C o n v (F_{i n}))))))

(5)

where

P W C o n v

is depth-wise convolution (

7 \times 7, g r o u p s = C

),

L N

is layer normalization [51],

P W C o n v_{1}

is point-wise expansion (

C \to 4 C

),

G E L U

is Gaussian Error Linear Unit activation [52],

P W C o n v_{2}

is point-wise projection (

4 C \to C

),

γ

is the layer scale parameter [35], and

D r o p P a t h

is drop path stochastic depth regularization [53]. It prevents overfitting by randomly dropping residual connections during training.

F_{i n}

and

F_{o u t}

are input and output features from the ConvNeXt block.

3.2.4. Sequence Conversion Module

The following is the final ConvNeXt stage, where spatial features are converted to sequences for Mamba processing

F_{m}^{(4)} \in R^{B \times d i m \times H' \times W'}

:

\begin{matrix} F_{r e d} = C o n v 2 D (F_{m}^{4}; d i m \to d i m, k = 1), \\ H_{L} = [\sqrt{L}], W_{L} = [L / H_{L}], \\ F_{p o o l} = A d a p t i v e A v g P o o l 2 d (F_{r e d}, (H_{L}, W_{L})), \\ S_{m} = L a y e r N o r m (F l a t t e n (F_{p o o l})) \in R^{B \times L \times d i m} \end{matrix}

(6)

where

H^{'}, W^{'}

are height and width after all down-sampling.

H_{L}, W_{L}

are height and width for reshaping,

\sqrt{L}

is the floor of the square root of the sequence,

d i m

is the embedding dimension, and

L = H^{'} \times W^{'}

is the target sequence length.

After the fourth ConvNeXt stage, we have spatial feature maps

F_{m}^{4}

with shape (

B, d i m, H^{'}, W^{'}

). We applied a 1 × 1 convolution to create

F_{r e d}

(dimension reduction), ensuring the features have the correct embedding dimension. We determine the target sequence length L and compute

H_{L}

and

W_{L}

to define the spatial dimensions for pooling. This adaptive average pooling to

F_{r e d}

to create

F_{p o o l}

(adaptive pooling) with shape (

B, d i m, H_{L}

,

W_{L}

), ensuring consistent spatial dimensions. Then, we flatten the spatial dimensions into a single sequence dimension L, resulting in shape (B, L, dim). We apply layer normalization to obtain the final sequence representation

S_{m}

.

3.3. Quantum-Inspired Entanglement Fusion

3.3.1. Quantum Superposition Layer

The physical explanation is that the phase relationship between the modes encodes a correlation pattern that cannot be captured by classical real-valued coupling. When HSI is highly reflective and LiDAR is at different heights, their phase relationship is

φ_{H S I} - φ_{L i D A R}

, which distinguishes between spectrally similar materials with different structural profiles [54]. This is consistent with the principle of quantum interference, where the relative phase determines whether the interference is constructive or destructive [55]. The information-theoretical rationale for our design is based on Theorem 1 (Section 3.5), which states that a composite quantum image can capture more information than the real-valued image, as explained by Lloyd et al. [48].

We first encode two modalities into a single complex-valued quantum state that preserves both amplitude (feature importance) and phase (intermodal relationships). According to quantum superposition fundamental tenets [55], we decompose the complex coefficient into amplitude and phase encoders with hidden dimension

D_{h} = [d i m \cdot ρ]

, where

ρ

is the complexity factor compute, as follows.

Amplitude encoding:

α_{m} = σ (\bar{W_{2}^{α}} \cdot G E L U (\bar{W_{1}^{α}} \cdot S_{m} + b_{1}^{α}) b_{2}^{α})

(7)

where

S_{m} \in R^{B \times L \times d i m}

is sequential feature representation from modality

m \in \{1, 2\}

;

\bar{W_{1}^{α}} \in R^{D_{h} \times d i m}, b_{1}^{α} \in R^{D_{h}}

are the first-layer parameters;

\bar{W_{2}^{α}} \in R^{d i m \times D_{h}}, b_{2}^{α} \in R^{d i m}

are the second-layer parameters;

G E L U (x) = x \cdot Φ (x)

where

Φ (x)

is the cumulative distribution function of the standard normal distribution; and

σ (\cdot)

is the sigmoid activation.

Phase encoding:

φ_{m} = π \cdot t a n h (\bar{W_{2}^{φ}} \cdot G E L U (\bar{W_{1}^{φ}} \cdot S_{m} + b_{1}^{φ}) b_{2}^{φ})

(8)

where

\bar{W_{1}^{φ}} \in R^{D_{h} \times d i m}, b_{1}^{φ} \in R^{D_{h}}

,

\bar{W_{2}^{φ}} \in R^{d i m \times D_{h}}, b_{2}^{φ} \in R^{d i m}

are the first- and second-layers parameters and

t a n h (\cdot)

maps to [−1, 1], then multiplication by

π

yields

φ_{m} \in [- π, π]

.

Amplitude normalization (Born rule constraint):

{\tilde{α}}_{m} = α_{m} / \sqrt{\sum_{i = 1}^{2} α_{i}^{2} + 10^{- 8}}

(9)

This normalization ensures

\sum_{m = 1}^{2} {({\tilde{α}}_{m})}^{2} = 1

, satisfying the quantum probability conservation constraint. The additive constraint

10^{- 8}

prevents numerical instability from division by zero.

Quantum superposition state:

|ψ〉 = \sum_{m = 1}^{2} {\tilde{α}}_{m} ⊙ e x p (i ϕ_{m}) ⊙ S_{m} = ψ_{r e a l} + ψ_{i m a g}

(10)

where

⊙

denotes element-wise (Hadamard) product and

e x p (i ϕ_{m}) = c o s (φ_{m}) + i s i n (φ_{m})

is the complex phase factor (Euler’s formula). The superposition operates element-wise across batch, sequence, and feature dimensions. The quantum superposition state employs amplitude–phase encoding for physical interpretability:

α_{m}

represents modality importance amplitude weights (after Born rule normalization), and

φ_{m}

encodes intermodal phase relationships that determine interference patterns (constructive when

φ_{h s i} {\approx φ}_{L i D A R}

, destructive when

|φ_{h s i} - φ_{L i D A R}| \approx π

). However, for subsequent temporal evolution in the quantum Mamba block, we convert this representation to Cartesian (real–imaginary) form via Euler’s formula. This decomposition is mathematically lossless and enables independent processing of structural features (

ψ_{r e a l} = \sum_{m = 1}^{2} {\tilde{α}}_{m} ⊙ c o s (φ_{m}) ⊙ S_{m}

) and dynamical correlations (

ψ_{i m a g} = \sum_{m = 1}^{2} {\tilde{α}}_{m} ⊙ i s i n (φ_{m}) ⊙ S_{m}

) through separate state-space recurrences. We adopt real–imaginary processing in the quantum Mamba block rather than operating directly on amplitude–phase components because (1) phase evolution through state-space models causes gradient discontinuities when φ crosses ±π boundaries (phase wrapping), and (2) real and imaginary components are unbounded and thus compatible with continuous temporal evolution.

3.3.2. Unitary Entanglement Network

Innovation: In contrast to previous quantum-inspired models featuring static entanglement gates, we developed a learnable unitary transformation that adjusts to dataset-specific cross-modal correlations, allowing the network to independently execute efficient intramodal and intermodal information exchange and complementary fusion. The process starts with input parameters and constructs a unified matrix in the following steps, represented Figure 3. The raw matrix is initialized as a real-valued learnable matrix, a skew-symmetric matrix is calculated ensuring zero diagonals, and learnable scaling controls the transformation size. A skew-Hermitian matrix produces a pure imaginary matrix that is suitable for unitary generation. The method selected for this process is either the matrix exponential EXPM [55] as described in Equation (12) or the Cayley transformation [56] outlined in Equation (13), both of which generate the final unitary matrix U. The entanglement preserves the quantum state norm U†U = I (unitary constraint).

The entanglement gate is parameterized as the exponential of a skew-Hermitian matrix [10]:

S_{s k e w} = 0.5 (R - R^{T})

(11)

U = e x p (i λ S_{s k e w}) \in C^{d i m \times d i m} or

(12)

U_{c a y l e y} = {(I - 0.5 i A)}^{- 1} (I + 0.5 i A)

(13)

where

R \in R^{d i m \times d i m}

is a learnable real matrix initialized,

R^{T}

is the transpose of matrix

R

,

λ

is the entanglement strength optimal value,

S_{s k e w}

is the skew-symmetric matrix, and

i

is an imaginary unit.

Unitarity guarantee [57]: For any skew-Hermitian matrix

H

,

U = e x p (H)

is unitary:

U † U = e x p (- i λ S_{s k e w}) e x p (i λ S_{s k e w}) = e x p (0) = I

(14)

This ensures information preservation during entanglement [58].

Entangled state computation:

{|ψ_{e n t}⟩}_{b, l, d i m} = \sum_{j = 1}^{d i m} {|ψ_{q u a n t u m}⟩}_{b, l . j} \cdot U_{j, d i m}^{*}

(15)

where

|ψ_{e n t}⟩

is the entangled quantum state (output),

|ψ_{q u a n t u m}⟩

is the input quantum state (before entanglement),

U_{j, d i m}^{*}

denotes the complex conjugate transpose convention applied over dimension

j

, and

j

is the summation index over the feature dimension.

3.3.3. Quantum-Enhanced Mamba State-Space Model

The design challenge is that standard Mamba [14] processes real-valued sequences with O(L) complexity. We extend it to process complex quantum states while maintaining linear complexity and incorporating quantum decoherence.

Physical interpretation of decoherence is the decoherence mechanism modeling quantum state decay toward classical states over time, which offers three practical benefits: gradient stabilization by preventing exponential growth in long sequences, temporal inductive bias where recent tokens contribute more than distant tokens, and physical plausibility by modeling real quantum system behavior.

Different decoherence models produce qualitatively different learning dynamics. Based on empirical validation of multiple models, it is necessary to choose the one that best suits the characteristics of our model. In accordance with the theory of quantum open systems [59], a study comparing basic two decoherence models, Markovian–standard memoryless decoherence [60] and non-Markovian–limited ambient memory [61] under the same training conditions. Other non-Markovian models include Gaussian [62], power law [63], and stretched exponential [64], but since their properties are like those of the non-Markovian method, we performed a detailed validation of the optimization values of the two methods (Markovian and non-Markovian), which show fundamentally different dynamics. As hyperspectral bands exhibit correlated noise [65] and quantum ML performance depends critically on decoherence model choice [66], we empirically validated across three datasets, selecting the preserve-gradient-flow model with the accuracy of the best performer [67]. This extends the stable SSM training techniques of Gu et al. [14] to complex-valued quantum states.

Figure 4 presents the structure of a quantum Mamba block, in which the entangled quantum state [B, seq, dim] is decomposed into real and imaginary components, each layer is normalized, and the expanded dimensions are processed through linear projections. The SSM with the decoherence module implements a sequential scan (loop for t = 0 Seq L−1) in which the state evolution follows the real and imaginary dynamics with the learned parameters A, B, C, D; the decoherence gate uses the exponential decay to model the quantum state decay over time; and the state projection is combined with the direct connection to calculate the y outputs. The dual-path processing stores quantum information in separate real and imaginary channels, while the decomposition mechanism provides a physically inspired arrangement that prevents infinite growth of states and improves the stability of the model.

Dual-path architecture maintains numerical stability and interpretability; we process real and imaginary components separately [68], then normalize:

x_{R} = L a y e r N o r m (Re (|ψ_{e n t}⟩)) x_{I} = L a y e r N o r m (Im (|ψ_{e n t}⟩))

(16)

where

Re (|ψ_{e n t}⟩)

,

Im (|ψ_{e n t}⟩)

are weight matrices for real and imaginary projections.

Dimension expansion:

z_{R} = x_{R} W_{i n}^{R}, z_{I} = x_{I} W_{i n}^{I}

(17)

where

W_{i n} \in R^{B \times L \times expended_dim}

with expansion factor.

For quantum decoherence mechanisms following Lindblad master equation formalism [60], we model decoherence as exponential decay:

Markovian : γ_{t} = \exp (- γ \cdot t) or Non - Markovian : γ_{t} = 1 / (1 + γ \cdot t / L)

(18)

where

γ

is the decoherence rate and

t

is time. We model quantum decoherence using two approaches: Markovian (exponential decay, memoryless) and non-Markovian (hyperbolic decay, with memory effects).

State-space recurrence with decoherence: For

t = 0, 1, \dots, L - 1

,

h_{R}^{(t)} = γ_{t} \sum_{s} h_{R, s}^{(t - 1)} A_{s} + \sum_{s} z_{R, t, s} B_{s}

(19)

h_{I}^{(t)} = γ_{t} \sum_{s} h_{I, s}^{(t - 1)} A_{s} + \sum_{s} z_{I, t, s} B_{s}

(20)

where

h_{R}^{(t)}

,

h_{I}^{(t)} \in R^{B \times (expanded_factor \times \dim) \times s}

are hidden states at time t (real/imaginary);

γ_{t}

is the decoherence factor;

\sum_{s}

is the Einstein summation over state dimension s;

A_{s}

is the SSM state transition parameter (learned);

B_{s}

is the SSM input projection parameter (learned);

z_{R, t, s}

,

z_{I, t, s}

are the real and imaginary inputs at time t, state

s

; and

h_{R, s}^{(t - 1)}

,

h_{I, s}^{(t - 1)}

are the previous real and imaginary hidden states at state

s

.

y_{R}^{(t)} = \sum_{s} h_{R, s}^{(t)} C_{s} + z_{R, t} ⊙ D

(21)

y_{I}^{(t)} = \sum_{s} h_{I, s}^{(t)} C_{s} + z_{I, t} ⊙ D

(22)

where

y_{R}^{(t)}

,

y_{I}^{(t)}

are real and imaginary for outputs,

C_{s}

is the SSM output projection parameter (learned),

D

is the skip connection parameter (learned), and

A, B, C \in R^{768 \times L}, D \in R^{768}

.

Complex output formulation:

\begin{array}{l} y^{(t)} = y_{R}^{(t)} + i y_{I}^{(t)} \\ o^{(t)} = W_{o u t} y^{(t)} \in R^{B \times d i m} \\ \hat{y} = [y^{(1)}; \dots; y^{(16)}] \bar{W_{o u t}} \in C^{B \times L \times d i m} \end{array}

(23)

where

[y^{(1)}; \dots; y^{(16)}]

is vertical stacking (concatenation along sequence dimension),

y^{(t)}

is the complex output,

o^{(t)} = Re (W_{o u t} y^{(t)})

, and

W_{o u t}

is the final projection back to the original dimension. We extract only the real component

y_{R}^{(t)}

for output, as the final projection

\bar{W_{o u t}}

operates in real space.

Stacking multiple blocks: We apply

N_{b l o c k s} \in \{1, 2, 3, 4, 5\}

quantum Mamba blocks (

Q M a m b a

) sequentially:

F_{q u a n t u m} = Q M a m b a_{1} (ψ_{e n t})

(24)

where

F_{q u a n t u m}

is a representation of the output feature obtained by sequentially applying two quantum Mamba blocks (

Q M a m b a_{1}

) to the entangled quantum state. This hierarchical process can reveal complex multimode relationships through the dynamics of the quantum state, depending on the number of quantum blocks.

The imaginary component serves to guide gradient flow during training through the quantum state dynamics.

∥ h^{t} ∥ \leq C \cdot \exp (- γ \cdot t) + C_{i n p u t}

(25)

The norm of the hidden state at a layer is bounded by an exponentially decaying term (

C \cdot \exp (- γ \cdot t)

) plus a constant input contribution (

C_{i n p u t}

). This bound ensures training stability by preventing gradient explosion, as properly initialized parameters cause the gradient magnitude to decay exponentially with the decoherence rate over time.

We use amplitude–phase as the physical descriptor in multimode coupling, and then we convert the quantum state to real–imaginary to ensure stable and efficient time evolution through a state-space model before the final measurement reduces the quantum state to classical properties.

3.3.4. Quantum Measurement Selection Framework

High-confidence samples with distinct spectral-elevation patterns use magnitude measurement, while ambiguous samples with spectrally similar classes rely on phase relationships to distinguish materials with similar spectral signatures but different structural profiles. This measurement selection framework is motivated by weak measurement theory in quantum mechanics [69], where measurement strength can be tuned to preserve quantum coherence.

The measurement problem that quantum states must be “collapsed” to classical features for classification, but the optimal measurement strategy depends on input characteristics—some samples benefit from phase-aware measurement, while others require magnitude-only measurement. Our measurement selection framework is specially figured in Figure 5. The input quantum state is processed by four types of measurements, such as magnitude measurement, which calculates the absolute value of the quantum state magnitude; phase measurement, which combines the magnitude with the phase information used to preserve the phase relationship; real measurement, which extracts only the real component; and adaptive measurement, which uses a learned belief network to dynamically choose between quantum and classical measurements based on the state properties. All measurements preserve the quantum encoded information from the entanglement and superposition stages while producing real-valued outputs [B, L, dim] suitable for downstream classical processing. The choice of framework allows us to strike a balance between preserving quantum coherence and extracting classical properties during the optimization stage, and ultimately the best real-valued projection from the experiment provides the best balance between accuracy, stability, and computational efficiency.

Confidence computation for measurement analysis: To enable systematic comparison, we implement a confidence network that learns measurement preferences.

State magnitude and pooled measurement features:

state_magnitude = |F_{q u a n t u m}| \in R^{B \times L \times d i m}

(26)

m_{p o o l e d} = \frac{1}{L} \sum_{i = 1}^{L} state_magnitude [b, l, :] \in R^{B \times d i m}

(27)

where

|\cdot|

denotes complex magnitude (element-wise absolute value) and

m_{p o o l e d}

is the average measurement obtained by taking the mean of

state_magnitude

across all L sequence positions, resulting in a single feature vector of dimension dim for each sample in the batch.

Confidence network (MLP):

c_{c o n f} = σ (W_{2}^{c} \cdot ReLU (W_{1}^{c} \cdot m_{p o o l e d} + b_{1}^{c}) + b_{2}^{c}) \in R^{B \times 1}

(28)

where

W_{1}^{c} \in R^{(\dim / 2) \times d i m}

,

b_{1}^{c} \in R^{(\dim / 2)}

and

W_{2}^{c} \in R^{1 \times (\dim / 2)}

,

b_{2}^{c} \in R^{1}

are the first and the second layers with weights for measurement confidence.

The confidence scores are then expanded to match the sequence dimension:

{\tilde{c}}_{c o n f} = expand (c_{c o n f}) \in R^{B \times L \times 1}

(29)

Measurement operators (systematic comparison): The input quantum state is processed through four candidate measurement strategies (Figure 5):

Magnitude measurement—calculates the absolute value of the quantum state magnitude:

M_{m a g} (ψ) = |ψ|

(30)

2.: Real projection—extracts only the real (Hermitian/observable) component:

M_{r e a l} (ψ) = Re (ψ)

(31)

3.: Phase-aware measurement—combines the magnitude with phase information to preserve phase relationships:

M_{p h a s e} (ψ) = |ψ| \cdot \cos (∠ ψ)

(32)

where

∠ ψ

denotes the complex phase angle.

4.: Adaptive measurement (for comparative analysis during development)—uses a learned confidence network to explore different measurement strategies:

S \sim U n i f o r m (0,1) \in R^{B \times L \times 1}

(33)

quantum_mask = 1 [S ≺ {\tilde{c}}_{c o n f}]

(34)

M_{a d a p t i v e} (ψ) = \{\begin{matrix} M_{p h a s e} (ψ), i f quantum_mask = 1 \\ M_{r e a l} (ψ), i f quantum_mask = 0 \end{matrix}\}

(35)

where

1 [\cdot]

is the indicator function and the measurement type is selected stochastically for each position in the batch and sequence.

Final output:

F_{c l a s s i c a l} = M (F_{q u a n t u m}) \in R^{B \times L \times d i m}

(36)

where

M \in \{M_{m a g}, M_{r e a l}, M_{p h a s e}, M_{a d a p t i v e}\}

is the selected measurement strategy.

Physical elucidation is a confidence score, where

c_{c o n f} \approx 1

indicates high quantum coherence and

c_{c o n f} \approx 0

indicates low quantum coherence, which seems to involve real projection (a classical-like state); therefore, phase-aware measurement should be used as it preserves quantum correlations. During training, stochastic sampling involves a selection process that encourages exploration when both measurement types are sampled confidently, allowing for gradient flow through both paths. Regularization helps prevent overreliance on a single measurement strategy, while the probabilistic interpretation is represented by conf, which indicates the likelihood of using phase-aware measurement.

3.4. Confidence-Based Modality Fusion

Ambition remains a challenge even after quantum processing; not all methods contribute equally to each sample. For urban scenes, LiDAR elevation information dominates the classification, while for vegetation, HSI spectral information is more discriminative. Therefore, combining HSI image and LiDAR data can significantly improve classification performance if the mode contribution is properly accounted for [20].

Modality confidence network:

For each modality

m \in \{H S I, L i D A R\}

:

c_{m} = σ ({\bar{W}}_{2}^{(m)} \cdot ReLU ({\bar{W}}_{1}^{(m)} \cdot {\bar{S}}_{m} + b_{1}^{(m)}) + b_{2}^{(m)})

(37)

w = softmax ([c_{1}, c_{2}])

(38)

where

{\bar{S}}_{m} = \frac{1}{L} \sum_{l = 1}^{L} S_{m}^{(l)} \in R^{B \times d i m}

denotes the mean-pooled sequence feature and

{\bar{W}}_{2}^{(m)}, {\bar{W}}_{1}^{(m)}

are weights for modality confidence for modality

m

.

Feature fusion:

{\bar{F}}_{m e a s} = \frac{1}{L} \sum_{l = 1}^{L} F_{c l a s s i c a l}^{l} \in R^{B \times d i m}

(39)

where

F_{c l a s s i c a l}^{(l)}

is the

l

-th position of the measured quantum state (Equation (36)).

f_{f u s e d} = {\bar{F}}_{m e a s} + w_{1} {\bar{S}}_{1} + w_{2} {\bar{S}}_{2}

(40)

where

w_{1}

and

w_{2}

are weights for modalities HSI and LiDAR.

This adaptive fusion [70] preserves three complementary information patches, including the quantum path, HSI path, and LiDAR path.

3.5. Classification Head and Training

Classification:

\hat{f} = L a y e r N o r m (f_{f u s e d})

(41)

h = D r o p o u t (ReLU (W_{1} \hat{f} + b_{1}), p = o p t i m a l)

(42)

z = W_{2} h + b_{2} \in R^{B \times C}

(43)

where

\hat{f}

is the normalized fused features,

h

is hidden representation,

D r o p o u t

is dropout regularization,

ReLU

is the Rectified Linear Unit activation function,

W_{1}

,

W_{2}

are weighted matrices,

b_{1}, b_{2}

are bias terms,

z \in R^{B \times C}

is logits (output scores before SoftMax), and

W_{1} \in R^{192 \times 384}, W_{2} \in R^{C \times 192}

.

Loss and optimization:

L = - \frac{1}{B} \sum_{b = 1}^{B} \sum_{k = 1}^{C} y_{b, k} \log (SoftMax {(z_{b})}_{k})

(44)

where

L

is the loss value,

b

is the batch index,

k

is the 4class index (iterates through classes),

y_{b, k}

is the ground-truth label,

z_{b}

is the logits for sample b, and

SoftMax {(z_{b})}_{k}

is the predicted probability for class k of sample b.

3.6. Theoretical Foundation

Theorem 1.

Information-theoretic quantum advantage. For multimodal features

x_{H S I}

and

x_{L i D A R}

, the quantum-inspired complex-valued representation captures strictly greater mutual information than classical real-valued concatenation:

I_{q u a n t u m} (x_{H S I}; x_{L i D A R}) ≻ I_{c l a s s i c a l} (x_{H S I}; x_{L i D A R})

(45)

where the quantum mutual information is defined as

I_{p h a s e} = S_{(ρ c o m p l e x)} - S_{(ρ r e a l)} \geq 0

(46)

with

S_{(ρ)} = - T r (ρ \log ρ)

being the Von Neumann entropy and

I_{p h a s e} = S_{(ρ c o m p l e x)} - S_{(ρ r e a l)} \geq 0

, representing the phase information contribution.

Proof of Theorem 1.

We proceed by demonstrating that phase encoding strictly increases representational capacity. For the classical case, features are concatenated in real space:

x_{c l a s s i c a l} = [S_{H S I}; S_{L i D A R}] \in R^{2 D}

(47)

The density matrix is

ρ_{r e a l} = \frac{x_{c l a s s i c a l} x_{c l a s s i c a l}^{T}}{{‖x_{c l a s s i c a l}‖}^{2}}

(48)

For the quantum case, our superposition creates

|ψ⟩ = \sum_{m = 1}^{2} \tilde{α_{m}} e^{i ϕ_{m}} S_{m} \in C

(49)

The complex density matrix is

ρ_{c o m p l e x} = \frac{|ψ⟩ ⟨ψ|}{⟨⟨ψ| ψ⟩}

(50)

By the properties of von Neumann entropy,

S_{(ρ c o m p l e x)}

contains both amplitude and phase correlations, while

S_{(ρ r e a l)}

contains only amplitude information. The phase term satisfies

I_{p h a s e} = S_{(ρ c o m p l e x)} - S_{(ρ r e a l)} = - T r (ρ_{c o m p l e x} \log ρ_{c o m p l e x}) + T r (ρ_{r e a l} \log ρ_{r e a l})

(51)

Since phase relationships

e^{i ϕ_{m}}

introduce off-diagonal terms in

ρ_{c o m p l e x}

that are absent in

ρ_{r e a l}

, and these terms capture cross-modal interference patterns, we have

I_{q u a n t u m} = I_{c l a s s i c a l} + I_{p h a s e} ≻ I_{c l a s s i c a l}

(52)

This completing the proof. □

Theorem 2.

Stability of quantum-enhanced SSM. The quantum-enhanced Mamba block with decoherence rate

γ \geq 0

and decay factor

γ (t) = \exp (- γ \cdot t) \in [0,1]

maintains bounded hidden states for any input sequence

x \in C^{L \times D}

:

{‖h_{t}^{c o m p l e x}‖}_{2} \leq γ (t) {‖h_{t - 1}^{c o m p l e x}‖}_{2} + ‖b_{t}^{c o m p l e x}‖ \cdot {‖x_{t}‖}_{2}

(53)

where

γ (t) = \exp (- γ \cdot t) \in [0,1]

for

t ≻ 0

.

Proof of Theorem 2.

We analyze the complex state evolution. The state update at time

t

is

h_{t}^{c o m p l e x} = γ (t) \cdot (h_{t - 1}^{c o m p l e x} \otimes A) + x_{t} \otimes B_{t}^{c o m p l e x}

(54)

Taking the

L_{2}

norm

{‖h_{t}^{c o m p l e x}‖}_{2} \leq γ (t) {‖h_{t - 1}^{c o m p l e x} \otimes A‖}_{2} + {‖x_{t} \otimes B_{t}^{c o m p l e x}‖}_{2}

(55)

By properties of the tensor contraction

\otimes

,

{‖h_{t}^{c o m p l e x} \otimes A‖}_{2} \leq {‖h_{t - 1}^{c o m p l e x}‖}_{2} \cdot {‖A‖}_{F}

(56)

where

{‖\cdot‖}_{F}

is the Frobenius norm. Since

A

is initialized with

\exp (s o f t p l u s (A_{l o g}))

, ensuring

{‖A‖}_{F} \leq 1

, we have

{‖h_{t}^{c o m p l e x}‖}_{2} \leq γ (t) {‖h_{t - 1}^{c o m p l e x}‖}_{2} + ‖B_{t}^{c o m p l e x}‖ \cdot {‖x_{t}‖}_{2}

(57)

Since

γ (t) = \exp (- γ \cdot t) ≺ 1

for

t ≻ 0

and

γ ≻ 0

, the first term contracts exponentially. For bounded input

{‖x_{t}‖}_{2} \leq M

, the state remains bounded:

{‖h_{t}^{c o m p l e x}‖}_{2} \leq γ (t) {‖h_{t - 1}^{c o m p l e x}‖}_{2} + C \cdot M

(58)

where

C = \max_{t} ‖B_{t}^{c o m p l e x}‖

is bound by initialization. This proves stability. □

Theorem 3.

Computational complexity. QIE-Mamba architecture achieves

O (L n d + d^{2})

computational complexity for sequence length

L

, feature dimension d, and batch size n, compared to

O (L n^{2} d)

for self-attention mechanisms.

Proof of Theorem 3.

We analyze the complexity of each component:

ConvNeXt Encoder: Each stage performs convolutions and linear transformations. For feature map size $H \times W$ with $d$ channels:

\begin{array}{l} Depth - wise convolution : O (k^{2} H W d) where k = 7 \end{array}

(59)

\begin{array}{l} Pointwise FFN : O (H W d^{2}) \end{array}

(60)

\begin{array}{l} Total for encoder : O (H W d^{2}) = O (L d^{2}) where L = H W after adaptive pooling . \end{array}

(61)

Quantum superposition:

MLP operations on sequences : O (2 \cdot n L d \cdot d_{h}) = O (n L d^{2})

(62)

Unitary entanglement:

\begin{array}{l} Matrix exponential \exp (M) for M \in C^{d \times d} : \\ Application to batch : O (n L d^{2}) \end{array}

(63)

Quantum mamba:

\begin{array}{l} SSM recurrent for L steps with state dimension s : \\ Total complexity : O (L d^{2} + n L d^{2} + d^{3} + n L d^{2} + L n d) = O (L n d^{2} + d^{3} + L n d) \\ \begin{array}{l} Since n ≫ 1 and d^{3} is constant time operation per batch, the dominant \\ term is O (L n d^{2} + L n d) = O (L n d (d + 1)) = O (L n d \cdot d) = O (L n d^{2}) \end{array} \\ \begin{array}{l} However, when considering the sequence processing specifically (Mamba \\ block), the complexity per sequence element is O (n d (1 + s)) = O (n d), \\ giving total sequence complexity O (L n d) . \end{array} \end{array}

(64)

In contrast, self-attention computes

O (L^{2} n d)

for attention matrix computation and application. Therefore, QIE-Mamba achieves linear complexity in sequence length. □

4. Experiments Results and Analysis

4.1. Experiment Setup

In this section, we present the experimental results and analyze the evaluations shown in Figure 6 in the context of three benchmark remote sensing datasets with different geographic and sensor characteristics.

The theoretical validation of our proposed QIE-Mamba method commences with the selection of a decoherence model; the optimal decoherence rate of the selected model is established in Section 4.2.1, while the impact of the complexity factor, numerical stability, and unitary gate implementation on the principal architectural decisions is refined in Section 4.2.2, Section 4.2.3 and Section 4.2.4. The next step of development utilizes the validated values.

The values chosen for the three-step feature extraction, transfer, and aggregation steps in Section 4.3 are the best fit for this task and make for high performance. Using the selected optimization values of these theoretical and architectural key parameters, Section 4.4 performs training to examine the effects of the training hyperparameters, and the hyperparameters with the best performance are utilized for evaluation training. Finally, we performed an ablation study on our contributions, which we talked about in Section 4.5.

4.1.1. Datasets

Houston2013: This dataset was obtained across the campus of the University of Houston and the surrounding metropolitan area. It consists of hyperspectral data with 144 spectral bands (380 to 1050 nm), LiDAR-derived DSM, and ground-truth labels for 15 land cover classes. The dataset is 349 × 1905 pixels in size and contains an area resolution of 2.5 m. Table 1 lists the variety of samples included in each class.

Muufl: This dataset was collected over the University of Southern Mississippi Gulf Park Campus. It includes hyperspectral imagery with 64 bands (375 to 1050 nm) and LiDAR data with elevation information, and it has a spatial resolution of 1 m and dimensions of 325 × 220 pixels. Ground truth for 11 land cover classes in each class is reported in Table 1.

Augsburg: The Augsburg dataset was acquired over the city of Augsburg. It encompasses 180 spectral bands (spanning wavelengths from 0.4 to 2.5 µm). The spatial resolution is 30 m ground-sampling distance, and the image dimensions are 332 × 485 pixels. It includes definitive labels for seven land cover categories. Table 1 enumerates the diverse samples contained within each category.

4.1.2. Implementation Details

Hardware configuration: NVIDIA RTX5070 12 GPU, Intel Core i5-14600KF CPU@3.50GHz, and 32GB RAM. Software environment: PyTorch 2.8.0, Python 3.10.17, CUDA 12.8.

Training configuration: The network is trained for 100 epochs with a batch size of 32. We use the Adam optimizer with an initial learning rate of regular parameter 0.01 and quantum parameter Houston2013 and Muufl with 0.0005 and Augsburg with 0.001, which is reduced using a cosine annealing schedule with

T_{m a x}

,

η_{m i n} = 10^{- 6}

. Momentum parameters are

β_{1} = 0.9, β_{2} = 0.999

. Weight decay is set to 0.01 for regularization. Our training uses

D r o p P a t h = 0

by default, meaning drop path is disabled, and 0.1 in the final classifier. We use L = 16 based on adaptive pooling to 4 × 4 spatial dimensions, and dim = 384 is the embedding dimension in sequences for Mamba processing. A learnable real value is initialized with

N (0, 0.01)

and imaginary unit is

\sqrt{- 1}

. Sigmoid activation is

[0,1]

. Loss function is cross-entropy with class weight. The hyperspectral images and LiDAR data are extracted into patches of size 15 × 15 that are centered on each identified pixel.

Data preprocessing: We randomly choose 80% of the labeled samples from each dataset to use for testing and 20% to use for training. This divide is in keeping with past studies on these datasets, which makes it possible to compare them fairly with current methods. All input data is set up to be between 0 and 1. For HSI data, spectral normalization to the range [0, 1]. Data augmentation includes rotation, flipping, and adding noise to the spectrum.

4.2. Determines Theoretical Foundation

4.2.1. Decoherence Model Validation

We assessed both Markovian and non-Markovian decoherence models over rates

γ \in [0.005 t o 0.1]

represented in Figure 7, which illustrates the distinct temporal behaviors of Markovian and non-Markovian models; at the identical rate γ = 0.05, the non-Markovian model preserves 10.7% more quantum information due to its slower decay characteristics. Optimal decoherence rates follow an inverse scaling relationship with sequence length (Figure 7b), with theoretical optima at γ = 0.0139 (Markovian) and γ = 0.0156 (Non-Markovian) for L = 16 to maintain 80% state preservation.

The preserve-gradient-flow model was then empirically validated across datasets (three runs, 100 epochs), achieving the accuracy of the best performer in Table 2.

Both models showed wonderful robustness, with performance differences of less than 0.08% across a 20× range of decoherence rates. In Table 2 and Figure 8, the optimal decoherence rate is highly dependent on the dataset, highlighting the subtle relationship between quantum state preservation and data properties; for example, the Houston2013 dataset is rich in high-quality data and robust modes and achieves superior performance with minimal decoherence (

γ = 0.005

), preserving 92.31% of the quantum information. In contrast, Augsburg is characterized by high noise and low baseline accuracy and benefits from aggressive decoherence (

γ = 0.1

, 20.19% preservation) where quantum noise acts as a hidden regularization. The Muufl dataset occupies an intermediate position (

γ = 0.015

), which shows an optimal rate that is the same as the optimal rate (Figure 7b) in sequence length (L = 16).

That variability suggests that the decoherence rate should be considered a dataset-dependent hyperparameter rather than a fixed architectural choice, like the learning rate or break schedule in classical networks. Subsequent training sessions will use the value of the decoherence rate that provides the best results (0.005, 0.015, and 0.1) for the Houston2013, Muufl, and Augsburg datasets.

4.2.2. Ablation Analysis of Complex Factor $ρ$

To choose the optimal value of the complexity factor in the quantum superposition layer, we examined

ρ \in \{0.5, 1.0, 1.5, 2.0\}

three datasets, held all other hyperparameters constant, and performed an ablation study measuring OA, AA, Kappa, and the number of parameters. Table 3 shows the results of this validation.

The range of model complexity ρ (complexity factor) shown in Table 3 is a validation of finding the “right” balance: when too simple (

ρ = 0.5

), it does not fit a complex dataset like Muufl, and when too complex (

ρ = 2.0

), it adds parameters without any performance gain and risks overfitting. The optimal value (

ρ = 1.0

) achieves the best generalization across a variety of datasets.

Figure 9 illustrates the varying responses of different datasets: Houston2013 (best at

ρ = 0.5

) is a simple urban scene that does not require complex features. On the other hand, the Muufl and Augsburg (best at

ρ = 1.0

) datasets benefit from a more diverse set of phenomena, with moderate complexity. This database demonstrates the adaptability of our model rather than a one-size-fits-all approach. Only ~300 K additional parameters (21.2 M → 21.5 M) yield optimal performance. In conclusion, we chose

ρ = 1.0

for both Muufl and Augsburg and

ρ = 0.5

for Houston2013 as the optimal complexity factor, balancing the model expression and parameter efficiency with the stability of the training.

4.2.3. Numerical Stability Validation

To assess the numerical behavior of the proposed real–imaginary transformation, we evaluated amplitude preservation, mode contribution balance, and degeneracy across more than 180 million samples from the Houston2013, Muufl, and Augsburg datasets. To ensure that amplitude normalization remains numerically stable across training and large-scale inference, we incorporated a small constant,

ε = 1 0^{- 8}

, into the normalization denominator. The purpose of ε is to prevent division by zero and suppress floating-point underflow when feature magnitudes become extremely small.

Given individual modality amplitudes

α_{HSI}

and

α_{LiDAR}

, we compute the fused amplitude energy as

amp_sum = \sqrt{α_{H S I}^{2} + α_{L i D A R}^{2}}

Samples are classified as near-zero if

amp_sum ≺ ε

This threshold enables us to track numerical collapse events during training and evaluate stability over millions of forward passes. The findings validate that the transformation maintains both physical interpretability and numerical stability, as demonstrated in Table 4 and Figure 10. First, the amplitudes of HSI and LiDAR channels remain centered around 0.5 with negligible variance (std ≈ 0.05) in Figure 10a–c demonstrating balanced mode contributions.

Second, the combined amplitudes stay within a stable range (0.5111–0.9139) for all datasets, which shows that the pseudo modal generation does not add or take away artifacts. Third, we observed that the near-zero activation ratio remains 0.00% for ε = 10⁻¹⁰→10⁻⁴ in Figure 10c, indicating that the mapping does not cause information to collapse.

These findings verify that the real–imaginary mapping is numerically robust and suitable for downstream learning, providing the foundation for the decomposition comparison in Section 4.2.5 and the domain-difference validation in Section 4.2.6.

4.2.4. Unitary Gate Implementation

An important tool for implementing unitary gates is the requirement to enforce unity. This requirement is met by exponential (expm) and Cayley, which both mathematically produce a unitary matrix with the same accuracy for unity. The choice between them affects accuracy at the expense of speed/computing time. Our plot of unity error on the dataset in Figure 11c confirms this—the overlap of both error curves indicates that they are numerically equivalent. Cayley is faster in terms of computational efficiency; in Figure 11a, dashed lines (Cayley) show a consistent advantage over solid lines (exponential) for d ≥ 200 and speedup performance is shown in Figure 11b.

Figure 11c depicts the verification of the implementation of the unity gate’s error across datasets; Frobenius shows that both the Cayley and exponential methods preserve machine accuracy across dimensions on the

{\log ‖U † U - I‖}_{F}

scale, which deviates from perfect unity. The overlapping lines verify mathematical equivalence. Figure 11d shows a heatmap of the speedup (Cayley time/exponential time) for the dataset and the metrics. Green indicates Cayley advantage (>1.2×), yellow indicates marginal benefit (1.0–1.2×), and red indicates Cayley disadvantage (<1.0×). Our architectural choices (dim = 256 and 384, marked) fall within the optimal green regime for all datasets.

4.2.5. Decomposition Method Strategies Comparison

Our quantum-inspired framework encodes multimodal features using amplitude–phase representation in the quantum superposition layer for physical interpretability, then converts to real–imaginary for quantum Mamba processing. We compared two decomposition strategies—real–imaginary (proposed) and amplitude–phase (baseline)—to evaluate their ability to preserve domain information while supporting effective multimodal fusion both variants maintain identical network architectures, training procedures, and hyperparameters, differing only in the complex state decomposition method applied between the superposition layer and temporal evolution.

Across all datasets, the real–imaginary strategy achieved slightly higher accuracy (Δ = 0.04–0.32%) and faster convergence (up to 14.4% reduction in training time) while also exhibiting lower gradient variance and improved stability, as shown in Table 5. These observations indicate that the real–imaginary decomposition is more efficacious in maintaining semantic content and enhancing discriminating learning. The nearly identical accuracy between the strategies suggests that the two decompositions preserve the equivalent categories of the original representation.

However, to further confirm the greater stability and efficiency of the real–imaginary mapping, we will describe the trends in domain differences in Section 4.2.6. The real–imaginary and amplitude–phase decompositions operate solely on internal complex feature representations, and all training uses the original real HSI–LiDAR datasets.

4.2.6. Domain Difference and Stability of Pseudomodalities

The quantitative domain-difference analysis in Table 6 and Figure 12 further corroborates the conclusions derived in Section 4.2.3 and Section 4.2.5 about the stability and domain preservation of the suggested pseudomodalities. The cosine similarity between real and pseudo-representations across all three datasets stays within the expected range for discriminative multimodal learning (HSI: 0.60–0.81; LiDAR: 0.26–0.73). This observation indicates that the pseudomodes do not just copy raw sensor data; they change into task-optimized representations while staying in the same domain. This behavior is fully consistent with the decomposition results in Section 4.2.5, where the real–imaginary strategy preserved semantic information while allowing beneficial representational divergence. Importantly, the Maximum Mean Discrepancy (MMD) values converge to low magnitudes (HSI: 0.002–0.003; LiDAR: 0.001–0.388) after early epochs, supporting the numerical stability observations in Section 4.2.3. During the initial training phase, pseudomodalities function as superficial reconstructions of the original HSI and LiDAR data, resulting in elevated cosine similarity. As training progresses, the pseudomodality representations become increasingly task-optimized and encode discriminative multimodal patterns rather than input-level structure.

Such behavior naturally reduces cosine similarity since the features diverge from raw input while remaining within the same statistical domain, as reflected by the low MMD values. The strong OA performance (96–98%) obtained at convergence further shows the necessity for robust fusion, providing a coherent explanation that ties together the stability, decomposition behavior, and domain-difference characteristics of our quantum-inspired Mamba pipeline.

4.3. Optimal Selection of Key Parameters and Training Hyperparameters

4.3.1. Optimal Selection of Key Parameters

The practical implementation of the quantum-inspired Mamba paradigm, in conjunction with the robust feature extraction technique of ConvNeXt, necessitates numerous training cycles to ascertain the best values of the critical parameters. We devised a systematic learning technique for the successive optimization of key values to address this challenge. We structured the testing process of various factors’ impact on the efficiency of our developed algorithm in a methodical three-stage approach. The values are assigned randomly in the initial trial, with their numerical representations displayed in Table 7.

In the initial phase of feature extraction, the most effective choice (the highest OA) is chosen from several depth configurations of ConvNeXt, and the optimal value is refined according to the principles of the expansion factor and state size of the succeeding Mamba architecture. In the third quantum-inspired phase, the quantum block quantity, entanglement intensity, and measurement type are systematically tuned.

Table 8, Table 9 and Table 10 demonstrate the contribution of each component on the Houston2013, Muufl, and Augsburg datasets. The performance progression phase-by-phase optimization results phase 1 is the ConvNeXt backbone optimization configuration tested 4 times; phase 2 is the Mamba parameter optimization configuration that tested 8 times; and phase 3 is the quantum enhancement optimization configuration tested 14 times.

The main idea of the sequential dependency principle is that the hyperparameters in a deep learning architecture are hierarchically dependent, and downstream components operate on the representations generated by upstream components. The depth of ConvNeXt influences Mamba’s optimal complexity, quantum enhancements yield quantifiable advantages, sequential optimization diminishes testing by over 99.6%, and parameter sensitivity fluctuates with component hierarchy. If we optimize all stages simultaneously (grid search), we can test the parameters of stages 2 and 3 with the non-optimal settings of stage 1, wasting computational resources on irrelevant combinations. Sequential validation makes sense efficiency-wise; the argument is 4 + 8 + 14 = 26 configurations tested, and the network search is 4 × 4 × 4 × 5 × 5 × 4 = 6400 configurations (all combinations will need to be tested). Reduction is ~99.6% fewer tests.

Table 8 shows the optimization results of Houston2013, which has 2.5 m resolution, 144 spectral bands plus LiDAR, 15 city categories, and 65 to 250 training samples for each class. In the first stage, the tiny architecture [2, 2, 6, 2] with 15.24 M parameters achieved the highest accuracy (98.40%), outperforming the deep network foundation [3, 3, 27, 3] with 40.63 M parameters (98.20%). This follows the idea of Occam’s razor, which states that simple models generalize better when the training sample is limited and the class boundaries are well defined. Deeper networks are better suited to learning noise than learning true discrimination patterns.

In the second optimization step, the high-dimensional input (144 bands + LiDAR) required a large amount of memory to encode complex spectral–spatial relationships and to distinguish similar classes such as “grass healthy” and “grass stressed,” so the 64-state size (98.55%) showed a significant performance improvement over the smaller state size. The scaling factor showed a limited difference between 2.0 (98.25%) and 3.0 (98.26%), indicating an optimal range of 2.0–3.0.

The third step of quantum optimization provided important insights. A single quantum block (98.44% ± 0.20) was found to be optimal, and additional blocks increased the training time (+62%) without increasing overfitting (variance increased from ±0.20 to ±0.30) and accuracy. This validates the idea of “quantum is advantageous,” indicating that a single well-designed quantum block suffices for cross-modal fusion. An entanglement strength of 0.5 (98.39% ± 0.08) provided the best accuracy–stability balance, while the highest entanglement (1.0) achieved peak accuracy (98.55%) but had high dispersion (±0.23), as strong coupling propagates noise between modes.

The Muufl dataset features complex terrain, diverse land cover, and spectral confusion, and the optimization results are shown in Table 9. Consistent with Houston2013, the tiny architecture [2, 2, 6, 2] achieved the optimal performance (95.00%), which confirms that the architecture’s economy prevents overfitting under the constraints of the training data. The state size of 64 (94.85%) remained optimal, confirming the common requirement of high-dimensional multimode fusion, and Muufl was also consistent in the choice of scaling factor, which was 3.0 (95.21%), which was the same as the optimal value of Houston2013, which is 3.0. This indicates that terrain heterogeneity and spectral ambiguity benefit from the increased feature abstraction capacity through the more pronounced obstacles in the Mamba feed network.

Quantum optimization revealed unique features: the entanglement strength of 0.3 (94.71% ± 0.17%) was lower than the Houston2013 value of 0.7, which was confirmed to be optimal, indicating that weak cross-modal coupling prevents instability when there is little additional information or more noise. Strong entanglement of 0.7 significantly increased the variability (±0.69%), confirming the systematic instability.

Examine Table 10 to observe this effect in the noisy 30 m resolution data for Augsburg. The deeper neural network works effectively as a noise-free filter using a 15 × 15 patch size over 450 m × 450 m area. In the case of high noise, a smaller (18 blocks) model fits the space perfectly. The deeper model increases the receptive field, which is important for getting the right values from the data. However, if you add more than 18 blocks, especially if your training sample is imbalanced and large classes like housing (6065) are more relevant than small classes like distribution (115), you may end up with overfitting.

The results show that it is important to match the complexity of the architecture to the characteristics of the data. State size 64 (95.87%) and expansion factor 2.5 (95.61%) maintained patterns consistent with other datasets. Quantum optimization revealed Augsburg’s most distinctive characteristic: a maximum entanglement strength of 1.0 (95.73% ± 0.27%) proved optimal, indicating that low resolution and high noise necessitate aggressive cross-modal information fusion to achieve adequate discriminative power. Augsburg needs to find the strongest correlation between HSI and LiDAR, while high-quality datasets only need moderate entanglement.

Four universal principles emerged: a single quantum block is consistently optimal across all datasets, validating targeted enhancement over deep quantum networks; a large state size (64) is universally required for multimodal fusion; sequential optimization dramatically reduces computational cost while finding optimal solutions; and training sample scarcity universally constrains model complexity. Context-dependent adaptations revealed equally important insights: architectural depth inversely correlates with data quality (shallow for high-quality Houston2013, deeper for noisy Augsburg); entanglement strength reflects cross-modal complementarity (1.0 for urban, 0.3 for terrain, 1.0 for noisy data). As can be seen in Table 8 (Houston2013), the ConvNeXt-Tiny model works well in just 288.5 s. However, deeper models like ConvNeXt-Base take 418.6 s to train and only slightly improve accuracy, showing that there is a clear trade-off between speed and accuracy. A similar pattern can be seen in Table 9 and Table 10 of Muufl and Augsburg, where the optimal configuration balances not only depth but also runtime and accuracy.

In the final step of measurement-type selection, all four measurement types had an accuracy of over 94% (with <0.4% variance for all datasets). This suggests that the specific measurement strategies used to compare the quantum state of the quantum-inspired architecture’s performance to the classical properties are reading out stable, well-formed quantum states resulting from entanglement. Furthermore, the fact that they are read out by different sensitivity profiles, such as angular information and amplitude information with their own characteristics, confirms that any reasonable projection works well if the state is sufficiently informative. We systematically compared four quantum measurement operators (magnitude, real, phase-aware, and adaptive) across three datasets. Real projection consistently achieved optimal performance (Table 8, Table 9 and Table 10, Section C) and is therefore selected as the final measurement operator.

The quantum enhancements provide 0.2–0.4 percentage point improvements in overall accuracy, but these mask substantial gains in hard-case classification—boundary pixels and transitional zones where traditional architectures struggle most. For Houston2013, the 0.4% improvement (98.15% → 98.55%) represents 4–8% accuracy gains in ambiguous regions, achieved by properly correlating cross-modal signatures like “high reflectance with elevation” for commercial zones. These results demonstrate that optimal quantum-inspired architecture emerges from dataset-specific characteristics rather than universal recipes, requiring architectural decisions grounded in data acquisition physics, classification task semantics, and training sample statistics.

4.3.2. Training Hyperparameter Ablation

Figure 13a shows the effect of learning rate on optimal performance at high training levels [0.01, 0.005, 0.001, 0.0005, 0.0001]. The experimental optimal learning rate shows greater noise robustness than the ones suggested by Houston2013, Muufl 0.0005, and Augsburg 0.001. Figure 13b shows the tested batch sizes from [16, 32, 64, 128, 256], where 256 is suitable for stable data in Houston2013 and Muufl, while 32 is optimal for Augsburg to prevent overloading with noisy and unbalanced data. Larger batch sizes improve performance on clean datasets. Figure 13c shows a heatmap of learning rate at high training levels, with a color-coded matrix showing the error region (red) and the optimal region (green). Figure 13d shows a heatmap of batch sizes comparing how different batch sizes affect performance in the high-performance range (94–100%).

The optimal settings taken from Section 4.3—ConvNeXt [2, 2, 6, 2]/[3, 3, 9, 3] depth, Mamba state size 64, expansion coefficient 2.5–3.0, entanglement strength λ = 0.7–1.0, and real measurements—provide the basis for a competitive evaluation of state-of-the-art methods. This systematic optimization ensures a fair comparison: while the competing methods are evaluated under the same conditions (training samples, data partitioning, preprocessing), QIE-Mamba operates on empirically validated optimal hyperparameter settings instead of arbitrary initial values. Sequential optimization, which reduces the search complexity by 99.6%, demonstrates not only computational efficiency but also architectural maturity—the contribution of each component is precisely quantified by ablation, allowing for a principled comparison with methods without such systematic validation.

4.4. Comparison of Benchmark Approaches

We compared the proposed QIE-Mamba with four methods: 1D-CNN (spectral basis), HybridSN (3D-2D hybrid CNN), MFT (multi-scale convolutional transform), and CALC (attention-centered cross-training). Table 11, Table 12 and Table 13 show the overall results for the Houston2013, Muufl, and Augsburg datasets. The Houston2013 results are shown in Table 11, and the proposed method achieves OA = 99.62%, AA = 99.68%, and Kappa = 99.59, outperforming HybridSN (96.58%), CALC (93.97%), MFT (88.78%), and 1D-CNN (69.40%). There is a significant improvement in the heavy classes that require cross-modal thinking, especially in the residential classes (+4.78 points over HybridSN), parking lot1 (+9.48 points), and highway (+4.93 points). Six classes achieve perfect 100% accuracy. The catastrophic failure of 1D-CNN (parking lot 2: 0.00%, residential: 47.30%) indicates the need for multiple classifiers. Figure 14 shows that the prediction map has better boundary definition and less noise compared to the competitors.

Table 12 shows the performance of Muufl. The proposed method achieves OA = 96.31%, AA = 90.44%, and Kappa = 95.12, outperforming HybridSN (93.14%), MFT (92.28%), and CALC (80.96%). Significant improvements are observed in the difficult categories such as grass (+9.96 over HybridSN), dirt and sand (+8.07), and sidewalk (+16.18). The difficult yellow curb (53.74%) is much better than the almost broken (0.0%) version of MFT. The average error accuracy reaches 90.44%, which is +7.42% better than the best performance of other categories. Figure 15 shows that the building boundaries are cleaner and the vegetation discrimination is improved.

Figure 16 illustrates an enhancement in the cleanliness of residential and industrial zones, as well as low vegetation, accompanied by an improvement in allotment differentiation. However, in the most adverse conditions in Augsburg, the proposed method achieves OA = 96.30%, AA = 89.23%, and Kappa = 94.69, which is significantly better than its competitors, represented in Table 13. Most importantly, QIE-Mamba is the best method with >60% accuracy for all classes, eliminating the catastrophic errors found in 1D-CNN and HybridSN (allotment, commercial area, water: all 0.00%). The improvements include improvements in industrial (+18.86% over HybridSN), forest (+2.55%), and residential area (+1.10%). The advantage of +17.12% over HybridSN is due to the noise immunity due to the quantum-inspired fusion.

Figure 17 shows t-SNE visualizations showing the evolution of the feature space over the three architecture phases. The raw input features Figure 17a show strong overlap between classes; the ConvNeXt backbone Figure 17b produces partial separation with a stepwise convolution, while the final QIE-Mamba output Figure 17c is significantly improved with distinct, well-separated clusters. Additional improvements visually confirm the classification performance: near-perfect cluster separation (99.62% OA) in Houston2013, robust but complex separations in Muufl (96.31% OA), and clean clusters from noisy Augsburg data (96.30% OA).

The t-SNE visualizations provide compelling visual evidence that the QIE-Mamba architecture fundamentally changes the geometry of the feature space and produces linearly separable class representations from raw multimodal data with high overlap. Consistent patterns across a wide range of datasets, from high-quality urban areas to noisy imagery, demonstrate robust architectural generalization. The dramatic transition from input chaos (a) through partial structure (b) to pure isolation (c) visually validates the synergistic combination of ConvNeXt hierarchical mining, quantum-induced cross-modal fusion, and Mamba long-term modeling, explaining state-of-the-art numerical performance across all test cases.

4.5. Ablation Study

4.5.1. Contribution Ablation

To better demonstrate the contribution of each component, we followed the limited labeled data scenario common in remote sensing applications and ran ablation experiments using 5% of the training samples from each class. Table 14 shows the model configuration details and performance metrics on the Houston2013 dataset.

These empirical ablation results confirm the theoretical framework established in Section 3.6. The contribution of superposition (+3.22 points, 82% of the total) directly confirms the prediction of Theorem 1 that quantum-induced complex-valued representations contain more mutual information than classical concatenation (

I_{q u a n t u m} ≻ I_{c l a s s i c}

). The small contribution of the trust mechanism (+0.57 points, 18%) indicates that the optimal mode weighting improves beyond the baseline capability, which is consistent with the measurement-type equivalence shown in Table 8, Table 9 and Table 10 (all measurements yield >93% of the possible 6.9966 bits). Importantly, the “Classic Mamba” baseline configuration (92.80% OA) using simple average concatenation without quantum processing represents the information capacity of a real-valued concatenation (

I_{c l a s s i c} = 5.1493

bits from Table 15). Jumping to “without superposition” (91.68% OA) shows that the naive complex concatenation without learned entanglement degrades performance—confirming that the U unit transformation (Equations (12) and (13)) is important for generating creative cross-modal relationships rather than simply representing features in

C^{d}

rather than in

R^{d}

.

4.5.2. Theoretical Validation Study

We perform comprehensive theoretical validation against six criteria and present the results in Table 15. A mutual information advantage of 35.87% (6.9966 vs. 5.1493 bits) provides an upper bound for the performance improvement: for a perfect classifier with a perfect decision boundary, this value reduces the error rate by a factor of

l o g_{2} (2^{6.9966} / 2^{5.1493}) \approx 3.6

. Our experimental results are close to this theoretical limit—Houston2013 improved from 92.80% (classical basis) to 99.62% (quantum-inspired) and reduced the error from 7.20% to 0.38%, which is 3.6 times the theoretical advantage, given the limited training sample and model capacity constraints.

The complex-valued quantum state encodes information in both amplitude and phase components, with the complex entropy (2.1174 bits) substantially exceeding real-only entropy (0.0084 bits), demonstrating that 99.6% of the quantum state’s representational capacity resides in the complex structure. The stability analysis of the training confirms that the gradient disorder is zero.

Convergence analysis shows that the zero-gradient pathology (max norm = 0.00, mean = 0.00, std = 0.00) confirms the stability of Theorem 2: decoherence-controlled SSM iterations preserve

‖h^{(t)}‖ \leq C \cdot e x p^{(- γ t)} + C_{i n p u t}

, creating finite hidden states and avoiding the explosive gradients that plague RNNs. The efficiency analysis shows that quantum parameters encode 1.6988 bits per parameter, compared to 1.2726 bits for classical methods (42.61% improvement), which explains why a single quantum block is more efficient than multiple classical layers. The runtime analysis shows linear complexity, with an average execution time of 14.17 milliseconds, confirming the feasibility of practical deployment. All validation criteria are met.

5. Discussion

The optimal rate, which depends on the observed dataset, shows a quantum decoherence function rather than a quantum state degeneracy—it acts as an adaptive regulation. For high-quality urban imagery, the minimum decoherence (γ = 0.005) preserves 92.31% of the quantum state information, and the overall accuracy (OA) is 98.55% by effectively exploiting cross-modal correlations such as synchronized LiDAR and hyperspectral imaging (HSI) features. In contrast, low-resolution noisy data achieves optimal performance with significant decoherence (γ = 0.1), even though it preserves only 20.19% of the quantum information, suggesting that decoherence acts as a form of implicit regularization like collapse in classical networks.

The sequential optimization analysis emphasizes the approximate equivalence of measurement types, since all measurements yield the same accuracy despite different methods for extracting entangled information. This study indicates that performance improvements are more likely to occur when entanglement is greater than measurement degradation, allowing experts to prioritize the benefits of measurement choices based on operational constraints.

Decomposition algorithms show that the optimal representation of a quantum state depends on the computational conditions. Our unified approach allows us to describe the time evolution of the superposition steady state and the phase-accurate representation of the amplitude.

This paper highlights a 99.6% reduction in the complexity of hyperparameter search through sequential optimization and shows a hierarchical relationship between upstream architectural settings and optimal quantum parameters. The observed patterns include a negative relationship between network depth and data quality, a common convergence of state-level dimensions reflecting the trade-off between different entanglement strengths and imaging capacity, and a strong correlation between mode differences.

In addition to remote sensing applications, we highlight the generality of quantum-induced frame adjustment and propose applications involving phase correlation and nonlinear cross-modal correlation, such as audio-visual speech recognition and medical image fusion. However, several limitations are acknowledged: the need to manually adjust the decoherence level, the reliance on spatial co-registration between modes, the general failure to take advantage of phase information, and the underutilization of deep quantum architectures. Future research should investigate meta-learning to identify optimal decoherence parameters, rectify spatial misallocation, and enhance deep quantum network models to effectively manage complex multimode fusion tasks.

6. Conclusions

This paper presents a quantum-inspired state-space model for multimode remote sensing, achieving enhanced performance with linear O(Lnd) complexity. The method uses complex-valued quantum superposition theory, which makes acquiring mutual information 35.87% more efficient than conventional methods. The results indicate that the accuracy is 99.62% on high-quality urban data, 96.31% on complex terrain, and 96.30% on noisy images. Quantum superposition plays a crucial role in improving performance.

The synergistic optimization framework for datasets significantly reduces the hyperparameter search and opens valuable insights into the interaction between decoherence and entanglement, which is related to data quality, architectural depth, and the interaction between decoherence and entanglement. Theoretical validation ensures stability during training, and runtime analysis confirms the competitive efficiency of quantum blocks compared to classical architecture. Practical results for the selection of real and phase measurements are presented, showing that the measurement types do not affect the accuracy, but their computational requirements differ. Our combined representation strategy, leveraging amplitude–phase for physically interpretable fusion and real–imaginary for stable temporal evolution, suggests future directions in adaptive decomposition selection mechanisms or unified geometric algebra frameworks that could further optimize quantum-inspired architectures across diverse multimodal learning domains.

Current limitations are related to manual parameter tuning and the assumption of spatial co-registration, and future research is suggested, such as adaptive decoherence scheduling, multimode entanglement studies, and integration with adaptive architectures. The optimization methodology has real-time applications in various fields such as precision agriculture, urban planning, and disaster response, positioning the framework as a versatile tool for multimode correlation.

Author Contributions

Conceptualization, D.M., A.W., H.L., G.M., L.Y., and H.W.; methodology, H.L., D.M., A.W., and H.W.; software, D.M. and H.L.; validation, D.M.; writing—original draft preparation, D.M.; writing—review and editing, A.W. and H.W.; visualization, D.M.; supervision, G.M., L.Y., A.W., and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Key Research and Development Plan Project of Heilongjiang (JD2023SJ19), the National Key Support Project for Foreign Experts of Northeast Special Project (D20250098), the Program for Young Talents of Basic Research in Universities of Heilongjiang Province (YQJH2024077), and the Postdoctoral Fellowship Program of China Postdoctoral Science Foundation (GZC20252304).

Data Availability Statement

Houston2013 and Augsburg: https://drive.google.com/file/d/1UaeUWqTHhXzpwGHZElcF8AHwmVo0IwaV/view (accessed on 21 May 2021); Muufl: https://github.com/GatorSense/MUUFLGulfport (accessed on 17 April 2017).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, B.; Huang, B.; Xu, B. Multi-source remotely sensed data fusion for improving land cover classification. ISPRS J. Photogramm. Remote Sens. 2017, 124, 27–39. [Google Scholar] [CrossRef]
Bhargava, A.; Sachdeva, A.; Sharma, K.; Alsharif, M.H.; Uthansakul, P.; Uthansakul, M. Hyperspectral imaging and its applications: A review. Heliyon 2024, 10, e33208. [Google Scholar] [CrossRef]
Khan, M.J.; Khan, H.S.; Yousaf, A.; Khurshid, K.; Abbas, A. Modern Trends in Hyperspectral Image Analysis: A Review. IEEE Access 2018, 6, 14118–14129. [Google Scholar] [CrossRef]
Liu, G.; Song, J.; Chu, Y.; Zhang, L.; Li, P.; Xia, J. Deep Fuzzy Fusion Network for Joint Hyperspectral and LiDAR Data Classification. Remote Sens. 2025, 17, 2923. [Google Scholar] [CrossRef]
Wang, A.; Lei, G.; Dai, S.; Wu, H.; Iwahori, Y. Multiscale Attention Feature Fusion Based on Improved Transformer for Hyperspectral Image and LiDAR Data Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4124–4140. [Google Scholar] [CrossRef]
Ni, K.; Li, Z.; Yuan, C.; Zheng, Z.; Wang, P. Selective Spectral–Spatial Aggregation Transformer for Hyperspectral and LiDAR Classification. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Rehman, M.Z.U.; Islam, S.M.S.; Blake, D.; Ulhaq, A.; Janjua, N. Deep learning for land use classification: A systematic review of HS-LiDAR imagery. Artif. Intell. Rev. 2025, 58, 272. [Google Scholar] [CrossRef]
Wang, X.; Feng, Y.; Song, R.; Mu, Z.; Song, C. Multi-attentive hierarchical dense fusion net for fusion classification of hyperspectral and LiDAR data. Inf. Fusion 2022, 82, 1–18. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef]
Ye, M.; Ruiwen, N.; Chang, Z.; He, G.; Tianli, H.; Shijun, L.; Yu, S.; Tong, Z.; Ying, G. A Lightweight Model of VGG-16 for Remote Sensing Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6916–6922. [Google Scholar] [CrossRef]
Zhu, H.; Ma, M.; Ma, W.; Jiao, L.; Hong, S.; Shen, J.; Hou, B. A spatial-channel progressive fusion ResNet for remote sensing classification. Inf. Fusion 2021, 70, 72–87. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Gu, A.; Dao, T.; Ermon, S.; Rudra, A.; Ré, C. Hippo: Recurrent memory with optimal polynomial projections. Adv. Neural Inf. Process. Syst. 2020, 33, 1474–1487. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar] [CrossRef]
Schuld, M.; Sinayskiy, I.; Petruccione, F. An introduction to quantum machine learning. Contemp. Phys. 2015, 56, 172–185. [Google Scholar] [CrossRef]
Meedinti, G.N.; Srirekha, K.S.; Delhibabu, R. A quantum convolutional neural network approach for object detection and classification. arXiv 2023, arXiv:2307.08204. [Google Scholar] [CrossRef]
Ghamisi, P.; Rasti, B.; Yokoya, N.; Wang, Q.; Hofle, B.; Bruzzone, L.; Bovolo, F.; Chi, M.; Anders, K.; Gloaguen, R.; et al. Multisource and Multitemporal Data Fusion in Remote Sensing: A Comprehensive Review of the State of the Art. IEEE Geosci. Remote Sens. Mag. 2019, 7, 6–39. [Google Scholar] [CrossRef]
Hu, Q.; Wang, F.; Fang, J.; Li, Y. Semantic Labeling of High-Resolution Images Combining a Self-Cascaded Multimodal Fully Convolution Neural Network with Fully Conditional Random Field. Remote Sens. 2024, 16, 3300. [Google Scholar] [CrossRef]
Liu, X.; Liu, Q.; Wang, Y. Remote sensing image fusion based on two-stream fusion network. Inf. Fusion 2020, 55, 1–15. [Google Scholar] [CrossRef]
Li, H.; Ghamisi, P.; Soergel, U.; Zhu, X.X. Hyperspectral and LiDAR fusion using deep three-stream convolutional neural networks. Remote Sens. 2018, 10, 1649. [Google Scholar] [CrossRef]
Xu, Y.; Mao, Y.; Li, H.; Shen, J.; Xu, X.; Wang, S.; Zaman, S.; Ding, Z.; Wang, Y. A deep learning model based on RGB and hyperspectral images for efficiently detecting tea green leafhopper damage symptoms. Smart Agric. Technol. 2025, 10, 100817. [Google Scholar] [CrossRef]
Sha, W.; Hu, K.; Weng, S. Statistic and Network Features of RGB and Hyperspectral Imaging for Determination of Black Root Mold Infection in Apples. Foods 2023, 12, 1608. [Google Scholar] [CrossRef]
Habili, N.; Kwan, E.; Li, W.; Webers, C.; Oorloff, J.; Armin, A.; Petersson, L. A Hyperspectral and RGB Dataset for Building Facade Segmentation; Springer: Cham, Switzerland, 2022. [Google Scholar]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep Learning for Hyperspectral Image Classification: An Overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
Sun, E.; Cui, Y.; Liu, P.; Yan, J. A decade of deep learning for remote sensing spatiotemporal fusion: Advances, challenges, and opportunities. Inf. Fusion 2026, 126, 103675. [Google Scholar] [CrossRef]
Yang, B.; Wang, X.; Xing, Y.; Cheng, C.; Jiang, W.; Feng, Q. Modality Fusion Vision Transformer for Hyperspectral and LiDAR Data Collaborative Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17052–17065. [Google Scholar] [CrossRef]
Bai, M.; Zhou, Z.; Li, J.; Chen, Y.; Liu, J.; Zhao, X.; Yu, D. Deep graph gated recurrent unit network-based spatial–temporal multi-task learning for intelligent information fusion of multiple sites with application in short-term spatial–temporal probabilistic forecast of photovoltaic power. Expert Syst. Appl. 2024, 240, 122072. [Google Scholar] [CrossRef]
Hussain, M.; O’Nils, M.; Lundgren, J.; Mousavirad, S.J. A Comprehensive Review On Deep Learning-Based Data Fusion. IEEE Access 2024, 12, 180093–180124. [Google Scholar] [CrossRef]
Tang, Y.; Feng, Y.; Fung, S.; Xomchuk, V.R.; Jiang, M.; Moore, T.; Beckler, J. Spatiotemporal Deep-Learning-Based Algal Bloom Prediction for Lake Okeechobee Using Multisource Data Fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8318–8331. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Sivasubramanian, A.; Prashanth, V.R.; Hari, T.; Sowmya, V.; Gopalakrishnan, E.A.; Ravi, V. Transformer-based convolutional neural network approach for remote sensing natural scene classification. Remote Sens. Appl. Soc. Environ. 2024, 33, 101126. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar]
Roy, S.; Krishna, G.; Dubey, S.R.; Chaudhuri, B. HybridSN: Exploring 3D-2D CNN Feature Hierarchy for Hyperspectral Image Classification. IEEE Geosci. Remote. Sens. Lett. 2019, 17, 277–281. [Google Scholar] [CrossRef]
Zhu, Y.; Yuan, K.; Zhong, W.; Xu, L. Spatial–Spectral ConvNeXt for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5453–5463. [Google Scholar] [CrossRef]
Wu, H.; Dai, S.; Liu, C.; Wang, A.; Iwahori, Y. A Novel Dual-Encoder Model for Hyperspectral and LiDAR Joint Classification via Contrastive Learning. Remote Sens. 2023, 15, 924. [Google Scholar] [CrossRef]
Ai, J.; Mao, Y.; Luo, Q.; Jia, L.; Xing, M. SAR Target Classification Using the Multikernel-Size Feature Fusion-Based Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Xue, W.; Ai, J.; Zhu, Y.; Chen, J.; Zhuang, S. AIS-FCANet: Long-Term AIS Data Assisted Frequency-Spatial Contextual Awareness Network for Salient Ship Detection in SAR Imagery. IEEE Trans. Aerosp. Electron. Syst. 2025, 40, 15166–15171. [Google Scholar] [CrossRef]
Zhang, Y.; Gao, H.; Chen, Z.; Fei, S.; Zhou, J.; Ghamisi, H.; Zhang, B. Adaptive multi-stage fusion of hyperspectral and LiDAR data via selective state space models. Inf. Fusion 2026, 125, 103488. [Google Scholar] [CrossRef]
Hussain, K.M.; Zhao, K.; Zhou, Y.; Ali, A.; Li, Y. Cross Attention Based Dual-Modality Collaboration for Hyperspectral Image and LiDAR Data Classification. Remote Sens. 2025, 17, 2836. [Google Scholar] [CrossRef]
Bloemheuvel, S.; van den Hoogen, J.; Atzmueller, M. Graph construction on complex spatiotemporal data for enhancing graph neural network-based approaches. Int. J. Data Sci. Anal. 2024, 18, 157–174. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Hua, W.; Dai, Z.; Liu, H.; Le, Q. Transformer quality in linear time. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 9099–9117. [Google Scholar]
Huang, J.; Zhang, Y.; Yang, F.; Chai, L. Attention-Guided Fusion and Classification for Hyperspectral and LiDAR Data. Remote Sens. 2024, 16, 94. [Google Scholar] [CrossRef]
Preskill, J. Quantum computing in the NISQ era and beyond. Quantum 2018, 2, 79. [Google Scholar] [CrossRef]
Lloyd, S.; Mohseni, M.; Rebentrost, P. Quantum principal component analysis. Nat. Phys. 2014, 10, 631–633. [Google Scholar] [CrossRef]
Biamonte, J.; Wittek, P.; Pancotti, N.; Rebentrost, P.; Wiebe, N.; Lloyd, S. Quantum machine learning. Nature 2017, 549, 195–202. [Google Scholar] [CrossRef] [PubMed]
Nevalainen, O.; Honkavaara, E.; Tuominen, S.; Viljanen, N.; Hakala, T.; Yu, X.; Hyyppä, J.; Saari, H.; Pölönen, I.; Imai, N.N.; et al. Individual Tree Detection and Classification with UAV-Based Photogrammetric Point Clouds and Hyperspectral Imaging. Remote Sens. 2017, 9, 185. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q. Deep networks with stochastic depth. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 646–661. [Google Scholar]
Yang, J.X.; Zhou, J.; Wang, J.; Tian, H.; Liew, A.W.C. LiDAR-Guided Cross-Attention Fusion for Hyperspectral Band Selection and Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Nielsen, M.A.; Chuang, I.L. Quantum Computation and Quantum Information; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Helfrich, K.; Willmott, D.; Ye, Q. Orthogonal recurrent neural networks with scaled Cayley transform. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1969–1978. [Google Scholar]
Mhammedi, Z.; Hellicar, A.; Rahman, A.; Bailey, J. Efficient orthogonal parametrisation of recurrent neural networks using householder reflections. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2401–2409. [Google Scholar]
Arjovsky, M.; Shah, A.; Bengio, Y. Unitary evolution recurrent neural networks. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1120–1128. [Google Scholar]
Zurek, W.H. Decoherence, einselection, and the quantum origins of the classical. Rev. Mod. Phys. 2003, 75, 715. [Google Scholar] [CrossRef]
Breuer, H.-P.; Petruccione, F. The Theory of Open Quantum Systems; OUP: Oxford, UK, 2002. [Google Scholar]
Rivas, Á.; Huelga, S.F.; Plenio, M.B. Quantum non-Markovianity: Characterization, quantification and detection. Rep. Prog. Phys. 2014, 77, 094001. [Google Scholar] [CrossRef] [PubMed]
Paz, J.P.; Zurek, W.H. Environment-induced decoherence and the transition from quantum to classical. In Fundamentals of Quantum Information: Quantum Computation, Communication, Decoherence and All That; Springer: Berlin/Heidelberg, Germany, 2002; pp. 77–148. [Google Scholar]
Garraway, B. Nonperturbative decay of an atomic system in a cavity. Phys. Rev. A 1997, 55, 2290. [Google Scholar] [CrossRef]
Berberan-Santos, M.; Bodunov, E.; Valeur, B. Mathematical functions for the analysis of luminescence decays with underlying distributions 1. Kohlrausch decay function (stretched exponential). Chem. Phys. 2005, 315, 171–182. [Google Scholar] [CrossRef]
Rasti, B.; Hong, D.; Hang, R.; Ghamisi, P.; Kang, X.; Chanussot, J.; Benediktsson, J.A. Feature extraction for hyperspectral imagery: The evolution from shallow to deep: Overview and toolbox. IEEE Geosci. Remote Sens. Mag. 2020, 8, 60–88. [Google Scholar] [CrossRef]
Hong, D.; Gao, L.; Yokoya, N.; Yao, J.; Chanussot, J.; Du, Q.; Zhang, B. More Diverse Means Better: Multimodal Deep Learning Meets Remote-Sensing Imagery Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 4340–4354. [Google Scholar] [CrossRef]
Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Trabelsi, C.; Bilaniuk, O.; Zhang, Y.; Serdyuk, D.; Subramanian, S.; Santos, J.F.; Mehri, S.; Rostamzadeh, N.; Bengio, Y.; Pal, C.J. Deep complex networks. arXiv 2017, arXiv:1705.09792. [Google Scholar]
Aharonov, Y.; Albert, D.Z.; Vaidman, L. How the result of a measurement of a component of the spin of a spin-1/2 particle can turn out to be 100. Phys. Rev. Lett. 1988, 60, 1351. [Google Scholar] [CrossRef] [PubMed]
Ramachandram, D.; Taylor, G.W. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Process. Mag. 2017, 34, 96–108. [Google Scholar] [CrossRef]
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep Convolutional Neural Networks for Hyperspectral Image Classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
Roy, S.K.; Deria, A.; Hong, D.; Rasti, B.; Plaza, A.; Chanussot, J. Multimodal fusion transformer for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–20. [Google Scholar] [CrossRef]
Lu, T.; Ding, K.; Fu, W.; Li, S.; Guo, A. Coupled adversarial learning for fusion classification of hyperspectral and LiDAR data. Inf. Fusion 2023, 93, 118–131. [Google Scholar] [CrossRef]

Figure 1. Graph abstract of the QIE-Mamba architecture for multimodal HSI-LiDAR fusion.

Figure 2. Comprehensive architecture featuring quantum processing elements (* continuous line).

Figure 3. The process of constructing a unified gate. The mathematical structure of a learnable unified transformation gate used in quantum entanglement.

Figure 4. Internal structure of quantum Mamba block. The internal architecture of the quantum Mamba block implements coherent state-space models.

Figure 5. Quantum measurement selection framework (used during development for systematic comparison).

Figure 6. The order of organizing the testing and calibration process of the QIE-Mamba method.

Figure 7. Model quantum decoherence using two approaches: Markovian and non-Markovian. (a) Decoherence model comparison by preservation over time; (b) optimal decoherence rate vs. sequence length.

Figure 8. Correlation between decoherence models and evaluation measures across three datasets. (a) The impact of overall accuracy and varying decoherence rates is presented; (b) the influence of average accuracy and decoherence rates is illustrated; and (c) the implications of the Kappa coefficient and decoherence rates are elucidated.

Figure 9. The influence of fine-grained complexity on model fitting. (a) Comparison of complexity factors and overall accuracy; (b) comparison between aggregate accuracy and model dimensions.

Figure 10. Numerical stability analysis amplitude distributions for three datasets. (a) Individual HSI (red) and LiDAR (blue) amplitudes, and combined amplitudes (green) with epsilon reference; (b) HSI-LiDAR correlation scatter plots colored by combined amplitude and box plot comparisons; (c) stability margin analysis showing the percentage of samples below various epsilon thresholds. The datasets are Houston2013, Muufl, and Augsburg, corresponding to 1, 2, and 3, respectively.

Figure 11. Computational efficiency of Cayley vs. exponential unitary gate implementations. (a) Computation time comparison; (b) speedup factor of Cayley and exponential map; (c) unitarity error across datasets; (d) speedup heatmap (green = Cayley faster).

Figure 12. Quantitative domain-difference analysis between real and pseudomodalities across the three datasets. (a) Cosine similarity versus training epochs; (b) MMD versus epochs.

Figure 13. Hyperparameter sensitivity analysis across on the three remote sensing datasets. (a) Learning rate sensitively analysis; (b) batch size impact analysis; (c) learning rate heatmap; (d) batch size heatmap.

Figure 14. Classification maps of Houston2013. (a) Ground-truth map; (b) RGB origin; (c) 1DCNN; (d) HybridSN; (e) MFT; (f) CALC; (g) proposed.

Figure 15. Classification maps of Muufl. (a) Ground-truth map; (b) RGB origin; (c) 1DCNN; (d) HybridSN; (e) MFT; (f) CALC; (g) proposed.

Figure 16. Classification maps of Augsburg. (a) Ground-truth map; (b) RGB origin; (c) 1DCNN; (d) HybridSN; (e) MFT; (f) CALC; (g) proposed.

Figure 17. t-SNE visualization of three datasets. (a) Input characteristics; (b) output characteristics of ConvNeXt; (c) output characteristics of QIE-Mamba.

Table 1. The number of training and testing samples split into each class for Houston2013, Muufl, and Augsburg datasets.

Houston2013		Muufl		Augsburg
Classes	Train/Test	Classes	Train/Test	Classes	Train/Test
Grass healthy	250/1001	Trees	4649/18,597	Forest	2701/10,806
Grass stressed	250/1004	Mostly grass	854/3416	Residential area	6065/24,264
Grass synthetic	139/558	Mixed ground	1376/5506	Industrial area	770/3081
Trees	248/996	Dirt and sand	365/1461	Low plants	5371/21,486
Soil	248/994	Road	1337/5350	Allotment	115/460
Water	65/260	Water	93/373	Commercial area	329/1316
Residential	253/1015	Building shadow	446/1787	Water	306/1224
Commercial	248/996	Building	1248/4992
Road	250/1002	Sidewalk	277/1108
Highway	245/982	Yellow curb	36/147
Railway	247/988	Cloth panels	53/216
Parking lot1	246/987
Parking lot2	93/376
Tennis court	85/343
Running track	132/528

Table 2. Comparative study of decoherence rate and performance of decoherence methods on three datasets. Bold highlights indicate the best values for each decoherence model on the dataset.

Decoherence Rate	Decoherence Model	Houston2013			Muufl			Augsburg			Preservation (%)
Decoherence Rate	Decoherence Model	OA (%)	AA (%)	$Kappa \times$ 100	OA (%)	AA (%)	$Kappa \times$ 100	OA (%)	AA (%)	$Kappa \times$ 100	Preservation (%)
0.005	Markovian	98.98	98.73	98.90	95.67	87.73	94.27	96.85	83.49	95.48	92.31
0.005	Non-Markovian	98.76	98.47	98.66	96.24	89.47	95.02	96.40	82.11	84.82	92.59
0.01	Markovian	98.75	98.44	98.65	95.91	88.69	94.57	96.89	83.39	95.53	85.21
0.01	Non-Markovian	98.92	98.68	98.83	96.21	89.03	94.98	96.99	83.59	95.68	86.21
0.015	Markovian	98.85	98.55	98.76	96.17	89.05	94.94	96.46	82.77	94.93	78.66
0.015	Non-Markovian	98.79	98.54	98.69	96.34	89.66	95.16	97.03	83.70	95.73	80.65
0.02	Markovian	98.82	98.63	98.72	96.12	89.15	94.87	97.02	84.89	95.73	72.61
0.02	Non-Markovian	98.87	98.58	98.78	96.25	89.50	95.04	97.03	83.78	95.73	75.76
0.03	Markovian	98.84	98.68	98.75	96.12	89.16	94.86	97.04	84.60	95.73	61.88
0.03	Non-Markovian	98.89	98.63	98.80	95.66	87.40	94.25	96.94	83.38	95.60	67.57
0.05	Markovian	98.83	98.62	98.73	96.21	89.46	94.99	97.05	83.37	95.76	44.93
0.05	Non-Markovian	98.96	98.68	98.88	95.98	88.69	94.67	97.12	84.15	95.86	55.56
0.1	Markovian	98.96	98.67	98.88	96.12	89.28	94.87	97.18	84.48	95.96	20.19
0.1	Non-Markovian	98.89	98.66	98.80	96.04	88.84	94.75	97.09	84.29	95.82	38.46

Table 3. Comparison values of complex factor

ρ

ablation on the three datasets. Bold highlights indicate the best values for each in the dataset.

Table 3. Comparison values of complex factor

ρ

ablation on the three datasets. Bold highlights indicate the best values for each in the dataset.

$ρ$	Hidden Dimension	Parameters (M)	Houston2013			Muufl			Augsburg
$ρ$	Hidden Dimension	Parameters (M)	OA (%)	AA (%)	Kappa	OA (%)	AA (%)	Kappa	OA (%)	AA (%)	Kappa
0.5	192	21.2	99.07	98.88	98.99	96.12	89.47	94.87	96.81	83.64	95.43
1.0	384	21.5	97.80	97.78	97.63	96.28	89.73	95.08	96.86	83.38	95.48
1.5	576	21.8	99.00	98.79	98.92	96.13	88.88	94.88	96.62	81.59	95.14
2.0	768	22.1	98.82	98.51	98.72	95.76	88.39	94.37	96.70	82.97	95.27

Table 4. Comparison values of numerical stability

ε

ablation on the three datasets.

Table 4. Comparison values of numerical stability

ε

ablation on the three datasets.

Metric	Houston2013			Muufl			Augsburg
Metric	HSI Amplitude	LiDAR Amplitude	Combined Sum	HSI Amplitude	LiDAR Amplitude	Combined Sum	HSI Amplitude	LiDAR Amplitude	Combined Sum
Mean	0.5023	0.5020	0.7116	0.5019	0.4943	0.7058	0.4990	0.4983	0.7068
Std Dev	0.0504	0.0460	0.0505	0.0482	0.0459	0.0495	0.0495	0.0539	0.0555
Min	0.3180	0.3252	0.5292	0.3088	0.3597	0.5261	0.3133	0.3235	0.5111
Max	0.6882	0.6494	0.8861	0.6918	0.6328	0.8979	0.7146	0.6745	0.9139

Table 5. Comparative analysis of real–imaginary versus amplitude–phase. Bold highlights indicate the best values for each in the dataset.

Metric	Houston2013			Muufl			Augsburg
Metric	Real– Imaginary	Amplitude– Phase	Δ	Real– Imaginary	Amplitude– Phase	Δ	Real– Imaginary	Amplitude– Phase	Δ
Accuracy (%)	98.71 ± 0.14	98.65 ± 0.13	+0.06	96.04 ± 0.03	96.00 ± 0.05	+0.04	97.27 ± 0.03	96.94 ± 0.03	+0.32
Final loss	0.00 ± 0.00	0.00 ± 0.00	0.00	0.00 ± 0.00	0.00 ± 0.00	0.00	0.00 ± 0.00	0.00 ± 0.00	0.00
Gradient norm	1.88 ± 1.86	1.76 ± 1.75	+0.12	2.36 ± 3.17	3.48 ± 5.74	−1.12	0.76 ± 1.00	0.81 ± 1.09	−0.05
Training time (s)	3536 ± 126	4131 ± 422	−14.4%	6264.1 ± 9.7	6352.7 ± 4.6	−1.4%	9436.5 ± 14.1	9563.4 ± 2.6	−1.3%
Stability (σ)	0.14%	0.13%	+0.01%	0.03%	0.05%	+0.02%	0.03%	0.03%	0.00

Table 6. Quotative difference of real and pseudomodalities on the three datasets.

Metric	Houston2013			Muufl			Augsburg
Metric	HSI	LiDAR	OA (%)	HSI	LiDAR	OA (%)	HSI	LiDAR	OA (%)
Cosine	0.6930	0.5380	98.35	0.8144	0.2574	96.14	0.6034	0.7329	96.72
MMD	0.003	0.001	98.35	0.002	0.388	96.14	0.002	0.006	96.72

Table 7. Quantitative information from the initial learning trial employed in the 3 successive phases.

Training Stage	Parameters	Initial Value	Variable Values
Phase 1	Depth	[3, 3, 9, 3]	[2, 2, 6, 2], [3, 3, 9, 3], [3, 3, 27, 3], [4, 4, 12, 4]
Phase 2A	State size	16.0	8.0, 16.0, 32.0, 64.0
Phase 2B	Expand factor	2.0	1.5, 2.0, 2.5, 3.0
Phase 3A	Quantum block	2.0	1.0, 2.0, 3.0, 4.0, 5.0
Phase 3B	Entanglement strength	0.3	0.1, 0.3, 0.5, 0.7, 1.0
Phase 3C	Measurement type	Adaptive	Adaptive, Magnitude, Real, Phase

Table 8. Validation of optimal key value selection in three-phase sequence on the Houston2013 dataset. The OA (%) value associated with the ideal value for each dataset is shown in bold.

Phase 1:	ConvNeXt Backbone Architecture Validation
Architecture	Depths	Blocks	Parameters (M)	Training (s)	Houston2013
Tiny	[2, 2, 6, 2]	12	15.24	288.5	98.40
Small	[3, 3, 9, 3]	18	21.23	323.0	98.35
Custom	[4, 4, 12, 4]	24	27.21	352.8	98.23
Base	[3, 3, 27, 3]	36	40.63	418.6	98.20
Phase 2:	Mamba parameter validation
State size	Parameters (M)	Houston2013	Expand factor	Parameters (M)	Houston2013
8	15.20	98.14	1.5	14.94	98.15
16	15.24	98.41	2.0	15.46	98.25
32	15.31	98.29	2.5	15.97	98.10
64	15.46	98.55	3.0	16.49	98.26
Phase 3:	A. Quantum block ablation
Num block	Parameters (M)	Training (s)		Houston2013
1.0	14.94	577.9		98.44 ± 0.20
2.0	16.49	667.0		98.19 ± 0.22
3.0	18.05	756.5		98.32 ± 0.13
4.0	19.60	851.3		98.06 ± 0.27
5.0	21.15	936.2		98.13 ± 0.30
B. Entanglement strength			C. Quantum measurement type
Strength	Parameters (M)	Houston2013	Type	Houston2013
0.1	14.94	98.15 ± 0.11	Adaptive	98.17 ± 0.07
0.3	14.94	98.39 ± 0.16	Magnitude	98.22 ± 0.21
0.5	14.94	98.39 ± 0.08	Real	98.53 ± 0.13
0.7	14.94	98.33 ± 0.12	Phase	98.44 ± 0.05
1.0	14.94	98.55 ± 0.23

Table 9. Validation of optimal key value selection in three-phase sequence on the Muufl dataset. The OA (%) value associated with the ideal value for each dataset is shown in bold.

Phase 1:	ConvNeXt Backbone Architecture Validation
Architecture	Depths	Blocks	Parameters (M)	Training (s)	Muufl
Tiny	[2, 2, 6, 2]	12	15.51	948.6	95.00
Small	[3, 3, 9, 3]	18	21.50	1157.3	94.28
Custom	[4, 4, 12, 4]	24	27.49	968.2	94.88
Base	[3, 3, 27, 3]	36	40.91	1408.8	94.60
Phase 2:	Mamba parameter validation
State size	Parameters (M)	Muufl	Expand factor	Parameters (M)	Muufl
8	15.47	94.11	1.5	15.21	94.65
16	15.51	94.51	2.0	15.73	94.14
32	15.58	94.75	2.5	16.25	94.22
64	15.73	94.85	3.0	16.77	95.21
Phase 3:	A. Quantum block ablation
Num block	Parameters (M)	Training (s)		Muufl
1.0	15.21	1983.3		95.02 ± 0.48
2.0	16.77	2274.3		94.28 ± 0.35
3.0	18.32	2773.0		94.72 ± 0.03
4.0	19.87	3341.0		94.23 ± 0.19
5.0	21.43	3477.3		94.54 ± 0.13
B. Entanglement strength			C. Quantum measurement type
Strength	Parameters (M)	Muufl	Type	Muufl
0.1	15.21	94.50 ± 0.33	Adaptive	94.85 ± 0.17
0.3	15.21	94.71 ± 0.17	Magnitude	94.61 ± 0.15
0.5	15.21	94.68 ± 0.23	Real	94.87 ± 0.24
0.7	15.21	94.32 ± 0.69	Phase	94.75 ± 0.22
1.0	15.21	94.38 ± 0.28

Table 10. Validation of optimal key value selection in three-phase sequence on the Augsburg dataset. The OA (%) value associated with the ideal value for each dataset is shown in bold.

Phase 1:	ConvNeXt backbone architecture validation
Architecture	Depths	Blocks	Parameters (M)	Training (s)	Augsburg
Tiny	[2, 2, 6, 2]	12	15.54	1553.1	95.03
Small	[3, 3, 9, 3]	18	21.53	1733.4	95.50
Custom	[4, 4, 12, 4]	24	27.52	1901.1	95.11
Base	[3, 3, 27, 3]	36	40.94	2231.4	95.08
Phase 2:	Mamba parameter validation
State size	Parameters (M)	Augsburg	Expand factor	Parameters (M)	Augsburg
8	21.49	95.50	1.5	21.23	95.27
16	21.53	95.34	2.0	21.75	95.09
32	21.60	95.17	2.5	22.27	95.61
64	21.75	95.87	3.0	22.78	95.37
Phase 3:	A. Quantum block ablation
Num block	Parameters (M)	Training (s)		Augsburg
1.0	20.97	3410.5		95.62 ± 0.18
2.0	22.27	3849.4		95.32 ± 0.33
3.0	23.56	4597.0		95.31 ± 0.20
4.0	24.86	6484.0		95.53 ± 0.25
5.0	26.15	6255.8		95.32 ± 0.23
B. Entanglement strength			C. Quantum measurement type
Strength	Parameters (M)	Augsburg	Type	Augsburg
0.1	20.97	95.55 ± 0.02	Adaptive	95.50 ± 0.22
0.3	20.97	95.37 ± 0.28	Magnitude	95.56 ± 0.13
0.5	20.97	95.72 ± 0.12	Real	95.57 ± 0.07
0.7	20.97	95.53 ± 0.32	Phase	95.55 ± 0.05
1.0	20.97	95.73 ± 0.27

Table 11. Classification results of all methods on the Houston2013 dataset. The optimal outcome is indicated in bold.

Class	1D-CNN [71]	HybridSN [36]	MFT [72]	CALC [73]	Proposed
Grass healthy	92.09	99.35	80.25	93.74	99.30
Grass stressed	79.93	98.52	96.33	98.86	100.00
Grass synthetic	97.19	100.00	95.25	99.70	100.00
Trees	88.26	97.29	96.12	93.63	99.90
Soil	88.93	100.00	99.90	99.67	99.90
Water	97.55	100.00	93.71	99.67	100.00
Residential	47.30	93.74	81.06	96.63	98.52
Commercial	78.33	99.71	87.17	86.76	99.90
Road	51.74	95.68	92.06	90.75	99.50
Highway	25.87	91.10	59.17	94.20	99.80
Railway	58.86	99.90	99.91	86.91	99.60
Parking lot1	44.70	89.61	92.99	91.34	99.09
Parking lot2	00.00	96.48	85.26	87.53	99.73
Tennis court	80.43	95.31	100.00	100.00	100.00
Running track	92.64	96.90	81.82	99.53	100.00
OA (%)	69.40	96.58	88.78	93.97	99.62
AA (%)	69.63	96.37	89.40	94.59	99.68
$Kappa \times$ 100	66.86	96.30	87.81	93.48	99.59

Table 12. Classification results of all methods on the Muufl dataset. The optimal outcome is indicated in bold.

Class	1D-CNN	HybridSN	MFT	CALC	Proposed
Trees	95.35	96.73	97.64	86.58	98.38
Mostly grass	66.62	82.34	89.81	73.01	92.30
Mixed ground	82.25	95.22	85.40	54.12	93.55
Dirt and sand	66.44	88.58	85.82	86.43	96.65
Road	86.07	93.06	94.63	81.22	97.35
Water	14.29	97.98	71.78	100.00	94.91
Building shadow	48.33	84.89	88.35	83.96	94.63
Building	88.33	97.41	97.26	94.86	98.36
Sidewalk	58.76	66.22	50.30	62.93	82.40
Yellow curb	71.19	37.68	0.00	50.92	53.74
Cloth panels	77.11	76.74	61.71	95.98	92.59
OA (%)	83.55	93.14	92.28	80.96	96.31
AA (%)	67.57	83.02	74.79	79.09	90.44
$Kappa \times$ 100	78.52	90.93	89.77	75.53	95.12

Table 13. Classification results of all methods on Augsburg dataset. The optimal outcome is indicated in bold.

Class	1D-CNN	HybridSN	MFT	CALC	Proposed
Forest	79.68	96.09	97.07	97.01	98.64
Residential area	74.82	97.03	96.06	91.55	98.13
Industrial area	62.80	71.05	69.48	56.85	89.91
Low plants	81.95	97.75	96.66	80.21	98.12
Allotment	0.00	0.00	0.00	77.83	64.13
Commercial area	0.00	0.00	0.00	68.55	65.96
Water	0.00	0.00	62.09	70.19	68.14
OA (%)	78.18	93.17	91.74	85.91	96.30
AA (%)	39.47	54.93	60.19	80.32	89.23
$Kappa \times$ 100	67.65	90.13	88.01	77.45	94.69

Table 14. Model configuration details and performance metrics on Houston2013 dataset.

Model	Fusion Method	Processing Blocks	Classical Mamba	Superposition	Confidence	OA (%)	AA (%)	Kappa
Classical Mamba	Simple average	Transformer	✓	✗	✗	92.80	92.13	92.22
w/o Superposition	Simple concatenation	Quantum Mamba	✗	✗	✗	91.68	91.59	91.01
w/o Confidence	Quantum superposition	Quantum Mamba	✗	✓	✗	95.45	95.10	95.08
Full QIE-Mamba	Quantum superposition	Quantum Mamba	✗	✓	✓	96.02	95.90	95.69

Table 15. Theoretical validation of each theorem on the Augsburg dataset.

Validation Information	Details	Value
Information capacity (MI)	Classical MI	5.1493
	Quantum MI	6.9966
	Advantage	1.8472
	Relative improvement	35.8738
	Validation status	Pass
Phase information	$I_{p h a s e} = S_{(ρ c o m p l e x)} - S_{(ρ r e a l)}$	2.1089
	$S_{(ρ c o m p l e x)}$	2.1174
	$S_{(ρ r e a l)}$	0.0084
	Quantum advantage	True
	Validation status	Pass
Convergence	Theoretical bound	100.00
	Max observed norm	0.00
	Mean Norm	0.00
	Std Norm	0.00
	Validation status	Pass
Complexity parameters	Actual parameters	20897354
	Theoretical parameters	1360832
	Complexity estimate	8709120000
	Validation status	Pass
Quantum advantage	Classical ratio	1.2726
	Quantum ratio	1.6988
	Advantage	0.4261
	Validation status	Pass
Run time	Sequence lengths	{8, 16, 24, 32}
	Runtimes, milliseconds	{15.59, 13.76, 14.32, 13.01}
	Linear MSE	0.0000
	Linear r2	0.7260
	Quadratic MSE	0.0000
	Quadratic r2	0.7442
	Validation status	Pass

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Myagmarsuren, D.; Wang, A.; Lv, H.; Wu, H.; Molnar, G.; Yu, L. Joint Hyperspectral Images and LiDAR Data Classification Combined with Quantum-Inspired Entangled Mamba. Remote Sens. 2025, 17, 4065. https://doi.org/10.3390/rs17244065

AMA Style

Myagmarsuren D, Wang A, Lv H, Wu H, Molnar G, Yu L. Joint Hyperspectral Images and LiDAR Data Classification Combined with Quantum-Inspired Entangled Mamba. Remote Sensing. 2025; 17(24):4065. https://doi.org/10.3390/rs17244065

Chicago/Turabian Style

Myagmarsuren, Davaajargal, Aili Wang, Haoran Lv, Haibin Wu, Gabor Molnar, and Liang Yu. 2025. "Joint Hyperspectral Images and LiDAR Data Classification Combined with Quantum-Inspired Entangled Mamba" Remote Sensing 17, no. 24: 4065. https://doi.org/10.3390/rs17244065

APA Style

Myagmarsuren, D., Wang, A., Lv, H., Wu, H., Molnar, G., & Yu, L. (2025). Joint Hyperspectral Images and LiDAR Data Classification Combined with Quantum-Inspired Entangled Mamba. Remote Sensing, 17(24), 4065. https://doi.org/10.3390/rs17244065

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Joint Hyperspectral Images and LiDAR Data Classification Combined with Quantum-Inspired Entangled Mamba

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Multimodal Remote Sensing Fusion: Context and Obstacles

2.2. Deep Learning Architectures for Remote Sensing Fusion

2.3. State-Space and Quantum-Inspired Models

3. Method

3.1. Problem Formulation

3.2. Hierarchical Feature Extraction with ConvNeXt Encoders

3.2.1. Stem Layer with Patchify Operation

3.2.2. ConvNeXt Stage Architecture

3.2.3. ConvNeXt Block

3.2.4. Sequence Conversion Module

3.3. Quantum-Inspired Entanglement Fusion

3.3.1. Quantum Superposition Layer

3.3.2. Unitary Entanglement Network

3.3.3. Quantum-Enhanced Mamba State-Space Model

3.3.4. Quantum Measurement Selection Framework

3.4. Confidence-Based Modality Fusion

3.5. Classification Head and Training

3.6. Theoretical Foundation

4. Experiments Results and Analysis

4.1. Experiment Setup

4.1.1. Datasets

4.1.2. Implementation Details

4.2. Determines Theoretical Foundation

4.2.1. Decoherence Model Validation

4.2.2. Ablation Analysis of Complex Factor ρ

4.2.3. Numerical Stability Validation

4.2.4. Unitary Gate Implementation

4.2.5. Decomposition Method Strategies Comparison

4.2.6. Domain Difference and Stability of Pseudomodalities

4.3. Optimal Selection of Key Parameters and Training Hyperparameters

4.3.1. Optimal Selection of Key Parameters

4.3.2. Training Hyperparameter Ablation

4.4. Comparison of Benchmark Approaches

4.5. Ablation Study

4.5.1. Contribution Ablation

4.5.2. Theoretical Validation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2.2. Ablation Analysis of Complex Factor $ρ$