1. Introduction
Through the acquisition of ground object images using tens to hundreds of contiguous narrow spectral bands, hyperspectral remote sensing technology provides nearly continuous spectral information for each pixel. This characteristic endows hyperspectral imagery with significant potential for fine-grained land-cover classification [
1]. Early studies primarily focused on developing efficient classification algorithms, ranging from classical machine learning methods such as support vector machines (SVM) [
2] to a variety of advanced statistical learning models [
3], which collectively promoted the continuous advancement of hyperspectral classification techniques. In recent years, with the rapid development of deep learning, researchers have begun to construct more powerful network architectures, such as deep perception networks capable of jointly capturing spatial context and spectral detail features, in order to learn more discriminative feature representations [
4]. More recent research efforts have shifted toward state space models (SSMs) and their variants, notably the Mamba architecture [
5]. Owing to their linear computational complexity and strong sequence modeling capability, these models offer a promising new paradigm for handling high-dimensional and highly redundant hyperspectral data. Consequently, the field of hyperspectral image classification has evolved from the development of efficient classifiers to the design of advanced deep neural networks. More recently, it has further moved toward the exploration of novel and efficient sequence modeling architectures, playing a vital role in applications such as agriculture, environmental monitoring, and geological exploration.
In the early stages of deep learning-based hyperspectral image classification, Convolutional Neural Networks (CNNs) emerged as the mainstream paradigm owing to their formidable local feature extraction capabilities [
6,
7,
8,
9,
10,
11]. To address high-dimensional redundancy and spatial–spectral fusion, various CNN-based architectures have been developed [
12,
13,
14,
15]. These methods typically integrate dimensionality reduction techniques (e.g., Gabor filters or PCA) with 2D/3D convolutional kernels to capture joint features. However, they often suffer from high computational complexity, sensitivity to preprocessing parameters, and limited generalization due to their reliance on manual designs and local receptive fields. To further capture complex spatial–spectral dependencies, researchers have progressively integrated multiscale feature representations and hybrid attention mechanisms into deep learning frameworks, significantly improving the discriminative power of the models [
16,
17]. Collectively, while various CNN architectures have demonstrated superior spatial–spectral feature extraction capabilities, they generally suffer from limitations such as reliance on local receptive fields, propensity for overfitting on limited training samples, and difficulty in modeling long-range dependencies. Consequently, to transcend the intrinsic limitations of convolution operators and capture profound long-range contextual dependencies, research priorities have progressively shifted toward the Transformer architecture, renowned for its exceptional sequence modeling capabilities.
Following the successful application of Transformers in natural language processing and computer vision, models based on self-attention mechanisms have been increasingly integrated into hyperspectral image classification task [
18,
19,
20,
21,
22,
23,
24,
25]. Unlike CNNs that rely on local convolutional kernels, Transformers leverage self-attention mechanisms to effectively model long-range dependencies between pixels and spectral bands across input sequences, thereby overcoming the inherent limitations of convolutional architectures in capturing long-range contextual information and cross-band correlations. Specifically, Peng et al. [
26] utilized cross-attention for spatial–spectral fusion, while Zhao et al. [
27] designed a lightweight ViT using group-separable convolutions to reduce overhead. To handle limited labeled samples, Jia et al. [
28] introduced a local–global fusion framework with center-mask pre-training. Although these Transformer-based methods exhibit notable advantages in long-range dependency modeling and spatial–spectral interaction, the quadratic computational complexity O(N
2) of self-attention and the lack of local inductive biases make it challenging to efficiently capture spatial–spectral details under limited sample conditions.
To bridge these gaps, the Mamba architecture based on the State Space Model (SSM) has emerged as a promising solution. By virtue of its selective scan mechanism, Mamba achieves linear computational complexity O(N) while maintaining a long-range receptive field, thereby providing a new paradigm for the efficient modeling of hyperspectral images [
29,
30,
31,
32,
33,
34,
35,
36]. However, directly applying vanilla Mamba to hyperspectral 3D data still encounters significant challenges: existing 3D scanning strategies often employ predetermined and rigid paths (e.g., simple row-by-row or band-by-band scanning), which overlook the spatial–spectral anisotropy of land cover distribution and fail to adaptively adjust contextual routing based on texture or spectral saliency. Furthermore, the redundancy inherent in high-dimensional spectral data remains insufficiently decoupled, leading to excessive computational loads or the creation of “information islands” when capturing fine-grained features. Additionally, continuous long-range features and discrete texture features exist in different representation forms, where simple linear superposition frequently induces representation discrepancies.
Previous studies have demonstrated that double-branch architectures (e.g., typical spatial–spectral dual streams) are highly effective in decoupling feature representations [
37]. However, conventional spatial–spectral paradigms often treat hyperspectral data merely as two independent computer vision features to be mechanically concatenated, fundamentally neglecting the intrinsic physical duality of Earth observation data. In physical remote sensing, ground materials concurrently exhibit continuous physical evolution (e.g., smooth spectral absorption trajectories and continuous geometric textures) and high-frequency discrete mutations (e.g., sharp material boundaries and narrow-band distinctiveness). To genuinely reflect this physical reality and overcome the limitations of single-stream SSMs, we propose a paradigm shift: a continuous–discrete collaborative framework based on the state space model, termed Confluence Mamba (CF-Mamba). This framework innovatively incorporates an Adaptive Holographic Spectral Encoder (AHSE) to dynamically route continuous sequence contexts, and an Interactive Interval Spectral Encoder (IISE) to reduce spectral redundancy while maintaining information flow via channel shuffling. Finally, a Confluence Gating Unit (CGU) is employed to implement consistency constraints and detail enhancement for cross-representation features. Consequently, the proposed method significantly enhances hyperspectral image classification performance while ensuring computational efficiency.
The primary contributions of this paper are summarized in the following three aspects:
Proposing a continuous–discrete collaborative CF-Mamba architecture with customized embedding strategies. To tackle the challenges of high-dimensional redundancy and representation discrepancy in HSIs, a parallel CF-Mamba framework is constructed. We specifically design Spectral–Spatial Convolutional Representation (SSCR) and Depthwise Separable Embedding (DWE) for the continuous modeling path and discrete interaction path, respectively. This design achieves effective feature decoupling and redundancy compression at the input stage, laying a solid foundation for subsequent efficient modeling.
Designing AHSE and IISE modules for continuous evolution and discrete decoupling modeling. To overcome the limitations of traditional scanning mechanisms, an Adaptive Holographic Spectral Encoder (AHSE) is proposed, which introduces a content-aware adaptive routing mechanism to dynamically weight multi-view scanning features and preserve the continuous evolution of spectral–spatial information. Simultaneously, an Interactive Interval Spectral Encoder (IISE) is developed, employing discretized interval sampling and channel shuffling strategies to break “information islands” in discrete feature extraction while maintaining linear computational complexity.
Introducing a Confluence Gating Unit (CGU) to resolve cross-representation discrepancies. To address the representational differences between the dual-path features, the CGU is designed. Leveraging a bi-directional cross-modulation strategy, this module utilizes continuous contextual information to constrain the distribution consistency of discrete features, while employing discrete details to sharpen the continuous features, achieving deep alignment and complementary enhancement of cross-scale features.
3. Proposed Method
3.1. Overall Architecture
Addressing the inherent physical duality of hyperspectral images (HSIs) is a critical challenge in remote sensing. Conventional dual-branch networks typically treat spatial and spectral dimensions as independent modalities for mechanical fusion. In contrast, we propose a continuous–discrete collaborative framework based on the state space model (SSM), termed Confluence Mamba (CF-Mamba), which explicitly deconstructs the physical attributes of HSIs.
Rather than a simple spatial–spectral split, our architecture is driven by the physical reality of ground objects. The specific selection of AHSE, IISE, and CGU is strictly driven by the necessity to deconstruct the intrinsic physical duality of hyperspectral data. The AHSE is designed to track continuous physical evolution (e.g., smooth spectral absorption gradients), whereas the IISE isolates high-frequency discrete mutations to prevent over-smoothing. Because these continuous and discrete representations are physically orthogonal, conventional fusion paradigms (e.g., simple element-wise addition) fail to align their heterogeneous semantic spaces. Therefore, the CGU is uniquely introduced to perform bidirectional cross-modulation, dynamically gating the confluence of continuous envelopes and discrete details to explicitly resolve their cross-representation discrepancies. As illustrated in
Figure 1, CF-Mamba primarily consists of three core stages: shallow feature embedding, dual-path deep feature extraction, and feature confluence and classification.
First, to mitigate the “curse of dimensionality” and reduce computational complexity, the raw hyperspectral data first undergoes Principal Component Analysis (PCA) for dimensionality reduction, followed by partitioning into overlapping 3D patches. These patches are then fed into two parallel branches for complementary processing:
Continuous Modeling Path: This path aims to establish long-range dependencies at the sequence level while preserving the integrity of spectral–spatial information. The input features first pass through the Spectral–Spatial Convolutional Representation (SSCR) module to retain 3D structural information. Subsequently, the features enter the core Adaptive HoloSpectral Encoder (AHSE). Diverging from traditional methods that mechanically sum fixed scanning paths, AHSE introduces an adaptive routing mechanism that dynamically weights features from different 3D scanning directions based on the texture complexity of the input content, thereby reinforcing continuous context perception while suppressing noise.
Discrete Interaction Path: This path focuses on decoupling high-dimensional redundancy and extracting fine-grained discriminative features. After being processed by the Depth-wise separable spectral–spatial Embedding (DWE), the input features enter the Interactive Interval Spectral Encoder (IISE). IISE employs a discretized interval sampling strategy to partition continuous spectral features into non-overlapping subgroups and performs selective scanning (S6) within each group. To break the “information silos” caused by discrete grouping, IISE innovatively introduces a channel shuffle mechanism to facilitate information flow across subgroups while maintaining linear computational complexity.
In the feature fusion stage, addressing the differences in representation forms and spatial response distributions between the dual-path features, we propose the Confluence Gating Unit (CGU). This module aims to achieve cross-representation feature alignment and complementary enhancement through a learnable gating mechanism. CGU utilizes a bi-directional cross-modulation strategy: it first maps the features of one path into gating coefficients via a nonlinear activation function, which are then used for channel-wise weighted filtering of the features from the other path.
This framework mechanism achieves a dual purpose: first, it utilizes the continuous contextual information from AHSE to provide distributional constraints on the discrete features of IISE, filtering out fragmented noise that lacks contextual support; second, it leverages the discrete detailed information from IISE to supplement the continuous features of AHSE, compensating for potential boundary blurring caused by long-range smooth modeling. Through this deep bi-directional interaction, the model generates “Confluence” features that possess both sequence consistency and detail sharpness. Finally, these features are processed by Global Average Pooling (GAP) before being fed into the classifier to output pixel-level land-cover classification results.
3.2. Adaptive HoloSpectral Encoder (AHSE)
The selection and design of the Adaptive HoloSpectral Encoder (AHSE) are strictly driven by the severe physical limitations of applying standard sequence models to 3D hyperspectral data. Hyperspectral images are fundamentally anisotropic 3D physical entities. While the native Mamba architecture excels at 1D sequences, applying it directly to HSIs via arbitrary flattening inherently destroys either spatial or spectral physical continuity. Furthermore, existing multi-path extensions (e.g., 3DSS-Mamba [
41]) merely perform mechanical feature summation. This approach falsely assumes isotropic physical continuity, contradicting the physical reality that different land covers exhibit highly anisotropic continuous evolutions—for instance, vegetation relies heavily on spectral waveform continuity, whereas built structures depend primarily on spatial geometric continuity.
To overcome these physical limitations, the AHSE module is introduced as the core of the continuous modeling path. Analogous to the principle of optical holography, which reconstructs 3D information by recording light waves from different angles, AHSE comprehensively captures anisotropic structural dependencies through multi-view orthogonal trajectories without blind spots. As illustrated in
Figure 2, it transforms non-causal 3D images into causal continuous sequence streams through three cascaded stages: Multi-View Serialization, Parallel State Space Evolution, and Content-Aware Adaptive Routing. Crucially, rather than mechanically summing features, the final adaptive routing mechanism dynamically gates and weighs these perspectives. It adaptively amplifies the most physically relevant continuous perspective based on the intrinsic complexity of the specific ground object, thereby completely preventing the introduction of structural noise from irrelevant scanning dimensions.
3.2.1. Multi-View Spectral–Spatial Serialization
Prior to multi-view serialization, the Spectral–Spatial Convolutional Representation (SSCR) module is designed to satisfy the requirements of the continuous modeling path (AHSE) for holistic structural features and spectral continuity. This module leverages a 3D convolutional layer (3D Conv) to directly encode the input hyperspectral image patches
as follows:
SSCR enables the direct extraction of spatial structures between pixels and continuous correlations between bands, transforming raw data into a highly compressed representation that preserves complete spectral evolution information. This provides a robust structural input for the subsequent AHSE to capture long-range sequential dependencies.
To model with SSMs without compromising the integrity of the 3D structure, a continuity-preserving sequence mapping must first be established. Given an input feature tensor , we define different scanning trajectories to flatten the 3D cube into 1D sequences, capturing anisotropic long-range contextual dependencies. In this study, we construct complementary paths:
Spatial-Priority Path (): This path prioritizes traversing spatial pixels (h, w) before switching spectral bands. It is designed to capture the continuity of spatial textures and the geometric structures of object edges.
Spectral-Priority Path (): This path prioritizes traversing the spectral dimension before switching spatial positions. It focuses on extracting the evolution of spectral fingerprints and long-range sequential dependencies across bands.
Joint-Cross Path (): Utilizing a 3D spiral traversal strategy, this path aims to establish a deep coupling relationship between the spatial and spectral dimensions.
For the
path, the generated sequence
can be expressed as:
where
contains the holistic sequential context from a specific perspective.
3.2.2. Parallel Selective State Space Evolution
Upon obtaining the flattened sequences, AHSE utilizes parallel selective state space models (S6) to model the independent sequence evolution for each path. For the sequence input of the path, discretized state equations are employed to update the latent state . The reliance on the Selective State Space Model (S6) for continuous modeling, rather than conventional CNNs, RNNs, or Transformers, is strictly dictated by the physical prerequisites of hyperspectral continuous evolution. Traditional CNNs, constrained by local receptive fields, inherently fail to capture holistic long-range spectral trajectories. While Transformers excel at global dependencies, their quadratic complexity typically forces aggressive sequence truncation or patch downsampling, which physically severs the continuity of the HSI data cube. Moreover, traditional RNNs suffer from memory decay over massive sequences and lack parallel efficiency. In contrast, Mamba uniquely satisfies both physical and computational demands: its linear complexity permits the full ingestion of untruncated 3D continuous sequences, preserving the absolute integrity of physical evolution. Furthermore, unlike the static parameters in RNNs, Mamba’s input-dependent selective mechanism dynamically retains critical continuous physical states (e.g., macro-spectral envelopes) while filtering out irrelevant noise, making it mathematically and structurally optimal for our continuous modeling path.
First, discretization parameters are dynamically generated based on the input
:
Subsequently, state recurrence and output computation are executed:
where
denotes the skip connection parameter, typically maintained as a static value to facilitate direct gradient propagation. After S6 processing, the output sequences from each path are reshaped back to their original 3D dimensions via an inverse transformation, denoted as
. This process ensures that each branch independently captures continuous long-range dependency features under its specific perspective.
3.2.3. Content-Aware Adaptive Fusion
Conventional feature fusion often relies on simple element-wise summation. This practice assumes that features from all perspectives hold equal importance for the final classification, thereby ignoring the discrepancies in characterizing data continuity across different scanning topologies (e.g., in spectrally smooth regions, forced spatial scanning may introduce redundant high-frequency structural noise). To address this, AHSE introduces a lightweight adaptive fusion module to achieve dynamic weighted aggregation of multi-path features by learning the sequential feature saliency for each perspective. First, the input features
are compressed into holistic statistical descriptors
using Global Average Pooling (GAP):
The GAP here is intended to aggregate the holistic distribution information within the current patch, establishing a macro-statistical basis for subsequent routing decisions. Subsequently, an excitation network—comprising two fully connected (FC) layers and a ReLU activation function—is employed to generate a normalized weight vector
for the
paths:
where
denotes the representational contribution of the
scanning path to the current sample, and
refers to the ReLU activation function.
Ultimately, the output of AHSE, denoted as
is formed by aggregating the weighted multi-view features and is fused with the input through a residual connection to ensure effective gradient propagation:
Through this mechanism, AHSE adaptively reinforces the perspectives that best align with the continuous structures of ground objects based on the inherent distribution characteristics of the input data. Simultaneously, it suppresses perspectives that introduce redundant interference, thereby achieving precise refinement and sequence optimization of high-dimensional spectral–spatial features.
3.3. Interactive Interval Spectral Encoder (IISE)
Although AHSE can effectively capture continuous long-range dependencies at the sequence level, capturing fine-grained discrete spectral features typically entails prohibitive computational costs (e.g., dense 3D convolutions). Direct full-band scanning in high-dimensional spectral space not only imposes a heavy computational burden but also leads to significant spectral redundancy due to the high correlation between adjacent bands. Therefore, how to efficiently decouple spectral redundancy and extract discriminative features without substantially increasing the model complexity remains a critical challenge.
To address this, we propose the Interactive Interval Spectral Encoder (IISE). Instead of naive continuous scanning, IISE adopts an “Isolation-Interaction” paradigm. Specifically, IISE leverages an interval feature decoupling strategy to significantly compress sequence length and eliminate spectral redundancy, while employing a parameter-free shuffle interaction mechanism to break the barriers between discrete groups. This design enables the model to achieve fine-grained modeling of discrete spectral–spatial details at a lightweight computational cost, effectively circumventing the “high precision, high complexity” dilemma inherent in conventional methods.
3.3.1. Interval Feature Decoupling
To address the requirements of the discrete interaction path (IISE) for spectral redundancy decoupling and fine-grained feature extraction, we propose the DWE module. This module adopts a strategy that combines depth-wise separable convolution (DWConv) with linear mapping (Linear), which substantially reduces the parameter count while enhancing the response to discriminative features:
The generated features are subsequently partitioned into non-overlapping interval groups . This discretization-based preprocessing effectively circumvents redundant computations in high-dimensional spectral space and seamlessly aligns with the subsequent interval-based discrete group interaction mechanism of IISE.
To break the high redundancy inherent in adjacent bands and alleviate the computational burden of sequence modeling, IISE first introduces a discrete interval grouping strategy as illustrated in
Figure 3. Unlike traditional contiguous band clustering, which essentially acts as local smoothing and risks obliterating critical high-frequency physical mutations (e.g., narrow diagnostic absorption valleys of specific minerals), our strategy is physically grounded in sparse discrete sampling. By extracting bands at fixed intervals, it functions as a spectral comb, explicitly breaking the strong physical collinearity among adjacent bands to reduce redundancy, while strictly retaining the representative macro-spectral profile of ground objects.
Given the decoupled feature tensor
generated by the depth-wise separable embedding, we partition it into
non-overlapping low-dimensional feature subspaces. The feature set
of the
subspace (group) consists of channels with indices
:
This interval sampling ensures that although each subgroup’s dimension is reduced to of the original, the bands it contains are uniformly distributed across the spectral domain. This enables each discrete subgroup to preserve the complete skeleton of the ground object’s spectral curve (i.e., a Representative Spectral Profile), rather than being restricted to discrete features within a narrow spectral range. Subsequently, each subgroup is independently fed into a unidirectional selective state space model (S6) to capture the intra-group spatial–spectral dependencies within each discrete subspace in parallel. Furthermore, from the perspective of multi-view analysis, each subgroup generated by the interval sampling fundamentally acts as an independent spectral ‘view’ of the ground object. By decoupling the continuous spectrum into multiple interleaved sub-spaces and analyzing the hyperspectral data from these diverse discrete angles, the IISE module effectively captures complementary high-frequency variations, thereby significantly enhancing the overall feature representation capability of the model.
3.3.2. Cross-Group Shuffle Interaction
Feature extraction based on discrete grouping significantly reduces the sequence length and, combined with the linear complexity of S6, substantially diminishes computational overhead. However, merely decoupling the features is insufficient for true discrete modeling. Standard State Space Models (SSMs) are inherently biased toward sequential continuity. If the decoupled subgroups are processed strictly in their spectral order, the SSM will still compulsively attempt to fit a continuous evolutionary sequence. Therefore, IISE innovatively introduces a channel shuffle mechanism. This operation is not merely an “interaction bridge,” but a mathematical prerequisite: it intentionally destroys the physical wavelength sequence constraint. This forces the SSM to abandon continuous modeling and strictly focus on capturing the global non-sequential interactions among independent, discrete high-frequency mutations without introducing additional parameters.
Assuming that the discrete group features processed by parallel S6 are stacked into a tensor (where denotes the number of spatial pixels), the shuffle interaction process is implemented through tensor dimension rearrangement to enable cross-group information flow:
- 2.
Flatten & Fusion: The transposed tensor is flattened back to the original channel dimension :
This operation essentially performs a uniform “re-weaving” of the decoupled spectral features, ensuring that the subsequent
linear projection layer can simultaneously receive and integrate feature information from all discrete intervals. Finally, the output of IISE,
, is obtained through linear mapping of the interacted features combined with a residual connection:
Through this efficient workflow of “Decoupling—Independent Modeling—Interaction Re-weaving,” IISE successfully resolves the conflict between redundancy and correlation in high-dimensional spectral data processing. It maximizes the integrity of fine-grained discrete details and cross-interval spectral dependencies while maintaining linear computational complexity.
3.4. Confluence Gating Unit (CGU)
In the CF-Mamba framework, the AHSE path captures continuous sequence-level context, while the IISE path extracts discrete decoupled fine-grained features. Despite their informational complementarity, the distinct feature generation mechanisms—continuous evolution versus discrete sampling—often lead to representation discrepancy in the feature space when direct linear superposition (e.g., element-wise addition) is applied. For instance, the continuous modeling path may blur the high-frequency boundaries of ground objects due to sequence smoothing effects, whereas the discrete interaction path might introduce fragmented noise lacking contextual constraints as a result of severed long-range dependencies.
To bridge this representation gap and achieve a deep organic fusion of dual-path features, we designed the Confluence Gating Unit (CGU). Departing from traditional passive aggregation methods, the CGU adopts a “Bi-directional Cross-Rectification” strategy, which leverages the distribution characteristics of one path as a prior to dynamically calibrate the response of the other path.
3.4.1. Cross-Scale Alignment and Gate Generation
Let denote the continuous context features output by AHSE, and denote the discrete fine-grained features output by IISE. The core objective of the CGU is to construct two parallel gating branches designed to generate a “consistency mask” and a “detail enhancement mask,” respectively.
Initially, lightweight feature transformation functions, denoted as
for the continuous context and
for the discrete features, are introduced. These functions, typically composed of a
convolutional layer, batch normalization (BN), and an activation function, endow the gating coefficients with non-linear discriminative power. Subsequently, a Sigmoid function
is employed to map the features into the
interval, thereby generating the gating maps:
Here, represents the attention map generated from the continuous context, indicating regions with high sequence-level consistency confidence. Conversely, represents the attention map derived from discrete details, identifying regions that contain significant discriminative texture information.
3.4.2. Bi-Directional Cross-Modulation
Upon obtaining the gating coefficients, the CGU executes a bi-directional cross-modulation operation. Rather than a simple feature mixture, this process utilizes the Hadamard product () to achieve mutual filtering and enhancement between the dual-path features:
continuous-to-discrete Guidance (Contextual Regularization): The Discrete features are weighted using the Continuous gate :
The primary role of this step is to utilize continuous semantics as a “regularizer” to suppress discrete noise points within the discrete features that are inconsistent with the surrounding sequence evolution trends (e.g., filtering out isolated feature fragments resulting from group decoupling).
discrete-to-continuous Feedback (Detail Refinement): The continuous features are weighted using the discrete gate :
This step serves to utilize discrete high-frequency details as an “enhancer” to strengthen the response of continuous features at object boundaries, thereby compensating for the potential smoothing of local responses caused by long-range sequence modeling.
3.4.3. Confluence Output and Classification
Features after bi-directional calibration possess both the distribution consistency of continuous context and the fine-grained distinctiveness of discrete features. The final confluence feature
is obtained by fusing the complementarily modulated features from both paths, while maintaining the flow of original information through residual connections, as illustrated in the mechanism of
Figure 4:
Finally,
is compressed into a feature vector via a global average pooling (GAP) layer and then fed into a fully connected layer to calculate the final probability distribution of land cover categories:
Through the CGU module, CF-Mamba successfully achieves a transition from mechanical “multi-source feature stacking” to organic “continuous–discrete Representation Synergy”. This significantly enhances the model’s classification robustness when dealing with spectral confusion (which requires continuity constraints) and subtle texture differences (which require discrete feature differentiation).
Figure 4.
Bi-directional Cross-Modulation Mechanism of the Confluence Gating Unit (CGU).
Figure 4.
Bi-directional Cross-Modulation Mechanism of the Confluence Gating Unit (CGU).
4. Experiment
4.1. Datasets
To evaluate the effectiveness of the proposed method, extensive experimental comparisons were conducted on four public hyperspectral image (HSI) databases: Indian Pines, Pavia University, Houston 2013, and WHU-Hi-Longkou.
Table 1 provides the detailed partitioning of the training and testing sets.
Indian Pines Dataset: This dataset was acquired in 1992 by the Airborne/Visible Infrared Imaging Spectrometer (AVIRIS) over the Indian Pines test site in northwestern Indiana. The spectrometer covers a wavelength range from 0.4 to 2.5 µm. After removing water absorption channels, the dataset contains 200 spectral bands with a spatial resolution of 20 m per pixel, and the image size is 145 × 145 pixels. The dataset comprises a total of 10,249 ground truth samples across 16 different categories.
Pavia University Dataset: This dataset was collected by the Reflective Optics System Imaging Spectrometer (ROSIS) over the city of Pavia in northern Italy. The imaging wavelength range of the spectrometer is 0.43 to 0.86 µm. After removing 12 noisy bands, the dataset includes 103 spectral bands with a spatial resolution of 1.3 m per pixel and an image size of 610 × 340 pixels. There are 42,776 labeled ground truth pixels, corresponding to 9 different categories.
Houston 2013 Dataset: This dataset was captured by the ITRES CASI-1500 sensor (ITRES Research Limited, Calgary, AB, Canada) over the University of Houston campus and its surrounding areas, provided by the 2013 IEEE Geoscience and Remote Sensing Society (GRSS) Data Fusion Contest. The spectrometer’s imaging wavelength range is 0.38 to 1.05 µm. It contains 144 spectral bands with an image size of 340 × 1905 pixels and a spatial resolution of 2.5 m per pixel. The dataset includes a total of 15,029 sample pixels categorized into 15 challenging classes.
WHU-Hi-Longkou Dataset: This dataset was acquired by the Headwall Nano-Hyperspec sensor (Headwall Photonics, Bolton, MA, USA) in Longkou Town, Hubei Province, China. The image size is 550 × 400 pixels, containing 270 spectral bands with a wavelength range of 0.4 to 1 µm and a spatial resolution of approximately 0.463 m. The area contains 9 types of land cover: Corn, Cotton, Sesame, Broad-leaf soybean, Narrow-leaf soybean, Rice, Water, Roads and houses, and Mixed weed, primarily used for precision agricultural classification research.
4.2. Experimental Setup
- (1)
To quantitatively evaluate the classification performance, three standard metrics are employed: Overall Accuracy (OA), Average Accuracy (AA), and the Kappa coefficient Kappa. Specifically, OA is computed as the ratio of correctly classified pixels to the total number of test pixels. AA represents the mean of the classification accuracies across all individual classes. AA is explicitly included to provide a fairer evaluation of the model’s performance on minority classes, effectively addressing the severe class imbalance problem typically inherent in hyperspectral data. Finally, the Kappa coefficient is utilized to measure the consistency and agreement between the classification results and the ground truth, effectively penalizing correct predictions occurring by random chance. To ensure fairness and stability in the comparison between different models, each model is trained five times independently with random initializations. Mean value and standard deviation of these five experiments is reported as the final evaluation result.
- (2)
The proposed CF-Mamba and all comparative models are implemented based on the PyTorch 2.6.0 deep learning framework, with hardware acceleration provided by a single NVIDIA GeForce RTX 4090D GPU. In the data preprocessing stage, Principal Component Analysis (PCA) is first utilized to compress the spectral dimension of the raw HSI data to 35 dimensions. The feature embedding dimension C is set to 64, and the group number G in the IISE module is set to 4. Subsequently, the data is segmented into overlapping patches with a spatial size of 15 × 15 as network inputs. For the training strategy, the Adam optimizer is selected for parameter updates, with an initial learning rate of 0.001 and a batch size of 64. The entire training cycle lasts for 150 epochs, and the model parameters at the end of the 150th iteration are directly used for final testing.
- (3)
To evaluate the performance of the proposed CF-Mamba, four categories of mainstream algorithms in the current field of hyperspectral classification are selected as comparative baselines. These include SVM [
2] as a representative of traditional shallow learning methods; convolutional neural networks based on deep feature extraction, such as 2D-CNN and 3D-CNN [
42]; Transformer-based methods utilizing self-attention mechanisms, including HSI-BERT [
43], SF [
44], CASST [
26], and DCTN [
45]; CenterMamba [
34], S2Mamba [
32] and 3DSS-Mamba [
41] model, which incorporate some Mamba-based architectures and selective scanning mechanisms.
4.3. Experimental Results
Indian Pines Dataset: As shown in
Table 2 and
Figure 5, CF-Mamba achieves the best performance with an OA of 97.77%, AA of 96.00%, and Kappa coefficient of 97.46%. Compared to advanced state-space-based methods 3DSS-Mamba, S2-Mamba, and CenterMamba, our method improves OA by 1.82%, 0.51%, and 1.07%, respectively, and improves AA by 4.92%, 1.72%, and 1.27%, respectively. These improvements are particularly significant given the challenges of low spatial resolution and severe mixed pixels in the Indian Pines dataset.
Notably, CF-Mamba demonstrates exceptional discriminative power across multiple agricultural categories with similar spectral features. Specifically, it achieves perfect classification accuracy of 100.00% on Class 4 (Corn), Class 7 (Grass-pasture-mowed), Class 8 (Hay-windrowed), Class 13 (Wheat), and Class 14 (Woods). These results fully validate the effectiveness of capturing continuous spectral evolution trends through the AHSE module, combined with extracting discrete fine-grained features via the IISE module. Although other SSM-based methods also achieve 100% accuracy on certain classes, CF-Mamba attains perfect classification on more classes while maintaining higher overall classification stability and average accuracy, further demonstrating the significant advantages of the continuous–discrete collaborative framework in addressing the challenges of “same object, different spectrum” and “different object, same spectrum.”
Houston 2013 Dataset: The classification results for the Houston 2013 dataset are summarized in
Table 3. Despite the “ceiling effect” where most deep learning models exceed 95% accuracy, CF-Mamba further pushes the limit, increasing the OA to 99.06%. Compared to advanced state-space-based methods 3DSS-Mamba, S2-Mamba, and CenterMamba, our method achieves improvements of 2.99%, 0.96%, and 1.47% in OA, respectively. Detailed comparisons reveal that CF-Mamba significantly reduces confusion between different urban land cover categories. For example, in Class 12 (Parking Lot1) and Class 13 (Parking Lot2), which share similar geometric structures, our model achieves near-perfect accuracies of 99.97% and 99.32%, respectively, substantially outperforming the baseline 3DSS-Mamba (98.06% and 96.15%) as well as other SSM-based methods. This robustness in complex urban scenes validates the effectiveness of the Confluence Gating Unit (CGU). By utilizing continuous contextual information to provide consistency regularization for discrete features, the CGU effectively filters out shadow effects and texture fragmentation noise common in high-resolution urban imagery, ensuring that visually similar objects are accurately distinguished based on their intrinsic spatial–spectral consistency.
Pavia University Dataset: As shown in
Figure 6 and
Table 4, CF-Mamba demonstrated a significant advantage on the ROSIS dataset, achieving an OA of 99.68%, nearly reaching the saturation limit for this benchmark. Compared to advanced state-space-based methods 3DSS-Mamba, S2-Mamba, and CenterMamba, our method achieves improvements of 0.79%, 0.83%, and 2.38% in OA, respectively. The advantages were most pronounced in categories with narrow linear structures and fine-grained textures. For Class 2 (Meadows) and Class 8 (Self-blocking bricks), CF-Mamba achieved perfect or near-perfect accuracies of 100.00% and 99.85%, respectively, substantially outperforming Transformer-based models (e.g., SF, DCTN), which often suffer from boundary blurring due to resolution loss during patch embedding. Furthermore, our model also achieved 100.00% accuracy on Class 6 (Bare soil) and Class 7 (Bitumen), further demonstrating its robustness in handling diverse material types. In contrast, CF-Mamba preserved fine-grained discrete features through the IISE interaction mechanism and enhanced the sharpness of continuous features at object edges via the CGU “detail feedback” strategy, enabling pixel-level precision for small targets and linear objects.
WHU-Hi-Longkou Dataset: The results for the Longkou dataset are presented in
Figure 7 and
Table 5. Due to its extremely high spatial resolution (0.463 m) and rich texture features, most deep learning models exceed 98% OA on this dataset. CF-Mamba achieved the highest OA of 99.59%, though the performance gain was more gradual. Compared to advanced state-space-based methods 3DSS-Mamba, S2-Mamba, and CenterMamba, our method achieves improvements of 0.04%, 0.20%, and 0.86% in OA, respectively, while its AA (98.54%) is slightly lower than 3DSS-Mamba’s 98.90% but higher than S2-Mamba (97.69%) and CenterMamba (95.29%). This can be attributed to the highly concentrated distribution and strong anisotropic strip features of certain crops (e.g., Class 3 Sesame); while the AHSE module captured these structures, the complexity introduced by the dual-path mechanism might have led to slight overfitting on some small-sample fragmented objects. Nevertheless, CF-Mamba maintained near-perfect recognition rates for major classes like Class 1 (Corn) and Class 4 (Broad-leaf soybean) with accuracies of 99.97% and 99.81%, respectively, proving its basic robustness in high-resolution agricultural remote sensing scenarios.
4.4. Feature Visualization
To further qualitatively evaluate the feature representation capability of the proposed CF-Mamba, we utilize the t-distributed stochastic neighbor embedding (t-SNE) algorithm to visualize the high-dimensional features extracted by the network. Specifically, the output features from the final global average pooling layer are projected into a two-dimensional space. The visualization results from the IP dataset are presented in
Figure 8.
As illustrated in
Figure 8, the feature embeddings generated by CF-Mamba exhibit excellent discriminative properties. Data points belonging to the same land-cover categories are tightly clustered together, demonstrating high intra-class compactness. Meanwhile, different categories are well separated with distinct boundaries, indicating strong inter-class separability. This visualization intuitively verifies that the proposed dual-path collaborative framework—incorporating the continuous modeling path (AHSE) and the discrete interaction path (IISE)—can effectively decouple spectral–spatial redundancy and extract highly discriminative representations, thereby facilitating accurate hyperspectral image classification.
4.5. Parameter Analysis
To evaluate the robustness of the proposed CF-Mamba framework and validate the empirical settings used in our experiments, we conduct a detailed parameter sensitivity analysis. We specifically investigate the impact of two critical hyperparameters: the spatial patch size and the number of principal component analysis (PCA) dimensions. The experiments are conducted across all four datasets, and the results are illustrated in
Figure 9.
Effect of Spatial Patch Size: The spatial patch size determines the receptive field for capturing local spatial context. As shown in
Figure 9a, we vary the patch size from 9 to 21. For all four datasets, the Overall Accuracy (OA) initially increases as the patch size grows, peaking at a size of 15 (e.g., reaching 97.77% on Indian Pines and 99.68% on Pavia University). This upward trend indicates that a larger spatial neighborhood provides richer structural and contextual information, which is beneficial for the continuous modeling path (AHSE). However, when the patch size exceeds 15, the performance begins to degrade slightly. This drop is attributed to the inclusion of heterogeneous pixels from different classes (i.e., the smoothing effect) and the introduction of redundant spatial noise, which interferes with the classification of the central pixel. Therefore, a patch size of 13 to 15 provides the optimal balance.
Effect of PCA Dimensions: Hyperspectral images possess high spectral dimensionality with significant band correlation. The PCA dimensions dictate the amount of retained spectral information fed into the network.
Figure 9b illustrates the model’s performance when the retained PCA dimensions range from 20 to 50. The OA curves demonstrate an inverted U-shape, achieving optimal performance at 35 dimensions across all benchmark datasets (e.g., 99.06% on Houston 2013 and 99.23% on WHU-Hi-Longkou). When the dimension is set too low (e.g., 20 or 25), critical discriminative spectral signatures are lost, leading to sub-optimal accuracy. Conversely, retaining too many dimensions (e.g., 45 or 50) not only increases the computational burden but also preserves redundant spectral noise, which hinders the interval decoupling process in the IISE module. Hence, setting the PCA dimension to 35 ensures sufficient information retention while effectively mitigating the curse of dimensionality.
Figure 9.
Effect of key hyperparameters (spatial patch size and PCA dimensions) on the classification performance of CF-Mamba.
Figure 9.
Effect of key hyperparameters (spatial patch size and PCA dimensions) on the classification performance of CF-Mamba.
4.6. Ablation Studies
To comprehensively validate the contributions of the proposed components and address the physical rationality behind our architectural design, we conducted extensive ablation studies on the Indian Pines dataset. The variants and their corresponding performances are detailed in
Table 6.
4.6.1. Internal Mechanisms of the Single-Path Encoders (AHSE & IISE)
We first investigated the internal sub-components of the continuous modeling path (AHSE) and the discrete interaction path (IISE) to verify their structural and physical necessity.
Effectiveness of Adaptive Routing in AHSE (ID 1 & ID 2): Replacing the content-aware adaptive routing with fixed average weights (ID 2) resulted in a 3.44% drop in OA. From a physical perspective, different ground objects exhibit varying spatial–spectral structural dependencies (e.g., roads possess strong spatial directionality, while vegetation heavily relies on continuous spectral waveforms). The adaptive routing ensures the model dynamically assigns higher weights to the most relevant scanning perspective, rather than mechanically averaging them, thereby better capturing the anisotropic continuous evolution.
Interval Sampling & Contiguous Clustering in IISE (ID 3 & ID 5): To validate the physical basis of our interval sampling strategy, we replaced it with traditional contiguous band clustering (ID 5). This change caused a drastic performance degradation, with OA plummeting by 3.33%. In hyperspectral physics, adjacent spectral bands are highly correlated and redundant. Contiguous clustering easily traps the model in localized, homogeneous information silos, blurring subtle spectral differences. Conversely, our sparse interval sampling strategy successfully preserves the holistic structural skeleton (representative spectral profile) of the ground objects while stripping away adjacent redundancy.
Necessity of Channel Shuffle (ID 3 & ID 4): Removing the cross-group shuffle mechanism led to a decrease in accuracy. This confirms that the parameter-free shuffle operation successfully acts as a bridge, breaking the “information silos” caused by hard discrete decoupling and ensuring global spectral interaction.
4.6.2. Synergy of Dual-Path and Superiority of CGU Fusion
We further evaluated the necessity of the dual-path architecture and compared our Confluence Gating Unit (CGU) against mainstream feature fusion paradigms.
Dual-Path & Single-Path: Compared to the single-stream models (ID 1 and ID 3), all dual-path variants (IDs 6–10) achieved significant performance gains. This proves that continuous sequential context and discrete fine-grained details are highly complementary in characterizing complex hyperspectral scenes.
Superiority of CGU over Alternative Fusions (IDs 6–10): We extensively compared CGU with Element-wise Sum, Concatenation, Dual-Attention (cross-attention mechanism), and Dual-GLU (Gated Linear Unit). While advanced mechanisms like GLU and Attention outperformed naive addition, the proposed CGU achieved the highest performance (OA 97.77%). The inherent structural superiority of CGU lies in its “Bi-directional Cross-Modulation” design. Unlike conventional attention or GLU that simply re-weighs aggregated features, CGU explicitly utilizes continuous context as a physical regularizer to denoise discrete fragmented features, while utilizing discrete textures to sharpen continuous boundaries. This bidirectional constraint fundamentally resolves the representation discrepancy between heterogeneous feature spaces.
4.7. Comprehensive Analysis of Computational Efficiency
To verify the practical feasibility of CF-Mamba, we conducted a comprehensive efficiency evaluation on the Indian Pines, Pavia University, Houston 2013, and Wuhan Longkou datasets.
Table 7 details the inference time, Floating Point Operations (FLOPs), and parameter counts (Params) of eight representative methods on the full test sets.
As shown in
Table 7, Transformer-based architectures (such as HSI-BERT and SF) typically incur high computational costs due to the quadratic computational complexity of the global self-attention mechanism. For instance, on the Indian Pines dataset, HSI-BERT requires 11.32 s for inference and consumes 7.112 G FLOPs. In sharp contrast, benefiting from the linear complexity of the state space architecture, our CF-Mamba completes inference in just 1.68 s with only 0.115 G FLOPs. Furthermore, compared to other SSM-based methods (such as CenterMamba and S2-Mamba), CF-Mamba maintains a reasonable balance in inference time and computational cost while demonstrating highly competitive parameter efficiency. This significant reduction in computational overhead confirms the efficiency advantages of our framework when processing long spectral sequences.
Compared to the lightest SSM variants such as 3DSS-Mamba, our proposed CF-Mamba does exhibit an increase in parameters and FLOPs (e.g., 0.106 M parameters on the Houston 2013 dataset versus 0.011 M for 3DSS-Mamba). We acknowledge that CF-Mamba is not strictly the most ‘lightweight’ model among SSM-based variants. However, this is an intended and acceptable trade-off. The additional overhead introduced by the dual-path architecture is necessary to capture richer continuous–discrete complementary features, which significantly enhances the model’s robustness and discriminative power in complex scenarios.
5. Conclusions and Future Developments
To address the challenges in hyperspectral image classification—specifically the difficulty in modeling spectral continuity, inadequate decoupling of high-dimensional redundancy, and conflicts in cross-representation feature fusion—this paper innovatively proposes a continuous–discrete collaborative framework based on the Mamba architecture, named CF-Mamba. Breaking the single-view limitations of traditional sequence modeling, this method achieves a breakthrough in classification performance through three key contributions:
By constructing the Adaptive Holographic Spectral Encoder (AHSE), the model introduces a multi-view dynamic routing mechanism. This successfully resolves the spatial–spectral distribution anisotropy of HSI data, achieving adaptive focusing on key discriminative continuous evolutionary features while maintaining Sequence-level Long-range Dependency.
The Interactive Interval Spectral Encoder (IISE) achieves the extraction of discrete fine-grained features under extremely low computational load through Interval Feature Decoupling and cross-group channel shuffle, effectively overcoming the contradiction between spectral redundancy and feature fragmentation.
The proposed Confluence Gating Unit (CGU) utilizes a bidirectional cross-modulation strategy to achieve deep alignment and complementary enhancement of continuous context and discrete features, mitigating the Representation Discrepancy phenomenon during multi-source feature fusion.
Extensive experimental results demonstrate that CF-Mamba achieves state-of-the-art classification accuracy on the Indian Pines, Houston 2013, Pavia University, and WHU-Hi-Longkou datasets (with OAs of 97.77%, 99.06%, 99.68%, and 99.59%, respectively) and possesses significant computational efficiency advantages compared to Transformer-based methods (such as HSI-BERT).
Despite the superior performance demonstrated by CF-Mamba, future research could further explore the following directions:
Multimodal Extension: Extending the dual-path architecture to multi-source (e.g., LiDAR, SAR) or multi-temporal remote sensing data fusion to explore complementary mechanisms among different data modalities within a continuous evolutionary space.
Lightweight Deployment: Targeting edge computing scenarios such as satellite-borne or UAV platforms, future work could combine model quantization and pruning techniques to further exploit the compression potential of the discrete decoupling path, thereby optimizing the memory footprint and inference speed of the Mamba architecture.
Physical Interpretability Enhancement: Combining frequency domain analysis or attention visualization techniques to deeply investigate the mapping relationship between the evolution of Mamba’s internal hidden states and the physical properties of ground objects (such as spectral absorption peaks), thereby enhancing the trustworthiness of the model.